0% found this document useful (0 votes)
33 views656 pages

Lnotes Book

Uploaded by

Covid -19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views656 pages

Lnotes Book

Uploaded by

Covid -19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 656

BOA Z BA R A K

I N T RO D U C T I O N TO
THEORETICAL
COMPUTER SCIENCE

T E X T B O O K I N P R E PA R AT I O N .
AVA I L A B L E O N HTTPS://INTROTCS.ORG
Text available on  https://github.com/boazbk/tcs - please post any issues there - thank you!

This version was compiled on Wednesday 6th December, 2023 00:05

Copyright © 2023 Boaz Barak

This work is licensed under a Creative


Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.
To Ravit, Alma and Goren.
Contents

Preface 9

Preliminaries 17

0 Introduction 19

1 Mathematical Background 37

2 Computation and Representation 73

I Finite computation 111

3 Defining computation 113

4 Syntactic sugar, and computing every function 149

5 Code as data, data as code 175

II Uniform computation 205

6 Functions with Infinite domains, Automata, and Regular


expressions 207

7 Loops and infinity 241

8 Equivalent models of computation 271

9 Universality and uncomputability 315

10 Restricted computational models 347

11 Is every theorem provable? 365

Compiled on 12.6.2023 00:05


6

III Efficient algorithms 385

12 Efficient computation: An informal introduction 387

13 Modeling running time 407

14 Polynomial-time reductions 441

15 NP, NP completeness, and the Cook-Levin Theorem 469

16 What if P equals NP? 489

17 Space bounded computation 509

IV Randomized computation 511

18 Probability Theory 101 513

19 Probabilistic computation 533

20 Modeling randomized computation 545

V Advanced topics 569

21 Cryptography 571

22 Proofs and algorithms 599

23 Quantum computing 601

VI Appendices 631
Contents (detailed)

Preface 9
0.1 To the student . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.1 Is the effort worth it? . . . . . . . . . . . . . . . . 11
0.2 To potential instructors . . . . . . . . . . . . . . . . . . . 12
0.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 14

Preliminaries 17

0 Introduction 19
0.1 Integer multiplication: an example of an algorithm . . . 20
0.2 Extended Example: A faster way to multiply (optional) 22
0.3 Algorithms beyond arithmetic . . . . . . . . . . . . . . . 27
0.4 On the importance of negative results . . . . . . . . . . 28
0.5 Roadmap to the rest of this book . . . . . . . . . . . . . 29
0.5.1 Dependencies between chapters . . . . . . . . . . 30
0.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 33

1 Mathematical Background 37
1.1 This chapter: a reader’s manual . . . . . . . . . . . . . . 37
1.2 A quick overview of mathematical prerequisites . . . . 38
1.3 Reading mathematical texts . . . . . . . . . . . . . . . . 39
1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . 40
1.3.2 Assertions: Theorems, lemmas, claims . . . . . . 40
1.3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4 Basic discrete math objects . . . . . . . . . . . . . . . . . 41
1.4.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.2 Special sets . . . . . . . . . . . . . . . . . . . . . . 42
1.4.3 Functions . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.5 Logic operators and quantifiers . . . . . . . . . . 49
1.4.6 Quantifiers for summations and products . . . . 50
1.4.7 Parsing formulas: bound and free variables . . . 50
1.4.8 Asymptotics and Big-𝑂 notation . . . . . . . . . . 52
8

1.4.9 Some “rules of thumb” for Big-𝑂 notation . . . . 53


1.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.5.1 Proofs and programs . . . . . . . . . . . . . . . . 55
1.5.2 Proof writing style . . . . . . . . . . . . . . . . . . 55
1.5.3 Patterns in proofs . . . . . . . . . . . . . . . . . . 56
1.6 Extended example: Topological Sorting . . . . . . . . . 59
1.6.1 Mathematical induction . . . . . . . . . . . . . . . 60
1.6.2 Proving the result by induction . . . . . . . . . . 61
1.6.3 Minimality and uniqueness . . . . . . . . . . . . . 63
1.7 This book: notation and conventions . . . . . . . . . . . 65
1.7.1 Variable name conventions . . . . . . . . . . . . . 66
1.7.2 Some idioms . . . . . . . . . . . . . . . . . . . . . 67
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.9 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 71

2 Computation and Representation 73


2.1 Defining representations . . . . . . . . . . . . . . . . . . 75
2.1.1 Representing natural numbers . . . . . . . . . . . 76
2.1.2 Meaning of representations (discussion) . . . . . 78
2.2 Representations beyond natural numbers . . . . . . . . 78
2.2.1 Representing (potentially negative) integers . . . 79
2.2.2 Two’s complement representation (optional) . . 79
2.2.3 Rational numbers and representing pairs of
strings . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.3 Representing real numbers . . . . . . . . . . . . . . . . . 82
2.4 Cantor’s Theorem, countable sets, and string represen-
tations of the real numbers . . . . . . . . . . . . . . . . . 83
2.4.1 Corollary: Boolean functions are uncountable . . 89
2.4.2 Equivalent conditions for countability . . . . . . 89
2.5 Representing objects beyond numbers . . . . . . . . . . 90
2.5.1 Finite representations . . . . . . . . . . . . . . . . 91
2.5.2 Prefix-free encoding . . . . . . . . . . . . . . . . . 91
2.5.3 Making representations prefix-free . . . . . . . . 94
2.5.4 “Proof by Python” (optional) . . . . . . . . . . . 95
2.5.5 Representing letters and text . . . . . . . . . . . . 97
2.5.6 Representing vectors, matrices, images . . . . . . 99
2.5.7 Representing graphs . . . . . . . . . . . . . . . . . 99
2.5.8 Representing lists and nested lists . . . . . . . . . 99
2.5.9 Notation . . . . . . . . . . . . . . . . . . . . . . . . 100
2.6 Defining computational tasks as mathematical functions 100
2.6.1 Distinguish functions from programs! . . . . . . 102
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 108
9

I Finite computation 111

3 Defining computation 113


3.1 Defining computation . . . . . . . . . . . . . . . . . . . . 115
3.2 Computing using AND, OR, and NOT. . . . . . . . . . . 116
3.2.1 Some properties of AND and OR . . . . . . . . . 118
3.2.2 Extended example: Computing XOR from
AND, OR, and NOT . . . . . . . . . . . . . . . . . 119
3.2.3 Informally defining “basic operations” and
“algorithms” . . . . . . . . . . . . . . . . . . . . . 121
3.3 Boolean Circuits . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.1 Boolean circuits: a formal definition . . . . . . . . 124
3.4 Straight-line programs . . . . . . . . . . . . . . . . . . . 127
3.4.1 Specification of the AON-CIRC programming
language . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.2 Proving equivalence of AON-CIRC programs
and Boolean circuits . . . . . . . . . . . . . . . . . 130
3.5 Physical implementations of computing devices (di-
gression) . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.1 Transistors . . . . . . . . . . . . . . . . . . . . . . 132
3.5.2 Logical gates from transistors . . . . . . . . . . . 133
3.5.3 Biological computing . . . . . . . . . . . . . . . . 133
3.5.4 Cellular automata and the game of life . . . . . . 134
3.5.5 Neural networks . . . . . . . . . . . . . . . . . . . 134
3.5.6 A computer made from marbles and pipes . . . . 135
3.6 The NAND function . . . . . . . . . . . . . . . . . . . . 135
3.6.1 NAND Circuits . . . . . . . . . . . . . . . . . . . . 136
3.6.2 More examples of NAND circuits (optional) . . . 138
3.6.3 The NAND-CIRC Programming language . . . . 139
3.7 Equivalence of all these models . . . . . . . . . . . . . . 141
3.7.1 Circuits with other gate sets . . . . . . . . . . . . 142
3.7.2 Specification vs. implementation (again) . . . . . 143
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.9 Biographical notes . . . . . . . . . . . . . . . . . . . . . . 147

4 Syntactic sugar, and computing every function 149


4.1 Some examples of syntactic sugar . . . . . . . . . . . . . 151
4.1.1 User-defined procedures . . . . . . . . . . . . . . 151
4.1.2 Proof by Python (optional) . . . . . . . . . . . . . 153
4.1.3 Conditional statements . . . . . . . . . . . . . . . 154
4.2 Extended example: Addition and Multiplication (op-
tional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.3 The LOOKUP function . . . . . . . . . . . . . . . . . . . 158
10

4.3.1 Constructing a NAND-CIRC program for


LOOKUP . . . . . . . . . . . . . . . . . . . . . . . 159
4.4 Computing every function . . . . . . . . . . . . . . . . . 161
4.4.1 Proof of NAND’s Universality . . . . . . . . . . . 162
4.4.2 Improving by a factor of 𝑛 (optional) . . . . . . . 163
4.5 Computing every function: An alternative proof . . . . 165
4.6 The class SIZE𝑛,𝑚 (𝑠) . . . . . . . . . . . . . . . . . . . . 167
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 174

5 Code as data, data as code 175


5.1 Representing programs as strings . . . . . . . . . . . . . 177
5.2 Counting programs, and lower bounds on the size of
NAND-CIRC programs . . . . . . . . . . . . . . . . . . . 178
5.2.1 Size hierarchy theorem (optional) . . . . . . . . . 180
5.3 The tuples representation . . . . . . . . . . . . . . . . . 182
5.3.1 From tuples to strings . . . . . . . . . . . . . . . . 183
5.4 A NAND-CIRC interpreter in NAND-CIRC . . . . . . . 184
5.4.1 Efficient universal programs . . . . . . . . . . . . 185
5.4.2 A NAND-CIRC interpeter in “pseudocode” . . . 186
5.4.3 A NAND interpreter in Python . . . . . . . . . . 188
5.4.4 Constructing the NAND-CIRC interpreter in
NAND-CIRC . . . . . . . . . . . . . . . . . . . . . 188
5.5 A Python interpreter in NAND-CIRC (discussion) . . . 191
5.6 The physical extended Church-Turing thesis (discus-
sion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6.1 Attempts at refuting the PECTT . . . . . . . . . . 195
5.7 Recap of Part I: Finite Computation . . . . . . . . . . . . 200
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.9 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 203

II Uniform computation 205

6 Functions with Infinite domains, Automata, and Regular


expressions 207
6.1 Functions with inputs of unbounded length . . . . . . . 208
6.1.1 Varying inputs and outputs . . . . . . . . . . . . . 209
6.1.2 Formal Languages . . . . . . . . . . . . . . . . . . 211
6.1.3 Restrictions of functions . . . . . . . . . . . . . . . 211
6.2 Deterministic finite automata (optional) . . . . . . . . . 212
6.2.1 Anatomy of an automaton (finite vs. unbounded) 215
6.2.2 DFA-computable functions . . . . . . . . . . . . . 216
6.3 Regular expressions . . . . . . . . . . . . . . . . . . . . . 217
6.3.1 Algorithms for matching regular expressions . . 221
11

6.4 Efficient matching of regular expressions (optional) . . 223


6.4.1 Matching regular expressions using DFAs . . . . 227
6.4.2 Equivalence of regular expressions and automata 228
6.4.3 Closure properties of regular expressions . . . . 230
6.5 Limitations of regular expressions and the pumping
lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.6 Answering semantic questions about regular expres-
sions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 240

7 Loops and infinity 241


7.1 Turing Machines . . . . . . . . . . . . . . . . . . . . . . . 242
7.1.1 Extended example: A Turing machine for palin-
dromes . . . . . . . . . . . . . . . . . . . . . . . . 244
7.1.2 Turing machines: a formal definition . . . . . . . 245
7.1.3 Computable functions . . . . . . . . . . . . . . . . 247
7.1.4 Infinite loops and partial functions . . . . . . . . 248
7.2 Turing machines as programming languages . . . . . . 249
7.2.1 The NAND-TM Programming language . . . . . 251
7.2.2 Sneak peak: NAND-TM vs Turing machines . . . 254
7.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . 255
7.3 Equivalence of Turing machines and NAND-TM pro-
grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
7.3.1 Specification vs implementation (again) . . . . . 260
7.4 NAND-TM syntactic sugar . . . . . . . . . . . . . . . . . 260
7.4.1 “GOTO” and inner loops . . . . . . . . . . . . . . 261
7.5 Uniformity, and NAND vs NAND-TM (discussion) . . 263
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
7.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 267

8 Equivalent models of computation 271


8.1 RAM machines and NAND-RAM . . . . . . . . . . . . . 273
8.2 The gory details (optional) . . . . . . . . . . . . . . . . . 277
8.2.1 Indexed access in NAND-TM . . . . . . . . . . . . 277
8.2.2 Two dimensional arrays in NAND-TM . . . . . . 279
8.2.3 All the rest . . . . . . . . . . . . . . . . . . . . . . 279
8.3 Turing equivalence (discussion) . . . . . . . . . . . . . . 280
8.3.1 The “Best of both worlds” paradigm . . . . . . . 281
8.3.2 Let’s talk about abstractions . . . . . . . . . . . . 281
8.3.3 Turing completeness and equivalence, a formal
definition (optional) . . . . . . . . . . . . . . . . . 283
8.4 Cellular automata . . . . . . . . . . . . . . . . . . . . . . 284
12

8.4.1 One dimensional cellular automata are Turing


complete . . . . . . . . . . . . . . . . . . . . . . . 286
8.4.2 Configurations of Turing machines and the
next-step function . . . . . . . . . . . . . . . . . . 287
8.5 Lambda calculus and functional programming lan-
guages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8.5.1 Applying functions to functions . . . . . . . . . . 290
8.5.2 Obtaining multi-argument functions via Currying 291
8.5.3 Formal description of the λ calculus . . . . . . . . 292
8.5.4 Infinite loops in the λ calculus . . . . . . . . . . . 295
8.6 The “Enhanced” λ calculus . . . . . . . . . . . . . . . . 295
8.6.1 Computing a function in the enhanced λ calculus 298
8.6.2 Enhanced λ calculus is Turing-complete . . . . . 298
8.7 From enhanced to pure λ calculus . . . . . . . . . . . . 301
8.7.1 List processing . . . . . . . . . . . . . . . . . . . . 302
8.7.2 The Y combinator, or recursion without recursion 303
8.8 The Church-Turing Thesis (discussion) . . . . . . . . . 306
8.8.1 Different models of computation . . . . . . . . . . 307
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.10 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 312

9 Universality and uncomputability 315


9.1 Universality or a meta-circular evaluator . . . . . . . . . 316
9.1.1 Proving the existence of a universal Turing
Machine . . . . . . . . . . . . . . . . . . . . . . . . 318
9.1.2 Implications of universality (discussion) . . . . . 320
9.2 Is every function computable? . . . . . . . . . . . . . . . 321
9.3 The Halting problem . . . . . . . . . . . . . . . . . . . . 323
9.3.1 Is the Halting problem really hard? (discussion) 326
9.3.2 A direct proof of the uncomputability of HALT
(optional) . . . . . . . . . . . . . . . . . . . . . . . 327
9.4 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.4.1 Example: Halting on the zero problem . . . . . . 330
9.5 Rice’s Theorem and the impossibility of general soft-
ware verification . . . . . . . . . . . . . . . . . . . . . . . 334
9.5.1 Rice’s Theorem . . . . . . . . . . . . . . . . . . . . 335
9.5.2 Halting and Rice’s Theorem for other Turing-
complete models . . . . . . . . . . . . . . . . . . . 340
9.5.3 Is software verification doomed? (discussion) . . 341
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 345

10 Restricted computational models 347


10.1 Turing completeness as a bug . . . . . . . . . . . . . . . 347
13

10.2 Context free grammars . . . . . . . . . . . . . . . . . . . 349


10.2.1 Context-free grammars as a computational model 351
10.2.2 The power of context free grammars . . . . . . . 353
10.2.3 Limitations of context-free grammars (optional) 355
10.3 Semantic properties of context free languages . . . . . . 357
10.3.1 Uncomputability of context-free grammar
equivalence (optional) . . . . . . . . . . . . . . . 357
10.4 Summary of semantic properties for regular expres-
sions and context-free grammars . . . . . . . . . . . . . 360
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
10.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 362

11 Is every theorem provable? 365


11.1 Hilbert’s Program and Gödel’s Incompleteness Theorem 366
11.1.1 Defining “Proof Systems” . . . . . . . . . . . . . . 367
11.2 Gödel’s Incompleteness Theorem: Computational
variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.3 Quantified integer statements . . . . . . . . . . . . . . . 371
11.4 Diophantine equations and the MRDP Theorem . . . . 373
11.5 Hardness of quantified integer statements . . . . . . . . 374
11.5.1 Step 1: Quantified mixed statements and com-
putation histories . . . . . . . . . . . . . . . . . . 375
11.5.2 Step 2: Reducing mixed statements to integer
statements . . . . . . . . . . . . . . . . . . . . . . 378
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
11.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 383

III Efficient algorithms 385

12 Efficient computation: An informal introduction 387


12.1 Problems on graphs . . . . . . . . . . . . . . . . . . . . . 389
12.1.1 Finding the shortest path in a graph . . . . . . . . 390
12.1.2 Finding the longest path in a graph . . . . . . . . 392
12.1.3 Finding the minimum cut in a graph . . . . . . . 392
12.1.4 Min-Cut Max-Flow and Linear programming . . 393
12.1.5 Finding the maximum cut in a graph . . . . . . . 395
12.1.6 A note on convexity . . . . . . . . . . . . . . . . . 395
12.2 Beyond graphs . . . . . . . . . . . . . . . . . . . . . . . . 397
12.2.1 SAT . . . . . . . . . . . . . . . . . . . . . . . . . . 397
12.2.2 Solving linear equations . . . . . . . . . . . . . . . 398
12.2.3 Solving quadratic equations . . . . . . . . . . . . 399
12.3 More advanced examples . . . . . . . . . . . . . . . . . 399
12.3.1 Determinant of a matrix . . . . . . . . . . . . . . . 399
12.3.2 Permanent of a matrix . . . . . . . . . . . . . . . . 401
14

12.3.3 Finding a zero-sum equilibrium . . . . . . . . . . 401


12.3.4 Finding a Nash equilibrium . . . . . . . . . . . . 402
12.3.5 Primality testing . . . . . . . . . . . . . . . . . . . 402
12.3.6 Integer factoring . . . . . . . . . . . . . . . . . . . 403
12.4 Our current knowledge . . . . . . . . . . . . . . . . . . . 403
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
12.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 404
12.7 Further explorations . . . . . . . . . . . . . . . . . . . . 405

13 Modeling running time 407


13.1 Formally defining running time . . . . . . . . . . . . . . 409
13.1.1 Polynomial and Exponential Time . . . . . . . . . 410
13.2 Modeling running time using RAM Machines /
NAND-RAM . . . . . . . . . . . . . . . . . . . . . . . . . 412
13.3 Extended Church-Turing Thesis (discussion) . . . . . . 417
13.4 Efficient universal machine: a NAND-RAM inter-
preter in NAND-RAM . . . . . . . . . . . . . . . . . . . 418
13.4.1 Timed Universal Turing Machine . . . . . . . . . 420
13.5 The time hierarchy theorem . . . . . . . . . . . . . . . . 421
13.6 Non-uniform computation . . . . . . . . . . . . . . . . . 425
13.6.1 Oblivious NAND-TM programs . . . . . . . . . . 427
13.6.2 “Unrolling the loop”: algorithmic transforma-
tion of Turing Machines to circuits . . . . . . . . . 430
13.6.3 Can uniform algorithms simulate non-uniform
ones? . . . . . . . . . . . . . . . . . . . . . . . . . 432
13.6.4 Uniform vs. Non-uniform computation: A recap 434
13.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
13.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 438

14 Polynomial-time reductions 441


14.1 Formal definitions of problems . . . . . . . . . . . . . . 442
14.2 Polynomial-time reductions . . . . . . . . . . . . . . . . 443
14.2.1 Whistling pigs and flying horses . . . . . . . . . . 444
14.3 Reducing 3SAT to zero one and quadratic equations . . 446
14.3.1 Quadratic equations . . . . . . . . . . . . . . . . . 449
14.4 The subset sum problem . . . . . . . . . . . . . . . . . . 451
14.5 The independent set problem . . . . . . . . . . . . . . . 453
14.6 Some exercises and anatomy of a reduction. . . . . . . . 456
14.6.1 Dominating set . . . . . . . . . . . . . . . . . . . . 457
14.6.2 Anatomy of a reduction . . . . . . . . . . . . . . . 460
14.7 Reducing Independent Set to Maximum Cut . . . . . . 462
14.8 Reducing 3SAT to Longest Path . . . . . . . . . . . . . . 464
14.8.1 Summary of relations . . . . . . . . . . . . . . . . 466
14.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
15

14.10 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 467

15 NP, NP completeness, and the Cook-Levin Theorem 469


15.1 The class NP . . . . . . . . . . . . . . . . . . . . . . . . . 471
15.1.1 Examples of functions in NP . . . . . . . . . . . . 473
15.1.2 Basic facts about NP . . . . . . . . . . . . . . . . . 474
15.2 From NP to 3SAT: The Cook-Levin Theorem . . . . . . . 476
15.2.1 What does this mean? . . . . . . . . . . . . . . . . 477
15.2.2 The Cook-Levin Theorem: Proof outline . . . . . 478
15.3 The NANDSAT Problem, and why it is NP hard . . . . 479
15.4 The 3NAND problem . . . . . . . . . . . . . . . . . . . . 481
15.5 From 3NAND to 3SAT . . . . . . . . . . . . . . . . . . . 483
15.6 Wrapping up . . . . . . . . . . . . . . . . . . . . . . . . . 484
15.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
15.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 487

16 What if P equals NP? 489


16.1 Search-to-decision reduction . . . . . . . . . . . . . . . . 491
16.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 493
16.2.1 Example: Supervised learning . . . . . . . . . . . 496
16.2.2 Example: Breaking cryptosystems . . . . . . . . . 497
16.3 Finding mathematical proofs . . . . . . . . . . . . . . . 497
16.4 Quantifier elimination (advanced) . . . . . . . . . . . . 499
16.4.1 Application: self improving algorithm for 3SAT . 501
16.5 Approximating counting problems and posterior
sampling (advanced, optional) . . . . . . . . . . . . . . 502
16.6 What does all of this imply? . . . . . . . . . . . . . . . . 503
16.7 Can P ≠ NP be neither true nor false? . . . . . . . . . . 505
16.8 Is P = NP “in practice”? . . . . . . . . . . . . . . . . . . 506
16.9 What if P ≠ NP? . . . . . . . . . . . . . . . . . . . . . . . 507
16.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
16.11 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 508

17 Space bounded computation 509


17.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
17.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 509

IV Randomized computation 511

18 Probability Theory 101 513


18.1 Random coins . . . . . . . . . . . . . . . . . . . . . . . . 514
18.1.1 Random variables . . . . . . . . . . . . . . . . . . 517
18.1.2 Distributions over strings . . . . . . . . . . . . . . 519
18.1.3 More general sample spaces . . . . . . . . . . . . 519
16

18.2 Correlations and independence . . . . . . . . . . . . . . 520


18.2.1 Independent random variables . . . . . . . . . . . 521
18.2.2 Collections of independent random variables . . 523
18.3 Concentration and tail bounds . . . . . . . . . . . . . . . 523
18.3.1 Chebyshev’s Inequality . . . . . . . . . . . . . . . 525
18.3.2 The Chernoff bound . . . . . . . . . . . . . . . . . 526
18.3.3 Application: Supervised learning and empirical
risk minimization . . . . . . . . . . . . . . . . . . 527
18.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
18.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 532

19 Probabilistic computation 533


19.1 Finding approximately good maximum cuts . . . . . . . 534
19.1.1 Amplifying the success of randomized algorithms 535
19.1.2 Success amplification . . . . . . . . . . . . . . . . 536
19.1.3 Two-sided amplification . . . . . . . . . . . . . . . 537
19.1.4 What does this mean? . . . . . . . . . . . . . . . . 538
19.2 Solving SAT through randomization . . . . . . . . . . . 539
19.3 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . 540
19.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
19.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 544
19.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 544

20 Modeling randomized computation 545


20.1 Modeling randomized computation . . . . . . . . . . . 546
20.1.1 An alternative view: random coins as an “extra
input” . . . . . . . . . . . . . . . . . . . . . . . . . 549
20.1.2 Success amplification of two-sided error algo-
rithms . . . . . . . . . . . . . . . . . . . . . . . . . 551
20.2 BPP and NP completeness . . . . . . . . . . . . . . . . . 552
20.3 The power of randomization . . . . . . . . . . . . . . . . 553
20.3.1 Solving BPP in exponential time . . . . . . . . . . 553
20.3.2 Simulating randomized algorithms by circuits . . 554
20.4 Derandomization . . . . . . . . . . . . . . . . . . . . . . 555
20.4.1 Pseudorandom generators . . . . . . . . . . . . . 557
20.4.2 From existence to constructivity . . . . . . . . . . 558
20.4.3 Usefulness of pseudorandom generators . . . . . 559
20.5 P = NP and BPP vs P . . . . . . . . . . . . . . . . . . . . 560
20.6 Non-constructive existence of pseudorandom genera-
tors (advanced, optional) . . . . . . . . . . . . . . . . . 563
20.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
20.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 566
17

V Advanced topics 569

21 Cryptography 571
21.1 Classical cryptosystems . . . . . . . . . . . . . . . . . . . 572
21.2 Defining encryption . . . . . . . . . . . . . . . . . . . . . 574
21.3 Defining security of encryption . . . . . . . . . . . . . . 575
21.4 Perfect secrecy . . . . . . . . . . . . . . . . . . . . . . . . 577
21.4.1 Example: Perfect secrecy in the battlefield . . . . 578
21.4.2 Constructing perfectly secret encryption . . . . . 579
21.5 Necessity of long keys . . . . . . . . . . . . . . . . . . . 581
21.6 Computational secrecy . . . . . . . . . . . . . . . . . . . 582
21.6.1 Stream ciphers or the “derandomized one-time
pad” . . . . . . . . . . . . . . . . . . . . . . . . . . 584
21.7 Computational secrecy and NP . . . . . . . . . . . . . . 587
21.8 Public key cryptography . . . . . . . . . . . . . . . . . . 589
21.8.1 Defining public key encryption . . . . . . . . . . 591
21.8.2 Diffie-Hellman key exchange . . . . . . . . . . . . 592
21.9 Other security notions . . . . . . . . . . . . . . . . . . . 594
21.10 Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
21.10.1 Zero knowledge proofs . . . . . . . . . . . . . . . 595
21.10.2 Fully homomorphic encryption . . . . . . . . . . 595
21.10.3 Multiparty secure computation . . . . . . . . . . 596
21.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
21.12 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 597

22 Proofs and algorithms 599


22.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
22.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 599

23 Quantum computing 601


23.1 The double slit experiment . . . . . . . . . . . . . . . . . 602
23.2 Quantum amplitudes . . . . . . . . . . . . . . . . . . . . 602
23.3 Bell’s Inequality . . . . . . . . . . . . . . . . . . . . . . . 605
23.4 Quantum weirdness . . . . . . . . . . . . . . . . . . . . . 606
23.5 Quantum computing and computation - an executive
summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
23.6 Quantum systems . . . . . . . . . . . . . . . . . . . . . . 609
23.6.1 Quantum amplitudes . . . . . . . . . . . . . . . . 610
23.6.2 Recap . . . . . . . . . . . . . . . . . . . . . . . . . 611
23.7 Analysis of Bell’s Inequality (optional) . . . . . . . . . . 612
23.8 Quantum computation . . . . . . . . . . . . . . . . . . . 614
23.8.1 Quantum circuits . . . . . . . . . . . . . . . . . . 615
23.8.2 QNAND-CIRC programs (optional) . . . . . . . 617
23.8.3 Uniform computation . . . . . . . . . . . . . . . . 618
23.9 Physically realizing quantum computation . . . . . . . 619
18

23.10 Shor’s Algorithm: Hearing the shape of prime factors . 620


23.10.1 Period finding . . . . . . . . . . . . . . . . . . . . 621
23.10.2 Shor’s Algorithm: A bird’s eye view . . . . . . . . 621
23.11 Quantum Fourier Transform (advanced, optional) . . . 624
23.11.1 Quantum Fourier Transform over the Boolean
Cube: Simon’s Algorithm . . . . . . . . . . . . . . 625
23.11.2 From Fourier to Period finding: Simon’s Algo-
rithm (advanced, optional) . . . . . . . . . . . . . 626
23.11.3 From Simon to Shor (advanced, optional) . . . . 627
23.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
23.13 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 629
23.14 Further explorations . . . . . . . . . . . . . . . . . . . . 630
23.15 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 630

VI Appendices 631
Preface

“We make ourselves no promises, but we cherish the hope that the unobstructed
pursuit of useless knowledge will prove to have consequences in the future
as in the past” … “An institution which sets free successive generations of
human souls is amply justified whether or not this graduate or that makes a
so-called useful contribution to human knowledge. A poem, a symphony, a
painting, a mathematical truth, a new scientific fact, all bear in themselves all
the justification that universities, colleges, and institutes of research need or
require”, Abraham Flexner, The Usefulness of Useless Knowledge, 1939.

“I suggest that you take the hardest courses that you can, because you learn
the most when you challenge yourself… CS 121 I found pretty hard.”, Mark
Zuckerberg, 2005.

This is a textbook for an undergraduate introductory course on


theoretical computer science. The educational goals of this book are to
convey the following:

• That computation arises in a variety of natural and human-made


systems, and not only in modern silicon-based computers.

• Similarly, beyond being an extremely important tool, computation


also serves as a useful lens to describe natural, physical, mathemati-
cal and even social concepts.

• The notion of universality of many different computational models,


and the related notion of the duality between code and data.

• The idea that one can precisely define a mathematical model of


computation, and then use that to prove (or sometimes only conjec-
ture) lower bounds and impossibility results.

• Some of the surprising results and discoveries in modern theoreti-


cal computer science, including the prevalence of NP-completeness,
the power of interaction, the power of randomness on one hand
and the possibility of derandomization on the other, the ability
to use hardness “for good” in cryptography, and the fascinating
possibility of quantum computing.

Compiled on 12.6.2023 00:05


20

I hope that following this course, students would be able to rec-


ognize computation, with both its power and pitfalls, as it arises in
various settings, including seemingly “static” content or “restricted”
formalisms such as macros and scripts. They should be able to follow
through the logic of proofs about computation, including the cen-
tral concept of a reduction, as well as understanding “self-referential”
proofs (such as diagonalization-based proofs that involve programs
given their own code as input). Students should understand that
some problems are inherently intractable, and be able to recognize the
potential for intractability when they are faced with a new problem.
While this book only touches on cryptography, students should un-
derstand the basic idea of how we can use computational hardness for
cryptographic purposes. However, more than any specific skill, this
book aims to introduce students to a new way of thinking of computa-
tion as an object in its own right and to illustrate how this new way of
thinking leads to far-reaching insights and applications.
My aim in writing this text is to try to convey these concepts in the
simplest possible way and try to make sure that the formal notation
and model help elucidate, rather than obscure, the main ideas. I also
tried to take advantage of modern students’ familiarity (or at least
interest!) in programming, and hence use (highly simplified) pro-
gramming languages to describe our models of computation. That
said, this book does not assume fluency with any particular program-
ming language, but rather only some familiarity with the general
notion of programming. We will use programming metaphors and
idioms, occasionally mentioning specific programming languages
such as Python, C, or Lisp, but students should be able to follow these
descriptions even if they are not familiar with these languages.
Proofs in this book, including the existence of a universal Turing
Machine, the fact that every finite function can be computed by some
circuit, the Cook-Levin theorem, and many others, are often con-
structive and algorithmic, in the sense that they ultimately involve
transforming one program to another. While it is possible to follow
these proofs without seeing the code, I do think that having access
to the code, and the ability to play around with it and see how it acts
on various programs, can make these theorems more concrete for the
students. To that end, an accompanying website (which is still a work
in progress) allows executing programs in the various computational
models we define, as well as seeing constructive proofs of some of the
theorems.

0.1 TO THE STUDENT


This book can be challenging, mainly because it brings together a
variety of ideas and techniques in the study of computation. There
21

are quite a few technical hurdles to master, whether it is following


the diagonalization argument for proving the Halting Problem is
undecidable, combinatorial gadgets in NP-completeness reductions,
analyzing probabilistic algorithms, or arguing about the adversary to
prove the security of cryptographic primitives.
The best way to engage with this material is to read these notes ac-
tively, so make sure you have a pen ready. While reading, I encourage
you to stop and think about the following:

• When I state a theorem, stop and take a shot at proving it on your


own before reading the proof. You will be amazed by how much
better you can understand a proof even after only 5 minutes of
attempting it on your own.

• When reading a definition, make sure that you understand what


the definition means, and what the natural examples are of objects
that satisfy it and objects that do not. Try to think of the motivation
behind the definition, and whether there are other natural ways to
formalize the same concept.

• Actively notice which questions arise in your mind as you read the
text, and whether or not they are answered in the text.

As a general rule, it is more important that you understand the


definitions than the theorems, and it is more important that you
understand a theorem statement than its proof. After all, before you
can prove a theorem, you need to understand what it states, and to
understand what a theorem is about, you need to know the definitions
of the objects involved. Whenever a proof of a theorem is at least
somewhat complicated, I provide a “proof idea.” Feel free to skip the
actual proof in a first reading, focusing only on the proof idea.
This book contains some code snippets, but this is by no means
a programming text. You don’t need to know how to program to
follow this material. The reason we use code is that it is a precise way
to describe computation. Particular implementation details are not
as important to us, and so we will emphasize code readability at the
expense of considerations such as error handling, encapsulation, etc.
that can be extremely important for real-world programming.

0.1.1 Is the effort worth it?


This is not an easy book, and you might reasonably wonder why
should you spend the effort in learning this material. A traditional
justification for a “Theory of Computation” course is that you might
encounter these concepts later on in your career. Perhaps you will
come across a hard problem and realize it is NP complete, or find a
need to use what you learned about regular expressions. This might
22

very well be true, but the main benefit of this book is not in teaching
you any practical tool or technique, but instead in giving you a differ-
ent way of thinking: an ability to recognize computational phenomena
even when they occur in non-obvious settings, a way to model compu-
tational tasks and questions, and to reason about them.
Regardless of any use you will derive from this book, I believe
learning this material is important because it contains concepts that
are both beautiful and fundamental. The role that energy and matter
played in the 20th century is played in the 21st by computation and
information, not just as tools for our technology and economy, but also
as the basic building blocks we use to understand the world. This
book will give you a taste of some of the theory behind those, and
hopefully spark your curiosity to study more.

0.2 TO POTENTIAL INSTRUCTORS


I wrote this book for my Harvard course, but I hope that other lectur-
ers will find it useful as well. To some extent, it is similar in content
to “Theory of Computation” or “Great Ideas” courses such as those
taught at CMU or MIT.
The most significant difference between our approach and more
traditional ones (such as Hopcroft and Ullman’s [HU69; HU79] and
Sipser’s [Sip97]) is that we do not start with finite automata as our ini-
tial computational model. Instead, our initial computational model 1
An earlier book that starts with circuits as the initial
is Boolean Circuits.1 We believe that Boolean Circuits are more fun- model is John Savage’s [Sav98].
damental to the theory of computing (and even its practice!) than
automata. In particular, Boolean Circuits are a prerequisite for many
concepts that one would want to teach in a modern course on theoret-
ical computer science, including cryptography, quantum computing,
derandomization, attempts at proving P ≠ NP, and more. Even in
cases where Boolean Circuits are not strictly required, they can of-
ten offer significant simplifications (as in the case of the proof of the
Cook-Levin Theorem).
Furthermore, I believe there are pedagogical reasons to start with
Boolean circuits as opposed to finite automata. Boolean circuits are a
more natural model of computation, and one that corresponds more
closely to computing in silicon, making the connection to practice
more immediate to the students. Finite functions are arguably easier
to grasp than infinite ones, as we can fully write down their truth ta-
ble. The theorem that every finite function can be computed by some
Boolean circuit is both simple enough and important enough to serve
as an excellent starting point for this course. Moreover, many of the
main conceptual points of the theory of computation, including the
notions of the duality between code and data, and the idea of universal-
ity, can already be seen in this context.
23

After Boolean circuits, we move on to Turing machines and prove


results such as the existence of a universal Turing machine, the un-
computability of the halting problem, and Rice’s Theorem. Automata
are discussed after we see Turing machines and undecidability, as an
example for a restricted computational model where problems such as
determining halting can be effectively solved.
While this is not our motivation, the order we present circuits, Tur-
ing machines, and automata roughly corresponds to the chronological
order of their discovery. Boolean algebra goes back to Boole’s and
DeMorgan’s works in the 1840s [Boo47; De 47] (though the defini-
tion of Boolean circuits and the connection to physical computation
was given 90 years later by Shannon [Sha38]). Alan Turing defined
what we now call “Turing Machines” in the 1930s [Tur37], while finite
automata were introduced in the 1943 work of McCulloch and Pitts
[MP43] but only really understood in the seminal 1959 work of Rabin
and Scott [RS59].
More importantly, while models such as finite-state machines, reg-
ular expressions, and context-free grammars are incredibly important
for practice, the main applications for these models (whether it is for
parsing, for analyzing properties such as liveness and safety, or even for
software-defined routing tables) rely crucially on the fact that these
are tractable models for which we can effectively answer semantic ques-
tions. This practical motivation can be better appreciated after students
see the undecidability of semantic properties of general computing
models.
The fact that we start with circuits makes proving the Cook-Levin
Theorem much easier. In fact, our proof of this theorem can be (and
is) done using a handful of lines of Python. Combining this proof
with the standard reductions (which are also implemented in Python)
allows students to appreciate visually how a question about computa-
tion can be mapped into a question about (for example) the existence
of an independent set in a graph.
Some other differences between this book and previous texts are
the following:

1. For measuring time complexity, we use the standard RAM machine


model used (implicitly) in algorithms courses, rather than Tur-
ing machines. While these two models are of course polynomially
equivalent, and hence make no difference for the definitions of the
classes P, NP, and EXP, our choice makes the distinction between
notions such as 𝑂(𝑛) or 𝑂(𝑛2 ) time more meaningful. This choice
also ensures that these finer-grained time complexity classes corre-
spond to the informal definitions of linear and quadratic time that
24

students encounter in their algorithms lectures (or their whiteboard


coding interviews…).

2. We use the terminology of functions rather than languages. That is,


rather than saying that a Turing Machine 𝑀 decides a language 𝐿 ⊆
{0, 1}∗ , we say that it computes a function 𝐹 ∶ {0, 1}∗ → {0, 1}. The
terminology of “languages” arises from Chomsky’s work [Cho56],
but it is often more confusing than illuminating. The language
terminology also makes it cumbersome to discuss concepts such
as algorithms that compute functions with more than one bit of
output (including basic tasks such as addition, multiplication,
etc…). The fact that we use functions rather than languages means
we have to be extra vigilant about students distinguishing between
the specification of a computational task (e.g., the function) and its
implementation (e.g., the program). On the other hand, this point is
so important that it is worth repeatedly emphasizing and drilling
into the students, regardless of the notation used. The book does
mention the language terminology and reminds of it occasionally,
to make it easier for students to consult outside resources.

Reducing the time dedicated to finite automata and context-free


languages allows instructors to spend more time on topics that a mod-
ern course in the theory of computing needs to touch upon. These
include randomness and computation, the interactions between proofs
and programs (including Gödel’s incompleteness theorem, interactive
proof systems, and even a bit on the 𝜆-calculus and the Curry-Howard
correspondence), cryptography, and quantum computing.
This book contains sufficient detail to enable its use for self-study.
Toward that end, every chapter starts with a list of learning objectives,
ends with a recap, and is peppered with “pause boxes” which encour-
age students to stop and work out an argument or make sure they
understand a definition before continuing further.
Section 0.5 contains a “roadmap” for this book, with descriptions
of the different chapters, as well as the dependency structure between
them. This can help in planning a course based on this book.

0.3 ACKNOWLEDGEMENTS
This text is continually evolving, and I am getting input from many
people, for which I am deeply grateful. Salil Vadhan co-taught with
me the first iteration of this course and gave me a tremendous amount
of useful feedback and insights during this process. Michele Amoretti
and Marika Swanberg carefully read several chapters of this text and
gave extremely helpful detailed comments. Dave Evans and Richard
Xu contributed many pull requests fixing errors and improving phras-
ing. Thanks to Anil Ada, Venkat Guruswami, and Ryan O’Donnell for
25

helpful tips from their experience in teaching CMU 15-251. Thanks to


Adam Hesterberg and Madhu Sudan for their comments on their ex-
perience teaching CS 121 with this book. Kunal Marwaha gave many
comments, as well as provided great help with the technical aspects of
producing the book.
Thanks to everyone that sent me comments, typo reports, or posted
issues or pull requests on the GitHub repository https://github.
com/boazbk/tcs. In particular I would like to acknowledge help-
ful feedback from Scott Aaronson, Michele Amoretti, Aadi Bajpai,
Marguerite Basta, Anindya Basu, Sam Benkelman, Jarosław Błasiok,
Emily Chan, Christy Cheng, Michelle Chiang, Daniel Chiu, Chi-Ning
Chou, Michael Colavita, Brenna Courtney, Rodrigo Daboin Sanchez,
Robert Darley Waddilove, Anlan Du, Juan Esteller, David Evans,
Michael Fine, Simon Fischer, Leor Fishman, Zaymon Foulds-Cook,
William Fu, Kent Furuie, Piotr Galuszka, Carolyn Ge, Jason Giroux,
Mark Goldstein, Alexander Golovnev, Sayan Goswami, Maxwell
Grozovsky, Michael Haak, Rebecca Hao, Lucia Hoerr, Joosep Hook,
Austin Houck, Thomas Huet, Emily Jia, Serdar Kaçka, Chan Kang,
Nina Katz-Christy, Vidak Kazic, Joe Kerrigan, Eddie Kohler, Estefa-
nia Lahera, Allison Lee, Benjamin Lee, Ondřej Lengál, Raymond Lin,
Emma Ling, Alex Lombardi, Lisa Lu, Kai Ma, Aditya Mahadevan,
Kunal Marwaha, Christian May, Josh Mehr, Jacob Meyerson, Leon
Mlodzian, George Moe, Todd Morrill, Glenn Moss, Haley Mulligan,
Hamish Nicholson, Owen Niles, Sandip Nirmel, Sebastian Oberhoff,
Thomas Orton, Joshua Pan, Pablo Parrilo, Juan Perdomo, Banks Pick-
ett, Aaron Sachs, Abdelrhman Saleh, Brian Sapozhnikov, Anthony
Scemama, Peter Schäfer, Josh Seides, Alaisha Sharma, Nathan Sheely,
Haneul Shin, Noah Singer, Matthew Smedberg, Miguel Solano, Hikari
Sorensen, David Steurer, Alec Sun, Amol Surati, Everett Sussman,
Marika Swanberg, Garrett Tanzer, Eric Thomas, Sarah Turnill, Salil
Vadhan, Patrick Watts, Jonah Weissman, Ryan Williams, Licheng Xu,
Richard Xu, Wanqian Yang, Elizabeth Yeoh-Wang, Josh Zelinsky, Fred
Zhang, Grace Zhang, Alex Zhao, and Jessica Zhu.
I am using many open-source software packages in the production
of these notes for which I am grateful. In particular, I am thankful to
Donald Knuth and Leslie Lamport for LaTeX and to John MacFarlane
for Pandoc. David Steurer wrote the original scripts to produce this
text. The current version uses Sergio Correia’s panflute. The templates
for the LaTeX and HTML versions are derived from Tufte LaTeX,
Gitbook and Bookdown. Thanks to Amy Hendrickson for some LaTeX
consulting. Juan Esteller and Gabe Montague initially implemented
the NAND* programming languages in OCaml and Javascript. I used
the Jupyter project to write the supplemental code snippets.
26

Finally, I would like to thank my family: my wife Ravit, and my


children Alma and Goren. Working on this book (and the correspond-
ing course) took so much of my time that Alma wrote an essay for her
fifth-grade class saying that “universities should not pressure profes-
sors to work too much.” I’m afraid all I have to show for this effort is
600 pages of ultra-boring mathematical text.
PRELIMINARIES
Learning Objectives:
• Introduce and motivate the study of
computation for its own sake, irrespective of
particular implementations.
• The notion of an algorithm and some of its
history.
• Algorithms as not just tools, but also ways of
thinking and understanding.
• Taste of Big-𝑂 analysis and the surprising

0 creativity in the design of efficient


algorithms.

Introduction

“Computer Science is no more about computers than astronomy is about 1


This quote is typically read as disparaging the
telescopes”, attributed to Edsger Dijkstra.1 importance of actual physical computers in Computer
Science, but note that telescopes are absolutely
“Hackers need to understand the theory of computation about as much as essential to astronomy as they provide us with the
painters need to understand paint chemistry.”, Paul Graham 2003.2 means to connect theoretical predictions with actual
experimental observations.
“The subject of my talk is perhaps most directly indicated by simply asking 2
To be fair, in the following sentence Graham says
two questions: first, is it harder to multiply than to add? and second, why?…I “you need to know how to calculate time and space
(would like to) show that there is no algorithm for multiplication computation- complexity and about Turing completeness”. This
book includes these topics, as well as others such as
ally as simple as that for addition, and this proves something of a stumbling NP-hardness, randomization, cryptography, quantum
block.”, Alan Cobham, 1964 computing, and more.

One of the ancient Babylonians’ greatest innovations is the place-


value number system. The place-value system represents numbers as
sequences of digits where the position of each digit determines its
value.
This is opposed to a system like Roman numerals, where every
digit has a fixed value regardless of position. For example, the aver-
age distance to the moon is approximately 259,956 Roman miles. In
standard Roman numerals, that would be

MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMDCCCCLVI

Writing the distance to the sun in Roman numerals would require


about 100,000 symbols; it would take a 50-page book to contain this
single number!

Compiled on 12.6.2023 00:05


30 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For someone who thinks of numbers in an additive system like


Roman numerals, quantities like the distance to the moon or sun are
not merely large—they are unspeakable: they cannot be expressed or
even grasped. It’s no wonder that Eratosthenes, the first to calculate
the earth’s diameter (up to about ten percent error), and Hipparchus,
the first to calculate the distance to the moon, used not a Roman-
numeral type system but the Babylonian sexagesimal (base 60) place-
value system.

0.1 INTEGER MULTIPLICATION: AN EXAMPLE OF AN ALGORITHM


In the language of Computer Science, the place-value system for rep-
resenting numbers is known as a data structure: a set of instructions,
or “recipe”, for representing objects as symbols. An algorithm is a set
of instructions, or “recipe”, for performing operations on such rep-
resentations. Data structures and algorithms have enabled amazing
applications that have transformed human society, but their impor-
tance goes beyond their practical utility. Structures from computer
science, such as bits, strings, graphs, and even the notion of a program
itself, as well as concepts such as universality and replication, have not
just found (many) practical uses but contributed a new language and
a new way to view the world.
In addition to coming up with the place-value system, the Babylo-
nians also invented the “standard algorithms” that we were all taught
in elementary school for adding and multiplying numbers. These al-
gorithms have been essential throughout the ages for people using
abaci, papyrus, or pencil and paper, but in our computer age, do they
still serve any purpose beyond torturing third-graders? To see why
these algorithms are still very much relevant, let us compare the Baby-
lonian digit-by-digit multiplication algorithm (“grade-school multi-
plication”) with the naive algorithm that multiplies numbers through
repeated addition. We start by formally describing both algorithms,
see Algorithm 0.1 and Algorithm 0.2.

Algorithm 0.1 — Multiplication via repeated addition.

Input: Non-negative integers 𝑥, 𝑦


Output: Product 𝑥 ⋅ 𝑦
1: Let 𝑟𝑒𝑠𝑢𝑙𝑡 ← 0.
2: for 𝑖 = 1, … , 𝑦 do
3: 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑟𝑒𝑠𝑢𝑙𝑡 + 𝑥
4: end for
5: return 𝑟𝑒𝑠𝑢𝑙𝑡
i n trod u c ti on 31

Algorithm 0.2 — Grade-school multiplication.

Input: Non-negative integers 𝑥, 𝑦


Output: Product 𝑥 ⋅ 𝑦
1: Write 𝑥 = 𝑥𝑛−1 𝑥𝑛−2 ⋯ 𝑥0 and 𝑦 = 𝑦𝑚−1 𝑦𝑚−2 ⋯ 𝑦0 in dec-
imal place-value notation. # 𝑥0 is the ones digit of 𝑥, 𝑥1 is
the tens digit, etc.
2: Let 𝑟𝑒𝑠𝑢𝑙𝑡 ← 0
3: for 𝑖 = 0, … , 𝑛 − 1 do
4: for 𝑗 = 0, … , 𝑚 − 1 do
5: 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑟𝑒𝑠𝑢𝑙𝑡 + 10𝑖+𝑗 ⋅ 𝑥𝑖 ⋅ 𝑦𝑗
6: end for
7: end for
8: return 𝑟𝑒𝑠𝑢𝑙𝑡

Both Algorithm 0.1 and Algorithm 0.2 assume that we already


know how to add numbers, and Algorithm 0.2 also assumes that we
can multiply a number by a power of 10 (which is, after all, a sim-
ple shift). Suppose that 𝑥 and 𝑦 are two integers of 𝑛 = 20 decimal
digits each. (This roughly corresponds to 64 binary digits, which is
a common size in many programming languages.) Computing 𝑥 ⋅ 𝑦
using Algorithm 0.1 entails adding 𝑥 to itself 𝑦 times which entails
(since 𝑦 is a 20-digit number) at least 1019 additions. In contrast, the
grade-school algorithm (i.e., Algorithm 0.2) involves 𝑛2 shifts and
single-digit products, and so at most 2𝑛2 = 800 single-digit opera-
tions. To understand the difference, consider that a grade-schooler can
perform a single-digit operation in about 2 seconds, and so would re-
quire about 1, 600 seconds (about half an hour) to compute 𝑥 ⋅ 𝑦 using
Algorithm 0.2. In contrast, even though it is more than a billion times
faster than a human, if we used Algorithm 0.1 to compute 𝑥 ⋅ 𝑦 using a
modern PC, it would take us 1020 /109 = 1011 seconds (which is more
than three millennia!) to compute the same result.
Computers have not made algorithms obsolete. On the contrary,
the vast increase in our ability to measure, store, and communicate
data has led to much higher demand for developing better and more
sophisticated algorithms that empower us to make better decisions
based on these data. We also see that in no small extent the notion of
algorithm is independent of the actual computing device that executes
it. The digit-by-digit multiplication algorithm is vastly better than
iterated addition, regardless of whether the technology we use to
implement it is a silicon-based chip, or a third-grader with pen and
paper.
Theoretical computer science is concerned with the inherent proper-
ties of algorithms and computation; namely, those properties that are
32 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

independent of current technology. We ask some questions that were


already pondered by the Babylonians, such as “what is the best way to
multiply two numbers?”, but also questions that rely on cutting-edge
science such as “could we use the effects of quantum entanglement to
factor numbers faster?”.

R
Remark 0.3 — Specification, implementation, and analysis
of algorithms.. A full description of an algorithm has
three components:

• Specification: What is the task that the algorithm


performs (e.g., multiplication in the case of Algo-
rithm 0.1 and Algorithm 0.2.)
• Implementation: How is the task accomplished:
what is the sequence of instructions to be per-
formed. Even though Algorithm 0.1 and Algo-
rithm 0.2 perform the same computational task
(i.e., they have the same specification), they do it in
different ways (i.e., they have different implementa-
tions).
• Analysis: Why does this sequence of instructions
achieve the desired task. A full description of Algo-
rithm 0.1 and Algorithm 0.2 will include a proof for
each one of these algorithms that on input 𝑥, 𝑦, the
algorithm does indeed output 𝑥 ⋅ 𝑦.

Often as part of the analysis we show that the algo-


rithm is not only correct but also efficient. That is, we
want to show that not only will the algorithm compute
the desired task, but will do so in a prescribed number
of operations. For example Algorithm 0.2 computes
the multiplication function on inputs of 𝑛 digits using
𝑂(𝑛2 ) operations, while Algorithm 0.4 (described
below) computes the same function using 𝑂(𝑛1.6 )
operations. (We define the 𝑂 notations used here in
Section 1.4.8.)

0.2 EXTENDED EXAMPLE: A FASTER WAY TO MULTIPLY (OP-


TIONAL)
Once you think of the standard digit-by-digit multiplication algo-
rithm, it seems like the “obviously best’ ’ way to multiply numbers.
In 1960, the famous mathematician Andrey Kolmogorov organized
a seminar at Moscow State University in which he conjectured that
every algorithm for multiplying two 𝑛 digit numbers would require
a number of basic operations that is proportional to 𝑛2 (Ω(𝑛2 ) opera-
tions, using 𝑂-notation as defined in Chapter 1). In other words, Kol-
mogorov conjectured that in any multiplication algorithm, doubling
the number of digits would quadruple the number of basic operations
i n trod u c ti on 33

required. A young student named Anatoly Karatsuba was in the au-


dience, and within a week he disproved Kolmogorov’s conjecture by
discovering an algorithm that requires only about 𝐶𝑛1.6 operations
for some constant 𝐶. Such a number becomes much smaller than 𝑛2
as 𝑛 grows and so for large 𝑛 Karatsuba’s algorithm is superior to the
grade-school one. (For example, Python’s implementation switches
from the grade-school algorithm to Karatsuba’s algorithm for num-
bers that are 1000 bits or larger.) While the difference between an
𝑂(𝑛1.6 ) and an 𝑂(𝑛2 ) algorithm can be sometimes crucial in practice
(see Section 0.3 below), in this book we will mostly ignore such dis-
tinctions. However, we describe Karatsuba’s algorithm below since it
is a good example of how algorithms can often be surprising, as well
as a demonstration of the analysis of algorithms, which is central to this
book and to theoretical computer science at large.
Karatsuba’s algorithm is based on a faster way to multiply two-digit
numbers. Suppose that 𝑥, 𝑦 ∈ [100] = {0, … , 99} are a pair of two-
digit numbers. Let’s write 𝑥 for the “tens” digit of 𝑥, and 𝑥 for the
“ones” digit, so that 𝑥 = 10𝑥 + 𝑥, and write similarly 𝑦 = 10𝑦 + 𝑦 for
𝑥, 𝑥, 𝑦, 𝑦 ∈ [10]. The grade-school algorithm for multiplying 𝑥 and 𝑦 is
illustrated in Fig. 1.
The grade-school algorithm can be thought of as transforming the
task of multiplying a pair of two-digit numbers into four single-digit
multiplications via the formula

(10𝑥 + 𝑥) × (10𝑦 + 𝑦) = 100𝑥𝑦 + 10(𝑥𝑦 + 𝑥𝑦) + 𝑥𝑦 (1)

Generally, in the grade-school algorithm doubling the number of


digits in the input results in quadrupling the number of operations,
Figure 1: The grade-school multiplication algorithm
leading to an 𝑂(𝑛2 ) times algorithm. In contrast, Karatsuba’s algo-
illustrated for multiplying 𝑥 = 10𝑥 + 𝑥 and 𝑦 =
rithm is based on the observation that we can express Eq. (1) also 10𝑦 + 𝑦. It uses the formula (10𝑥 + 𝑥) × (10𝑦 + 𝑦) =
as 100𝑥𝑦 + 10(𝑥𝑦 + 𝑥𝑦) + 𝑥𝑦.

(10𝑥+𝑥)×(10𝑦+𝑦) = (100−10)𝑥𝑦+10 [(𝑥 + 𝑥)(𝑦 + 𝑦)]−(10−1)𝑥𝑦 (2)

which reduces multiplying the two-digit number 𝑥 and 𝑦 to com-


puting the following three simpler products: 𝑥𝑦, 𝑥𝑦 and (𝑥 + 𝑥)(𝑦 + 𝑦).
By repeating the same strategy recursively, we can reduce the task of
multiplying two 𝑛-digit numbers to the task of multiplying three pairs 3
If 𝑥 is a number then ⌊𝑥⌋ is the integer obtained by
of ⌊𝑛/2⌋ + 1 digit numbers.3 Since every time we double the number of rounding it down, see Section 1.7.

digits we triple the number of operations, we will be able to multiply


numbers of 𝑛 = 2ℓ digits using about 3ℓ = 𝑛log2 3 ∼ 𝑛1.585 operations.
The above is the intuitive idea behind Karatsuba’s algorithm, but is
not enough to fully specify it. A complete description of an algorithm
entails a precise specification of its operations together with its analysis:
34 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

proof that the algorithm does in fact do what it’s supposed to do. The
operations of Karatsuba’s algorithm are detailed in Algorithm 0.4,
while the analysis is given in Lemma 0.5 and Lemma 0.6.

Algorithm 0.4 — Karatsuba multiplication.

Input: non-negative integers 𝑥, 𝑦 each of at most 𝑛 digits


Output: 𝑥 ⋅ 𝑦
1: procedure Ka ratsu ba (𝑥,𝑦)
2: if 𝑛 ≤ 4 then return 𝑥 ⋅ 𝑦 ;
3: Let 𝑚 = ⌊𝑛/2⌋
4: Write 𝑥 = 10𝑚 𝑥 + 𝑥 and 𝑦 = 10𝑚 𝑦 + 𝑦
Figure 2: Karatsuba’s multiplication algorithm illus-
5: 𝐴 ← Ka ratsu ba(𝑥, 𝑦) trated for multiplying 𝑥 = 10𝑥 + 𝑥 and 𝑦 = 10𝑦 + 𝑦.
6: 𝐵 ← K a ratsu ba(𝑥 + 𝑥, 𝑦 + 𝑦) We compute the three orange, green and purple prod-
ucts 𝑥𝑦, 𝑥𝑦 and (𝑥 + 𝑥)(𝑦 + 𝑦) and then add and
7: 𝐶 ← K a ratsu ba(𝑥, 𝑦) subtract them to obtain the result.
8: return (10𝑛 − 10𝑚 ) ⋅ 𝐴 + 10𝑚 ⋅ 𝐵 + (1 − 10𝑚 ) ⋅ 𝐶
9: end procedure

Algorithm 0.4 is only half of the full description of Karatsuba’s


algorithm. The other half is the analysis, which entails proving that (1)
Algorithm 0.4 indeed computes the multiplication operation and (2)
it does so using 𝑂(𝑛log2 3 ) operations. We now turn to showing both
facts:
Lemma 0.5For every non-negative integers 𝑥, 𝑦, when given input 𝑥, 𝑦 Figure 3: Running time of Karatsuba’s algorithm
vs. the grade-school algorithm. (Python implementa-
Algorithm 0.4 will output 𝑥 ⋅ 𝑦. tion available online.) Note the existence of a “cutoff”
length, where for sufficiently large inputs Karat-
Proof. Let 𝑛 be the maximum number of digits of 𝑥 and 𝑦. We prove suba becomes more efficient than the grade-school
algorithm. The precise cutoff location varies by imple-
the lemma by induction on 𝑛. The base case is 𝑛 ≤ 4 where the algo-
mentation and platform details, but will always occur
rithm returns 𝑥 ⋅ 𝑦 by definition. (It does not matter which algorithm eventually.
we use to multiply four-digit numbers - we can even use repeated
addition.) Otherwise, if 𝑛 > 4, we define 𝑚 = ⌊𝑛/2⌋, and write
𝑥 = 10𝑚 𝑥 + 𝑥 and 𝑦 = 10𝑚 𝑦 + 𝑦.
Plugging this into 𝑥 ⋅ 𝑦, we get

𝑥 ⋅ 𝑦 = 102𝑚 𝑥𝑦 + 10𝑚 (𝑥𝑦 + 𝑥𝑦) + 𝑥𝑦 . (3)

Rearranging the terms we see that

𝑥 ⋅ 𝑦 = 102𝑚 𝑥𝑦 + 10𝑚 [(𝑥 + 𝑥)(𝑦 + 𝑦) − 𝑥𝑦 − 𝑥𝑦] + 𝑥𝑦 . (4)

since the numbers 𝑥,𝑥, 𝑦,𝑦,𝑥 + 𝑥,𝑦 + 𝑦 all have at most 𝑚 + 2 < 𝑛 digits,
the induction hypothesis implies that the values 𝐴, 𝐵, 𝐶 computed
by the recursive calls will satisfy 𝐴 = 𝑥𝑦, 𝐵 = (𝑥 + 𝑥)(𝑦 + 𝑦) and
𝐶 = 𝑥𝑦. Plugging this into (4) we see that 𝑥 ⋅ 𝑦 equals the value
(102𝑚 − 10𝑚 ) ⋅ 𝐴 + 10𝑚 ⋅ 𝐵 + (1 − 10𝑚 ) ⋅ 𝐶 computed by Algorithm 0.4.

i n trod u c ti on 35

Lemma 0.6 If 𝑥, 𝑦 are integers of at most 𝑛 digits, Algorithm 0.4 will


take 𝑂(𝑛log2 3 ) operations on input 𝑥, 𝑦.

Proof. Fig. 2 illustrates the idea behind the proof, which we only
sketch here, leaving filling out the details as Exercise 0.4. The proof
is again by induction. We define 𝑇 (𝑛) to be the maximum number of
steps that Algorithm 0.4 takes on inputs of length at most 𝑛. Since in
the base case 𝑛 ≤ 4, Exercise 0.4 performs a constant number of com-
putation, we know that 𝑇 (4) ≤ 𝑐 for some constant 𝑐 and for 𝑛 > 4, it
satisfies the recursive equation

𝑇 (𝑛) ≤ 3𝑇 (⌊𝑛/2⌋ + 1) + 𝑐′ 𝑛 (5)

for some constant 𝑐′ (using the fact that addition can be done in 𝑂(𝑛)
operations).
The recursive equation (5) solves to 𝑂(𝑛log2 3 ). The intuition be-
hind this is presented in Fig. 2, and this is also a consequence of the
so-called “Master Theorem” on recurrence relations. As mentioned
above, we leave completing the proof to the reader as Exercise 0.4.

Figure 4: Karatsuba’s algorithm reduces an 𝑛-bit


multiplication to three 𝑛/2-bit multiplications,
which in turn are reduced to nine 𝑛/4-bit multi-
plications and so on. We can represent the compu-
tational cost of all these multiplications in a 3-ary
tree of depth log2 𝑛, where at the root the extra cost
is 𝑐𝑛 operations, at the first level the extra cost is
𝑐(𝑛/2) operations, and at each of the 3𝑖 nodes of
level 𝑖, the extra cost is 𝑐(𝑛/2𝑖 ). The total cost is
log 𝑛
𝑐𝑛 ∑𝑖=02 (3/2)𝑖 ≤ 10𝑐𝑛log2 3 by the formula for
summing a geometric series.

Karatsuba’s algorithm is by no means the end of the line for multi-


plication algorithms. In the 1960’s, Toom and Cook extended Karat-
suba’s ideas to get an 𝑂(𝑛log𝑘 (2𝑘−1) ) time multiplication algorithm for
every constant 𝑘. In 1971, Schönhage and Strassen got even better al-
gorithms using the Fast Fourier Transform; their idea was to somehow
treat integers as “signals” and do the multiplication more efficiently
by moving to the Fourier domain. (The Fourier transform is a central
tool in mathematics and engineering, used in a great many applica-
tions; if you have not seen it yet, you are likely to encounter it at some
point in your studies.) In the years that followed researchers kept im-
proving the algorithm, and only very recently Harvey and Van Der
36 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Hoeven managed to obtain an 𝑂(𝑛 log 𝑛) time algorithm for multipli-


cation (though it only starts beating the Schönhage-Strassen algorithm
for truly astronomical numbers). Yet, despite all this progress, we
still don’t know whether or not there is an 𝑂(𝑛) time algorithm for
multiplying two 𝑛 digit numbers!

R
Remark 0.7 — Matrix Multiplication (advanced note).
(This book contains many “advanced” or “optional”
notes and sections. These may assume background
that not every student has, and can be safely skipped
over as none of the future parts depends on them.)
Ideas similar to Karatsuba’s can be used to speed up
matrix multiplications as well. Matrices are a powerful
way to represent linear equations and operations,
widely used in numerous applications of scientific
computing, graphics, machine learning, and many
many more.
One of the basic operations one can do with
two matrices is to multiply them. For example,
𝑥 𝑥0,1 𝑦 𝑦0,1
if 𝑥 = ( 0,0 ) and 𝑦 = ( 0,0 )
𝑥1,0 𝑥1,1 𝑦1,0 𝑦1,1
then the product of 𝑥 and 𝑦 is the matrix
𝑥 𝑦 + 𝑥0,1 𝑦1,0 𝑥0,0 𝑦0,1 + 𝑥0,1 𝑦1,1
( 0,0 0,0 ). You can
𝑥1,0 𝑦0,0 + 𝑥1,1 𝑦1,0 𝑥1,0 𝑦0,1 + 𝑥1,1 𝑦1,1
see that we can compute this matrix by eight products
of numbers.
Now suppose that 𝑛 is even and 𝑥 and 𝑦 are a pair of
𝑛 × 𝑛 matrices which we can think of as each com-
posed of four (𝑛/2) × (𝑛/2) blocks 𝑥0,0 , 𝑥0,1 , 𝑥1,0 , 𝑥1,1
and 𝑦0,0 , 𝑦0,1 , 𝑦1,0 , 𝑦1,1 . Then the formula for the matrix
product of 𝑥 and 𝑦 can be expressed in the same way
as above, just replacing products 𝑥𝑎,𝑏 𝑦𝑐,𝑑 with matrix
products, and addition with matrix addition. This
means that we can use the formula above to give an
algorithm that doubles the dimension of the matrices
at the expense of increasing the number of operations
by a factor of 8, which for 𝑛 = 2ℓ results in 8ℓ = 𝑛3
operations.
In 1969 Volker Strassen noted that we can compute
the product of a pair of two-by-two matrices using
only seven products of numbers by observing that
each entry of the matrix 𝑥𝑦 can be computed by
adding and subtracting the following seven terms:
𝑡1 = (𝑥1,0 + 𝑥1,1 )(𝑦0,0 + 𝑦1,1 ), 𝑡2 = (𝑥0,0 + 𝑥1,1 )𝑦0,0 ,
𝑡3 = 𝑥0,0 (𝑦0,1 − 𝑦1,1 ), 𝑡4 = 𝑥1,1 (𝑦0,1 − 𝑦0,0 ),
𝑡5 = (𝑥0,0 + 𝑥0,1 )𝑦1,1 , 𝑡6 = (𝑥1,0 − 𝑥0,0 )(𝑦1,0 + 𝑦0,1 ),
𝑡7 = (𝑥0,1 − 𝑥1,1 )(𝑦1,0 + 𝑦1,1 ). Indeed, one can verify
𝑡1 + 𝑡4 − 𝑡5 + 𝑡7 𝑡3 + 𝑡5
that 𝑥𝑦 = ( ).
𝑡2 + 𝑡4 𝑡1 + 𝑡 3 − 𝑡 2 + 𝑡 6
i n trod u c ti on 37

Using this observation, we can obtain an algorithm


such that doubling the dimension of the matrices
results in increasing the number of operations by a
factor of 7, which means that for 𝑛 = 2ℓ the cost is
7ℓ = 𝑛log2 7 ∼ 𝑛2.807 . A long sequence of work has
since improved this algorithm, and the current record
has a running time of about 𝑂(𝑛2.373 ). However, un-
like the case of integer multiplication, at the moment
we don’t know of any algorithm for matrix multiplica-
tion that runs in time linear or even close to linear in
the size of the input matrices (e.g., an 𝑂(𝑛2 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛))
time algorithm). People have tried to use group repre-
sentations, which can be thought of as generalizations
of the Fourier transform, to obtain faster algorithms,
but this effort has not yet succeeded.

0.3 ALGORITHMS BEYOND ARITHMETIC


The quest for better algorithms is by no means restricted to arithmetic
tasks such as adding, multiplying or solving equations. Many graph
algorithms, including algorithms for finding paths, matchings, span-
ning trees, cuts, and flows, have been discovered in the last several
decades, and this is still an intensive area of research. (For example,
the last few years saw many advances in algorithms for the maximum
flow problem, borne out of unexpected connections with electrical cir-
cuits and linear equation solvers.) These algorithms are being used
not just for the “natural” applications of routing network traffic or
GPS-based navigation, but also for applications as varied as drug dis-
covery through searching for structures in gene-interaction graphs to
computing risks from correlations in financial investments.
Google was founded based on the PageRank algorithm, which is
an efficient algorithm to approximate the “principal eigenvector” of
(a dampened version of) the adjacency matrix of the web graph. The
Akamai company was founded based on a new data structure, known
as consistent hashing, for a hash table where buckets are stored at dif-
ferent servers. The backpropagation algorithm, which computes partial
derivatives of a neural network in 𝑂(𝑛) instead of 𝑂(𝑛2 ) time, under-
lies many of the recent phenomenal successes of learning deep neural
networks. Algorithms for solving linear equations under sparsity
constraints, a concept known as compressed sensing, have been used
to drastically reduce the amount and quality of data needed to ana-
lyze MRI images. This made a critical difference for MRI imaging of
cancer tumors in children, where previously doctors needed to use
anesthesia to suspend breath during the MRI exam, sometimes with
dire consequences.
38 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Even for classical questions, studied through the ages, new dis-
coveries are still being made. For example, for the question of de-
termining whether a given integer is prime or composite, which has
been studied since the days of Pythagoras, efficient probabilistic algo-
rithms were only discovered in the 1970s, while the first deterministic
polynomial-time algorithm was only found in 2002. For the related
problem of actually finding the factors of a composite number, new
algorithms were found in the 1980s, and (as we’ll see later in this
course) discoveries in the 1990s raised the tantalizing prospect of
obtaining faster algorithms through the use of quantum mechanical
effects.
Despite all this progress, there are still many more questions than
answers in the world of algorithms. For almost all natural prob-
lems, we do not know whether the current algorithm is the “best”,
or whether a significantly better one is still waiting to be discovered.
As alluded to in Cobham’s opening quote for this chapter, even for
the basic problem of multiplying numbers we have not yet answered
the question of whether there is a multiplication algorithm that is as
efficient as our algorithms for addition. But at least we now know the
right way to ask it.

0.4 ON THE IMPORTANCE OF NEGATIVE RESULTS


Finding better algorithms for problems such as multiplication, solv-
ing equations, graph problems, or fitting neural networks to data, is
undoubtedly a worthwhile endeavor. But why is it important to prove
that such algorithms don’t exist? One motivation is pure intellectual
curiosity. Another reason to study impossibility results is that they
correspond to the fundamental limits of our world. In other words,
impossibility results are laws of nature.
Here are some examples of impossibility results outside computer
science (see Section 0.7 for more about these). In physics, the impos-
sibility of building a perpetual motion machine corresponds to the law
of conservation of energy. The impossibility of building a heat engine
beating Carnot’s bound corresponds to the second law of thermody-
namics, while the impossibility of faster-than-light information trans-
mission is a cornerstone of special relativity. In mathematics, while we
all learned the formula for solving quadratic equations in high school,
the impossibility of generalizing this formula to equations of degree
five or more gave birth to group theory. The impossibility of proving
Euclid’s fifth axiom from the first four gave rise to non-Euclidean ge-
ometries, which ended up crucial for the theory of general relativity.
In an analogous way, impossibility results for computation corre-
spond to “computational laws of nature” that tell us about the fun-
damental limits of any information processing apparatus, whether
i n trod u c ti on 39

based on silicon, neurons, or quantum particles. Moreover, computer


scientists found creative approaches to apply computational limitations
to achieve certain useful tasks. For example, much of modern Internet
traffic is encrypted using the RSA encryption scheme, the security of
which relies on the (conjectured) impossibility of efficiently factoring
large integers. More recently, the Bitcoin system uses a digital ana-
log of the “gold standard” where, instead of using a precious metal,
new currency is obtained by “mining” solutions for computationally
difficult problems.

✓ Chapter Recap

• The history of algorithms goes back thousands


of years; they have been essential to much of hu-
man progress and these days form the basis of
multi-billion dollar industries, as well as life-saving
technologies.
• There is often more than one algorithm to achieve
the same computational task. Finding a faster al-
gorithm can often make a much bigger difference
than improving computing hardware.
• Better algorithms and data structures don’t just
speed up calculations, but can yield new qualitative
insights.
• One question we will study is to find out what is
the most efficient algorithm for a given problem.
• To show that an algorithm is the most efficient one
for a given problem, we need to be able to prove
that it is impossible to solve the problem using a
smaller amount of computational resources.

0.5 ROADMAP TO THE REST OF THIS BOOK


Often, when we try to solve a computational problem, whether it is
solving a system of linear equations, finding the top eigenvector of a
matrix, or trying to rank Internet search results, it is enough to use the
“I know it when I see it” standard for describing algorithms. As long
as we find some way to solve the problem, we are happy and might
not care much on the exact mathematical model for our algorithm.
But when we want to answer a question such as “does there exist an
algorithm to solve the problem 𝑃 ?” we need to be much more precise.
In particular, we will need to (1) define exactly what it means to
solve 𝑃 , and (2) define exactly what an algorithm is. Even (1) can
sometimes be non-trivial but (2) is particularly challenging; it is not
at all clear how (and even whether) we can encompass all potential
ways to design algorithms. We will consider several simple models of
computation, and argue that, despite their simplicity, they do capture
40 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

all “reasonable” approaches to achieve computing, including all those


that are currently used in modern computing devices.
Once we have these formal models of computation, we can try
to obtain impossibility results for computational tasks, showing that
some problems can not be solved (or perhaps can not be solved within
the resources of our universe). Archimedes once said that given a
fulcrum and a long enough lever, he could move the world. We will
see how reductions allow us to leverage one hardness result into a
slew of a great many others, illuminating the boundaries between
the computable and uncomputable (or tractable and intractable)
problems.
Later in this book we will go back to examining our models of
computation, and see how resources such as randomness or quantum
entanglement could potentially change the power of our model. In
the context of probabilistic algorithms, we will see a glimpse of how
randomness has become an indispensable tool for understanding
computation, information, and communication. We will also see how
computational difficulty can be an asset rather than a hindrance, and
be used for the “derandomization” of probabilistic algorithms. The
same ideas also show up in cryptography, which has undergone not
just a technological but also an intellectual revolution in the last few
decades, much of it building on the foundations that we explore in
this course.
Theoretical Computer Science is a vast topic, branching out and
touching upon many scientific and engineering disciplines. This book
provides a very partial (and biased) sample of this area. More than
anything, I hope I will manage to “infect” you with at least some of
my love for this field, which is inspired and enriched by the connec-
tion to practice, but is also deep and beautiful regardless of applica-
tions.

0.5.1 Dependencies between chapters


This book is divided into the following parts, see Fig. 5.

• Preliminaries: Introduction, mathematical background, and repre-


senting objects as strings.

• Part I: Finite computation (Boolean circuits): Equivalence of cir-


cuits and straight-line programs. Universal gate sets. Existence of a
circuit for every function, representing circuits as strings, universal
circuit, lower bound on circuit size using the counting argument.

• Part II: Uniform computation (Turing machines): Equivalence of


Turing machines and programs with loops. Equivalence of models
(including RAM machines, 𝜆 calculus, and cellular automata),
configurations of Turing machines, existence of a universal Turing
i n trod u c ti on 41

machine, uncomputable functions (including the Halting problem


and Rice’s Theorem), Gödel’s incompleteness theorem, restricted
computational models (regular and context free languages).

• Part III: Efficient computation: Definition of running time, time


hierarchy theorem, P and NP, P/poly , NP completeness and the
Cook-Levin Theorem, space bounded computation.

• Part IV: Randomized computation: Probability, randomized algo-


rithms, BPP, amplification, BPP ⊆ P/𝑝𝑜𝑙𝑦 , pseudorandom genera-
tors and derandomization.

• Part V: Advanced topics: Cryptography, proofs and algorithms


(interactive and zero knowledge proofs, Curry-Howard correspon-
dence), quantum computing.

Figure 5: The dependency structure of the different


parts. Part I introduces the model of Boolean cir-
cuits to study finite functions with an emphasis on
quantitative questions (how many gates to compute
a function). Part II introduces the model of Turing
machines to study functions that have unbounded input
lengths with an emphasis on qualitative questions (is
this function computable or not). Much of Part II does
not depend on Part I, as Turing machines can be used
as the first computational model. Part III depends
on both parts as it introduces a quantitative study of
functions with unbounded input length. The more
advanced parts IV (randomized computation) and
V (advanced topics) rely on the material of Parts I, II
and III.

The book largely proceeds in linear order, with each chapter build-
ing on the previous ones, with the following exceptions:

• The topics of 𝜆 calculus (Section 8.5 and Section 8.5), Gödel’s in-
completeness theorem (Chapter 11), Automata/regular expres-
sions and context-free grammars (Chapter 10), and space-bounded
computation (Chapter 17), are not used in the following chapters.
Hence you can choose whether to cover or skip any subset of them.

• Part II (Uniform Computation / Turing Machines) does not have


a strong dependency on Part I (Finite computation / Boolean cir-
cuits) and it should be possible to teach them in the reverse order
with minor modification. Boolean circuits are used Part III (efficient
computation) for results such as P ⊆ P/poly and the Cook-Levin
42 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Theorem, as well as in Part IV (for BPP ⊆ P/poly and derandom-


ization) and Part V (specifically in cryptography and quantum
computing).

• All chapters in Part V (Advanced topics) are independent of one


another and can be covered in any order.

A course based on this book can use all of Parts I, II, and III (possi-
bly skipping over some or all of the 𝜆 calculus, Chapter 11, Chapter 10
or Chapter 17), and then either cover all or some of Part IV (random-
ized computation), and add a “sprinkling” of advanced topics from
Part V based on student or instructor interest.

0.6 EXERCISES
Exercise 0.1Rank the significance of the following inventions in speed-
ing up the multiplication of large (that is 100-digit or more) numbers.
That is, use “back of the envelope” estimates to order them in terms of
the speedup factor they offered over the previous state of affairs.

a. Discovery of the grade-school digit by digit algorithm (improving


upon repeated addition).

b. Discovery of Karatsuba’s algorithm (improving upon the digit by


digit algorithm).

c. Invention of modern electronic computers (improving upon calcu-


lations with pen and paper).

Exercise 0.2 The 1977 Apple II personal computer had a processor


speed of 1.023 Mhz or about 106 operations per second. At the
time of this writing the world’s fastest supercomputer performs 93
“petaflops” (1015 floating point operations per second) or about 1018
basic steps per second. For each one of the following running times
(as a function of the input length 𝑛), compute for both computers how
large an input they could handle in a week of computation, if they run
an algorithm that has this running time:

a. 𝑛 operations.

b. 𝑛2 operations.

c. 𝑛 log 𝑛 operations.

d. 2𝑛 operations.

e. 𝑛! operations.
i n trod u c ti on 43

Exercise 0.3 — Usefulness of algorithmic non-existence. In this chapter we


mentioned several companies that were founded based on the discov-
ery of new algorithms. Can you give an example for a company that
was founded based on the non-existence of an algorithm? See footnote 4
As we will see in Chapter Chapter 21, almost any
for hint.4 company relying on cryptography needs to assume

the non-existence of certain algorithms. In particular,
RSA Security was founded based on the security
a. Suppose that
Exercise 0.4 — Analysis of Karatsuba’s Algorithm. of the RSA cryptosystem, which presumes the non-
existence of an efficient algorithm to compute the
𝑇1 , 𝑇2 , 𝑇3 , … is a sequence of numbers such that 𝑇2 ≤ 10 and prime factorization of large integers.
for every 𝑛, 𝑇𝑛 ≤ 3𝑇⌊𝑛/2⌋+1 + 𝐶𝑛 for some 𝐶 ≥ 1. Prove that 5
Hint: Use a proof by induction - suppose that this is
𝑇𝑛 ≤ 20𝐶𝑛log2 3 for every 𝑛 > 2.5 true for all 𝑛’s from 1 to 𝑚 and prove that this is true
also for 𝑚 + 1.
b. Prove that the number of single-digit operations that Karatsuba’s
algorithm takes to multiply two 𝑛 digit numbers is at most
1000𝑛log2 3 .

Exercise 0.5 Implement in the programming language of your


choice functions Gradeschool_multiply(x,y) and Karat-
suba_multiply(x,y) that take two arrays of digits x and y and return
an array representing the product of x and y (where x is identified
with the number x[0]+10*x[1]+100*x[2]+... etc..) using the
grade-school algorithm and the Karatsuba algorithm respectively.
At what number of digits does the Karatsuba algorithm beat the
grade-school one?

In this exercise, we
Exercise 0.6 — Matrix Multiplication (optional, advanced).
show that if for some 𝜔 > 2, we can write the product of two 𝑘 × 𝑘
real-valued matrices 𝐴, 𝐵 using at most 𝑘𝜔 multiplications, then we
can multiply two 𝑛 × 𝑛 matrices in roughly 𝑛𝜔 time for every large
enough 𝑛.
To make this precise, we need to make some notation that is unfor-
tunately somewhat cumbersome. Assume that there is some 𝑘 ∈ ℕ
and 𝑚 ≤ 𝑘𝜔 such that for every 𝑘 × 𝑘 matrices 𝐴, 𝐵, 𝐶 such that
𝐶 = AB, we can write for every 𝑖, 𝑗 ∈ [𝑘]:
𝑚−1
𝐶𝑖,𝑗 = ∑ 𝛼ℓ𝑖,𝑗 𝑓ℓ (𝐴)𝑔ℓ (𝐵)
ℓ=0

for some linear functions 𝑓0 , … , 𝑓𝑚−1 , 𝑔0 , … , 𝑔𝑚−1 ∶ ℝ𝑛 → ℝ and


2

coefficients {𝛼ℓ𝑖,𝑗 }𝑖,𝑗∈[𝑘],ℓ∈[𝑚] . Prove that under this assumption for


every 𝜖 > 0, if 𝑛 is sufficiently large, then there is an algorithm that
computes the product of two 𝑛 × 𝑛 matrices using at most 𝑂(𝑛𝜔+𝜖 ) 6
Start by showing this for the case that 𝑛 = 𝑘𝑡 for
arithmetic operations. See footnote for hint.6 some natural number 𝑡, in which case you can do so
■ recursively by breaking the matrices into 𝑘 × 𝑘 blocks.
44 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

0.7 BIBLIOGRAPHICAL NOTES


For a brief overview of what we’ll see in this book, you could do far
worse than read Bernard Chazelle’s wonderful essay on the Algo-
rithm as an Idiom of modern science. The book of Moore and Mertens
[MM11] gives a wonderful and comprehensive overview of the theory
of computation, including much of the content discussed in this chap-
ter and the rest of this book. Aaronson’s book [Aar13] is another great
read that touches upon many of the same themes.
For more on the algorithms the Babylonians used, see Knuth’s
paper and Neugebauer’s classic book.
Many of the algorithms we mention in this chapter are covered
in algorithms textbooks such as those by Cormen, Leiserson, Rivest,
and Stein [Cor+09], Kleinberg and Tardos [KT06], and Dasgupta, Pa-
padimitriou and Vazirani [DPV08], as well as Jeff Erickson’s textbook.
Erickson’s book is freely available online and contains a great exposi-
tion of recursive algorithms in general and Karatsuba’s algorithm in
particular.
The story of Karatsuba’s discovery of his multiplication algorithm
is recounted by him in [Kar95]. As mentioned above, further improve-
ments were made by Toom and Cook [Too63; Coo66], Schönhage and
Strassen [SS71], Fürer [Für07], and recently by Harvey and Van Der
Hoeven [HV19], see this article for a nice overview. The last papers
crucially rely on the Fast Fourier transform algorithm. The fascinating
story of the (re)discovery of this algorithm by John Tukey in the con-
text of the cold war is recounted in [Coo87]. (We say re-discovery
because it later turned out that the algorithm dates back to Gauss
[HJB85].) The Fast Fourier Transform is covered in some of the books
mentioned below, and there are also online available lectures such as
Jeff Erickson’s. See also this popular article by David Austin. Fast ma-
trix multiplication was discovered by Strassen [Str69], and since then
this has been an active area of research. [Blä13] is a recommended
self-contained survey of this area.
The Backpropagation algorithm for fast differentiation of neural net-
works was invented by Werbos [Wer74]. The Pagerank algorithm was
invented by Larry Page and Sergey Brin [Pag+99]. It is closely related
to the HITS algorithm of Kleinberg [Kle99]. The Akamai company was
founded based on the consistent hashing data structure described in
[Kar+97]. Compressed sensing has a long history but two foundational
papers are [CRT06; Don06]. [Lus+08] gives a survey of applications
of compressed sensing to MRI; see also this popular article by Ellen-
berg [Ell10]. The deterministic polynomial-time algorithm for testing
primality was given by Agrawal, Kayal, and Saxena [AKS04].
i n trod u c ti on 45

We alluded briefly to classical impossibility results in mathematics,


including the impossibility of proving Euclid’s fifth postulate from the
other four, impossibility of trisecting an angle with a straightedge and
compass and the impossibility of solving a quintic equation via rad-
icals. A geometric proof of the impossibility of angle trisection (one
of the three geometric problems of antiquity, going back to the an-
cient Greeks) is given in this blog post of Tao. The book of Mario Livio
[Liv05] covers some of the background and ideas behind these impos-
sibility results. Some exciting recent research is focused on trying to
use computational complexity to shed light on fundamental questions
in physics such as understanding black holes and reconciling general
relativity with quantum mechanics
Learning Objectives:
• Recall basic mathematical notions such as
sets, functions, numbers, logical operators
and quantifiers, strings, and graphs.
• Rigorously define Big-𝑂 notation.
• Proofs by induction.
• Practice with reading mathematical
definitions, statements, and proofs.

1
• Transform an intuitive argument into a
rigorous proof.

Mathematical Background

“I found that every number, which may be expressed from one to ten, surpasses
the preceding by one unit: afterwards the ten is doubled or tripled … until
a hundred; then the hundred is doubled and tripled in the same manner as
the units and the tens … and so forth to the utmost limit of numeration.”,
Muhammad ibn Mūsā al-Khwārizmī, 820, translation by Fredric Rosen,
1831.

In this chapter we review some of the mathematical concepts that


we use in this book. These concepts are typically covered in courses
or textbooks on “mathematics for computer science” or “discrete
mathematics”; see the “Bibliographical Notes” section (Section 1.9)
for several excellent resources on these topics that are freely-available
online.
A mathematician’s apology. Some students might wonder why this
book contains so much math. The reason is that mathematics is sim-
ply a language for modeling concepts in a precise and unambiguous
way. In this book we use math to model the concept of computation.
For example, we will consider questions such as “is there an efficient
algorithm to find the prime factors of a given integer?”. (We will see that
this question is particularly interesting, touching on areas as far apart
as Internet security and quantum mechanics!) To even phrase such a
question, we need to give a precise definition of the notion of an algo-
rithm, and of what it means for an algorithm to be efficient. Also, since
there is no empirical experiment to prove the nonexistence of an algo-
rithm, the only way to establish such a result is using a mathematical
proof.

1.1 THIS CHAPTER: A READER’S MANUAL


Depending on your background, you can approach this chapter in two
different ways:

• If you have already taken “discrete mathematics”, “mathematics


for computer science” or similar courses, you do not need to read

Compiled on 12.6.2023 00:05


48 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the whole chapter. You can just take a quick look at Section 1.2 to
see the main tools we will use, Section 1.7 for our notation and con-
ventions, and then skip ahead to the rest of this book. Alternatively,
you can sit back, relax, and read this chapter just to get familiar
with our notation, as well as to enjoy (or not) my philosophical
musings and attempts at humor.

• If your background is less extensive, see Section 1.9 for some re-
sources on these topics. This chapter briefly covers the concepts
that we need, but you may find it helpful to see a more in-depth
treatment. As usual with math, the best way to get comfortable
with this material is to work out exercises on your own.

• You might also want to start brushing up on discrete probability,


which we’ll use later in this book (see Chapter 18).

1.2 A QUICK OVERVIEW OF MATHEMATICAL PREREQUISITES


The main mathematical concepts we will use are the following. We
just list these notions below, deferring their definitions to the rest of
this chapter. If you are familiar with all of these, then you might want
to just skip to Section 1.7 to see the full list of notation we use.

• Proofs: First and foremost, this book involves a heavy dose of for-
mal mathematical reasoning, which includes mathematical defini-
tions, statements, and proofs.

• Sets and set operations: We will use extensively mathematical sets.


We use the basic set relations of membership (∈) and containment
(⊆), and set operations, principally union (∪), intersection (∩), and
set difference (⧵).

• Cartesian product and Kleene star operation: We also use the


Cartesian product of two sets 𝐴 and 𝐵, denoted as 𝐴 × 𝐵 (that is,
𝐴 × 𝐵 the set of pairs (𝑎, 𝑏) where 𝑎 ∈ 𝐴 and 𝑏 ∈ 𝐵). We denote by
𝐴𝑛 the 𝑛 fold Cartesian product (e.g., 𝐴3 = 𝐴 × 𝐴 × 𝐴) and by 𝐴∗
(known as the Kleene star) the union of 𝐴𝑛 for all 𝑛 ∈ {0, 1, 2, …}.

• Functions: The domain and codomain of a function, properties such


as being one-to-one (also known as injective) or onto (also known
as surjective) functions, as well as partial functions (that, unlike
standard or “total” functions, are not necessarily defined on all
elements of their domain).

• Logical operations: The operations AND (∧), OR (∨), and NOT


(¬) and the quantifiers “there exists” (∃) and “for all” (∀).

• Basic combinatorics: Notions such as (𝑛𝑘) (the number of 𝑘-sized


subsets of a set of size 𝑛).
mathe mati ca l backg rou n d 49

• Graphs: Undirected and directed graphs, connectivity, paths, and


cycles.

• Big-𝑂 notation: 𝑂, 𝑜, Ω, 𝜔, Θ notation for analyzing asymptotic


growth of functions.

• Discrete probability: We will use probability theory, and specifi-


cally probability over finite samples spaces such as tossing 𝑛 coins,
including notions such as random variables, expectation, and concen-
tration. We will only use probability theory in the second half of
this text, and will review it beforehand in Chapter 18. However,
probabilistic reasoning is a subtle (and extremely useful!) skill, and
it’s always good to start early in acquiring it.

In the rest of this chapter we briefly review the above notions. This
is partially to remind the reader and reinforce material that might
not be fresh in your mind, and partially to introduce our notation
and conventions which might occasionally differ from those you’ve
encountered before.

1.3 READING MATHEMATICAL TEXTS


Mathematicians use jargon for the same reason that it is used in many
other professions such as engineering, law, medicine, and others. We
want to make terms precise and introduce shorthand for concepts
that are frequently reused. Mathematical texts tend to “pack a lot
of punch” per sentence, and so the key is to read them slowly and
carefully, parsing each symbol at a time.
With time and practice you will see that reading mathematical texts
becomes easier and jargon is no longer an issue. Moreover, reading
mathematical texts is one of the most transferable skills you could take
from this book. Our world is changing rapidly, not just in the realm
Figure 1.1: A snippet from the “methods” section of
of technology, but also in many other human endeavors, whether it the “AlphaGo Zero” paper by Silver et al, Nature, 2017.
is medicine, economics, law or even culture. Whatever your future
aspirations, it is likely that you will encounter texts that use new con-
cepts that you have not seen before (see Fig. 1.1 and Fig. 1.2 for two
recent examples from current “hot areas”). Being able to internalize
and then apply new definitions can be hugely important. It is a skill
that’s much easier to acquire in the relatively safe and stable context of
a mathematical course, where one at least has the guarantee that the
concepts are fully specified, and you have access to your teaching staff
for questions.
Figure 1.2: A snippet from the “Zerocash” paper of
The basic components of a mathematical text are definitions, asser- Ben-Sasson et al, that forms the basis of the cryptocur-
tions and proofs. rency startup Zcash.
50 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

1.3.1 Definitions
Mathematicians often define new concepts in terms of old concepts.
For example, here is a mathematical definition which you may have
encountered in the past (and will see again shortly):

Let 𝑆, 𝑇 be sets. We say that a


Definition 1.1 — One to one function.
function 𝑓 ∶ 𝑆 → 𝑇 is one to one (also known as injective) if for every
two elements 𝑥, 𝑥′ ∈ 𝑆, if 𝑥 ≠ 𝑥′ then 𝑓(𝑥) ≠ 𝑓(𝑥′ ).

Definition 1.1 captures a simple concept, but even so it uses quite


a bit of notation. When reading such a definition, it is often useful to
annotate it with a pen as you’re going through it (see Fig. 1.3). For
example, when you see an identifier such as 𝑓, 𝑆 or 𝑥, make sure that
you realize what sort of object it is: is it a set, a function, an element,
a number, a gremlin? You might also find it useful to explain the
definition in words to a friend (or to yourself).

1.3.2 Assertions: Theorems, lemmas, claims


Theorems, lemmas, claims and the like are true statements about the
concepts we defined. Deciding whether to call a particular statement a
Figure 1.3: An annotated form of Definition 1.1,
“Theorem”, a “Lemma” or a “Claim” is a judgement call, and does not marking which part is being defined and how.
make a mathematical difference. All three correspond to statements
which were proven to be true. The difference is that a Theorem refers to
a significant result that we would want to remember and highlight. A
Lemma often refers to a technical result that is not necessarily impor-
tant in its own right, but that can be often very useful in proving other
theorems. A Claim is a “throwaway” statement that we need to use
in order to prove some other bigger results, but do not care so much
about for its own sake.

1.3.3 Proofs
Mathematical proofs are the arguments we use to demonstrate that our
theorems, lemmas, and claims are indeed true. We discuss proofs in
Section 1.5 below, but the main point is that the mathematical stan-
dard of proof is very high. Unlike in some other realms, in mathe-
matics a proof is an “airtight” argument that demonstrates that the
statement is true beyond a shadow of a doubt. Some examples in this
section for mathematical proofs are given in Solved Exercise 1.1 and
Section 1.6. As mentioned in the preface, as a general rule, it is more
important you understand the definitions than the theorems, and it is
more important you understand a theorem statement than its proof.
mathe mati ca l backg rou n d 51

1.4 BASIC DISCRETE MATH OBJECTS


In this section we quickly review some of the mathematical objects
(the “basic data structures” of mathematics, if you will) we use in this
book.

1.4.1 Sets
A set is an unordered collection of objects. For example, when we
write 𝑆 = {2, 4, 7}, we mean that 𝑆 denotes the set that contains the
numbers 2, 4, and 7. (We use the notation “2 ∈ 𝑆” to denote that 2 is
an element of 𝑆.) Note that the set {2, 4, 7} and {7, 4, 2} are identical,
since they contain the same elements. Also, a set either contains an
element or does not contain it – there is no notion of containing it
“twice” – and so we could even write the same set 𝑆 as {2, 2, 4, 7}
(though that would be a little weird). The cardinality of a finite set 𝑆,
denoted by |𝑆|, is the number of elements it contains. (Cardinality can
be defined for infinite sets as well; see the sources in Section 1.9.) So,
in the example above, |𝑆| = 3. A set 𝑆 is a subset of a set 𝑇 , denoted
by 𝑆 ⊆ 𝑇 , if every element of 𝑆 is also an element of 𝑇 . (We can
also describe this by saying that 𝑇 is a superset of 𝑆.) For example,
{2, 7} ⊆ {2, 4, 7}. The set that contains no elements is known as the
empty set and it is denoted by ∅. If 𝐴 is a subset of 𝐵 that is not equal
to 𝐵 we say that 𝐴 is a strict subset of 𝐵, and denote this by 𝐴 ⊊ 𝐵.
We can define sets by either listing all their elements or by writing
down a rule that they satisfy such as

EVEN = {𝑥 | 𝑥 = 2𝑦 for some non-negative integer 𝑦} .

Of course there is more than one way to write the same set, and of-
ten we will use intuitive notation listing a few examples that illustrate
the rule. For example, we can also define EVEN as

EVEN = {0, 2, 4, …} .
Note that a set can be either finite (such as the set {2, 4, 7}) or in-
finite (such as the set EVEN). Also, the elements of a set don’t have
to be numbers. We can talk about the sets such as the set {𝑎, 𝑒, 𝑖, 𝑜, 𝑢}
of all the vowels in the English language, or the set {New York, Los
Angeles, Chicago, Houston, Philadelphia, Phoenix, San Antonio,
San Diego, Dallas} of all cities in the U.S. with population more than
one million per the 2010 census. A set can even have other sets as ele-
ments, such as the set {∅, {1, 2}, {2, 3}, {1, 3}} of all even-sized subsets
of {1, 2, 3}.

Operations on sets: The union of two sets 𝑆, 𝑇 , denoted by 𝑆 ∪ 𝑇 ,


is the set that contains all elements that are either in 𝑆 or in 𝑇 . The
intersection of 𝑆 and 𝑇 , denoted by 𝑆 ∩ 𝑇 , is the set of elements that are
52 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

both in 𝑆 and in 𝑇 . The set difference of 𝑆 and 𝑇 , denoted by 𝑆 ⧵ 𝑇 (and


in some texts also by 𝑆 − 𝑇 ), is the set of elements that are in 𝑆 but not
in 𝑇 .

Tuples, lists, strings, sequences: A tuple is an ordered collection of items.


For example (1, 5, 2, 1) is a tuple with four elements (also known as
a 4-tuple or quadruple). Since order matters, this is not the same
tuple as the 4-tuple (1, 1, 5, 2) or the 3-tuple (1, 5, 2). A 2-tuple is also
known as a pair. We use the terms tuples and lists interchangeably.
A tuple where every element comes from some finite set Σ (such as
{0, 1}) is also known as a string. Analogously to sets, we denote the
length of a tuple 𝑇 by |𝑇 |. Just like sets, we can also think of infinite
analogues of tuples, such as the ordered collection (1, 4, 9, …) of all
perfect squares. Infinite ordered collections are known as sequences;
we might sometimes use the term “infinite sequence” to emphasize
this, and use “finite sequence” as a synonym for a tuple. (We can
identify a sequence (𝑎0 , 𝑎1 , 𝑎2 , …) of elements in some set 𝑆 with a
function 𝐴 ∶ ℕ → 𝑆 (where 𝑎𝑛 = 𝐴(𝑛) for every 𝑛 ∈ ℕ). Similarly,
we can identify a 𝑘-tuple (𝑎0 , … , 𝑎𝑘−1 ) of elements in 𝑆 with a function
𝐴 ∶ [𝑘] → 𝑆.)

Cartesian product: If 𝑆 and 𝑇 are sets, then their Cartesian product,


denoted by 𝑆 × 𝑇 , is the set of all ordered pairs (𝑠, 𝑡) where 𝑠 ∈ 𝑆 and
𝑡 ∈ 𝑇 . For example, if 𝑆 = {1, 2, 3} and 𝑇 = {10, 12}, then 𝑆 × 𝑇
contains the 6 elements (1, 10), (2, 10), (3, 10), (1, 12), (2, 12), (3, 12).
Similarly if 𝑆, 𝑇 , 𝑈 are sets then 𝑆 × 𝑇 × 𝑈 is the set of all ordered
triples (𝑠, 𝑡, 𝑢) where 𝑠 ∈ 𝑆, 𝑡 ∈ 𝑇 , and 𝑢 ∈ 𝑈 . More generally, for
every positive integer 𝑛 and sets 𝑆0 , … , 𝑆𝑛−1 , we denote by 𝑆0 × 𝑆1 ×
⋯ × 𝑆𝑛−1 the set of ordered 𝑛-tuples (𝑠0 , … , 𝑠𝑛−1 ) where 𝑠𝑖 ∈ 𝑆𝑖 for
every 𝑖 ∈ {0, … , 𝑛 − 1}. For every set 𝑆, we denote the set 𝑆 × 𝑆 by 𝑆 2 ,
𝑆 × 𝑆 × 𝑆 by 𝑆 3 , 𝑆 × 𝑆 × 𝑆 × 𝑆 by 𝑆 4 , and so on and so forth.

1.4.2 Special sets


There are several sets that we will use in this book time and again. The
set

ℕ = {0, 1, 2, …}

contains all natural numbers, i.e., non-negative integers. For any natural
number 𝑛 ∈ ℕ, we define the set [𝑛] as {0, … , 𝑛 − 1} = {𝑘 ∈ ℕ ∶
𝑘 < 𝑛}. (We start our indexing of both ℕ and [𝑛] from 0, while many
other texts index those sets from 1. Starting from zero or one is simply
a convention that doesn’t make much difference, as long as one is
consistent about it.)
We will also occasionally use the set ℤ = {… , −2, −1, 0, +1, +2, …} 1
The letter Z stands for the German word “Zahlen”,
of (negative and non-negative) integers,1 as well as the set ℝ of real which means numbers.
mathe mati ca l backg rou n d 53

numbers. (This is the set that includes not just the integers, but also
fractional and irrational numbers; e.g., ℝ contains numbers such as
+0.5, −𝜋, etc.) We denote by ℝ+ the set {𝑥 ∈ ℝ ∶ 𝑥 > 0} of positive real
numbers. This set is sometimes also denoted as (0, ∞).

Strings: Another set we will use time and again is

{0, 1}𝑛 = {(𝑥0 , … , 𝑥𝑛−1 ) ∶ 𝑥0 , … , 𝑥𝑛−1 ∈ {0, 1}}


which is the set of all 𝑛-length binary strings for some natural number
𝑛. That is {0, 1}𝑛 is the set of all 𝑛-tuples of zeroes and ones. This is
consistent with our notation above: {0, 1}2 is the Cartesian product
{0, 1} × {0, 1}, {0, 1}3 is the product {0, 1} × {0, 1} × {0, 1} and so on.
We will write the string (𝑥0 , 𝑥1 , … , 𝑥𝑛−1 ) as simply 𝑥0 𝑥1 ⋯ 𝑥𝑛−1 . For
example,

{0, 1}3 = {000, 001, 010, 011, 100, 101, 110, 111} .
For every string 𝑥 ∈ {0, 1}𝑛 and 𝑖 ∈ [𝑛], we write 𝑥𝑖 for the 𝑖𝑡ℎ
element of 𝑥.
We will also often talk about the set of binary strings of all lengths,
which is

{0, 1}∗ = {(𝑥0 , … , 𝑥𝑛−1 ) ∶ 𝑛 ∈ ℕ , , 𝑥0 , … , 𝑥𝑛−1 ∈ {0, 1}} .

Another way to write this set is as

{0, 1}∗ = {0, 1}0 ∪ {0, 1}1 ∪ {0, 1}2 ∪ ⋯

or more concisely as

{0, 1}∗ = ∪𝑛∈ℕ {0, 1}𝑛 .


The set {0, 1}∗ includes the “string of length 0” or “the empty
string”, which we will denote by "". (In using this notation we fol-
low the convention of many programming languages. Other texts
sometimes use 𝜖 or 𝜆 to denote the empty string.)

Generalizing the star operation: For every set Σ, we define

Σ∗ = ∪𝑛∈ℕ Σ𝑛 .
For example, if Σ = {𝑎, 𝑏, 𝑐, 𝑑, … , 𝑧} then Σ∗ denotes the set of all finite
length strings over the alphabet a-z.

Concatenation: The concatenation of two strings 𝑥 ∈ Σ𝑛 and 𝑦 ∈ Σ𝑚 is


the (𝑛 + 𝑚)-length string 𝑥𝑦 obtained by writing 𝑦 after 𝑥. That is, if
𝑥 ∈ {0, 1}𝑛 and 𝑦 ∈ {0, 1}𝑚 , then 𝑥𝑦 is equal to the string 𝑧 ∈ {0, 1}𝑛+𝑚
such that for 𝑖 ∈ [𝑛], 𝑧𝑖 = 𝑥𝑖 and for 𝑖 ∈ {𝑛, … , 𝑛 + 𝑚 − 1}, 𝑧𝑖 = 𝑦𝑖−𝑛 .
54 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

1.4.3 Functions
If 𝑆 and 𝑇 are non-empty sets, a function 𝐹 mapping 𝑆 to 𝑇 , denoted
by 𝐹 ∶ 𝑆 → 𝑇 , associates with every element 𝑥 ∈ 𝑆 an element
𝐹 (𝑥) ∈ 𝑇 . The set 𝑆 is known as the domain of 𝐹 and the set 𝑇
is known as the codomain of 𝐹 . The image of a function 𝐹 is the set
{𝐹 (𝑥) | 𝑥 ∈ 𝑆} which is the subset of 𝐹 ’s codomain consisting of all
output elements that are mapped from some input. (Some texts use
range to denote the image of a function, while other texts use range
to denote the codomain of a function. Hence we will avoid using the
term “range” altogether.) As in the case of sets, we can write a func-
tion either by listing the table of all the values it gives for elements
in 𝑆 or by using a rule. For example if 𝑆 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and 𝑇 = {0, 1}, then the table below defines a function 𝐹 ∶ 𝑆 → 𝑇 .
Note that this function is the same as the function defined by the rule 2
For two natural numbers 𝑥 and 𝑎, 𝑥 mod 𝑎 (short-
𝐹 (𝑥) = (𝑥 mod 2).2 hand for “modulo”) denotes the remainder of 𝑥
when it is divided by 𝑎. That is, it is the number 𝑟 in
Table 1.1: An example of a function. {0, … , 𝑎 − 1} such that 𝑥 = 𝑎𝑘 + 𝑟 for some integer 𝑘.
We sometimes also use the notation 𝑥 = 𝑦 ( mod 𝑎)
to denote the assertion that 𝑥 mod 𝑎 is the same as 𝑦
Input Output mod 𝑎.

0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1

If 𝐹 ∶ 𝑆 → 𝑇 satisfies that 𝐹 (𝑥) ≠ 𝐹 (𝑦) for all 𝑥 ≠ 𝑦 then we say


that 𝐹 is one-to-one (Definition 1.1, also known as an injective function
or simply an injection). If 𝐹 satisfies that for every 𝑦 ∈ 𝑇 there is some
𝑥 ∈ 𝑆 such that 𝐹 (𝑥) = 𝑦 then we say that 𝐹 is onto (also known as a
surjective function or simply a surjection). A function that is both one-
to-one and onto is known as a bijective function or simply a bijection.
A bijection from a set 𝑆 to itself is also known as a permutation of 𝑆. If
𝐹 ∶ 𝑆 → 𝑇 is a bijection then for every 𝑦 ∈ 𝑇 there is a unique 𝑥 ∈ 𝑆
such that 𝐹 (𝑥) = 𝑦. We denote this value 𝑥 by 𝐹 −1 (𝑦). Note that 𝐹 −1
is itself a bijection from 𝑇 to 𝑆 (can you see why?).
Giving a bijection between two sets is often a good way to show
they have the same size. In fact, the standard mathematical definition
of the notion that “𝑆 and 𝑇 have the same cardinality” is that there
mathe mati ca l backg rou n d 55

exists a bijection 𝑓 ∶ 𝑆 → 𝑇 . Further, the cardinality of a set 𝑆 is


defined to be 𝑛 if there is a bijection from 𝑆 to the set {0, … , 𝑛 − 1}.
As we will see later in this book, this is a definition that generalizes to
defining the cardinality of infinite sets.

Partial functions: We will sometimes be interested in partial functions


from 𝑆 to 𝑇 . A partial function is allowed to be undefined on some
subset of 𝑆. That is, if 𝐹 is a partial function from 𝑆 to 𝑇 , then for
every 𝑠 ∈ 𝑆, either there is (as in the case of standard functions) an
element 𝐹 (𝑠) in 𝑇 , or 𝐹 (𝑠) is undefined. For example, the partial func-

tion 𝐹 (𝑥) = 𝑥 is only defined on non-negative real numbers. When
we want to distinguish between partial functions and standard (i.e.,
non-partial) functions, we will call the latter total functions. When we
say “function” without any qualifier then we mean a total function.
The notion of partial functions is a strict generalization of functions,
and so every function is a partial function, but not every partial func-
tion is a function. (That is, for every non-empty 𝑆 and 𝑇 , the set of
partial functions from 𝑆 to 𝑇 is a proper superset of the set of total
functions from 𝑆 to 𝑇 .) When we want to emphasize that a function
𝑓 from 𝐴 to 𝐵 might not be total, we will write 𝑓 ∶ 𝐴 →𝑝 𝐵. We can
think of a partial function 𝐹 from 𝑆 to 𝑇 also as a total function from
𝑆 to 𝑇 ∪ {⊥} where ⊥ is a special “failure symbol”. So, instead of
saying that 𝐹 is undefined at 𝑥, we can say that 𝐹 (𝑥) = ⊥.

Basic facts about functions:Verifying that you can prove the following
results is an excellent way to brush up on functions:

• If 𝐹 ∶ 𝑆 → 𝑇 and 𝐺 ∶ 𝑇 → 𝑈 are one-to-one functions, then their


composition 𝐻 ∶ 𝑆 → 𝑈 defined as 𝐻(𝑠) = 𝐺(𝐹 (𝑠)) is also one to
one.

• If 𝐹 ∶ 𝑆 → 𝑇 is one to one, then there exists an onto function


𝐺 ∶ 𝑇 → 𝑆 such that 𝐺(𝐹 (𝑠)) = 𝑠 for every 𝑠 ∈ 𝑆.

• If 𝐺 ∶ 𝑇 → 𝑆 is onto then there exists a one-to-one function 𝐹 ∶ 𝑆 →


𝑇 such that 𝐺(𝐹 (𝑠)) = 𝑠 for every 𝑠 ∈ 𝑆.

• If 𝑆 and 𝑇 are non-empty finite sets then the following conditions


are equivalent to one another: (a) |𝑆| ≤ |𝑇 |, (b) there is a one-
to-one function 𝐹 ∶ 𝑆 → 𝑇 , and (c) there is an onto function
Figure 1.4: We can represent finite functions as a
𝐺 ∶ 𝑇 → 𝑆. These equivalences are in fact true even for infinite 𝑆 directed graph where we put an edge from 𝑥 to
and 𝑇 . For infinite sets the condition (b) (or equivalently, (c)) is 𝑓(𝑥). The onto condition corresponds to requiring
that every vertex in the codomain of the function
the commonly accepted definition for |𝑆| ≤ |𝑇 |. has in-degree at least one. The one-to-one condition
corresponds to requiring that every vertex in the
codomain of the function has in-degree at most one. In
the examples above 𝐹 is an onto function, 𝐺 is one to
P one, and 𝐻 is neither onto nor one to one.
56 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

You can find the proofs of these results in many dis-


crete math texts, including for example, Section 4.5
in the Lehman-Leighton-Meyer notes. However, I
strongly suggest you try to prove them on your own,
or at least convince yourself that they are true by
proving special cases of those for small sizes (e.g.,
|𝑆| = 3, |𝑇 | = 4, |𝑈 | = 5).

Let us prove one of these facts as an example:


Lemma 1.2 If 𝑆, 𝑇 are non-empty sets and 𝐹 ∶ 𝑆 → 𝑇 is one to one, then
there exists an onto function 𝐺 ∶ 𝑇 → 𝑆 such that 𝐺(𝐹 (𝑠)) = 𝑠 for
every 𝑠 ∈ 𝑆.

Proof. Choose some 𝑠0 ∈ 𝑆. We will define the function 𝐺 ∶ 𝑇 → 𝑆 as


follows: for every 𝑡 ∈ 𝑇 , if there is some 𝑠 ∈ 𝑆 such that 𝐹 (𝑠) = 𝑡 then
set 𝐺(𝑡) = 𝑠 (the choice of 𝑠 is well defined since by the one-to-one
property of 𝐹 , there cannot be two distinct 𝑠, 𝑠′ that both map to 𝑡).
Otherwise, set 𝐺(𝑡) = 𝑠0 . Now for every 𝑠 ∈ 𝑆, by the definition of 𝐺,
if 𝑡 = 𝐹 (𝑠) then 𝐺(𝑡) = 𝐺(𝐹 (𝑠)) = 𝑠. Moreover, this also shows that
𝐺 is onto, since it means that for every 𝑠 ∈ 𝑆 there is some 𝑡, namely
𝑡 = 𝐹 (𝑠), such that 𝐺(𝑡) = 𝑠.
3
It is possible, and sometimes useful, to think of an
undirected graph as the special case of a directed
■ graph that has the special property that for every pair
𝑢, 𝑣 either both the edges (𝑢, 𝑣) and (𝑣, 𝑢) are present
or neither of them is. However, in many settings there
1.4.4 Graphs is a significant difference between undirected and
Graphs are ubiquitous in Computer Science, and many other fields as directed graphs, and so it’s typically best to think of
well. They are used to model a variety of data types including social them as separate categories.

networks, scheduling constraints, road networks, deep neural nets,


gene interactions, correlations between observations, and a great
many more. Formal definitions of several kinds of graphs are given
next, but if you have not seen graphs before in a course, I urge you to
read up on them in one of the sources mentioned in Section 1.9.
Graphs come in two basic flavors: undirected and directed.3

Definition 1.3 — Undirected graphs.An undirected graph 𝐺 = (𝑉 , 𝐸) con-


sists of a set 𝑉 of vertices and a set 𝐸 of edges. Every edge is a size
two subset of 𝑉 . We say that two vertices 𝑢, 𝑣 ∈ 𝑉 are neighbors, if
the edge {𝑢, 𝑣} is in 𝐸.

Given this definition, we can define several other properties of


graphs and their vertices. We define the degree of 𝑢 to be the number
of neighbors 𝑢 has. A path in the graph is a tuple (𝑢0 , … , 𝑢𝑘 ) ∈ 𝑉 𝑘+1 ,
for some 𝑘 > 0 such that 𝑢𝑖+1 is a neighbor of 𝑢𝑖 for every 𝑖 ∈ [𝑘]. A Figure 1.5: An example of an undirected and a di-

simple path is a path (𝑢0 , … , 𝑢𝑘−1 ) where all the 𝑢𝑖 ’s are distinct. A cycle rected graph. The undirected graph has vertex set
{1, 2, 3, 4} and edge set {{1, 2}, {2, 3}, {2, 4}}. The
is a path (𝑢0 , … , 𝑢𝑘 ) where 𝑢0 = 𝑢𝑘 . We say that two vertices 𝑢, 𝑣 ∈ 𝑉 directed graph has vertex set {𝑎, 𝑏, 𝑐} and the edge
are connected if either 𝑢 = 𝑣 or there is a path from (𝑢0 , … , 𝑢𝑘 ) where set {(𝑎, 𝑏), (𝑏, 𝑐), (𝑐, 𝑎), (𝑎, 𝑐)}.
mathe mati ca l backg rou n d 57

𝑢0 = 𝑢 and 𝑢𝑘 = 𝑣. We say that the graph 𝐺 is connected if every pair of


vertices in it is connected.
Here are some basic facts about undirected graphs. We give some
informal arguments below, but leave the full proofs as exercises (the
proofs can be found in many of the resources listed in Section 1.9).
Lemma 1.4 In any undirected graph 𝐺 = (𝑉 , 𝐸), the sum of the degrees
of all vertices is equal to twice the number of edges.
Lemma 1.4 can be shown by seeing that every edge {𝑢, 𝑣} con-
tributes twice to the sum of the degrees (once for 𝑢 and the second
time for 𝑣).
Lemma 1.5The connectivity relation is transitive, in the sense that if 𝑢 is
connected to 𝑣, and 𝑣 is connected to 𝑤, then 𝑢 is connected to 𝑤.
Lemma 1.5 can be shown by simply attaching a path of the form
(𝑢, 𝑢1 , 𝑢2 , … , 𝑢𝑘−1 , 𝑣) to a path of the form (𝑣, 𝑢′1 , … , 𝑢′𝑘′ −1 , 𝑤) to obtain
the path (𝑢, 𝑢1 , … , 𝑢𝑘−1 , 𝑣, 𝑢′1 , … , 𝑢′𝑘′ −1 , 𝑤) that connects 𝑢 to 𝑤.
Lemma 1.6 For every undirected graph 𝐺 = (𝑉 , 𝐸) and connected pair
𝑢, 𝑣, the shortest path from 𝑢 to 𝑣 is simple. In particular, for every
connected pair there exists a simple path that connects them.
Lemma 1.6 can be shown by “shortcutting” any non-simple path
from 𝑢 to 𝑣 where the same vertex 𝑤 appears twice to remove it (see
Fig. 1.6). It is a good exercise to transforming this intuitive reasoning
to a formal proof:

Figure 1.6: If there is a path from 𝑢 to 𝑣 in a graph


that passes twice through a vertex 𝑤 then we can
“shortcut” it by removing the loop from 𝑤 to itself to
find a path from 𝑢 to 𝑣 that only passes once through
𝑤.

Solved Exercise 1.1 — Connected vertices have simple paths. Prove Lemma 1.6

Solution:
The proof follows the idea illustrated in Fig. 1.6. One complica-
tion is that there can be more than one vertex that is visited twice
58 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

by a path, and so “shortcutting” might not necessarily result in a


simple path; we deal with this by looking at a shortest path between
𝑢 and 𝑣. Details follow.
Let 𝐺 = (𝑉 , 𝐸) be a graph and 𝑢 and 𝑣 in 𝑉 be two connected
vertices in 𝐺. We will prove that there is a simple path between 𝑢
and 𝑣. Let 𝑘 be the shortest length of a path between 𝑢 and 𝑣 and
let 𝑃 = (𝑢0 , 𝑢1 , 𝑢2 , … , 𝑢𝑘−1 , 𝑢𝑘 ) be a 𝑘-length path from 𝑢 to 𝑣
(there can be more than one such path: if so we just choose one of
them). (That is 𝑢0 = 𝑢, 𝑢𝑘 = 𝑣, and (𝑢ℓ , 𝑢ℓ+1 ) ∈ 𝐸 for all ℓ ∈ [𝑘].)
We claim that 𝑃 is simple. Indeed, suppose otherwise that there is
some vertex 𝑤 that occurs twice in the path: 𝑤 = 𝑢𝑖 and 𝑤 = 𝑢𝑗 for
some 𝑖 < 𝑗. Then we can “shortcut” the path 𝑃 by considering the
path 𝑃 ′ = (𝑢0 , 𝑢1 , … , 𝑢𝑖−1 , 𝑤, 𝑢𝑗+1 , … , 𝑢𝑘 ) obtained by taking the
first 𝑖 vertices of 𝑃 (from 𝑢0 = 0 to the first occurrence of 𝑤) and
the last 𝑘 − 𝑗 ones (from the vertex 𝑢𝑗+1 following the second oc-
currence of 𝑤 to 𝑢𝑘 = 𝑣). The path 𝑃 ′ is a valid path between 𝑢 and
𝑣 since every consecutive pair of vertices in it is connected by an
edge (in particular, since 𝑤 = 𝑢𝑖 = 𝑢𝑗 , both (𝑢𝑖−1 , 𝑤) and (𝑤, 𝑢𝑗+1 )
are edges in 𝐸), but since the length of 𝑃 ′ is 𝑘 − (𝑗 − 𝑖) < 𝑘, this
contradicts the minimality of 𝑃 .

R
Remark 1.7 — Finding proofs. Solved Exercise 1.1 is a
good example of the process of finding a proof. You
start by ensuring you understand what the statement
means, and then come up with an informal argument
why it should be true. You then transform the infor-
mal argument into a rigorous proof. This proof need
not be very long or overly formal, but should clearly
establish why the conclusion of the statement follows
from its assumptions.

The concepts of degrees and connectivity extend naturally to di-


rected graphs, defined as follows.

A directed graph 𝐺 = (𝑉 , 𝐸) consists


Definition 1.8 — Directed graphs.
of a set 𝑉 and a set 𝐸 ⊆ 𝑉 × 𝑉 of ordered pairs of 𝑉 . We sometimes
denote the edge (𝑢, 𝑣) also as 𝑢 → 𝑣. If the edge 𝑢 → 𝑣 is present
in the graph then we say that 𝑣 is an out-neighbor of 𝑢 and 𝑢 is an
in-neighbor of 𝑣.

A directed graph might contain both 𝑢 → 𝑣 and 𝑣 → 𝑢 in which


case 𝑢 will be both an in-neighbor and an out-neighbor of 𝑣 and vice
versa. The in-degree of 𝑢 is the number of in-neighbors it has, and the
mathe mati ca l backg rou n d 59

out-degree of 𝑣 is the number of out-neighbors it has. A path in the


graph is a tuple (𝑢0 , … , 𝑢𝑘 ) ∈ 𝑉 𝑘+1 , for some 𝑘 > 0 such that 𝑢𝑖+1 is an
out-neighbor of 𝑢𝑖 for every 𝑖 ∈ [𝑘]. As in the undirected case, a simple
path is a path (𝑢0 , … , 𝑢𝑘−1 ) where all the 𝑢𝑖 ’s are distinct and a cycle
is a path (𝑢0 , … , 𝑢𝑘 ) where 𝑢0 = 𝑢𝑘 . One type of directed graphs we
often care about is directed acyclic graphs or DAGs, which, as their name
implies, are directed graphs without any cycles:

Definition 1.9 — Directed Acyclic Graphs. We say that 𝐺 = (𝑉 , 𝐸) is a


directed acyclic graph (DAG) if it is a directed graph and there does
not exist a list of vertices 𝑢0 , 𝑢1 , … , 𝑢𝑘 ∈ 𝑉 such that 𝑢0 = 𝑢𝑘 and
for every 𝑖 ∈ [𝑘], the edge 𝑢𝑖 → 𝑢𝑖+1 is in 𝐸.

The lemmas we mentioned above have analogs for directed graphs.


We again leave the proofs (which are essentially identical to their
undirected analogs) as exercises.
Lemma 1.10 In any directed graph 𝐺 = (𝑉 , 𝐸), the sum of the in-
degrees is equal to the sum of the out-degrees, which is equal to the
number of edges.
Lemma 1.11 In any directed graph 𝐺, if there is a path from 𝑢 to 𝑣 and a
path from 𝑣 to 𝑤, then there is a path from 𝑢 to 𝑤.
Lemma 1.12 For every directed graph 𝐺 = (𝑉 , 𝐸) and a pair 𝑢, 𝑣 such
that there is a path from 𝑢 to 𝑣, the shortest path from 𝑢 to 𝑣 is simple.

R
Remark 1.13 — Labeled graphs. For some applications
we will consider labeled graphs, where the vertices or
edges have associated labels (which can be numbers,
strings, or members of some other set). We can think
of such a graph as having an associated (possibly
partial) labelling function 𝐿 ∶ 𝑉 ∪ 𝐸 → ℒ, where ℒ is
the set of potential labels. However we will typically
not refer explicitly to this labeling function and simply
say things such as “vertex 𝑣 has the label 𝛼”.

1.4.5 Logic operators and quantifiers


If 𝑃 and 𝑄 are some statements that can be true or false, then 𝑃 AND
𝑄 (denoted as 𝑃 ∧ 𝑄) is a statement that is true if and only if both 𝑃
and 𝑄 are true, and 𝑃 OR 𝑄 (denoted as 𝑃 ∨ 𝑄) is a statement that is
true if and only if either 𝑃 or 𝑄 is true. The negation of 𝑃 , denoted as
¬𝑃 or 𝑃 , is true if and only if 𝑃 is false.
Suppose that 𝑃 (𝑥) is a statement that depends on some parameter 𝑥
(also sometimes known as an unbound variable) in the sense that for
every instantiation of 𝑥 with a value from some set 𝑆, 𝑃 (𝑥) is either
60 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

true or false. For example, 𝑥 > 7 is a statement that is not a priori


true or false, but becomes true or false whenever we instantiate 𝑥 with
some real number. We denote by ∀𝑥∈𝑆 𝑃 (𝑥) the statement that is true 4
In this book, we place the variable bound by a quan-
if and only if 𝑃 (𝑥) is true for every 𝑥 ∈ 𝑆.4 We denote by ∃𝑥∈𝑆 𝑃 (𝑥) the tifier in a subscript and so write ∀𝑥∈𝑆 𝑃 (𝑥). Many
statement that is true if and only if there exists some 𝑥 ∈ 𝑆 such that other texts do not use this subscript notation and so
will write the same statement as ∀𝑥 ∈ 𝑆, 𝑃 (𝑥).
𝑃 (𝑥) is true.
For example, the following is a formalization of the true statement
that there exists a natural number 𝑛 larger than 100 that is not divisi-
ble by 3:

∃𝑛∈ℕ (𝑛 > 100) ∧ (∀𝑘∈𝑁 𝑘 + 𝑘 + 𝑘 ≠ 𝑛) .

One expression that we will see come up


”For sufficiently large n.”
time and again in this book is the claim that some statement 𝑃 (𝑛) is
true “for sufficiently large 𝑛”. What this means is that there exists an
integer 𝑁0 such that 𝑃 (𝑛) is true for every 𝑛 > 𝑁0 . We can formalize
this as ∃𝑁0 ∈ℕ ∀𝑛>𝑁0 𝑃 (𝑛).

1.4.6 Quantifiers for summations and products


The following shorthands for summing up or taking products of sev-
eral numbers are often convenient. If 𝑆 = {𝑠0 , … , 𝑠𝑛−1 } is a finite set
and 𝑓 ∶ 𝑆 → ℝ is a function, then we write ∑𝑥∈𝑆 𝑓(𝑥) as shorthand for

𝑓(𝑠0 ) + 𝑓(𝑠1 ) + 𝑓(𝑠2 ) + … + 𝑓(𝑠𝑛−1 ) ,

and ∏𝑥∈𝑆 𝑓(𝑥) as shorthand for

𝑓(𝑠0 ) ⋅ 𝑓(𝑠1 ) ⋅ 𝑓(𝑠2 ) ⋅ … ⋅ 𝑓(𝑠𝑛−1 ) .

For example, the sum of the squares of all numbers from 1 to 100
can be written as

∑ 𝑖2 . (1.1)
𝑖∈{1,…,100}

Since summing up over intervals of integers is so common, there


𝑏
is a special notation for it. For every two integers, 𝑎 ≤ 𝑏, ∑𝑖=𝑎 𝑓(𝑖)
denotes ∑𝑖∈𝑆 𝑓(𝑖) where 𝑆 = {𝑥 ∈ ℤ ∶ 𝑎 ≤ 𝑥 ≤ 𝑏}. Hence, we can
write the sum (1.1) as

100
∑ 𝑖2 .
𝑖=1

1.4.7 Parsing formulas: bound and free variables


In mathematics, as in coding, we often have symbolic “variables” or
“parameters”. It is important to be able to understand, given some
formula, whether a given variable is bound or free in this formula. For
mathe mati ca l backg rou n d 61

example, in the following statement 𝑛 is free but 𝑎 and 𝑏 are bound by


the ∃ quantifier:

∃𝑎,𝑏∈ℕ (𝑎 ≠ 1) ∧ (𝑎 ≠ 𝑛) ∧ (𝑛 = 𝑎 × 𝑏) (1.2)
Since 𝑛 is free, it can be set to any value, and the truth of the state-
ment (1.2) depends on the value of 𝑛. For example, if 𝑛 = 8 then (1.2)
is true, but for 𝑛 = 11 it is false. (Can you see why?)
The same issue appears when parsing code. For example, in the
following snippet from the C programming language

for (int i=0 ; i<n ; i=i+1) {


printf("*");
}

the variable i is bound within the for block but the variable n is
free.
The main property of bound variables is that we can rename them
(as long as the new name doesn’t conflict with another used variable)
without changing the meaning of the statement. Thus for example the
statement

∃𝑥,𝑦∈ℕ (𝑥 ≠ 1) ∧ (𝑥 ≠ 𝑛) ∧ (𝑛 = 𝑥 × 𝑦) (1.3)
is equivalent to (1.2) in the sense that it is true for exactly the same
set of 𝑛’s.
Similarly, the code

for (int j=0 ; j<n ; j=j+1) {


printf("*");
}

produces the same result as the code above that used i instead of j.

R
Remark 1.14 — Aside: mathematical vs programming no-
tation. Mathematical notation has a lot of similarities
with programming language, and for the same rea-
sons. Both are formalisms meant to convey complex
concepts in a precise way. However, there are some
cultural differences. In programming languages, we
often try to use meaningful variable names such as
NumberOfVertices while in math we often use short
identifiers such as 𝑛. Part of it might have to do with
the tradition of mathematical proofs as being hand-
written and verbally presented, as opposed to typed
up and compiled. Another reason is if the wrong
variable name is used in a proof, at worst it causes
confusion to readers; when the wrong variable name
62 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

is used in a program, planes might crash, patients


might die, and rockets could explode.
One consequence of that is that in mathematics we
often end up reusing identifiers, and also “run out”
of letters and hence use Greek letters too, as well as
distinguish between small and capital letters and
different font faces. Similarly, mathematical notation
tends to use quite a lot of “overloading”, using oper-
ators such as + for a great variety of objects (e.g., real
numbers, matrices, finite field elements, etc..), and
assuming that the meaning can be inferred from the
context.
Both fields have a notion of “types”, and in math
we often try to reserve certain letters for variables
of a particular type. For example, variables such as
𝑖, 𝑗, 𝑘, ℓ, 𝑚, 𝑛 will often denote integers, and 𝜖 will
often denote a small positive real number (see Sec-
tion 1.7 for more on these conventions). When reading
or writing mathematical texts, we usually don’t have
the advantage of a “compiler” that will check type
safety for us. Hence it is important to keep track of the
type of each variable, and see that the operations that
are performed on it “make sense”.
Kun’s book [Kun18] contains an extensive discus-
sion on the similarities and differences between the
cultures of mathematics and programming.

1.4.8 Asymptotics and Big-𝑂 notation


“log log log 𝑛 has been proved to go to infinity, but has never been observed to
do so.”, Anonymous, quoted by Carl Pomerance (2000)

It is often very cumbersome to describe precisely quantities such


as running time and is also not needed, since we are typically mostly
interested in the “higher order terms”. That is, we want to understand
the scaling behavior of the quantity as the input variable grows. For
example, as far as running time goes, the difference between an 𝑛5 -
time algorithm and an 𝑛2 -time one is much more significant than the
difference between a 100𝑛2 + 10𝑛 time algorithm and a 10𝑛2 time
algorithm. For this purpose, 𝑂-notation is extremely useful as a way
to “declutter” our text and focus our attention on what really matters.
For example, using 𝑂-notation, we can say that both 100𝑛2 + 10𝑛
and 10𝑛2 are simply Θ(𝑛2 ) (which informally means “the same up to
constant factors”), while 𝑛2 = 𝑜(𝑛5 ) (which informally means that 𝑛2
is “much smaller than” 𝑛5 ).
Generally (though still informally), if 𝐹 , 𝐺 are two functions map-
ping natural numbers to non-negative reals, then “𝐹 = 𝑂(𝐺)” means
that 𝐹 (𝑛) ≤ 𝐺(𝑛) if we don’t care about constant factors, while
“𝐹 = 𝑜(𝐺)” means that 𝐹 is much smaller than 𝐺, in the sense that no
matter by what constant factor we multiply 𝐹 , if we take 𝑛 to be large
mathe mati ca l backg rou n d 63

enough then 𝐺 will be bigger (for this reason, sometimes 𝐹 = 𝑜(𝐺)


is written as 𝐹 ≪ 𝐺). We will write 𝐹 = Θ(𝐺) if 𝐹 = 𝑂(𝐺) and
𝐺 = 𝑂(𝐹 ), which one can think of as saying that 𝐹 is the same as 𝐺 if
we don’t care about constant factors. More formally, we define Big-𝑂
notation as follows:

Let ℝ+ = {𝑥 ∈ ℝ | 𝑥 > 0} be the set


Definition 1.15 — Big-𝑂 notation.
of positive real numbers. For two functions 𝐹 , 𝐺 ∶ ℕ → ℝ+ , we say
that 𝐹 = 𝑂(𝐺) if there exist numbers 𝑎, 𝑁0 ∈ ℕ such that 𝐹 (𝑛) ≤
𝑎 ⋅ 𝐺(𝑛) for every 𝑛 > 𝑁0 . We say that 𝐹 = Θ(𝐺) if 𝐹 = 𝑂(𝐺) and
𝐺 = 𝑂(𝐹 ). We say that 𝐹 = Ω(𝐺) if 𝐺 = 𝑂(𝐹 ).
We say that 𝐹 = 𝑜(𝐺) if for every 𝜖 > 0 there is some 𝑁0 such
that 𝐹 (𝑛) < 𝜖𝐺(𝑛) for every 𝑛 > 𝑁0 . We say that 𝐹 = 𝜔(𝐺) if
𝐺 = 𝑜(𝐹 ).

It’s often convenient to use “anonymous functions” in the context of


𝑂-notation. For example, when we write a statement such as 𝐹 (𝑛) =
𝑂(𝑛3 ), we mean that 𝐹 = 𝑂(𝐺) where 𝐺 is the function defined by
𝐺(𝑛) = 𝑛3 . Chapter 7 in Jim Apsnes’ notes on discrete math provides
a good summary of 𝑂 notation; see also this tutorial for a gentler and
more programmer-oriented introduction.
𝑂 is not equality. Using the equality sign for 𝑂-notation is extremely
common, but is somewhat of a misnomer, since a statement such as
Figure 1.7: If 𝐹 (𝑛) = 𝑜(𝐺(𝑛)) then for sufficiently
𝐹 = 𝑂(𝐺) really means that 𝐹 is in the set {𝐺′ ∶ ∃𝑁,𝑐 s.t. ∀𝑛>𝑁 𝐺′ (𝑛) ≤ large 𝑛, 𝐹 (𝑛) will be smaller than 𝐺(𝑛). For example,
𝑐𝐺(𝑛)}. If anything, it makes more sense to use inequalities and write if Algorithm 𝐴 runs in time 1000 ⋅ 𝑛 + 106 and
Algorithm 𝐵 runs in time 0.01 ⋅ 𝑛2 then even though
𝐹 ≤ 𝑂(𝐺) and 𝐹 ≥ Ω(𝐺), reserving equality for 𝐹 = Θ(𝐺), and
𝐵 might be more efficient for smaller inputs, when
so we will sometimes use this notation too, but since the equality the inputs get sufficiently large, 𝐴 will run much faster
notation is quite firmly entrenched we often stick to it as well. (Some than 𝐵.

texts write 𝐹 ∈ 𝑂(𝐺) instead of 𝐹 = 𝑂(𝐺), but we will not use this
notation.) Despite the misleading equality sign, you should remember
that a statement such as 𝐹 = 𝑂(𝐺) means that 𝐹 is “at most” 𝐺 in
some rough sense when we ignore constants, and a statement such as
𝐹 = Ω(𝐺) means that 𝐹 is “at least” 𝐺 in the same rough sense.

1.4.9 Some “rules of thumb” for Big-𝑂 notation


There are some simple heuristics that can help when trying to com-
pare two functions 𝐹 and 𝐺:

• Multiplicative constants don’t matter in 𝑂-notation, and so if


𝐹 (𝑛) = 𝑂(𝐺(𝑛)) then 100𝐹 (𝑛) = 𝑂(𝐺(𝑛)).

• When adding two functions, we only care about the larger one. For
example, for the purpose of 𝑂-notation, 𝑛3 + 100𝑛2 is the same as
𝑛3 , and in general in any polynomial, we only care about the larger
exponent.
64 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• For every two constants 𝑎, 𝑏 > 0, 𝑛𝑎 = 𝑂(𝑛𝑏 ) if and only if 𝑎 ≤ 𝑏,


and 𝑛𝑎 = 𝑜(𝑛𝑏 ) if and only if 𝑎 < 𝑏. For example, combining the two
observations above, 100𝑛2 + 10𝑛 + 100 = 𝑜(𝑛3 ).

• Polynomial is always smaller than exponential: 𝑛𝑎 = 𝑜(2𝑛 ) for


𝜖

every two constants 𝑎 > 0 and 𝜖 > 0 even if 𝜖 is much smaller than

𝑎. For example, 100𝑛100 = 𝑜(2 𝑛 ).

• Similarly, logarithmic is always smaller than polynomial: (log 𝑛)𝑎


𝑎
(which we write as log 𝑛) is 𝑜(𝑛𝜖 ) for every two constants 𝑎, 𝜖 > 0.
100
For example, combining the observations above, 100𝑛2 log 𝑛 =
3
𝑜(𝑛 ).

R
Remark 1.16 — Big 𝑂 for other applications (optional).
While Big-𝑂 notation is often used to analyze running
time of algorithms, this is by no means the only ap-
plication. We can use 𝑂 notation to bound asymptotic
relations between any functions mapping integers
to positive numbers. It can be used regardless of
whether these functions are a measure of running
time, memory usage, or any other quantity that may
have nothing to do with computation. Here is one
example which is unrelated to this book (and hence
one that you can feel free to skip): one way to state the
Riemann Hypothesis (one of the most famous open
questions in mathematics) is that it corresponds to
the conjecture that the number of primes between 0
𝑛
and 𝑛 is equal to ∫2 ln1𝑥 𝑑𝑥 up to an additive error of

magnitude at most 𝑂( 𝑛 log 𝑛).

1.5 PROOFS
Many people think of mathematical proofs as a sequence of logical
deductions that starts from some axioms and ultimately arrives at
a conclusion. In fact, some dictionaries define proofs that way. This
is not entirely wrong, but at its essence, a mathematical proof of a
statement X is simply an argument that convinces the reader that X is
true beyond a shadow of a doubt.
To produce such a proof you need to:

1. Understand precisely what X means.

2. Convince yourself that X is true.

3. Write your reasoning down in plain, precise and concise English


(using formulas or notation only when they help clarity).
mathe mati ca l backg rou n d 65

In many cases, the first part is the most important one. Understand-
ing what a statement means is oftentimes more than halfway towards
understanding why it is true. In the third part, to convince the reader
beyond a shadow of a doubt, we will often want to break down the
reasoning to “basic steps”, where each basic step is simple enough
to be “self-evident”. The combination of all steps yields the desired
statement.

1.5.1 Proofs and programs


There is a great deal of similarity between the process of writing proofs
and that of writing programs, and both require a similar set of skills.
Writing a program involves:

1. Understanding what is the task we want the program to achieve.

2. Convincing yourself that the task can be achieved by a computer,


perhaps by planning on a whiteboard or notepad how you will
break it up into simpler tasks.

3. Converting this plan into code that a compiler or interpreter can


understand, by breaking up each task into a sequence of the basic
operations of some programming language.

In programs as in proofs, step 1 is often the most important one.


A key difference is that the reader for proofs is a human being and
the reader for programs is a computer. (This difference is eroding
with time as more proofs are being written in a machine verifiable form;
moreover, to ensure correctness and maintainability of programs, it
is important that they can be read and understood by humans.) Thus
our emphasis is on readability and having a clear logical flow for our
proof (which is not a bad idea for programs as well). When writing a
proof, you should think of your audience as an intelligent but highly
skeptical and somewhat petty reader, that will “call foul” at every step
that is not well justified.

1.5.2 Proof writing style


A mathematical proof is a piece of writing, but it is a specific genre
of writing with certain conventions and preferred styles. As in any
writing, practice makes perfect, and it is also important to revise your
drafts for clarity.
In a proof for the statement 𝑋, all the text between the words
“Proof:” and “QED” should be focused on establishing that 𝑋 is true.
Digressions, examples, or ruminations should be kept outside these
two words, so they do not confuse the reader. The proof should have
a clear logical flow in the sense that every sentence or equation in it
should have some purpose and it should be crystal-clear to the reader
66 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

what this purpose is. When you write a proof, for every equation or
sentence you include, ask yourself:

1. Is this sentence or equation stating that some statement is true?

2. If so, does this statement follow from the previous steps, or are we
going to establish it in the next step?

3. What is the role of this sentence or equation? Is it one step towards


proving the original statement, or is it a step towards proving some
intermediate claim that you have stated before?

4. Finally, would the answers to questions 1-3 be clear to the reader?


If not, then you should reorder, rephrase, or add explanations.

Some helpful resources on mathematical writing include this hand-


out by Lee, this handout by Hutching, as well as several of the excel-
lent handouts in Stanford’s CS 103 class.

1.5.3 Patterns in proofs


“If it was so, it might be; and if it were so, it would be; but as it isn’t, it ain’t.
That’s logic.”, Lewis Carroll, Through the looking-glass.

Just like in programming, there are several common patterns of


proofs that occur time and again. Here are some examples:

Proofs by contradiction: One way to prove that 𝑋 is true is to show


that if 𝑋 was false it would result in a contradiction. Such proofs
often start with a sentence such as “Suppose, towards a contradiction,
that 𝑋 is false” and end with deriving some contradiction (such as a
violation of one of the assumptions in the theorem statement). Here is
an example:

Lemma 1.17 There are no natural numbers 𝑎, 𝑏 such that 2 = 𝑏 .
𝑎

Proof. Suppose, towards a contradiction that this is false, and so let


𝑎 ∈ ℕ be the smallest number such that there exists some 𝑏 ∈ ℕ

satisfying 2 = 𝑎𝑏 . Squaring this equation we get that 2 = 𝑎2 /𝑏2 or
𝑎2 = 2𝑏2 (∗). But this means that 𝑎2 is even, and since the product of
two odd numbers is odd, it means that 𝑎 is even as well, or in other
words, 𝑎 = 2𝑎′ for some 𝑎′ ∈ ℕ. Yet plugging this into (∗) shows that
4𝑎′2 = 2𝑏2 which means 𝑏2 = 2𝑎′2 is an even number as well. By the
same considerations as above we get that 𝑏 is even and hence 𝑎/2 and

𝑏/2 are two natural numbers satisfying 𝑎/2
𝑏/2 = 2, contradicting the
minimality of 𝑎.

mathe mati ca l backg rou n d 67

Proofs of a universal statement: Often we want to prove a statement 𝑋 of


the form “Every object of type 𝑂 has property 𝑃 .” Such proofs often
start with a sentence such as “Let 𝑜 be an object of type 𝑂” and end by
showing that 𝑜 has the property 𝑃 . Here is a simple example:
Lemma 1.18 For every natural number 𝑛 ∈ 𝑁 , either 𝑛 or 𝑛 + 1 is even.

Proof. Let 𝑛 ∈ 𝑁 be some number. If 𝑛/2 is a whole number then


we are done, since then 𝑛 = 2(𝑛/2) and hence it is even. Otherwise,
𝑛/2 + 1/2 is a whole number, and hence 2(𝑛/2 + 1/2) = 𝑛 + 1 is even.

Another common case is that the statement 𝑋


Proofs of an implication:
has the form “𝐴 implies 𝐵”. Such proofs often start with a sentence
such as “Assume that 𝐴 is true” and end with a derivation of 𝐵 from
𝐴. Here is a simple example:
Lemma 1.19If 𝑏2 ≥ 4𝑎𝑐 then there is a solution to the quadratic equa-
tion 𝑎𝑥 + 𝑏𝑥 + 𝑐 = 0.
2

Proof. Suppose that 𝑏2 ≥ 4𝑎𝑐. Then 𝑑 = 𝑏2 − 4𝑎𝑐 is a non-negative


number and hence it has a square root 𝑠. Thus 𝑥 = (−𝑏 + 𝑠)/(2𝑎)
satisfies
𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 𝑎(−𝑏 + 𝑠)2 /(4𝑎2 ) + 𝑏(−𝑏 + 𝑠)/(2𝑎) + 𝑐
(1.4)
= (𝑏2 − 2𝑏𝑠 + 𝑠2 )/(4𝑎) + (−𝑏2 + 𝑏𝑠)/(2𝑎) + 𝑐 .

Rearranging the terms of (1.4) we get

𝑠2 /(4𝑎) + 𝑐 − 𝑏2 /(4𝑎) = (𝑏2 − 4𝑎𝑐)/(4𝑎) + 𝑐 − 𝑏2 /(4𝑎) = 0

Proofs of equivalence: If a statement has the form “𝐴 if and only if


𝐵” (often shortened as “𝐴 iff 𝐵”) then we need to prove both that 𝐴
implies 𝐵 and that 𝐵 implies 𝐴. We call the implication that 𝐴 implies
𝐵 the “only if” direction, and the implication that 𝐵 implies 𝐴 the “if”
direction.

Proofs by combining intermediate claims: When a proof is more complex,


it is often helpful to break it apart into several steps. That is, to prove
the statement 𝑋, we might first prove statements 𝑋1 ,𝑋2 , and 𝑋3 and
then prove that 𝑋1 ∧ 𝑋2 ∧ 𝑋3 implies 𝑋. (Recall that ∧ denotes the
logical AND operator.)

Proofs by case distinction:This is a special case of the above, where to


prove a statement 𝑋 we split into several cases 𝐶1 , … , 𝐶𝑘 , and prove
that (a) the cases are exhaustive, in the sense that one of the cases 𝐶𝑖
must happen and (b) go one by one and prove that each one of the
cases 𝐶𝑖 implies the result 𝑋 that we are after.
68 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proofs by induction: We discuss induction and give an example in


Section 1.6.1 below. We can think of such proofs as a variant of the
above, where we have an unbounded number of intermediate claims
𝑋0 , 𝑋1 , 𝑋2 , … , 𝑋𝑘 , and we prove that 𝑋0 is true, as well as that 𝑋0
implies 𝑋1 , and that 𝑋0 ∧ 𝑋1 implies 𝑋2 , and so on and so forth. The
website for CMU course 15-251 contains a useful handout on potential
pitfalls when making proofs by induction.

This term can be initially quite con-


”Without loss of generality (w.l.o.g)”:
fusing. It is essentially a way to simplify proofs by case distinctions.
The idea is that if Case 1 is equal to Case 2 up to a change of variables
or a similar transformation, then the proof of Case 1 will also imply
the proof of Case 2. It is always a statement that should be viewed
with suspicion. Whenever you see it in a proof, ask yourself if you
understand why the assumption made is truly without loss of gen-
erality, and when you use it, try to see if the use is indeed justified.
When writing a proof, sometimes it might be easiest to simply repeat
the proof of the second case (adding a remark that the proof is very
similar to the first one).

R
Remark 1.20 — Hierarchical Proofs (optional). Mathe-
matical proofs are ultimately written in English prose.
The well-known computer scientist Leslie Lamport
argues that this is a problem, and proofs should be
written in a more formal and rigorous way. In his
manuscript he proposes an approach for structured
hierarchical proofs, that have the following form:

• A proof for a statement of the form “If 𝐴 then 𝐵”


is a sequence of numbered claims, starting with
the assumption that 𝐴 is true, and ending with the
claim that 𝐵 is true.
• Every claim is followed by a proof showing how
it is derived from the previous assumptions or
claims.
• The proof for each claim is itself a sequence of
subclaims.

The advantage of Lamport’s format is that the role


that every sentence in the proof plays is very clear.
It is also much easier to transform such proofs into
machine-checkable forms. The disadvantage is that
such proofs can be tedious to read and write, with
less differentiation between the important parts of the
arguments versus the more routine ones.
mathe mati ca l backg rou n d 69

1.6 EXTENDED EXAMPLE: TOPOLOGICAL SORTING


In this section we will prove the following: every directed acyclic
graph (DAG, see Definition 1.9) can be arranged in layers so that for
all directed edges 𝑢 → 𝑣, the layer of 𝑣 is larger than the layer of 𝑢.
This result is known as topological sorting and is used in many appli-
cations, including task scheduling, build systems, software package
management, spreadsheet cell calculations, and many others (see
Fig. 1.8). In fact, we will also use it ourselves later on in this book.

Figure 1.8: An example of topological sorting. We con-


sider a directed graph corresponding to a prerequisite
graph of the courses in some Computer Science pro-
gram. The edge 𝑢 → 𝑣 means that the course 𝑢 is a
prerequisite for the course 𝑣. A layering or “topologi-
cal sorting” of this graph is the same as mapping the
courses to semesters so that if we decide to take the
course 𝑣 in semester 𝑓(𝑣), then we have already taken
all the prerequisites for 𝑣 (i.e., its in-neighbors) in
prior semesters.

We start with the following definition. A layering of a directed


graph is a way to assign for every vertex 𝑣 a natural number
(corresponding to its layer), such that 𝑣’s in-neighbors are in
lower-numbered layers than 𝑣, and 𝑣’s out-neighbors are in
higher-numbered layers. The formal definition is as follows:

Let 𝐺 = (𝑉 , 𝐸) be a directed graph.


Definition 1.21 — Layering of a DAG.
A layering of 𝐺 is a function 𝑓 ∶ 𝑉 → ℕ such that for every edge
𝑢 → 𝑣 of 𝐺, 𝑓(𝑢) < 𝑓(𝑣).

In this section we prove that a directed graph is acyclic if and only if


it has a valid layering.

Theorem 1.22 — Topological Sort. Let 𝐺 be a directed graph. Then 𝐺 is


acyclic if and only if there exists a layering 𝑓 of 𝐺.

To prove such a theorem, we need to first understand what it


means. Since it is an “if and only if” statement, Theorem 1.22 corre-
sponds to two statements:
Lemma 1.23 For every directed graph 𝐺, if 𝐺 is acyclic then it has a
layering.
Lemma 1.24 For every directed graph 𝐺, if 𝐺 has a layering, then it is
acyclic.
70 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

To prove Theorem 1.22 we need to prove both Lemma 1.23 and


Lemma 1.24. Lemma 1.24 is actually not that hard to prove. Intuitively,
if 𝐺 contains a cycle, then it cannot be the case that all edges on the
cycle increase in layer number, since if we travel along the cycle at
some point we must come back to the place we started from. The
formal proof is as follows:

Proof. Let 𝐺 = (𝑉 , 𝐸) be a directed graph and let 𝑓 ∶ 𝑉 → ℕ be a


layering of 𝐺 as per Definition 1.21 . Suppose, towards a contradiction,
that 𝐺 is not acyclic, and hence there exists some cycle 𝑢0 , 𝑢1 , … , 𝑢𝑘
such that 𝑢0 = 𝑢𝑘 and for every 𝑖 ∈ [𝑘] the edge 𝑢𝑖 → 𝑢𝑖+1 is present in
𝐺. Since 𝑓 is a layering, for every 𝑖 ∈ [𝑘], 𝑓(𝑢𝑖 ) < 𝑓(𝑢𝑖+1 ), which means
that
𝑓(𝑢0 ) < 𝑓(𝑢1 ) < ⋯ < 𝑓(𝑢𝑘 )

but this is a contradiction since 𝑢0 = 𝑢𝑘 and hence 𝑓(𝑢0 ) = 𝑓(𝑢𝑘 ).


Lemma 1.23 corresponds to the more difficult (and useful) direc-


tion. To prove it, we need to show how, given an arbitrary DAG 𝐺, we
can come up with a layering of the vertices of 𝐺 so that all edges “go
up”.

P
If you have not seen the proof of this theorem before
(or don’t remember it), this would be an excellent
point to pause and try to prove it yourself. One way
to do it would be to describe an algorithm that given as
input a directed acyclic graph 𝐺 on 𝑛 vertices and 𝑛−2
or fewer edges, constructs an array 𝐹 of length 𝑛 such
that for every edge 𝑢 → 𝑣 in the graph 𝐹 [𝑢] < 𝐹 [𝑣].

1.6.1 Mathematical induction


There are several ways to prove Lemma 1.23. One approach to do is
to start by proving it for small graphs, such as graphs with 1, 2 or 3
vertices (see Fig. 1.9, for which we can check all the cases, and then
try to extend the proof for larger graphs). The technical term for this
proof approach is proof by induction.
Induction is simply an application of the self-evident Modus Ponens
rule that says that if

(a) 𝑃 is true
and
Figure 1.9: Some examples of DAGs of one, two and
(b) 𝑃 implies 𝑄 three vertices, and valid ways to assign layers to the
then 𝑄 is true. vertices.
mathe mati ca l backg rou n d 71

In the setting of proofs by induction we typically have a statement


𝑄(𝑘) that is parameterized by some integer 𝑘, and we prove that (a)
𝑄(0) is true, and (b) For every 𝑘 > 0, if 𝑄(0), … , 𝑄(𝑘 − 1) are all true
then 𝑄(𝑘) is true. (Usually proving (b) is the hard part, though there
are examples where the “base case” (a) is quite subtle.) By applying
Modus Ponens, we can deduce from (a) and (b) that 𝑄(1) is true.
Once we did so, since we now know that both 𝑄(0) and 𝑄(1) are true,
then we can use this and (b) to deduce (again using Modus Ponens)
that 𝑄(2) is true. We can repeat the same reasoning again and again
to obtain that 𝑄(𝑘) is true for every 𝑘. The statement (a) is called the
“base case”, while (b) is called the “inductive step”. The assumption
in (b) that 𝑄(𝑖) holds for 𝑖 < 𝑘 is called the “inductive hypothesis”.
(The form of induction described here is sometimes called “strong
induction” as opposed to “weak induction” where we replace (b)
by the statement (b’) that if 𝑄(𝑘 − 1) is true then 𝑄(𝑘) is true; weak
induction can be thought of as the special case of strong induction
where we don’t use the assumption that 𝑄(0), … , 𝑄(𝑘 − 2) are true.)

R
Remark 1.25 — Induction and recursion. Proofs by in-
duction are closely related to algorithms by recursion.
In both cases we reduce solving a larger problem to
solving a smaller instance of itself. In a recursive algo-
rithm to solve some problem P on an input of length
𝑘 we ask ourselves “what if someone handed me a
way to solve P on instances smaller than 𝑘?”. In an
inductive proof to prove a statement Q parameterized
by a number 𝑘, we ask ourselves “what if I already
knew that 𝑄(𝑘′ ) is true for 𝑘′ < 𝑘?”. Both induction
and recursion are crucial concepts for this course and
Computer Science at large (and even other areas of
inquiry, including not just mathematics but other
sciences as well). Both can be confusing at first, but
with time and practice they become clearer. For more
on proofs by induction and recursion, you might find
the following Stanford CS 103 handout, this MIT 6.00
lecture or this excerpt of the Lehman-Leighton book
useful.

1.6.2 Proving the result by induction


There are several ways to prove Lemma 1.23 by induction. We will
use induction on the number 𝑛 of vertices, and so we will define the
statement 𝑄(𝑛) as follows:

𝑄(𝑛) is “For every DAG 𝐺 = (𝑉 , 𝐸) with 𝑛 vertices, there is a layering of 𝐺.”


72 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The statement for 𝑄(0) (where the graph contains no vertices) is


trivial. Thus it will suffice to prove the following: for every 𝑛 > 0, if
𝑄(𝑛 − 1) is true then 𝑄(𝑛) is true.
To do so, we need to somehow find a way, given a graph 𝐺 of 𝑛
vertices, to reduce the task of finding a layering for 𝐺 into the task of
finding a layering for some other graph 𝐺′ of 𝑛 − 1 vertices. The idea is
that we will find a source of 𝐺: a vertex 𝑣 that has no in-neighbors. We
can then assign to 𝑣 the layer 0, and layer the remaining vertices using
the inductive hypothesis in layers 1, 2, ….
The above is the intuition behind the proof of Lemma 1.23, but
when writing the formal proof below, we use the benefit of hind-
sight, and try to streamline what was a messy journey into a linear
and easy-to-follow flow of logic that starts with the word “Proof:” 5
QED stands for “quod erat demonstrandum”, which
and ends with “QED” or the symbol ■.5 Discussions, examples and is Latin for “what was to be demonstrated” or “the
digressions can be very insightful, but we keep them outside the space very thing it was required to have shown”.
delimited between these two words, where (as described by this ex-
cellent handout) “every sentence must be load-bearing”. Just like we
do in programming, we can break the proof into little “subroutines”
or “functions” (known as lemmas or claims in math language), which
will be smaller statements that help us prove the main result. How-
ever, the proof should be structured in a way that ensures that it is
always crystal-clear to the reader in what stage we are of the proof.
The reader should be able to tell what the role of every sentence is in
the proof and which part it belongs to. We now present the formal
proof of Lemma 1.23.

Proof of Lemma 1.23. Let 𝐺 = (𝑉 , 𝐸) be a DAG and 𝑛 = |𝑉 | be the


number of its vertices. We prove the lemma by induction on 𝑛. The
base case is 𝑛 = 0 where there are no vertices, and so the statement is
6
Using 𝑛 = 0 as the base case is logically valid, but
can be confusing. If you find the trivial 𝑛 = 0 case
trivially true.6 For the case of 𝑛 > 0, we make the inductive hypothesis to be confusing, you can always directly verify the
that every DAG 𝐺′ of at most 𝑛 − 1 vertices has a layering. statement for 𝑛 = 1 and then use both 𝑛 = 0 and
𝑛 = 1 as the base cases.
We make the following claim:
Claim: 𝐺 must contain a vertex 𝑣 of in-degree zero.
Proof of Claim: Suppose otherwise that every vertex 𝑣 ∈ 𝑉 has an
in-neighbor. Let 𝑣0 be some vertex of 𝐺, let 𝑣1 be an in-neighbor of 𝑣0 ,
𝑣2 be an in-neighbor of 𝑣1 , and continue in this way for 𝑛 steps until
we construct a list 𝑣0 , 𝑣1 , … , 𝑣𝑛 such that for every 𝑖 ∈ [𝑛], 𝑣𝑖+1 is an
in-neighbor of 𝑣𝑖 , or in other words the edge 𝑣𝑖+1 → 𝑣𝑖 is present in the
graph. Since there are only 𝑛 vertices in this graph, one of the 𝑛 + 1
vertices in this sequence must repeat itself, and so there exists 𝑖 < 𝑗
such that 𝑣𝑖 = 𝑣𝑗 . But then the sequence 𝑣𝑗 → 𝑣𝑗−1 → ⋯ → 𝑣𝑖 is a cycle
in 𝐺, contradicting our assumption that it is acyclic. (QED Claim)
Given the claim, we can let 𝑣0 be some vertex of in-degree zero in
𝐺, and let 𝐺′ be the graph obtained by removing 𝑣0 from 𝐺. 𝐺′ has
mathe mati ca l backg rou n d 73

𝑛 − 1 vertices and hence per the inductive hypothesis has a layering


𝑓 ′ ∶ (𝑉 ⧵ {𝑣0 }) → ℕ. We define 𝑓 ∶ 𝑉 → ℕ as follows:


{𝑓 ′ (𝑣) + 1 𝑣 ≠ 𝑣0
𝑓(𝑣) = ⎨ .
{
⎩0 𝑣 = 𝑣0
We claim that 𝑓 is a valid layering, namely that for every edge 𝑢 →
𝑣, 𝑓(𝑢) < 𝑓(𝑣). To prove this, we split into cases:

• Case 1: 𝑢 ≠ 𝑣0 , 𝑣 ≠ 𝑣0 . In this case the edge 𝑢 → 𝑣 exists in the


graph 𝐺′ and hence by the inductive hypothesis 𝑓 ′ (𝑢) < 𝑓 ′ (𝑣)
which implies that 𝑓 ′ (𝑢) + 1 < 𝑓 ′ (𝑣) + 1.

• Case 2: 𝑢 = 𝑣0 , 𝑣 ≠ 𝑣0 . In this case 𝑓(𝑢) = 0 and 𝑓(𝑣) = 𝑓 ′ (𝑣) + 1 >


0.

• Case 3: 𝑢 ≠ 𝑣0 , 𝑣 = 𝑣0 . This case can’t happen since 𝑣0 does not


have in-neighbors.

• Case 4: 𝑢 = 𝑣0 , 𝑣 = 𝑣0 . This case again can’t happen since it means


that 𝑣0 is its own-neighbor — it is involved in a self loop which is a
form cycle that is disallowed in an acyclic graph.

Thus, 𝑓 is a valid layering for 𝐺 which completes the proof.


P
Reading a proof is no less of an important skill than
producing one. In fact, just like understanding code,
it is a highly non-trivial skill in itself. Therefore I
strongly suggest that you re-read the above proof, ask-
ing yourself at every sentence whether the assumption
it makes is justified, and whether this sentence truly
demonstrates what it purports to achieve. Another
good habit is to ask yourself when reading a proof for
every variable you encounter (such as 𝑢, 𝑖, 𝐺′ , 𝑓 ′ , etc.
in the above proof) the following questions: (1) What
type of variable is it? Is it a number? a graph? a ver-
tex? a function? and (2) What do we know about it?
Is it an arbitrary member of the set? Have we shown
some facts about it?, and (3) What are we trying to
show about it?.

1.6.3 Minimality and uniqueness


Theorem 1.22 guarantees that for every DAG 𝐺 = (𝑉 , 𝐸) there exists
some layering 𝑓 ∶ 𝑉 → ℕ but this layering is not necessarily unique.
For example, if 𝑓 ∶ 𝑉 → ℕ is a valid layering of the graph then so is
the function 𝑓 ′ defined as 𝑓 ′ (𝑣) = 2 ⋅ 𝑓(𝑣). However, it turns out that
74 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the minimal layering is unique. A minimal layering is one where every


vertex is given the smallest layer number possible. We now formally
define minimality and state the uniqueness theorem:

Let 𝐺 = (𝑉 , 𝐸) be a DAG. We
Theorem 1.26 — Minimal layering is unique.
say that a layering 𝑓 ∶ 𝑉 → ℕ is minimal if for every vertex 𝑣 ∈ 𝑉 , if
𝑣 has no in-neighbors then 𝑓(𝑣) = 0 and if 𝑣 has in-neighbors then
there exists an in-neighbor 𝑢 of 𝑣 such that 𝑓(𝑢) = 𝑓(𝑣) − 1.
For every layering 𝑓, 𝑔 ∶ 𝑉 → ℕ of 𝐺, if both 𝑓 and 𝑔 are minimal
then 𝑓 = 𝑔.

The definition of minimality in Theorem 1.26 implies that for every


vertex 𝑣 ∈ 𝑉 , we cannot move it to a lower layer without making
the layering invalid. If 𝑣 is a source (i.e., has in-degree zero) then
a minimal layering 𝑓 must put it in layer 0, and for every other 𝑣, if
𝑓(𝑣) = 𝑖, then we cannot modify this to set 𝑓(𝑣) ≤ 𝑖 − 1 since there
is an-neighbor 𝑢 of 𝑣 satisfying 𝑓(𝑢) = 𝑖 − 1. What Theorem 1.26
says is that a minimal layering 𝑓 is unique in the sense that every other
minimal layering is equal to 𝑓.

Proof Idea:
The idea is to prove the theorem by induction on the layers. If 𝑓 and
𝑔 are minimal then they must agree on the source vertices, since both
𝑓 and 𝑔 should assign these vertices to layer 0. We can then show that
if 𝑓 and 𝑔 agree up to layer 𝑖 − 1, then the minimality property implies
that they need to agree in layer 𝑖 as well. In the actual proof we use
a small trick to save on writing. Rather than proving the statement
that 𝑓 = 𝑔 (or in other words that 𝑓(𝑣) = 𝑔(𝑣) for every 𝑣 ∈ 𝑉 ),
we prove the weaker statement that 𝑓(𝑣) ≤ 𝑔(𝑣) for every 𝑣 ∈ 𝑉 .
(This is a weaker statement since the condition that 𝑓(𝑣) is lesser or
equal than to 𝑔(𝑣) is implied by the condition that 𝑓(𝑣) is equal to
𝑔(𝑣).) However, since 𝑓 and 𝑔 are just labels we give to two minimal
layerings, by simply changing the names “𝑓” and “𝑔” the same proof
also shows that 𝑔(𝑣) ≤ 𝑓(𝑣) for every 𝑣 ∈ 𝑉 and hence that 𝑓 = 𝑔.

Proof of Theorem 1.26. Let 𝐺 = (𝑉 , 𝐸) be a DAG and 𝑓, 𝑔 ∶ 𝑉 → ℕ be


two minimal valid layerings of 𝐺. We will prove that for every 𝑣 ∈ 𝑉 ,
𝑓(𝑣) ≤ 𝑔(𝑣). Since we didn’t assume anything about 𝑓, 𝑔 except their
minimality, the same proof will imply that for every 𝑣 ∈ 𝑉 , 𝑔(𝑣) ≤ 𝑓(𝑣)
and hence that 𝑓(𝑣) = 𝑔(𝑣) for every 𝑣 ∈ 𝑉 , which is what we needed
to show.
We will prove that 𝑓(𝑣) ≤ 𝑔(𝑣) for every 𝑣 ∈ 𝑉 by induction on
𝑖 = 𝑓(𝑣). The case 𝑖 = 0 is immediate: since in this case 𝑓(𝑣) = 0,
𝑔(𝑣) must be at least 𝑓(𝑣). For the case 𝑖 > 0, by the minimality of 𝑓,
mathe mati ca l backg rou n d 75

if 𝑓(𝑣) = 𝑖 then there must exist some in-neighbor 𝑢 of 𝑣 such that


𝑓(𝑢) = 𝑖 − 1. By the induction hypothesis we get that 𝑔(𝑢) ≥ 𝑖 − 1, and
since 𝑔 is a valid layering it must hold that 𝑔(𝑣) > 𝑔(𝑢) which means
that 𝑔(𝑣) ≥ 𝑖 = 𝑓(𝑣).

P
The proof of Theorem 1.26 is fully rigorous, but is
written in a somewhat terse manner. Make sure that
you read through it and understand why this is indeed
an airtight proof of the Theorem’s statement.

1.7 THIS BOOK: NOTATION AND CONVENTIONS


Most of the notation we use in this book is standard and is used in
most mathematical texts. The main points where we diverge are:

• We index the natural numbers ℕ starting with 0 (though many


other texts, especially in computer science, do the same).

• We also index the set [𝑛] starting with 0, and hence define it as
{0, … , 𝑛 − 1}. In other texts it is often defined as {1, … , 𝑛}. Similarly,
we index our strings starting with 0, and hence a string 𝑥 ∈ {0, 1}𝑛
is written as 𝑥0 𝑥1 ⋯ 𝑥𝑛−1 .

• If 𝑛 is a natural number then 1𝑛 does not equal the number 1 but


rather this is the length 𝑛 string 11 ⋯ 1 (that is a string of 𝑛 ones).
Similarly, 0𝑛 refers to the length 𝑛 string 00 ⋯ 0.

• Partial functions are functions that are not necessarily defined on


all inputs. When we write 𝑓 ∶ 𝐴 → 𝐵 this means that 𝑓 is a total
function unless we say otherwise. When we want to emphasize that
𝑓 can be a partial function, we will sometimes write 𝑓 ∶ 𝐴 →𝑝 𝐵.

• As we will see later on in the course, we will mostly describe our


computational problems in terms of computing a Boolean function
𝑓 ∶ {0, 1}∗ → {0, 1}. In contrast, many other textbooks refer to the
same task as deciding a language 𝐿 ⊆ {0, 1}∗ . These two viewpoints
are equivalent, since for every set 𝐿 ⊆ {0, 1}∗ there is a correspond-
ing function 𝐹 such that 𝐹 (𝑥) = 1 if and only if 𝑥 ∈ 𝐿. Computing
partial functions corresponds to the task known in the literature as a
solving promise problem. Because the language notation is so preva-
lent in other textbooks, we will occasionally remind the reader of
this correspondence.

• We use ⌈𝑥⌉ and ⌊𝑥⌋ for the “ceiling” and “floor” operators that
correspond to “rounding up” or “rounding down” a number to the
76 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

nearest integer. We use (𝑥 mod 𝑦) to denote the “remainder” of 𝑥


when divided by 𝑦. That is, (𝑥 mod 𝑦) = 𝑥 − 𝑦⌊𝑥/𝑦⌋. In context
when an integer is expected we’ll typically “silently round” the
quantities to an integer. For example, if we say that 𝑥 is a string of
√ √
length 𝑛 then this means that 𝑥 is of length ⌈ 𝑛 ⌉. (We round up
for the sake of convention, but in most such cases, it will not make a
difference whether we round up or down.)

• Like most Computer Science texts, we default to the logarithm in


base two. Thus, log 𝑛 is the same as log2 𝑛.

• We will also use the notation 𝑓(𝑛) = 𝑝𝑜𝑙𝑦(𝑛) as a shorthand for


𝑓(𝑛) = 𝑛𝑂(1) (i.e., as shorthand for saying that there are some
constants 𝑎, 𝑏 such that 𝑓(𝑛) ≤ 𝑎 ⋅ 𝑛𝑏 for every sufficiently large
𝑛). Similarly, we will use 𝑓(𝑛) = 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛) as shorthand for
𝑓(𝑛) = 𝑝𝑜𝑙𝑦(log 𝑛) (i.e., as shorthand for saying that there are
some constants 𝑎, 𝑏 such that 𝑓(𝑛) ≤ 𝑎 ⋅ (log 𝑛)𝑏 for every sufficiently
large 𝑛).

• As is often the case in mathematical literature, we use the apostro-


phe character to enrich our set of identifiers. Typically if 𝑥 denotes
some object, then 𝑥′ , 𝑥″ , etc. will denote other objects of the same
type.

• To save on “cognitive load” we will often use round constants such


as 10, 100, 1000 in the statements of both theorems and problem
set questions. When you see such a “round” constant, you can
typically assume that it has no special significance and was just
chosen arbitrarily. For example, if you see a theorem of the form
“Algorithm 𝐴 takes at most 1000 ⋅ 𝑛2 steps to compute function 𝐹 on
inputs of length 𝑛” then probably the number 1000 is an arbitrary
sufficiently large constant, and one could prove the same theorem
with a bound of the form 𝑐 ⋅ 𝑛2 for a constant 𝑐 that is smaller than
1000. Similarly, if a problem asks you to prove that some quantity is
at least 𝑛/100, it is quite possible that in truth the quantity is at least
𝑛/𝑑 for some constant 𝑑 that is smaller than 100.

1.7.1 Variable name conventions


Like programming, mathematics is full of variables. Whenever you see
a variable, it is always important to keep track of what its type is (e.g.,
whether the variable is a number, a string, a function, a graph, etc.).
To make this easier, we try to stick to certain conventions and consis-
tently use certain identifiers for variables of the same type. Some of
these conventions are listed in Section 1.7.1 below. These conventions
are not immutable laws and we might occasionally deviate from them.
mathe mati ca l backg rou n d 77

Also, such conventions do not replace the need to explicitly declare for
each new variable the type of object that it denotes.

Table 1.2: Conventions for identifiers in this book

Identifier Often denotes object of type


𝑖,𝑗,𝑘,ℓ,𝑚,𝑛 Natural numbers (i.e., in ℕ = {0, 1, 2, …})
𝜖, 𝛿 Small positive real numbers (very close to 0)
𝑥, 𝑦, 𝑧, 𝑤 Typically strings in {0, 1}∗ though sometimes numbers or
other objects. We often identify an object with its
representation as a string.
𝐺 A graph. The set of 𝐺’s vertices is typically denoted by 𝑉 .
Often 𝑉 = [𝑛]. The set of 𝐺’s edges is typically denoted by
𝐸.
𝑆 Set
𝑓, 𝑔, ℎ Functions. We often (though not always) use lowercase
identifiers for finite functions, which map {0, 1}𝑛 to {0, 1}𝑚
(often 𝑚 = 1).
𝐹 , 𝐺, 𝐻 Infinite (unbounded input) functions mapping {0, 1}∗ to
{0, 1}∗ or {0, 1}∗ to {0, 1}𝑚 for some 𝑚. Based on context,
the identifiers 𝐺, 𝐻 are sometimes used to denote
functions and sometimes graphs.
𝐴, 𝐵, 𝐶 Boolean circuits
𝑀, 𝑁 Turing machines
𝑃,𝑄 Programs
𝑇 A function mapping ℕ to ℕ that corresponds to a time
bound.
𝑐 A positive number (often an unspecified constant; e.g.,
𝑇 (𝑛) = 𝑂(𝑛) corresponds to the existence of 𝑐 s.t.
𝑇 (𝑛) ≤ 𝑐 ⋅ 𝑛 every 𝑛 > 0). We sometimes use 𝑎, 𝑏 in a
similar way.
Σ Finite set (often used as the alphabet for a set of strings).

1.7.2 Some idioms


Mathematical texts often employ certain conventions or “idioms”.
Some examples of such idioms that we use in this text include the
following:

• “Let 𝑋 be …”, “let 𝑋 denote …”, or “let 𝑋 = …”: These are all
different ways for us to say that we are defining the symbol 𝑋 to
stand for whatever expression is in the …. When 𝑋 is a property of
some objects we might define 𝑋 by writing something along the
lines of “We say that … has the property 𝑋 if ….”. While we often
78 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

try to define terms before they are used, sometimes a mathematical


sentence reads easier if we use a term before defining it, in which
case we add “Where 𝑋 is …” to explain how 𝑋 is defined in the
preceding expression.

• Quantifiers: Mathematical texts involve many quantifiers such as


“for all” and “exists”. We sometimes spell these in words as in “for
all 𝑖 ∈ ℕ” or “there is 𝑥 ∈ {0, 1}∗ ”, and sometimes use the formal
symbols ∀ and ∃. It is important to keep track of which variable is
quantified in what way the dependencies between the variables. For
example, a sentence fragment such as “for every 𝑘 > 0 there exists
𝑛” means that 𝑛 can be chosen in a way that depends on 𝑘. The order
of quantifiers is important. For example, the following is a true
statement: “for every natural number 𝑘 > 1 there exists a prime number
𝑛 such that 𝑛 divides 𝑘.” In contrast, the following statement is false:
“there exists a prime number 𝑛 such that for every natural number 𝑘 > 1,
𝑛 divides 𝑘.”

• Numbered equations, theorems, definitions: To keep track of all


the terms we define and statements we prove, we often assign them
a (typically numeric) label, and then refer back to them in other
parts of the text.

• (i.e.,), (e.g.,): Mathematical texts tend to contain quite a few of


these expressions. We use 𝑋 (i.e., 𝑌 ) in cases where 𝑌 is equivalent
to 𝑋 and 𝑋 (e.g., 𝑌 ) in cases where 𝑌 is an example of 𝑋 (e.g., one
can use phrases such as “a natural number (i.e., a non-negative
integer)” or “a natural number (e.g., 7)”).

• “Thus”, “Therefore” , “We get that”: This means that the following
sentence is implied by the preceding one, as in “The 𝑛-vertex graph
𝐺 is connected. Therefore it contains at least 𝑛 − 1 edges.” We
sometimes use “indeed” to indicate that the following text justifies
the claim that was made in the preceding sentence as in “The 𝑛-
vertex graph 𝐺 has at least 𝑛 − 1 edges. Indeed, this follows since 𝐺 is
connected.”

• Constants: In Computer Science, we typically care about how our


algorithms’ resource consumption (such as running time) scales
with certain quantities (such as the length of the input). We refer to
quantities that do not depend on the length of the input as constants
and so often use statements such as “there exists a constant 𝑐 > 0 such
that for every 𝑛 ∈ ℕ, Algorithm 𝐴 runs in at most 𝑐 ⋅ 𝑛2 steps on inputs of
length 𝑛.” The qualifier “constant” for 𝑐 is not strictly needed but is
added to emphasize that 𝑐 here is a fixed number independent of 𝑛.
In fact sometimes, to reduce cognitive load, we will simply replace 𝑐
mathe mati ca l backg rou n d 79

by a sufficiently large round number such as 10, 100, or 1000, or use


𝑂-notation and write “Algorithm 𝐴 runs in 𝑂(𝑛2 ) time.”

✓ Chapter Recap

• The basic “mathematical data structures” we’ll


need are numbers, sets, tuples, strings, graphs and
functions.
• We can use basic objects to define more complex
notions. For example, graphs can be defined as a list
of pairs.
• Given precise definitions of objects, we can state
unambiguous and precise statements. We can then
use mathematical proofs to determine whether these
statements are true or false.
• A mathematical proof is not a formal ritual but
rather a clear, precise and “bulletproof” argument
certifying the truth of a certain statement.
• Big-𝑂 notation is an extremely useful formalism
to suppress less significant details and allows us
to focus on the high-level behavior of quantities of
interest.
• The only way to get comfortable with mathematical
notions is to apply them in the contexts of solving
problems. You should expect to need to go back
time and again to the definitions and notation in
this chapter as you work through problems in this
course.

1.8 EXERCISES
a. Write a logical expression 𝜑(𝑥)
Exercise 1.1 — Logical expressions.
involving the variables 𝑥0 , 𝑥1 , 𝑥2 and the operators ∧ (AND), ∨
(OR), and ¬ (NOT), such that 𝜑(𝑥) is true if the majority of the
inputs are True.

b. Write a logical expression 𝜑(𝑥) involving the variables 𝑥0 , 𝑥1 , 𝑥2


and the operators ∧ (AND), ∨ (OR), and ¬ (NOT), such that 𝜑(𝑥)
2
is true if the sum ∑𝑖=0 𝑥𝑖 (identifying “true” with 1 and “false”
with 0) is odd.

Exercise 1.2 — Quantifiers.Use the logical quantifiers ∀ (for all), ∃ (there


exists), as well as ∧, ∨, ¬ and the arithmetic operations +, ×, =, >, < to
write the following:
a. An expression 𝜑(𝑛, 𝑘) such that for every natural number 𝑛, 𝑘,
𝜑(𝑛, 𝑘) is true if and only if 𝑘 divides 𝑛.

b. An expression 𝜑(𝑛) such that for every natural number 𝑛, 𝜑(𝑛) is


true if and only if 𝑛 is a power of three.
80 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Exercise 1.3 Describe the following statement in English words:


∀𝑛∈ℕ ∃𝑝>𝑛 ∀𝑎, 𝑏 ∈ ℕ(𝑎 × 𝑏 ≠ 𝑝) ∨ (𝑎 = 1).

Exercise 1.4 — Set construction notation. Describe in words the following


sets:

a. 𝑆 = {𝑥 ∈ {0, 1}100 ∶ ∀𝑖∈{0,…,99} 𝑥𝑖 = 𝑥99−𝑖 }

b. 𝑇 = {𝑥 ∈ {0, 1}∗ ∶ ∀𝑖,𝑗∈{2,…,|𝑥|−1} 𝑖 ⋅ 𝑗 ≠ |𝑥|}

For each one of the fol-


Exercise 1.5 — Existence of one to one mappings.
lowing pairs of sets (𝑆, 𝑇 ), prove or disprove the following statement:
there is a one to one function 𝑓 mapping 𝑆 to 𝑇 .

a. Let 𝑛 > 10. 𝑆 = {0, 1}𝑛 and 𝑇 = [𝑛] × [𝑛] × [𝑛].

b. Let 𝑛 > 10. 𝑆 is the set of all functions mapping {0, 1}𝑛 to {0, 1}.
𝑇 = {0, 1}𝑛 .
3

c. Let 𝑛 > 100. 𝑆 = {𝑘 ∈ [𝑛] | 𝑘 is prime}, 𝑇 = {0, 1}⌈log 𝑛−1⌉ .

a. Let 𝐴, 𝐵 be finite sets. Prove that


Exercise 1.6 — Inclusion Exclusion.
|𝐴 ∪ 𝐵| = |𝐴| + |𝐵| − |𝐴 ∩ 𝐵|.

b. Let 𝐴0 , … , 𝐴𝑘−1 be finite sets. Prove that |𝐴0 ∪ ⋯ ∪ 𝐴𝑘−1 | ≥


𝑘−1
∑𝑖=0 |𝐴𝑖 | − ∑0≤𝑖<𝑗<𝑘 |𝐴𝑖 ∩ 𝐴𝑗 |.

c. Let 𝐴0 , … , 𝐴𝑘−1 be finite subsets of {1, … , 𝑛}, such that |𝐴𝑖 | = 𝑚 for
every 𝑖 ∈ [𝑘]. Prove that if 𝑘 > 100𝑛, then there exist two distinct
sets 𝐴𝑖 , 𝐴𝑗 s.t. |𝐴𝑖 ∩ 𝐴𝑗 | ≥ 𝑚2 /(10𝑛).

Exercise 1.7Prove that if 𝑆, 𝑇 are finite and 𝐹 ∶ 𝑆 → 𝑇 is one to one


then |𝑆| ≤ |𝑇 |.

Exercise 1.8 Prove that if 𝑆, 𝑇 are finite and 𝐹 ∶ 𝑆 → 𝑇 is onto then


|𝑆| ≥ |𝑇 |.

Exercise 1.9Prove that for every finite 𝑆, 𝑇 , there are (|𝑇 | + 1) |𝑆|
partial
functions from 𝑆 to 𝑇 .

Suppose that {𝑆𝑛 }𝑛∈ℕ is a sequence such that 𝑆0 ≤ 10 and


Exercise 1.10
for 𝑛 > 1 𝑆𝑛 ≤ 5𝑆⌊ 𝑛 ⌋ + 2𝑛. Prove by induction that 𝑆𝑛 ≤ 100𝑛 log 𝑛 for
5
every 𝑛.
mathe mati ca l backg rou n d 81

Exercise 1.11 Prove that for every undirected graph 𝐺 of 100 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 20 vertices such that no two vertices in 𝑆 are neighbors of one
another.

For every pair of functions 𝐹 , 𝐺 below, deter-


Exercise 1.12 — 𝑂-notation.
mine which of the following relations holds: 𝐹 = 𝑂(𝐺), 𝐹 = Ω(𝐺),
𝐹 = 𝑜(𝐺) or 𝐹 = 𝜔(𝐺).

a. 𝐹 (𝑛) = 𝑛, 𝐺(𝑛) = 100𝑛.



b. 𝐹 (𝑛) = 𝑛, 𝐺(𝑛) = 𝑛.

c. 𝐹 (𝑛) = 𝑛 log 𝑛, 𝐺(𝑛) = 2(log(𝑛)) .


2


d. 𝐹 (𝑛) = 𝑛, 𝐺(𝑛) = 2√log 𝑛 .

e. 𝐹 (𝑛) = (⌈0.2𝑛⌉
𝑛
), 𝐺(𝑛) = 20.1𝑛 (where (𝑛𝑘) is the number of 𝑘-sized 7
One way to do this is to use Stirling’s approximation
subsets of a set of size 𝑛). See footnote for hint.7 for the factorial function.

Exercise 1.13 Give an example of a pair of functions 𝐹 , 𝐺 ∶ ℕ → ℕ such


that neither 𝐹 = 𝑂(𝐺) nor 𝐺 = 𝑂(𝐹 ) holds.

Exercise 1.14 Prove that for every undirected graph 𝐺 on 𝑛 vertices, if 𝐺


has at least 𝑛 edges then 𝐺 contains a cycle.

Exercise 1.15 Prove that for every undirected graph 𝐺 of 1000 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 200 vertices such that no two vertices in 𝑆 are neighbors of one
another.

1.9 BIBLIOGRAPHICAL NOTES


The heading “A Mathematician’s Apology”, refers to Hardy’s classic
book [Har41]. Even when Hardy is wrong, he is very much worth
reading.
There are many online sources for the mathematical background
needed for this book. In particular, the lecture notes for MIT 6.042
“Mathematics for Computer Science” [LLM18] are extremely com-
prehensive, and videos and assignments for this course are available
online. Similarly, Berkeley CS 70: “Discrete Mathematics and Proba-
bility Theory” has extensive lecture notes online.
82 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Other sources for discrete mathematics are Rosen [Ros19] and


Jim Aspens’ online book [Asp18]. Lewis and Zax [LZ19], as well
as the online book of Fleck [Fle18], give a more gentle overview of
much of the same material. Solow [Sol14] is a good introduction
to proof reading and writing. Kun [Kun18] gives an introduction
to mathematics aimed at readers with programming backgrounds.
Stanford’s CS 103 course has a wonderful collection of handouts on
mathematical proof techniques and discrete mathematics.
The word graph in the sense of Definition 1.3 was coined by the
mathematician Sylvester in 1878 in analogy with the chemical graphs
used to visualize molecules. There is an unfortunate confusion be-
tween this term and the more common usage of the word “graph” as
a way to plot data, and in particular a plot of some function 𝑓(𝑥) as a
function of 𝑥. One way to relate these two notions is to identify every
function 𝑓 ∶ 𝐴 → 𝐵 with the directed graph 𝐺𝑓 over the vertex set
𝑉 = 𝐴 ∪ 𝐵 such that 𝐺𝑓 contains the edge 𝑥 → 𝑓(𝑥) for every 𝑥 ∈ 𝐴. In
a graph 𝐺𝑓 constructed in this way, every vertex in 𝐴 has out-degree
equal to one. If the function 𝑓 is one to one then every vertex in 𝐵 has
in-degree at most one. If the function 𝑓 is onto then every vertex in 𝐵
has in-degree at least one. If 𝑓 is a bijection then every vertex in 𝐵 has
in-degree exactly equal to one.
Carl Pomerance’s quote is taken from the home page of Doron
Zeilberger.
Learning Objectives:
• Distinguish between specification and
implementation, or equivalently between
mathematical functions and
algorithms/programs.
• Representing an object as a string (often of
zeroes and ones).
• Examples of representations for common
objects such as numbers, vectors, lists, and

2
graphs.
• Prefix-free representations.

Computation and Representation


• Cantor’s Theorem: The real numbers cannot
be represented exactly as finite strings.

“The alphabet (sic) was a great invention, which enabled men (sic) to store
and to learn with little effort what others had learned the hard way – that is, to
learn from books rather than from direct, possibly painful, contact with the real
world.”, B.F. Skinner

“The name of the song is called ‘HADDOCK’S EYES.”’ [said the Knight]
“Oh, that’s the name of the song, is it?” Alice said, trying to feel interested.
“No, you don’t understand,” the Knight said, looking a little vexed. “That’s
what the name is CALLED. The name really is ‘THE AGED AGED MAN.”’
“Then I ought to have said ‘That’s what the SONG is called’?” Alice cor-
rected herself.
“No, you oughtn’t: that’s quite another thing! The SONG is called ‘WAYS
AND MEANS’: but that’s only what it’s CALLED, you know!”
“Well, what IS the song, then?” said Alice, who was by this time com-
pletely bewildered.
“I was coming to that,” the Knight said. “The song really IS ‘A-SITTING ON
A GATE’: and the tune’s my own invention.”
Lewis Carroll, Through the Looking-Glass

To a first approximation, computation is a process that maps an input


to an output.
When discussing computation, it is essential to separate the ques-
tion of what is the task we need to perform (i.e., the specification) from
the question of how we achieve this task (i.e., the implementation).
For example, as we’ve seen, there is more than one way to achieve the
computational task of computing the product of two integers.
In this chapter we focus on the what part, namely defining com-
putational tasks. For starters, we need to define the inputs and out-
Figure 2.1: Our basic notion of computation is some
puts. Capturing all the potential inputs and outputs that we might process that maps an input to an output
ever want to compute seems challenging, since computation today is
applied to a wide variety of objects. We do not compute merely on
numbers, but also on texts, images, videos, connection graphs of social

Compiled on 12.6.2023 00:05


84 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

networks, MRI scans, gene data, and even other programs. We will
represent all these objects as strings of zeroes and ones, that is objects
such as 0011101 or 1011 or any other finite list of 1’s and 0’s. (This
choice is for convenience: there is nothing “holy” about zeroes and
ones, and we could have used any other finite collection of symbols.)
Today, we are so used to the notion of digital representation that
we are not surprised by the existence of such an encoding. But it is
actually a deep insight with significant implications. Many animals
can convey a particular fear or desire, but what is unique about hu-
mans is language: we use a finite collection of basic symbols to describe
a potentially unlimited range of experiences. Language allows trans-
mission of information over both time and space and enables soci-
eties that span a great many people and accumulate a body of shared
Figure 2.2: We represent numbers, texts, images, net-
knowledge over time. works and many other objects using strings of zeroes
Over the last several decades, we have seen a revolution in what we and ones. Writing the zeroes and ones themselves in
green font over a black background is optional.
can represent and convey in digital form. We can capture experiences
with almost perfect fidelity, and disseminate it essentially instanta-
neously to an unlimited audience. Moreover, once information is in
digital form, we can compute over it, and gain insights from data that
were not accessible in prior times. At the heart of this revolution is the
simple but profound observation that we can represent an unbounded
variety of objects using a finite set of symbols (and in fact using only
the two symbols 0 and 1).
In later chapters, we will typically take such representations for
granted, and hence use expressions such as “program 𝑃 takes 𝑥 as
input” when 𝑥 might be a number, a vector, a graph, or any other
object, when we really mean that 𝑃 takes as input the representation of
𝑥 as a binary string. However, in this chapter we will dwell a bit more
on how we can construct such representations.

This chapter: A non-mathy overview


The main takeaways from this chapter are:

• We can represent all kinds of objects we want to use as in-


puts and outputs using binary strings. For example, we can
use the binary basis to represent integers and rational num-
bers as binary strings (see Section 2.1.1 and Section 2.2).

• We can compose the representations of simple objects to


represent more complex objects. In this way, we can rep-
resent lists of integers or rational numbers, and use that
to represent objects such as matrices, images, and graphs.
Prefix-free encoding is one way to achieve such a composi-
tion (see Section 2.5.2).
comp u tati on a n d re p re se n tati on 85

• A computational task specifies a map from an input to an


output— a function. It is crucially important to distinguish
between the “what” and the “how”, or the specification
and implementation (see Section 2.6.1). A function simply
defines which output corresponds to which input. It does
not specify how to compute the output from the input, and
as we’ve seen in the context of multiplication, there can be
more than one way to compute the same function.

• While the set of all possible binary strings is infinite, it still


cannot represent everything. In particular, there is no rep-
resentation of the real numbers (with absolute accuracy) as
binary strings. This result is also known as “Cantor’s The-
orem” (see Section 2.4) and is typically referred to as the
result that the “reals are uncountable.” It is also implied
that there are different levels of infinity though we will not
get into this topic in this book (see Remark 2.10).

The two “big ideas” we discuss are Big Idea 1 - we can com-
pose representations for simple objects to represent more
complex objects and Big Idea 2 - it is crucial to distinguish be-
tween functions’ (“what”) and programs’ (“how”). The latter
will be a theme we will come back to time and again in this
book.

2.1 DEFINING REPRESENTATIONS


Every time we store numbers, images, sounds, databases, or other ob-
jects on a computer, what we actually store in the computer’s memory
is the representation of these objects. Moreover, the idea of representa-
tion is not restricted to digital computers. When we write down text or
make a drawing we are representing ideas or experiences as sequences
of symbols (which might as well be strings of zeroes and ones). Even
our brain does not store the actual sensory inputs we experience, but
rather only a representation of them.
To use objects such as numbers, images, graphs, or others as inputs
for computation, we need to define precisely how to represent these
objects as binary strings. A representation scheme is a way to map an ob-
ject 𝑥 to a binary string 𝐸(𝑥) ∈ {0, 1}∗ . For example, a representation
scheme for natural numbers is a function 𝐸 ∶ ℕ → {0, 1}∗ . Of course,
we cannot merely represent all numbers as the string “0011” (for ex-
ample). A minimal requirement is that if two numbers 𝑥 and 𝑥′ are
different then they would be represented by different strings. Another
way to say this is that we require the encoding function 𝐸 to be one to
one.
86 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2.1.1 Representing natural numbers


We now show how we can represent natural numbers as binary
strings. Over the years people have represented numbers in a variety
of ways, including Roman numerals, tally marks, our own Hindu-
Arabic decimal system, and many others. We can use any one of
those as well as many others to represent a number as a string (see
Fig. 2.3). However, for the sake of concreteness, we use the binary
basis as our default representation of natural numbers as strings.
For example, we represent the number six as the string 110 since
1 ⋅ 22 + 1 ⋅ 21 + 0 ⋅ 20 = 6, and similarly we represent the number thirty-
5
five as the string 𝑦 = 100011 which satisfies ∑𝑖=0 𝑦𝑖 ⋅ 2|𝑦|−𝑖−1 = 35.
Some more examples are given in the table below.

Table 2.1: Representing numbers in the binary basis. The left-


hand column contains representations of natural numbers in the
decimal basis, while the right-hand column contains representa-
tions of the same numbers in the binary basis.

Number (decimal representation) Number (binary representation)


0 0 Figure 2.3: Representing each one the digits
1 1 0, 1, 2, … , 9 as a 12 × 8 bitmap image, which can be
thought of as a string in {0, 1}96 . Using this scheme
2 10
we can represent a natural number 𝑥 of 𝑛 decimal
5 101 digits as a string in {0, 1}96𝑛 . Image taken from blog
16 10000 post of A. C. Andersen.

40 101000
53 110101
389 110000101
3750 111010100110

If 𝑛 is even, then the least significant digit of 𝑛’s binary representa-


tion is 0, while if 𝑛 is odd then this digit equals 1. Just like the number
⌊𝑛/10⌋ corresponds to “chopping off” the least significant decimal
digit (e.g., ⌊457/10⌋ = ⌊45.7⌋ = 45), the number ⌊𝑛/2⌋ corresponds
to the “chopping off” the least significant binary digit. Hence the bi-
nary representation can be formally defined as the following function
𝑁 𝑡𝑆 ∶ ℕ → {0, 1}∗ (𝑁 𝑡𝑆 stands for “natural numbers to strings”):

⎧0 𝑛=0
{
{
𝑁 𝑡𝑆(𝑛) = 1 𝑛=1 (2.1)

{
{𝑁 𝑡𝑆(⌊𝑛/2⌋)𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) 𝑛 > 1

where 𝑝𝑎𝑟𝑖𝑡𝑦 ∶ ℕ → {0, 1} is the function defined as 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 0
if 𝑛 is even and 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 1 if 𝑛 is odd, and as usual, for strings
𝑥, 𝑦 ∈ {0, 1}∗ , 𝑥𝑦 denotes the concatenation of 𝑥 and 𝑦. The function
comp u tati on a n d re p re se n tati on 87

𝑁 𝑡𝑆 is defined recursively: for every 𝑛 > 1 we define 𝑟𝑒𝑝(𝑛) in terms


of the representation of the smaller number ⌊𝑛/2⌋. It is also possible to
define 𝑁 𝑡𝑆 non-recursively, see Exercise 2.2.
Throughout most of this book, the particular choices of represen-
tation of numbers as binary strings would not matter much: we just
need to know that such a representation exists. In fact, for many of our
purposes we can even use the simpler representation of mapping a
natural number 𝑛 to the length-𝑛 all-zero string 0𝑛 .

R
Remark 2.1 — Binary representation in python (optional).
We can implement the binary representation in Python
as follows:

def NtS(n):# natural numbers to strings


if n > 1:
return NtS(n // 2) + str(n % 2)
else:
return str(n % 2)

print(NtS(236))
# 11101100

print(NtS(19))
# 10011

We can also use Python to implement the inverse


transformation, mapping a string back to the natural
number it represents.

def StN(x):# String to number


k = len(x)-1
return sum(int(x[i])*(2**(k-i)) for i in
↪ range(k+1))

print(StN(NtS(236)))
# 236

R
Remark 2.2 — Programming examples. In this book,
we sometimes use code examples as in Remark 2.1.
The point is always to emphasize that certain com-
putations can be achieved concretely, rather than
illustrating the features of Python or any other pro-
gramming language. Indeed, one of the messages of
this book is that all programming languages are in
a certain precise sense equivalent to one another, and
hence we could have just as well used JavaScript, C,
COBOL, Visual Basic or even BrainF*ck. This book
is not about programming, and it is absolutely OK if
88 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

you are not familiar with Python or do not follow code


examples such as those in Remark 2.1.

2.1.2 Meaning of representations (discussion)


It is natural for us to think of 236 as the “actual” number, and of
11101100 as “merely” its representation. However, for most Euro-
peans in the middle ages CCXXXVI would be the “actual” number and
236 (if they have even heard about it) would be the weird Hindu-
1
While the Babylonians already invented a positional
Arabic positional representation.1 When our AI robot overlords ma- system much earlier, the decimal positional system
terialize, they will probably think of 11101100 as the “actual” number we use today was invented by Indian mathematicians
and of 236 as “merely” a representation that they need to use when around the third century. Arab mathematicians took
it up in the 8th century. It first received significant
they give commands to humans. attention in Europe with the publication of the 1202
So what is the “actual” number? This is a question that philoso- book “Liber Abaci” by Leonardo of Pisa, also known as
Fibonacci, but it did not displace Roman numerals in
phers of mathematics have pondered throughout history. Plato ar- common usage until the 15th century.
gued that mathematical objects exist in some ideal sphere of existence
(that to a certain extent is more “real” than the world we perceive
via our senses, as this latter world is merely the shadow of this ideal
sphere). In Plato’s vision, the symbols 236 are merely notation for
some ideal object, that, in homage to the late musician, we can refer to
as “the number commonly represented by 236”.
The Austrian philosopher Ludwig Wittgenstein, on the other hand,
argued that mathematical objects do not exist at all, and the only
things that exist are the actual marks on paper that make up 236,
11101100 or CCXXXVI. In Wittgenstein’s view, mathematics is merely
about formal manipulation of symbols that do not have any inherent
meaning. You can think of the “actual” number as (somewhat recur-
sively) “that thing which is common to 236, 11101100 and CCXXXVI
and all other past and future representations that are meant to capture
the same object”.
While reading this book, you are free to choose your own phi-
losophy of mathematics, as long as you maintain the distinction be-
tween the mathematical objects themselves and the various particular
choices of representing them, whether as splotches of ink, pixels on a
screen, zeroes and ones, or any other form.

2.2 REPRESENTATIONS BEYOND NATURAL NUMBERS


We have seen that natural numbers can be represented as binary
strings. We now show that the same is true for other types of objects,
including (potentially negative) integers, rational numbers, vectors,
lists, graphs and many others. In many instances, choosing the “right”
string representation for a piece of data is highly non-trivial, and find-
ing the “best” one (e.g., most compact, best fidelity, most efficiently
manipulable, robust to errors, most informative features, etc.) is the
comp u tati on a n d re p re se n tati on 89

object of intense research. But for now, we focus on presenting some


simple representations for various objects that we would like to use as
inputs and outputs for computation.

2.2.1 Representing (potentially negative) integers


Since we can represent natural numbers as strings, we can
represent the full set of integers (i.e., members of the set
ℤ = {… , −3, −2, −1, 0, +1, +2, +3, …} ) by adding one more bit
that represents the sign. To represent a (potentially negative) number
𝑚, we prepend to the representation of the natural number |𝑚| a bit 𝜎
that equals 0 if 𝑚 ≥ 0 and equals 1 if 𝑚 < 0. Formally, we define the
function 𝑍𝑡𝑆 ∶ ℤ → {0, 1}∗ as follows


{0 𝑁 𝑡𝑆(𝑚) 𝑚≥0
𝑍𝑡𝑆(𝑚) = ⎨
{
⎩1 𝑁 𝑡𝑆(−𝑚) 𝑚 < 0
where 𝑁 𝑡𝑆 is defined as in (2.1).
While the encoding function of a representation needs to be one
to one, it does not have to be onto. For example, in the representation
above there is no number that is represented by the empty string
but it is still a fine representation, since every integer is represented
uniquely by some string.

R
Remark 2.3 — Interpretation and context. Given a string
𝑦 ∈ {0, 1}∗ , how do we know if it’s “supposed” to
represent a (non-negative) natural number or a (po-
tentially negative) integer? For that matter, even if
we know 𝑦 is “supposed” to be an integer, how do
we know what representation scheme it uses? The
short answer is that we do not necessarily know this
information, unless it is supplied from the context. (In
programming languages, the compiler or interpreter
determines the representation of the sequence of bits
corresponding to a variable based on the variable’s
type.) We can treat the same string 𝑦 as representing a
natural number, an integer, a piece of text, an image,
or a green gremlin. Whenever we say a sentence such
as “let 𝑛 be the number represented by the string 𝑦,”
we will assume that we are fixing some canonical rep-
resentation scheme such as the ones above. The choice
of the particular representation scheme will rarely
matter, except that we want to make sure to stick with
the same one for consistency.

2.2.2 Two’s complement representation (optional)


Section 2.2.1’s approach of representing an integer using a specific
“sign bit” is known as the Signed Magnitude Representation and was
90 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

used in some early computers. However, the two’s complement rep-


resentation is much more common in practice. The two’s complement
representation of an integer 𝑘 in the set {−2𝑛 , −2𝑛 + 1, … , 2𝑛 − 1} is the
string 𝑍𝑡𝑆𝑛 (𝑘) of length 𝑛 + 1 defined as follows:


{𝑁 𝑡𝑆𝑛+1 (𝑘) 0 ≤ 𝑘 ≤ 2𝑛 − 1
𝑍𝑡𝑆𝑛 (𝑘) = ⎨ ,
𝑛+1
{
⎩𝑁 𝑡𝑆𝑛+1 (2 + 𝑘) −2𝑛 ≤ 𝑘 ≤ −1

where 𝑁 𝑡𝑆ℓ (𝑚) denotes the standard binary representation of a num-


ber 𝑚 ∈ {0, … , 2ℓ } as string of length ℓ, padded with leading zeros
as needed. For example, if 𝑛 = 3 then 𝑍𝑡𝑆3 (1) = 𝑁 𝑡𝑆4 (1) = 0001,
𝑍𝑡𝑆3 (2) = 𝑁 𝑡𝑆4 (2) = 0010, 𝑍𝑡𝑆3 (−1) = 𝑁 𝑡𝑆4 (16 − 1) = 1111, and
𝑍𝑡𝑆3 (−8) = 𝑁 𝑡𝑆4 (16 − 8) = 1000. If 𝑘 is a negative number larger than
or equal to −2𝑛 then 2𝑛+1 + 𝑘 is a number between 2𝑛 and 2𝑛+1 − 1.
Hence the two’s complement representation of such a number 𝑘 is a
string of length 𝑛 + 1 with its first digit equal to 1.
Another way to say this is that we represent a potentially negative
number 𝑘 ∈ {−2𝑛 , … , 2𝑛 − 1} as the non-negative number 𝑘 mod 2𝑛+1
(see also Fig. 2.4). This means that if two (potentially negative) num-
bers 𝑘 and 𝑘′ are not too large (i.e., $ k + k’ ∈{ -2^n,…, 2^n-1 }$),
then we can compute the representation of 𝑘 + 𝑘′ by adding modulo
2𝑛+1 the representations of 𝑘 and 𝑘′ as if they were non-negative inte-
gers. This property of the two’s complement representation is its main
attraction since, depending on their architectures, microprocessors
can often perform arithmetic operations modulo 2𝑤 very efficiently
(for certain values of 𝑤 such as 32 and 64). Many systems leave it to
the programmer to check that values are not too large and will carry
out this modular arithmetic regardless of the size of the numbers in-
volved. For this reason, in some systems adding two large positive
numbers can result in a negative number (e.g., adding 2𝑛 − 100 and
2𝑛 − 200 might result in −300 since (2𝑛+1 − 300) mod 2𝑛+1 = −300,
see also Fig. 2.4).
Figure 2.4: In the two’s complement representation
2.2.3 Rational numbers and representing pairs of strings we represent a potentially negative integer 𝑘 ∈
We can represent a rational number of the form 𝑎/𝑏 by represent- {−2𝑛 , … , 2𝑛 − 1} as an 𝑛 + 1 length string using the
binary representation of the integer 𝑘 mod 2𝑛+1 . On
ing the two numbers 𝑎 and 𝑏. However, merely concatenating the the left-hand side: this representation for 𝑛 = 3 (the
representations of 𝑎 and 𝑏 will not work. For example, the binary rep- red integers are the numbers being represented by
resentation of 4 is 100 and the binary representation of 43 is 101011, the blue binary strings). If a microprocessor does not
check for overflows, adding the two positive numbers
but the concatenation 100101011 of these strings is also the concatena- 6 and 5 might result in the negative number −5 (since
tion of the representation 10010 of 18 and the representation 1011 of −5 mod 16 = 11. The right-hand side is a C program
that will on some 32 bit architecture print a negative
11. Hence, if we used such simple concatenation then we would not number after adding two positive numbers. (Integer
be able to tell if the string 100101011 is supposed to represent 4/43 or overflow in C is considered undefined behavior which
18/11. means the result of this program, including whether
it runs or crashes, could differ depending on the
We tackle this by giving a general representation for pairs of strings. architecture, compiler, and even compiler options and
If we were using a pen and paper, we would just use a separator sym- version.)
comp u tati on a n d re p re se n tati on 91

bol such as ‖ to represent, for example, the pair consisting of the num-
bers represented by 10 and 110001 as the length-9 string “10‖110001”.
In other words, there is a one to one map 𝐹 from pairs of strings
𝑥, 𝑦 ∈ {0, 1}∗ into a single string 𝑧 over the alphabet Σ = {0, 1, ‖}
(in other words, 𝑧 ∈ Σ∗ ). Using such separators is similar to the
way we use spaces and punctuation to separate words in English. By
adding a little redundancy, we achieve the same effect in the digital
domain. We can map the three-element set Σ to the three-element set
{00, 11, 01} ⊂ {0, 1}2 in a one-to-one fashion, and hence encode a
length 𝑛 string 𝑧 ∈ Σ∗ as a length 2𝑛 string 𝑤 ∈ {0, 1}∗ .
Our final representation for rational numbers is obtained by com-
posing the following steps:
1. Representing a (potentially negative) rational number as a pair of
integers 𝑎, 𝑏 such that 𝑟 = 𝑎/𝑏.

2. Representing an integer by a string via the binary representation.

3. Combining 1 and 2 to obtain a representation of a rational number


as a pair of strings.

4. Representing a pair of strings over {0, 1} as a single string over


Σ = {0, 1, ‖}.

5. Representing a string over Σ as a longer string over {0, 1}.

■ Example 2.4 — Representing a rational number as a string. Consider the


rational number 𝑟 = −5/8. We represent −5 as 1101 and +8 as
01000, and so we can represent 𝑟 as the pair of strings (1101, 01000)
and represent this pair as the length 10 string 1101‖01000 over
the alphabet {0, 1, ‖}. Now, applying the map 0 ↦ 00, 1 ↦ 11,
‖ ↦ 01, we can represent the latter string as the length 20 string
𝑠 = 11110011010011000000 over the alphabet {0, 1}.

The same idea can be used to represent triples of strings, quadru-


ples, and so on as a string. Indeed, this is one instance of a very gen-
eral principle that we use time and again in both the theory and prac-
tice of computer science (for example, in Object Oriented program-
ming):

 Big Idea 1 If we can represent objects of type 𝑇 as strings, then we


can represent tuples of objects of type 𝑇 as strings as well.

Repeating the same idea, once we can represent objects of type 𝑇 ,


we can also represent lists of lists of such objects, and even lists of lists
of lists and so on and so forth. We will come back to this point when
we discuss prefix free encoding in Section 2.5.2.
92 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2.3 REPRESENTING REAL NUMBERS


The set of real numbers ℝ contains all numbers including positive,
negative, and fractional, as well as irrational numbers such as 𝜋 or 𝑒.
Every real number can be approximated by a rational number, and
thus we can represent every real number 𝑥 by a rational number 𝑎/𝑏
that is very close to 𝑥. For example, we can represent 𝜋 by 22/7 within
an error of about 10−3 . If we want a smaller error (e.g., about 10−4 )
then we can use 311/99, and so on and so forth.

Figure 2.5: The floating-point representation of a real


number 𝑥 ∈ ℝ is its approximation as a number of
the form 𝜎𝑏 ⋅ 2𝑒 where 𝜎 ∈ {±1}, 𝑒 is an (potentially
negative) integer, and 𝑏 is a rational number between
1 and 2 expressed as a binary fraction 1.𝑏1 𝑏2 … 𝑏𝑘
for some 𝑏1 , … , 𝑏𝑘 ∈ {0, 1} (that is 𝑏 = 1 + 𝑏1 /2 +
𝑏2 /4 + … + 𝑏𝑘 /2𝑘 ). Commonly-used floating-point
representations fix the numbers ℓ and 𝑘 of bits to
represent 𝑒 and 𝑏 respectively. In the example above,
assuming we use two’s complement representation
for 𝑒, the number represented is −1 × 25 × (1 + 1/2 +
The above representation of real numbers via rational numbers that 1/4 + 1/64 + 1/512) = −56.5625.
approximate them is a fine choice for a representation scheme. How-
ever, typically in computing applications, it is more common to use
the floating-point representation scheme (see Fig. 2.5) to represent real
numbers. In the floating-point representation scheme we represent
𝑥 ∈ ℝ by the pair (𝑏, 𝑒) of (positive or negative) integers of some pre-
scribed sizes (determined by the desired accuracy) such that 𝑏 × 2𝑒
is closest to 𝑥. Floating-point representation is the base-two version
of scientific notation, where one represents a number 𝑦 ∈ 𝑅 as its
approximation of the form 𝑏 × 10𝑒 for 𝑏, 𝑒. It is called “floating-point”
because we can think of the number 𝑏 as specifying a sequence of
binary digits, and 𝑒 as describing the location of the “binary point”
within this sequence. The use of floating representation is the reason
why in many programming systems, printing the expression 0.1+0.2
will result in 0.30000000000000004 and not 0.3, see here, here and
here for more.
The reader might be (rightly) worried about the fact that the
floating-point representation (or the rational number one) can only
approximately represent real numbers. In many (though not all) com-
putational applications, one can make the accuracy tight enough so
that this does not affect the final result, though sometimes we do need
to be careful. Indeed, floating-point bugs can sometimes be no jok-
Figure 2.6: XKCD cartoon on floating-point arithmetic.
ing matter. For example, floating-point rounding errors have been
implicated in the failure of a U.S. Patriot missile to intercept an Iraqi
Scud missile, costing 28 lives, as well as a 100 million pound error in
computing payouts to British pensioners.
comp u tati on a n d re p re se n tati on 93

2.4 CANTOR’S THEOREM, COUNTABLE SETS, AND STRING REP-


RESENTATIONS OF THE REAL NUMBERS
“For any collection of fruits, we can make more fruit salads than there are
fruits. If not, we could label each salad with a different fruit, and consider the
salad of all fruits not in their salad. The label of this salad is in it if and only if
it is not.”, Martha Storey.

Given the issues with floating-point approximations for real num-


bers, a natural question is whether it is possible to represent real num-
bers exactly as strings. Unfortunately, the following theorem shows
that this cannot be done:

Theorem 2.5 — Cantor’s Theorem. There does not exist a one-to-one


function 𝑅𝑡𝑆 ∶ ℝ → {0, 1}∗ . 2 2
𝑅𝑡𝑆 stands for “real numbers to strings”.

Countable sets. We say that a set 𝑆 is countable if there is an onto


map 𝐶 ∶ ℕ → 𝑆, or in other words, we can write 𝑆 as the sequence
𝐶(0), 𝐶(1), 𝐶(2), …. Since the binary representation yields an onto
map from {0, 1}∗ to ℕ, and the composition of two onto maps is onto,
a set 𝑆 is countable iff there is an onto map from {0, 1}∗ to 𝑆. Using
the basic properties of functions (see Section 1.4.3), a set is countable
if and only if there is a one-to-one function from 𝑆 to {0, 1}∗ . Hence,
we can rephrase Theorem 2.5 as follows:

The reals are un-


Theorem 2.6 — Cantor’s Theorem (equivalent statement).
countable. That is, there does not exist an onto function 𝑁 𝑡𝑅 ∶ ℕ →
ℝ.

Theorem 2.6 was proven by Georg Cantor in 1874. This result (and
the theory around it) was quite shocking to mathematicians at the
time. By showing that there is no one-to-one map from ℝ to {0, 1}∗ (or
ℕ), Cantor showed that these two infinite sets have “different forms of
infinity” and that the set of real numbers ℝ is in some sense “bigger”
than the infinite set {0, 1}∗ . The notion that there are “shades of infin-
ity” was deeply disturbing to mathematicians and philosophers at the
time. The philosopher Ludwig Wittgenstein (whom we mentioned be-
fore) called Cantor’s results “utter nonsense” and “laughable.” Others
thought they were even worse than that. Leopold Kronecker called
Cantor a “corrupter of youth,” while Henri Poincaré said that Can-
tor’s ideas “should be banished from mathematics once and for all.”
The tide eventually turned, and these days Cantor’s work is univer-
sally accepted as the cornerstone of set theory and the foundations of
mathematics. As David Hilbert said in 1925, “No one shall expel us from
the paradise which Cantor has created for us.” As we will see later in this
94 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

book, Cantor’s ideas also play a huge role in the theory of computa-
tion.
Now that we have discussed Theorem 2.5’s importance, let us see
the proof. It is achieved in two steps:

1. Define some infinite set 𝒳 for which it is easier for us to prove that
𝒳 is not countable (namely, it’s easier for us to prove that there is
no one-to-one function from 𝒳 to {0, 1}∗ ).

2. Prove that there is a one-to-one function 𝐺 mapping 𝒳 to ℝ.

We can use a proof by contradiction to show that these two facts


together imply Theorem 2.5. Specifically, if we assume (towards the
sake of contradiction) that there exists some one-to-one 𝐹 mapping ℝ
to {0, 1}∗ , then the function 𝑥 ↦ 𝐹 (𝐺(𝑥)) obtained by composing 𝐹
with the function 𝐺 from Step 2 above would be a one-to-one function
from 𝒳 to {0, 1}∗ , which contradicts what we proved in Step 1!
To turn this idea into a full proof of Theorem 2.5 we need to:

• Define the set 𝒳.

• Prove that there is no one-to-one function from 𝒳 to {0, 1}∗

• Prove that there is a one-to-one function from 𝒳 to ℝ.

We now proceed to do precisely that. That is, we will define the set
{0, 1}∞ , which will play the role of 𝒳, and then state and prove two
lemmas that show that this set satisfies our two desired properties.

Definition 2.7 We denote by {0, 1}∞ the set {𝑓 | 𝑓 ∶ ℕ → {0, 1}}.

That is, {0, 1}∞ is a set of functions, and a function 𝑓 is in {0, 1}∞
iff its domain is ℕ and its codomain is {0, 1}. We can also think of
{0, 1}∞ as the set of all infinite sequences of bits, since a function 𝑓 ∶
ℕ → {0, 1} can be identified with the sequence (𝑓(0), 𝑓(1), 𝑓(2), …).
The following two lemmas show that {0, 1}∞ can play the role of 𝒳 to
establish Theorem 2.5.
Lemma 2.8 There does not exist a one-to-one map 𝐹 𝑡𝑆 ∶ {0, 1}∞ → 3
𝐹 𝑡𝑆 stands for “functions to strings”.
{0, 1}∗ .3
4
𝐹 𝑡𝑅 stands for “functions to reals.”
Lemma 2.9 There does exist a one-to-one map 𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ.4
As we’ve seen above, Lemma 2.8 and Lemma 2.9 together imply
Theorem 2.5. To repeat the argument more formally, suppose, for
the sake of contradiction, that there did exist a one-to-one function
𝑅𝑡𝑆 ∶ ℝ → {0, 1}∗ . By Lemma 2.9, there exists a one-to-one function
𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ. Thus, under this assumption, since the composi-
tion of two one-to-one functions is one-to-one (see Exercise 2.12), the
comp u tati on a n d re p re se n tati on 95

function 𝐹 𝑡𝑆 ∶ {0, 1}∞ → {0, 1}∗ defined as 𝐹 𝑡𝑆(𝑓) = 𝑅𝑡𝑆(𝐹 𝑡𝑅(𝑓))


will be one to one, contradicting Lemma 2.8. See Fig. 2.7 for a graphi-
cal illustration of this argument.

Figure 2.7: We prove Theorem 2.5 by combining


Lemma 2.8 and Lemma 2.9. Lemma 2.9, which uses
standard calculus tools, shows the existence of a
one-to-one map 𝐹 𝑡𝑅 from the set {0, 1}∞ to the real
numbers. So, if a hypothetical one-to-one map 𝑅𝑡𝑆 ∶
ℝ → {0, 1}∗ existed, then we could compose them
to get a one-to-one map 𝐹 𝑡𝑆 ∶ {0, 1}∞ → {0, 1}∗ .
Yet this contradicts Lemma 2.8- the heart of the proof-
which rules out the existence of such a map.

Now all that is left is to prove these two lemmas. We start by prov-
ing Lemma 2.8 which is really the heart of Theorem 2.5.

Figure 2.8: We construct a function 𝑑 such that 𝑑 ≠


𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ by ensuring that
𝑑(𝑛(𝑥)) ≠ 𝑆𝑡𝐹 (𝑥)(𝑛(𝑥)) for every 𝑥 ∈ {0, 1}∗
with lexicographic order 𝑛(𝑥). We can think of this
as building a table where the columns correspond
to numbers 𝑚 ∈ ℕ and the rows correspond to
𝑥 ∈ {0, 1}∗ (sorted according to 𝑛(𝑥)). If the entry
in the 𝑥-th row and the 𝑚-th column corresponds to
𝑔(𝑚)) where 𝑔 = 𝑆𝑡𝐹 (𝑥) then 𝑑 is obtained by going
over the “diagonal” elements in this table (the entries
corresponding to the 𝑥-th row and 𝑛(𝑥)-th column)
and ensuring that 𝑑(𝑛(𝑥)) ≠ 𝑆𝑡𝐹 (𝑥)(𝑛(𝑥)).

Warm-up: ”Baby Cantor”. The proof of Lemma 2.8 is rather subtle. One
way to get intuition for it is to consider the following finite statement
“there is no onto function 𝑓 ∶ {0, … , 99} → {0, 1}100 ”. Of course
we know it’s true since the set {0, 1}100 is bigger than the set [100],
but let’s see a direct proof. For every 𝑓 ∶ {0, … , 99} → {0, 1}100 , we
can define the string 𝑑 ∈ {0, 1}100 as follows: 𝑑 = (1 − 𝑓(0)0 , 1 −
𝑓(1)1 , … , 1 − 𝑓(99)99 ). If 𝑓 was onto, then there would exist some
𝑛 ∈ [100] such that 𝑓(𝑛) = 𝑑, but we claim that no such 𝑛 exists.
96 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Indeed, if there was such 𝑛, then the 𝑛-th coordinate of 𝑑 would equal
𝑓(𝑛)𝑛 but by definition this coordinate equals 1 − 𝑓(𝑛)𝑛 . See also a
“proof by code” of this statement.

Proof of Lemma 2.8. We will prove that there does not exist an onto
function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . This implies the lemma since
for every two sets 𝐴 and 𝐵, there exists an onto function from 𝐴 to
𝐵 if and only if there exists a one-to-one function from 𝐵 to 𝐴 (see
Lemma 1.2).
The technique of this proof is known as the “diagonal argument”
and is illustrated in Fig. 2.8. We assume, towards a contradiction, that
there exists such a function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . We will show
that 𝑆𝑡𝐹 is not onto by demonstrating a function 𝑑 ∈ {0, 1}∞ such that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Consider the lexicographic ordering
of binary strings (i.e., "",0,1,00,01,…). For every 𝑛 ∈ ℕ, we let 𝑥𝑛 be the
𝑛-th string in this order. That is 𝑥0 = "", 𝑥1 = 0, 𝑥2 = 1 and so on and
so forth. We define the function 𝑑 ∈ {0, 1}∞ as follows:

𝑑(𝑛) = 1 − 𝑆𝑡𝐹 (𝑥𝑛 )(𝑛)

for every 𝑛 ∈ ℕ. That is, to compute 𝑑 on input 𝑛 ∈ ℕ, we first com-


pute 𝑔 = 𝑆𝑡𝐹 (𝑥𝑛 ), where 𝑥𝑛 ∈ {0, 1}∗ is the 𝑛-th string in the lexico-
graphical ordering. Since 𝑔 ∈ {0, 1}∞ , it is a function mapping ℕ to
{0, 1}. The value 𝑑(𝑛) is defined to be the negation of 𝑔(𝑛).
The definition of the function 𝑑 is a bit subtle. One way to think
about it is to imagine the function 𝑆𝑡𝐹 as being specified by an in-
finitely long table, in which every row corresponds to a string 𝑥 ∈
{0, 1}∗ (with strings sorted in lexicographic order), and contains the
sequence 𝑆𝑡𝐹 (𝑥)(0), 𝑆𝑡𝐹 (𝑥)(1), 𝑆𝑡𝐹 (𝑥)(2), …. The diagonal elements in
this table are the values

𝑆𝑡𝐹 ("")(0), 𝑆𝑡𝐹 (0)(1), 𝑆𝑡𝐹 (1)(2), 𝑆𝑡𝐹 (00)(3), 𝑆𝑡𝐹 (01)(4), …

which correspond to the elements 𝑆𝑡𝐹 (𝑥𝑛 )(𝑛) in the 𝑛-th row and
𝑛-th column of this table for 𝑛 = 0, 1, 2, …. The function 𝑑 we defined
above maps every 𝑛 ∈ ℕ to the negation of the 𝑛-th diagonal value.
To complete the proof that 𝑆𝑡𝐹 is not onto we need to show that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Indeed, let 𝑥 ∈ {0, 1}∗ be some string
and let 𝑔 = 𝑆𝑡𝐹 (𝑥). If 𝑛 is the position of 𝑥 in the lexicographical
order then by construction 𝑑(𝑛) = 1 − 𝑔(𝑛) ≠ 𝑔(𝑛) which means that
𝑔 ≠ 𝑑 which is what we wanted to prove.

comp u tati on a n d re p re se n tati on 97

R
Remark 2.10 — Generalizing beyond strings and reals.
Lemma 2.8 doesn’t really have much to do with the
natural numbers or the strings. An examination of
the proof shows that it really shows that for every
set 𝑆, there is no one-to-one map 𝐹 ∶ {0, 1}𝑆 → 𝑆
where {0, 1}𝑆 denotes the set {𝑓 | 𝑓 ∶ 𝑆 → {0, 1}}
of all Boolean functions with domain 𝑆. Since we can
identify a subset 𝑉 ⊆ 𝑆 with its characteristic function
𝑓 = 1𝑉 (i.e., 1𝑉 (𝑥) = 1 iff 𝑥 ∈ 𝑉 ), we can think of
{0, 1}𝑆 also as the set of all subsets of 𝑆. This subset
is sometimes called the power set of 𝑆 and denoted by
𝒫(𝑆) or 2𝑆 .
The proof of Lemma 2.8 can be generalized to show
that there is no one-to-one map between a set and its
power set. In particular, it means that the set {0, 1}ℝ is
“even bigger” than ℝ. Cantor used these ideas to con-
struct an infinite hierarchy of shades of infinity. The
number of such shades turns out to be much larger
than |ℕ| or even |ℝ|. He denoted the cardinality of ℕ
by ℵ0 and denoted the next largest infinite number
by ℵ1 . (ℵ is the first letter in the Hebrew alphabet.)
Cantor also made the continuum hypothesis that
|ℝ| = ℵ1 . We will come back to the fascinating story
of this hypothesis later on in this book. This lecture of
Aaronson mentions some of these issues (see also this
Berkeley CS 70 lecture).

To complete the proof of Theorem 2.5, we need to show Lemma 2.9.


This requires some calculus background but is otherwise straightfor-
ward. If you have not had much experience with limits of a real series
before, then the formal proof below might be a little hard to follow.
This part is not the core of Cantor’s argument, nor are such limits
important to the remainder of this book, so you can feel free to take
Lemma 2.9 on faith and skip the proof.

Proof Idea:
We define 𝐹 𝑡𝑅(𝑓) to be the number between 0 and 2 whose dec-
imal expansion is 𝑓(0).𝑓(1)𝑓(2) …, or in other words 𝐹 𝑡𝑅(𝑓) =

∑𝑖=0 𝑓(𝑖) ⋅ 10−𝑖 . If 𝑓 and 𝑔 are two distinct functions in {0, 1}∞ , then
there must be some input 𝑘 in which they disagree. If we take the
minimum such 𝑘, then the numbers 𝑓(0).𝑓(1)𝑓(2) … 𝑓(𝑘 − 1)𝑓(𝑘) …
and 𝑔(0).𝑔(1)𝑔(2) … 𝑔(𝑘) … agree with each other all the way up to the
𝑘 − 1-th digit after the decimal point, and disagree on the 𝑘-th digit.
But then these numbers must be distinct. Concretely, if 𝑓(𝑘) = 1 and
𝑔(𝑘) = 0 then the first number is larger than the second, and otherwise
(𝑓(𝑘) = 0 and 𝑔(𝑘) = 1) the first number is smaller than the second.
In the proof we have to be a little careful since these are numbers with
infinite expansions. For example, the number one half has two decimal
98 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

expansions 0.5 and 0.49999 ⋯. However, this issue does not come up
here, since we restrict attention only to numbers with decimal expan-
sions that do not involve the digit 9.

Proof of Lemma 2.9. For every 𝑓 ∈ {0, 1}∞ , we define 𝐹 𝑡𝑅(𝑓) to be the
number whose decimal expansion is 𝑓(0).𝑓(1)𝑓(2)𝑓(3) …. Formally,

𝐹 𝑡𝑅(𝑓) = ∑ 𝑓(𝑖) ⋅ 10−𝑖 (2.2)
𝑖=0

It is a known result in calculus (whose proof we will not repeat here)


that the series on the right-hand side of (2.2) converges to a definite
limit in ℝ.
We now prove that 𝐹 𝑡𝑅 is one to one. Let 𝑓, 𝑔 be two distinct func-
tions in {0, 1}∞ . Since 𝑓 and 𝑔 are distinct, there must be some input
on which they differ, and we define 𝑘 to be the smallest such input
and assume without loss of generality that 𝑓(𝑘) = 0 and 𝑔(𝑘) = 1.
(Otherwise, if 𝑓(𝑘) = 1 and 𝑔(𝑘) = 0, then we can simply switch the
roles of 𝑓 and 𝑔.) The numbers 𝐹 𝑡𝑅(𝑓) and 𝐹 𝑡𝑅(𝑔) agree with each
other up to the 𝑘 −1-th digit up after the decimal point. Since this digit
equals 0 for 𝐹 𝑡𝑅(𝑓) and equals 1 for 𝐹 𝑡𝑅(𝑔), we claim that 𝐹 𝑡𝑅(𝑔) is
bigger than 𝐹 𝑡𝑅(𝑓) by at least 0.5 ⋅ 10−𝑘 . To see this note that the dif-
ference 𝐹 𝑡𝑅(𝑔) − 𝐹 𝑡𝑅(𝑓) will be minimized if 𝑔(ℓ) = 0 for every ℓ > 𝑘
and 𝑓(ℓ) = 1 for every ℓ > 𝑘, in which case (since 𝑓 and 𝑔 agree up to
the 𝑘 − 1-th digit)

𝐹 𝑡𝑅(𝑔) − 𝐹 𝑡𝑅(𝑓) = 10−𝑘 − 10−𝑘−1 − 10−𝑘−2 − 10−𝑘−3 − ⋯ (2.3)



Since the infinite series ∑𝑖=0 10−𝑖 converges to 10/9, it follows that
for every such 𝑓 and 𝑔, 𝐹 𝑡𝑅(𝑔) − 𝐹 𝑡𝑅(𝑓) ≥ 10−𝑘 − 10−𝑘−1 ⋅ (10/9) > 0.
In particular we see that for every distinct 𝑓, 𝑔 ∈ {0, 1}∞ , 𝐹 𝑡𝑅(𝑓) ≠
𝐹 𝑡𝑅(𝑔), implying that the function 𝐹 𝑡𝑅 is one to one.

R
Remark 2.11 — Using decimal expansion (op-
tional). In the proof above we used the fact that
1 + 1/10 + 1/100 + ⋯ converges to 10/9, which
plugging into (2.3) yields that the difference between
𝐹 𝑡𝑅(𝑔) and 𝐹 𝑡𝑅(ℎ) is at least 10−𝑘 −10−𝑘−1 ⋅(10/9) > 0.
While the choice of the decimal representation for 𝐹 𝑡𝑅
was arbitrary, we could not have used the binary
representation in its place. Had we used the binary
expansion instead of decimal, the corresponding se-
quence 1 + 1/2 + 1/4 + ⋯ converges to 2/1 = 2,
comp u tati on a n d re p re se n tati on 99

and since 2−𝑘 = 2−𝑘−1 ⋅ 2, we could not have de-


duced that 𝐹 𝑡𝑅 is one to one. Indeed there do exist
pairs of distinct sequences 𝑓, 𝑔 ∈ {0, 1}∞ such that
∞ ∞
∑𝑖=0 𝑓(𝑖)2−𝑖 = ∑𝑖=0 𝑔(𝑖)2−𝑖 . (For example, the se-
quence 1, 0, 0, 0, … and the sequence 0, 1, 1, 1, … have
this property.)

2.4.1 Corollary: Boolean functions are uncountable


Cantor’s Theorem yields the following corollary that we will use
several times in this book: the set of all Boolean functions (mapping
{0, 1}∗ to {0, 1}) is not countable:

Let ALL be the set of


Theorem 2.12 — Boolean functions are uncountable.
all functions 𝐹 ∶ {0, 1} → {0, 1}. Then ALL is uncountable. Equiv-

alently, there does not exist an onto map 𝑆𝑡𝐴𝐿𝐿 ∶ {0, 1}∗ → ALL.

Proof Idea:
This is a direct consequence of Lemma 2.8, since we can use the
binary representation to show a one-to-one map from {0, 1}∞ to ALL.
Hence the uncountability of {0, 1}∞ implies the uncountability of
ALL.

Proof of Theorem 2.12. Since {0, 1}∞ is uncountable, the result will
follow by showing a one-to-one map from {0, 1}∞ to ALL. The reason
is that the existence of such a map implies that if ALL was countable,
and hence there was a one-to-one map from ALL to ℕ, then there
would have been a one-to-one map from {0, 1}∞ to ℕ, contradicting
Lemma 2.8.
We now show this one-to-one map. We simply map a function
𝑓 ∈ {0, 1}∞ to the function 𝐹 ∶ {0, 1}∗ → {0, 1} as follows. We let
𝐹 (0) = 𝑓(0), 𝐹 (1) = 𝑓(1), 𝐹 (10) = 𝑓(2), 𝐹 (11) = 𝑓(3) and so on and
so forth. That is, for every 𝑥 ∈ {0, 1}∗ that represents a natural number
𝑛 in the binary basis, we define 𝐹 (𝑥) = 𝑓(𝑛). If 𝑥 does not represent
such a number (e.g., it has a leading zero), then we set 𝐹 (𝑥) = 0.
This map is one-to-one since if 𝑓 ≠ 𝑔 are two distinct elements in
{0, 1}∞ , then there must be some input 𝑛 ∈ ℕ on which 𝑓(𝑛) ≠ 𝑔(𝑛).
But then if 𝑥 ∈ {0, 1}∗ is the string representing 𝑛, we see that 𝐹 (𝑥) ≠
𝐺(𝑥) where 𝐹 is the function in ALL that 𝑓 mapped to, and 𝐺 is the
function that 𝑔 is mapped to.

100 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2.4.2 Equivalent conditions for countability


The results above establish many equivalent ways to phrase the fact
that a set is countable. Specifically, the following statements are all
equivalent:

1. The set 𝑆 is countable

2. There exists an onto map from ℕ to 𝑆

3. There exists an onto map from {0, 1}∗ to 𝑆.

4. There exists a one-to-one map from 𝑆 to ℕ

5. There exists a one-to-one map from 𝑆 to {0, 1}∗ .

6. There exists an onto map from some countable set 𝑇 to 𝑆.

7. There exists a one-to-one map from 𝑆 to some countable set 𝑇 .

P
Make sure you know how to prove the equivalence of
all the results above.

2.5 REPRESENTING OBJECTS BEYOND NUMBERS


Numbers are of course by no means the only objects that we can repre-
sent as binary strings. A representation scheme for representing objects
from some set 𝒪 consists of an encoding function that maps an object in
𝒪 to a string, and a decoding function that decodes a string back to an
object in 𝒪. Formally, we make the following definition:

Definition 2.13 — String representation. Let 𝒪 be any set. A representation


scheme for 𝒪 is a pair of functions 𝐸, 𝐷 where 𝐸 ∶ 𝒪 → {0, 1}∗ is a
total one-to-one function, 𝐷 ∶ {0, 1}∗ →𝑝 𝒪 is a (possibly partial)
function, and such that 𝐷 and 𝐸 satisfy that 𝐷(𝐸(𝑜)) = 𝑜 for every
𝑜 ∈ 𝒪. 𝐸 is known as the encoding function and 𝐷 is known as the
decoding function.

Note that the condition 𝐷(𝐸(𝑜)) = 𝑜 for every 𝑜 ∈ 𝒪 implies


that 𝐷 is onto (can you see why?). It turns out that to construct a
representation scheme we only need to find an encoding function. That
is, every one-to-one encoding function has a corresponding decoding
function, as shown in the following lemma:
Lemma 2.14 Suppose that 𝐸 ∶ 𝒪 → {0, 1}∗ is one-to-one. Then there
exists a function 𝐷 ∶ {0, 1}∗ → 𝒪 such that 𝐷(𝐸(𝑜)) = 𝑜 for every
𝑜 ∈ 𝒪.
comp u tati on a n d re p re se n tati on 101

Proof. Let 𝑜0 be some arbitrary element of 𝒪. For every 𝑥 ∈ {0, 1}∗ ,


there exists either zero or a single 𝑜 ∈ 𝒪 such that 𝐸(𝑜) = 𝑥 (otherwise
𝐸 would not be one-to-one). We will define 𝐷(𝑥) to equal 𝑜0 in the
first case and this single object 𝑜 in the second case. By definition
𝐷(𝐸(𝑜)) = 𝑜 for every 𝑜 ∈ 𝒪.

R
Remark 2.15 — Total decoding functions. While the
decoding function of a representation scheme can in
general be a partial function, the proof of Lemma 2.14
implies that every representation scheme has a total
decoding function. This observation can sometimes be
useful.

2.5.1 Finite representations


If 𝒪 is finite, then we can represent every object in 𝒪 as a string of
length at most some number 𝑛. What is the value of 𝑛? Let us denote
by {0, 1}≤𝑛 the set {𝑥 ∈ {0, 1}∗ ∶ |𝑥| ≤ 𝑛} of strings of length at most 𝑛.
The size of {0, 1}≤𝑛 is equal to
𝑛
|{0, 1}0 | + |{0, 1}1 | + |{0, 1}2 | + ⋯ + |{0, 1}𝑛 | = ∑ 2𝑖 = 2𝑛+1 − 1
𝑖=0

using the standard formula for summing a geometric progression.


To obtain a representation of objects in 𝒪 as strings of length at
most 𝑛 we need to come up with a one-to-one function from 𝒪 to
{0, 1}≤𝑛 . We can do so, if and only if |𝒪| ≤ 2𝑛+1 − 1 as is implied by
the following lemma:
Lemma 2.16 For every two non-empty finite sets 𝑆, 𝑇 , there exists a
one-to-one 𝐸 ∶ 𝑆 → 𝑇 if and only if |𝑆| ≤ |𝑇 |.

Proof. Let 𝑘 = |𝑆| and 𝑚 = |𝑇 | and so write the elements of 𝑆 and


𝑇 as 𝑆 = {𝑠0 , 𝑠1 , … , 𝑠𝑘−1 } and 𝑇 = {𝑡0 , 𝑡1 , … , 𝑡𝑚−1 }. We need to
show that there is a one-to-one function 𝐸 ∶ 𝑆 → 𝑇 iff 𝑘 ≤ 𝑚. For
the “if” direction, if 𝑘 ≤ 𝑚 we can simply define 𝐸(𝑠𝑖 ) = 𝑡𝑖 for every
𝑖 ∈ [𝑘]. Clearly for 𝑖 ≠ 𝑗, 𝑡𝑖 = 𝐸(𝑠𝑖 ) ≠ 𝐸(𝑠𝑗 ) = 𝑡𝑗 , and hence this
function is one-to-one. In the other direction, suppose that 𝑘 > 𝑚 and
𝐸 ∶ 𝑆 → 𝑇 is some function. Then 𝐸 cannot be one-to-one. Indeed, for
𝑖 = 0, 1, … , 𝑚 − 1 let us “mark” the element 𝑡𝑗 = 𝐸(𝑠𝑖 ) in 𝑇 . If 𝑡𝑗 was
marked before, then we have found two objects in 𝑆 mapping to the
same element 𝑡𝑗 . Otherwise, since 𝑇 has 𝑚 elements, when we get to
𝑖 = 𝑚 − 1 we mark all the objects in 𝑇 . Hence, in this case, 𝐸(𝑠𝑚 ) must
map to an element that was already marked before. (This observation
is sometimes known as the “Pigeonhole Principle”: the principle that
102 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

if you have a pigeon coop with 𝑚 holes and 𝑘 > 𝑚 pigeons, then there
must be two pigeons in the same hole.)

2.5.2 Prefix-free encoding


When showing a representation scheme for rational numbers, we used
the “hack” of encoding the alphabet {0, 1, ‖} to represent tuples of
strings as a single string. This is a special case of the general paradigm
of prefix-free encoding. The idea is the following: if our representation
has the property that no string 𝑥 representing an object 𝑜 is a prefix
(i.e., an initial substring) of a string 𝑦 representing a different object
𝑜′ , then we can represent a list of objects by merely concatenating the
representations of all the list members. For example, because in En-
glish every sentence ends with a punctuation mark such as a period,
exclamation, or question mark, no sentence can be a prefix of another
and so we can represent a list of sentences by merely concatenating
the sentences one after the other. (English has some complications
such as periods used for abbreviations (e.g., “e.g.”) or sentence quotes
containing punctuation, but the high level point of a prefix-free repre-
sentation for sentences still holds.)
It turns out that we can transform every representation to a prefix-
free form. This justifies Big Idea 1, and allows us to transform a repre-
sentation scheme for objects of a type 𝑇 to a representation scheme of
lists of objects of the type 𝑇 . By repeating the same technique, we can
also represent lists of lists of objects of type 𝑇 , and so on and so forth.
But first let us formally define prefix-freeness:

Definition 2.17 — Prefix free encoding.For two strings 𝑦, 𝑦′ , we say that 𝑦


is a prefix of 𝑦 if |𝑦| ≤ |𝑦 | and for every 𝑖 < |𝑦|, 𝑦𝑖′ = 𝑦𝑖 .
′ ′

Let 𝒪 be a non-empty set and 𝐸 ∶ 𝒪 → {0, 1}∗ be a function. We


say that 𝐸 is prefix-free if 𝐸(𝑜) is non-empty for every 𝑜 ∈ 𝒪 and
there does not exist a distinct pair of objects 𝑜, 𝑜′ ∈ 𝒪 such that
𝐸(𝑜) is a prefix of 𝐸(𝑜′ ).

Recall that for every set 𝒪, the set 𝒪∗ consists of all finite length
tuples (i.e., lists) of elements in 𝒪. The following theorem shows that
if 𝐸 is a prefix-free encoding of 𝒪 then by concatenating encodings we
can obtain a valid (i.e., one-to-one) representation of 𝒪∗ :

Theorem 2.18 — Prefix-free implies tuple encoding.Suppose that 𝐸 ∶ 𝒪 →


{0, 1} is prefix-free. Then the following map 𝐸 ∶ 𝒪∗ → {0, 1}∗ is

one to one, for every (𝑜0 , … , 𝑜𝑘−1 ) ∈ 𝒪∗ , we define

𝐸(𝑜0 , … , 𝑜𝑘−1 ) = 𝐸(𝑜0 )𝐸(𝑜1 ) ⋯ 𝐸(𝑜𝑘−1 ) .


comp u tati on a n d re p re se n tati on 103

P
Theorem 2.18 is an example of a theorem that is a little
hard to parse, but in fact is fairly straightforward to
prove once you understand what it means. Therefore,
I highly recommend that you pause here to make
sure you understand the statement of this theorem.
You should also try to prove it on your own before
proceeding further.

Proof Idea:
The idea behind the proof is simple. Suppose that for example
Figure 2.9: If we have a prefix-free representation of
we want to decode a triple (𝑜0 , 𝑜1 , 𝑜2 ) from its representation 𝑥 =
each object then we can concatenate the representa-
𝐸(𝑜0 , 𝑜1 , 𝑜2 ) = 𝐸(𝑜0 )𝐸(𝑜1 )𝐸(𝑜2 ). We will do so by first finding the tions of 𝑘 objects to obtain a representation for the
first prefix 𝑥0 of 𝑥 that is a representation of some object. Then we tuple (𝑜0 , … , 𝑜𝑘−1 ).

will decode this object, remove 𝑥0 from 𝑥 to obtain a new string 𝑥′ ,


and continue onwards to find the first prefix 𝑥1 of 𝑥′ and so on and so
forth (see Exercise 2.9). The prefix-freeness property of 𝐸 will ensure
that 𝑥0 will in fact be 𝐸(𝑜0 ), 𝑥1 will be 𝐸(𝑜1 ), etc.

Proof of Theorem 2.18. We now show the formal proof. Suppose, to-
wards the sake of contradiction, that there exist two distinct tuples
(𝑜0 , … , 𝑜𝑘−1 ) and (𝑜0′ , … , 𝑜𝑘′ ′ −1 ) such that

𝐸(𝑜0 , … , 𝑜𝑘−1 ) = 𝐸(𝑜0′ , … , 𝑜𝑘′ ′ −1 ) . (2.4)

We will denote the string 𝐸(𝑜0 , … , 𝑜𝑘−1 ) by 𝑥.


Let 𝑖 be the first index such that 𝑜𝑖 ≠ 𝑜𝑖′ . (If 𝑜𝑖 = 𝑜𝑖′ for all 𝑖 then,
since we assume the two tuples are distinct, one of them must be
larger than the other. In this case we assume without loss of generality
that 𝑘′ > 𝑘 and let 𝑖 = 𝑘.) In the case that 𝑖 < 𝑘, we see that the string
𝑥 can be written in two different ways:

𝑥 = 𝐸(𝑜0 , … , 𝑜𝑘−1 ) = 𝑥0 ⋯ 𝑥𝑖−1 𝐸(𝑜𝑖 )𝐸(𝑜𝑖+1 ) ⋯ 𝐸(𝑜𝑘−1 )

and

𝑥 = 𝐸(𝑜0′ , … , 𝑜𝑘′ ′ −1 ) = 𝑥0 ⋯ 𝑥𝑖−1 𝐸(𝑜𝑖′ )𝐸(𝑜𝑖+1



) ⋯ 𝐸(𝑜𝑘′ ′ −1 )

where 𝑥𝑗 = 𝐸(𝑜𝑗 ) = 𝐸(𝑜𝑗′ ) for all 𝑗 < 𝑖. Let 𝑦 be the string obtained
after removing the prefix 𝑥0 ⋯ 𝑥𝑖−1 from 𝑥. We see that 𝑦 can be writ-
ten as both 𝑦 = 𝐸(𝑜𝑖 )𝑠 for some string 𝑠 ∈ {0, 1}∗ and as 𝑦 = 𝐸(𝑜𝑖′ )𝑠′
for some 𝑠′ ∈ {0, 1}∗ . But this means that one of 𝐸(𝑜𝑖 ) and 𝐸(𝑜𝑖′ ) must
be a prefix of the other, contradicting the prefix-freeness of 𝐸.
In the case that 𝑖 = 𝑘 and 𝑘′ > 𝑘, we get a contradiction in the
following way. In this case
104 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑥 = 𝐸(𝑜0 ) ⋯ 𝐸(𝑜𝑘−1 ) = 𝐸(𝑜0 ) ⋯ 𝐸(𝑜𝑘−1 )𝐸(𝑜𝑘′ ) ⋯ 𝐸(𝑜𝑘′ ′ −1 )

which means that 𝐸(𝑜𝑘′ ) ⋯ 𝐸(𝑜𝑘′ ′ −1 ) must correspond to the empty


string "". But in such a case 𝐸(𝑜𝑘′ ) must be the empty string, which in
particular is the prefix of any other string, contradicting the prefix-
freeness of 𝐸.

R
Remark 2.19 — Prefix freeness of list representation.
Even if the representation 𝐸 of objects in 𝒪 is prefix
free, this does not mean that our representation 𝐸
of lists of such objects will be prefix free as well. In
fact, it won’t be: for every three objects 𝑜, 𝑜′ , 𝑜″ the
representation of the list (𝑜, 𝑜′ ) will be a prefix of the
representation of the list (𝑜, 𝑜′ , 𝑜″ ). However, as we see
in Lemma 2.20 below, we can transform every repre-
sentation into prefix-free form, and so will be able to
use that transformation if needed to represent lists of
lists, lists of lists of lists, and so on and so forth.

2.5.3 Making representations prefix-free


Some natural representations are prefix-free. For example, every fixed
output length representation (i.e., one-to-one function 𝐸 ∶ 𝒪 → {0, 1}𝑛 )
is automatically prefix-free, since a string 𝑥 can only be a prefix of
an equal-length 𝑥′ if 𝑥 and 𝑥′ are identical. Moreover, the approach
we used for representing rational numbers can be used to show the
following:
Lemma 2.20 Let 𝐸 ∶ 𝒪 → {0, 1}∗ be a one-to-one function. Then there is
a one-to-one prefix-free encoding 𝐸 such that |𝐸(𝑜)| ≤ 2|𝐸(𝑜)| + 2 for
every 𝑜 ∈ 𝒪.

P
For the sake of completeness, we will include the
proof below, but it is a good idea for you to pause
here and try to prove it on your own, using the same
technique we used for representing rational numbers.

Proof of Lemma 2.20. The idea behind the proof is to use the map 0 ↦
00, 1 ↦ 11 to “double” every bit in the string 𝑥 and then mark the
end of the string by concatenating to it the pair 01. If we encode a
string 𝑥 in this way, it ensures that the encoding of 𝑥 is never a prefix
comp u tati on a n d re p re se n tati on 105

of the encoding of a distinct string 𝑥′ . Formally, we define the function


PF ∶ {0, 1}∗ → {0, 1}∗ as follows:

PF(𝑥) = 𝑥0 𝑥0 𝑥1 𝑥1 … 𝑥𝑛−1 𝑥𝑛−1 01

for every 𝑥 ∈ {0, 1}∗ . If 𝐸 ∶ 𝒪 → {0, 1}∗ is the (potentially not


prefix-free) representation for 𝒪, then we transform it into a prefix-
free representation 𝐸 ∶ 𝒪 → {0, 1}∗ by defining 𝐸(𝑜) = PF(𝐸(𝑜)).
To prove the lemma we need to show that (1) 𝐸 is one-to-one and
(2) 𝐸 is prefix-free. In fact, prefix freeness is a stronger condition than
one-to-one (if two strings are equal then in particular one of them is a
prefix of the other) and hence it suffices to prove (2), which we now
do.
Let 𝑜 ≠ 𝑜′ in 𝒪 be two distinct objects. We will prove that 𝐸(𝑜) is
not a prefix of 𝐸(𝑜′ ), or in other words PF(𝑥) is not a prefix of PF(𝑥′ )
where 𝑥 = 𝐸(𝑜) and 𝑥′ = 𝐸(𝑜′ ). Since 𝐸 is one-to-one, 𝑥 ≠ 𝑥′ . We
will split into three cases, depending on whether |𝑥| < |𝑥′ |, |𝑥| = |𝑥′ |,
or |𝑥| > |𝑥′ |. If |𝑥| < |𝑥′ | then the two bits in positions 2|𝑥|, 2|𝑥| + 1
in PF(𝑥) have the value 01 but the corresponding bits in PF(𝑥′ ) will
equal either 00 or 11 (depending on the |𝑥|-th bit of 𝑥′ ) and hence
PF(𝑥) cannot be a prefix of PF(𝑥′ ). If |𝑥| = |𝑥′ | then, since 𝑥 ≠ 𝑥′ ,
there must be a coordinate 𝑖 in which they differ, meaning that the
strings PF(𝑥) and PF(𝑥′ ) differ in the coordinates 2𝑖, 2𝑖 + 1, which
again means that PF(𝑥) cannot be a prefix of PF(𝑥′ ). If |𝑥| > |𝑥′ |
then |PF(𝑥)| = 2|𝑥| + 2 > |PF(𝑥′ )| = 2|𝑥′ | + 2 and hence PF(𝑥) is
longer than (and cannot be a prefix of) PF(𝑥′ ). In all cases we see that
PF(𝑥) = 𝐸(𝑜) is not a prefix of PF(𝑥′ ) = 𝐸(𝑜′ ), hence completing the
proof.

The proof of Lemma 2.20 is not the only or even the best way to
transform an arbitrary representation into prefix-free form. Exer-
cise 2.10 asks you to construct a more efficient prefix-free transforma-
tion satisfying |𝐸(𝑜)| ≤ |𝐸(𝑜)| + 𝑂(log |𝐸(𝑜)|).

2.5.4 “Proof by Python” (optional)


The proofs of Theorem 2.18 and Lemma 2.20 are constructive in the
sense that they give us:

• A way to transform the encoding and decoding functions of any


representation of an object 𝑂 to encoding and decoding functions
that are prefix-free, and

• A way to extend prefix-free encoding and decoding of single objects


to encoding and decoding of lists of objects by concatenation.
106 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Specifically, we could transform any pair of Python functions en-


code and decode to functions pfencode and pfdecode that correspond
to a prefix-free encoding and decoding. Similarly, given pfencode and
pfdecode for single objects, we can extend them to encoding of lists.
Let us show how this works for the case of the NtS and StN functions
we defined above.
We start with the “Python proof” of Lemma 2.20: a way to trans-
form an arbitrary representation into one that is prefix free. The func-
tion prefixfree below takes as input a pair of encoding and decoding
functions, and returns a triple of functions containing prefix-free encod-
ing and decoding functions, as well as a function that checks whether
a string is a valid encoding of an object.

# takes functions encode and decode mapping


# objects to lists of bits and vice versa,
# and returns functions pfencode and pfdecode that
# maps objects to lists of bits and vice versa
# in a prefix-free way.
# Also returns a function pfvalid that says
# whether a list is a valid encoding
def prefixfree(encode, decode):
def pfencode(o):
L = encode(o)
return [L[i//2] for i in range(2*len(L))]+[0,1]
def pfdecode(L):
return decode([L[j] for j in range(0,len(L)-2,2)])
def pfvalid(L):
return (len(L) % 2 == 0 ) and all(L[2*i]==L[2*i+1]
↪ for i in range((len(L)-2)//2)) and
↪ L[-2:]==[0,1]

return pfencode, pfdecode, pfvalid

pfNtS, pfStN , pfvalidN = prefixfree(NtS,StN)

NtS(234)
# 11101010
pfNtS(234)
# 111111001100110001
pfStN(pfNtS(234))
# 234
pfvalidM(pfNtS(234))
# true
comp u tati on a n d re p re se n tati on 107

P
Note that the Python function prefixfree above
takes two Python functions as input and outputs
three Python functions as output. (When it’s not
too awkward, we use the term “Python function” or
“subroutine” to distinguish between such snippets of
Python programs and mathematical functions.) You
don’t have to know Python in this course, but you do
need to get comfortable with the idea of functions as
mathematical objects in their own right, that can be
used as inputs and outputs of other functions.

We now show a “Python proof” of Theorem 2.18. Namely, we show


a function represlists that takes as input a prefix-free representation
scheme (implemented via encoding, decoding, and validity testing
functions) and outputs a representation scheme for lists of such ob-
jects. If we want to make this representation prefix-free then we could
fit it into the function prefixfree above.

def represlists(pfencode,pfdecode,pfvalid):
"""
Takes functions pfencode, pfdecode and pfvalid,
and returns functions encodelists, decodelists
that can encode and decode lists of the objects
respectively.
"""

def encodelist(L):
"""Gets list of objects, encodes it as list of
↪ bits"""
return "".join([pfencode(obj) for obj in L])

def decodelist(S):
"""Gets lists of bits, returns lists of objects"""
i=0; j=1 ; res = []
while j<=len(S):
if pfvalid(S[i:j]):
res += [pfdecode(S[i:j])]
i=j
j+= 1
return res

return encodelist,decodelist

LtS , StL = represlists(pfNtS,pfStN,pfvalidN)


108 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

LtS([234,12,5])
# 111111001100110001111100000111001101
StL(LtS([234,12,5]))
# [234, 12, 5]

2.5.5 Representing letters and text


We can represent a letter or symbol by a string, and then if this rep-
resentation is prefix-free, we can represent a sequence of symbols by
merely concatenating the representation of each symbol. One such
representation is the ASCII that represents 128 letters and symbols
as strings of 7 bits. Since the ASCII representation is fixed-length, it
is automatically prefix-free (can you see why?). Unicode is the rep-
resentation of (at the time of this writing) about 128,000 symbols as
numbers (known as code points) between 0 and 1, 114, 111. There are
several types of prefix-free representations of the code points, a pop-
ular one being UTF-8 that encodes every codepoint into a string of
length between 8 and 32.

■ The Braille system is another


Example 2.21 — The Braille representation.
way to encode letters and other symbols as binary strings. Specifi-
cally, in Braille, every letter is encoded as a string in {0, 1}6 , which
Figure 2.10: The word “Binary” in “Grade 1” or
is written using indented dots arranged in two columns and three “uncontracted” Unified English Braille. This word is
rows, see Fig. 2.10. (Some symbols require more than one six-bit encoded using seven symbols since the first one is a
modifier indicating that the first letter is capitalized.
string to encode, and so Braille uses a more general prefix-free
encoding.)
The Braille system was invented in 1821 by Louis Braille when
he was just 12 years old (though he continued working on it and
improving it throughout his life). Braille was a French boy who
lost his eyesight at the age of 5 as the result of an accident.

■ Example 2.22 — Representing objects in C (optional). We can use pro-


gramming languages to probe how our computing environment
represents various values. This is easiest to do in “unsafe” pro-
gramming languages such as C that allow direct access to the
memory.
Using a simple C program we have produced the following
representations of various values. One can see that for integers,
multiplying by 2 corresponds to a “left shift” inside each byte. In
contrast, for floating-point numbers, multiplying by two corre-
sponds to adding one to the exponent part of the representation.
In the architecture we used, a negative number is represented
comp u tati on a n d re p re se n tati on 109

using the two’s complement approach. C represents strings in a


prefix-free form by ensuring that a zero byte is at their end.

int 2 : 00000010 00000000 00000000 00000000


int 4 : 00000100 00000000 00000000 00000000
int 513 : 00000001 00000010 00000000 00000000
long 513 : 00000001 00000010 00000000 00000000
↪ 00000000 00000000 00000000 00000000
int -1 : 11111111 11111111 11111111 11111111
int -2 : 11111110 11111111 11111111 11111111
string Hello: 01001000 01100101 01101100 01101100
↪ 01101111 00000000
string abcd : 01100001 01100010 01100011 01100100
↪ 00000000
float 33.0 : 00000000 00000000 00000100 01000010
float 66.0 : 00000000 00000000 10000100 01000010
float 132.0: 00000000 00000000 00000100 01000011
double 132.0: 00000000 00000000 00000000 00000000
↪ 00000000 10000000 01100000 01000000

2.5.6 Representing vectors, matrices, images


Once we can represent numbers and lists of numbers, then we can also
represent vectors (which are just lists of numbers). Similarly, we can
represent lists of lists, and thus, in particular, can represent matrices.
To represent an image, we can represent the color at each pixel by a
list of three numbers corresponding to the intensity of Red, Green and
Blue. (We can restrict to three primary colors since most humans only
have three types of cones in their retinas; we would have needed 16
primary colors to represent colors visible to the Mantis Shrimp.) Thus
an image of 𝑛 pixels would be represented by a list of 𝑛 such length-
three lists. A video can be represented as a list of images. Of course
these representations are rather wasteful and much more compact
representations are typically used for images and videos, though this
will not be our concern in this book.

2.5.7 Representing graphs


A graph on 𝑛 vertices can be represented as an 𝑛 × 𝑛 adjacency matrix
whose (𝑖, 𝑗)𝑡ℎ entry is equal to 1 if the edge (𝑖, 𝑗) is present and is
equal to 0 otherwise. That is, we can represent an 𝑛 vertex directed
graph 𝐺 = (𝑉 , 𝐸) as a string 𝐴 ∈ {0, 1}𝑛 such that 𝐴𝑖,𝑗 = 1 iff the
2

edge ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 ∈ 𝐸. We can transform an undirected graph to a directed
graph by replacing every edge {𝑖, 𝑗} with both edges ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 and ⃖⃖⃖⃖⃖⃖⃖⃖
𝑖𝑗
Another representation for graphs is the adjacency list representa-
tion. That is, we identify the vertex set 𝑉 of a graph with the set [𝑛]
110 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

where 𝑛 = |𝑉 |, and represent the graph 𝐺 = (𝑉 , 𝐸) as a list of 𝑛


lists, where the 𝑖-th list consists of the out-neighbors of vertex 𝑖. The
difference between these representations can be significant for some
applications, though for us would typically be immaterial.

2.5.8 Representing lists and nested lists


If we have a way of representing objects from a set 𝒪 as binary strings,
then we can represent lists of these objects by applying a prefix-free
transformation. Moreover, we can use a trick similar to the above to
handle nested lists. The idea is that if we have some representation Figure 2.11: Representing the graph 𝐺 =
𝐸 ∶ 𝒪 → {0, 1}∗ , then we can represent nested lists of items from ({0, 1, 2, 3, 4}, {(1, 0), (4, 0), (1, 4), (4, 1), (2, 1), (3, 2), (4, 3)})
𝒪 using strings over the five element alphabet Σ = { 0,1,[ , ] , , }. in the adjacency matrix and adjacency list representa-
tions.
For example, if 𝑜1 is represented by 0011, 𝑜2 is represented by 10011,
and 𝑜3 is represented by 00111, then we can represent the nested list
(𝑜1 , (𝑜2 , 𝑜3 )) as the string "[0011,[10011,00111]]" over the alphabet
Σ. By encoding every element of Σ itself as a three-bit string, we can
transform any representation for objects 𝒪 into a representation that
enables representing (potentially nested) lists of these objects.

2.5.9 Notation
We will typically identify an object with its representation as a string.
For example, if 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is some function that maps
strings to strings and 𝑛 is an integer, we might make statements such
as “𝐹 (𝑛) + 1 is prime” to mean that if we represent 𝑛 as a string 𝑥,
then the integer 𝑚 represented by the string 𝐹 (𝑥) satisfies that 𝑚 + 1
is prime. (You can see how this convention of identifying objects with
their representation can save us a lot of cumbersome formalism.)
Similarly, if 𝑥, 𝑦 are some objects and 𝐹 is a function that takes strings
as inputs, then by 𝐹 (𝑥, 𝑦) we will mean the result of applying 𝐹 to the
representation of the ordered pair (𝑥, 𝑦). We use the same notation to
invoke functions on 𝑘-tuples of objects for every 𝑘.
This convention of identifying an object with its representation as
a string is one that we humans follow all the time. For example, when
people say a statement such as “17 is a prime number”, what they
really mean is that the integer whose decimal representation is the
string “17”, is prime.

When we say
𝐴 is an algorithm that computes the multiplication function on natural num-
bers.
what we really mean is that
𝐴 is an algorithm that computes the function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ such that
for every pair 𝑎, 𝑏 ∈ ℕ, if 𝑥 ∈ {0, 1}∗ is a string representing the pair (𝑎, 𝑏)
then 𝐹 (𝑥) will be a string representing their product 𝑎 ⋅ 𝑏.
comp u tati on a n d re p re se n tati on 111

2.6 DEFINING COMPUTATIONAL TASKS AS MATHEMATICAL FUNC-


TIONS
Abstractly, a computational process is some process that takes an input
which is a string of bits and produces an output which is a string
of bits. This transformation of input to output can be done using a
modern computer, a person following instructions, the evolution of
some natural system, or any other means.
In future chapters, we will turn to mathematically defining com-
putational processes, but, as we discussed above, at the moment we
focus on computational tasks. That is, we focus on the specification and
not the implementation. Again, at an abstract level, a computational
task can specify any relation that the output needs to have with the in-
put. However, for most of this book, we will focus on the simplest and
most common task of computing a function. Here are some examples:

• Given (a representation of) two integers 𝑥, 𝑦, compute the product


𝑥 × 𝑦. Using our representation above, this corresponds to com-
puting a function from {0, 1}∗ to {0, 1}∗ . We have seen that there is
more than one way to solve this computational task, and in fact, we
still do not know the best algorithm for this problem.

• Given (a representation of) an integer 𝑧 > 1, compute its factoriza-


tion; i.e., the list of primes 𝑝1 ≤ ⋯ ≤ 𝑝𝑘 such that 𝑧 = 𝑝1 ⋯ 𝑝𝑘 . This
again corresponds to computing a function from {0, 1}∗ to {0, 1}∗ .
The gaps in our knowledge of the complexity of this problem are
even larger.

• Given (a representation of) a graph 𝐺 and two vertices 𝑠 and 𝑡,


compute the length of the shortest path in 𝐺 between 𝑠 and 𝑡, or do
the same for the longest path (with no repeated vertices) between
𝑠 and 𝑡. Both these tasks correspond to computing a function from
{0, 1}∗ to {0, 1}∗ , though it turns out that there is a vast difference in
their computational difficulty.

• Given the code of a Python program, determine whether there is an


input that would force it into an infinite loop. This task corresponds
to computing a partial function from {0, 1}∗ to {0, 1} since not every
string corresponds to a syntactically valid Python program. We will
see that we do understand the computational status of this problem,
but the answer is quite surprising.

• Given (a representation of) an image 𝐼, decide if 𝐼 is a photo of a


cat or a dog. This corresponds to computing some (partial) func-
tion from {0, 1}∗ to {0, 1}.
112 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

R
Remark 2.23 — Boolean functions and languages. An
important special case of computational tasks corre-
sponds to computing Boolean functions, whose output
is a single bit {0, 1}. Computing such functions corre-
sponds to answering a YES/NO question, and hence
this task is also known as a decision problem. Given any
function 𝐹 ∶ {0, 1}∗ → {0, 1} and 𝑥 ∈ {0, 1}∗ , the task
of computing 𝐹 (𝑥) corresponds to the task of deciding
whether or not 𝑥 ∈ 𝐿 where 𝐿 = {𝑥 ∶ 𝐹 (𝑥) = 1} is
known as the language that corresponds to the function
𝐹 . (The language terminology is due to historical
connections between the theory of computation and
formal linguistics as developed by Noam Chomsky.)
Hence many texts refer to such a computational task
as deciding a language.

For every particular function 𝐹 , there can be several possible algo-


rithms to compute 𝐹 . We will be interested in questions such as:

• For a given function 𝐹 , can it be the case that there is no algorithm to


compute 𝐹 ?

• If there is an algorithm, what is the best one? Could it be that 𝐹 is


“effectively uncomputable” in the sense that every algorithm for
computing 𝐹 requires a prohibitively large amount of resources? Figure 2.12: A subset 𝐿 ⊆ {0, 1}∗ can be identified
with the function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
• If we cannot answer this question, can we show equivalence be- 𝐹 (𝑥) = 1 if 𝑥 ∈ 𝐿 and 𝐹 (𝑥) = 0 if 𝑥 ∉ 𝐿. Functions
with a single bit of output are called Boolean functions,
tween different functions 𝐹 and 𝐹 ′ in the sense that either they are
while subsets of strings are called languages. The
both easy (i.e., have fast algorithms) or they are both hard? above shows that the two are essentially the same
object, and we can identify the task of deciding
• Can a function being hard to compute ever be a good thing? Can we membership in 𝐿 (known as deciding a language in the
literature) with the task of computing the function 𝐹 .
use it for applications in areas such as cryptography?

In order to do that, we will need to mathematically define the no-


tion of an algorithm, which is what we will do in Chapter 3.

2.6.1 Distinguish functions from programs!


You should always watch out for potential confusions between speci-
fications and implementations or equivalently between mathematical
functions and algorithms/programs. It does not help that program-
ming languages (Python included) use the term “functions” to denote
(parts of) programs. This confusion also stems from thousands of years
of mathematical history, where people typically defined functions by
means of a way to compute them.
For example, consider the multiplication function on natural num-
bers. This is the function MULT ∶ ℕ × ℕ → ℕ that maps a pair (𝑥, 𝑦)
of natural numbers to the number 𝑥 ⋅ 𝑦. As we mentioned, it can be
implemented in more than one way:
comp u tati on a n d re p re se n tati on 113

def mult1(x,y):
res = 0
while y>0:
res += x
y -= 1
return res

def mult2(x,y):
a = str(x) # represent x as string in decimal notation
b = str(y) # represent y as string in decimal notation
res = 0
for i in range(len(a)):
for j in range(len(b)):
res += int(a[len(a)-i])*int(b[len(b)-
↪ j])*(10**(i+j))
return res

print(mult1(12,7))
# 84
print(mult2(12,7))
# 84

Both mult1 and mult2 produce the same output given the same
pair of natural number inputs. (Though mult1 will take far longer to
do so when the numbers become large.) Hence, even though these are
two different programs, they compute the same mathematical function.
This distinction between a program or algorithm 𝐴, and the function 𝐹
that 𝐴 computes will be absolutely crucial for us in this course (see also
Fig. 2.13).

 Big Idea 2 A function is not the same as a program. A program


computes a function.

Distinguishing functions from programs (or other ways for comput-


ing, including circuits and machines) is a crucial theme for this course.
For this reason, this is often a running theme in questions that I (and
many other instructors) assign in homework and exams (hint, hint). Figure 2.13: A function is a mapping of inputs to
outputs. A program is a set of instructions on how
to obtain an output given an input. A program
R computes a function, but it is not the same as a func-
Remark 2.24 — Computation beyond functions (advanced, tion, popular programming language terminology
optional). Functions capture quite a lot of compu- notwithstanding.
tational tasks, but one can consider more general
settings as well. For starters, we can and will talk
about partial functions, that are not defined on all
inputs. When computing a partial function, we only
114 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

need to worry about the inputs on which the function


is defined. Another way to say it is that we can design
an algorithm for a partial function 𝐹 under the as-
sumption that someone “promised” us that all inputs
𝑥 would be such that 𝐹 (𝑥) is defined (as otherwise,
we do not care about the result). Hence such tasks are
also known as promise problems.
Another generalization is to consider relations that may
have more than one possible admissible output. For
example, consider the task of finding any solution for
a given set of equations. A relation 𝑅 maps a string
𝑥 ∈ {0, 1}∗ into a set of strings 𝑅(𝑥) (for example, 𝑥
might describe a set of equations, in which case 𝑅(𝑥)
would correspond to the set of all solutions to 𝑥). We
can also identify a relation 𝑅 with the set of pairs of
strings (𝑥, 𝑦) where 𝑦 ∈ 𝑅(𝑥). A computational pro-
cess solves a relation if for every 𝑥 ∈ {0, 1}∗ , it outputs
some string 𝑦 ∈ 𝑅(𝑥).
Later in this book, we will consider even more general
tasks, including interactive tasks, such as finding a
good strategy in a game, tasks defined using proba-
bilistic notions, and others. However, for much of this
book, we will focus on the task of computing a func-
tion, and often even a Boolean function, that has only a
single bit of output. It turns out that a great deal of the
theory of computation can be studied in the context of
this task, and the insights learned are applicable in the
more general settings.

✓ Chapter Recap

• We can represent objects we want to compute on


using binary strings.
• A representation scheme for a set of objects 𝒪 is a
one-to-one map from 𝒪 to {0, 1}∗ .
• We can use prefix-free encoding to “boost” a repre-
sentation for a set 𝒪 into representations of lists of
elements in 𝒪.
• A basic computational task is the task of computing
a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ . This task encom-
passes not just arithmetical computations such
as multiplication, factoring, etc. but a great many
other tasks arising in areas as diverse as scientific
computing, artificial intelligence, image processing,
data mining and many many more.
• We will study the question of finding (or at least
giving bounds on) what the best algorithm for
computing 𝐹 for various interesting functions 𝐹 is.
comp u tati on a n d re p re se n tati on 115

2.7 EXERCISES
Exercise 2.1 Which one of these objects can be represented by a binary
string?

a. An integer 𝑥

b. An undirected graph 𝐺.

c. A directed graph 𝐻

d. All of the above.

a. Prove that the function 𝑁 𝑡𝑆 ∶ ℕ →


Exercise 2.2 — Binary representation.
{0, 1}∗ of the binary representation defined in (2.1) satisfies that for
every 𝑛 ∈ ℕ, if 𝑥 = 𝑁 𝑡𝑆(𝑛) then |𝑥| = 1 + max(0, ⌊log2 𝑛⌋) and
𝑥𝑖 = ⌊𝑥/2⌊log2 𝑛⌋−𝑖 ⌋ mod 2.

b. Prove that 𝑁 𝑡𝑆 is a one to one function by coming up with a func-


tion 𝑆𝑡𝑁 ∶ {0, 1}∗ → ℕ such that 𝑆𝑡𝑁 (𝑁 𝑡𝑆(𝑛)) = 𝑛 for every
𝑛 ∈ ℕ.

Exercise 2.3 — More compact than ASCII representation. The ASCII encoding
can be used to encode a string of 𝑛 English letters as a 7𝑛 bit binary
string, but in this exercise, we ask about finding a more compact rep-
resentation for strings of English lowercase letters.

1. Prove that there exists a representation scheme (𝐸, 𝐷) for strings


over the 26-letter alphabet {𝑎, 𝑏, 𝑐, … , 𝑧} as binary strings such
that for every 𝑛 > 0 and length-𝑛 string 𝑥 ∈ {𝑎, 𝑏, … , 𝑧}𝑛 , the
representation 𝐸(𝑥) is a binary string of length at most 4.8𝑛 + 1000.
In other words, prove that for every 𝑛, there exists a one-to-one
function 𝐸 ∶ {𝑎, 𝑏, … , 𝑧}𝑛 → {0, 1}⌊4.8𝑛+1000⌋ .

2. Prove that there exists no representation scheme for strings over the
alphabet {𝑎, 𝑏, … , 𝑧} as binary strings such that for every length-𝑛
string 𝑥 ∈ {𝑎, 𝑏, … , 𝑧}𝑛 , the representation 𝐸(𝑥) is a binary string of
length ⌊4.6𝑛 + 1000⌋. In other words, prove that there exists some
𝑛 > 0 such that there is no one-to-one function 𝐸 ∶ {𝑎, 𝑏, … , 𝑧}𝑛 →
{0, 1}⌊4.6𝑛+1000⌋ .

3. Python’s bz2.compress function is a mapping from strings to


strings, which uses the lossless (and hence one to one) bzip2 algo-
rithm for compression. After converting to lowercase, and truncat-
ing spaces and numbers, the text of Tolstoy’s “War and Peace” con-
tains 𝑛 = 2, 517, 262. Yet, if we run bz2.compress on the string of
the text of “War and Peace” we get a string of length 𝑘 = 6, 274, 768
116 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

bits, which is only 2.49𝑛 (and in particular much smaller than


4.6𝑛). Explain why this does not contradict your answer to the
previous question.

4. Interestingly, if we try to apply bz2.compress on a random string,


we get much worse performance. In my experiments, I got a ratio
of about 4.78 between the number of bits in the output and the
number of characters in the input. However, one could imagine that
one could do better and that there exists a company called “Pied
Piper” with an algorithm that can losslessly compress a string of 𝑛
5
Actually that particular fictional company uses a
metric that focuses more on compression speed then
random lowercase letters to fewer than 4.6𝑛 bits.5 Show that this ratio, see here and here.
is not the case by proving that for every 𝑛 > 100 and one to one
function 𝐸𝑛𝑐𝑜𝑑𝑒 ∶ {𝑎, … , 𝑧}𝑛 → {0, 1}∗ , if we let 𝑍 be the random
variable |𝐸𝑛𝑐𝑜𝑑𝑒(𝑥)| (i.e., the length of 𝐸𝑛𝑐𝑜𝑑𝑒(𝑥)) for 𝑥 chosen
uniformly at random from the set {𝑎, … , 𝑧}𝑛 , then the expected
value of 𝑍 is at least 4.6𝑛.

Exercise 2.4 — Representing graphs: upper bound. Show that there is a string
representation of directed graphs with vertex set [𝑛] and degree at
most 10 that uses at most 1000𝑛 log 𝑛 bits. More formally, show the
following: Suppose we define for every 𝑛 ∈ ℕ, the set 𝐺𝑛 as the set
containing all directed graphs (with no self loops) over the vertex
set [𝑛] where every vertex has degree at most 10. Then, prove that for
every sufficiently large 𝑛, there exists a one-to-one function 𝐸 ∶ 𝐺𝑛 →
{0, 1}⌊1000𝑛 log 𝑛⌋ .

1. Define 𝑆𝑛 to be the
Exercise 2.5 — Representing graphs: lower bound.
set of one-to-one and onto functions mapping [𝑛] to [𝑛]. Prove that
there is a one-to-one mapping from 𝑆𝑛 to 𝐺2𝑛 , where 𝐺2𝑛 is the set
defined in Exercise 2.4 above.

2. In this question you will show that one cannot improve the rep-
resentation of Exercise 2.4 to length 𝑜(𝑛 log 𝑛). Specifically, prove
for every sufficiently large 𝑛 ∈ ℕ there is no one-to-one function
𝐸 ∶ 𝐺𝑛 → {0, 1}⌊0.001𝑛 log 𝑛⌋+1000 .

Recall that the grade-


Exercise 2.6 — Multiplying in different representation.
school algorithm for multiplying two numbers requires 𝑂(𝑛2 ) oper-
ations. Suppose that instead of using decimal representation, we use
one of the following representations 𝑅(𝑥) to represent a number 𝑥
between 0 and 10𝑛 − 1. For which one of these representations you can
still multiply the numbers in 𝑂(𝑛2 ) operations?
comp u tati on a n d re p re se n tati on 117

a. The standard binary representation: 𝐵(𝑥) = (𝑥0 , … , 𝑥𝑘 ) where


𝑘
𝑥 = ∑𝑖=0 𝑥𝑖 2𝑖 and 𝑘 is the largest number s.t. 𝑥 ≥ 2𝑘 .

b. The reverse binary representation: 𝐵(𝑥) = (𝑥𝑘 , … , 𝑥0 ) where 𝑥𝑖 is


defined as above for 𝑖 = 0, … , 𝑘 − 1.

c. Binary coded decimal representation: 𝐵(𝑥) = (𝑦0 , … , 𝑦𝑛−1 ) where


𝑦𝑖 ∈ {0, 1}4 represents the 𝑖𝑡ℎ decimal digit of 𝑥 mapping 0 to 0000,
1 to 0001, 2 to 0010, etc. (i.e. 9 maps to 1001)

d. All of the above.

Exercise 2.7 Suppose that 𝑅 ∶ ℕ → {0, 1}∗ corresponds to representing a


number 𝑥 as a string of 𝑥 1’s, (e.g., 𝑅(4) = 1111, 𝑅(7) = 1111111, etc.).
If 𝑥, 𝑦 are numbers between 0 and 10𝑛 − 1, can we still multiply 𝑥 and
𝑦 using 𝑂(𝑛2 ) operations if we are given them in the representation
𝑅(⋅)?

Exercise 2.8 Recall that if 𝐹 is a one-to-one and onto function mapping


elements of a finite set 𝑈 into a finite set 𝑉 then the sizes of 𝑈 and 𝑉
are the same. Let 𝐵 ∶ ℕ → {0, 1}∗ be the function such that for every
𝑥 ∈ ℕ, 𝐵(𝑥) is the binary representation of 𝑥.

1. Prove that 𝑥 < 2𝑘 if and only if |𝐵(𝑥)| ≤ 𝑘.

2. Use 1. to compute the size of the set {𝑦 ∈ {0, 1}∗ ∶ |𝑦| ≤ 𝑘} where |𝑦|
denotes the length of the string 𝑦.

3. Use 1. and 2. to prove that 2𝑘 − 1 = 1 + 2 + 4 + ⋯ + 2𝑘−1 .

Suppose that 𝐹 ∶ ℕ → {0, 1}∗


Exercise 2.9 — Prefix-free encoding of tuples.
is a one-to-one function that is prefix-free in the sense that there is no
𝑎 ≠ 𝑏 s.t. 𝐹 (𝑎) is a prefix of 𝐹 (𝑏).

a. Prove that 𝐹2 ∶ ℕ × ℕ → {0, 1}∗ , defined as 𝐹2 (𝑎, 𝑏) = 𝐹 (𝑎)𝐹 (𝑏) (i.e.,


the concatenation of 𝐹 (𝑎) and 𝐹 (𝑏)) is a one-to-one function.

b. Prove that 𝐹∗ ∶ ℕ∗ → {0, 1}∗ defined as 𝐹∗ (𝑎1 , … , 𝑎𝑘 ) =


𝐹 (𝑎1 ) ⋯ 𝐹 (𝑎𝑘 ) is a one-to-one function, where ℕ∗ denotes the set of
all finite-length lists of natural numbers.

Suppose that 𝐹 ∶
Exercise 2.10 — More efficient prefix-free transformation.
𝑂 → {0, 1}∗ is some (not necessarily prefix-free) representation of the
objects in the set 𝑂, and 𝐺 ∶ ℕ → {0, 1}∗ is a prefix-free representa-
tion of the natural numbers. Define 𝐹 ′ (𝑜) = 𝐺(|𝐹 (𝑜)|)𝐹 (𝑜) (i.e., the
concatenation of the representation of the length 𝐹 (𝑜) and 𝐹 (𝑜)).
118 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

a. Prove that 𝐹 ′ is a prefix-free representation of 𝑂.

b. Show that we can transform any representation to a prefix-free one


by a modification that takes a 𝑘 bit string into a string of length at
most 𝑘 + 𝑂(log 𝑘).

c. Show that we can transform any representation to a prefix-free one


by a modification that takes a 𝑘 bit string into a string of length at 6
Hint: Think recursively how to represent the length
most 𝑘 + log 𝑘 + 𝑂(log log 𝑘).6 of the string.

Suppose that 𝑆 ⊆ {0, 1}∗ is some finite


Exercise 2.11 — Kraft’s Inequality.
prefix-free set, and let 𝑛 some number larger than max{|𝑥| ∶ 𝑥 ∈ 𝑋}.

a. For every 𝑥 ∈ 𝑆, let 𝐿(𝑥) ⊆ {0, 1}𝑛 denote all the length-𝑛 strings
whose first 𝑘 bits are 𝑥0 , … , 𝑥𝑘−1 . Prove that (1) |𝐿(𝑥)| = 2𝑛−|𝑥| and
(2) For every distinct 𝑥, 𝑥′ ∈ 𝑆, 𝐿(𝑥) is disjoint from 𝐿(𝑥′ ).

b. Prove that ∑𝑥∈𝑆 2−|𝑥| ≤ 1. (Hint: first show that ∑𝑥∈𝑆 |𝐿(𝑥)| ≤ 2𝑛 .)

c. Prove that there is no prefix-free encoding of strings with less than


logarithmic overhead. That is, prove that there is no function PF ∶
{0, 1}∗ → {0, 1}∗ s.t. |PF(𝑥)| ≤ |𝑥| + 0.9 log |𝑥| for every sufficiently
large 𝑥 ∈ {0, 1}∗ and such that the set {PF(𝑥) ∶ 𝑥 ∈ {0, 1}∗ } is prefix-
free. The factor 0.9 is arbitrary; all that matters is that it is less than
1.

Prove that for every


Exercise 2.12 — Composition of one-to-one functions.
two one-to-one functions 𝐹 ∶ 𝑆 → 𝑇 and 𝐺 ∶ 𝑇 → 𝑈 , the function
𝐻 ∶ 𝑆 → 𝑈 defined as 𝐻(𝑥) = 𝐺(𝐹 (𝑥)) is one to one.

1. We have shown that


Exercise 2.13 — Natural numbers and strings.
the natural numbers can be represented as strings. Prove that
the other direction holds as well: that there is a one-to-one map
𝑆𝑡𝑁 ∶ {0, 1}∗ → ℕ. (𝑆𝑡𝑁 stands for “strings to numbers.”)

2. Recall that Cantor proved that there is no one-to-one map 𝑅𝑡𝑁 ∶


ℝ → ℕ. Show that Cantor’s result implies Theorem 2.5.

Recall that for every set


Exercise 2.14 — Map lists of integers to a number.
𝑆, the set 𝑆 is defined as the set of all finite sequences of mem-

bers of 𝑆 (i.e., 𝑆 ∗ = {(𝑥0 , … , 𝑥𝑛−1 ) | 𝑛 ∈ ℕ , ∀𝑖∈[𝑛] 𝑥𝑖 ∈ 𝑆} ).


Prove that there is a one-one-map from ℤ∗ to ℕ where ℤ is the set of
{… , −3, −2, −1, 0, +1, +2, +3, …} of all integers.

comp u tati on a n d re p re se n tati on 119

2.8 BIBLIOGRAPHICAL NOTES


The study of representing data as strings, including issues such as
compression and error corrections falls under the purview of information
theory, as covered in the classic textbook of Cover and Thomas [CT06].
Representations are also studied in the field of data structures design, as
covered in texts such as [Cor+09].
The question of whether to represent integers with the most signif-
icant digit first or last is known as Big Endian vs. Little Endian repre-
sentation. This terminology comes from Cohen’s [Coh81] entertaining
and informative paper about the conflict between adherents of both
schools which he compared to the warring tribes in Jonathan Swift’s
“Gulliver’s Travels”. The two’s complement representation of signed
integers was suggested in von Neumann’s classic report [Neu45]
that detailed the design approaches for a stored-program computer,
though similar representations have been used even earlier in abacus
and other mechanical computation devices.
The idea that we should separate the definition or specification of
a function from its implementation or computation might seem “obvi-
ous,” but it took quite a lot of time for mathematicians to arrive at this
viewpoint. Historically, a function 𝐹 was identified by rules or formu-
las showing how to derive the output from the input. As we discuss
in greater depth in Chapter 9, in the 1800s this somewhat informal
notion of a function started “breaking at the seams,” and eventually
mathematicians arrived at the more rigorous definition of a function
as an arbitrary assignment of input to outputs. While many functions
may be described (or computed) by one or more formulas, today we
do not consider that to be an essential property of functions, and also
allow functions that do not correspond to any “nice” formula.
We have mentioned that all representations of the real numbers
are inherently approximate. Thus an important endeavor is to under-
stand what guarantees we can offer on the approximation quality of
the output of an algorithm, as a function of the approximation quality
of the inputs. This question is known as the question of determining
the numerical stability of given equations. The Floating-Point Guide
website contains an extensive description of the floating-point repre-
sentation, as well the many ways in which it could subtly fail, see also
the website 0.30000000000000004.com.
Dauben [Dau90] gives a biography of Cantor with emphasis on
the development of his mathematical ideas. [Hal60] is a classic text-
book on set theory, also including Cantor’s theorem. Cantor’s Theo-
rem is also covered in many texts on discrete mathematics, including
[LLM18; LZ19].
120 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The adjacency matrix representation of graphs is not merely a con-


venient way to map a graph into a binary string, but it turns out that
many natural notions and operations on matrices are useful for graphs
as well. (For example, Google’s PageRank algorithm relies on this
viewpoint.) The notes of Spielman’s course are an excellent source for
this area, known as spectral graph theory. We will return to this view
much later in this book when we talk about random walks.
I
FINITE COMPUTATION
Learning Objectives:
• See that computation can be precisely
modeled.
• Learn the computational model of Boolean
circuits / straight-line programs.
• Equivalence of circuits and straight-line
programs.
• Equivalence of AND/OR/NOT and NAND.

3
• Examples of computing in the physical world.

Defining computation

“there is no reason why mental as well as bodily labor should not be economized
by the aid of machinery”, Charles Babbage, 1852

“If, unwarned by my example, any man shall undertake and shall succeed
in constructing an engine embodying in itself the whole of the executive de-
partment of mathematical analysis upon different principles or by simpler
mechanical means, I have no fear of leaving my reputation in his charge, for he
alone will be fully able to appreciate the nature of my efforts and the value of
their results.”, Charles Babbage, 1864

“To understand a program you must become both the machine and the pro-
gram.”, Alan Perlis, 1982

People have been computing for thousands of years, with aids


that include not just pen and paper, but also abacus, slide rules, vari-
ous mechanical devices, and modern electronic computers. A priori,
the notion of computation seems to be tied to the particular mech-
anism that you use. You might think that the “best” algorithm for
multiplying numbers will differ if you implement it in Python on a
modern laptop than if you use pen and paper. However, as we saw
Figure 3.1: Calculating wheels by Charles Babbage.
in the introduction (Chapter 0), an algorithm that is asymptotically Image taken from the Mark I ‘operating manual’
better would eventually beat a worse one regardless of the underly-
ing technology. This gives us hope for a technology independent way
of defining computation. This is what we do in this chapter. We will
define the notion of computing an output from an input by applying a
sequence of basic operations (see Fig. 3.3). Using this, we will be able
to precisely define statements such as “function 𝑓 can be computed
by model 𝑋” or “function 𝑓 can be computed by model 𝑋 using 𝑠
operations”.
Figure 3.2: A 1944 Popular Mechanics article on the
This chapter: A non-mathy overview Harvard Mark I computer.
The main takeaways from this chapter are:

Compiled on 12.6.2023 00:05


124 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 3.3: A function mapping strings to strings


specifies a computational task, i.e., describes what the
desired relation between the input and the output
is. In this chapter we define models for implementing
computational processes that achieve the desired
relation, i.e., describe how to compute the output
from the input. We will see several examples of such
models using both Boolean circuits and straight-line
programming languages.

• We can use logical operations such as AND, OR, and NOT to


compute an output from an input (see Section 3.2).

• A Boolean circuit is a way to compose the basic logical


operations to compute a more complex function (see Sec-
tion 3.3). We can think of Boolean circuits as both a mathe-
matical model (which is based on directed acyclic graphs)
as well as physical devices we can construct in the real
world in a variety of ways, including not just silicon-based
semi-conductors but also mechanical and even biological
mechanisms (see Section 3.5).

• We can describe Boolean circuits also as straight-line pro-


grams, which are programs that do not have any looping
constructs (i.e., no while / for/ do .. until etc.), see
Section 3.4.

• It is possible to implement the AND, OR, and NOT oper-


ations using the NAND operation (as well as vice versa).
This means that circuits with AND/OR/NOT gates can
compute the same functions (i.e., are equivalent in power)
to circuits with NAND gates, and we can use either model
to describe computation based on our convenience, see
Section 3.6. To give out a “spoiler”, we will see in Chap-
ter 4 that such circuits can compute all finite functions.

One “big idea” of this chapter is the notion of equivalence


between models (Big Idea 3). Two computational models
are equivalent if they can compute the same set of functions.
d e fi n i ng comp u tati on 125

Boolean circuits with AND/OR/NOT gates are equivalent to


circuits with NAND gates, but this is just one example of the
more general phenomenon that we will see many times in
this book.

3.1 DEFINING COMPUTATION


The name “algorithm” is derived from the Latin transliteration of
Muhammad ibn Musa al-Khwarizmi’s name. Al-Khwarizmi was a
Persian scholar during the 9th century whose books introduced the
western world to the decimal positional numeral system, as well as to
the solutions of linear and quadratic equations (see Fig. 3.4). However
Al-Khwarizmi’s descriptions of algorithms were rather informal by
today’s standards. Rather than use “variables” such as 𝑥, 𝑦, he used
concrete numbers such as 10 and 39, and trusted the reader to be
able to extrapolate from these examples, much as algorithms are still
taught to children today.
Here is how Al-Khwarizmi described the algorithm for solving an
equation of the form 𝑥2 + 𝑏𝑥 = 𝑐:

[How to solve an equation of the form ] “roots and squares are equal to num-
bers”: For instance “one square, and ten roots of the same, amount to thirty-
nine dirhems” that is to say, what must be the square which, when increased
by ten of its own root, amounts to thirty-nine? The solution is this: you halve
the number of the roots, which in the present instance yields five. This you
multiply by itself; the product is twenty-five. Add this to thirty-nine’ the sum
is sixty-four. Now take the root of this, which is eight, and subtract from it half
the number of roots, which is five; the remainder is three. This is the root of the
square which you sought for; the square itself is nine.

For the purposes of this book, we will need a much more precise
way to describe algorithms. Fortunately (or is it unfortunately?), at
least at the moment, computers lag far behind school-age children
in learning from examples. Hence in the 20th century, people came
Figure 3.4: Text pages from Algebra manuscript with
up with exact formalisms for describing algorithms, namely program- geometrical solutions to two quadratic equations.
ming languages. Here is al-Khwarizmi’s quadratic equation solving Shelfmark: MS. Huntington 214 fol. 004v-005r

algorithm described in the Python programming language:

from math import sqrt


#Pythonspeak to enable use of the sqrt function to compute
↪ square roots.

def solve_eq(b,c):
# return solution of x^2 + bx = c following Al
↪ Khwarizmi's instructions Figure 3.5: An explanation for children of the two digit
# Al Kwarizmi demonstrates this for the case b=10 and addition algorithm

↪ c= 39
126 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

val1 = b / 2.0 # "halve the number of the roots"


val2 = val1 * val1 # "this you multiply by itself"
val3 = val2 + c # "Add this to thirty-nine"
val4 = sqrt(val3) # "take the root of this"
val5 = val4 - val1 # "subtract from it half the number
↪ of roots"
return val5 # "This is the root of the square which
↪ you sought for"

# Test: solve x^2 + 10*x = 39


print(solve_eq(10,39))
# 3.0

We can define algorithms informally as follows:

Informal definition of an algorithm: An algorithm is a set of instruc-


tions for how to compute an output from an input by following a se-
quence of “elementary steps”.
An algorithm 𝐴 computes a function 𝐹 if for every input 𝑥, if we follow
the instructions of 𝐴 on the input 𝑥, we obtain the output 𝐹 (𝑥).

In this chapter we will make this informal definition precise using


the model of Boolean Circuits. We will show that Boolean Circuits
are equivalent in power to straight line programs that are written in
“ultra simple” programming languages that do not even have loops.
We will also see that the particular choice of elementary operations is
immaterial and many different choices yield models with equivalent
power (see Fig. 3.6). However, it will take us some time to get there.
We will start by discussing what are “elementary operations” and how
we map a description of an algorithm into an actual physical process
that produces an output from an input in the real world.

Figure 3.6: An overview of the computational models


defined in this chapter. We will show several equiv-
alent ways to represent a recipe for performing a
finite computation. Specifically we will show that we
can model such a computation using either a Boolean
circuit or a straight line program, and these two repre-
sentations are equivalent to one another. We will also
show that we can choose as our basic operations ei-
ther the set {AND, OR, NOT} or the set {NAND} and
these two choices are equivalent in power. By making
the choice of whether to use circuits or programs,
and whether to use {AND, OR, NOT} or {NAND} we
obtain four equivalent ways of modeling finite com-
putation. Moreover, there are many other choices of
sets of basic operations that are equivalent in power.
d e fi n i ng comp u tati on 127

3.2 COMPUTING USING AND, OR, AND NOT.


An algorithm breaks down a complex calculation into a series of sim-
pler steps. These steps can be executed in a variety of different ways,
including:

• Writing down symbols on a piece of paper.

• Modifying the current flowing on electrical wires.

• Binding a protein to a strand of DNA.

• Responding to a stimulus by a member of a collection (e.g., a bee in


a colony, a trader in a market).

To formally define algorithms, let us try to “err on the side of sim-


plicity” and model our “basic steps” as truly minimal. For example,
here are some very simple functions:

• OR ∶ {0, 1}2 → {0, 1} defined as


{0 𝑎 = 𝑏 = 0
OR(𝑎, 𝑏) = ⎨
⎩1 otherwise
{

• AND ∶ {0, 1}2 → {0, 1} defined as


{1 𝑎 = 𝑏 = 1
AND(𝑎, 𝑏) = ⎨
⎩0 otherwise
{

• NOT ∶ {0, 1} → {0, 1} defined as


{0 𝑎 = 1
NOT(𝑎) = ⎨
{
⎩1 𝑎 = 0
The functions AND, OR and NOT, are the basic logical operators
used in logic and many computer systems. In the context of logic, it is
common to use the notation 𝑎 ∧ 𝑏 for AND(𝑎, 𝑏), 𝑎 ∨ 𝑏 for OR(𝑎, 𝑏) and
𝑎 and ¬𝑎 for NOT(𝑎), and we will use this notation as well.
Each one of the functions AND, OR, NOT takes either one or two
single bits as input, and produces a single bit as output. Clearly, it
cannot get much more basic than that. However, the power of compu-
tation comes from composing such simple building blocks together.

■ Consider the func-


Example 3.1 — Majority from 𝐴𝑁 𝐷,𝑂𝑅 and 𝑁 𝑂𝑇 .
tion MAJ ∶ {0, 1} → {0, 1} that is defined as follows:
3
128 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e


{1 𝑥 0 + 𝑥 1 + 𝑥 2 ≥ 2
MAJ(𝑥) = ⎨ .
⎩0 otherwise
{

That is, for every 𝑥 ∈ {0, 1}3 , MAJ(𝑥) = 1 if and only if the ma-
jority (i.e., at least two out of the three) of 𝑥’s elements are equal
to 1. Can you come up with a formula involving AND, OR and
NOT to compute MAJ? (It would be useful for you to pause at this
point and work out the formula for yourself. As a hint, although
the NOT operator is needed to compute some functions, you will
not need to use it to compute MAJ.)
Let us first try to rephrase MAJ(𝑥) in words: “MAJ(𝑥) = 1 if and
only if there exists some pair of distinct elements 𝑖, 𝑗 such that both
𝑥𝑖 and 𝑥𝑗 are equal to 1.” In other words it means that MAJ(𝑥) = 1
iff either both 𝑥0 = 1 and 𝑥1 = 1, or both 𝑥1 = 1 and 𝑥2 = 1, or both
𝑥0 = 1 and 𝑥2 = 1. Since the OR of three conditions 𝑐0 , 𝑐1 , 𝑐2 can
be written as OR(𝑐0 , OR(𝑐1 , 𝑐2 )), we can now translate this into a
formula as follows:

MAJ(𝑥0 , 𝑥1 , 𝑥2 ) = OR ( AND(𝑥0 , 𝑥1 ) , OR(AND(𝑥1 , 𝑥2 ) , AND(𝑥0 , 𝑥2 )) ) .


(3.1)
Recall that we can also write 𝑎 ∨ 𝑏 for OR(𝑎, 𝑏) and 𝑎 ∧ 𝑏 for
AND(𝑎, 𝑏). With this notation, (3.1) can also be written as

MAJ(𝑥0 , 𝑥1 , 𝑥2 ) = ((𝑥0 ∧ 𝑥1 ) ∨ (𝑥1 ∧ 𝑥2 )) ∨ (𝑥0 ∧ 𝑥2 ) .

We can also write (3.1) in a “programming language” form,


expressing it as a set of instructions for computing MAJ given the
basic operations AND, OR, NOT:

def MAJ(X[0],X[1],X[2]):
firstpair = AND(X[0],X[1])
secondpair = AND(X[1],X[2])
thirdpair = AND(X[0],X[2])
temp = OR(secondpair,thirdpair)
return OR(firstpair,temp)

3.2.1 Some properties of AND and OR


Like standard addition and multiplication, the functions AND and OR
satisfy the properties of commutativity: 𝑎 ∨ 𝑏 = 𝑏 ∨ 𝑎 and 𝑎 ∧ 𝑏 = 𝑏 ∧ 𝑎
and associativity: (𝑎 ∨ 𝑏) ∨ 𝑐 = 𝑎 ∨ (𝑏 ∨ 𝑐) and (𝑎 ∧ 𝑏) ∧ 𝑐 = 𝑎 ∧ (𝑏 ∧ 𝑐). As in
the case of addition and multiplication, we often drop the parenthesis
d e fi n i ng comp u tati on 129

and write 𝑎 ∨ 𝑏 ∨ 𝑐 ∨ 𝑑 for ((𝑎 ∨ 𝑏) ∨ 𝑐) ∨ 𝑑, and similarly OR’s and AND’s


of more terms. They also satisfy a variant of the distributive law:
Solved Exercise 3.1 — Distributive law for AND and OR. Prove that for every
𝑎, 𝑏, 𝑐 ∈ {0, 1}, 𝑎 ∧ (𝑏 ∨ 𝑐) = (𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑐).

Solution:
We can prove this by enumerating over all the 8 possible values
for 𝑎, 𝑏, 𝑐 ∈ {0, 1} but it also follows from the standard distributive
law. Suppose that we identify any positive integer with “true” and
the value zero with “false”. Then for every numbers 𝑢, 𝑣 ∈ ℕ, 𝑢 + 𝑣
is positive if and only if 𝑢 ∨ 𝑣 is true and 𝑢 ⋅ 𝑣 is positive if and only
if 𝑢 ∧ 𝑣 is true. This means that for every 𝑎, 𝑏, 𝑐 ∈ {0, 1}, the expres-
sion 𝑎 ∧ (𝑏 ∨ 𝑐) is true if and only if 𝑎 ⋅ (𝑏 + 𝑐) is positive, and the
expression (𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑐) is true if and only if 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 is positive,
But by the standard distributive law 𝑎 ⋅ (𝑏 + 𝑐) = 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 and
hence the former expression is true if and only if the latter one is.

3.2.2 Extended example: Computing XOR from AND, OR, and NOT
Let us see how we can obtain a different function from the same
building blocks. Define XOR ∶ {0, 1}2 → {0, 1} to be the function
XOR(𝑎, 𝑏) = 𝑎 + 𝑏 mod 2. That is, XOR(0, 0) = XOR(1, 1) = 0 and
XOR(1, 0) = XOR(0, 1) = 1. We claim that we can construct XOR
using only AND, OR, and NOT.

P
As usual, it is a good exercise to try to work out the
algorithm for XOR using AND, OR and NOT on your
own before reading further.

The following algorithm computes XOR using AND, OR, and NOT:

Algorithm 3.2 — 𝑋𝑂𝑅 from 𝐴𝑁 𝐷/𝑂𝑅/𝑁 𝑂𝑇 .

Input: 𝑎, 𝑏 ∈ {0, 1}.


Output: 𝑋𝑂𝑅(𝑎, 𝑏)
1: 𝑤1 ← 𝐴𝑁 𝐷(𝑎, 𝑏)
2: 𝑤2 ← 𝑁 𝑂𝑇 (𝑤1)
3: 𝑤3 ← 𝑂𝑅(𝑎, 𝑏)
4: return 𝐴𝑁 𝐷(𝑤2, 𝑤3)

For every 𝑎, 𝑏 ∈ {0, 1}, on input 𝑎, 𝑏, Algorithm 3.2 outputs


Lemma 3.3
𝑎 + 𝑏 mod 2.
130 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proof. For every 𝑎, 𝑏, XOR(𝑎, 𝑏) = 1 if and only if 𝑎 is different from


𝑏. On input 𝑎, 𝑏 ∈ {0, 1}, Algorithm 3.2 outputs AND(𝑤2, 𝑤3) where
𝑤2 = NOT(AND(𝑎, 𝑏)) and 𝑤3 = OR(𝑎, 𝑏).

• If 𝑎 = 𝑏 = 0 then 𝑤3 = OR(𝑎, 𝑏) = 0 and so the output will be 0.

• If 𝑎 = 𝑏 = 1 then AND(𝑎, 𝑏) = 1 and so 𝑤2 = NOT(AND(𝑎, 𝑏)) = 0


and the output will be 0.

• If 𝑎 = 1 and 𝑏 = 0 (or vice versa) then both 𝑤3 = OR(𝑎, 𝑏) = 1


and 𝑤1 = AND(𝑎, 𝑏) = 0, in which case the algorithm will output
OR(NOT(𝑤1), 𝑤3) = 1.

We can also express Algorithm 3.2 using a programming language.


Specifically, the following is a Python program that computes the XOR
function:

def AND(a,b): return a*b


def OR(a,b): return 1-(1-a)*(1-b)
def NOT(a): return 1-a

def XOR(a,b):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
return AND(w2,w3)

# Test out the code


print([f"XOR({a},{b})={XOR(a,b)}" for a in [0,1] for b in
↪ [0,1]])
# ['XOR(0,0)=0', 'XOR(0,1)=1', 'XOR(1,0)=1', 'XOR(1,1)=0']

Solved Exercise 3.2 — Compute 𝑋𝑂𝑅 on three bits of input.Let XOR3 ∶


{0, 1} → {0, 1} be the function defined as XOR3 (𝑎, 𝑏, 𝑐) = 𝑎 + 𝑏 + 𝑐
3

mod 2. That is, XOR3 (𝑎, 𝑏, 𝑐) = 1 if 𝑎 + 𝑏 + 𝑐 is odd, and XOR3 (𝑎, 𝑏, 𝑐) =


0 otherwise. Show that you can compute XOR3 using AND, OR, and
NOT. You can express it as a formula, use a programming language
such as Python, or use a Boolean circuit.

Solution:
Addition modulo two satisfies the same properties of associativ-
ity ((𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐)) and commutativity (𝑎 + 𝑏 = 𝑏 + 𝑎) as
standard addition. This means that, if we define 𝑎 ⊕ 𝑏 to equal 𝑎 + 𝑏
d e fi n i ng comp u tati on 131

mod 2, then
XOR3 (𝑎, 𝑏, 𝑐) = (𝑎 ⊕ 𝑏) ⊕ 𝑐

or in other words

XOR3 (𝑎, 𝑏, 𝑐) = XOR(XOR(𝑎, 𝑏), 𝑐) .

Since we know how to compute XOR using AND, OR, and


NOT, we can compose this to compute XOR3 using the same build-
ing blocks. In Python this corresponds to the following program:

def XOR3(a,b,c):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
w4 = AND(w2,w3)
w5 = AND(w4,c)
w6 = NOT(w5)
w7 = OR(w4,c)
return AND(w6,w7)

# Let's test this out


print([f"XOR3({a},{b},{c})={XOR3(a,b,c)}" for a in [0,1]
↪ for b in [0,1] for c in [0,1]])
# ['XOR3(0,0,0)=0', 'XOR3(0,0,1)=1', 'XOR3(0,1,0)=1',
↪ 'XOR3(0,1,1)=0', 'XOR3(1,0,0)=1', 'XOR3(1,0,1)=0',
↪ 'XOR3(1,1,0)=0', 'XOR3(1,1,1)=1']

P
Try to generalize the above examples to obtain a way
to compute XOR𝑛 ∶ {0, 1}𝑛 → {0, 1} for every 𝑛 us-
ing at most 4𝑛 basic steps involving applications of a
function in {AND, OR, NOT} to outputs or previously
computed values.

3.2.3 Informally defining “basic operations” and “algorithms”


We have seen that we can obtain at least some examples of interesting
functions by composing together applications of AND, OR, and NOT.
This suggests that we can use AND, OR, and NOT as our “basic opera-
tions”, hence obtaining the following definition of an “algorithm”:
Semi-formal definition of an algorithm: An algorithm consists of a
sequence of steps of the form “compute a new value by applying AND,
OR, or NOT to previously computed values (assuming that the input
was also previously computed)”.
132 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

An algorithm 𝐴 computes a function 𝐹 if for every input 𝑥 to 𝐹 , if we


feed 𝑥 as input to the algorithm, the value computed in its last step is
𝐹 (𝑥).

There are several concerns that are raised by this definition:

1. First and foremost, this definition is indeed too informal. We do not


specify exactly what each step does, nor what it means to “feed 𝑥 as
input”.

2. Second, the choice of AND, OR or NOT seems rather arbitrary.


Why not XOR and MAJ? Why not allow operations like addition
and multiplication? What about any other logical constructions
such if/then or while?

3. Third, do we even know that this definition has anything to do


with actual computing? If someone gave us a description of such an
algorithm, could we use it to actually compute the function in the
real world?

P
These concerns will to a large extent guide us in the
upcoming chapters. Thus you would be well advised
to re-read the above informal definition and see what
you think about these issues.

A large part of this book will be devoted to addressing the above


issues. We will see that:

1. We can make the definition of an algorithm fully formal, and so


give a precise mathematical meaning to statements such as “Algo-
rithm 𝐴 computes function 𝑓”.

2. While the choice of AND/OR/NOT is arbitrary, and we could just


as well have chosen other functions, we will also see this choice
does not matter much. We will see that we would obtain the same
computational power if we instead used addition and multiplica-
tion, and essentially every other operation that could be reasonably
thought of as a basic step.

3. It turns out that we can and do compute such “AND/OR/NOT-


based algorithms” in the real world. First of all, such an algorithm
is clearly well specified, and so can be executed by a human with a
pen and paper. Second, there are a variety of ways to mechanize this
computation. We’ve already seen that we can write Python code
that corresponds to following such a list of instructions. But in fact
we can directly implement operations such as AND, OR, and NOT
d e fi n i ng comp u tati on 133

via electronic signals using components known as transistors. This is


how modern electronic computers operate.

In the remainder of this chapter, and the rest of this book, we will
begin to answer some of these questions. We will see more examples
of the power of simple operations to compute more complex opera-
tions including addition, multiplication, sorting and more. We will
also discuss how to physically implement simple operations such as
AND, OR and NOT using a variety of technologies.

3.3 BOOLEAN CIRCUITS


Boolean circuits provide a precise notion of “composing basic opera-
tions together”. A Boolean circuit (see Fig. 3.9) is composed of gates
and inputs that are connected by wires. The wires carry a signal that
represents either the value 0 or 1. Each gate corresponds to either the
OR, AND, or NOT operation. An OR gate has two incoming wires,
and one or more outgoing wires. If these two incoming wires carry
the signals 𝑎 and 𝑏 (for 𝑎, 𝑏 ∈ {0, 1}), then the signal on the outgo-
ing wires will be OR(𝑎, 𝑏). AND and NOT gates are defined simi-
Figure 3.7: Standard symbols for the logical operations
larly. The inputs have only outgoing wires. If we set a certain input or “gates” of AND, OR, NOT, as well as the operation
to a value 𝑎 ∈ {0, 1}, then this value is propagated on all the wires NAND discussed in Section 3.6.

outgoing from it. We also designate some gates as output gates, and
their value corresponds to the result of evaluating the circuit. For ex-
ample, Fig. 3.8 gives such a circuit for the XOR function, following
Section 3.2.2. We evaluate an 𝑛-input Boolean circuit 𝐶 on an input
𝑥 ∈ {0, 1}𝑛 by placing the bits of 𝑥 on the inputs, and then propagat-
ing the values on the wires until we reach an output, see Fig. 3.9.

R
Remark 3.4 — Physical realization of Boolean circuits. Figure 3.8: A circuit with AND, OR and NOT gates for
Boolean circuits are a mathematical model that does not computing the XOR function.
necessarily correspond to a physical object, but they
can be implemented physically. In physical imple-
mentations of circuits, the signal is often implemented
by electric potential, or voltage, on a wire, where for
example voltage above a certain level is interpreted
as a logical value of 1, and below a certain level is in-
terpreted as a logical value of 0. Section 3.5 discusses
physical implementations of Boolean circuits (with
examples including using electrical signals such as
in silicon-based circuits, as well as biological and
mechanical implementations).

Define ALLEQ ∶ {0, 1}4 → {0, 1}


Solved Exercise 3.3 — All equal function.
to be the function that on input 𝑥 ∈ {0, 1}4 outputs 1 if and only if
𝑥0 = 𝑥1 = 𝑥2 = 𝑥3 . Give a Boolean circuit for computing ALLEQ.
134 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 3.9: A Boolean Circuit consists of gates that are


connected by wires to one another and the inputs. The
left side depicts a circuit with 2 inputs and 5 gates,
one of which is designated the output gate. The right
side depicts the evaluation of this circuit on the input
𝑥 ∈ {0, 1}2 with 𝑥0 = 1 and 𝑥1 = 0. The value of
every gate is obtained by applying the corresponding
function (AND, OR, or NOT) to values on the wire(s)
that enter it. The output of the circuit on a given
input is the value of the output gate(s). In this case,
the circuit computes the XOR function and hence it
outputs 1 on the input 10.

Solution:
Another way to describe the function ALLEQ is that it outputs
1 on an input 𝑥 ∈ {0, 1}4 if and only if 𝑥 = 04 or 𝑥 = 14 . We can
phrase the condition 𝑥 = 14 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be
computed using three AND gates. Similarly we can phrase the con-
dition 𝑥 = 04 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be computed using four
NOT gates and three AND gates. The output of ALLEQ is the OR
of these two conditions, which results in the circuit of 4 NOT gates,
6 AND gates, and one OR gate presented in Fig. 3.10.

3.3.1 Boolean circuits: a formal definition


We defined Boolean circuits informally as obtained by connecting
AND, OR, and NOT gates via wires so as to produce an output from
an input. However, to be able to prove theorems about the existence or
non-existence of Boolean circuits for computing various functions we
need to:

1. Formally define a Boolean circuit as a mathematical object. Figure 3.10: A Boolean circuit for computing the all
equal function ALLEQ ∶ {0, 1}4 → {0, 1} that outputs
2. Formally define what it means for a circuit 𝐶 to compute a function 1 on 𝑥 ∈ {0, 1}4 if and only if 𝑥0 = 𝑥1 = 𝑥2 = 𝑥3 .
𝑓.

We now proceed to do so. We will define a Boolean circuit as a


labeled Directed Acyclic Graph (DAG). The vertices of the graph corre-
spond to the gates and inputs of the circuit, and the edges of the graph
correspond to the wires. A wire from an input or gate 𝑢 to a gate 𝑣 in
the circuit corresponds to a directed edge between the corresponding
vertices. The inputs are vertices with no incoming edges, while each
gate has the appropriate number of incoming edges based on the func-
tion it computes. (That is, AND and OR gates have two in-neighbors,
while NOT gates have one in-neighbor.) The formal definition is as
follows (see also Fig. 3.11):
d e fi n i ng comp u tati on 135

Figure 3.11: A Boolean Circuit is a labeled directed


acyclic graph (DAG). It has 𝑛 input vertices, which are
marked with X[0],…, X[𝑛 − 1] and have no incoming
edges, and the rest of the vertices are gates. AND,
OR, and NOT gates have two, two, and one incoming
edges, respectively. If the circuit has 𝑚 outputs, then
𝑚 of the gates are known as outputs and are marked
with Y[0],…,Y[𝑚 − 1]. When we evaluate a circuit
𝐶 on an input 𝑥 ∈ {0, 1}𝑛 , we start by setting the
value of the input vertices to 𝑥0 , … , 𝑥𝑛−1 and then
propagate the values, assigning to each gate 𝑔 the
result of applying the operation of 𝑔 to the values of
𝑔’s in-neighbors. The output of the circuit is the value
assigned to the output gates.

Definition 3.5 — Boolean Circuits. Let 𝑛, 𝑚, 𝑠 be positive integers with


𝑠 ≥ 𝑚. A Boolean circuit with 𝑛 inputs, 𝑚 outputs, and 𝑠 gates, is a
labeled directed acyclic graph (DAG) 𝐺 = (𝑉 , 𝐸) with 𝑠+𝑛 vertices
satisfying the following properties:

• Exactly 𝑛 of the vertices have no in-neighbors. These vertices


are known as inputs and are labeled with the 𝑛 labels X[0], …,
X[𝑛 − 1]. Each input has at least one out-neighbor.

• The other 𝑠 vertices are known as gates. Each gate is labeled with
∧, ∨ or ¬. Gates labeled with ∧ (AND) or ∨ (OR) have two in-
neighbors. Gates labeled with ¬ (NOT) have one in-neighbor.
We will allow parallel edges. 1

• Exactly 𝑚 of the gates are also labeled with the 𝑚 labels Y[0], …,
Y[𝑚 − 1] (in addition to their label ∧/∨/¬). These are known as
outputs.

The size of a Boolean circuit is the number 𝑠 of gates it contains.


1
Having parallel edges means an AND or OR gate
𝑢 can have both its in-neighbors be the same gate
𝑣. Since AND(𝑎, 𝑎) = OR(𝑎, 𝑎) = 𝑎 for every 𝑎 ∈
{0, 1}, such parallel edges don’t help in computing
P new values in circuits with AND/OR/NOT gates.
This is a non-trivial mathematical definition, so it is
However, we will see circuits with more general sets
worth taking the time to read it slowly and carefully. of gates later on.
As in all mathematical definitions, we are using a
known mathematical object — a directed acyclic graph
(DAG) — to define a new object, a Boolean circuit.
This might be a good time to review some of the basic
136 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

properties of DAGs and in particular the fact that they


can be topologically sorted, see Section 1.6.

If 𝐶 is a circuit with 𝑛 inputs and 𝑚 outputs, and 𝑥 ∈ {0, 1}𝑛 , then


we can compute the output of 𝐶 on the input 𝑥 in the natural way:
assign the input vertices X[0], …, X[𝑛 − 1] the values 𝑥0 , … , 𝑥𝑛−1 ,
apply each gate on the values of its in-neighbors, and then output the
values that correspond to the output vertices. Formally, this is defined
as follows:

Definition 3.6 — Computing a function via a Boolean circuit. Let 𝐶 be a


Boolean circuit with 𝑛 inputs and 𝑚 outputs. For every 𝑥 ∈ {0, 1}𝑛 ,
the output of 𝐶 on the input 𝑥, denoted by 𝐶(𝑥), is defined as the
result of the following process:
We let ℎ ∶ 𝑉 → ℕ be the minimal layering of 𝐶 (aka topological
sorting, see Theorem 1.26). We let 𝐿 be the maximum layer of ℎ,
and for ℓ = 0, 1, … , 𝐿 we do the following:

• For every 𝑣 in the ℓ-th layer (i.e., 𝑣 such that ℎ(𝑣) = ℓ) do:

– If 𝑣 is an input vertex labeled with X[𝑖] for some 𝑖 ∈ [𝑛], then


we assign to 𝑣 the value 𝑥𝑖 .
– If 𝑣 is a gate vertex labeled with ∧ and with two in-neighbors
𝑢, 𝑤 then we assign to 𝑣 the AND of the values assigned to
𝑢 and 𝑤. (Since 𝑢 and 𝑤 are in-neighbors of 𝑣, they are in a
lower layer than 𝑣, and hence their values have already been
assigned.)
– If 𝑣 is a gate vertex labeled with ∨ and with two in-neighbors
𝑢, 𝑤 then we assign to 𝑣 the OR of the values assigned to 𝑢
and 𝑤.
– If 𝑣 is a gate vertex labeled with ¬ and with one in-neighbor 𝑢
then we assign to 𝑣 the negation of the value assigned to 𝑢.

• The result of this process is the value 𝑦 ∈ {0, 1}𝑚 such that for
every 𝑗 ∈ [𝑚], 𝑦𝑗 is the value assigned to the vertex with label
Y[𝑗].

Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . We say that the circuit 𝐶 computes 𝑓 if
for every 𝑥 ∈ {0, 1}𝑛 , 𝐶(𝑥) = 𝑓(𝑥).

R
Remark 3.7 — Boolean circuits nitpicks (optional). In
phrasing Definition 3.5, we’ve made some technical
d e fi n i ng comp u tati on 137

choices that are not very important, but will be con-


venient for us later on. Having parallel edges means
an AND or OR gate 𝑢 can have both its in-neighbors
be the same gate 𝑣. Since AND(𝑎, 𝑎) = OR(𝑎, 𝑎) = 𝑎
for every 𝑎 ∈ {0, 1}, such parallel edges don’t help in
computing new values in circuits with AND/OR/NOT
gates. However, we will see circuits with more general
sets of gates later on. The condition that every input
vertex has at least one out-neighbor is also not very
important because we can always add “dummy gates”
that touch these inputs. However, it is convenient
since it guarantees that (since every gate has at most
two in-neighbors) the number of inputs in a circuit is
never larger than twice its size.

3.4 STRAIGHT-LINE PROGRAMS


We have seen two ways to describe how to compute a function 𝑓 using
AND, OR and NOT:

• A Boolean circuit, defined in Definition 3.5, computes 𝑓 by connect-


ing via wires AND, OR, and NOT gates to the inputs.

• We can also describe such a computation using a straight-line


program that has lines of the form foo = AND(bar,blah), foo =
OR(bar,blah) and foo = NOT(bar) where foo, bar and blah are
variable names. (We call this a straight-line program since it contains
no loops or branching (e.g., if/then) statements.)

To make the second definition more precise, we will now define a


programming language that is equivalent to Boolean circuits. We call
this programming language the AON-CIRC programming language
(“AON” stands for AND/OR/NOT; “CIRC” stands for circuit).
For example, the following is an AON-CIRC program that on in-
put 𝑥 ∈ {0, 1}2 , outputs 𝑥0 ∧ 𝑥1 (i.e., the NOT operation applied to
AND(𝑥0 , 𝑥1 ):

temp = AND(X[0],X[1])
Y[0] = NOT(temp)

AON-CIRC is not a practical programming language: it was de-


signed for pedagogical purposes only, as a way to model computation
as the composition of AND, OR, and NOT. However, it can still be
easily implemented on a computer.
Given this example, you might already be able to guess how to
write a program for computing (for example) 𝑥0 ∧ 𝑥1 ∨ 𝑥2 , and in
general how to translate a Boolean circuit into an AON-CIRC program.
However, since we will want to prove mathematical statements about
138 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

AON-CIRC programs, we will need to precisely define the AON-CIRC


programming language. Precise specifications of programming lan- 2
For example the C programming language specifica-
guages can sometimes be long and tedious,2 but are crucial for secure tion takes more than 500 pages.
and reliable implementations. Luckily, the AON-CIRC programming
language is simple enough that we can define it formally with rela-
tively little pain.

3.4.1 Specification of the AON-CIRC programming language


An AON-CIRC program is a sequence of strings, which we call
“lines”, satisfying the following conditions:

• Every line has one of the following forms: foo = AND(bar,baz),


foo = OR(bar,baz), or foo = NOT(bar) where foo, bar and baz
are variable identifiers. (We follow the common programming lan-
guages convention of using names such as foo, bar, baz as stand-
ins for generic identifiers.) The line foo = AND(bar,baz) corre-
sponds to the operation of assigning to the variable foo the logical
AND of the values of the variables bar and baz. Similarly foo =
OR(bar,baz) and foo = NOT(bar) correspond to the logical OR
and logical NOT operations.

• A variable identifier in the AON-CIRC programming language can


be any combination of letters, numbers, underscores, and brackets.
There are two special types of variables:

– Variables of the form X[𝑖], with 𝑖 ∈ {0, 1, … , 𝑛 − 1} are known as


input variables.
– Variables of the form Y[𝑗] are known as output variables.

• A valid AON-CIRC program 𝑃 includes input variables of the form


X[0],…,X[𝑛 − 1] and output variables of the form Y[0],…, Y[𝑚 − 1]
where 𝑛, 𝑚 are natural numbers. We say that 𝑛 is the number of
inputs of the program 𝑃 and 𝑚 is the number of outputs.

• In a valid AON-CIRC program, in every line the variables on the


right-hand side of the assignment operator must either be input
variables or variables that have already been assigned a value in a
previous line.

• If 𝑃 is a valid AON-CIRC program of 𝑛 inputs and 𝑚 outputs,


then for every 𝑥 ∈ {0, 1}𝑛 the output of 𝑃 on input 𝑥 is the string
𝑦 ∈ {0, 1}𝑚 defined as follows:

– Initialize the input variables X[0],…,X[𝑛 − 1] to the values


𝑥0 , … , 𝑥𝑛−1
– Run the operator lines of 𝑃 one by one in order, in each line
assigning to the variable on the left-hand side of the assignment
operators the value of the operation on the right-hand side.
d e fi n i ng comp u tati on 139

– Let 𝑦 ∈ {0, 1}𝑚 be the values of the output variables Y[0],…,


Y[𝑚 − 1] at the end of the execution.

• We denote the output of 𝑃 on input 𝑥 by 𝑃 (𝑥).

• The size of an AON circ program 𝑃 is the number of lines it con-


tains. (The reader might note that this corresponds to our defini-
tion of the size of a circuit as the number of gates it contains.)

Now that we formally specified AON-CIRC programs, we can


define what it means for an AON-CIRC program 𝑃 to compute a
function 𝑓:

Let 𝑓
Definition 3.8 — Computing a function via AON-CIRC programs. ∶
{0, 1}𝑛 → {0, 1}𝑚 , and 𝑃 be a valid AON-CIRC program with 𝑛
inputs and 𝑚 outputs. We say that 𝑃 computes 𝑓 if 𝑃 (𝑥) = 𝑓(𝑥) for
every 𝑥 ∈ {0, 1}𝑛 .

The following solved exercise gives an example of an AON-CIRC


program.
Solved Exercise 3.4Consider the following function CMP ∶ {0, 1}4 →
{0, 1} that on four input bits 𝑎, 𝑏, 𝑐, 𝑑 ∈ {0, 1}, outputs 1 iff the number
represented by (𝑎, 𝑏) is larger than the number represented by (𝑐, 𝑑).
That is CMP(𝑎, 𝑏, 𝑐, 𝑑) = 1 iff 2𝑎 + 𝑏 > 2𝑐 + 𝑑.
Write an AON-CIRC program to compute CMP.

Solution:
Writing such a program is tedious but not truly hard. To com-
pare two numbers we first compare their most significant digit,
and then go down to the next digit and so on and so forth. In this
case where the numbers have just two binary digits, these compar-
isons are particularly simple. The number represented by (𝑎, 𝑏) is
larger than the number represented by (𝑐, 𝑑) if and only if one of
the following conditions happens:

1. The most significant bit 𝑎 of (𝑎, 𝑏) is larger than the most signifi-
cant bit 𝑐 of (𝑐, 𝑑).

or

2. The two most significant bits 𝑎 and 𝑐 are equal, but 𝑏 > 𝑑.

Another way to express the same condition is the following: the


number (𝑎, 𝑏) is larger than (𝑐, 𝑑) iff 𝑎 > 𝑐 OR (𝑎 ≥ 𝑐 AND 𝑏 > 𝑑).
140 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For binary digits 𝛼, 𝛽, the condition 𝛼 > 𝛽 is simply that 𝛼 = 1


and 𝛽 = 0 or AND(𝛼, NOT(𝛽)) = 1, and the condition 𝛼 ≥ 𝛽 is sim-
ply OR(𝛼, NOT(𝛽)) = 1. Together these observations can be used to
give the following AON-CIRC program to compute CMP:

# Compute CMP:{0,1}^4-->{0,1}
# CMP(X)=1 iff 2X[0]+X[1] > 2X[2] + X[3]
temp_1 = NOT(X[2])
temp_2 = AND(X[0],temp_1)
temp_3 = OR(X[0],temp_1)
temp_4 = NOT(X[3])
temp_5 = AND(X[1],temp_4)
temp_6 = AND(temp_5,temp_3)
Y[0] = OR(temp_2,temp_6)

We can also present this 8-line program as a circuit with 8 gates,


see Fig. 3.12.

3.4.2 Proving equivalence of AON-CIRC programs and Boolean circuits


We now formally prove that AON-CIRC programs and Boolean cir-
cuits have exactly the same power:

Let
Theorem 3.9 — Equivalence of circuits and straight-line programs.
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 and 𝑠 ≥ 𝑚 be some number. Then 𝑓 is
computable by a Boolean circuit with 𝑠 gates if and only if 𝑓 is
Figure 3.12: A circuit for computing the CMP function.
computable by an AON-CIRC program of 𝑠 lines. The evaluation of this circuit on (1, 1, 1, 0) yields the
output 1, since the number 3 (represented in binary
as 11) is larger than the number 2 (represented in
Proof Idea:
binary as 10).
The idea is simple - AON-CIRC programs and Boolean circuits
are just different ways of describing the exact same computational
process. For example, an AND gate in a Boolean circuit corresponds to
computing the AND of two previously-computed values. In an AON-
CIRC program this will correspond to the line that stores in a variable
the AND of two previously-computed variables.

P
This proof of Theorem 3.9 is simple at heart, but all
the details it contains can make it a little cumbersome
to read. You might be better off trying to work it out
yourself before reading it. Our GitHub repository con-
tains a “proof by Python” of Theorem 3.9: implemen-
tation of functions circuit2prog and prog2circuits
mapping Boolean circuits to AON-CIRC programs and
vice versa.
d e fi n i ng comp u tati on 141

Proof of Theorem 3.9. Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . Since the theorem is an
“if and only if” statement, to prove it we need to show both directions:
translating an AON-CIRC program that computes 𝑓 into a circuit that
computes 𝑓, and translating a circuit that computes 𝑓 into an AON-
CIRC program that does so.
We start with the first direction. Let 𝑃 be an AON-CIRC program
that computes 𝑓. We define a circuit 𝐶 as follows: the circuit will
have 𝑛 inputs and 𝑠 gates. For every 𝑖 ∈ [𝑠], if the 𝑖-th operator line
has the form foo = AND(bar,blah) then the 𝑖-th gate in the circuit
will be an AND gate that is connected to gates 𝑗 and 𝑘 where 𝑗 and
𝑘 correspond to the last lines before 𝑖 where the variables bar and
blah (respectively) were written to. (For example, if 𝑖 = 57 and the
last line bar was written to is 35 and the last line blah was written
to is 17 then the two in-neighbors of gate 57 will be gates 35 and 17.)
If either bar or blah is an input variable then we connect the gate to
the corresponding input vertex instead. If foo is an output variable
of the form Y[𝑗] then we add the same label to the corresponding
gate to mark it as an output gate. We do the analogous operations if
the 𝑖-th line involves an OR or a NOT operation (except that we use the
corresponding OR or NOT gate, and in the latter case have only one
in-neighbor instead of two). For every input 𝑥 ∈ {0, 1}𝑛 , if we run
the program 𝑃 on 𝑥, then the value written that is computed in the
𝑖-th line is exactly the value that will be assigned to the 𝑖-th gate if we
evaluate the circuit 𝐶 on 𝑥. Hence 𝐶(𝑥) = 𝑃 (𝑥) for every 𝑥 ∈ {0, 1}𝑛 .
For the other direction, let 𝐶 be a circuit of 𝑠 gates and 𝑛 inputs that
computes the function 𝑓. We sort the gates according to a topological
order and write them as 𝑣0 , … , 𝑣𝑠−1 . We now can create a program
𝑃 of 𝑠 operator lines as follows. For every 𝑖 ∈ [𝑠], if 𝑣𝑖 is an AND
gate with in-neighbors 𝑣𝑗 , 𝑣𝑘 then we will add a line to 𝑃 of the form
temp_𝑖 = AND(temp_𝑗,temp_𝑘), unless one of the vertices is an input
vertex or an output gate, in which case we change this to the form
X[.] or Y[.] appropriately. Because we work in topological order-
ing, we are guaranteed that the in-neighbors 𝑣𝑗 and 𝑣𝑘 correspond to
variables that have already been assigned a value. We do the same for
OR and NOT gates. Once again, one can verify that for every input 𝑥,
the value 𝑃 (𝑥) will equal 𝐶(𝑥) and hence the program computes the
same function as the circuit. (Note that since 𝐶 is a valid circuit, per
Definition 3.5, every input vertex of 𝐶 has at least one out-neighbor
and there are exactly 𝑚 output gates labeled 0, … , 𝑚 − 1; hence all the
variables X[0], …, X[𝑛 − 1] and Y[0] ,…, Y[𝑚 − 1] will appear in the
program 𝑃 .)

Figure 3.13: Two equivalent descriptions of the same
AND/OR/NOT computation as both an AON pro-
gram and a Boolean circuit.
142 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

3.5 PHYSICAL IMPLEMENTATIONS OF COMPUTING DEVICES (DI-


GRESSION)
Computation is an abstract notion that is distinct from its physical im-
plementations. While most modern computing devices are obtained by
mapping logical gates to semiconductor-based transistors, throughout
history people have computed using a huge variety of mechanisms,
including mechanical systems, gas and liquid (known as fluidics), bi-
ological and chemical processes, and even living creatures (e.g., see
Fig. 3.14 or this video for how crabs or slime mold can be used to do
computations).
In this section we will review some of these implementations, both
so you can get an appreciation of how it is possible to directly translate
Boolean circuits to the physical world, without going through the en-
tire stack of architecture, operating systems, and compilers, as well as
to emphasize that silicon-based processors are by no means the only
way to perform computation. Indeed, as we will see in Chapter 23,
a very exciting recent line of work involves using different media for
computation that would allow us to take advantage of quantum me-
chanical effects to enable different types of algorithms.
Such a cool way to explain logic gates. pic.twitter.com/6Wgu2ZKFCx
— Lionel Page (@page_eco) October 28, 2019

3.5.1 Transistors
A transistor can be thought of as an electric circuit with two inputs,
known as the source and the gate and an output, known as the sink.
The gate controls whether current flows from the source to the sink. In
Figure 3.14: Crab-based logic gates from the paper
a standard transistor, if the gate is “ON” then current can flow from the “Robust soldier-crab ball gate” by Gunji, Nishiyama
source to the sink and if it is “OFF” then it can’t. In a complementary and Adamatzky. This is an example of an AND gate
transistor this is reversed: if the gate is “OFF” then current can flow that relies on the tendency of two swarms of crabs
arriving from different directions to combine to a
from the source to the sink and if it is “ON” then it can’t. single swarm that continues in the average of the
There are several ways to implement the logic of a transistor. For directions.
example, we can use faucets to implement it using water pressure
(e.g. Fig. 3.15). This might seem as merely a curiosity, but there is
a field known as fluidics concerned with implementing logical op-
erations using liquids or gasses. Some of the motivations include
operating in extreme environmental conditions such as in space or a
battlefield, where standard electronic equipment would not survive.
The standard implementations of transistors use electrical current.
Figure 3.15: We can implement the logic of transistors
One of the original implementations used vacuum tubes. As its name
using water. The water pressure from the gate closes
implies, a vacuum tube is a tube containing nothing (i.e., a vacuum) or opens a faucet between the source and the sink.
and where a priori electrons could freely flow from the source (a
wire) to the sink (a plate). However, there is a gate (a grid) between
the two, where modulating its voltage can block the flow of electrons.
d e fi n i ng comp u tati on 143

Early vacuum tubes were roughly the size of lightbulbs (and


looked very much like them too). In the 1950’s they were supplanted
by transistors, which implement the same logic using semiconduc-
tors which are materials that normally do not conduct electricity but
whose conductivity can be modified and controlled by inserting impu-
rities (“doping”) and applying an external electric field (this is known
as the field effect). In the 1960’s computers started to be implemented
using integrated circuits which enabled much greater density. In 1965,
Gordon Moore predicted that the number of transistors per integrated
circuit would double every year (see Fig. 3.16), and that this would
lead to “such wonders as home computers —or at least terminals con-
nected to a central computer— automatic controls for automobiles,
and personal portable communications equipment”. Since then, (ad-
justed versions of) this so-called “Moore’s law” have been running
strong, though exponential growth cannot be sustained forever, and Figure 3.16: The number of transistors per integrated
some physical limitations are already becoming apparent. circuit from 1959 till 1965 and a prediction that ex-
ponential growth will continue for at least another
decade. Figure taken from “Cramming More Com-
3.5.2 Logical gates from transistors ponents onto Integrated Circuits”, Gordon Moore,
1965
We can use transistors to implement various Boolean functions such
as AND, OR, and NOT. For each two-input gate 𝐺 ∶ {0, 1}2 → {0, 1},
such an implementation would be a system with two input wires 𝑥, 𝑦
and one output wire 𝑧, such that if we identify high voltage with “1”
and low voltage with “0”, then the wire 𝑧 will be equal to “1” if and
only if applying 𝐺 to the values of the wires 𝑥 and 𝑦 is 1 (see Fig. 3.19
and Fig. 3.20). This means that if there exists a AND/OR/NOT circuit
Figure 3.17: Cartoon from Gordon Moore’s article
to compute a function 𝑔 ∶ {0, 1}𝑛 → {0, 1}𝑚 , then we can compute 𝑔 in “predicting” the implications of radically improving
the physical world using transistors as well. transistor density.

3.5.3 Biological computing


Computation can be based on biological or chemical systems. For
example the lac operon produces the enzymes needed to digest lactose
only if the conditions 𝑥 ∧ (¬𝑦) hold where 𝑥 is “lactose is present”
and 𝑦 is “glucose is present”. Researchers have managed to create
transistors, and from them logic gates, based on DNA molecules (see
also Fig. 3.21). Projects such as the Cello programming language
enable converting Boolean circuits into DNA sequences that encode Figure 3.18: The exponential growth in computing
power over the last 120 years. Graph by Steve Jurvet-
operations that can be executed in bacterial cells, see this video. One son, extending a prior graph of Ray Kurzweil.
motivation for DNA computing is to achieve increased parallelism
or storage density; another is to create “smart biological agents” that
could perhaps be injected into bodies, replicate themselves, and fix or
kill cells that were damaged by a disease such as cancer. Computing
in biological systems is not restricted, of course, to DNA: even larger
systems such as flocks of birds can be considered as computational Figure 3.19: Implementing logical gates using transis-
processes. tors. Figure taken from Rory Mangles’ website.
144 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

3.5.4 Cellular automata and the game of life


Cellular automata is a model of a system composed of a sequence of
cells, each of which can have a finite state. At each step, a cell updates
its state based on the states of its neighboring cells and some simple
rules. As we will discuss later in this book (see Section 8.4), cellular
automata such as Conway’s “Game of Life” can be used to simulate
computation gates.

Figure 3.20: Implementing a NAND gate (see Sec-


tion 3.6) using transistors.

3.5.5 Neural networks


One computation device that we all carry with us is our own brain.
Brains have served humanity throughout history, doing computations
that range from distinguishing prey from predators, through making
scientific discoveries and artistic masterpieces, to composing witty 280
character messages. The exact working of the brain is still not fully
understood, but one common mathematical model for it is a (very
large) neural network. Figure 3.21: Performance of DNA-based logic gates.
A neural network can be thought of as a Boolean circuit that instead Figure taken from paper of Bonnet et al, Science, 2013.

of AND/OR/NOT uses some other gates as the basic basis. For exam-
ple, one particular basis we can use are threshold gates. For every vector
𝑤 = (𝑤0 , … , 𝑤𝑘−1 ) of integers and integer 𝑡 (some or all of which
could be negative), the threshold function corresponding to 𝑤, 𝑡 is the
function 𝑇𝑤,𝑡 ∶ {0, 1}𝑘 → {0, 1} that maps 𝑥 ∈ {0, 1}𝑘 to 1 if and only if
𝑘−1
∑𝑖=0 𝑤𝑖 𝑥𝑖 ≥ 𝑡. For example, the threshold function 𝑇𝑤,𝑡 correspond-
ing to 𝑤 = (1, 1, 1, 1, 1) and 𝑡 = 3 is simply the majority function MAJ5
on {0, 1}5 . Threshold gates can be thought of as an approximation for
neuron cells that make up the core of human and animal brains. To a Figure 3.22: An AND gate using a “Game of Life”
configuration. Figure taken from Jean-Philippe
first approximation, a neuron has 𝑘 inputs and a single output, and Rennard’s paper.
the neuron “fires” or “turns on” its output when those signals pass
some threshold.
Many machine learning algorithms use artificial neural networks
whose purpose is not to imitate biology but rather to perform some
computational tasks, and hence are not restricted to a threshold or
other biologically-inspired gates. Generally, a neural network is often
described as operating on signals that are real numbers, rather than
0/1 values, and where the output of a gate on inputs 𝑥0 , … , 𝑥𝑘−1 is
obtained by applying 𝑓(∑𝑖 𝑤𝑖 𝑥𝑖 ) where 𝑓 ∶ ℝ → ℝ is an activation
function such as rectified linear unit (ReLU), Sigmoid, or many others
(see Fig. 3.23). However, for the purposes of our discussion, all of
the above are equivalent (see also Exercise 3.13). In particular we can
reduce the setting of real inputs to binary inputs by representing a
real number in the binary basis, and multiplying the weight of the bit
corresponding to the 𝑖𝑡ℎ digit by 2𝑖 .
d e fi n i ng comp u tati on 145

3.5.6 A computer made from marbles and pipes


We can implement computation using many other physical media,
without any electronic, biological, or chemical components. Many
suggestions for mechanical computers have been put forward, going
back at least to Gottfried Leibniz’s computing machines from the
1670s and Charles Babbage’s 1837 plan for a mechanical “Analytical
Engine”. As one example, Fig. 3.24 shows a simple implementation of
a NAND (negation of AND, see Section 3.6) gate using marbles going
Figure 3.23: Common activation functions used in
through pipes. We represent a logical value in {0, 1} by a pair of pipes, Neural Networks, including rectified linear units
such that there is a marble flowing through exactly one of the pipes. (ReLU), sigmoids, and hyperbolic tangent. All of
those can be thought of as continuous approximations
We call one of the pipes the “0 pipe” and the other the “1 pipe”, and to simplify the step function. All of these can be used
so the identity of the pipe containing the marble determines the logi- to compute the NAND gate (see Exercise 3.13). This
cal value. A NAND gate corresponds to a mechanical object with two property enables neural networks to (approximately)
compute any function that can be computed by a
pairs of incoming pipes and one pair of outgoing pipes, such that for Boolean circuit.
every 𝑎, 𝑏 ∈ {0, 1}, if two marbles are rolling toward the object in the
𝑎 pipe of the first pair and the 𝑏 pipe of the second pair, then a marble
will roll out of the object in the NAND(𝑎, 𝑏)-pipe of the outgoing pair.
In fact, there is even a commercially-available educational game that
uses marbles as a basis of computing, see Fig. 3.26.

3.6 THE NAND FUNCTION


The NAND function is another simple function that is extremely use-
ful for defining computation. It is the function mapping {0, 1}2 to
Figure 3.24: A physical implementation of a NAND
{0, 1} defined by: gate using marbles. Each wire in a Boolean circuit is
modeled by a pair of pipes representing the values
⎧ 0 and 1 respectively, and hence a gate has four input
{0 𝑎 = 𝑏 = 1
NAND(𝑎, 𝑏) = ⎨ . pipes (two for each logical input) and two output
⎩1 otherwise
{ pipes. If one of the input pipes representing the value
0 has a marble in it then that marble will flow to the
As its name implies, NAND is the NOT of AND (i.e., NAND(𝑎, 𝑏) = output pipe representing the value 1. (The dashed
NOT(AND(𝑎, 𝑏))), and so we can clearly compute NAND using AND line represents a gadget that will ensure that at most
one marble is allowed to flow onward in the pipe.)
and NOT. Interestingly, the opposite direction holds as well: If both the input pipes representing the value 1 have
marbles in them, then the first marble will be stuck
We can compute AND,
Theorem 3.10 — NAND computes AND,OR,NOT. but the second one will flow onwards to the output
pipe representing the value 0.
OR, and NOT by composing only the NAND function.

Proof. We start with the following observation. For every 𝑎 ∈ {0, 1},
AND(𝑎, 𝑎) = 𝑎. Hence, NAND(𝑎, 𝑎) = NOT(AND(𝑎, 𝑎)) = NOT(𝑎).
This means that NAND can compute NOT. By the principle of “dou-
ble negation”, AND(𝑎, 𝑏) = NOT(NOT(AND(𝑎, 𝑏))), and hence
we can use NAND to compute AND as well. Once we can compute
AND and NOT, we can compute OR using “De Morgan’s Law”:
OR(𝑎, 𝑏) = NOT(AND(NOT(𝑎), NOT(𝑏))) (which can also be writ-
ten as 𝑎 ∨ 𝑏 = 𝑎 ∧ 𝑏) for every 𝑎, 𝑏 ∈ {0, 1}. Figure 3.25: A “gadget” in a pipe that ensures that at
most one marble can pass through it. The first marble
■ that passes causes the barrier to lift and block new
ones.
146 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

P
Theorem 3.10’s proof is very simple, but you should
make sure that (i) you understand the statement of
the theorem, and (ii) you follow its proof. In partic-
ular, you should make sure you understand why De
Morgan’s law is true.

We can use NAND to compute many other functions, as demon-


Figure 3.26: The game “Turing Tumble” contains an
strated in the following exercise. implementation of logical gates using marbles.
Solved Exercise 3.5 — Compute majority with NAND.Let MAJ ∶ {0, 1}3 →
{0, 1} be the function that on input 𝑎, 𝑏, 𝑐 outputs 1 iff 𝑎 + 𝑏 + 𝑐 ≥ 2.
Show how to compute MAJ using a composition of NAND’s.

Solution:
Recall that (3.1) states that

MAJ(𝑥0 , 𝑥1 , 𝑥2 ) = OR ( AND(𝑥0 , 𝑥1 ) , OR(AND(𝑥1 , 𝑥2 ) , AND(𝑥0 , 𝑥2 )) ) .


(3.2)
We can use Theorem 3.10 to replace all the occurrences of AND
and OR with NAND’s. Specifically, we can use the equivalence
AND(𝑎, 𝑏) = NOT(NAND(𝑎, 𝑏)), OR(𝑎, 𝑏) = NAND(NOT(𝑎), NOT(𝑏)),
and NOT(𝑎) = NAND(𝑎, 𝑎) to replace the right-hand side of
(3.2) with an expression involving only NAND, yielding that
MAJ(𝑎, 𝑏, 𝑐) is equivalent to the (somewhat unwieldy) expression

NAND( NAND( NAND(NAND(𝑎, 𝑏), NAND(𝑎, 𝑐)),

NAND(NAND(𝑎, 𝑏), NAND(𝑎, 𝑐)) ),

NAND(𝑏, 𝑐) )

The same formula can also be expressed as a circuit with NAND


gates, see Fig. 3.27.

3.6.1 NAND Circuits


We define NAND Circuits as circuits in which all the gates are NAND
operations. Such a circuit again corresponds to a directed acyclic
graph (DAG) since all the gates correspond to the same function (i.e.,
NAND), we do not even need to label them, and all gates have in-
degree exactly two. Despite their simplicity, NAND circuits can be Figure 3.27: A circuit with NAND gates to compute
quite powerful. the Majority function on three bits
d e fi n i ng comp u tati on 147

■Example 3.11 — 𝑁 𝐴𝑁 𝐷 circuit for 𝑋𝑂𝑅. Recall the XOR function


which maps 𝑥0 , 𝑥1 ∈ {0, 1} to 𝑥0 + 𝑥1 mod 2. We have seen in
Section 3.2.2 that we can compute XOR using AND, OR, and NOT,
and so by Theorem 3.10 we can compute it using only NAND’s.
However, the following is a direct construction of computing XOR
by a sequence of NAND operations:

1. Let 𝑢 = NAND(𝑥0 , 𝑥1 ).
2. Let 𝑣 = NAND(𝑥0 , 𝑢)
3. Let 𝑤 = NAND(𝑥1 , 𝑢).
4. The XOR of 𝑥0 and 𝑥1 is 𝑦0 = NAND(𝑣, 𝑤).

One can verify that this algorithm does indeed compute XOR
by enumerating all the four choices for 𝑥0 , 𝑥1 ∈ {0, 1}. We can also
represent this algorithm graphically as a circuit, see Fig. 3.28.

In fact, we can show the following theorem:

For every Boolean circuit


Theorem 3.12 — NAND is a universal operation.
𝐶 of 𝑠 gates, there exists a NAND circuit 𝐶 ′ of at most 3𝑠 gates that
computes the same function as 𝐶.

Proof Idea:
Figure 3.28: A circuit with NAND gates to compute
The idea of the proof is to just replace every AND, OR and NOT the XOR of two bits.
gate with their NAND implementation following the proof of Theo-
rem 3.10.

Proof of Theorem 3.12. If 𝐶 is a Boolean circuit, then since, as we’ve


seen in the proof of Theorem 3.10, for every 𝑎, 𝑏 ∈ {0, 1}

• NOT(𝑎) = NAND(𝑎, 𝑎)

• AND(𝑎, 𝑏) = NAND(NAND(𝑎, 𝑏), NAND(𝑎, 𝑏))

• OR(𝑎, 𝑏) = NAND(NAND(𝑎, 𝑎), NAND(𝑏, 𝑏))

we can replace every gate of 𝐶 with at most three NAND gates to


obtain an equivalent circuit 𝐶 ′ . The resulting circuit will have at most
3𝑠 gates.

 Big Idea 3 Two models are equivalent in power if they can be used
to compute the same set of functions.
148 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

3.6.2 More examples of NAND circuits (optional)


Here are some more sophisticated examples of NAND circuits:

Incrementing integers. Consider the task of computing, given as input


a string 𝑥 ∈ {0, 1}𝑛 that represents a natural number 𝑋 ∈ ℕ, the
representation of 𝑋 + 1. That is, we want to compute the function
INC𝑛 ∶ {0, 1}𝑛 → {0, 1}𝑛+1 such that for every 𝑥0 , … , 𝑥𝑛−1 , INC𝑛 (𝑥) =
𝑛 𝑛−1
𝑦 which satisfies ∑𝑖=0 𝑦𝑖 2𝑖 = (∑𝑖=0 𝑥𝑖 2𝑖 ) + 1. (For simplicity of
notation, in this example we use the representation where the least
significant digit is first rather than last.)
The increment operation can be very informally described as fol-
lows: “Add 1 to the least significant bit and propagate the carry”. A little
more precisely, in the case of the binary representation, to obtain the
increment of 𝑥, we scan 𝑥 from the least significant bit onwards, and
flip all 1’s to 0’s until we encounter a bit equal to 0, in which case we
flip it to 1 and stop.
Thus we can compute the increment of 𝑥0 , … , 𝑥𝑛−1 by doing the
following:

Algorithm 3.13 — Compute Increment Function.


𝑛−1
Input: 𝑥0 , 𝑥1 , … , 𝑥𝑛−1 representing the number ∑𝑖=0 𝑥𝑖 ⋅ 2𝑖
# we use LSB-first representation
𝑛 𝑛−1
Output: 𝑦 ∈ {0, 1}𝑛+1 such that ∑𝑖=0 𝑦𝑖 ⋅2𝑖 = ∑𝑖=0 𝑥𝑖 ⋅2𝑖 +1
1: Let 𝑐0 ← 1 # we pretend we have a ”carry” of 1 initially
2: for 𝑖 = 0, … , 𝑛 − 1 do
3: Let 𝑦𝑖 ← 𝑋𝑂𝑅(𝑥𝑖 , 𝑐𝑖 ).
4: if 𝑐𝑖 = 𝑥𝑖 = 1 then
5: 𝑐𝑖+1 = 1
6: else
7: 𝑐𝑖+1 = 0
8: end if
9: end for
10: Let 𝑦𝑛 ← 𝑐𝑛 .

Algorithm 3.13 describes precisely how to compute the increment


operation, and can be easily transformed into Python code that per-
forms the same computation, but it does not seem to directly yield
a NAND circuit to compute this. However, we can transform this
algorithm line by line to a NAND circuit. For example, since for ev-
ery 𝑎, NAND(𝑎, NOT(𝑎)) = 1, we can replace the initial statement
𝑐0 = 1 with 𝑐0 = NAND(𝑥0 , NAND(𝑥0 , 𝑥0 )). We already know
how to compute XOR using NAND and so we can use this to im-
plement the operation 𝑦𝑖 ← XOR(𝑥𝑖 , 𝑐𝑖 ). Similarly, we can write
the “if” statement as saying 𝑐𝑖+1 ← AND(𝑐𝑖 , 𝑥𝑖 ), or in other words
d e fi n i ng comp u tati on 149

𝑐𝑖+1 ← NAND(NAND(𝑐𝑖 , 𝑥𝑖 ), NAND(𝑐𝑖 , 𝑥𝑖 )). Finally, the assignment


𝑦𝑛 = 𝑐𝑛 can be written as 𝑦𝑛 = NAND(NAND(𝑐𝑛 , 𝑐𝑛 ), NAND(𝑐𝑛 , 𝑐𝑛 )).
Combining these observations yields for every 𝑛 ∈ ℕ, a NAND circuit
to compute INC𝑛 . For example, Fig. 3.29 shows what this circuit looks
like for 𝑛 = 4.

From increment to addition.Once we have the increment operation,


we can certainly compute addition by repeatedly incrementing (i.e.,
compute 𝑥+𝑦 by performing INC(𝑥) 𝑦 times). However, that would be
quite inefficient and unnecessary. With the same idea of keeping track
of carries we can implement the “grade-school” addition algorithm
and compute the function ADD𝑛 ∶ {0, 1}2𝑛 → {0, 1}𝑛+1 that on
input 𝑥 ∈ {0, 1}2𝑛 outputs the binary representation of the sum of the
numbers represented by 𝑥0 , … , 𝑥𝑛−1 and 𝑥𝑛 , … , 𝑥2𝑛−1 : Figure 3.29: NAND circuit with computing the incre-
ment function on 4 bits.
Algorithm 3.14 — Addition using NAND.

Input: 𝑢 ∈ {0, 1}𝑛 , 𝑣 ∈ {0, 1}𝑛 representing numbers in


LSB-first binary representation.
Output: LSB-first binary representation of 𝑥 + 𝑦.
1: Let 𝑐0 ← 0
2: for 𝑖 = 0, … , 𝑛 − 1 do
3: Let 𝑦𝑖 ← 𝑢𝑖 + 𝑣𝑖 mod 2
4: if 𝑢𝑖 + 𝑣𝑖 + 𝑐𝑖 ≥ 2 then
5: 𝑐𝑖+1 ← 1
6: else
7: 𝑐𝑖+1 ← 0
8: end if
9: end for
10: Let 𝑦𝑛 ← 𝑐𝑛

Once again, Algorithm 3.14 can be translated into a NAND cir-


cuit. The crucial observation is that the “if/then” statement simply
corresponds to 𝑐𝑖+1 ← MAJ3 (𝑢𝑖 , 𝑣𝑖 , 𝑣𝑖 ) and we have seen in Solved
Exercise 3.5 that the function MAJ3 ∶ {0, 1}3 → {0, 1} can be computed
using NANDs.

3.6.3 The NAND-CIRC Programming language


Just like we did for Boolean circuits, we can define a programming-
language analog of NAND circuits. It is even simpler than the AON-
CIRC language since we only have a single operation. We define the
NAND-CIRC Programming Language to be a programming language
where every line (apart from the input/output declaration) has the
following form:

foo = NAND(bar,blah)
150 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

where foo, bar and blah are variable identifiers.

■ Example 3.15 — Our first NAND-CIRC program. Here is an example of a


NAND-CIRC program:

u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)

P
Do you know what function this program computes?
Hint: you have seen it before.

Formally, just like we did in Definition 3.8 for AON-CIRC, we can


define the notion of computation by a NAND-CIRC program in the
natural way:

Let 𝑓 ∶ {0, 1}𝑛 →


Definition 3.16 — Computing by a NAND-CIRC program.
{0, 1}𝑚 be some function, and let 𝑃 be a NAND-CIRC program.
We say that 𝑃 computes the function 𝑓 if:

1. 𝑃 has 𝑛 input variables X[0], … ,X[𝑛−1] and 𝑚 output variables


Y[0],…,Y[𝑚 − 1].

2. For every 𝑥 ∈ {0, 1}𝑛 , if we execute 𝑃 when we assign to


X[0], … ,X[𝑛 − 1] the values 𝑥0 , … , 𝑥𝑛−1 , then at the end of
the execution, the output variables Y[0],…,Y[𝑚 − 1] have the
values 𝑦0 , … , 𝑦𝑚−1 where 𝑦 = 𝑓(𝑥).

As before we can show that NAND circuits are equivalent to


NAND-CIRC programs (see Fig. 3.30):

For
Theorem 3.17 — NAND circuits and straight-line program equivalence.
every 𝑓 ∶ {0, 1} → {0, 1} and 𝑠 ≥ 𝑚, 𝑓 is computable by a
𝑛 𝑚

NAND-CIRC program of 𝑠 lines if and only if 𝑓 is computable by a


NAND circuit of 𝑠 gates.

We omit the proof of Theorem 3.17 since it follows along exactly


the same lines as the equivalence of Boolean circuits and AON-CIRC
program (Theorem 3.9). Given Theorem 3.17 and Theorem 3.12, we
know that we can translate every 𝑠-line AON-CIRC program 𝑃 into
Figure 3.30: A NAND program and the corresponding
an equivalent NAND-CIRC program of at most 3𝑠 lines. In fact, this circuit. Note how every line in the program corre-
translation can be easily done by replacing every line of the form sponds to a gate in the circuit.
d e fi n i ng comp u tati on 151

foo = AND(bar,blah), foo = OR(bar,blah) or foo = NOT(bar)


with the equivalent 1-3 lines that use the NAND operation. Our GitHub
repository contains a “proof by code”: a simple Python program
AON2NAND that transforms an AON-CIRC into an equivalent NAND-
CIRC program.

R
Remark 3.18 — Is the NAND-CIRC programming language
Turing Complete? (optional note). You might have heard
of a term called “Turing Complete” that is sometimes
used to describe programming languages. (If you
haven’t, feel free to ignore the rest of this remark: we
define this term precisely in Chapter 8.) If so, you
might wonder if the NAND-CIRC programming lan-
guage has this property. The answer is no, or perhaps
more accurately, the term “Turing Completeness” is
not really applicable for the NAND-CIRC program-
ming language. The reason is that, by design, the
NAND-CIRC programming language can only com-
pute finite functions 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 that take a
fixed number of input bits and produce a fixed num-
ber of outputs bits. The term “Turing Complete” is
only applicable to programming languages for infinite
functions that can take inputs of arbitrary length. We
will come back to this distinction later on in this book.

3.7 EQUIVALENCE OF ALL THESE MODELS


If we put together Theorem 3.9, Theorem 3.12, and Theorem 3.17, we
obtain the following result:

Theorem 3.19 — Equivalence between models of finite computation. For


every sufficiently large 𝑠, 𝑛, 𝑚 and 𝑓 ∶ {0, 1} →𝑛
{0, 1}𝑚 , the
following conditions are all equivalent to one another:

• 𝑓 can be computed by a Boolean circuit (with ∧, ∨, ¬ gates) of at


most 𝑂(𝑠) gates.

• 𝑓 can be computed by an AON-CIRC straight-line program of at


most 𝑂(𝑠) lines.

• 𝑓 can be computed by a NAND circuit of at most 𝑂(𝑠) gates.

• 𝑓 can be computed by a NAND-CIRC straight-line program of at


most 𝑂(𝑠) lines.

By “𝑂(𝑠)” we mean that the bound is at most 𝑐 ⋅ 𝑠 where 𝑐 is a con-


stant that is independent of 𝑛. For example, if 𝑓 can be computed by a
Boolean circuit of 𝑠 gates, then it can be computed by a NAND-CIRC
152 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

program of at most 3𝑠 lines, and if 𝑓 can be computed by a NAND


circuit of 𝑠 gates, then it can be computed by an AON-CIRC program
of at most 2𝑠 lines.
Proof Idea:
We omit the formal proof, which is obtained by combining Theo-
rem 3.9, Theorem 3.12, and Theorem 3.17. The key observation is that
the results we have seen allow us to translate a program/circuit that
computes 𝑓 in one of the above models into a program/circuit that
computes 𝑓 in another model by increasing the lines/gates by at most
a constant factor (in fact this constant factor is at most 3).

Theorem 3.9 is a special case of a more general result. We can con-
sider even more general models of computation, where instead of
AND/OR/NOT or NAND, we use other operations (see Section 3.7.1
below). It turns out that Boolean circuits are equivalent in power to
such models as well. The fact that all these different ways to define
computation lead to equivalent models shows that we are “on the
right track”. It justifies the seemingly arbitrary choices that we’ve
made of using AND/OR/NOT or NAND as our basic operations,
since these choices do not affect the power of our computational
model. Equivalence results such as Theorem 3.19 mean that we can
easily translate between Boolean circuits, NAND circuits, NAND-
CIRC programs and the like. We will use this ability later on in this
book, often shifting to the most convenient formulation without mak-
ing a big deal about it. Hence we will not worry too much about the
distinction between, for example, Boolean circuits and NAND-CIRC
programs.
In contrast, we will continue to take special care to distinguish
between circuits/programs and functions (recall Big Idea 2). A func-
tion corresponds to a specification of a computational task, and it is
a fundamentally different object than a program or a circuit, which
corresponds to the implementation of the task.

3.7.1 Circuits with other gate sets


There is nothing special about AND/OR/NOT or NAND. For every
set of functions 𝒢 = {𝐺0 , … , 𝐺𝑘−1 }, we can define a notion of circuits
that use elements of 𝒢 as gates, and a notion of a “𝒢 programming
language” where every line involves assigning to a variable foo the re-
sult of applying some 𝐺𝑖 ∈ 𝒢 to previously defined or input variables.
Specifically, we can make the following definition:

Let ℱ = {𝑓0 , … , 𝑓𝑡−1 }


Definition 3.20 — General straight-line programs.
be a finite collection of Boolean functions, such that 𝑓𝑖 ∶ {0, 1}𝑘𝑖 →
d e fi n i ng comp u tati on 153

{0, 1} for some 𝑘𝑖 ∈ ℕ. An ℱ program is a sequence of lines, each of


which assigns to some variable the result of applying some 𝑓𝑖 ∈ ℱ
to 𝑘𝑖 other variables. As above, we use X[𝑖] and Y[𝑗] to denote the
input and output variables.
We say that ℱ is a universal set of operations (also known as a uni-
versal gate set) if there exists a ℱ program to compute the function
NAND.

AON-CIRC programs correspond to {𝐴𝑁 𝐷, OR, NOT} programs,


NAND-CIRC programs corresponds to ℱ programs for the set
ℱ that only contains the NAND function, but we can also define
{IF, ZERO, ONE} programs (see below), or use any other set.
We can also define ℱ circuits, which will be directed graphs in
which each gate corresponds to applying a function 𝑓𝑖 ∈ ℱ, and will
each have 𝑘𝑖 incoming wires and a single outgoing wire. (If the func-
tion 𝑓𝑖 is not symmetric, in the sense that the order of its input matters
then we need to label each wire entering a gate as to which parameter
of the function it corresponds to.) As in Theorem 3.9, we can show
that ℱ circuits and ℱ programs are equivalent. We have seen that for
ℱ = {AND, OR, NOT}, the resulting circuits/programs are equivalent
in power to the NAND-CIRC programming language, as we can com-
pute NAND using AND/OR/NOT and vice versa. This turns out to be
a special case of a general phenomenon — the universality of NAND
and other gate sets — that we will explore more in-depth later in this
book.

■ Example 3.21 — IF,ZERO,ONE circuits. Let ℱ = {IF, ZERO, ONE}


where ZERO ∶ {0, 1} → {0} and ONE ∶ {0, 1} → {1} are the
constant zero and one functions, 3 and IF ∶ {0, 1}3 → {0, 1} is the
function that on input (𝑎, 𝑏, 𝑐) outputs 𝑏 if 𝑎 = 1 and 𝑐 otherwise.
Then ℱ is universal.
Indeed, we can demonstrate that {IF, ZERO, ONE} is universal
using the following formula for NAND:

NAND(𝑎, 𝑏) = IF(𝑎, IF(𝑏, ZERO, ONE), ONE) .


3
One can also define these functions as taking a
There are also some sets ℱ that are more restricted in power. For length zero input. This makes no difference for the
example it can be shown that if we use only AND or OR gates (with- computational power of the model.

out NOT) then we do not get an equivalent model of computation.


The exercises cover several examples of universal and non-universal
gate sets.

3.7.2 Specification vs. implementation (again)


As we discussed in Section 2.6.1, one of the most important distinc-
tions in this book is that of specification versus implementation or sep-
154 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 3.31: It is crucial to distinguish between the


specification of a computational task, namely what is
the function that is to be computed and the implemen-
tation of it, namely the algorithm, program, or circuit
that contains the instructions defining how to map
an input to an output. The same function could be
computed in many different ways.

arating “what” from “how” (see Fig. 3.31). A function corresponds


to the specification of a computational task, that is what output should
be produced for every particular input. A program (or circuit, or any
other way to specify algorithms) corresponds to the implementation of
how to compute the desired output from the input. That is, a program
is a set of instructions on how to compute the output from the input.
Even within the same computational model there can be many differ-
ent ways to compute the same function. For example, there is more
than one NAND-CIRC program that computes the majority function,
more than one Boolean circuit to compute the addition function, and
so on and so forth.
Confusing specification and implementation (or equivalently func-
tions and programs) is a common mistake, and one that is unfortu-
nately encouraged by the common programming-language termi-
nology of referring to parts of programs as “functions”. However, in
both the theory and practice of computer science, it is important to
maintain this distinction, and it is particularly important for us in this
book.

✓ Chapter Recap

• An algorithm is a recipe for performing a compu-


tation as a sequence of “elementary” or “simple”
operations.
• One candidate definition for “elementary” opera-
tions is the set AND, OR and NOT.
• Another candidate definition for an “elementary”
operation is the NAND operation. It is an operation
that is easily implementable in the physical world
in a variety of methods including by electronic
transistors.
d e fi n i ng comp u tati on 155

• We can use NAND to compute many other func-


tions, including majority, increment, and others.
• There are other equivalent choices, including the
sets {𝐴𝑁 𝐷, OR, NOT} and {IF, ZERO, ONE}.
• We can formally define the notion of a function
𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 being computable using the
NAND-CIRC Programming language.
• For every set of basic operations, the notions of be-
ing computable by a circuit and being computable
by a straight-line program are equivalent.

3.8 EXERCISES
Give a Boolean circuit
Exercise 3.1 — Compare 4 bit numbers.
(with AND/OR/NOT gates) that computes the function
CMP8 ∶ {0, 1}8 → {0, 1} such that CMP8 (𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑏0 , 𝑏1 , 𝑏2 , 𝑏3 ) = 1
if and only if the number represented by 𝑎0 𝑎1 𝑎2 𝑎3 is larger than the
number represented by 𝑏0 𝑏1 𝑏2 𝑏3 .

Exercise 3.2 — Compare 𝑛 bit numbers.Prove that there exists a constant 𝑐


such that for every 𝑛 there is a Boolean circuit (with AND/OR/NOT
gates) 𝐶 of at most 𝑐 ⋅ 𝑛 gates that computes the function CMP2𝑛 ∶
{0, 1}2𝑛 → {0, 1} such that CMP2𝑛 (𝑎0 ⋯ 𝑎𝑛−1 𝑏0 ⋯ 𝑏𝑛−1 ) = 1 if and
only if the number represented by 𝑎0 ⋯ 𝑎𝑛−1 is larger than the number
represented by 𝑏0 ⋯ 𝑏𝑛−1 .

Prove that the set {OR, NOT} is univer-


Exercise 3.3 — OR,NOT is universal.
sal, in the sense that one can compute NAND using these gates.

Exercise 3.4 — AND,OR is not universal. Prove that for every 𝑛-bit input
circuit 𝐶 that contains only AND and OR gates, as well as gates that
compute the constant functions 0 and 1, 𝐶 is monotone, in the sense
that if 𝑥, 𝑥′ ∈ {0, 1}𝑛 , 𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 ∈ [𝑛], then 𝐶(𝑥) ≤ 𝐶(𝑥′ ).
Conclude that the set {AND, OR, 0, 1} is not universal.

Prove that for every 𝑛-bit input circuit


Exercise 3.5 — XOR is not universal.
𝐶 that contains only XOR gates, as well as gates that compute the
constant functions 0 and 1, 𝐶 is affine or linear modulo two, in the sense
that there exists some 𝑎 ∈ {0, 1}𝑛 and 𝑏 ∈ {0, 1} such that for every
𝑛−1
𝑥 ∈ {0, 1}𝑛 , 𝐶(𝑥) = ∑𝑖=0 𝑎𝑖 𝑥𝑖 + 𝑏 mod 2.
Conclude that the set {XOR, 0, 1} is not universal.

156 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Let MAJ ∶ {0, 1}3 → {0, 1} be the


Exercise 3.6 — MAJ,NOT, 1 is universal.
majority function. Prove that {MAJ, NOT, 1} is a universal set of gates.

Exercise 3.7 — MAJ,NOT is not universal. Prove that {MAJ, NOT} is not a 4
Hint: Use the fact that MAJ(𝑎, 𝑏, 𝑐) = 𝑀𝐴𝐽(𝑎, 𝑏, 𝑐)
universal set. See footnote for hint.4 to prove that every 𝑓 ∶ {0, 1}𝑛 → {0, 1} computable

by a circuit with only MAJ and NOT gates satisfies
𝑓(0, 0, … , 0) ≠ 𝑓(1, 1, … , 1). Thanks to Nathan
Let NOR ∶ {0, 1} → {0, 1} defined as
Exercise 3.8 — NOR is universal. 2 Brunelle and David Evans for suggesting this exercise.

NOR(𝑎, 𝑏) = NOT(OR(𝑎, 𝑏)). Prove that {NOR} is a universal set of


gates.

Prove that {LOOKUP1 , 0, 1} is a uni-


Exercise 3.9 — Lookup is universal.
versal set of gates where 0 and 1 are the constant functions and
LOOKUP1 ∶ {0, 1}3 → {0, 1} satisfies LOOKUP1 (𝑎, 𝑏, 𝑐) equals 𝑎 if
𝑐 = 0 and equals 𝑏 if 𝑐 = 1.

Prove that for ev-


Exercise 3.10 — Bound on universal basis size (challenge).
ery subset 𝐵 of the functions from {0, 1}𝑘 to {0, 1}, if 𝐵 is universal
then there is a 𝐵-circuit of at most 𝑂(1) gates to compute the NAND
function (you can start by showing that there is a 𝐵 circuit of at most 5
Thanks to Alec Sun and Simon Fischer for comments
𝑂(𝑘16 ) gates).5 on this problem.

Prove that for every NAND cir-


Exercise 3.11 — Size and inputs / outputs.
cuit of size 𝑠 with 𝑛 inputs and 𝑚 outputs, 𝑠 ≥ min{𝑛/2, 𝑚}. See 6
Hint: Use the conditions of Definition 3.5 stipulating
footnote for hint.6 that every input vertex has at least one out-neighbor

and there are exactly 𝑚 output gates. See also Re-
mark 3.7.
Exercise 3.12 — Threshold using NANDs. Prove that there is some constant
𝑐 such that for every 𝑛 > 1, and integers 𝑎0 , … , 𝑎𝑛−1 , 𝑏 ∈ {−2𝑛 , −2𝑛 +
1, … , −1, 0, +1, … , 2𝑛 }, there is a NAND circuit with at most 𝑐𝑛4 gates
that computes the threshold function 𝑓𝑎0 ,…,𝑎𝑛−1 ,𝑏 ∶ {0, 1}𝑛 → {0, 1} that
𝑛−1
on input 𝑥 ∈ {0, 1}𝑛 outputs 1 if and only if ∑𝑖=0 𝑎𝑖 𝑥𝑖 > 𝑏.

Exercise 3.13 — NANDs from activation functions. We say that a function


𝑓 ∶ ℝ2 → ℝ is a NAND approximator if it has the following property: for
every 𝑎, 𝑏 ∈ ℝ, if min{|𝑎|, |1 − 𝑎|} ≤ 1/3 and min{|𝑏|, |1 − 𝑏|} ≤ 0.1 then
|𝑓(𝑎, 𝑏) − NAND(⌊𝑎⌉, ⌊𝑏⌉)| ≤ 0.1 where we denote by ⌊𝑥⌉ the integer
closest to 𝑥. That is, if 𝑎, 𝑏 are within a distance 1/3 to {0, 1} then we
want 𝑓(𝑎, 𝑏) to equal the NAND of the values in {0, 1} that are closest
to 𝑎 and 𝑏 respectively. Otherwise, we do not care what the output of
𝑓 is on 𝑎 and 𝑏.
In this exercise you will show that you can construct a NAND ap-
proximator from many common activation functions used in deep
d e fi n i ng comp u tati on 157

neural networks. As a corollary you will obtain that deep neural net-
works can simulate NAND circuits. Since NAND circuits can also
simulate deep neural networks, these two computational models are
equivalent to one another.

1. Show that there is a NAND approximator 𝑓 defined as 𝑓(𝑎, 𝑏) =


𝐿(DR𝑒𝐿𝑈 (𝐿′ (𝑎, 𝑏))) where 𝐿′ ∶ ℝ2 → ℝ is an affine function (of the
form 𝐿′ (𝑎, 𝑏) = 𝛼𝑎 + 𝛽𝑏 + 𝛾 for some 𝛼, 𝛽, 𝛾 ∈ ℝ), 𝐿 is an affine
function (of the form 𝐿(𝑦) = 𝛼𝑦 + 𝛽 for 𝛼, 𝛽 ∈ ℝ), and DR𝑒𝐿𝑈 ∶
ℝ → ℝ, is the function defined as DR𝑒𝐿𝑈 (𝑥) = min(1, max(0, 𝑥)).
Note that DR𝑒𝐿𝑈 (𝑥) = 1 − 𝑅𝑒𝐿𝑈 (1 − 𝑅𝑒𝐿𝑈 (𝑥)) where 𝑅𝑒𝐿𝑈 (𝑥) =
max(𝑥, 0) is the rectified linear unit activation function.

2. Show that there is a NAND approximator 𝑓 defined as 𝑓(𝑎, 𝑏) =


𝐿(𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐿′ (𝑎, 𝑏))) where 𝐿′ , 𝐿 are affine as above and 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ∶
ℝ → ℝ is the function defined as 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) = 𝑒𝑥 /(𝑒𝑥 + 1).

3. Show that there is a NAND approximator 𝑓 defined as 𝑓(𝑎, 𝑏) =


𝐿(𝑡𝑎𝑛ℎ(𝐿′ (𝑎, 𝑏))) where 𝐿′ , 𝐿 are affine as above and 𝑡𝑎𝑛ℎ ∶ ℝ → ℝ
is the function defined as 𝑡𝑎𝑛ℎ(𝑥) = (𝑒𝑥 − 𝑒−𝑥 )/(𝑒𝑥 + 𝑒−𝑥 ).

4. Prove that for every NAND-circuit 𝐶 with 𝑛 inputs and one output
that computes a function 𝑔 ∶ {0, 1}𝑛 → {0, 1}, if we replace every
gate of 𝐶 with a NAND-approximator and then invoke the result-
ing circuit on some 𝑥 ∈ {0, 1}𝑛 , the output will be a number 𝑦 such
that |𝑦 − 𝑔(𝑥)| ≤ 1/3.

Prove that there is some


Exercise 3.14 — Majority with NANDs efficiently.
constant 𝑐 such that for every 𝑛 > 1, there is a NAND circuit of at
most 𝑐 ⋅ 𝑛 gates that computes the majority function on 𝑛 input bits
𝑛−1
7
One approach to solve this is using recursion and
MAJ𝑛 ∶ {0, 1}𝑛 → {0, 1}. That is MAJ𝑛 (𝑥) = 1 iff ∑𝑖=0 𝑥𝑖 > 𝑛/2. See analyzing it using the so called “Master Theorem”.
footnote for hint.7

Prove that for every 𝑓 ∶ {0, 1} →


Exercise 3.15 — Output at last layer. 𝑛

{0, 1}, if there is a Boolean circuit 𝐶 of 𝑠 gates that computes 𝑓 then


there is a Boolean circuit 𝐶 ′ of at most 𝑠 gates such that in the minimal 8
Hint: Vertices in layers beyond the output can be
layering of 𝐶 ′ , the output gate of 𝐶 ′ is placed in the last layer. See safely removed without changing the functionality of
footnote for hint.8 the circuit.

3.9 BIOGRAPHICAL NOTES


The excerpt from Al-Khwarizmi’s book is from “The Algebra of Ben-
Musa”, Fredric Rosen, 1831.
158 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Charles Babbage (1791-1871) was a visionary scientist, mathemati-


cian, and inventor (see [Swa02; CM00]). More than a century before
the invention of modern electronic computers, Babbage realized that
computation can be in principle mechanized. His first design for a
mechanical computer was the difference engine that was designed to do
polynomial interpolation. He then designed the analytical engine which
was a much more general machine and the first prototype for a pro-
grammable general-purpose computer. Unfortunately, Babbage was
never able to complete the design of his prototypes. One of the earliest
people to realize the engine’s potential and far-reaching implications
was Ada Lovelace (see the notes for Chapter 7).
Boolean algebra was first investigated by Boole and DeMorgan
in the 1840’s [Boo47; De 47]. The definition of Boolean circuits and
connection to electrical relay circuits was given in Shannon’s Masters
Thesis [Sha38]. (Howard Gardener called Shannon’s thesis “possibly
the most important, and also the most famous, master’s thesis of the
[20th] century”.) Savage’s book [Sav98], like this one, introduces
the theory of computation starting with Boolean circuits as the first
model. Jukna’s book [Juk12] contains a modern in-depth exposition of
Boolean circuits, see also [Weg87].
The NAND function was shown to be universal by Sheffer [She13],
though this also appears in the earlier work of Peirce, see [Bur78].
Whitehead and Russell used NAND as the basis for their logic in
their magnum opus Principia Mathematica [WR12]. In her Ph.D thesis,
Ernst [Ern09] investigates empirically the minimal NAND circuits
for various functions. Nisan and Shocken’s book [NS05] builds a
computing system starting from NAND gates and ending with high-
level programs and games (“NAND to Tetris”); see also the website
nandtotetris.org.
We defined the size of a Boolean circuit in Definition 3.5 to be the
number of gates it contains. This is one of two conventions used in the
literature. The other convention is to define the size as the number of
wires (equivalent to the number of gates plus the number of inputs).
This makes very little difference in almost all settings, but can affect
the circuit size complexity of some “pathological examples” of func-
tions such as the constant zero function that do not depend on much
of their inputs.
Learning Objectives:
• Get comfortable with syntactic sugar or
automatic translation of higher-level logic to
low-level gates.
• Learn proof of major result: every finite
function can be computed by a Boolean
circuit.
• Start thinking quantitatively about the
number of lines required for computation.

4
Syntactic sugar, and computing every function

“[In 1951] I had a running compiler and nobody would touch it because,
they carefully told me, computers could only do arithmetic; they could not do
programs.”, Grace Murray Hopper, 1986.

“Syntactic sugar causes cancer of the semicolon.”, Alan Perlis, 1982.

The computational models we considered thus far are as “bare


bones” as they come. For example, our NAND-CIRC “programming
language” has only the single operation foo = NAND(bar,blah). In
this chapter we will see that these simple models are actually equiv-
alent to more sophisticated ones. The key observation is that we can
implement more complex features using our basic building blocks,
and then use these new features themselves as building blocks for
even more sophisticated features. This is known as “syntactic sugar”
in the field of programming language design since we are not modi-
fying the underlying programming model itself, but rather we merely
implement new features by syntactically transforming a program that
uses such features into one that doesn’t.
This chapter provides a “toolkit” that can be used to show that
many functions can be computed by NAND-CIRC programs, and
hence also by Boolean circuits. We will also use this toolkit to prove
a fundamental theorem: every finite function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚
can be computed by a Boolean circuit, see Theorem 4.13 below. While
the syntactic sugar toolkit is important in its own right, Theorem 4.13
can also be proven directly without using this toolkit. We present this
alternative proof in Section 4.5. See Fig. 4.1 for an outline of the results
of this chapter.

This chapter: A non-mathy overview


In this chapter, we will see our first major result: every fi-
nite function can be computed by some Boolean circuit (see
Theorem 4.13 and Big Idea 5). This is sometimes known as

Compiled on 12.6.2023 00:05


160 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 4.1: An outline of the results of this chapter. In


Section 4.1 we give a toolkit of “syntactic sugar” trans-
formations showing how to implement features such
as programmer-defined functions and conditional
statements in NAND-CIRC. We use these tools in
Section 4.3 to give a NAND-CIRC program (or alter-
natively a Boolean circuit) to compute the LOOKUP
function. We then build on this result to show in Sec-
tion 4.4 that NAND-CIRC programs (or equivalently,
Boolean circuits) can compute every finite function.
An alternative direct proof of the same result is given
in Section 4.5.

the “universality” of AND, OR, and NOT (and, using the


equivalence of Chapter 3, of NAND as well)
Despite being an important result, Theorem 4.13 is actually
not that hard to prove. Section 4.5 presents a relatively sim-
ple direct proof of this result. However, in Section 4.1 and
Section 4.3 we derive this result using the concept of “syntac-
tic sugar” (see Big Idea 4). This is an important concept for
programming languages theory and practice. The idea be-
hind “syntactic sugar” is that we can extend a programming
language by implementing advanced features from its basic
components. For example, we can take the AON-CIRC and
NAND-CIRC programming languages we saw in Chapter 3,
and extend them to achieve features such as user-defined
functions (e.g., def Foo(...)), conditional statements (e.g.,
if blah ...), and more. Once we have these features, it
is not that hard to show that we can take the “truth table”
(table of all inputs and outputs) of any function, and use that
to create an AON-CIRC or NAND-CIRC program that maps
each input to its corresponding output.
We will also get our first glimpse of quantitative measures in
this chapter. While Theorem 4.13 tells us that every func-
tion can be computed by some circuit, the number of gates
in this circuit can be exponentially large. (We are not using
here “exponentially” as some colloquial term for “very very
big” but in a very precise mathematical sense, which also
happens to coincide with being very very big.) It turns out
that some functions (for example, integer addition and multi-
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 161

plication) can be in fact computed using far fewer gates. We


will explore this issue of “gate complexity” more deeply in
Chapter 5 and following chapters.

4.1 SOME EXAMPLES OF SYNTACTIC SUGAR


We now present some examples of “syntactic sugar” transformations
that we can use in constructing straightline programs or circuits. We
focus on the straight-line programming language view of our computa-
tional models, and specifically (for the sake of concreteness) on the
NAND-CIRC programming language. This is convenient because
many of the syntactic sugar transformations we present are easiest to
think about in terms of applying “search and replace” operations to
the source code of a program. However, by Theorem 3.19, all of our
results hold equally well for circuits, whether ones using NAND gates
or Boolean circuits that use the AND, OR, and NOT operations. Enu-
merating the examples of such syntactic sugar transformations can be
a little tedious, but we do it for two reasons:

1. To convince you that despite their seeming simplicity and limita-


tions, simple models such as Boolean circuits or the NAND-CIRC
programming language are actually quite powerful.

2. So you can realize how lucky you are to be taking a theory of com-
putation course and not a compilers course… :)

4.1.1 User-defined procedures


One staple of almost any programming language is the ability to
define and then execute procedures or subroutines. (These are often
known as functions in some programming languages, but we prefer
the name procedures to avoid confusion with the function that a pro-
gram computes.) The NAND-CIRC programming language does
not have this mechanism built in. However, we can achieve the same
effect using the time-honored technique of “copy and paste”. Specifi-
cally, we can replace code which defines a procedure such as

def Proc(a,b):
proc_code
return c
some_code
f = Proc(d,e)
some_more_code

with the following code where we “paste” the code of Proc


162 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

some_code
proc_code'
some_more_code

and where proc_code' is obtained by replacing all occurrences


of a with d, b with e, and c with f. When doing that we will need to
ensure that all other variables appearing in proc_code' don’t interfere
with other variables. We can always do so by renaming variables to
new names that were not used before. The above reasoning leads to
the proof of the following theorem:

Let NAND-CIRC-
Theorem 4.1 — Procedure definition syntatic sugar.
PROC be the programming language NAND-CIRC augmented
with the syntax above for defining procedures. Then for every
NAND-CIRC-PROC program 𝑃 , there exists a standard (i.e.,
“sugar-free”) NAND-CIRC program 𝑃 ′ that computes the same
function as 𝑃 .

R
Remark 4.2 — No recursive procedure. NAND-CIRC-
PROC only allows non-recursive procedures. In partic-
ular, the code of a procedure Proc cannot call Proc but
only use procedures that were defined before it. With-
out this restriction, the above “search and replace”
procedure might never terminate and Theorem 4.1
would not be true.

Theorem 4.1 can be proven using the transformation above, but


since the formal proof is somewhat long and tedious, we omit it here.

■ Pro-
Example 4.3 — Computing Majority from NAND using syntactic sugar.
cedures allow us to express NAND-CIRC programs much more
cleanly and succinctly. For example, because we can compute
AND, OR, and NOT using NANDs, we can compute the Majority
function as follows:

def NOT(a):
return NAND(a,a)
def AND(a,b):
temp = NAND(a,b)
return NOT(temp)
def OR(a,b):
temp1 = NOT(a)
temp2 = NOT(b)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 163

return NAND(temp1,temp2)

def MAJ(a,b,c):
and1 = AND(a,b)
and2 = AND(a,c)
and3 = AND(b,c)
or1 = OR(and1,and2)
return OR(or1,and3)

print(MAJ(0,1,1))
# 1

Fig. 4.2 presents the “sugar-free” NAND-CIRC program (and


the corresponding circuit) that is obtained by “expanding out” this
program, replacing the calls to procedures with their definitions.

 Big Idea 4 Once we show that a computational model 𝑋 is equiv-


alent to a model that has feature 𝑌 , we can assume we have 𝑌 when
showing that a function 𝑓 is computable by 𝑋.

Figure 4.2: A standard (i.e., “sugar-free”) NAND-


CIRC program that is obtained by expanding out the
procedure definitions in the program for Majority
of Example 4.3. The corresponding circuit is on
the right. Note that this is not the most efficient
NAND circuit/program for majority: we can save on
some gates by “shortcutting” steps where a gate 𝑢
computes NAND(𝑣, 𝑣) and then a gate 𝑤 computes
NAND(𝑢, 𝑢) (as indicated by the dashed green
arrows in the above figure).

R
Remark 4.4 — Counting lines. While we can use syn-
tactic sugar to present NAND-CIRC programs in more
readable ways, we did not change the definition of
the language itself. Therefore, whenever we say that
some function 𝑓 has an 𝑠-line NAND-CIRC program
we mean a standard “sugar-free” NAND-CIRC pro-
gram, where all syntactic sugar has been expanded
out. For example, the program of Example 4.3 is a
12-line program for computing the MAJ function,
164 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

even though it can be written in fewer lines using


NAND-CIRC-PROC.

4.1.2 Proof by Python (optional)


We can write a Python program that implements the proof of Theo-
rem 4.1. This is a Python program that takes a NAND-CIRC-PROC
program 𝑃 that includes procedure definitions and uses simple
“search and replace” to transform 𝑃 into a standard (i.e., “sugar-
free”) NAND-CIRC program 𝑃 ′ that computes the same function as
𝑃 without using any procedures. The idea is simple: if the program 𝑃
contains a definition of a procedure Proc of two arguments x and y,
then whenever we see a line of the form foo = Proc(bar,blah), we
can replace this line by:

1. The body of the procedure Proc (replacing all occurrences of x and


y with bar and blah respectively).

2. A line foo = exp, where exp is the expression following the re-
turn statement in the definition of the procedure Proc.

To make this more robust we add a prefix to the internal variables


used by Proc to ensure they don’t conflict with the variables of 𝑃 ;
for simplicity we ignore this issue in the code below though it can be
easily added.
The code of the Python function desugar below achieves such a
transformation.
Fig. 4.2 shows the result of applying desugar to the program of Ex-
ample 4.3 that uses syntactic sugar to compute the Majority function.
Specifically, we first apply desugar to remove usage of the OR func-
tion, then apply it to remove usage of the AND function, and finally
apply it a third time to remove usage of the NOT function.

R
Remark 4.5 — Parsing function definitions (optional). The
function desugar in Fig. 4.3 assumes that it is given
the procedure already split up into its name, argu-
ments, and body. It is not crucial for our purposes to
describe precisely how to scan a definition and split it
up into these components, but in case you are curious,
it can be achieved in Python via the following code:

def parse_func(code):
"""Parse a function definition into name,
↪ arguments and body"""
lines = [l.strip() for l in code.split('\n')]
regexp = r'def\s+([a-zA-Z\_0-9]+)\(([\sa-zA-
↪ Z0-9\_,]+)\)\s*:\s*'
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 165

Figure 4.3: Python code for transforming NAND-CIRC-PROC programs into standard sugar-free NAND-CIRC programs.

def desugar(code, func_name, func_args,func_body):


"""
Replaces all occurences of
foo = func_name(func_args)
with
func_body[x->a,y->b]
foo = [result returned in func_body]
"""
# Uses Python regular expressions to simplify the search and replace,
# see https://docs.python.org/3/library/re.html and Chapter 9 of the book

# regular expression for capturing a list of variable names separated by commas


arglist = ",".join([r"([a-zA-Z0-9\_\[\]]+)" for i in range(len(func_args))])
# regular expression for capturing a statement of the form
# "variable = func_name(arguments)"
regexp = fr'([a-zA-Z0-9\_\[\]]+)\s*=\s*{func_name}\({arglist}\)\s*$'
while True:
m = re.search(regexp, code, re.MULTILINE)
if not m: break
newcode = func_body
# replace function arguments by the variables from the function invocation
for i in range(len(func_args)):
newcode = newcode.replace(func_args[i], m.group(i+2))
# Splice the new code inside
newcode = newcode.replace('return', m.group(1) + " = ")
code = code[:m.start()] + newcode + code[m.end()+1:]
return code
166 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

m = re.match(regexp,lines[0])
return m.group(1), m.group(2).split(','),
↪ '\n'.join(lines[1:])

4.1.3 Conditional statements


Another sorely missing feature in NAND-CIRC is a conditional
statement such as the if/then constructs that are found in many
programming languages. However, using procedures, we can ob-
tain an ersatz if/then construct. First we can compute the function
IF ∶ {0, 1}3 → {0, 1} such that IF(𝑎, 𝑏, 𝑐) equals 𝑏 if 𝑎 = 1 and 𝑐 if 𝑎 = 0.

P
Before reading onward, try to see how you could com-
pute the IF function using NAND’s. Once you do that,
see how you can use that to emulate if/then types of
constructs.

The IF function can be implemented from NANDs as follows (see


Exercise 4.2):

def IF(cond,a,b):
notcond = NAND(cond,cond)
temp = NAND(b,notcond)
temp1 = NAND(a,cond)
return NAND(temp,temp1)

The IF function is also known as a multiplexing function, since 𝑐𝑜𝑛𝑑


can be thought of as a switch that controls whether the output is con-
nected to 𝑎 or 𝑏. Once we have a procedure for computing the IF func-
tion, we can implement conditionals in NAND. The idea is that we
replace code of the form

if (condition): assign blah to variable foo

with code of the form

foo = IF(condition, blah, foo)

that assigns to foo its old value when condition equals 0, and
assign to foo the value of blah otherwise. More generally we can
replace code of the form

if (cond):
a = ...
b = ...
c = ...
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 167

with code of the form

temp_a = ...
temp_b = ...
temp_c = ...
a = IF(cond,temp_a,a)
b = IF(cond,temp_b,b)
c = IF(cond,temp_c,c)

Using such transformations, we can prove the following theorem.


Once again we omit the (not too insightful) full formal proof, though
see Section 4.1.2 for some hints on how to obtain it.

Let NAND-CIRC-
Theorem 4.6 — Conditional statements syntactic sugar.
IF be the programming language NAND-CIRC augmented with
if/then/else statements for allowing code to be conditionally
executed based on whether a variable is equal to 0 or 1.
Then for every NAND-CIRC-IF program 𝑃 , there exists a stan-
dard (i.e., “sugar-free”) NAND-CIRC program 𝑃 ′ that computes
the same function as 𝑃 .

4.2 EXTENDED EXAMPLE: ADDITION AND MULTIPLICATION (OP-


TIONAL)
Using “syntactic sugar”, we can write the integer addition function as
follows:

# Add two n-bit integers


# Use LSB first notation for simplicity
def ADD(A,B):
Result = [0]*(n+1)
Carry = [0]*(n+1)
Carry[0] = zero(A[0])
for i in range(n):
Result[i] = XOR(Carry[i],XOR(A[i],B[i]))
Carry[i+1] = MAJ(Carry[i],A[i],B[i])
Result[n] = Carry[n]
return Result

ADD([1,1,1,0,0],[1,0,0,0,0]);;
# [0, 0, 0, 1, 0, 0]

where zero is the constant zero function, and MAJ and XOR corre-
spond to the majority and XOR functions respectively. While we use
Python syntax for convenience, in this example 𝑛 is some fixed integer
and so for every such 𝑛, ADD is a finite function that takes as input 2𝑛
168 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

bits and outputs 𝑛 + 1 bits. In particular for every 𝑛 we can remove


the loop construct for i in range(n) by simply repeating the code 𝑛
times, replacing the value of i with 0, 1, 2, … , 𝑛 − 1. By expanding out
all the features, for every value of 𝑛 we can translate the above pro-
gram into a standard (“sugar-free”) NAND-CIRC program. Fig. 4.4
depicts what we get for 𝑛 = 2.

Figure 4.4: The NAND-CIRC program and corre-


sponding NAND circuit for adding two-digit binary
numbers that are obtained by “expanding out” all the
syntactic sugar. The program/circuit has 43 lines/-
gates which is by no means necessary. It is possible
to add 𝑛 bit numbers using 9𝑛 NAND gates, see
Exercise 4.5.

By going through the above program carefully and accounting for


the number of gates, we can see that it yields a proof of the following
theorem (see also Fig. 4.5):

Theorem 4.7 — Addition using NAND-CIRC programs.For every 𝑛 ∈ ℕ,


let ADD𝑛 ∶ {0, 1}2𝑛 → {0, 1}𝑛+1 be the function that, given
𝑥, 𝑥′ ∈ {0, 1}𝑛 computes the representation of the sum of the num-
bers that 𝑥 and 𝑥′ represent. Then there is a constant 𝑐 ≤ 30 such
that for every 𝑛 there is a NAND-CIRC program of at most 𝑐𝑛 lines
computing ADD𝑛 . 1 1
The value of 𝑐 can be improved to 9, see Exercise 4.5.

Once we have addition, we can use the grade-school algorithm to


obtain multiplication as well, thus obtaining the following theorem:

For every 𝑛,
Theorem 4.8 — Multiplication using NAND-CIRC programs.
let MULT𝑛 ∶ {0, 1} 2𝑛
→ {0, 1} be the function that, given
2𝑛

𝑥, 𝑥′ ∈ {0, 1}𝑛 computes the representation of the product of the


numbers that 𝑥 and 𝑥′ represent. Then there is a constant 𝑐 such
that for every 𝑛, there is a NAND-CIRC program of at most 𝑐𝑛2
Figure 4.5: The number of lines in our NAND-CIRC
lines that computes the function MULT𝑛 .
program to add two 𝑛 bit numbers, as a function of
𝑛, for 𝑛’s between 1 and 100. This is not the most
We omit the proof, though in Exercise 4.7 we ask you to supply efficient program for this task, but the important point
a “constructive proof” in the form of a program (in your favorite is that it has the form 𝑂(𝑛).
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 169

programming language) that on input a number 𝑛, outputs the code


of a NAND-CIRC program of at most 1000𝑛2 lines that computes the
MULT𝑛 function. In fact, we can use Karatsuba’s algorithm to show
that there is a NAND-CIRC program of 𝑂(𝑛log2 3 ) lines to compute
MULT𝑛 (and can get even further asymptotic improvements using
better algorithms).

4.3 THE LOOKUP FUNCTION


The LOOKUP function will play an important role in this chapter and
later. It is defined as follows:

For every 𝑘, the lookup function of


Definition 4.9 — Lookup function.
order 𝑘, LOOKUP𝑘 ∶ {0, 1} → {0, 1} is defined as follows: For
2𝑘 +𝑘

every 𝑥 ∈ {0, 1}2 and 𝑖 ∈ {0, 1}𝑘 ,


𝑘

LOOKUP𝑘 (𝑥, 𝑖) = 𝑥𝑖

where 𝑥𝑖 denotes the 𝑖𝑡ℎ entry of 𝑥, using the binary representation


to identify 𝑖 with a number in {0, … , 2𝑘 − 1}.

Figure 4.6: The LOOKUP𝑘 function takes an input


𝑘
in {0, 1}2 +𝑘 , which we denote by 𝑥, 𝑖 (with 𝑥 ∈
𝑘
{0, 1}2 and 𝑖 ∈ {0, 1}𝑘 ). The output is 𝑥𝑖 : the 𝑖-th
coordinate of 𝑥, where we identify 𝑖 as a number
in [𝑘] using the binary representation. In the above
example 𝑥 ∈ {0, 1}16 and 𝑖 ∈ {0, 1}4 . Since 𝑖 = 0110
is the binary representation of the number 6, the
output of LOOKUP4 (𝑥, 𝑖) in this case is 𝑥6 = 1.

See Fig. 4.6 for an illustration of the LOOKUP function. It turns


out that for every 𝑘, we can compute LOOKUP𝑘 using a NAND-CIRC
program:

For every 𝑘 > 0, there is a NAND-


Theorem 4.10 — Lookup function.
CIRC program that computes the function LOOKUP𝑘 ∶ {0, 1}2 +𝑘 →
𝑘

{0, 1}. Moreover, the number of lines in this program is at most


4 ⋅ 2𝑘 .

An immediate corollary of Theorem 4.10 is that for every 𝑘 > 0,


LOOKUP𝑘 can be computed by a Boolean circuit (with AND, OR and
NOT gates) of at most 8 ⋅ 2𝑘 gates.

4.3.1 Constructing a NAND-CIRC program for LOOKUP


We prove Theorem 4.10 by induction. For the case 𝑘 = 1, LOOKUP1
maps (𝑥0 , 𝑥1 , 𝑖) ∈ {0, 1}3 to 𝑥𝑖 . In other words, if 𝑖 = 0 then it outputs
170 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑥0 and otherwise it outputs 𝑥1 , which (up to reordering variables) is


the same as the IF function presented in Section 4.1.3, which can be
computed by a 4-line NAND-CIRC program.
As a warm-up for the case of general 𝑘, let us consider the case
of 𝑘 = 2. Given input 𝑥 = (𝑥0 , 𝑥1 , 𝑥2 , 𝑥3 ) for LOOKUP2 and an
index 𝑖 = (𝑖0 , 𝑖1 ), if the most significant bit 𝑖0 of the index is 0 then
LOOKUP2 (𝑥, 𝑖) will equal 𝑥0 if 𝑖1 = 0 and equal 𝑥1 if 𝑖1 = 1. Similarly,
if the most significant bit 𝑖0 is 1 then LOOKUP2 (𝑥, 𝑖) will equal 𝑥2 if
𝑖1 = 0 and will equal 𝑥3 if 𝑖1 = 1. Another way to say this is that we
can write LOOKUP2 as follows:

def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
if i[0]==1:
return LOOKUP1(X[2],X[3],i[1])
else:
return LOOKUP1(X[0],X[1],i[1])

or in other words,

def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
a = LOOKUP1(X[2],X[3],i[1])
b = LOOKUP1(X[0],X[1],i[1])
return IF( i[0],a,b)

More generally, as shown in the following lemma, we can compute


LOOKUP𝑘 using two invocations of LOOKUP𝑘−1 and one invocation
of IF:
Lemma 4.11 — Lookup recursion. For every 𝑘 ≥ 2, LOOKUP𝑘 (𝑥0 , … , 𝑥2𝑘 −1 , 𝑖0 , … , 𝑖𝑘−1 )
is equal to

IF (𝑖0 , LOOKUP𝑘−1 (𝑥2𝑘−1 , … , 𝑥2𝑘 −1 , 𝑖1 , … , 𝑖𝑘−1 ), LOOKUP𝑘−1 (𝑥0 , … , 𝑥2𝑘−1 −1 , 𝑖1 , … , 𝑖𝑘−1 ))

Proof. If the most significant bit 𝑖0 of 𝑖 is zero, then the index 𝑖 is


in {0, … , 2𝑘−1 − 1} and hence we can perform the lookup on the
“first half” of 𝑥 and the result of LOOKUP𝑘 (𝑥, 𝑖) will be the same as
𝑎 = LOOKUP𝑘−1 (𝑥0 , … , 𝑥2𝑘−1 −1 , 𝑖1 , … , 𝑖𝑘−1 ). On the other hand, if this
most significant bit 𝑖0 is equal to 1, then the index is in {2𝑘−1 , … , 2𝑘 −
1}, in which case the result of LOOKUP𝑘 (𝑥, 𝑖) is the same as 𝑏 =
LOOKUP𝑘−1 (𝑥2𝑘−1 , … , 𝑥2𝑘 −1 , 𝑖1 , … , 𝑖𝑘−1 ). Thus we can compute
LOOKUP𝑘 (𝑥, 𝑖) by first computing 𝑎 and 𝑏 and then outputting
IF(𝑖0 , 𝑏, 𝑎).

Proof of Theorem 4.10 from Lemma 4.11. Now that we have Lemma 4.11,
we can complete the proof of Theorem 4.10. We will prove by induc-
tion on 𝑘 that there is a NAND-CIRC program of at most 4 ⋅ (2𝑘 − 1)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 171

lines for LOOKUP𝑘 . For 𝑘 = 1 this follows by the four line program for
IF we’ve seen before. For 𝑘 > 1, we use the following pseudocode:

a = LOOKUP_(k-1)(X[0],...,X[2^(k-1)-1],i[1],...,i[k-1])
b = LOOKUP_(k-1)(X[2^(k-1)],...,X[2^(k-1)],i[1],...,i[k-
↪ 1])
return IF(i[0],b,a)

If we let 𝐿(𝑘) be the number of lines required for LOOKUP𝑘 , then


the above pseudo-code shows that

𝐿(𝑘) ≤ 2𝐿(𝑘 − 1) + 4 . (4.1)

Since under our induction hypothesis 𝐿(𝑘 − 1) ≤ 4(2𝑘−1 − 1), we get


that 𝐿(𝑘) ≤ 2 ⋅ 4(2𝑘−1 − 1) + 4 = 4(2𝑘 − 1) which is what we wanted
to prove. See Fig. 4.7 for a plot of the actual number of lines in our
implementation of LOOKUP𝑘 .

4.4 COMPUTING EVERY FUNCTION


At this point we know the following facts about NAND-CIRC pro-
grams (and so equivalently about Boolean circuits and our other
equivalent models):

1. They can compute at least some non-trivial functions.

2. Coming up with NAND-CIRC programs for various functions is a Figure 4.7: The number of lines in our implementation
very tedious task. of the LOOKUP_k function as a function of 𝑘 (i.e., the
length of the index). The number of lines in our
implementation is roughly 3 ⋅ 2𝑘 .
Thus I would not blame the reader if they were not particularly
looking forward to a long sequence of examples of functions that can
be computed by NAND-CIRC programs. However, it turns out we are
not going to need this, as we can show in one fell swoop that NAND-
CIRC programs can compute every finite function:

There exists some constant 𝑐 > 0


Theorem 4.12 — Universality of NAND.
such that for every 𝑛, 𝑚 > 0 and function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 ,
there is a NAND-CIRC program with at most 𝑐⋅𝑚2𝑛 lines that com-
putes the function 𝑓 .

By Theorem 3.19, the models of NAND circuits, NAND-CIRC pro-


grams, AON-CIRC programs, and Boolean circuits, are all equivalent
to one another, and hence Theorem 4.12 holds for all these models. In
particular, the following theorem is equivalent to Theorem 4.12:

There exists some


Theorem 4.13 — Universality of Boolean circuits.
constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and function
172 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most


𝑐 ⋅ 𝑚2𝑛 gates that computes the function 𝑓 .

 Big Idea 5 Every finite function can be computed by a large


enough Boolean circuit.

Improved bounds. Though it will not be of great importance to us, it


is possible to improve on the proof of Theorem 4.12 and shave an extra
factor of 𝑛, as well as optimize the constant 𝑐, and so prove that for
every 𝜖 > 0, 𝑚 ∈ ℕ and sufficiently large 𝑛, if 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚
then 𝑓 can be computed by a NAND circuit of at most (1 + 𝜖) 𝑚⋅2
𝑛
𝑛
gates. The proof of this result is beyond the scope of this book, but we
do discuss how to obtain a bound of the form 𝑂( 𝑚⋅2 𝑛 ) in Section 4.4.2;
𝑛

see also the biographical notes.

4.4.1 Proof of NAND’s Universality


To prove Theorem 4.12, we need to give a NAND circuit, or equiva-
lently a NAND-CIRC program, for every possible function. We will
restrict our attention to the case of Boolean functions (i.e., 𝑚 = 1).
Exercise 4.9 asks you to extend the proof for all values of 𝑚. A func-
tion 𝐹 ∶ {0, 1}𝑛 → {0, 1} can be specified by a table of its values for
each one of the 2𝑛 inputs. For example, the table below describes one 2
In case you are curious, this is the function on input
particular function 𝐺 ∶ {0, 1}4 → {0, 1}:2 𝑖 ∈ {0, 1}4 (which we interpret as a number in [16]),
that outputs the 𝑖-th digit of 𝜋 in the binary basis.
Table 4.1: An example of a function 𝐺 ∶ {0, 1}4 → {0, 1}.

Input (𝑥) Output (𝐺(𝑥))


0000 1
0001 1
0010 0
0011 0
0100 1
0101 0
0110 0
0111 1
1000 0
1001 0
1010 0
1011 0
1100 1
1101 1
1110 1
1111 1
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 173

For every 𝑥 ∈ {0, 1}4 , 𝐺(𝑥) = LOOKUP4 (1100100100001111, 𝑥), and


so the following is NAND-CIRC “pseudocode” to compute 𝐺 using
syntactic sugar for the LOOKUP_4 procedure.

G0000 = 1
G1000 = 1
G0100 = 0
...
G0111 = 1
G1111 = 1
Y[0] = LOOKUP_4(G0000,G1000,...,G1111,
X[0],X[1],X[2],X[3])

We can translate this pseudocode into an actual NAND-CIRC pro-


gram by adding three lines to define variables zero and one that are
initialized to 0 and 1 respectively, and then replacing a statement such
as Gxxx = 0 with Gxxx = NAND(one,one) and a statement such as
Gxxx = 1 with Gxxx = NAND(zero,zero). The call to LOOKUP_4 will
be replaced by the NAND-CIRC program that computes LOOKUP4 ,
plugging in the appropriate inputs.
There was nothing about the above reasoning that was particular to
the function 𝐺 above. Given every function 𝐹 ∶ {0, 1}𝑛 → {0, 1}, we
can write a NAND-CIRC program that does the following:

1. Initialize 2𝑛 variables of the form F00...0 till F11...1 so that for


every 𝑧 ∈ {0, 1}𝑛 , the variable corresponding to 𝑧 is assigned the
value 𝐹 (𝑧).

2. Compute LOOKUP𝑛 on the 2𝑛 variables initialized in the previ-


ous step, with the index variable being the input variables X[0
],…,X[𝑛 − 1 ]. That is, just like in the pseudocode for G above, we
use Y[0] = LOOKUP(F00..00,...,F11..1,X[0],..,X[𝑛 − 1])

The total number of lines in the resulting program is 3 + 2𝑛 lines for


initializing the variables plus the 4 ⋅ 2𝑛 lines that we pay for computing
LOOKUP𝑛 . This completes the proof of Theorem 4.12.

R
Remark 4.14 — Result in perspective. While Theo-
rem 4.12 seems striking at first, in retrospect, it is
perhaps not that surprising that every finite function
can be computed with a NAND-CIRC program. After
all, a finite function 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be
represented by simply the list of its outputs for each
one of the 2𝑛 input values. So it makes sense that we
could write a NAND-CIRC program of similar size
to compute it. What is more interesting is that some
functions, such as addition and multiplication, have
174 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

a much more efficient representation: one that only


requires 𝑂(𝑛2 ) or even fewer lines.

4.4.2 Improving by a factor of 𝑛 (optional)


By being a little more careful, we can improve the bound of Theo-
rem 4.12 and show that every function 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be
computed by a NAND-CIRC program of at most 𝑂(𝑚2𝑛 /𝑛) lines. In
other words, we can prove the following improved version:

There ex-
Theorem 4.15 — Universality of NAND circuits, improved bound.
ists a constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and function
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a NAND-CIRC program with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 lines that computes the function 𝑓. 3
3
The constant 𝑐 in this theorem is at most 10 and in
fact can be arbitrarily close to 1, see Section 4.8.
Proof. As before, it is enough to prove the case that 𝑚 = 1. Hence
we let 𝑓 ∶ {0, 1}𝑛 → {0, 1}, and our goal is to prove that there exists
a NAND-CIRC program of 𝑂(2𝑛 /𝑛) lines (or equivalently a Boolean
circuit of 𝑂(2𝑛 /𝑛) gates) that computes 𝑓.
We let 𝑘 = log(𝑛 − 2 log 𝑛) (the reasoning behind this choice will
become clear later on). We define the function 𝑔 ∶ {0, 1}𝑘 → {0, 1}2
𝑛−𝑘

as follows:

𝑔(𝑎) = 𝑓(𝑎0𝑛−𝑘 )𝑓(𝑎0𝑛−𝑘−1 1) ⋯ 𝑓(𝑎1𝑛−𝑘 ) .

In other words, if we use the usual binary representation to identify


the numbers {0, … , 2𝑛−𝑘 − 1} with the strings {0, 1}𝑛−𝑘 , then for every
𝑎 ∈ {0, 1}𝑘 and 𝑏 ∈ {0, 1}𝑛−𝑘

𝑔(𝑎)𝑏 = 𝑓(𝑎𝑏) . (4.2)

Figure 4.8: We can compute 𝑓 ∶ {0, 1}𝑛 → {0, 1} on


input 𝑥 = 𝑎𝑏 where 𝑎 ∈ {0, 1}𝑘 and 𝑏 ∈ {0, 1}𝑛−𝑘
by first computing the 2𝑛−𝑘 long string 𝑔(𝑎) that
corresponds to all 𝑓’s values on inputs that begin with
𝑎, and then outputting the 𝑏-th coordinate of this
string.

(4.2) means that for every 𝑥 ∈ {0, 1}𝑛 , if we write 𝑥 = 𝑎𝑏 with


𝑎 ∈ {0, 1}𝑘 and 𝑏 ∈ {0, 1}𝑛−𝑘 then we can compute 𝑓(𝑥) by first
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 175

computing the string 𝑇 = 𝑔(𝑎) of length 2𝑛−𝑘 , and then computing


LOOKUP𝑛−𝑘 (𝑇 , 𝑏) to retrieve the element of 𝑇 at the position cor-
responding to 𝑏 (see Fig. 4.8). The cost to compute the LOOKUP𝑛−𝑘
is 𝑂(2𝑛−𝑘 ) lines/gates and the cost in NAND-CIRC lines (or Boolean
gates) to compute 𝑓 is at most

𝑐𝑜𝑠𝑡(𝑔) + 𝑂(2𝑛−𝑘 ) , (4.3)

where 𝑐𝑜𝑠𝑡(𝑔) is the number of operations (i.e., lines of NAND-CIRC


programs or gates in a circuit) needed to compute 𝑔.
To complete the proof we need to give a bound on 𝑐𝑜𝑠𝑡(𝑔). Since 𝑔
is a function mapping {0, 1}𝑘 to {0, 1}2 , we can also think of it as a
𝑛−𝑘

collection of 2𝑛−𝑘 functions 𝑔0 , … , 𝑔2𝑛−𝑘 −1 ∶ {0, 1}𝑘 → {0, 1}, where


𝑔𝑖 (𝑥) = 𝑔(𝑎)𝑖 for every 𝑎 ∈ {0, 1}𝑘 and 𝑖 ∈ [2𝑛−𝑘 ]. (That is, 𝑔𝑖 (𝑎) is
the 𝑖-th bit of 𝑔(𝑎).) Naively, we could use Theorem 4.12 to compute
each 𝑔𝑖 in 𝑂(2𝑘 ) lines, but then the total cost is 𝑂(2𝑛−𝑘 ⋅ 2𝑘 ) = 𝑂(2𝑛 )
which does not save us anything. However, the crucial observation
is that there are only 22 distinct functions mapping {0, 1}𝑘 to {0, 1}.
𝑘

For example, if 𝑔17 is an identical function to 𝑔67 that means that if


we already computed 𝑔17 (𝑎) then we can compute 𝑔67 (𝑎) using only
a constant number of operations: simply copy the same value! In
general, if you have a collection of 𝑁 functions 𝑔0 , … , 𝑔𝑁−1 mapping
{0, 1}𝑘 to {0, 1}, of which at most 𝑆 are distinct then for every value
𝑎 ∈ {0, 1}𝑘 we can compute the 𝑁 values 𝑔0 (𝑎), … , 𝑔𝑁−1 (𝑎) using at
most 𝑂(𝑆 ⋅ 2𝑘 + 𝑁 ) operations (see Fig. 4.9).
In our case, because there are at most 22 distinct functions map-
𝑘

ping {0, 1}𝑘 to {0, 1}, we can compute the function 𝑔 (and hence by
(4.2) also 𝑓) using at most

(4.4)
𝑘
𝑂(22 ⋅ 2𝑘 + 2𝑛−𝑘 )
operations. Now all that is left is to plug into (4.4) our choice of 𝑘 =
log(𝑛 − 2 log 𝑛). By definition, 2𝑘 = 𝑛 − 2 log 𝑛, which means that (4.4)
can be bounded Figure 4.9: If 𝑔0 , … , 𝑔𝑁−1 is a collection of functions
each mapping {0, 1}𝑘 to {0, 1} such that at most 𝑆
of them are distinct then for every 𝑎 ∈ {0, 1}𝑘 , we
𝑂 (2𝑛−2 log 𝑛 ⋅ (𝑛 − 2 log 𝑛) + 2𝑛−log(𝑛−2 log 𝑛) ) ≤
can compute all the values 𝑔0 (𝑎), … , 𝑔𝑁−1 (𝑎) using
at most 𝑂(𝑆 ⋅ 2𝑘 + 𝑁) operations by first computing
the distinct functions and then copying the resulting
values.
𝑛
2𝑛 𝑛
2𝑛 𝑛
𝑂 ( 2𝑛2 ⋅ 𝑛 + 𝑛−2 log 𝑛 ) ≤ 𝑂 ( 2𝑛 + 0.5𝑛 ) = 𝑂 ( 2𝑛 )
which is what we wanted to prove. (We used above the fact that 𝑛 −
2 log 𝑛 ≥ 0.5 log 𝑛 for sufficiently large 𝑛.)

Using the connection between NAND-CIRC programs and Boolean


circuits, an immediate corollary of Theorem 4.15 is the following
improvement to Theorem 4.13:
176 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

There
Theorem 4.16 — Universality of Boolean circuits, improved bound.
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 gates that computes the function 𝑓 .

4.5 COMPUTING EVERY FUNCTION: AN ALTERNATIVE PROOF


Theorem 4.13 is a fundamental result in the theory (and practice!) of
computation. In this section, we present an alternative proof of this
basic fact that Boolean circuits can compute every finite function. This
alternative proof gives a somewhat worse quantitative bound on the
number of gates but it has the advantage of being simpler, working
directly with circuits and avoiding the usage of all the syntactic sugar
machinery. (However, that machinery is useful in its own right, and
will find other applications later on.)

There
Theorem 4.17 — Universality of Boolean circuits (alternative phrasing).
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚 ⋅ 𝑛2𝑛 gates that computes the function 𝑓 .

Proof Idea:
The idea of the proof is illustrated in Fig. 4.10. As before, it is
enough to focus on the case that 𝑚 = 1 (the function 𝑓 has a sin-
gle output), since we can always extend this to the case of 𝑚 > 1
by looking at the composition of 𝑚 circuits each computing a differ-
ent output bit of the function 𝑓. We start by showing that for every
𝛼 ∈ {0, 1}𝑛 , there is an 𝑂(𝑛)-sized circuit that computes the function
𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} defined as follows: 𝛿𝛼 (𝑥) = 1 iff 𝑥 = 𝛼 (that is, Figure 4.10: Given a function 𝑓 ∶ {0, 1}𝑛 → {0, 1},
𝛿𝛼 outputs 0 on all inputs except the input 𝛼). We can then write any we let {𝑥0 , 𝑥1 , … , 𝑥𝑁−1 } ⊆ {0, 1}𝑛 be the set of
inputs such that 𝑓(𝑥𝑖 ) = 1, and note that 𝑁 ≤ 2𝑛 .
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} as the OR of at most 2𝑛 functions 𝛿𝛼 for
We can express 𝑓 as the OR of 𝛿𝑥𝑖 for 𝑖 ∈ [𝑁] where
the 𝛼’s on which 𝑓(𝛼) = 1. the function 𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} (for 𝛼 ∈ {0, 1}𝑛 )
⋆ is defined as follows: 𝛿𝛼 (𝑥) = 1 iff 𝑥 = 𝛼. We can
compute the OR of 𝑁 values using 𝑁 two-input OR
gates. Therefore if we have a circuit of size 𝑂(𝑛) to
Proof of Theorem 4.17. We prove the theorem for the case 𝑚 = 1. The compute 𝛿𝛼 for every 𝛼 ∈ {0, 1}𝑛 , we can compute 𝑓
result can be extended for 𝑚 > 1 as before (see also Exercise 4.9). Let using a circuit of size 𝑂(𝑛 ⋅ 𝑁) = 𝑂(𝑛 ⋅ 2𝑛 ).

𝑓 ∶ {0, 1}𝑛 → {0, 1}. We will prove that there is an 𝑂(𝑛 ⋅ 2𝑛 )-sized
Boolean circuit to compute 𝑓 in the following steps:

1. We show that for every 𝛼 ∈ {0, 1}𝑛 , there is an 𝑂(𝑛)-sized circuit


that computes the function 𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1}, where 𝛿𝛼 (𝑥) = 1 iff
𝑥 = 𝛼.

2. We then show that this implies the existence of an 𝑂(𝑛 ⋅ 2𝑛 )-sized


circuit that computes 𝑓, by writing 𝑓(𝑥) as the OR of 𝛿𝛼 (𝑥) for all
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 177

𝛼 ∈ {0, 1}𝑛 such that 𝑓(𝛼) = 1. (If 𝑓 is the constant zero function
and hence there is no such 𝛼, then we can use the circuit 𝑓(𝑥) =
𝑥0 ∧ 𝑥0 .)

We start with Step 1:


CLAIM: For 𝛼 ∈ {0, 1}𝑛 , define 𝛿𝛼 ∶ {0, 1}𝑛 as follows:


{1 𝑥 = 𝛼
𝛿𝛼 (𝑥) = .

⎩0 otherwise
{

then there is a Boolean circuit using at most 2𝑛 gates that computes 𝛿𝛼 .


PROOF OF CLAIM: The proof is illustrated in Fig. 4.11. As an
example, consider the function 𝛿011 ∶ {0, 1}3 → {0, 1}. This function
outputs 1 on 𝑥 if and only if 𝑥0 = 0, 𝑥1 = 1 and 𝑥2 = 1, and so we can
write 𝛿011 (𝑥) = 𝑥0 ∧ 𝑥1 ∧ 𝑥2 , which translates into a Boolean circuit
with one NOT gate and two AND gates. More generally, for every
𝛼 ∈ {0, 1}𝑛 , we can express 𝛿𝛼 (𝑥) as (𝑥0 = 𝛼0 )∧(𝑥1 = 𝛼1 )∧⋯∧(𝑥𝑛−1 =
𝛼𝑛−1 ), where if 𝛼𝑖 = 0 we replace 𝑥𝑖 = 𝛼𝑖 with 𝑥𝑖 and if 𝛼𝑖 = 1 we
replace 𝑥𝑖 = 𝛼𝑖 by simply 𝑥𝑖 . This yields a circuit that computes 𝛿𝛼
using 𝑛 AND gates and at most 𝑛 NOT gates, so a total of at most 2𝑛
gates.
Now for every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}, we can write

𝑓(𝑥) = 𝛿𝑥0 (𝑥) ∨ 𝛿𝑥1 (𝑥) ∨ ⋯ ∨ 𝛿𝑥𝑁−1 (𝑥) (4.5)


where 𝑆 = {𝑥0 , … , 𝑥𝑁−1 } is the set of inputs on which 𝑓 outputs 1.
(To see this, you can verify that the right-hand side of (4.5) evaluates
to 1 on 𝑥 ∈ {0, 1}𝑛 if and only if 𝑥 is in the set 𝑆.)
Therefore we can compute 𝑓 using a Boolean circuit of at most 2𝑛
gates for each of the 𝑁 functions 𝛿𝑥𝑖 and combine that with at most 𝑁
OR gates, thus obtaining a circuit of at most 2𝑛 ⋅ 𝑁 + 𝑁 gates. Since
𝑆 ⊆ {0, 1}𝑛 , its size 𝑁 is at most 2𝑛 and hence the total number of
gates in this circuit is 𝑂(𝑛 ⋅ 2𝑛 ).

4.6 THE CLASS SIZE𝑛,𝑚 (𝑠)


We have seen that every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be com-
puted by a circuit of size 𝑂(𝑚 ⋅ 2𝑛 ), and some functions (such as ad- Figure 4.11: For every string 𝛼 ∈ {0, 1}𝑛 , there is a
dition and multiplication) can be computed by much smaller circuits. Boolean circuit of 𝑂(𝑛) gates to compute the function
𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} such that 𝛿𝛼 (𝑥) = 1 if and
We define SIZE𝑛,𝑚 (𝑠) to be the set of functions mapping 𝑛 bits to 𝑚 only if 𝑥 = 𝛼. The circuit is very simple. Given input
bits that can be computed by NAND circuits of at most 𝑠 gates (or 𝑥0 , … , 𝑥𝑛−1 we compute the AND of 𝑧0 , … , 𝑧𝑛−1
where 𝑧𝑖 = 𝑥𝑖 if 𝛼𝑖 = 1 and 𝑧𝑖 = NOT(𝑥𝑖 ) if 𝛼𝑖 = 0.
equivalently, by NAND-CIRC programs of at most 𝑠 lines). Formally, While formally Boolean circuits only have a gate for
the definition is as follows: computing the AND of two inputs, we can implement
an AND of 𝑛 inputs by composing 𝑛 two-input
ANDs.
178 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For all natural numbers 𝑛, 𝑚, 𝑠,


Definition 4.18 — Size class of functions.
let SIZE𝑛,𝑚 (𝑠) denote the set of all functions 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚
such that there exists a NAND circuit of at most 𝑠 gates comput-
ing 𝑓. We denote by SIZE𝑛 (𝑠) the set SIZE𝑛,1 (𝑠). For every integer
𝑠 ≥ 1, we let SIZE(𝑠) = ∪𝑛,𝑚 SIZE𝑛,𝑚 (𝑠) be the set of all functions
𝑓 for which there exists a NAND circuit of at most 𝑠 gates that
compute 𝑓.

Fig. 4.12 depicts the set SIZE𝑛,1 (𝑠). Note that SIZE𝑛,𝑚 (𝑠) is a set of
functions, not of programs! Asking if a program or a circuit is a mem-
ber of SIZE𝑛,𝑚 (𝑠) is a category error as in the sense of Fig. 4.13. As we
discussed in Section 3.7.2 (and Section 2.6.1), the distinction between
programs and functions is absolutely crucial. You should always re-
member that while a program computes a function, it is not equal to
a function. In particular, as we’ve seen, there can be more than one
program to compute the same function.

𝑛
Figure 4.12: There are 22 functions mapping {0, 1}𝑛
to {0, 1}, and an infinite number of circuits with 𝑛 bit
inputs and a single bit of output. Every circuit com-
putes one function, but every function can be com-
puted by many circuits. We say that 𝑓 ∈ SIZE𝑛,1 (𝑠)
if the smallest circuit that computes 𝑓 has 𝑠 or fewer
gates. For example XOR𝑛 ∈ SIZE𝑛,1 (4𝑛). Theo-
rem 4.12 shows that every function 𝑔 is computable
by some circuit of at most 𝑐 ⋅ 2𝑛 /𝑛 gates, and hence
SIZE𝑛,1 (𝑐 ⋅ 2𝑛 /𝑛) corresponds to the set of all func-
tions from {0, 1}𝑛 to {0, 1}.

While we defined SIZE𝑛 (𝑠) with respect to NAND gates, we


would get essentially the same class if we defined it with respect to
AND/OR/NOT gates:
Let SIZE𝑛,𝑚 (𝑠) denote the set of all functions 𝑓 ∶ {0, 1}𝑛 →
𝐴𝑂𝑁
Lemma 4.19
{0, 1}𝑚 that can be computed by an AND/OR/NOT Boolean circuit of
at most 𝑠 gates. Then,

SIZE𝑛,𝑚 (𝑠/2) ⊆ SIZE𝑛,𝑚 (𝑠) ⊆ SIZE𝑛,𝑚 (3𝑠)


𝐴𝑂𝑁

Proof. If 𝑓 can be computed by a NAND circuit of at most 𝑠/2 gates,


then by replacing each NAND with the two gates NOT and AND, we
can obtain an AND/OR/NOT Boolean circuit of at most 𝑠 gates that
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 179

computes 𝑓. On the other hand, if 𝑓 can be computed by a Boolean


AND/OR/NOT circuit of at most 𝑠 gates, then by Theorem 3.12 it can
be computed by a NAND circuit of at most 3𝑠 gates.

The results we have seen in this chapter can be phrased as showing


that ADD𝑛 ∈ SIZE2𝑛,𝑛+1 (100𝑛) and MULT𝑛 ∈ SIZE2𝑛,2𝑛 (10000𝑛log2 3 ).
Theorem 4.12 shows that for some constant 𝑐, SIZE𝑛,𝑚 (𝑐𝑚2𝑛 ) is equal
to the set of all functions from {0, 1}𝑛 to {0, 1}𝑚 .

Figure 4.13: A “category error” is a question such as


R “is a cucumber even or odd?” which does not even
Remark 4.20 — Finite vs infinite functions. Unlike pro-
make sense. In this book one type of category error
gramming languages such as Python, C or JavaScript, you should watch out for is confusing functions and
the NAND-CIRC and AON-CIRC programming lan- programs (i.e., confusing specifications and implemen-
guage do not have arrays. A NAND-CIRC program tations). If 𝐶 is a circuit or program, then asking if
𝑃 has some fixed number 𝑛 and 𝑚 of inputs and out- 𝐶 ∈ SIZE𝑛,1 (𝑠) is a category error, since SIZE𝑛,1 (𝑠) is
put variable. Hence, for example, there is no single a set of functions and not programs or circuits.
NAND-CIRC program that can compute the incre-
ment function INC ∶ {0, 1}∗ → {0, 1}∗ that maps a
string 𝑥 (which we identify with a number via the
binary representation) to the string that represents
𝑥 + 1. Rather for every 𝑛 > 0, there is a NAND-CIRC
program 𝑃𝑛 that computes the restriction INC𝑛 of
the function INC to inputs of length 𝑛. Since it can be
shown that for every 𝑛 > 0 such a program 𝑃𝑛 exists
of length at most 10𝑛, INC𝑛 ∈ SIZE𝑛,𝑛+1 (10𝑛) for
every 𝑛 > 0.
For the time being, our focus will be on finite func-
tions, but we will discuss how to extend the definition
of size complexity to functions with unbounded input
lengths later on in Section 13.6.

Solved Exercise 4.1 — 𝑆𝐼𝑍𝐸 closed under complement.. In this exercise we


prove a certain “closure property” of the class SIZE𝑛 (𝑠). That is, we
show that if 𝑓 is in this class then (up to some small additive term) so
is the complement of 𝑓, which is the function 𝑔(𝑥) = 1 − 𝑓(𝑥).
Prove that there is a constant 𝑐 such that for every 𝑓 ∶ {0, 1}𝑛 →
{0, 1} and 𝑠 ∈ ℕ, if 𝑓 ∈ SIZE𝑛 (𝑠) then 1 − 𝑓 ∈ SIZE𝑛 (𝑠 + 𝑐).

Solution:
If 𝑓 ∈ SIZE𝑛 (𝑠) then there is an 𝑠-line NAND-CIRC program
𝑃 that computes 𝑓. We can rename the variable Y[0] in 𝑃 to a
variable temp and add the line

Y[0] = NAND(temp,temp)

at the very end to obtain a program 𝑃 ′ that computes 1 − 𝑓.


180 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

✓ Chapter Recap

• We can define the notion of computing a function


via a simplified “programming language”, where
computing a function 𝐹 in 𝑇 steps would corre-
spond to having a 𝑇 -line NAND-CIRC program
that computes 𝐹 .
• While the NAND-CIRC programming only has one
operation, other operations such as functions and
conditional execution can be implemented using it.
• Every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be com-
puted by a circuit of at most 𝑂(𝑚2𝑛 ) gates (and in
fact at most 𝑂(𝑚2𝑛 /𝑛) gates).
• Sometimes (or maybe always?) we can translate an
efficient algorithm to compute 𝑓 into a circuit that
computes 𝑓 with a number of gates comparable to
the number of steps in this algorithm.

4.7 EXERCISES
This exercise asks you to give a one-to-one map
Exercise 4.1 — Pairing.
from ℕ to ℕ. This can be useful to implement two-dimensional arrays
2

as “syntactic sugar” in programming languages that only have one-


dimensional arrays.

1. Prove that the map 𝐹 (𝑥, 𝑦) = 2𝑥 3𝑦 is a one-to-one map from ℕ2 to


ℕ.

2. Show that there is a one-to-one map 𝐹 ∶ ℕ2 → ℕ such that for every


𝑥, 𝑦, 𝐹 (𝑥, 𝑦) ≤ 100 ⋅ max{𝑥, 𝑦}2 + 100.

3. For every 𝑘, show that there is a one-to-one map 𝐹 ∶ ℕ𝑘 → ℕ such


that for every 𝑥0 , … , 𝑥𝑘−1 ∈ ℕ, 𝐹 (𝑥0 , … , 𝑥𝑘−1 ) ≤ 100 ⋅ (𝑥0 + 𝑥1 + … +
𝑥𝑘−1 + 100𝑘)𝑘 .

Exercise 4.2 — Computing MUX.Prove that the NAND-CIRC program be-


low computes the function MUX (or LOOKUP1 ) where MUX(𝑎, 𝑏, 𝑐)
equals 𝑎 if 𝑐 = 0 and equals 𝑏 if 𝑐 = 1:

t = NAND(X[2],X[2])
u = NAND(X[0],t)
v = NAND(X[1],X[2])
Y[0] = NAND(u,v)


sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 181

Give a NAND-CIRC program of at


Exercise 4.3 — At least two / Majority.
most 6 lines to compute the function MAJ ∶ {0, 1}3 → {0, 1} where
MAJ(𝑎, 𝑏, 𝑐) = 1 iff 𝑎 + 𝑏 + 𝑐 ≥ 2.

In this exercise we will explore The-


Exercise 4.4 — Conditional statements.
orem 4.6: transforming NAND-CIRC-IF programs that use code such
as if .. then .. else .. to standard NAND-CIRC programs.

1. Give a “proof by code” of Theorem 4.6: a program in a program-


ming language of your choice that transforms a NAND-CIRC-IF
program 𝑃 into a “sugar-free” NAND-CIRC program 𝑃 ′ that com- 4
You can start by transforming 𝑃 into a NAND-CIRC-
putes the same function. See footnote for hint.4 PROC program that uses procedure statements, and
then use the code of Fig. 4.3 to transform the latter
2. Prove the following statement, which is the heart of Theorem 4.6: into a “sugar-free” NAND-CIRC program.

suppose that there exists an 𝑠-line NAND-CIRC program to com-


pute 𝑓 ∶ {0, 1}𝑛 → {0, 1} and an 𝑠′ -line NAND-CIRC program
to compute 𝑔 ∶ {0, 1}𝑛 → {0, 1}. Prove that there exist a NAND-
CIRC program of at most 𝑠 + 𝑠′ + 10 lines to compute the func-
tion ℎ ∶ {0, 1}𝑛+1 → {0, 1} where ℎ(𝑥0 , … , 𝑥𝑛−1 , 𝑥𝑛 ) equals
𝑓(𝑥0 , … , 𝑥𝑛−1 ) if 𝑥𝑛 = 0 and equals 𝑔(𝑥0 , … , 𝑥𝑛−1 ) otherwise.
(All programs in this item are standard “sugar-free” NAND-CIRC
programs.)

1. A half adder is the function HA ∶


Exercise 4.5 — Half and full adders.
{0, 1}2 ∶→ {0, 1}2 that corresponds to adding two binary bits. That
is, for every 𝑎, 𝑏 ∈ {0, 1}, HA(𝑎, 𝑏) = (𝑒, 𝑓) where 2𝑒 + 𝑓 = 𝑎 + 𝑏.
Prove that there is a NAND circuit of at most five NAND gates that
computes HA.

2. A full adder is the function FA ∶ {0, 1}3 → {0, 1}2 that takes in
two bits and a “carry” bit and outputs their sum. That is, for every
𝑎, 𝑏, 𝑐 ∈ {0, 1}, FA(𝑎, 𝑏, 𝑐) = (𝑒, 𝑓) such that 2𝑒 + 𝑓 = 𝑎 + 𝑏 + 𝑐.
Prove that there is a NAND circuit of at most nine NAND gates that
computes FA.

3. Prove that if there is a NAND circuit of 𝑐 gates that computes FA,


then there is a circuit of 𝑐𝑛 gates that computes ADD𝑛 where (as
in Theorem 4.7) ADD𝑛 ∶ {0, 1}2𝑛 → {0, 1}𝑛+1 is the function that
5
Use a “cascade” of adding the bits one after the
outputs the addition of two input 𝑛-bit numbers. See footnote for other, starting with the least significant digit, just like
hint.5 in the elementary-school algorithm.

4. Show that for every 𝑛 there is a NAND-CIRC program to compute


ADD𝑛 with at most 9𝑛 lines.

182 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Write a program using your favorite program-


Exercise 4.6 — Addition.
ming language that on input of an integer 𝑛, outputs a NAND-CIRC
program that computes ADD𝑛 . Can you ensure that the program it
outputs for ADD𝑛 has fewer than 10𝑛 lines?

Exercise 4.7 — Multiplication. Write a program using your favorite pro-


gramming language that on input of an integer 𝑛, outputs a NAND-
CIRC program that computes MULT𝑛 . Can you ensure that the pro-
gram it outputs for MULT𝑛 has fewer than 1000 ⋅ 𝑛2 lines?

Write a program using


Exercise 4.8 — Efficient multiplication (challenge).
your favorite programming language that on input of an integer 𝑛,
outputs a NAND-CIRC program that computes MULT𝑛 and has at 6
Hint: Use Karatsuba’s algorithm.
most 10000𝑛1.9 lines.6 What is the smallest number of lines you can
use to multiply two 2048 bit numbers?

In the text Theorem 4.12 is only proven


Exercise 4.9 — Multibit function.
for the case 𝑚 = 1. In this exercise you will extend the proof for every
𝑚.
Prove that

1. If there is an 𝑠-line NAND-CIRC program to compute


𝑓 ∶ {0, 1}𝑛 → {0, 1} and an 𝑠′ -line NAND-CIRC program
to compute 𝑓 ′ ∶ {0, 1}𝑛 → {0, 1} then there is an 𝑠 + 𝑠′ -line
program to compute the function 𝑔 ∶ {0, 1}𝑛 → {0, 1}2 such that
𝑔(𝑥) = (𝑓(𝑥), 𝑓 ′ (𝑥)).

2. For every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a NAND-CIRC


program of at most 10𝑚 ⋅ 2𝑛 lines that computes 𝑓. (You can use the
𝑚 = 1 case of Theorem 4.12, as well as Item 1.)

Exercise 4.10 — Simplifying using syntactic sugar. Let 𝑃 be the following


NAND-CIRC program:

Temp[0] = NAND(X[0],X[0])
Temp[1] = NAND(X[1],X[1])
Temp[2] = NAND(Temp[0],Temp[1])
Temp[3] = NAND(X[2],X[2])
Temp[4] = NAND(X[3],X[3])
Temp[5] = NAND(Temp[3],Temp[4])
Temp[6] = NAND(Temp[2],Temp[2])
Temp[7] = NAND(Temp[5],Temp[5])
Y[0] = NAND(Temp[6],Temp[7])
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 183

1. Write a program 𝑃 ′ with at most three lines of code that uses both
NAND as well as the syntactic sugar OR that computes the same func-
tion as 𝑃 .

2. Draw a circuit that computes the same function as 𝑃 and uses only
AND and NOT gates.

In the following exercises you are asked to compare the power of


pairs of programming languages. By “comparing the power” of two
programming languages 𝑋 and 𝑌 we mean determining the relation
between the set of functions that are computable using programs in 𝑋
and 𝑌 respectively. That is, to answer such a question you need to do
both of the following:

1. Either prove that for every program 𝑃 in 𝑋 there is a program 𝑃 ′


in 𝑌 that computes the same function as 𝑃 , or give an example for
a function that is computable by an 𝑋-program but not computable
by a 𝑌 -program.

and

2. Either prove that for every program 𝑃 in 𝑌 there is a program 𝑃 ′


in 𝑋 that computes the same function as 𝑃 , or give an example for a
function that is computable by a 𝑌 -program but not computable by
an 𝑋-program.

When you give an example as above of a function that is com-


putable in one programming language but not the other, you need
to prove that the function you showed is (1) computable in the first
programming language and (2) not computable in the second program-
ming language.
Let IF-CIRC be the programming
Exercise 4.11 — Compare IF and NAND.
language where we have the following operations foo = 0, foo = 1,
foo = IF(cond,yes,no) (that is, we can use the constants 0 and 1,
and the IF ∶ {0, 1}3 → {0, 1} function such that IF(𝑎, 𝑏, 𝑐) equals 𝑏 if
𝑎 = 1 and equals 𝑐 if 𝑎 = 0). Compare the power of the NAND-CIRC
programming language and the IF-CIRC programming language.

Let XOR-CIRC be the pro-


Exercise 4.12 — Compare XOR and NAND.
gramming language where we have the following operations foo
= XOR(bar,blah), foo = 1 and bar = 0 (that is, we can use the
constants 0, 1 and the XOR function that maps 𝑎, 𝑏 ∈ {0, 1}2 to 𝑎 + 𝑏
mod 2). Compare the power of the NAND-CIRC programming
language and the XOR-CIRC programming language. See footnote for
hint.7
184 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Prove that there is some constant 𝑐


Exercise 4.13 — Circuits for majority.
such that for every 𝑛 > 1, MAJ𝑛 ∈ SIZE𝑛 (𝑐𝑛) where MAJ𝑛 ∶ {0, 1}𝑛 →
{0, 1} is the majority function on 𝑛 input bits. That is MAJ𝑛 (𝑥) = 1 iff
8
One approach to solve this is using recursion and the
∑𝑖=0 𝑥𝑖 > 𝑛/2. See footnote for hint.8
𝑛−1
so-called Master Theorem.

Exercise 4.14 — Circuits for threshold.Prove that there is some constant 𝑐


such that for every 𝑛 > 1, and integers 𝑎0 , … , 𝑎𝑛−1 , 𝑏 ∈ {−2𝑛 , −2𝑛 +
1, … , −1, 0, +1, … , 2𝑛 }, there is a NAND circuit with at most 𝑛𝑐 gates
that computes the threshold function 𝑓𝑎0 ,…,𝑎𝑛−1 ,𝑏 ∶ {0, 1}𝑛 → {0, 1} that
𝑛−1
on input 𝑥 ∈ {0, 1}𝑛 outputs 1 if and only if ∑𝑖=0 𝑎𝑖 𝑥𝑖 > 𝑏.

4.8 BIBLIOGRAPHICAL NOTES


See Jukna’s and Wegener’s books [Juk12; Weg87] for much more
extensive discussion on circuits. Shannon showed that every Boolean
function can be computed by a circuit of exponential size [Sha38]. The
improved bound of 𝑐 ⋅ 2𝑛 /𝑛 (with the optimal value of 𝑐 for many
bases) is due to Lupanov [Lup58]. An exposition of this for the case
of NAND (where 𝑐 = 1) is given in Chapter 4 of his book [Lup84].
(Thanks to Sasha Golovnev for tracking down this reference!)
The concept of “syntactic sugar” is also known as “macros” or
“meta-programming” and is sometimes implemented via a prepro-
cessor or macro language in a programming language or a text editor.
One modern example is the Babel JavaScript syntax transformer, that
converts JavaScript programs written using the latest features into
a format that older Browsers can accept. It even has a plug-in ar-
chitecture, that allows users to add their own syntactic sugar to the
language.
Learning Objectives:
• See one of the most important concepts in
computing: duality between code and data.
• Build up comfort in moving between
different representations of programs.
• Follow the construction of a “universal circuit
evaluator” that can evaluate other circuits
given their representation.
• See major result that complements the result

5 of the last chapter: some functions require an


exponential number of gates to compute.

Code as data, data as code


• Discussion of Physical extended
Church-Turing thesis stating that Boolean
circuits capture all feasible computation in
the physical world, and its physical and
philosophical implications.

“The term code script is, of course, too narrow. The chromosomal structures
are at the same time instrumental in bringing about the development they
foreshadow. They are law-code and executive power - or, to use another simile,
they are architect’s plan and builder’s craft - in one.” , Erwin Schrödinger,
1944.

“A mathematician would hardly call a correspondence between the set of 64


triples of four units and a set of twenty other units,”universal“, while such
correspondence is, probably, the most fundamental general feature of life on
Earth”, Misha Gromov, 2013

A program is simply a sequence of symbols, each of which can be


encoded as a string of 0’s and 1’s using (for example) the ASCII stan-
dard. Therefore we can represent every NAND-CIRC program (and
hence also every Boolean circuit) as a binary string. This statement
seems obvious but it is actually quite profound. It means that we can
treat circuits or NAND-CIRC programs both as instructions to car-
rying computation and also as data that could potentially be used as
inputs to other computations.

 Big Idea 6 A program is a piece of text, and so it can be fed as input


to other programs.

This correspondence between code and data is one of the most fun-
damental aspects of computing. It underlies the notion of general
purpose computers, that are not pre-wired to compute only one task,
and also forms the basis of our hope for obtaining general artificial
intelligence. This concept finds immense use in all areas of comput-
ing, from scripting languages to machine learning, but it is fair to say
that we haven’t yet fully mastered it. Many security exploits involve
cases such as “buffer overflows” when attackers manage to inject code
where the system expected only “passive” data (see Fig. 5.1). The re-
lation between code and data reaches beyond the realm of electronic

Compiled on 12.6.2023 00:05


186 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

computers. For example, DNA can be thought of as both a program


and data (in the words of Schrödinger, who wrote before the discov-
ery of DNA’s structure a book that inspired Watson and Crick, DNA is
both “architect’s plan and builder’s craft”).

This chapter: A non-mathy overview


In this chapter, we will begin to explore some of the many
applications of the correspondence between code and data.
We start by using the representation of programs/circuits Figure 5.1: As illustrated in this xkcd cartoon, many
as strings to count the number of programs/circuits up to a exploits, including buffer overflow, SQL injections,
certain size, and use that to obtain a counterpart to the result and more, utilize the blurry line between “active
programs” and “static strings”.
we proved in Chapter 4. There we proved that every function
can be computed by a circuit, but that circuit could be expo-
nentially large (see Theorem 4.16 for the precise bound) In
this chapter we will prove that there are some functions for
which we cannot do better: the smallest circuit that computes
them is exponentially large.
We will also use the notion of representing programs/cir-
cuits as strings to show the existence of a “universal circuit”
- a circuit that can evaluate other circuits. In programming
languages, this is known as a “meta circular evaluator” - a
program in a certain programming language that can eval-
uate other programs in the same language. These results
do have an important restriction: the universal circuit will
have to be of bigger size than the circuits it evaluates. We will
show how to get around this restriction in Chapter 7 where
we introduce loops and Turing machines.
See Fig. 5.2 for an overview of the results of this chapter.

Figure 5.2: Overview of the results in this chapter.


We use the representation of programs/circuits as
strings to derive two main results. First we show
the existence of a universal program/circuit, and
in fact (with more work) the existence of such a
program/circuit whose size is at most polynomial in
the size of the program/circuit it evaluates. We then
use the string representation to count the number
of programs/circuits of a given size, and use that to
establish that some functions require an exponential
number of lines/gates to compute.
cod e a s data, data a s cod e 187

5.1 REPRESENTING PROGRAMS AS STRINGS


We can represent programs or circuits as strings in a myriad of ways.
For example, since Boolean circuits are labeled directed acyclic graphs,
we can use the adjacency matrix or adjacency list representations for
them. However, since the code of a program is ultimately just a se-
quence of letters and symbols, arguably the conceptually simplest
representation of a program is as such a sequence. For example, the
following NAND-CIRC program 𝑃

temp_0 = NAND(X[0],X[1])
temp_1 = NAND(X[0],temp_0)
temp_2 = NAND(X[1],temp_0)
Y[0] = NAND(temp_1,temp_2)

is simply a string of 107 symbols which include lower and upper


case letters, digits, the underscore character _ and equality sign =,
punctuation marks such as “(”,“)”,“,”, spaces, and “new line” mark-
ers (often denoted as “\n” or “↵”). Each such symbol can be encoded
as a string of 7 bits using the ASCII encoding, and hence the program
𝑃 can be encoded as a string of length 7 ⋅ 107 = 749 bits.
Nothing in the above discussion was specific to the program 𝑃 , and
hence we can use the same reasoning to prove that every NAND-CIRC
program can be represented as a string in {0, 1}∗ . In fact, we can do a
bit better. Since the names of the working variables of a NAND-CIRC
program do not affect its functionality, we can always transform a pro-
gram to have the form of 𝑃 ′ where all variables apart from the inputs
and outputs have the form temp_0, temp_1, temp_2, etc.. Moreover,
if the program has 𝑠 lines, then we will never need to use an index
larger than 3𝑠 (since each line involves at most three variables), and
similarly the indices of the input and output variables will all be at
most 3𝑠. Since a number between 0 and 3𝑠 can be expressed using
at most ⌈log10 (3𝑠 + 1)⌉ = 𝑂(log 𝑠) digits, each line in the program
(which has the form foo = NAND(bar,blah)), can be represented
using 𝑂(1) + 𝑂(log 𝑠) = 𝑂(log 𝑠) symbols, each of which can be rep-
resented by 7 bits. Hence an 𝑠 line program can be represented as a
string of 𝑂(𝑠 log 𝑠) bits, resulting in the following theorem:

There is a constant 𝑐
Theorem 5.1 — Representing programs as strings.
such that for 𝑓 ∈ SIZE(𝑠), there exists a program 𝑃 computing 𝑓
whose string representation has length at most 𝑐𝑠 log 𝑠.

P
188 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

We omit the formal proof of Theorem 5.1 but please


make sure that you understand why it follows from
the reasoning above.

5.2 COUNTING PROGRAMS, AND LOWER BOUNDS ON THE SIZE


OF NAND-CIRC PROGRAMS
One consequence of the representation of programs as strings is that
the number of programs of certain length is bounded by the number
of strings that represent them. This has consequences for the sets
SIZE𝑛,𝑚 (𝑠) that we defined in Section 4.6.

Theorem 5.2 — Counting programs. For every 𝑠, 𝑛, 𝑚 ∈ ℕ,

|SIZE𝑛,𝑚 (𝑠)| ≤ 2𝑂(𝑠 log 𝑠) .

That is, there are at most 2𝑂(𝑠 log 𝑠) functions computed by NAND-
CIRC programs of at most 𝑠 lines. 1
1
The implicit constant in the 𝑂(⋅) notation is
smaller than 10. That is, for all sufficiently large 𝑠,
Proof. For any 𝑛, 𝑚 ∈ ℕ, we will show a one-to-one map 𝐸 from |SIZE𝑛,𝑚 (𝑠)| < 210𝑠 log 𝑠 , see Remark 5.4. As discussed
in Section 1.7, we use the bound 10 simply because it
SIZE𝑛,𝑚 (𝑠) to the set of strings of length 𝑐𝑠 log 𝑠 for some constant is a round number.
𝑐. This will conclude the proof, since it implies that |SIZE𝑛,𝑚 (𝑠)| is
smaller than the size of the set of all strings of length at most ℓ =
𝑐𝑠 log 𝑠. The size of the latter set is 1 + 2 + 4 + ⋯ + 2ℓ = 2ℓ+1 − 1 by the
formula for sums of geometric progressions.
The map 𝐸 will simply map 𝑓 to the representation of the smallest
program computing 𝑓. Since 𝑓 ∈ SIZE𝑛,𝑚 (𝑠), there is a program 𝑃
of at most 𝑠 lines that can be represented using a string of length at
most 𝑐𝑠 log 𝑠 by Theorem 5.1. Moreover, the map 𝑓 ↦ 𝐸(𝑓) is one to
one, since for every distinct 𝑓, 𝑓 ′ ∶ {0, 1}𝑛 → {0, 1}𝑚 there must exist
some input 𝑥 ∈ {0, 1}𝑛 on which 𝑓(𝑥) ≠ 𝑓 ′ (𝑥). This means that the
programs that compute 𝑓 and 𝑓 ′ respectively cannot be identical.

Theorem 5.2 has an important corollary. The number of func-


tions that can be computed using small circuits/programs is much
smaller than the total number of functions, and hence there ex-
ist functions that require very large (in fact exponentially large) cir-
cuits to compute. To see why this is the case, note that a function
mapping {0, 1}2 to {0, 1} can be identified with the list of its four
values on the inputs 00, 01, 10, 11. A function mapping {0, 1}3 to
{0, 1} can be identified with the list of its eight values on the inputs
000, 001, 010, 011, 100, 101, 110, 111. More generally, every function
𝐹 ∶ {0, 1}𝑛 → {0, 1} can be identified with the list of its 2𝑛 values on
cod e a s data, data a s cod e 189

the inputs {0, 1}𝑛 . Hence the number of functions mapping {0, 1}𝑛
to {0, 1} is equal to the number of possible 2𝑛 length lists of values
which is exactly 22 . Note that this is double exponential in 𝑛, and hence
𝑛

even for small values of 𝑛 (e.g., 𝑛 = 10) the number of functions from 2
“Astronomical” here is an understatement: there are
{0, 1}𝑛 to {0, 1} is truly astronomical.2 As mentioned, this yields the 10
much fewer than 22 stars, or even particles, in the
following corollary: observable universe.

There is a constant
Theorem 5.3 — Counting argument lower bound.
𝛿 > 0, such that for every sufficiently large 𝑛, there is a function
𝑓 ∶ {0, 1}𝑛 → {0, 1} such that 𝑓 ∉ SIZE𝑛 ( 𝛿2𝑛 ). That is, the shortest
𝑛

NAND-CIRC program to compute 𝑓 requires more than 𝛿 ⋅ 2𝑛 /𝑛


lines. 3
3
The constant 𝛿 is at least 0.1 and in fact, can be im-
proved to be arbitrarily close to 1/2, see Exercise 5.7.
Proof. The proof is simple. If we let 𝑐 be the constant such that
|SIZE𝑛 (𝑠)| ≤ 2𝑐𝑠 log 𝑠 and 𝛿 = 1/𝑐, then setting 𝑠 = 𝛿2𝑛 /𝑛 we see that
𝛿2𝑛
|SIZE𝑛 ( 𝛿2𝑛 )| ≤ 2𝑐 𝑛 log 𝑠
𝑛 𝑛 𝑛
< 2𝑐𝛿2 = 22

using the fact that since 𝑠 < 2𝑛 , log 𝑠 < 𝑛 and 𝛿 = 1/𝑐. But since
|SIZE𝑛 (𝑠)| is smaller than the total number of functions mapping 𝑛
bits to 1 bit, there must be at least one such function not in SIZE𝑛 (𝑠),
which is what we needed to prove.

We have seen before that every function mapping {0, 1}𝑛 to {0, 1}
can be computed by an 𝑂(2𝑛 /𝑛) line program. Theorem 5.3 shows
that this is tight in the sense that some functions do require such an
astronomical number of lines to compute.

 Big Idea 7 Some functions 𝑓 ∶ {0, 1}𝑛 → {0, 1} cannot be computed


by a Boolean circuit using fewer than exponential (in 𝑛) number of
gates.

In fact, as we explore in the exercises, this is the case for most func-
tions. Hence functions that can be computed in a small number of
lines (such as addition, multiplication, finding short paths in graphs,
or even the EVAL function) are the exception, rather than the rule.

R
Remark 5.4 — More efficient representation (advanced,
optional). The ASCII representation is not the shortest
representation for NAND-CIRC programs. NAND-
CIRC programs are equivalent to circuits with NAND
gates, which means that a NAND-CIRC program of 𝑠
lines, 𝑛 inputs, and 𝑚 outputs can be represented by
a labeled directed graph of 𝑠 + 𝑛 vertices, of which 𝑛
190 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

have in-degree zero, and the 𝑠 others have in-degree


at most two. Using the adjacency matrix represen-
tation for such graphs, we can reduce the implicit
constant in Theorem 5.2 to be arbitrarily close to 5, see
Exercise 5.6.

5.2.1 Size hierarchy theorem (optional)


By Theorem 4.15 the class SIZE𝑛 (10 ⋅ 2𝑛 /𝑛) contains all functions
from {0, 1}𝑛 to {0, 1}, while by Theorem 5.3, there is some function
𝑓 ∶ {0, 1}𝑛 → {0, 1} that is not contained in SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛). In other
words, for every sufficiently large 𝑛,

SIZE𝑛 (0.1 2𝑛 ) ⊊ SIZE𝑛 (10 2𝑛 ) .


𝑛 𝑛

It turns out that we can use Theorem 5.3 to show a more general re-
sult: whenever we increase our “budget” of gates we can compute
new functions.

Theorem 5.5 — Size Hierarchy Theorem. For every sufficiently large 𝑛


and 10𝑛 < 𝑠 < 0.1 ⋅ 2𝑛 /𝑛,

SIZE𝑛 (𝑠) ⊊ SIZE𝑛 (𝑠 + 10𝑛) .

Proof Idea:
To prove the theorem we need to find a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}
such that 𝑓 can be computed by a circuit of 𝑠 + 10𝑛 gates but it cannot
be computed by a circuit of 𝑠 gates. We will do so by coming up with
a sequence of functions 𝑓0 , 𝑓1 , 𝑓2 , … , 𝑓𝑁 with the following properties:
(1) 𝑓0 can be computed by a circuit of at most 10𝑛 gates, (2) 𝑓𝑁 cannot
be computed by a circuit of 0.1 ⋅ 2𝑛 /𝑛 gates, and (3) for every 𝑖 ∈
{0, … , 𝑁 }, if 𝑓𝑖 can be computed by a circuit of size 𝑠, then 𝑓𝑖+1 can be
computed by a circuit of size at most 𝑠+10𝑛. Together these properties
imply that if we let 𝑖 be the smallest number such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠),
then since 𝑓𝑖−1 ∈ SIZE𝑛 (𝑠) it must hold that 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛) which
is what we need to prove. See Fig. 5.4 for an illustration.

Proof of Theorem 5.5. Let 𝑓 ∗ ∶ {0, 1}𝑛 → {0, 1} be the function Figure 5.4: We prove Theorem 5.5 by coming up
with a list 𝑓0 , … , 𝑓2𝑛 of functions such that 𝑓0 is the
(whose existence we are guaranteed by Theorem 5.3) such that all zero function, 𝑓2𝑛 is a function (obtained from
𝑓 ∗ ∉ SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛). We define the functions 𝑓0 , 𝑓1 , … , 𝑓2𝑛 map- Theorem 5.3) outside of SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛) and such
that 𝑓𝑖−1 and 𝑓𝑖 differ by one another on at most one
ping {0, 1}𝑛 to {0, 1} as follows. For every 𝑥 ∈ {0, 1}𝑛 , if 𝑙𝑒𝑥(𝑥) ∈
input. We can show that for every 𝑖, the number of
{0, 1, … , 2𝑛 − 1} is 𝑥’s order in the lexicographical order then gates to compute 𝑓𝑖 is at most 10𝑛 larger than the
number of gates to compute 𝑓𝑖−1 and so if we let 𝑖 be
⎧ the smallest number such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠), then
{𝑓 ∗ (𝑥) 𝑙𝑒𝑥(𝑥) < 𝑖 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛).
𝑓𝑖 (𝑥) = ⎨ .
{
⎩0 otherwise
cod e a s data, data a s cod e 191

The function 𝑓0 is simply the constant zero function, while the


function 𝑓2𝑛 is equal to 𝑓 ∗ . Moreover, for every 𝑖 ∈ [2𝑛 ], the functions
𝑓𝑖 and 𝑓𝑖+1 differ on at most one input (i.e., the input 𝑥 ∈ {0, 1}𝑛 such
that 𝑙𝑒𝑥(𝑥) = 𝑖). Let 10𝑛 < 𝑠 < 0.1 ⋅ 2𝑛 /𝑛, and let 𝑖 be the first index
such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠). Since 𝑓2𝑛 = 𝑓 ∗ ∉ SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛) there
must exist such an index 𝑖, and moreover 𝑖 > 0 since the constant zero
function is a member of SIZE𝑛 (10𝑛).
By our choice of 𝑖, 𝑓𝑖−1 is a member of SIZE𝑛 (𝑠). To complete the
proof, we need to show that 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛). Let 𝑥∗ be the string
such that 𝑙𝑒𝑥(𝑥∗ ) = 𝑖 and let 𝑏 ∈ {0, 1} be the value of 𝑓 ∗ (𝑥∗ ). Then we
can define 𝑓𝑖 also as follows


{𝑏 𝑥 = 𝑥∗
𝑓𝑖 (𝑥) =

{𝑓𝑖−1 (𝑥) 𝑥 ≠ 𝑥∗

or in other words

𝑓𝑖 (𝑥) = IF(EQUAL(𝑥∗ , 𝑥), 𝑏, 𝑓𝑖−1 (𝑥))

where EQUAL ∶ {0, 1}2𝑛 → {0, 1} is the function that maps 𝑥, 𝑥′ ∈


{0, 1}𝑛 to 1 if they are equal and to 0 otherwise. Since (by our choice
of 𝑖), 𝑓𝑖−1 can be computed using at most 𝑠 gates and (as can be easily
verified) that EQUAL ∈ SIZE𝑛 (9𝑛), we can compute 𝑓𝑖 using at most
𝑠 + 9𝑛 + 𝑂(1) ≤ 𝑠 + 10𝑛 gates which is what we wanted to prove.

Figure 5.5: An illustration of some of what we know


about the size complexity classes (not to scale!). This
figure depicts classes of the form SIZE𝑛,𝑛 (𝑠) but the
state of affairs for other size complexity classes such
as SIZE𝑛,1 (𝑠) is similar. We know by Theorem 4.12
(with the improvement of Section 4.4.2) that all
functions mapping 𝑛 bits to 𝑛 bits can be computed
by a circuit of size 𝑐 ⋅ 2𝑛 for 𝑐 ≤ 10, while on the
other hand the counting lower bound (Theorem 5.3,
see also Exercise 5.4) shows that some such functions
will require 0.1 ⋅ 2𝑛 , and the size hierarchy theorem
(Theorem 5.5) shows the existence of functions in
SIZE𝑛 (𝑆) ⧵ SIZE𝑛 (𝑠) whenever 𝑠 = 𝑜(𝑆), see also
Exercise 5.5. We also consider some specific examples:
addition of two 𝑛/2 bit numbers can be done in 𝑂(𝑛)
lines, while we don’t know of such a program for
multiplying two 𝑛 bit numbers, though we do know
it can be done in 𝑂(𝑛2 ) and in fact even better size.
In the above, FACTOR𝑛 corresponds to the inverse
problem of multiplying- finding the prime factorization
of a given number. At the moment we do not know
of any circuit a polynomial (or even sub-exponential)
R number of lines that can compute FACTOR𝑛 .
Remark 5.6 — Explicit functions. While the size hierar-
chy theorem guarantees that there exists some function
that can be computed using, for example, 𝑛2 gates, but
192 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

not using 100𝑛 gates, we do not know of any explicit


example of such a function. While we suspect that
integer multiplication is such an example, we do not
have any proof that this is the case.

5.3 THE TUPLES REPRESENTATION


ASCII is a fine presentation of programs, but for some applications
it is useful to have a more concrete representation of NAND-CIRC
programs. In this section we describe a particular choice, that will
be convenient for us later on. A NAND-CIRC program is simply a
sequence of lines of the form

blah = NAND(baz,boo)

There is of course nothing special about the particular names we


use for variables. Although they would be harder to read, we could
write all our programs using only working variables such as temp_0,
temp_1 etc. Therefore, our representation for NAND-CIRC programs
ignores the actual names of the variables, and just associate a number
with each variable. We encode a line of the program as a triple of
numbers. If the line has the form foo = NAND(bar,blah) then we
encode it with the triple (𝑖, 𝑗, 𝑘) where 𝑖 is the number corresponding
to the variable foo and 𝑗 and 𝑘 are the numbers corresponding to bar
and blah respectively.
More concretely, we will associate every variable with a number
in the set [𝑡] = {0, 1, … , 𝑡 − 1}. The first 𝑛 numbers {0, … , 𝑛 − 1}
correspond to the input variables, the last 𝑚 numbers {𝑡 − 𝑚, … , 𝑡 − 1}
correspond to the output variables, and the intermediate numbers
{𝑛, … , 𝑡 − 𝑚 − 1} correspond to the remaining “workspace” variables.
Formally, we define our representation as follows:

Let 𝑃 be a NAND-CIRC
Definition 5.7 — List of tuples representation.
program of 𝑛 inputs, 𝑚 outputs, and 𝑠 lines, and let 𝑡 be the num-
ber of distinct variables used by 𝑃 . The list of tuples representation
of 𝑃 is the triple (𝑛, 𝑚, 𝐿) where 𝐿 is a list of triples of the form
(𝑖, 𝑗, 𝑘) for 𝑖, 𝑗, 𝑘 ∈ [𝑡].
We assign a number for each variable of 𝑃 as follows:

• For every 𝑖 ∈ [𝑛], the variable X[𝑖] is assigned the number 𝑖.

• For every 𝑗 ∈ [𝑚], the variable Y[𝑗] is assigned the number


𝑡 − 𝑚 + 𝑗.
cod e a s data, data a s cod e 193

• Every other variable is assigned a number in {𝑛, 𝑛 + 1, … , 𝑡 − 𝑚 −


1} in the order in which the variable appears in the program 𝑃 .

The list of tuples representation is our default choice for represent-


ing NAND-CIRC programs. Since “list of tuples representation” is a
bit of a mouthful, we will often call it simply “the representation” for
a program 𝑃 . Sometimes, when the number 𝑛 of inputs and number
𝑚 of outputs are known from the context, we will simply represent a
program as the list 𝐿 instead of the triple (𝑛, 𝑚, 𝐿).

■ Example 5.8 — Representing the XOR program. Our favorite NAND-


CIRC program, the program

u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)

computing the XOR function is represented as the tuple (2, 1, 𝐿)


where 𝐿 = ((2, 0, 1), (3, 0, 2), (4, 1, 2), (5, 3, 4)). That is, the variables
X[0] and X[1] are given the indices 0 and 1 respectively, the vari-
ables u,v,w are given the indices 2, 3, 4 respectively, and the variable
Y[0] is given the index 5.

Transforming a NAND-CIRC program from its representation as


code to the representation as a list of tuples is a fairly straightforward
programming exercise, and in particular can be done in a few lines of 4
If you’re curious what these few lines are, see our
Python.4 The list-of-tuples representation loses information such as the GitHub repository.
particular names we used for the variables, but this is OK since these
names do not make a difference to the functionality of the program.

5.3.1 From tuples to strings


If 𝑃 is a program of size 𝑠, then the number 𝑡 of variables is at most 3𝑠
(as every line touches at most three variables). Hence we can encode
every variable index in [𝑡] as a string of length ℓ = ⌈log(3𝑠)⌉, by adding
leading zeroes as needed. Since this is a fixed-length encoding, it is
prefix free, and so we can encode the list 𝐿 of 𝑠 triples (corresponding
to the encoding of the 𝑠 lines of the program) as simply the string of
length 3ℓ𝑠 obtained by concatenating all of these encodings.
We define 𝑆(𝑠) to be the length of the string representing the list 𝐿
corresponding to a size 𝑠 program. By the above we see that

𝑆(𝑠) = 3𝑠⌈log(3𝑠)⌉ . (5.1)

We can represent 𝑃 = (𝑛, 𝑚, 𝐿) as a string by prepending a prefix


free representation of 𝑛 and 𝑚 to the list 𝐿. Since 𝑛, 𝑚 ≤ 3𝑠 (a pro-
194 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

gram must touch at least once all its input and output variables), those
prefix free representations can be encoded using strings of length
𝑂(log 𝑠). In particular, every program 𝑃 of at most 𝑠 lines can be rep-
resented by a string of length 𝑂(𝑠 log 𝑠). Similarly, every circuit 𝐶 of
at most 𝑠 gates can be represented by a string of length 𝑂(𝑠 log 𝑠) (for
example by translating 𝐶 to the equivalent program 𝑃 ).

5.4 A NAND-CIRC INTERPRETER IN NAND-CIRC


Since we can represent programs as strings, we can also think of a
program as an input to a function. In particular, for every natural
number 𝑠, 𝑛, 𝑚 > 0 we define the function EVAL𝑠,𝑛,𝑚 ∶ {0, 1}𝑆(𝑠)+𝑛 →
{0, 1}𝑚 as follows:

{𝑃 (𝑥) 𝑝 ∈ {0, 1}|𝑆(𝑠)| represents a size-𝑠 program 𝑃 with 𝑛 inputs and 𝑚 outputs

EVAL𝑠,𝑛,𝑚 (𝑝𝑥) =

{0𝑚 otherwise

(5.2)
where 𝑆(𝑠) is defined as in (5.1) and we use the concrete representa-
tion scheme described in Section 5.1.
That is, EVAL𝑠,𝑛,𝑚 takes as input the concatenation of two strings:
a string 𝑝 ∈ {0, 1}|𝑆(𝑠)| and a string 𝑥 ∈ {0, 1}𝑛 . If 𝑝 is a string that
represents a list of triples 𝐿 such that (𝑛, 𝑚, 𝐿) is a list-of-tuples rep-
resentation of a size-𝑠 NAND-CIRC program 𝑃 , then EVAL𝑠,𝑛,𝑚 (𝑝𝑥)
is equal to the evaluation 𝑃 (𝑥) of the program 𝑃 on the input 𝑥. Oth-
erwise, EVAL𝑠,𝑛,𝑚 (𝑝𝑥) equals 0𝑚 (this case is not very important: you
can simply think of 0𝑚 as some “junk value” that indicates an error).

Take-away points. The fine details of EVAL𝑠,𝑛,𝑚 ’s definition are not


very crucial. Rather, what you need to remember about EVAL𝑠,𝑛,𝑚 is
that:
• EVAL𝑠,𝑛,𝑚 is a finite function taking a string of fixed length as input
and outputting a string of fixed length as output.

• EVAL𝑠,𝑛,𝑚 is a single function, such that computing EVAL𝑠,𝑛,𝑚


allows to evaluate arbitrary NAND-CIRC programs of a certain
length on arbitrary inputs of the appropriate length.

• EVAL𝑠,𝑛,𝑚 is a function, not a program (recall the discussion in Sec-


tion 3.7.2). That is, EVAL𝑠,𝑛,𝑚 is a specification of what output is as-
sociated with what input. The existence of a program that computes
EVAL𝑠,𝑛,𝑚 (i.e., an implementation for EVAL𝑠,𝑛,𝑚 ) is a separate
fact, which needs to be established (and which we will do in Theo-
rem 5.9, with a more efficient program shown in Theorem 5.10).
One of the first examples of self circularity we will see in this book is
the following theorem, which we can think of as showing a “NAND-
CIRC interpreter in NAND-CIRC”:
cod e a s data, data a s cod e 195

For every
Theorem 5.9 — Bounded Universality of NAND-CIRC programs.
𝑠, 𝑛, 𝑚 ∈ ℕ with 𝑠 ≥ 𝑚 there is a NAND-CIRC program 𝑈𝑠,𝑛,𝑚 that
computes the function EVAL𝑠,𝑛,𝑚 .

That is, the NAND-CIRC program 𝑈𝑠,𝑛,𝑚 takes the description


of any other NAND-CIRC program 𝑃 (of the right length and input-
s/outputs) and any input 𝑥, and computes the result of evaluating the
program 𝑃 on the input 𝑥. Given the equivalence between NAND-
CIRC programs and Boolean circuits, we can also think of 𝑈𝑠,𝑛,𝑚 as
a circuit that takes as input the description of other circuits and their
inputs, and returns their evaluation, see Fig. 5.6. We call this NAND-
CIRC program 𝑈𝑠,𝑛,𝑚 that computes EVAL𝑠,𝑛,𝑚 a bounded universal
program (or a universal circuit, see Fig. 5.6). “Universal” stands for
the fact that this is a single program that can evaluate arbitrary code,
where “bounded” stands for the fact that 𝑈𝑠,𝑛,𝑚 only evaluates pro-
grams of bounded size. Of course this limitation is inherent for the
NAND-CIRC programming language, since a program of 𝑠 lines (or,
equivalently, a circuit of 𝑠 gates) can take at most 2𝑠 inputs. Later, in
Chapter 7, we will introduce the concept of loops (and the model of
Turing machines), that allow to escape this limitation.

Proof. Theorem 5.9 is an important result, but it is actually not hard to


prove. Specifically, since EVAL𝑠,𝑛,𝑚 is a finite function Theorem 5.9 is
an immediate corollary of Theorem 4.12, which states that every finite
function can be computed by some NAND-CIRC program.

P
Theorem 5.9 is simple but important. Make sure you
understand what this theorem means, and why it is a
corollary of Theorem 4.12.

5.4.1 Efficient universal programs


Theorem 5.9 establishes the existence of a NAND-CIRC program
for computing EVAL𝑠,𝑛,𝑚 , but it provides no explicit bound on the
size of this program. Theorem 4.12, which we used to prove Theo-
rem 5.9, guarantees the existence of a NAND-CIRC program whose
size can be as large as exponential in the length of its input. This would
Figure 5.6: A universal circuit 𝑈 is a circuit that gets as
mean that even for moderately small values of 𝑠, 𝑛, 𝑚 (for example input the description of an arbitrary (smaller) circuit
𝑛 = 100, 𝑠 = 300, 𝑚 = 1), computing EVAL𝑠,𝑛,𝑚 might require a 𝑃 as a binary string, and an input 𝑥, and outputs the
string 𝑃 (𝑥) which is the evaluation of 𝑃 on 𝑥. We can
NAND program with more lines than there are atoms in the observ-
also think of 𝑈 as a straight-line program that gets as
able universe! Fortunately, we can do much better than that. In fact, input the code of a straight-line program 𝑃 and an
for every 𝑠, 𝑛, 𝑚 there exists a NAND-CIRC program for comput- input 𝑥, and outputs 𝑃 (𝑥).
196 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

ing EVAL𝑠,𝑛,𝑚 with size that is polynomial in its input length. This is
shown in the following theorem.

Theorem 5.10 — Efficient bounded universality of NAND-CIRC programs.


For every 𝑠, 𝑛, 𝑚 ∈ ℕ there is a NAND-CIRC program of at
most 𝑂(𝑠 log 𝑠) lines that computes the function EVAL𝑠,𝑛,𝑚
2

{0, 1}𝑆+𝑛 → {0, 1}𝑚 defined above (where 𝑆 is the number of bits
needed to represent programs of 𝑠 lines).

P
If you haven’t done so already, now might be a good
time to review 𝑂 notation in Section 1.4.8. In particu-
lar, an equivalent way to state Theorem 5.10 is that it
says that there exists some number 𝑐 > 0 such that for
every 𝑠, 𝑛, 𝑚 ∈ ℕ, there exists a NAND-CIRC program
𝑃 of at most 𝑐𝑠2 log 𝑠 lines that computes the function
EVAL𝑠,𝑛,𝑚 .

Unlike Theorem 5.9, Theorem 5.10 is not a trivial corollary of the


fact that every finite function can be computed by some circuit. Prov-
ing Theorem 5.10 requires us to present a concrete NAND-CIRC pro-
gram for computing the function EVAL𝑠,𝑛,𝑚 . We will do so in several
stages.

1. First, we will describe the algorithm to evaluate EVAL𝑠,𝑛,𝑚 in


“pseudo code”.

2. Then, we will show how we can write a program to compute


EVAL𝑠,𝑛,𝑚 in Python. We will not use much about Python, and
a reader that has familiarity with programming in any language
should be able to follow along.

3. Finally, we will show how we can transform this Python program


into a NAND-CIRC program.

This approach yields much more than just proving Theorem 5.10:
we will see that it is in fact always possible to transform (loop free)
code in high level languages such as Python to NAND-CIRC pro-
grams (and hence to Boolean circuits as well).

5.4.2 A NAND-CIRC interpeter in “pseudocode”


To prove Theorem 5.10 it suffices to give a NAND-CIRC program of
𝑂(𝑠2 log 𝑠) lines that can evaluate NAND-CIRC programs of 𝑠 lines.
Let us start by thinking how we would evaluate such programs if we
weren’t restricted to only performing NAND operations. That is, let us
describe informally an algorithm that on input 𝑛, 𝑚, 𝑠, a list of triples
cod e a s data, data a s cod e 197

𝐿, and a string 𝑥 ∈ {0, 1}𝑛 , evaluates the program represented by


(𝑛, 𝑚, 𝐿) on the string 𝑥.

P
It would be highly worthwhile for you to stop here
and try to solve this problem yourself. For example,
you can try thinking how you would write a program
NANDEVAL(n,m,s,L,x) that computes this function in
the programming language of your choice.

We will now describe such an algorithm. We assume that we have


access to a bit array data structure that can store for every 𝑖 ∈ [𝑡] a
bit 𝑇𝑖 ∈ {0, 1}. Specifically, if Table is a variable holding this data
structure, then we assume we can perform the operations:

• GET(Table,i) which retrieves the bit corresponding to i in Table.


The value of i is assumed to be an integer in [𝑡].

• Table = UPDATE(Table,i,b) which updates Table so the bit cor-


responding to i is now set to b. The value of i is assumed to be an
integer in [𝑡] and b is a bit in {0, 1}.

Algorithm 5.11 — Eval NAND-CIRC programs.

Input: Numbers 𝑛, 𝑚, 𝑠 and 𝑡 ≤ 3𝑠, as well as a list 𝐿 of 𝑠


triples of numbers in [𝑡], and a string 𝑥 ∈ {0, 1}𝑛 .
Output: Evaluation of the program represented by
(𝑛, 𝑚, 𝐿) on the input 𝑥 ∈ {0, 1}𝑛 .
1: Let Vartable be table of size 𝑡
2: for 𝑖 in [𝑛] do
3: Vartable = UPDATE(Vartable,𝑖,𝑥𝑖 )
4: end for
5: for (𝑖, 𝑗, 𝑘) in 𝐿 do
6: 𝑎 ← GET(Vartable,𝑗)
7: 𝑏 ← GET(Vartable,𝑘)
8: Vartable = UPDATE(Vartable,𝑖,NAND(𝑎,𝑏))
9: end for
10: for 𝑗 in [𝑚] do
11: 𝑦𝑗 ← GET(Vartable,𝑡 − 𝑚 + 𝑗)
12: end for
13: return 𝑦0 , … , 𝑦𝑚−1

Algorithm 5.11 evaluates the program given to it as input one line


at a time, updating the Vartable table to contain the value of each
variable. At the end of the execution it outputs the variables at posi-
tions 𝑡 − 𝑚, 𝑡 − 𝑚 + 1, … , 𝑡 − 1 which correspond to the input variables.
198 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

5.4.3 A NAND interpreter in Python


To make things more concrete, let us see how we implement Algo-
rithm 5.11 in the Python programming language. (There is nothing
special about Python. We could have easily presented a corresponding
function in JavaScript, C, OCaml, or any other programming lan-
guage.) We will construct a function NANDEVAL that on input 𝑛, 𝑚, 𝐿, 𝑥
will output the result of evaluating the program represented by
(𝑛, 𝑚, 𝐿) on 𝑥. To keep things simple, we will not worry about the case
that 𝐿 does not represent a valid program of 𝑛 inputs and 𝑚 outputs.
The code is presented in Fig. 5.7.
Accessing an element of the array Vartable at a given index takes
a constant number of basic operations. Hence (since 𝑛, 𝑚 ≤ 𝑠 and 5
Python does not distinguish between lists and
𝑡 ≤ 3𝑠), the program above will use 𝑂(𝑠) basic operations.5 arrays, but allows constant time random access to an
indexed elements to both of them. One could argue
that if we allowed programs of truly unbounded
length (e.g., larger than 264 ) then the price would
5.4.4 Constructing the NAND-CIRC interpreter in NAND-CIRC not be constant but logarithmic in the length of the
array/lists, but the difference between 𝑂(𝑠) and
We now turn to describing the proof of Theorem 5.10. To prove the
𝑂(𝑠 log 𝑠) will not be important for our discussions.
theorem it is not enough to give a Python program. Rather, we need to
show how we compute the function EVAL𝑠,𝑛,𝑚 using a NAND-CIRC
program. In other words, our job is to transform, for every 𝑠, 𝑛, 𝑚, the
Python code of Section 5.4.3 to a NAND-CIRC program 𝑈𝑠,𝑛,𝑚 that
computes the function EVAL𝑠,𝑛,𝑚 .

P
Before reading further, try to think how you could give
a “constructive proof” of Theorem 5.10. That is, think
of how you would write, in the programming lan-
guage of your choice, a function universal(s,n,m)
that on input 𝑠, 𝑛, 𝑚 outputs the code for the NAND-
CIRC program 𝑈𝑠,𝑛,𝑚 such that 𝑈𝑠,𝑛,𝑚 computes
EVAL𝑠,𝑛,𝑚 . There is a subtle but crucial difference
between this function and the Python NANDEVAL pro-
gram described above. Rather than actually evaluating
a given program 𝑃 on some input 𝑤, the function
universal should output the code of a NAND-CIRC
program that computes the map (𝑃 , 𝑥) ↦ 𝑃 (𝑥).

Our construction will follow very closely the Python implementa-


tion of EVAL above. We will use variables Vartable[0],…,Vartable[2ℓ −
1], where ℓ = ⌈log 3𝑠⌉ to store our variables. However, NAND doesn’t
have integer-valued variables, so we cannot write code such as
Vartable[i] for some variable i. However, we can implement the
function GET(Vartable,i) that outputs the i-th bit of the array
Vartable. Indeed, this is nothing but the function LOOKUPℓ that we
have seen in Theorem 4.10!
cod e a s data, data a s cod e 199

Figure 5.7: Code for evaluating a NAND-CIRC program given in the list-of-tuples representation

def NANDEVAL(n,m,L,X):
# Evaluate a NAND-CIRC program from list of tuple representation.
s = len(L) # num of lines
t = max(max(a,b,c) for (a,b,c) in L)+1 # max index in L + 1
Vartable = [0] * t # initialize array

# helper functions
def GET(V,i): return V[i]
def UPDATE(V,i,b):
V[i]=b
return V

# load input values to Vartable:


for i in range(n):
Vartable = UPDATE(Vartable,i,X[i])

# Run the program


for (i,j,k) in L:
a = GET(Vartable,j)
b = GET(Vartable,k)
c = NAND(a,b)
Vartable = UPDATE(Vartable,i,c)

# Return outputs Vartable[t-m], Vartable[t-m+1],....,Vartable[t-1]


return [GET(Vartable,t-m+j) for j in range(m)]

# Test on XOR (2 inputs, 1 output)


L = ((2, 0, 1), (3, 0, 2), (4, 1, 2), (5, 3, 4))
print(NANDEVAL(2,1,L,(0,1))) # XOR(0,1)
# [1]
print(NANDEVAL(2,1,L,(1,1))) # XOR(1,1)
# [0]
200 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

P
Please make sure that you understand why GET and
LOOKUPℓ are the same function.

We saw that we can compute LOOKUPℓ in time 𝑂(2ℓ ) = 𝑂(𝑠) for


our choice of ℓ.
For every ℓ, let UPDATEℓ ∶ {0, 1}2 +ℓ+1 → {0, 1}2 correspond to the
ℓ ℓ

UPDATE function for arrays of length 2ℓ . That is, on input 𝑉 ∈ {0, 1}2 ,

𝑖 ∈ {0, 1}ℓ , 𝑏 ∈ {0, 1}, UPDATEℓ (𝑉 , 𝑖, 𝑏) is equal to 𝑉 ′ ∈ {0, 1}2 such


that

{𝑉𝑗 𝑗 ≠ 𝑖
𝑉𝑗′ = ⎨
{
⎩𝑏 𝑗=𝑖
where we identify the string 𝑖 ∈ {0, 1}ℓ with a number in {0, … , 2ℓ − 1}
using the binary representation. We can compute UPDATEℓ using an
𝑂(2ℓ ℓ) = (𝑠 log 𝑠) line NAND-CIRC program as follows:

1. For every 𝑗 ∈ [2ℓ ], there is an 𝑂(ℓ) line NAND-CIRC program to


compute the function EQUALS𝑗 ∶ {0, 1}ℓ → {0, 1} that on input 𝑖
outputs 1 if and only if 𝑖 is equal to (the binary representation of) 𝑗.
(We leave verifying this as Exercise 5.2 and Exercise 5.3.)

2. We have seen that we can compute the function IF ∶ {0, 1}3 → {0, 1}
such that IF(𝑎, 𝑏, 𝑐) equals 𝑏 if 𝑎 = 1 and 𝑐 if 𝑎 = 0.

Together, this means that we can compute UPDATE (using some


“syntactic sugar” for bounded length loops) as follows:

def UPDATE_ell(V,i,b):
# Get V[0]...V[2^ell-1], i in {0,1}^ell, b in {0,1}
# Return NewV[0],...,NewV[2^ell-1]
# updated array with NewV[i]=b and all
# else same as V
for j in range(2**ell): # j = 0,1,2,....,2^ell -1
a = EQUALS_j(i)
NewV[j] = IF(a,b,V[j])
return NewV

Since the loop over j in UPDATE is run 2ℓ times, and computing


EQUALS_j takes 𝑂(ℓ) lines, the total number of lines to compute UP-
DATE is 𝑂(2ℓ ⋅ ℓ) = 𝑂(𝑠 log 𝑠). Once we can compute GET and UPDATE,
the rest of the implementation amounts to “book keeping” that needs
to be done carefully, but is not too insightful, and hence we omit the
full details. Since we run GET and UPDATE 𝑠 times, the total number of
lines for computing EVAL𝑠,𝑛,𝑚 is 𝑂(𝑠2 ) + 𝑂(𝑠2 log 𝑠) = 𝑂(𝑠2 log 𝑠).
This completes (up to the omitted details) the proof of Theorem 5.10.
cod e a s data, data a s cod e 201

R
Remark 5.12 — Improving to quasilinear overhead (ad-
vanced optional note). The NAND-CIRC program
above is less efficient than its Python counterpart,
since NAND does not offer arrays with efficient ran-
dom access. Hence for example the LOOKUP operation
on an array of 𝑠 bits takes Ω(𝑠) lines in NAND even
though it takes 𝑂(1) steps (or maybe 𝑂(log 𝑠) steps,
depending on how we count) in Python.
It turns out that it is possible to improve the bound
of Theorem 5.10, and evaluate 𝑠 line NAND-CIRC
programs using a NAND-CIRC program of 𝑂(𝑠 log 𝑠)
lines. The key is to consider the description of NAND-
CIRC programs as circuits, and in particular as di-
rected acyclic graphs (DAGs) of bounded in-degree.
A universal NAND-CIRC program 𝑈𝑠 for 𝑠 line pro-
grams will correspond to a universal graph 𝐻𝑠 for such
𝑠 vertex DAGs. We can think of such a graph 𝑈𝑠 as
fixed “wiring” for a communication network, that
should be able to accommodate any arbitrary pattern
of communication between 𝑠 vertices (where this pat-
tern corresponds to an 𝑠 line NAND-CIRC program).
It turns out that such efficient routing networks exist
that allow embedding any 𝑠 vertex circuit inside a uni-
versal graph of size 𝑂(𝑠 log 𝑠), see the bibliographical
notes Section 5.9 for more on this issue.

5.5 A PYTHON INTERPRETER IN NAND-CIRC (DISCUSSION)


To prove Theorem 5.10 we essentially translated every line of the
Python program for EVAL into an equivalent NAND-CIRC snip-
pet. However, none of our reasoning was specific to the particu-
lar function EVAL. It is possible to translate every Python program
into an equivalent NAND-CIRC program of comparable efficiency.
(More concretely, if the Python program takes 𝑇 (𝑛) operations on
inputs of length at most 𝑛 then there exists NAND-CIRC program of
𝑂(𝑇 (𝑛) log 𝑇 (𝑛)) lines that agrees with the Python program on inputs
of length 𝑛.) Actually doing so requires taking care of many details
and is beyond the scope of this book, but let me try to convince you
why you should believe it is possible in principle.
For starters, one can use CPython (the reference implementation
for Python), to evaluate every Python program using a C program. We
can combine this with a C compiler to transform a Python program
to various flavors of “machine language”. So, to transform a Python
program into an equivalent NAND-CIRC program, it is enough to
show how to transform a machine language program into an equiva-
lent NAND-CIRC program. One minimalistic (and hence convenient)
family of machine languages is known as the ARM architecture which
202 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

powers many mobile devices including essentially all Android de- 6


ARM stands for “Advanced RISC Machine” where
vices.6 There are even simpler machine languages, such as the LEG RISC in turn stands for “Reduced instruction set
architecture for which a backend for the LLVM compiler was imple- computer”.
mented (and hence can be the target of compiling any of the large
and growing list of languages that this compiler supports). Other ex-
amples include the TinyRAM architecture (motivated by interactive
proof systems that we will discuss in Chapter 22) and the teaching-
oriented Ridiculously Simple Computer architecture. Going one by
one over the instruction sets of such computers and translating them
to NAND snippets is no fun, but it is a feasible thing to do. In fact,
ultimately this is very similar to the transformation that takes place
in converting our high level code to actual silicon gates that are not
so different from the operations of a NAND-CIRC program. Indeed,
tools such as MyHDL that transform “Python to Silicon” can be used
to convert a Python program to a NAND-CIRC program.
The NAND-CIRC programming language is just a teaching tool,
and by no means do I suggest that writing NAND-CIRC programs, or
compilers to NAND-CIRC, is a practical, useful, or enjoyable activity.
What I do want is to make sure you understand why it can be done,
and to have the confidence that if your life (or at least your grade)
depended on it, then you would be able to do this. Understanding
how programs in high level languages such as Python are eventually
transformed into concrete low-level representation such as NAND is
fundamental to computer science.
The astute reader might notice that the above paragraphs only
outlined why it should be possible to find for every particular Python-
computable function 𝑓, a particular comparably efficient NAND-CIRC
program 𝑃 that computes 𝑓. But this still seems to fall short of our
goal of writing a “Python interpreter in NAND” which would mean
that for every parameter 𝑛, we come up with a single NAND-CIRC
program UNIV𝑠 such that given a description of a Python program
𝑃 , a particular input 𝑥, and a bound 𝑇 on the number of operations
(where the lengths of 𝑃 and 𝑥 and the value of 𝑇 are all at most 𝑠)
returns the result of executing 𝑃 on 𝑥 for at most 𝑇 steps. After all,
the transformation above takes every Python program into a different
NAND-CIRC program, and so does not yield “one NAND-CIRC pro-
gram to rule them all” that can evaluate every Python program up to
some given complexity. However, we can in fact obtain one NAND-
CIRC program to evaluate arbitrary Python programs. The reason is
that there exists a Python interpreter in Python: a Python program 𝑈
that takes a bit string, interprets it as Python code, and then runs that
code. Hence, we only need to show a NAND-CIRC program 𝑈 ∗ that
computes the same function as the particular Python program 𝑈 , and
this will give us a way to evaluate all Python programs.
cod e a s data, data a s cod e 203

What we are seeing time and again is the notion of universality or


self reference of computation, which is the sense that all reasonably rich
models of computation are expressive enough that they can “simulate
themselves”. The importance of this phenomenon to both the theory
and practice of computing, as well as far beyond it, including the
foundations of mathematics and basic questions in science, cannot be
overstated.

5.6 THE PHYSICAL EXTENDED CHURCH-TURING THESIS (DISCUS-


SION)
We’ve seen that NAND gates (and other Boolean operations) can be
implemented using very different systems in the physical world. What
about the reverse direction? Can NAND-CIRC programs simulate any
physical computer?
We can take a leap of faith and stipulate that Boolean circuits (or
equivalently NAND-CIRC programs) do actually encapsulate every
computation that we can think of. Such a statement (in the realm of
infinite functions, which we’ll encounter in Chapter 7) is typically
attributed to Alonzo Church and Alan Turing, and in that context
is known as the Church-Turing Thesis. As we will discuss in future
lectures, the Church-Turing Thesis is not a mathematical theorem or
conjecture. Rather, like theories in physics, the Church-Turing Thesis
is about mathematically modeling the real world. In the context of
finite functions, we can make the following informal hypothesis or
prediction:

“Physical Extended Church-Turing Thesis” (PECTT): If a function


𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be computed in the physical world using 𝑠 amount
of “physical resources” then it can be computed by a Boolean circuit program of
roughly 𝑠 gates.

A priori it might seem rather extreme to hypothesize that our mea-


ger model of NAND-CIRC programs or Boolean circuits captures all
possible physical computation. But yet, in more than a century of
computing technologies, no one has yet built any scalable computing
device that challenges this hypothesis.
We now discuss the “fine print” of the PECTT in more detail, as
well as the (so far unsuccessful) challenges that have been raised
against it. There is no single universally-agreed-upon formalization
of “roughly 𝑠 physical resources”, but we can approximate this notion
by considering the size of any physical computing device and the
time it takes to compute the output, and ask that any such device can
be simulated by a Boolean circuit with a number of gates that is a
polynomial (with not too large exponent) in the size of the system and
the time it takes it to operate.
204 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

In other words, we can phrase the PECTT as stipulating that any


function that can be computed by a device that takes a certain volume
𝑉 of space and requires 𝑡 time to complete the computation, must be
computable by a Boolean circuit with a number of gates 𝑝(𝑉 , 𝑡) that is
polynomial in 𝑉 and 𝑡.
The exact form of the function 𝑝(𝑉 , 𝑡) is not universally agreed
upon but it is generally accepted that if 𝑓 ∶ {0, 1}𝑛 → {0, 1} is an
exponentially hard function, in the sense that it has no NAND-CIRC
program of fewer than, say, 2𝑛/2 lines, then a demonstration of a phys-
ical device that can compute in the real world 𝑓 for moderate input
lengths (e.g., 𝑛 = 500) would be a violation of the PECTT.

R
Remark 5.13 — Advanced note: making PECTT concrete
(advanced, optional). We can attempt a more exact
phrasing of the PECTT as follows. Suppose that 𝑍 is
a physical system that accepts 𝑛 binary stimuli and
has a binary output, and can be enclosed in a sphere
of volume 𝑉 . We say that the system 𝑍 computes a
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} within 𝑡 seconds if when-
ever we set the stimuli to some value 𝑥 ∈ {0, 1}𝑛 , if
we measure the output after 𝑡 seconds then we obtain
𝑓(𝑥).
One can then phrase the PECTT as stipulating that if
there exists such a system 𝑍 that computes 𝐹 within
𝑡 seconds, then there exists a NAND-CIRC program
that computes 𝐹 and has at most 𝛼(𝑉 𝑡)2 lines, where
𝛼 is some normalization constant. (We can also con-
sider variants where we use surface area instead
of volume, or take (𝑉 𝑡) to a different power than 2.
However, none of these choices makes a qualitative
difference to the discussion below.) In particular,
suppose that 𝑓 ∶ {0, 1}𝑛 → {0, 1} is a function that
requires 2𝑛 /(100𝑛) > 20.8𝑛 lines for any NAND-CIRC
program (such a function exists by Theorem 5.3).
Then the PECTT would imply that either the volume
or the time of a system that computes 𝐹 will have to

be at least 20.2𝑛 / 𝛼. Since this quantity grows expo-
nentially in 𝑛, it is not hard to set parameters so that
even for moderately large values of 𝑛, such a system
could not fit in our universe.
To fully make the PECTT concrete, we need to decide
on the units for measuring time and volume, and the
normalization constant 𝛼. One conservative choice is
to assume that we could squeeze computation to the
absolute physical limits (which are many orders of
magnitude beyond current technology). This corre-
sponds to setting 𝛼 = 1 and using the Planck units
for volume and time. The Planck length ℓ𝑃 (which is,
roughly speaking, the shortest distance that can the-
oretically be measured) is roughly 2−120 meters. The
cod e a s data, data a s cod e 205

Planck time 𝑡𝑃 (which is the time it takes for light to


travel one Planck length) is about 2−150 seconds. In the
above setting, if a function 𝐹 takes, say, 1KB of input
(e.g., roughly 104 bits, which can encode a 100 by 100
bitmap image), and requires at least 20.8𝑛 = 20.8⋅10
4

NAND lines to compute, then any physical system


that computes it would require either volume of
20.2⋅10 Planck length cubed, which is more than 21500
4

meters cubed or take at least 20.2⋅10 Planck Time units,


4

which is larger than 2 1500


seconds. To get a sense of
how big that number is, note that the universe is only
about 260 seconds old, and its observable radius is
only roughly 290 meters. The above discussion sug-
gests that it is possible to empirically falsify the PECTT
by presenting a smaller-than-universe-size system that
computes such a function.
There are of course several hurdles to refuting the
PECTT in this way, one of which is that we can’t actu-
ally test the system on all possible inputs. However,
it turns out that we can get around this issue using
notions such as interactive proofs and program checking
that we might encounter later in this book. Another,
perhaps more salient problem, is that while we know
many hard functions exist, at the moment there is no
single explicit function 𝐹 ∶ {0, 1}𝑛 → {0, 1} for which
we can prove an 𝜔(𝑛) (let alone Ω(2𝑛 /𝑛)) lower bound
on the number of lines that a NAND-CIRC program
needs to compute it.

5.6.1 Attempts at refuting the PECTT


One of the admirable traits of mankind is the refusal to accept limita-
tions. In the best case this is manifested by people achieving long-
standing “impossible” challenges such as heavier-than-air flight,
putting a person on the moon, circumnavigating the globe, or even
resolving Fermat’s Last Theorem. In the worst case it is manifested by
people continually following the footsteps of previous failures to try to
do proven-impossible tasks such as build a perpetual motion machine,
trisect an angle with a compass and straightedge, or refute Bell’s in-
equality. The Physical Extended Church-Turing thesis (in its various
forms) has attracted both types of people. Here are some physical
devices that have been speculated to achieve computational tasks that
cannot be done by not-too-large NAND-CIRC programs:

• Spaghetti sort: One of the first lower bounds that Computer Sci-
ence students encounter is that sorting 𝑛 numbers requires making
Ω(𝑛 log 𝑛) comparisons. The “spaghetti sort” is a description of a
proposed “mechanical computer” that would do this faster. The
idea is that to sort 𝑛 numbers 𝑥1 , … , 𝑥𝑛 , we could cut 𝑛 spaghetti
noodles into lengths 𝑥1 , … , 𝑥𝑛 , and then if we simply hold them
206 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

together in our hand and bring them down to a flat surface, they
will emerge in sorted order. There are a great many reasons why
this is not truly a challenge to the PECTT hypothesis, and I will not
ruin the reader’s fun in finding them out by her or himself.

• Soap bubbles: One function 𝐹 ∶ {0, 1}𝑛 → {0, 1} that is conjectured


to require a large number of NAND lines to solve is the Euclidean
Steiner Tree problem. This is the problem where one is given 𝑚
points in the plane (𝑥1 , 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 ) (say with integer coordi-
nates ranging from 1 till 𝑚, and hence the list can be represented
as a string of 𝑛 = 𝑂(𝑚 log 𝑚) size) and some number 𝐾. The goal
is to figure out whether it is possible to connect all the points by
line segments of total length at most 𝐾. This function is conjec-
tured to be hard because it is NP complete - a concept that we’ll en-
counter later in this course - and it is in fact reasonable to conjecture
that as 𝑚 grows, the number of NAND lines required to compute
this function grows exponentially in 𝑚, meaning that the PECTT
would predict that if 𝑚 is sufficiently large (such as few hundreds
or so) then no physical device could compute 𝐹 . Yet, some people
claimed that there is in fact a very simple physical device that could
solve this problem, that can be constructed using some wooden
pegs and soap. The idea is that if we take two glass plates, and put
𝑚 wooden pegs between them in the locations (𝑥1 , 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 )
then bubbles will form whose edges touch those pegs in a way that
will minimize the total energy which turns out to be a function of
the total length of the line segments. The problem with this device
is that nature, just like people, often gets stuck in “local optima”.
That is, the resulting configuration will not be one that achieves
the absolute minimum of the total energy but rather one that can’t
be improved with local changes. Aaronson has carried out actual
experiments (see Fig. 5.8), and saw that while this device often
is successful for three or four pegs, it starts yielding suboptimal
results once the number of pegs grows beyond that.

• DNA computing. People have suggested using the properties of


DNA to do hard computational problems. The main advantage of
DNA is the ability to potentially encode a lot of information in a
relatively small physical space, as well as compute on this infor-
mation in a highly parallel manner. At the time of this writing, it
was demonstrated that one can use DNA to store about 1016 bits
of information in a region of radius about a millimeter, as opposed
Figure 5.8: Scott Aaronson tests a candidate device for
to about 1010 bits with the best known hard disk technology. This computing Steiner trees using soap bubbles.
does not posit a real challenge to the PECTT but does suggest that
one should be conservative about the choice of constant and not as-
cod e a s data, data a s cod e 207

sume that current hard disk + silicon technologies are the absolute 7
We were extremely conservative in the suggested
best possible.7 parameters for the PECTT, having assumed that as
many as ℓ−2𝑃 10
−6 ∼ 1061 bits could potentially be

• Continuous/real computers. The physical world is often described stored in a millimeter radius region.
using continuous quantities such as time and space, and people
have suggested that analog devices might have direct access to
computing with real-valued quantities and would be inherently
more powerful than discrete models such as NAND machines.
Whether the “true” physical world is continuous or discrete is an
open question. In fact, we do not even know how to precisely phrase
this question, let alone answer it. Yet, regardless of the answer, it
seems clear that the effort to measure a continuous quantity grows
with the level of accuracy desired, and so there is no “free lunch”
or way to bypass the PECTT using such machines (see also this
paper). Related to that are proposals known as “hypercomputing”
or “Zeno’s computers” which attempt to use the continuity of time
by doing the first operation in one second, the second one in half a
second, the third operation in a quarter second and so on.. These
fail for a similar reason to the one guaranteeing that Achilles will
eventually catch the tortoise despite the original Zeno’s paradox.

• Relativity computer and time travel. The formulation above as-


sumed the notion of time, but under the theory of relativity time is
in the eye of the observer. One approach to solve hard problems is
to leave the computer to run for a lot of time from his perspective,
but to ensure that this is actually a short while from our perspective.
One approach to do so is for the user to start the computer and then
go for a quick jog at close to the speed of light before checking on
its status. Depending on how fast one goes, few seconds from the
point of view of the user might correspond to centuries in com-
puter time (it might even finish updating its Windows operating
system!). Of course the catch here is that the energy required from
the user is proportional to how close one needs to get to the speed
of light. A more interesting proposal is to use time travel via closed
timelike curves (CTCs). In this case we could run an arbitrarily long
computation by doing some calculations, remembering the current
state, and then travelling back in time to continue where we left off.
Indeed, if CTCs exist then we’d probably have to revise the PECTT
(though in this case I will simply travel back in time and edit these
notes, so I can claim I never conjectured it in the first place…)

• Humans. Another computing system that has been proposed as


a counterexample to the PECTT is a 3 pound computer of about
0.1m radius, namely the human brain. Humans can walk around,
talk, feel, and do other things that are not commonly done by
208 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

NAND-CIRC programs, but can they compute partial functions


that NAND-CIRC programs cannot? There are certainly compu-
tational tasks that at the moment humans do better than computers
(e.g., play some video games, at the moment), but based on our
current understanding of the brain, humans (or other animals)
have no inherent computational advantage over computers. The
brain has about 1011 neurons, each operating at a speed of about
1000 operations per seconds. Hence a rough first approximation is
that a Boolean circuit of about 1014 gates could simulate one second 8
This is a very rough approximation that could
of a brain’s activity.8 Note that the fact that such a circuit (likely) be wrong to a few orders of magnitude in either
exists does not mean it is easy to find it. After all, constructing this direction. For one, there are other structures in the
brain apart from neurons that one might need to
circuit took evolution billions of years. Much of the recent efforts
simulate, hence requiring higher overhead. On the
in artificial intelligence research is focused on finding programs other hand, it is by no mean clear that we need to
that replicate some of the brain’s capabilities and they take massive fully clone the brain in order to achieve the same
computational tasks that it does.
computational effort to discover, these programs often turn out to
be much smaller than the pessimistic estimates above. For example,
at the time of this writing, Google’s neural network for machine
translation has about 104 nodes (and can be simulated by a NAND-
CIRC program of comparable size). Philosophers, priests and many
others have since time immemorial argued that there is something
about humans that cannot be captured by mechanical devices such
as computers; whether or not that is the case, the evidence is thin
that humans can perform computational tasks that are inherently 9
There are some well known scientists that have
impossible to achieve by computers of similar complexity.9 advocated that humans have inherent computational
advantages over computers. See also this.
• Quantum computation. The most compelling attack on the Physi-
cal Extended Church-Turing Thesis comes from the notion of quan-
tum computing. The idea was initiated by the observation that sys-
tems with strong quantum effects are very hard to simulate on a
computer. Turning this observation on its head, people have pro-
posed using such systems to perform computations that we do not
know how to do otherwise. At the time of this writing, scalable
quantum computers have not yet been built, but it is a fascinating
possibility, and one that does not seem to contradict any known law
of nature. We will discuss quantum computing in much more detail
in Chapter 23. Modeling quantum computation involves extending
the model of Boolean circuits into Quantum circuits that have one
more (very special) gate. However, the main takeaway is that while
quantum computing does suggest we need to amend the PECTT,
it does not require a complete revision of our worldview. Indeed,
almost all of the content of this book remains the same regardless of
whether the underlying computational model is Boolean circuits or
quantum circuits.
cod e a s data, data a s cod e 209

R
Remark 5.14 — Physical Extended Church-Turing Thesis
and Cryptography. While even the precise phrasing of
the PECTT, let alone understanding its correctness, is
still a subject of active research, some variants of it are
already implicitly assumed in practice. Governments,
companies, and individuals currently rely on cryptog-
raphy to protect some of their most precious assets,
including state secrets, control of weapon systems
and critical infrastructure, securing commerce, and
protecting the confidentiality of personal information.
In applied cryptography, one often encounters state-
ments such as “cryptosystem 𝑋 provides 128 bits of
security”. What such a statement really means is that
(a) it is conjectured that there is no Boolean circuit
(or, equivalently, a NAND-CIRC program) of size
much smaller than 2128 that can break 𝑋, and (b) we
assume that no other physical mechanism can do bet-
ter, and hence it would take roughly a 2128 amount of
“resources” to break 𝑋. We say “conjectured” and not
“proved” because, while we can phrase the statement
that breaking the system cannot be done by an 𝑠-gate
circuit as a precise mathematical conjecture, at the
moment we are unable to prove such a statement for
any non-trivial cryptosystem. This is related to the P
vs NP question we will discuss in future chapters. We
will explore Cryptography in Chapter 21.

✓ Chapter Recap

• We can think of programs both as describing a pro-


cess, as well as simply a list of symbols that can be
considered as data that can be fed as input to other
programs.
• We can write a NAND-CIRC program that evalu-
ates arbitrary NAND-CIRC programs (or equiv-
alently a circuit that evaluates other circuits).
Moreover, the efficiency loss in doing so is not too
large.
• We can even write a NAND-CIRC program that
evaluates programs in other programming lan-
guages such as Python, C, Lisp, Java, Go, etc.
• By a leap of faith, we could hypothesize that the
number of gates in the smallest circuit that com-
putes a function 𝑓 captures roughly the amount
of physical resources required to compute 𝑓. This
statement is known as the Physical Extended Church-
Turing Thesis (PECTT).
• Boolean circuits (or equivalently AON-CIRC or
NAND-CIRC programs) capture a surprisingly
wide array of computational models. The strongest
currently known challenge to the PECTT comes
from the potential for using quantum mechanical
210 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

effects to speed-up computation, a model known as


quantum computers.

Figure 5.9: A finite computational task is specified by


a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . We can model
a computational process using Boolean circuits (of
varying gate sets) or straight-line program. Every
function can be computed by many programs. We
say that 𝑓 ∈ SIZE𝑛,𝑚 (𝑠) if there exists a NAND
circuit of at most 𝑠 gates (equivalently a NAND-CIRC
program of at most 𝑠 lines) that computes 𝑓. Every
function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be computed by
a circuit of 𝑂(𝑚 ⋅ 2𝑛 /𝑛) gates. Many functions such
as multiplication, addition, solving linear equations,
computing the shortest path in a graph, and others,
can be computed by circuits of much fewer gates.
In particular there is an 𝑂(𝑠2 log 𝑠)-size circuit
that computes the map 𝐶, 𝑥 ↦ 𝐶(𝑥) where 𝐶 is
a string describing a circuit of 𝑠 gates. However,
the counting argument shows there do exist some
functions 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 that require
Ω(𝑚 ⋅ 2𝑛 /𝑛) gates to compute.

5.7 RECAP OF PART I: FINITE COMPUTATION


This chapter concludes the first part of this book that deals with finite
computation (computing functions that map a fixed number of Boolean
inputs to a fixed number of Boolean outputs). The main take-aways
from Chapter 3, Chapter 4, and Chapter 5 are as follows (see also
Fig. 5.9):

• We can formally define the notion of a function 𝑓 ∶ {0, 1}𝑛 →


{0, 1}𝑚 being computable using 𝑠 basic operations. Whether these
operations are AND/OR/NOT, NAND, or some other universal
basis does not make much difference. We can describe such a com-
putation either using a circuit or using a straight-line program.

• We define SIZE𝑛,𝑚 (𝑠) to be the set of functions that are computable


by NAND circuits of at most 𝑠 gates. This set is equal to the set of
functions computable by a NAND-CIRC program of at most 𝑠 lines
and up to a constant factor in 𝑠 (which we will not care about);
this is also the same as the set of functions that are computable
by a Boolean circuit of at most 𝑠 AND/OR/NOT gates. The class
SIZE𝑛,𝑚 (𝑠) is a set of functions, not of programs/circuits.

• Every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be computed using a


circuit of at most 𝑂(𝑚 ⋅ 2𝑛 /𝑛) gates. Some functions require at least
Ω(𝑚 ⋅ 2𝑛 /𝑛) gates. We define SIZE𝑛,𝑚 (𝑠) to be the set of functions
from {0, 1}𝑛 to {0, 1}𝑚 that can be computed using at most 𝑠 gates.
cod e a s data, data a s cod e 211

• We can describe a circuit/program 𝑃 as a string. For every 𝑠, there


is a universal circuit/program 𝑈𝑠 that can evaluate programs of
length 𝑠 given their description as strings. We can use this repre-
sentation also to count the number of circuits of at most 𝑠 gates and
hence prove that some functions cannot be computed by circuits of
smaller-than-exponential size.

• If there is a circuit of 𝑠 gates that computes a function 𝑓, then we


can build a physical device to compute 𝑓 using 𝑠 basic components
(such as transistors). The “Physical Extended Church-Turing The-
sis” postulates that the reverse direction is true as well: if 𝑓 is a
function for which every circuit requires at least 𝑠 gates then that
every physical device to compute 𝑓 will require about 𝑠 “physical
resources”. The main challenge to the PECTT is quantum computing,
which we will discuss in Chapter 23.

Sneak preview: In the next part we will discuss how to model compu-
tational tasks on unbounded inputs, which are specified using functions
𝐹 ∶ {0, 1}∗ → {0, 1}∗ (or 𝐹 ∶ {0, 1}∗ → {0, 1}) that can take an
unbounded number of Boolean inputs.

5.8 EXERCISES
Exercise 5.1 Which one of the following statements is false:

a. There is an 𝑂(𝑠3 ) line NAND-CIRC program that given as input


program 𝑃 of 𝑠 lines in the list-of-tuples representation computes
the output of 𝑃 when all its input are equal to 1.

b. There is an 𝑂(𝑠3 ) line NAND-CIRC program that given as input


program 𝑃 of 𝑠 characters encoded as a string of 7𝑠 bits using the
ASCII encoding, computes the output of 𝑃 when all its input are
equal to 1.

c. There is an 𝑂( 𝑠) line NAND-CIRC program that given as input
program 𝑃 of 𝑠 lines in the list-of-tuples representation computes
the output of 𝑃 when all its input are equal to 1.

For every 𝑘 ∈ ℕ, show that there is an 𝑂(𝑘)


Exercise 5.2 — Equals function.
line NAND-CIRC program that computes the function EQUALS𝑘 ∶
{0, 1}2𝑘 → {0, 1} where EQUALS(𝑥, 𝑥′ ) = 1 if and only if 𝑥 = 𝑥′ .

For every 𝑘 ∈ ℕ and 𝑥′ ∈ {0, 1}𝑘 ,


Exercise 5.3 — Equal to constant function.
show that there is an 𝑂(𝑘) line NAND-CIRC program that computes
the function EQUALS𝑥′ ∶ {0, 1}𝑘 → {0, 1} that on input 𝑥 ∈ {0, 1}𝑘
outputs 1 if and only if 𝑥 = 𝑥′ .
212 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Exercise 5.4 — Counting lower bound for multibit functions. Prove that there
exists a number 𝛿 > 0 such that for every sufficiently large 𝑛 and every
𝑚 there exists a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 that requires at least 10
How many functions from {0, 1}𝑛 to {0, 1}𝑚
𝛿𝑚 ⋅ 2𝑛 /𝑛 NAND gates to compute. See footnote for hint.10 exist? Note that our definition of circuits requires
■ each output to correspond to a unique gate, though
that restriction can make at most an 𝑂(𝑚) additive
Prove that there
Exercise 5.5 — Size hierarchy theorem for multibit functions. difference in the number of gates.
exists a number 𝐶 such that for every 𝑛, 𝑚 and 𝑛+𝑚 < 𝑠 < 𝑚⋅2𝑛 /(𝐶𝑛)
there exists a function 𝑓 ∈ SIZE𝑛,𝑚 (𝐶 ⋅ 𝑠) ⧵ SIZE𝑛,𝑚 (𝑠). See footnote for 11
Follow the proof of Theorem 5.5, replacing the use
hint.11 of the counting argument with Exercise 5.4.

Exercise 5.6 — Efficient representation of circuits and a tighter counting upper


bound. Use the ideas of Remark 5.4 to show that for every 𝜖 > 0 and
sufficiently large 𝑠, 𝑛, 𝑚,

|SIZE𝑛,𝑚 (𝑠)| < 2(2+𝜖)𝑠 log 𝑠+𝑛 log 𝑛+𝑚 log 𝑠 .

Conclude that the implicit constant in Theorem 5.2 can be made arbi-
trarily close to 5. See footnote for hint.12
12
Using the adjacency list representation, a graph
with 𝑛 in-degree zero vertices and 𝑠 in-degree two
■ vertices can be represented using roughly 2𝑠 log(𝑠 +
𝑛) ≤ 2𝑠(log 𝑠 + 𝑂(1)) bits. The labeling of the 𝑛 input
Exercise 5.7 — Tighter counting lower bound. Prove that for every 𝛿 < 1/2, if and 𝑚 output vertices can be specified by a list of 𝑛
𝑛 is sufficiently large then there exists a function 𝑓 ∶ {0, 1}𝑛 → {0, 1} labels in [𝑛] and 𝑚 labels in [𝑚].
such that 𝑓 ∉ SIZE𝑛,1 ( 𝛿2𝑛 ). See footnote for hint.13 Hint: Use the results of Exercise 5.6 and the fact that
𝑛 13

in this regime 𝑚 = 1 and 𝑛 ≪ 𝑠.


Exercise 5.8 — Random functions are hard. Suppose 𝑛 > 1000 and that we
choose a function 𝐹 ∶ {0, 1}𝑛 → {0, 1} at random, choosing for every
𝑥 ∈ {0, 1}𝑛 the value 𝐹 (𝑥) to be the result of tossing an independent
unbiased coin. Prove that the probability that there is a 2𝑛 /(1000𝑛)
line program that computes 𝐹 is at most 2−100 .14
14
Hint: An equivalent way to say this is that you
need to prove that the set of functions that can be
■ computed using at most 2𝑛 /(1000𝑛) lines has fewer
𝑛
than 2−100 22 elements. Can you see why?
Exercise 5.9 The following is a tuple representing a NAND program:
(3, 1, ((3, 2, 2), (4, 1, 1), (5, 3, 4), (6, 2, 1), (7, 6, 6), (8, 0, 0), (9, 7, 8), (10, 5, 0), (11, 9, 10))).

1. Write a table with the eight values 𝑃 (000), 𝑃 (001), 𝑃 (010), 𝑃 (011),
𝑃 (100), 𝑃 (101), 𝑃 (110), 𝑃 (111) in this order.

2. Describe what the programs does in words.

Exercise 5.10 — EVAL with XOR.For every sufficiently large 𝑛, let 𝐸𝑛 ∶


{0, 1}𝑛 → {0, 1} be the function that takes an 𝑛2 -length string that
2
15
Note that if 𝑛 is big enough, then it is easy to
represent such a pair using 𝑛2 bits, since we can
encodes a pair (𝑃 , 𝑥) where 𝑥 ∈ {0, 1}𝑛 and 𝑃 is a NAND program
represent the program using 𝑂(𝑛1.1 log 𝑛) bits, and
of 𝑛 inputs, a single output, and at most 𝑛1.1 lines, and returns the we can always pad our representation to have exactly
output of 𝑃 on 𝑥.15 That is, 𝐸𝑛 (𝑃 , 𝑥) = 𝑃 (𝑥). 𝑛2 length.
cod e a s data, data a s cod e 213

Prove that for every sufficiently large 𝑛, there does not exist an XOR
circuit 𝐶 that computes the function 𝐸𝑛 , where a XOR circuit has the
XOR gate as well as the constants 0 and 1 (see Exercise 3.5). That is,
prove that there is some constant 𝑛0 such that for every 𝑛 > 𝑛0 and
XOR circuit 𝐶 of 𝑛2 inputs and a single output, there exists a pair
(𝑃 , 𝑥) such that 𝐶(𝑃 , 𝑥) ≠ 𝐸𝑛 (𝑃 , 𝑥).

Exercise 5.11 — Learning circuits (challenge, optional, assumes more background).


(This exercise assumes background in probability theory and/or
machine learning that you might not have at this point. Feel free
to come back to it at a later point and in particular after going over
Chapter 18.) In this exercise we will use our bound on the number of
circuits of size 𝑠 to show that (if we ignore the cost of computation)
every such circuit can be learned from not too many training samples.
Specifically, if we find a size-𝑠 circuit that classifies correctly a training
set of 𝑂(𝑠 log 𝑠) samples from some distribution 𝐷, then it is guaran-
teed to do well on the whole distribution 𝐷. Since Boolean circuits
model very many physical processes (maybe even all of them, if the
(controversial) physical extended Church-Turing thesis is true), this
shows that all such processes could be learned as well (again, ignor-
ing the computation cost of finding a classifier that does well on the
training data).
Let 𝐷 be any probability distribution over {0, 1}𝑛 and let 𝐶 be a
NAND circuit with 𝑛 inputs, one output, and size 𝑠 ≥ 𝑛. Prove that
there is some constant 𝑐 such that with probability at least 0.999 the
following holds: if 𝑚 = 𝑐𝑠 log 𝑠 and 𝑥0 , … , 𝑥𝑚−1 are chosen indepen-
dently from 𝐷, then for every circuit 𝐶 ′ such that 𝐶 ′ (𝑥𝑖 ) = 𝐶(𝑥𝑖 ) on
every 𝑖 ∈ [𝑚], Pr𝑥∼𝐷 [𝐶 ′ (𝑥) ≤ 𝐶(𝑥)] ≤ 0.99.
In other words, if 𝐶 ′ is a so called “empirical risk minimizer” that
agrees with 𝐶 on all the training examples 𝑥0 , … , 𝑥𝑛−1 , then it will
also agree with 𝐶 with high probability for samples drawn from the
distribution 𝐷 (i.e., it “generalizes”, to use Machine-Learning lingo). 16
Hint: Use our bound on the number of program-
See footnote for hint.16 s/circuits of size 𝑠 (Theorem 5.2), as well as the

Chernoff Bound ( Theorem 18.12) and the union
bound.

5.9 BIBLIOGRAPHICAL NOTES


The EVAL function is usually known as a universal circuit. The imple-
mentation we describe is not the most efficient known. Valiant [Val76]
first showed a universal circuit of size 𝑂(𝑛 log 𝑛) where 𝑛 is the size of
the input. Universal circuits have seen in recent years new motivations
due to their applications for cryptography, see [LMS16; GKS17] .
While we’ve seen that “most” functions mapping 𝑛 bits to one bit
require circuits of exponential size Ω(2𝑛 /𝑛), we actually do not know
214 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

of any explicit function for which we can prove that it requires, say, at
least 𝑛100 or even 100𝑛 size. At the moment, the strongest such lower
bound we know is that there are quite simple and explicit 𝑛-variable
functions that require at least (5 − 𝑜(1))𝑛 lines to compute, see this
paper of Iwama et al as well as this more recent work of Kulikov et al.
Proving lower bounds for restricted models of circuits is an extremely
interesting research area, for which Jukna’s book [Juk12] (see also
Wegener [Weg87]) provides a very good introduction and overview. I
learned of the proof of the size hierarchy theorem (Theorem 5.5) from
Sasha Golovnev.
Scott Aaronson’s blog post on how information is physical is a good
discussion on issues related to the physical extended Church-Turing
Physics. Aaronson’s survey on NP complete problems and physical
reality [Aar05] discusses these issues as well, though it might be
easier to read after we reach Chapter 15 on NP and NP-completeness.
II
UNIFORM COMPUTATION
Learning Objectives:
• Define functions on unbounded length inputs,
that cannot be described by a finite size table
of inputs and outputs.
• Equivalence with the task of deciding
membership in a language.
• Deterministic finite automatons (optional): A
simple example for a model for unbounded
computation.

6 • Equivalence with regular expressions.

Functions with Infinite domains, Automata, and Regular ex-


pressions

“An algorithm is a finite answer to an infinite number of questions.”, At-


tributed to Stephen Kleene.

The model of Boolean circuits (or equivalently, the NAND-CIRC


programming language) has one very significant drawback: a Boolean
circuit can only compute a finite function 𝑓. In particular, since every
gate has two inputs, a size 𝑠 circuit can compute on an input of length
at most 2𝑠. Thus this model does not capture our intuitive notion of an
algorithm as a single recipe to compute a potentially infinite function.
For example, the standard elementary school multiplication algorithm
is a single algorithm that multiplies numbers of all lengths. However,
we cannot express this algorithm as a single circuit, but rather need a
different circuit (or equivalently, a NAND-CIRC program) for every
input length (see Fig. 6.1).
In this chapter, we extend our definition of computational tasks to
consider functions with the unbounded domain of {0, 1}∗ . We focus
on the question of defining what tasks to compute, mostly leaving
the question of how to compute them to later chapters, where we will
see Turing machines and other computational models for computing
on unbounded inputs. However, we will see one example of a sim-
ple restricted model of computation - deterministic finite automata
Figure 6.1: Once you know how to multiply multi-
(DFAs). digit numbers, you can do so for every number 𝑛
of digits, but if you had to describe multiplication
using Boolean circuits or NAND-CIRC programs,
This chapter: A non-mathy overview you would need a different program/circuit for every
In this chapter, we discuss functions that take as input strings length 𝑛 of the input.
of arbitrary length. We will often focus on the special case
of Boolean functions, where the output is a single bit. These
are still infinite functions since their inputs have unbounded

Compiled on 12.6.2023 00:05


218 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

length and hence such a function cannot be computed by any


single Boolean circuit.
In the second half of this chapter, we discuss finite automata,
a computational model that can compute unbounded length
functions. Finite automata are not as powerful as Python or
other general-purpose programming languages but can serve
as an introduction to these more general models. We also
show a beautiful result - the functions computable by finite
automata are precisely the ones that correspond to regular
expressions. However, the reader can also feel free to skip
automata and go straight to our discussion of Turing machines
in Chapter 7.

6.1 FUNCTIONS WITH INPUTS OF UNBOUNDED LENGTH


Up until now, we considered the computational task of mapping
some string of length 𝑛 into a string of length 𝑚. However, in gen-
eral, computational tasks can involve inputs of unbounded length.
For example, the following Python function computes the function
XOR ∶ {0, 1}∗ → {0, 1}, where XOR(𝑥) equals 1 iff the number of 1’s
|𝑥|−1
in 𝑥 is odd. (In other words, XOR(𝑥) = ∑𝑖=0 𝑥𝑖 mod 2 for every
𝑥 ∈ {0, 1}∗ .) As simple as it is, the XOR function cannot be com-
puted by a Boolean circuit. Rather, for every 𝑛, we can compute XOR𝑛
(the restriction of XOR to {0, 1}𝑛 ) using a different circuit (e.g., see
Fig. 6.2).

def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result

Previously in this book, we studied the computation of finite func-


tions 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . Such a function 𝑓 can always be described
by listing all the 2𝑛 values it takes on inputs 𝑥 ∈ {0, 1}𝑛 . In this chap-
ter, we consider functions such as XOR that take inputs of unbounded
size. While we can describe XOR using a finite number of symbols
(in fact, we just did so above), it takes infinitely many possible in-
Figure 6.2: The NAND circuit and NAND-CIRC
puts, and so we cannot just write down all of its values. The same is
program for computing the XOR of 5 bits. Note how
true for many other functions capturing important computational the circuit for XOR5 merely repeats four times the
tasks, including addition, multiplication, sorting, finding paths in circuit to compute the XOR of 2 bits.
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 219

graphs, fitting curves to points, and so on. To contrast with the fi-
nite case, we will sometimes call a function 𝐹 ∶ {0, 1}∗ → {0, 1} (or
𝐹 ∶ {0, 1}∗ → {0, 1}∗ ) infinite. However, this does not mean that 𝐹
takes as input strings of infinite length! It just means that 𝐹 can take
as input a string that can be arbitrarily long, and so we cannot simply
write down a table of all the outputs of 𝐹 on different inputs.

 Big Idea 8 A function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ specifies the computa-
tional task mapping an input 𝑥 ∈ {0, 1}∗ into the output 𝐹 (𝑥).

As we have seen before, restricting attention to functions that use


binary strings as inputs and outputs does not detract from our gener-
ality, since other objects, including numbers, lists, matrices, images,
videos, and more, can be encoded as binary strings.
As before, it is essential to differentiate between specification and
implementation. For example, consider the following function:

{1 ∃𝑝∈ℕ s.t.𝑝, 𝑝 + 2 are primes and 𝑝 > |𝑥|



TWINP(𝑥) = ⎨
⎩0 otherwise
{

This is a mathematically well-defined function. For every 𝑥,


TWINP(𝑥) has a unique value which is either 0 or 1. However, at
the moment, no one knows of a Python program that computes this
function. The Twin prime conjecture posits that for every 𝑛 there
exists 𝑝 > 𝑛 such that both 𝑝 and 𝑝 + 2 are primes. If this conjecture
is true, then 𝑇 is easy to compute indeed - the program def T(x):
return 1 will do the trick. However, mathematicians have tried
unsuccessfully to prove this conjecture since 1849. That said, whether
or not we know how to implement the function TWINP, the definition
above provides its specification.

6.1.1 Varying inputs and outputs


Many of the functions that interest us take more than one input. For
example, the function

MULT(𝑥, 𝑦) = 𝑥 ⋅ 𝑦
takes the binary representation of a pair of integers 𝑥, 𝑦 ∈ ℕ, and
outputs the binary representation of their product 𝑥 ⋅ 𝑦. However, since
we can represent a pair of strings as a single string, we will consider
functions such as MULT as mapping {0, 1}∗ to {0, 1}∗ . We will typi-
cally not be concerned with low-level details such as the precise way
to represent a pair of integers as a string, since virtually all choices will
be equivalent for our purposes.
220 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Another example of a function we want to compute is


{1 ∀𝑖∈[|𝑥|] 𝑥𝑖 = 𝑥|𝑥|−𝑖
PALINDROME(𝑥) =

⎩0 otherwise
{
PALINDROME has a single bit as output. Functions with a single
bit of output are known as Boolean functions. Boolean functions are
central to the theory of computation, and we will discuss them often
in this book. Note that even though Boolean functions have a single
bit of output, their input can be of arbitrary length. Thus they are still
infinite functions that cannot be described via a finite table of values.
“Booleanizing” functions. Sometimes it might be convenient to ob-
tain a Boolean variant for a non-Boolean function. For example, the
following is a Boolean variant of MULT.

{𝑖𝑡ℎ bit of 𝑥 ⋅ 𝑦 𝑖 < |𝑥 ⋅ 𝑦|



BMULT(𝑥, 𝑦, 𝑖) =

{ otherwise
⎩0
If we can compute BMULT via any programming language such as
Python, C, Java, etc., we can compute MULT as well, and vice versa.
Show that for every
Solved Exercise 6.1 — Booleanizing general functions.
function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , there exists a Boolean function BF ∶
{0, 1}∗ → {0, 1} such that a Python program to compute BF can be
transformed into a program to compute 𝐹 and vice versa.

Solution:
For every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , we can define

⎧𝐹 (𝑥)𝑖 𝑖 < |𝐹 (𝑥)|, 𝑏 = 0


{
{
BF(𝑥, 𝑖, 𝑏) = ⎨1 𝑖 < |𝐹 (𝑥)|, 𝑏 = 1
{
{0
⎩ 𝑖 ≥ |𝐹 (𝑥)|
to be the function that on input 𝑥 ∈ {0, 1}∗ , 𝑖 ∈ ℕ, 𝑏 ∈ {0, 1}
outputs the 𝑖𝑡ℎ bit of 𝐹 (𝑥) if 𝑏 = 0 and 𝑖 < |𝐹 (𝑥)|. If 𝑏 = 1, then
BF(𝑥, 𝑖, 𝑏) outputs 1 iff 𝑖 < |𝐹 (𝑥)| and hence this allows to compute
the length of 𝐹 (𝑥).
Computing BF from 𝐹 is straightforward. For the other direc-
tion, given a Python function BF that computes BF, we can compute
𝐹 as follows:

def F(x):
res = []
i = 0
while BF(x,i,1):
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 221

res.append(BF(x,i,0))
i += 1
return res

6.1.2 Formal Languages


For every Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1}, we can define the set
𝐿𝐹 = {𝑥|𝐹 (𝑥) = 1} of strings on which 𝐹 outputs 1. Such sets are
known as languages. This name is rooted in formal language theory as
pursued by linguists such as Noam Chomsky. A formal language is a
subset 𝐿 ⊆ {0, 1}∗ (or more generally 𝐿 ⊆ Σ∗ for some finite alphabet
Σ). The membership or decision problem for a language 𝐿, is the task of
determining, given 𝑥 ∈ {0, 1}∗ , whether or not 𝑥 ∈ 𝐿. If we can com-
pute the function 𝐹 , then we can decide membership in the language
𝐿𝐹 and vice versa. Hence, many texts such as [Sip97] refer to the task
of computing a Boolean function as “deciding a language”. In this
book, we mostly describe computational tasks using the function nota-
tion, which is easier to generalize to computation with more than one
bit of output. However, since the language terminology is so popular
in the literature, we will sometimes mention it.

6.1.3 Restrictions of functions


If 𝐹 ∶ {0, 1}∗ → {0, 1} is a Boolean function and 𝑛 ∈ ℕ then the re-
striction of 𝐹 to inputs of length 𝑛, denoted as 𝐹𝑛 , is the finite function
𝑓 ∶ {0, 1}𝑛 → {0, 1} such that 𝑓(𝑥) = 𝐹 (𝑥) for every 𝑥 ∈ {0, 1}𝑛 . That
is, 𝐹𝑛 is the finite function that is only defined on inputs in {0, 1}𝑛 , but
agrees with 𝐹 on those inputs. Since 𝐹𝑛 is a finite function, it can be
computed by a Boolean circuit, implying the following theorem:

Let 𝐹 ∶ {0, 1}∗ →


Theorem 6.1 — Circuit collection for every infinite function.
{0, 1}. Then there is a collection {𝐶𝑛 }𝑛∈{1,2,…} of circuits such that
for every 𝑛 > 0, 𝐶𝑛 computes the restriction 𝐹𝑛 of 𝐹 to inputs of
length 𝑛.

Proof. This is an immediate corollary of the universality of Boolean


circuits. Indeed, since 𝐹𝑛 maps {0, 1}𝑛 to {0, 1}, Theorem 4.15 implies
that there exists a Boolean circuit 𝐶𝑛 to compute it. In fact, the size of
this circuit is at most 𝑐 ⋅ 2𝑛 /𝑛 gates for some constant 𝑐 ≤ 10.

In particular, Theorem 6.1 implies that there exists such a circuit


collection {𝐶𝑛 } even for the TWINP function we described before,
even though we do not know of any program to compute it. Indeed,
this is not that surprising: for every particular 𝑛 ∈ ℕ, TWINP𝑛 is either
the constant zero function or the constant one function, both of which
222 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

can be computed by very simple Boolean circuits. Hence a collection


of circuits {𝐶𝑛 } that computes TWINP certainly exists. The difficulty
in computing TWINP using Python or any other programming lan-
guage arises from the fact that we do not know for each particular 𝑛
what is the circuit 𝐶𝑛 in this collection.

6.2 DETERMINISTIC FINITE AUTOMATA (OPTIONAL)


All our computational models so far - Boolean circuits and straight-
line programs - were only applicable for finite functions.
In Chapter 7, we will present Turing machines, which are the central
models of computation for unbounded input length functions. How-
ever, in this section we present the more basic model of deterministic
finite automata (DFA). Automata can serve as a good stepping-stone for
Turing machines, though they will not be used much in later parts of
this book, and so the reader can feel free to skip ahead to Chapter 7.
DFAs turn out to be equivalent in power to regular expressions: a pow-
erful mechanism to specify patterns, which is widely used in practice.
Our treatment of automata is relatively brief. There are plenty of re-
sources that help you get more comfortable with DFAs. In particular,
Chapter 1 of Sipser’s book [Sip97] contains an excellent exposition of
this material. There are also many websites with online simulators for
automata, as well as translators from regular expressions to automata
and vice versa (see for example here and here).
At a high level, an algorithm is a recipe for computing an output
from an input via a combination of the following steps:

1. Read a bit from the input


2. Update the state (working memory)
3. Stop and produce an output

For example, recall the Python program that computes the XOR
function:

def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result

In each step, this program reads a single bit X[i] and updates its
state result based on that bit (flipping result if X[i] is 1 and keep-
ing it the same otherwise). When it is done transversing the input,
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 223

the program outputs result. In computer science, such a program is


called a single-pass constant-memory algorithm since it makes a single
pass over the input and its working memory is finite. (Indeed, in this
case, result can either be 0 or 1.) Such an algorithm is also known as
a Deterministic Finite Automaton or DFA (another name for DFAs is a
finite state machine). We can think of such an algorithm as a “machine”
that can be in one of 𝐶 states, for some constant 𝐶. The machine starts
in some initial state and then reads its input 𝑥 ∈ {0, 1}∗ one bit at a
time. Whenever the machine reads a bit 𝜎 ∈ {0, 1}, it transitions into a
new state based on 𝜎 and its prior state. The output of the machine is
based on the final state. Every single-pass constant-memory algorithm
corresponds to such a machine. If an algorithm uses 𝑐 bits of mem-
ory, then the contents of its memory can be represented as a string
of length 𝑐. Therefore such an algorithm can be in one of at most 2𝑐
states at any point in the execution.
We can specify a DFA of 𝐶 states by a list of 𝐶 ⋅ 2 rules. Each rule
will be of the form “If the DFA is in state 𝑣 and the bit read from the
input is 𝜎 then the new state is 𝑣′ ”. At the end of the computation,
we will also have a rule of the form “If the final state is one of the
following … then output 1, otherwise output 0”. For example, the
Python program above can be represented by a two-state automaton
for computing XOR of the following form:

• Initialize in the state 0.


• For every state 𝑠 ∈ {0, 1} and input bit 𝜎 read, if 𝜎 = 1 then change
to state 1 − 𝑠, otherwise stay in state 𝑠.
• At the end output 1 iff 𝑠 = 1.

We can also describe a 𝐶-state DFA as a labeled graph of 𝐶 vertices.


For every state 𝑠 and bit 𝜎, we add a directed edge labeled with 𝜎
between 𝑠 and the state 𝑠′ such that if the DFA is at state 𝑠 and reads 𝜎
then it transitions to 𝑠′ . (If the state stays the same then this edge will
be a self-loop; similarly, if 𝑠 transitions to 𝑠′ in both the case 𝜎 = 0 and
𝜎 = 1 then the graph will contain two parallel edges.) We also label
the set 𝒮 of states on which the automaton will output 1 at the end of
the computation. This set is known as the set of accepting states. See
Fig. 6.3 for the graphical representation of the XOR automaton.
Formally, a DFA is specified by (1) the table of the 𝐶 ⋅ 2 rules, which
can be represented as a transition function 𝑇 that maps a state 𝑠 ∈ [𝐶]
and bit 𝜎 ∈ {0, 1} to the state 𝑠′ ∈ [𝐶] which the DFA will transition to
from state 𝑠 on input 𝜎 and (2) the set 𝒮 of accepting states. This leads
to the following definition.
Figure 6.3: A deterministic finite automaton that
computes the XOR function. It has two states 0 and 1,
and when it observes 𝜎 it transitions from 𝑣 to 𝑣 ⊕ 𝜎.
224 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

A deterministic finite
Definition 6.2 — Deterministic Finite Automaton.
automaton (DFA) with 𝐶 states over {0, 1} is a pair (𝑇 , 𝒮) with
𝑇 ∶ [𝐶] × {0, 1} → [𝐶] and 𝒮 ⊆ [𝐶]. The finite function 𝑇 is known
as the transition function of the DFA. The set 𝒮 is known as the set of
accepting states.
Let 𝐹 ∶ {0, 1}∗ → {0, 1} be a Boolean function with the infinite
domain {0, 1}∗ . We say that (𝑇 , 𝒮) computes a function 𝐹 ∶ {0, 1}∗ →
{0, 1} if for every 𝑛 ∈ ℕ and 𝑥 ∈ {0, 1}𝑛 , if we define 𝑠0 = 0 and
𝑠𝑖+1 = 𝑇 (𝑠𝑖 , 𝑥𝑖 ) for every 𝑖 ∈ [𝑛], then

𝑠𝑛 ∈ 𝒮 ⇔ 𝐹 (𝑥) = 1

P
Make sure not to confuse the transition function of
an automaton (𝑇 in Definition 6.2), which is a finite
function specifying the table of “rules” which it fol-
lows, with the function the automaton computes (𝐹 in
Definition 6.2) which is an infinite function.

R
Remark 6.3 — Definitions in other texts. Deterministic
finite automata can be defined in several equivalent
ways. In particular Sipser [Sip97] defines a DFA as a
five-tuple (𝑄, Σ, 𝛿, 𝑞0 , 𝐹 ) where 𝑄 is the set of states,
Σ is the alphabet, 𝛿 is the transition function, 𝑞0 is
the initial state, and 𝐹 is the set of accepting states.
In this book the set of states is always of the form
𝑄 = {0, … , 𝐶 − 1} and the initial state is always 𝑞0 = 0,
but this makes no difference to the computational
power of these models. Also, we restrict our attention
to the case that the alphabet Σ is equal to {0, 1}.

Solved Exercise 6.2 — DFA for (010)∗ . Prove that there is a DFA that com-
putes the following function 𝐹 :

{1 3 divides |𝑥| and ∀𝑖∈[|𝑥|/3] 𝑥3𝑖 𝑥3𝑖+1 𝑥3𝑖+2 = 010



𝐹 (𝑥) = ⎨
⎩0 otherwise
{

Solution:
When asked to construct a deterministic finite automaton, it is
often useful to start by constructing a single-pass constant-memory
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 225

algorithm using a more general formalism (for example, using


pseudocode or a Python program). Once we have such an algo-
rithm, we can mechanically translate it into a DFA. Here is a simple
Python program for computing 𝐹 :

def F(X):
'''Return 1 iff X is a concatenation of zero/more
↪ copies of [0,1,0]'''
if len(X) % 3 != 0:
return False
ultimate = 0
penultimate = 1
antepenultimate = 0
for idx, b in enumerate(X):
antepenultimate = penultimate
penultimate = ultimate
ultimate = b
if idx % 3 == 2 and ((antepenultimate,
↪ penultimate, ultimate) != (0,1,0)):
return False
return True

Since we keep three Boolean variables, the working memory can


be in one of 23 = 8 configurations, and so the program above can
be directly translated into an 8 state DFA. While this is not needed
to solve the question, by examining the resulting DFA, we can see
that we can merge some states and obtain a 4 state automaton, de-
scribed in Fig. 6.4. See also Fig. 6.5, which depicts the execution of
this DFA on a particular input.

6.2.1 Anatomy of an automaton (finite vs. unbounded)


Now that we are considering computational tasks with unbounded
input sizes, it is crucial to distinguish between the components of our
algorithm that have fixed length and the components that grow with
the input size. For the case of DFAs these are the following:

Constant size components: Given a DFA 𝐴, the following quantities are


fixed independent of the input size: Figure 6.4: A DFA that outputs 1 only on inputs
𝑥 ∈ {0, 1}∗ that are a concatenation of zero or more
• The number of states 𝐶 in 𝐴. copies of 010. The state 0 is both the starting state
and the only accepting state. The table denotes the
• The transition function 𝑇 (which has 2𝐶 inputs, and so can be speci- transition function of 𝑇 , which maps the current state
fied by a table of 2𝐶 rows, each entry in which is a number in [𝐶]). and symbol read to the new symbol.

• The set 𝒮 ⊆ [𝐶] of accepting states. This set can be described by a


string in {0, 1}𝐶 specifiying which states are in 𝒮 and which are not.
226 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Together the above means that we can fully describe an automaton


using finitely many symbols. This is a property we require out of any
notion of “algorithm”: we should be able to write down a complete
specification of how it produces an output from an input.

The following quantities relating to a


Components of unbounded size:
DFA are not bounded by any constant. We stress that these are still
finite for any given input.

• The length of the input 𝑥 ∈ {0, 1}∗ that the DFA is provided. The
input length is always finite, but not a priori bounded.

• The number of steps that the DFA takes can grow with the length of
the input. Indeed, a DFA makes a single pass on the input and so it
takes precisely |𝑥| steps on an input 𝑥 ∈ {0, 1}∗ .

Figure 6.5: Execution of the DFA of Fig. 6.4. The


number of states and the transition function size are
bounded, but the input can be arbitrarily long. If
the DFA is at state 𝑠 and observes the value 𝜎 then it
moves to the state 𝑇 (𝑠, 𝜎). At the end of the execution
the DFA accepts iff the final state is in 𝒮.

6.2.2 DFA-computable functions


We say that a function 𝐹 ∶ {0, 1}∗ → {0, 1} is DFA computable if there
exists some DFA that computes 𝐹 . In Chapter 4 we saw that every
finite function is computable by some Boolean circuit. Thus, at this
point, you might expect that every infinite function is computable by
some DFA. However, this is very much not the case. We will soon see
some simple examples of infinite functions that are not computable by
DFAs, but for starters, let us prove that such functions exist.

Theorem 6.4 — DFA-computable functions are countable. Let DFACOMP be


the set of all Boolean functions 𝐹 ∶ {0, 1}∗ → {0, 1} such that there
exists a DFA computing 𝐹 . Then DFACOMP is countable.

Proof Idea:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 227

Every DFA can be described by a finite length string, which yields


an onto map from {0, 1}∗ to DFACOMP: namely, the function that
maps a string describing an automaton 𝐴 to the function that it com-
putes.

Proof of Theorem 6.4. Every DFA can be described by a finite string,


representing the transition function 𝑇 and the set of accepting states,
and every DFA 𝐴 computes some function 𝐹 ∶ {0, 1}∗ → {0, 1}. Thus
we can define the following function 𝑆𝑡𝐷𝐶 ∶ {0, 1}∗ → DFACOMP:


{𝐹 𝑎 represents automaton 𝐴 and 𝐹 is the function 𝐴 computes
𝑆𝑡𝐷𝐶(𝑎) = ⎨
⎩ONE otherwise
{

where ONE ∶ {0, 1}∗ → {0, 1} is the constant function that outputs
1 on all inputs (and is a member of DFACOMP). Since by definition,
every function 𝐹 in DFACOMP is computable by some automaton,
𝑆𝑡𝐷𝐶 is an onto function from {0, 1}∗ to DFACOMP, which means
that DFACOMP is countable (see Section 2.4.2).

Since the set of all Boolean functions is uncountable, we get the


following corollary:

There exists a
Theorem 6.5 — Existence of DFA-uncomputable functions.
Boolean function 𝐹 ∶ {0, 1} → {0, 1} that is not computable by any

DFA.

Proof. If every Boolean function 𝐹 is computable by some DFA, then


DFACOMP equals the set ALL of all Boolean functions, but by Theo-
rem 2.12, the latter set is uncountable, contradicting Theorem 6.4.

6.3 REGULAR EXPRESSIONS


Searching for a piece of text is a common task in computing. At its
heart, the search problem is quite simple. We have a collection 𝑋 =
{𝑥0 , … , 𝑥𝑘 } of strings (e.g., files on a hard-drive, or student records in
a database), and the user wants to find out the subset of all the 𝑥 ∈ 𝑋
that are matched by some pattern (e.g., all files whose names end with
the string .txt). In full generality, we can allow the user to specify the
pattern by specifying a (computable) function 𝐹 ∶ {0, 1}∗ → {0, 1},
where 𝐹 (𝑥) = 1 corresponds to the pattern matching 𝑥. That is, the
user provides a program 𝑃 in a programming language such as Python,
and the system returns all 𝑥 ∈ 𝑋 such that 𝑃 (𝑥) = 1. For example,
228 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

one could search for all text files that contain the string important
document or perhaps (letting 𝑃 correspond to a neural-network based
classifier) all images that contain a cat. However, we don’t want our
system to get into an infinite loop just trying to evaluate the program
𝑃 ! For this reason, typical systems for searching files or databases do
not allow users to specify the patterns using full-fledged programming
languages. Rather, such systems use restricted computational models that
on the one hand are rich enough to capture many of the queries needed
in practice (e.g., all filenames ending with .txt, or all phone numbers
of the form (617) xxx-xxxx), but on the other hand are restricted
enough so that queries can be evaluated very efficiently on huge files
and in particular cannot result in an infinite loop.
One of the most popular such computational models is regular
expressions. If you ever used an advanced text editor, a command-line
shell, or have done any kind of manipulation of text files, then you
have probably come across regular expressions.
A regular expression over some alphabet Σ is obtained by combin-
ing elements of Σ with the operation of concatenation, as well as |
(corresponding to or) and ∗ (corresponding to repetition zero or
more times). (Common implementations of regular expressions in
programming languages and shells typically include some extra oper-
ations on top of | and ∗, but these operations can be implemented as
“syntactic sugar” using the operators | and ∗.) For example, the fol-
lowing regular expression over the alphabet {0, 1} corresponds to the
set of all strings 𝑥 ∈ {0, 1}∗ where every digit is repeated at least twice:

(00(0∗ )|11(1∗ ))∗ .

The following regular expression over the alphabet {𝑎, … , 𝑧, 0, … , 9}


corresponds to the set of all strings that consist of a sequence of one
or more of the letters 𝑎-𝑑 followed by a sequence of one or more digits
(without a leading zero):

(𝑎|𝑏|𝑐|𝑑)(𝑎|𝑏|𝑐|𝑑)∗ (1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗ . (6.1)

Formally, regular expressions are defined by the following recursive


definition:

A regular expression 𝑒 over an al-


Definition 6.6 — Regular expression.
phabet Σ is a string over Σ ∪ {(, ), |, ∗, ∅, ""} that has one of the
following forms:

1. 𝑒 = 𝜎 where 𝜎 ∈ Σ

2. 𝑒 = (𝑒′ |𝑒″ ) where 𝑒′ , 𝑒″ are regular expressions.


fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 229

3. 𝑒 = (𝑒′ )(𝑒″ ) where 𝑒′ , 𝑒″ are regular expressions. (We often


drop the parentheses when there is no danger of confusion and
so write this as 𝑒′ 𝑒″ .)

4. 𝑒 = (𝑒′ )∗ where 𝑒′ is a regular expression.

Finally we also allow the following “edge cases”: 𝑒 = ∅ and


𝑒 = "". These are the regular expressions corresponding to accept-
ing no strings, and accepting only the empty string respectively.

We will drop parentheses when they can be inferred from the


context. We also use the convention that OR and concatenation are
left-associative, and we give highest precedence to ∗, then concate-
nation, and then OR. Thus for example we write 00∗ |11 instead of
((0)(0∗ ))|((1)(1)).
Every regular expression 𝑒 corresponds to a function Φ𝑒 ∶ Σ∗ →
{0, 1} where Φ𝑒 (𝑥) = 1 if 𝑥 matches the regular expression. For exam-
ple, if 𝑒 = (00|11)∗ then Φ𝑒 (110011) = 1 but Φ𝑒 (101) = 0 (can you see
why?).

P
The formal definition of Φ𝑒 is one of those definitions
that is more cumbersome to write than to grasp. Thus
it might be easier for you first to work out the defini-
tion on your own, and then check that it matches what
is written below.

Let 𝑒 be a regular expres-


Definition 6.7 — Matching a regular expression.
sion over the alphabet Σ. The function Φ𝑒 ∶ Σ∗ → {0, 1} is defined
as follows:

1. If 𝑒 = 𝜎 then Φ𝑒 (𝑥) = 1 iff 𝑥 = 𝜎.

2. If 𝑒 = (𝑒′ |𝑒″ ) then Φ𝑒 (𝑥) = Φ𝑒′ (𝑥)∨Φ𝑒″ (𝑥) where ∨ is the OR op-
erator.

3. If 𝑒 = (𝑒′ )(𝑒″ ) then Φ𝑒 (𝑥) = 1 iff there is some 𝑥′ , 𝑥″ ∈ Σ∗ such


that 𝑥 is the concatenation of 𝑥′ and 𝑥″ and Φ𝑒′ (𝑥′ ) = Φ𝑒″ (𝑥″ ) =
1.

4. If 𝑒 = (𝑒′ )∗ then Φ𝑒 (𝑥) = 1 iff there is some 𝑘 ∈ ℕ and some


𝑥0 , … , 𝑥𝑘−1 ∈ Σ∗ such that 𝑥 is the concatenation 𝑥0 ⋯ 𝑥𝑘−1 and
Φ𝑒′ (𝑥𝑖 ) = 1 for every 𝑖 ∈ [𝑘].

5. Finally, for the edge cases Φ∅ is the constant zero function, and
Φ"" is the function that only outputs 1 on the empty string "".
230 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

We say that a regular expression 𝑒 over Σ matches a string 𝑥 ∈ Σ∗


if Φ𝑒 (𝑥) = 1.

P
The definitions above are not inherently difficult but
are a bit cumbersome. So you should pause here and
go over it again until you understand why it corre-
sponds to our intuitive notion of regular expressions.
This is important not just for understanding regular
expressions themselves (which are used time and
again in a great many applications) but also for get-
ting better at understanding recursive definitions in
general.

A Boolean function is called “regular” if it outputs 1 on precisely


the set of strings that are matched by some regular expression. That is,

Let Σ be a finite set and


Definition 6.8 — Regular functions / languages.
𝐹 ∶ Σ → {0, 1} be a Boolean function. We say that 𝐹 is regular if

𝐹 = Φ𝑒 for some regular expression 𝑒.


Similarly, for every formal language 𝐿 ⊆ Σ∗ , we say that 𝐿 is reg-
ular if and only if there is a regular expression 𝑒 such that 𝑥 ∈ 𝐿 iff
𝑒 matches 𝑥.

■ Example 6.9 — A regular function. Let Σ = {𝑎, 𝑏, 𝑐, 𝑑, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


and 𝐹 ∶ Σ∗ → {0, 1} be the function such that 𝐹 (𝑥) outputs 1 iff
𝑥 consists of one or more of the letters 𝑎-𝑑 followed by a sequence
of one or more digits (without a leading zero). Then 𝐹 is a regular
function, since 𝐹 = Φ𝑒 where

𝑒 = (𝑎|𝑏|𝑐|𝑑)(𝑎|𝑏|𝑐|𝑑)∗ (1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗

is the expression we saw in (6.1).


If we wanted to verify, for example, that Φ𝑒 (𝑎𝑏𝑐12078) = 1,
we can do so by noticing that the expression (𝑎|𝑏|𝑐|𝑑) matches the
string 𝑎, (𝑎|𝑏|𝑐|𝑑)∗ matches 𝑏𝑐, (1|2|3|4|5|6|7|8|9) matches the string
1, and the expression (0|1|2|3|4|5|6|7|8|9)∗ matches the string 2078.
Each one of those boils down to a simpler expression. For example,
the expression (𝑎|𝑏|𝑐|𝑑)∗ matches the string 𝑏𝑐 because both of the
one-character strings 𝑏 and 𝑐 are matched by the expression 𝑎|𝑏|𝑐|𝑑.

Regular expression can be defined over any finite alphabet Σ, but


as usual, we will mostly focus our attention on the binary case, where
Σ = {0, 1}. Most (if not all) of the theoretical and practical general
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 231

insights about regular expressions can be gleaned from studying the


binary case.

6.3.1 Algorithms for matching regular expressions


Regular expressions would not be very useful for search if we could
not evaluate, given a regular expression 𝑒, whether a string 𝑥 is
matched by 𝑒. Luckily, there is an algorithm to do so. Specifically,
there is an algorithm (think “Python program” though later we
will formalize the notion of algorithms using Turing machines) that
on input a regular expression 𝑒 over the alphabet {0, 1} and a string
𝑥 ∈ {0, 1}∗ , outputs 1 iff 𝑒 matches 𝑥 (i.e., outputs Φ𝑒 (𝑥)).
Indeed, Definition 6.7 actually specifies a recursive algorithm for
computing Φ𝑒 . Specifically, each one of our operations -concatenation,
OR, and star- can be thought of as reducing the task of testing whether
an expression 𝑒 matches a string 𝑥 to testing whether some sub-
expressions of 𝑒 match substrings of 𝑥. Since these sub-expressions
are always shorter than the original expression, this yields a recursive
algorithm for checking if 𝑒 matches 𝑥, which will eventually terminate
at the base cases of the expressions that correspond to a single symbol
or the empty string.
232 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 6.10 — Regular expression matching.

Input: Regular expression 𝑒 over Σ∗ , 𝑥 ∈ Σ∗


Output: Φ𝑒 (𝑥)
1: procedure Match(𝑒,𝑥)
2: if 𝑒 = ∅ then return 0 ;
3: if 𝑥 = "" then return MatchE mp ty(𝑒) ;
4: if 𝑒 ∈ Σ then return 1 iff 𝑥 = 𝑒 ;
5: if 𝑒 = (𝑒′ |𝑒″ ) then return Match(𝑒′ , 𝑥) or Match(𝑒″ , 𝑥)
;
6: if 𝑒 = (𝑒′ )(𝑒″ ) then
7: for 𝑖 ∈ [|𝑥|] do
8: if Match(𝑒′ , 𝑥0 ⋯ 𝑥𝑖−1 ) and Match(𝑒″ , 𝑥𝑖 ⋯ 𝑥|𝑥|−1 )
then return 1 ;
9: end for
10: end if
11: if 𝑒 = (𝑒′ )∗ then
12: if 𝑒′ = "" then return Match("", 𝑥) ;
13: # ("")∗ is the same as ""
14: for 𝑖 ∈ [|𝑥|] do
15: # 𝑥0 ⋯ 𝑥𝑖−1 is shorter than 𝑥
16: if Match(𝑒, 𝑥0 ⋯ 𝑥𝑖−1 ) and Match(𝑒′ , 𝑥𝑖 ⋯ 𝑥|𝑥|−1 )
then return 1 ;
17: end for
18: end if
19: return 0
20: end procedure

We assume above that we have a procedure MatchE mp ty that


on input a regular expression 𝑒 outputs 1 if and only if 𝑒 matches the
empty string "".
The key observation is that in our recursive definition of regular ex-
pressions, whenever 𝑒 is made up of one or two expressions 𝑒′ , 𝑒″ then
these two regular expressions are smaller than 𝑒. Eventually (when
they have size 1) then they must correspond to the non-recursive
case of a single alphabet symbol. Correspondingly, the recursive calls
made in Algorithm 6.10 always correspond to a shorter expression or
(in the case of an expression of the form (𝑒′ )∗ ) a shorter input string.
Thus, we can prove the correctness of Algorithm 6.10 on inputs of the
form (𝑒, 𝑥) by induction over min{|𝑒|, |𝑥|}. The base case is when ei-
ther 𝑥 = "" or 𝑒 is a single alphabet symbol, "" or ∅. In the case the
expression is of the form 𝑒 = (𝑒′ |𝑒″ ) or 𝑒 = (𝑒′ )(𝑒″ ), we make recursive
calls with the shorter expressions 𝑒′ , 𝑒″ . In the case the expression is of
the form 𝑒 = (𝑒′ )∗ , we make recursive calls with either a shorter string
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 233

𝑥 and the same expression, or with the shorter expression 𝑒′ and a


string 𝑥′ that is equal in length or shorter than 𝑥.
Give an algorithm that on
Solved Exercise 6.3 — Match the empty string.
input a regular expression 𝑒, outputs 1 if and only if Φ𝑒 ("") = 1.

Solution:
We can obtain such a recursive algorithm by using the following
observations:

1. An expression of the form "" or (𝑒′ )∗ always matches the empty


string.

2. An expression of the form 𝜎, where 𝜎 ∈ Σ is an alphabet sym-


bol, never matches the empty string.

3. The regular expression ∅ does not match the empty string.

4. An expression of the form 𝑒′ |𝑒″ matches the empty string if and


only if one of 𝑒′ or 𝑒″ matches it.

5. An expression of the form (𝑒′ )(𝑒″ ) matches the empty string if


and only if both 𝑒′ and 𝑒″ match it.

Given the above observations, we see that the following algo-


rithm will check if 𝑒 matches the empty string:

Algorithm 6.11 — Check for empty string.

Input: Regular expression 𝑒 over Σ∗ , 𝑥 ∈ Σ∗


Output: 1 iff 𝑒 matches the emptry string.
1: procedure MatchE mp ty (𝑒)
2: if 𝑒 = "" then return 1 ;
3: if 𝑒 = ∅ or 𝑒 ∈ Σ then return 0 ;
4: if 𝑒 = (𝑒′ |𝑒″ ) then return MatchE mp ty(𝑒′ ) or
MatchE mp ty(𝑒″ ) ;
5: if 𝑒 = (𝑒′ )(𝑒″ ) then return MatchE mp ty(𝑒′ )
and MatchE mp ty(𝑒″ ) ;
6: if 𝑒 = (𝑒′ )∗ then return 1 ;
7: end procedure

6.4 EFFICIENT MATCHING OF REGULAR EXPRESSIONS (OP-


TIONAL)
Algorithm 6.10 is not very efficient. For example, given an expression
involving concatenation or the “star” operation and a string of length
234 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑛, it can make 𝑛 recursive calls, and hence it can be shown that in the
worst case Algorithm 6.10 can take time exponential in the length of
the input string 𝑥. Fortunately, it turns out that there is a much more
efficient algorithm that can match regular expressions in linear (i.e.,
𝑂(𝑛)) time. Since we have not yet covered the topics of time and space
complexity, we describe this algorithm in high level terms, without
making the computational model precise. Rather we will use the
colloquial notion of 𝑂(𝑛) running time as used in introduction to
programming courses and whiteboard coding interviews. We will see
a formal definition of time complexity in Chapter 13.

Let 𝑒 be a
Theorem 6.12 — Matching regular expressions in linear time.
regular expression. Then there is an 𝑂(𝑛) time algorithm that
computes Φ𝑒 .

The implicit constant in the 𝑂(𝑛) term of Theorem 6.12 depends on


the expression 𝑒. Thus, another way to state Theorem 6.12 is that for
every expression 𝑒, there is some constant 𝑐 and an algorithm 𝐴 that
computes Φ𝑒 on 𝑛-bit inputs using at most 𝑐 ⋅ 𝑛 steps. This makes sense
since in practice we often want to compute Φ𝑒 (𝑥) for a small regular
expression 𝑒 and a large document 𝑥. Theorem 6.12 tells us that we
can do so with running time that scales linearly with the size of the
document, even if it has (potentially) worse dependence on the size of
the regular expression.
We prove Theorem 6.12 by obtaining more efficient recursive al-
gorithm, that determines whether 𝑒 matches a string 𝑥 ∈ {0, 1}𝑛 by
reducing this task to determining whether a related expression 𝑒′
matches 𝑥0 , … , 𝑥𝑛−2 . This will result in an expression for the running
time of the form 𝑇 (𝑛) = 𝑇 (𝑛 − 1) + 𝑂(1) which solves to 𝑇 (𝑛) = 𝑂(𝑛).

Restrictions of regular expressions. The central definition for the algo-


rithm behind Theorem 6.12 is the notion of a restriction of a regular
expression. The idea is that for every regular expression 𝑒 and sym-
bol 𝜎 in its alphabet, it is possible to define a regular expression 𝑒[𝜎]
such that 𝑒[𝜎] matches a string 𝑥 if and only if 𝑒 matches the string 𝑥𝜎.
For example, if 𝑒 is the regular expression (01)∗ (01) (i.e., one or more
occurrences of 01) then 𝑒[1] is equal to (01)∗ 0 and 𝑒[0] will be ∅. (Can
you see why?)
Algorithm 6.13 computes the restriction 𝑒[𝜎] given a regular ex-
pression 𝑒 and an alphabet symbol 𝜎. It always terminates, since the
recursive calls it makes are always on expressions smaller than the
input expression. Its correctness can be proven by induction on the
length of the regular expression 𝑒, with the base cases being when 𝑒 is
"", ∅, or a single alphabet symbol 𝜏 .
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 235

Algorithm 6.13 — Restricting regular expression.

Input: Regular expression 𝑒 over Σ, symbol 𝜎 ∈ Σ


Output: Regular expression 𝑒′ = 𝑒[𝜎] such that Φ𝑒′ (𝑥) =
Φ𝑒 (𝑥𝜎) for every 𝑥 ∈ Σ∗
1: procedure Re stri c t(𝑒,𝜎)
2: if 𝑒 = "" or 𝑒 = ∅ then return ∅ ;
3: if 𝑒 = 𝜏 for 𝜏 ∈ Σ then return "" if 𝜏 = 𝜎 and return
∅ otherwise ;
4: if 𝑒 = (𝑒′ |𝑒″ ) then return (Re stri c t(𝑒′ , 𝜎)|Re stri c t(𝑒″ , 𝜎))
;
5: if 𝑒 = (𝑒′ )∗ then return (𝑒′ )∗ (Re stri c t(𝑒′ , 𝜎)) ;
6: if 𝑒 = (𝑒′ )(𝑒″ ) and Φ𝑒″ ("") = 0 then return
(𝑒′ )(Re stri c t(𝑒″ , 𝜎)) ;
7: if 𝑒 = (𝑒′ )(𝑒″ ) and Φ𝑒″ ("") = 1 then return
(𝑒′ Re stri c t(𝑒″ , 𝜎)) | Re stri c t(𝑒′ , 𝜎) ;
8: end procedure

Using this notion of restriction, we can define the following recur-


sive algorithm for regular expression matching:

Algorithm 6.14 — Regular expression matching in linear time.

Input: Regular expression 𝑒 over Σ∗ , 𝑥 ∈ Σ𝑛 where 𝑛 ∈ ℕ


Output: Φ𝑒 (𝑥)
1: procedure FMatch(𝑒,𝑥)
2: if 𝑥 = "" then return MatchE mp ty(𝑒) ;
3: Let 𝑒′ ← Re stri c t(𝑒, 𝑥𝑛−1 )
4: return F Match(𝑒′ , 𝑥0 ⋯ 𝑥𝑛−2 )
5: end procedure

By the definition of a restriction, for every 𝜎 ∈ Σ and 𝑥′ ∈ Σ∗ ,


the expression 𝑒 matches 𝑥′ 𝜎 if and only if 𝑒[𝜎] matches 𝑥′ . Hence for
every 𝑒 and 𝑥 ∈ Σ𝑛 , Φ𝑒[𝑥𝑛−1 ] (𝑥0 ⋯ 𝑥𝑛−2 ) = Φ𝑒 (𝑥) and Algorithm 6.14
does return the correct answer. The only remaining task is to analyze
its running time. Note that Algorithm 6.14 uses the MatchE mp ty
procedure of Solved Exercise 6.3 in the base case that 𝑥 = "". However,
this is OK since this procedure’s running time depends only on 𝑒 and
is independent of the length of the original input.
For simplicity, let us restrict our attention to the case that the al-
phabet Σ is equal to {0, 1}. Define 𝐶(ℓ) to be the maximum number
of operations that Algorithm 6.13 takes when given as input a regular
expression 𝑒 over {0, 1} of at most ℓ symbols. The value 𝐶(ℓ) can be
shown to be polynomial in ℓ, though this is not important for this the-
orem, since we only care about the dependence of the time to compute
236 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Φ𝑒 (𝑥) on the length of 𝑥 and not about the dependence of this time on
the length of 𝑒.
Algorithm 6.14 is a recursive algorithm that input an expression
𝑒 and a string 𝑥 ∈ {0, 1}𝑛 , does computation of at most 𝐶(|𝑒|) steps
and then calls itself with input some expression 𝑒′ and a string 𝑥′ of
length 𝑛 − 1. It will terminate after 𝑛 steps when it reaches a string of
length 0. So, the running time 𝑇 (𝑒, 𝑛) that it takes for Algorithm 6.14
to compute Φ𝑒 for inputs of length 𝑛 satisfies the recursive equation:

𝑇 (𝑒, 𝑛) = max{𝑇 (𝑒[0], 𝑛 − 1), 𝑇 (𝑒[1], 𝑛 − 1)} + 𝐶(|𝑒|) (6.2)

(In the base case 𝑛 = 0, 𝑇 (𝑒, 0) is equal to some constant depending


only on 𝑒.) To get some intuition for the expression Eq. (6.2), let us
open up the recursion for one level, writing 𝑇 (𝑒, 𝑛) as

𝑇 (𝑒, 𝑛) = max{𝑇 (𝑒[0][0], 𝑛 − 2) + 𝐶(|𝑒[0]|),


𝑇 (𝑒[0][1], 𝑛 − 2) + 𝐶(|𝑒[0]|),
𝑇 (𝑒[1][0], 𝑛 − 2) + 𝐶(|𝑒[1]|),
𝑇 (𝑒[1][1], 𝑛 − 2) + 𝐶(|𝑒[1]|)} + 𝐶(|𝑒|) .
Continuing this way, we can see that 𝑇 (𝑒, 𝑛) ≤ 𝑛 ⋅ 𝐶(𝐿) + 𝑂(1)
where 𝐿 is the largest length of any expression 𝑒′ that we encounter
along the way. Therefore, the following claim suffices to show that
Algorithm 6.14 runs in 𝑂(𝑛) time:

Claim: Let 𝑒 be a regular expression over {0, 1}, then there is a num-
ber 𝐿(𝑒) ∈ ℕ, such that for every sequence of symbols 𝛼0 , … , 𝛼𝑛−1 , if
we define 𝑒′ = 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑛−1 ] (i.e., restricting 𝑒 to 𝛼0 , and then 𝛼1
and so on and so forth), then |𝑒′ | ≤ 𝐿(𝑒).

Proof of claim: For a regular expression 𝑒 over {0, 1} and 𝛼 ∈ {0, 1}𝑚 ,
we denote by 𝑒[𝛼] the expression 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑚−1 ] obtained by restrict-
ing 𝑒 to 𝛼0 and then to 𝛼1 and so on. We let 𝑆(𝑒) = {𝑒[𝛼]|𝛼 ∈ {0, 1}∗ }.
We will prove the claim by showing that for every 𝑒, the set 𝑆(𝑒) is fi-
nite, and hence so is the number 𝐿(𝑒) which is the maximum length of
𝑒′ for 𝑒′ ∈ 𝑆(𝑒).
We prove this by induction on the structure of 𝑒. If 𝑒 is a symbol, the
empty string, or the empty set, then this is straightforward to show
as the most expressions 𝑆(𝑒) can contain are the expression itself, "",
and ∅. Otherwise we split to the two cases (i) 𝑒 = 𝑒′∗ and (ii) 𝑒 =
𝑒′ 𝑒″ , where 𝑒′ , 𝑒″ are smaller expressions (and hence by the induction
hypothesis 𝑆(𝑒′ ) and 𝑆(𝑒″ ) are finite). In the case (i), if 𝑒 = (𝑒′ )∗ then
𝑒[𝛼] is either equal to (𝑒′ )∗ 𝑒′ [𝛼] or it is simply the empty set if 𝑒′ [𝛼] = ∅.
Since 𝑒′ [𝛼] is in the set 𝑆(𝑒′ ), the number of distinct expressions in
𝑆(𝑒) is at most |𝑆(𝑒′ )| + 1. In the case (ii), if 𝑒 = 𝑒′ 𝑒″ then all the
restrictions of 𝑒 to strings 𝛼 will either have the form 𝑒′ 𝑒″ [𝛼] or the form
𝑒′ 𝑒″ [𝛼]|𝑒′ [𝛼′ ] where 𝛼′ is some string such that 𝛼 = 𝛼′ 𝛼″ and 𝑒″ [𝛼″ ]
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 237

matches the empty string. Since 𝑒″ [𝛼] ∈ 𝑆(𝑒″ ) and 𝑒′ [𝛼′ ] ∈ 𝑆(𝑒′ ), the
number of the possible distinct expressions of the form 𝑒[𝛼] is at most
|𝑆(𝑒″ )| + |𝑆(𝑒″ )| ⋅ |𝑆(𝑒′ )|. This completes the proof of the claim.

The bottom line is that while running Algorithm 6.14 on a regular


expression 𝑒, all the expressions we ever encounter are in the finite set
𝑆(𝑒), no matter how large the input 𝑥 is, and so the running time of
Algorithm 6.14 satisfies the equation 𝑇 (𝑛) = 𝑇 (𝑛 − 1) + 𝐶 ′ for some
constant 𝐶 ′ depending on 𝑒. This solves to 𝑂(𝑛) where the implicit
constant in the O notation can (and will) depend on 𝑒 but crucially,
not on the length of the input 𝑥.

6.4.1 Matching regular expressions using DFAs


Theorem 6.12 is already quite impressive, but we can do even better.
Specifically, no matter how long the string 𝑥 is, we can compute Φ𝑒 (𝑥)
by maintaining only a constant amount of memory and moreover
making a single pass over 𝑥. That is, the algorithm will scan the input
𝑥 once from start to finish, and then determine whether or not 𝑥 is
matched by the expression 𝑒. This is important in the common case
of trying to match a short regular expression over a huge file or docu-
ment that might not even fit in our computer’s memory. Of course, as
we have seen before, a single-pass constant-memory algorithm is sim-
ply a deterministic finite automaton. As we will see in Theorem 6.17, a
function can be computed by regular expression if and only if it can be
computed by a DFA. We start with showing the “only if” direction:

Let 𝑒 be a regular
Theorem 6.15 — DFA for regular expression matching.
expression. Then there is an algorithm that on input 𝑥 ∈ {0, 1}∗
computes Φ𝑒 (𝑥) while making a single pass over 𝑥 and maintaining
a constant amount of memory.

Proof Idea:
The single-pass constant-memory for checking if a string matches
a regular expression is presented in Algorithm 6.16. The idea is to
replace the recursive algorithm of Algorithm 6.14 with a dynamic pro-
gram, using the technique of memoization. If you haven’t taken yet an
algorithms course, you might not know these techniques. This is OK;
while this more efficient algorithm is crucial for the many practical
applications of regular expressions, it is not of great importance for
this book.

238 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 6.16 — Regular expression matching by a DFA.

Input: Regular expression 𝑒 over Σ∗ , 𝑥 ∈ Σ𝑛 where 𝑛 ∈ ℕ


Output: Φ𝑒 (𝑥)
1: procedure DFA Match(𝑒,𝑥)
2: Let 𝑆 ← 𝑆(𝑒) be the set {𝑒[𝛼]|𝛼 ∈ Σ∗ } as defined in
the proof of the linear-time matching theorem.
3: for 𝑒′ ∈ 𝑆 do
4: Let 𝑣𝑒′ ← 1 if Φ𝑒′ ("") = 1 and 𝑣𝑒′ ← 0 otherwise
5: end for
6: for 𝑖 ∈ [𝑛] do
7: Let 𝑙𝑎𝑠𝑡𝑒′ ← 𝑣𝑒′ for all 𝑒′ ∈ 𝑆
8: Let 𝑣𝑒′ ← 𝑙𝑎𝑠𝑡𝑒′ [𝑥𝑖 ] for all 𝑒′ ∈ 𝑆
9: end for
10: return 𝑣𝑒
11: end procedure

Proof of Theorem 6.15. Algorithm 6.16 checks if a given string 𝑥 ∈ Σ∗


is matched by the regular expression 𝑒. For every regular expression
𝑒, this algorithm has a constant number of Boolean variables (specif-
ically a variable 𝑣𝑒′ for every 𝑒′ ∈ 𝑆(𝑒) and a variable 𝑙𝑎𝑠𝑡𝑒′ for every
𝑒′ in 𝑆(𝑒), using the fact that 𝑒′ [𝑥𝑖 ] is in 𝑆(𝑒) for every 𝑒′ ∈ 𝑆(𝑒)). It
makes a single pass over the input string. Hence it corresponds to a
DFA. We prove its correctness by induction on the length 𝑛 of the in-
put. Specifically, we will argue that before reading 𝑥𝑖 , the variable 𝑣𝑒′
is equal to Φ𝑒′ (𝑥0 ⋯ 𝑥𝑖−1 ) for every 𝑒′ ∈ 𝑆(𝑒). In the case 𝑖 = 0 this
holds since we initialize 𝑣𝑒′ = Φ𝑒′ ("") for all 𝑒′ ∈ 𝑆(𝑒). For 𝑖 > 0
this holds by induction since the inductive hypothesis implies that
𝑙𝑎𝑠𝑡𝑒′ = Φ𝑒′ (𝑥0 ⋯ 𝑥𝑖−2 ) for all 𝑒′ ∈ 𝑆(𝑒) and by the definition of the set
𝑆(𝑒′ ), for every 𝑒′ ∈ 𝑆(𝑒) and 𝑥𝑖−1 ∈ Σ, 𝑒″ = 𝑒′ [𝑥𝑖−1 ] is in 𝑆(𝑒) and
Φ𝑒′ (𝑥0 ⋯ 𝑥𝑖−1 ) = Φ𝑒″ (𝑥0 ⋯ 𝑥𝑖 ).

6.4.2 Equivalence of regular expressions and automata


Recall that a Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1} is defined to be
regular if it is equal to Φ𝑒 for some regular expression 𝑒. (Equivalently,
a language 𝐿 ⊆ {0, 1}∗ is defined to be regular if there is a regular
expression 𝑒 such that 𝑒 matches 𝑥 iff 𝑥 ∈ 𝐿.) The following theorem is
the central result of automata theory:

Let 𝐹 ∶ {0, 1}∗ →


Theorem 6.17 — DFA and regular expression equivalency.
{0, 1}. Then 𝐹 is regular if and only if there exists a DFA (𝑇 , 𝒮) that
computes 𝐹 .

Proof Idea:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 239

One direction follows from Theorem 6.15, which shows that for
every regular expression 𝑒, the function Φ𝑒 can be computed by a DFA
(see for example Fig. 6.6). For the other direction, we show that given
a DFA (𝑇 , 𝒮) for every 𝑣, 𝑤 ∈ [𝐶] we can find a regular expression that
would match 𝑥 ∈ {0, 1}∗ if and only if the DFA starting in state 𝑣, will
end up in state 𝑤 after reading 𝑥.

Proof of Theorem 6.17. Since Theorem 6.15 proves the “only if” direc-
tion, we only need to show the “if” direction. Let 𝐴 = (𝑇 , 𝒮) be a DFA
with 𝐶 states that computes the function 𝐹 . We need to show that 𝐹 is
regular.
For every 𝑣, 𝑤 ∈ [𝐶], we let 𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} be the function
Figure 6.6: A deterministic finite automaton that
that maps 𝑥 ∈ {0, 1}∗ to 1 if and only if the DFA 𝐴, starting at the computes the function Φ(01)∗ .
state 𝑣, will reach the state 𝑤 if it reads the input 𝑥. We will prove that
𝐹𝑣,𝑤 is regular for every 𝑣, 𝑤. This will prove the theorem, since by
Definition 6.2, 𝐹 (𝑥) is equal to the OR of 𝐹0,𝑤 (𝑥) for every 𝑤 ∈ 𝒮.
Hence if we have a regular expression for every function of the form
𝐹𝑣,𝑤 then (using the | operation), we can obtain a regular expression
for 𝐹 as well.
To give regular expressions for the functions 𝐹𝑣,𝑤 , we start by
defining the following functions 𝐹𝑣,𝑤 𝑡
: for every 𝑣, 𝑤 ∈ [𝐶] and Figure 6.7: Given a DFA of 𝐶 states, for every 𝑣, 𝑤 ∈
[𝐶] and number 𝑡 ∈ {0, … , 𝐶} we define the function
0 ≤ 𝑡 ≤ 𝐶, 𝐹𝑣,𝑤 (𝑥) = 1 if and only if starting from 𝑣 and observ-
𝑡
𝑡
𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} to output one on input
ing 𝑥, the automata reaches 𝑤 with all intermediate states being in the set 𝑥 ∈ {0, 1}∗ if and only if when the DFA is initialized
[𝑡] = {0, … , 𝑡 − 1} (see Fig. 6.7). That is, while 𝑣, 𝑤 themselves might in the state 𝑣 and is given the input 𝑥, it will reach the
state 𝑤 while going only through the intermediate
be outside [𝑡], 𝐹𝑣,𝑤
𝑡
(𝑥) = 1 if and only if throughout the execution of states {0, … , 𝑡 − 1}.
the automaton on the input 𝑥 (when initiated at 𝑣) it never enters any
of the states outside [𝑡] and still ends up at 𝑤. If 𝑡 = 0 then [𝑡] is the
empty set, and hence 𝐹𝑣,𝑤 0
(𝑥) = 1 if and only if the automaton reaches
𝑤 from 𝑣 directly on 𝑥, without any intermediate state. If 𝑡 = 𝐶 then
all states are in [𝑡], and hence 𝐹𝑣,𝑤
𝑡
= 𝐹𝑣,𝑤 .
We will prove the theorem by induction on 𝑡, showing that 𝐹𝑣,𝑤 𝑡
is
regular for every 𝑣, 𝑤 and 𝑡. For the base case of 𝑡 = 0, 𝐹𝑣,𝑤 is regular
0

for every 𝑣, 𝑤 since it can be described as one of the expressions "", ∅,


0, 1 or 0|1. Specifically, if 𝑣 = 𝑤 then 𝐹𝑣,𝑤 0
(𝑥) = 1 if and only if 𝑥 is
the empty string. If 𝑣 ≠ 𝑤 then 𝐹𝑣,𝑤 (𝑥) = 1 if and only if 𝑥 consists
0

of a single symbol 𝜎 ∈ {0, 1} and 𝑇 (𝑣, 𝜎) = 𝑤. Therefore in this case


0
𝐹𝑣,𝑤 corresponds to one of the four regular expressions 0|1, 0, 1 or ∅,
depending on whether 𝐴 transitions to 𝑤 from 𝑣 when it reads either 0
or 1, only one of these symbols, or neither.
Inductive step: Now that we’ve seen the base case, let us prove
the general case by induction. Assume, via the induction hypothesis,
that for every 𝑣′ , 𝑤′ ∈ [𝐶], we have a regular expression 𝑅𝑣𝑡 ′ ,𝑤′ that
computes 𝐹𝑣𝑡′ ,𝑤′ . We need to prove that 𝐹𝑣,𝑤 𝑡+1
is regular for every 𝑣, 𝑤.
240 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

If the automaton arrives from 𝑣 to 𝑤 using the intermediate states


[𝑡+1], then it visits the 𝑡-th state zero or more times. If the path labeled
by 𝑥 causes the automaton to get from 𝑣 to 𝑤 without visiting the 𝑡-
th state at all, then 𝑥 is matched by the regular expression 𝑅𝑣,𝑤𝑡
. If
the path labeled by 𝑥 causes the automaton to get from 𝑣 to 𝑤 while
visiting the 𝑡-th state 𝑘 > 0 times, then we can think of this path as:
• First travel from 𝑣 to 𝑡 using only intermediate states in [𝑡 − 1].

• Then go from 𝑡 back to itself 𝑘 − 1 times using only intermediate


states in [𝑡 − 1]

• Then go from 𝑡 to 𝑤 using only intermediate states in [𝑡 − 1].


Therefore in this case the string 𝑥 is matched by the regular expres-
sion 𝑅𝑣,𝑡
𝑡
) 𝑅𝑡,𝑤 . (See also Fig. 6.8.)
𝑡 ∗ 𝑡
(𝑅𝑡,𝑡
Therefore we can compute 𝐹𝑣,𝑤 𝑡+1
using the regular expression

𝑡 𝑡 𝑡 ∗ 𝑡
𝑅𝑣,𝑤 | 𝑅𝑣,𝑡 (𝑅𝑡,𝑡 ) 𝑅𝑡,𝑤 .
This completes the proof of the inductive step and hence of the theo-
rem.

Figure 6.8: If we have regular expressions 𝑅𝑣𝑡 ′ ,𝑤′


corresponding to 𝐹𝑣𝑡′ ,𝑤′ for every 𝑣′ , 𝑤′ ∈ [𝐶], we can
obtain a regular expression 𝑅𝑣,𝑤𝑡+1 corresponding to

𝐹𝑣,𝑤 . The key observation is that a path from 𝑣 to 𝑤


𝑡+1

using {0, … , 𝑡} either does not touch 𝑡 at all, in which


case it is captured by the expression 𝑅𝑣,𝑤
𝑡 , or it goes

from 𝑣 to 𝑡, comes back to 𝑡 zero or more times, and


then goes from 𝑡 to 𝑤, in which case it is captured by
the expression 𝑅𝑣,𝑡
𝑡 (𝑅𝑡 )∗ 𝑅𝑡 .
𝑡,𝑡 𝑡,𝑤

6.4.3 Closure properties of regular expressions


If 𝐹 and 𝐺 are regular functions computed by the expressions 𝑒 and 𝑓
respectively, then the expression 𝑒|𝑓 computes the function 𝐻 = 𝐹 ∨ 𝐺
defined as 𝐻(𝑥) = 𝐹 (𝑥) ∨ 𝐺(𝑥). Another way to say this is that the set
of regular functions is closed under the OR operation. That is, if 𝐹 and 𝐺
are regular then so is 𝐹 ∨ 𝐺. An important corollary of Theorem 6.17
is that this set is also closed under the NOT operation:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 241

Lemma 6.18 — Regular expressions closed under complement. If 𝐹 ∶ {0, 1}∗ →


{0, 1} is regular then so is the function 𝐹 , where 𝐹 (𝑥) = 1 − 𝐹 (𝑥) for
every 𝑥 ∈ {0, 1}∗ .

Proof. If 𝐹 is regular then by Theorem 6.12 it can be computed by a


DFA 𝐴. But we can then construct a DFA 𝐴 which does the same com-
putation but flips the set of accepted states. The DFA 𝐴 will compute
𝐹 . By Theorem 6.17 this implies that 𝐹 is regular as well.

Since 𝑎 ∧ 𝑏 = 𝑎 ∨ 𝑏, Lemma 6.18 implies that the set of regular


functions is closed under the AND operation as well. Moreover, since
OR, NOT and AND are a universal basis, this set is also closed un-
der NAND, XOR, and any other finite function. That is, we have the
following corollary:

Let 𝑓 ∶ {0, 1}𝑘 → {0, 1} be


Theorem 6.19 — Closure of regular expressions.
any finite Boolean function, and let 𝐹0 , … , 𝐹𝑘−1 ∶ {0, 1}∗ → {0, 1} be
regular functions. Then the function 𝐺(𝑥) = 𝑓(𝐹0 (𝑥), 𝐹1 (𝑥), … , 𝐹𝑘−1 (𝑥))
is regular.

Proof. This is a direct consequence of the closure of regular functions


under OR and NOT (and hence AND), combined with Theorem 4.13,
that states that every 𝑓 can be computed by a Boolean circuit (which is
simply a combination of the AND, OR, and NOT operations).

6.5 LIMITATIONS OF REGULAR EXPRESSIONS AND THE PUMPING


LEMMA
The efficiency of regular expression matching makes them very useful.
This is why operating systems and text editors often restrict their
search interface to regular expressions and do not allow searching by
specifying an arbitrary function. However, this efficiency comes at
a cost. As we have seen, regular expressions cannot compute every
function. In fact, there are some very simple (and useful!) functions
that they cannot compute. Here is one example:
Lemma 6.20 — Matching parentheses.Let Σ = {⟨, ⟩} and MATCHPAREN ∶
Σ → {0, 1} be the function that given a string of parentheses, out-

puts 1 if and only if every opening parenthesis is matched by a corre-


sponding closed one. Then there is no regular expression over Σ that
computes MATCHPAREN.
Lemma 6.20 is a consequence of the following result, which is
known as the pumping lemma:
242 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Theorem 6.21 — Pumping Lemma. Let 𝑒 be a regular expression over


some alphabet Σ. Then there is some number 𝑛0 such that for ev-
ery 𝑤 ∈ Σ∗ with |𝑤| > 𝑛0 and Φ𝑒 (𝑤) = 1, we can write 𝑤 = 𝑥𝑦𝑧 for
strings 𝑥, 𝑦, 𝑧 ∈ Σ∗ satisfying the following conditions:

1. |𝑦| ≥ 1.

2. |𝑥𝑦| ≤ 𝑛0 .

3. Φ𝑒 (𝑥𝑦𝑘 𝑧) = 1 for every 𝑘 ∈ ℕ.

Figure 6.9: To prove the “pumping lemma” we look


at a word 𝑤 that is much larger than the regular
expression 𝑒 that matches it. In such a case, part of
𝑤 must be matched by some sub-expression of the
form (𝑒′ )∗ , since this is the only operator that allows
matching words longer than the expression. If we
look at the “leftmost” such sub-expression and define
𝑦𝑘 to be the string that is matched by it, we obtain the
partition needed for the pumping lemma.

Proof Idea:
The idea behind the proof is the following. Let 𝑛0 be twice the
number of symbols that are used in the expression 𝑒, then the only
way that there is some 𝑤 with |𝑤| > 𝑛0 and Φ𝑒 (𝑤) = 1 is that 𝑒 con-
tains the ∗ (i.e. star) operator and that there is a non-empty substring
𝑦 of 𝑤 that was matched by (𝑒′ )∗ for some sub-expression 𝑒′ of 𝑒. We
can now repeat 𝑦 any number of times and still get a matching string.
See also Fig. 6.9.

P
The pumping lemma is a bit cumbersome to state,
but one way to remember it is that it simply says the
following: “if a string matching a regular expression is
long enough, one of its substrings must be matched using
the ∗ operator”.
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 243

Proof of Theorem 6.21. To prove the lemma formally, we use induction


on the length of the expression. Like all induction proofs, this will
be somewhat lengthy, but at the end of the day it directly follows the
intuition above that somewhere we must have used the star operation.
Reading this proof, and in particular understanding how the formal
proof below corresponds to the intuitive idea above, is a very good
way to get more comfortable with inductive proofs of this form.
Our inductive hypothesis is that for an 𝑛 length expression, 𝑛0 =
2𝑛 satisfies the conditions of the lemma. The base case is when the
expression is a single symbol 𝜎 ∈ Σ or that the expression is ∅ or
"". In all these cases the conditions of the lemma are satisfied simply
because 𝑛0 = 2, and there exists no string 𝑥 of length larger than 𝑛0
that is matched by the expression.
We now prove the inductive step. Let 𝑒 be a regular expression
with 𝑛 > 1 symbols. We set 𝑛0 = 2𝑛 and let 𝑤 ∈ Σ∗ be a string
satisfying |𝑤| > 𝑛0 . Since 𝑒 has more than one symbol, it has one of
the forms (a) 𝑒′ |𝑒″ , (b), (𝑒′ )(𝑒″ ), or (c) (𝑒′ )∗ where in all these cases
the subexpressions 𝑒′ and 𝑒″ have fewer symbols than 𝑒 and hence
satisfy the induction hypothesis.
In the case (a), every string 𝑤 matched by 𝑒 must be matched by
either 𝑒′ or 𝑒″ . If 𝑒′ matches 𝑤 then, since |𝑤| > 2|𝑒′ |, by the induction
hypothesis there exist 𝑥, 𝑦, 𝑧 with |𝑦| ≥ 1 and |𝑥𝑦| ≤ 2|𝑒′ | < 𝑛0 such
that 𝑒′ (and therefore also 𝑒 = 𝑒′ |𝑒″ ) matches 𝑥𝑦𝑘 𝑧 for every 𝑘. The
same arguments works in the case that 𝑒″ matches 𝑤.
In the case (b), if 𝑤 is matched by (𝑒′ )(𝑒″ ) then we can write 𝑤 =
𝑤 𝑤 where 𝑒′ matches 𝑤′ and 𝑒″ matches 𝑤″ . We split to subcases. If
′ ″

|𝑤′ | > 2|𝑒′ | then by the induction hypothesis there exist 𝑥, 𝑦, 𝑧 ′ with
|𝑦| ≥ 1, |𝑥𝑦| ≤ 2|𝑒′ | < 𝑛0 such that 𝑤′ = 𝑥𝑦𝑧 ′ and 𝑒′ matches 𝑥𝑦𝑘 𝑧′
for every 𝑘 ∈ ℕ. This completes the proof since if we set 𝑧 = 𝑧 ′ 𝑤″
then we see that 𝑤 = 𝑤′ 𝑤″ = 𝑥𝑦𝑧 and 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 for
every 𝑘 ∈ ℕ. Otherwise, if |𝑤′ | ≤ 2|𝑒′ | then since |𝑤| = |𝑤′ | + |𝑤″ | >
𝑛0 = 2(|𝑒′ | + |𝑒″ |), it must be that |𝑤″ | > 2|𝑒″ |. Hence by the induction
hypothesis there exist 𝑥′ , 𝑦, 𝑧 such that |𝑦| ≥ 1, |𝑥′ 𝑦| ≤ 2|𝑒″ | and 𝑒″
matches 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ. But now if we set 𝑥 = 𝑤′ 𝑥′ we see that
|𝑥𝑦| = |𝑤′ | + |𝑥′ 𝑦| ≤ 2|𝑒′ | + 2|𝑒″ | = 𝑛0 and on the other hand the
expression 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 = 𝑤′ 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ.
In case (c), if 𝑤 is matched by (𝑒′ )∗ then 𝑤 = 𝑤0 ⋯ 𝑤𝑡 where for
every 𝑖 ∈ [𝑡], 𝑤𝑖 is a nonempty string matched by 𝑒′ . If |𝑤0 | > 2|𝑒′ |,
then we can use the same approach as in the concatenation case above.
Otherwise, we simply note that if 𝑥 is the empty string, 𝑦 = 𝑤0 , and
𝑧 = 𝑤1 ⋯ 𝑤𝑡 then |𝑥𝑦| ≤ 𝑛0 and 𝑥𝑦𝑘 𝑧 is matched by (𝑒′ )∗ for every
𝑘 ∈ ℕ.

244 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

R
Remark 6.22 — Recursive definitions and inductive
proofs. When an object is recursively defined (as in the
case of regular expressions) then it is natural to prove
properties of such objects by induction. That is, if we
want to prove that all objects of this type have prop-
erty 𝑃 , then it is natural to use an inductive step that
says that if 𝑜′ , 𝑜″ , 𝑜‴ etc have property 𝑃 then so is an
object 𝑜 that is obtained by composing them.

Using the pumping lemma, we can easily prove Lemma 6.20 (i.e.,
the non-regularity of the “matching parenthesis” function):

Proof of Lemma 6.20. Suppose, towards the sake of contradiction, that


there is an expression 𝑒 such that Φ𝑒 = MATCHPAREN. Let 𝑛0 be
the number obtained from Theorem 6.21 and let 𝑤 = ⟨𝑛0 ⟩𝑛0 (i.e., 𝑛0
left parenthesis followed by 𝑛0 right parenthesis). Then we see that
if we write 𝑤 = 𝑥𝑦𝑧 as in Theorem 6.21, the condition |𝑥𝑦| ≤ 𝑛0
implies that 𝑦 consists solely of left parenthesis. Hence the string
𝑥𝑦2 𝑧 will contain more left parenthesis than right parenthesis. Hence
MATCHPAREN(𝑥𝑦2 𝑧) = 0 but by the pumping lemma Φ𝑒 (𝑥𝑦2 𝑧) = 1,
contradicting our assumption that Φ𝑒 = MATCHPAREN.

The pumping lemma is a very useful tool to show that certain func-
tions are not computable by a regular expression. However, it is not an
“if and only if” condition for regularity: there are non-regular func-
tions that still satisfy the pumping lemma conditions. To understand
the pumping lemma, it is crucial to follow the order of quantifiers in
Theorem 6.21. In particular, the number 𝑛0 in the statement of Theo-
rem 6.21 depends on the regular expression (in the proof we chose 𝑛0
to be twice the number of symbols in the expression). So, if we want
to use the pumping lemma to rule out the existence of a regular ex-
pression 𝑒 computing some function 𝐹 , we need to be able to choose
an appropriate input 𝑤 ∈ {0, 1}∗ that can be arbitrarily large and
satisfies 𝐹 (𝑤) = 1. This makes sense if you think about the intuition
behind the pumping lemma: we need 𝑤 to be large enough as to force
the use of the star operator.
Prove that the following
Solved Exercise 6.4 — Palindromes is not regular.
function over the alphabet {0, 1, ; } is not regular: PAL(𝑤) = 1 if and
only if 𝑤 = 𝑢; 𝑢𝑅 where 𝑢 ∈ {0, 1}∗ and 𝑢𝑅 denotes 𝑢 “reversed”:
the string 𝑢|𝑢|−1 ⋯ 𝑢0 . (The Palindrome function is most often defined
without an explicit separator character ;, but the version with such a
separator is a bit cleaner, and so we use it here. This does not make
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 245

Figure 6.10: A cartoon of a proof using the pumping lemma that a function 𝐹 is not regular. The pumping lemma states that if 𝐹 is regular then there
exists a number 𝑛0 such that for every large enough 𝑤 with 𝐹 (𝑤) = 1, there exists a partition of 𝑤 to 𝑤 = 𝑥𝑦𝑧 satisfying certain conditions such
that for every 𝑘 ∈ ℕ, 𝐹 (𝑥𝑦𝑘 𝑧) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there exists
quantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifier
corresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof
corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice
of 𝑘 that would result in 𝐹 (𝑥𝑦𝑘 𝑧) = 0, hence violating the conclusion of the pumping lemma.
246 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

much difference, as one can easily encode the separator as a special


binary string instead.)

Solution:
We use the pumping lemma. Suppose toward the sake of con-
tradiction that there is a regular expression 𝑒 computing PAL,
and let 𝑛0 be the number obtained by the pumping lemma (The-
orem 6.21). Consider the string 𝑤 = 0𝑛0 ; 0𝑛0 . Since the reverse
of the all zero string is the all zero string, PAL(𝑤) = 1. Now, by
the pumping lemma, if PAL is computed by 𝑒, then we can write
𝑤 = 𝑥𝑦𝑧 such that |𝑥𝑦| ≤ 𝑛0 , |𝑦| ≥ 1 and PAL(𝑥𝑦𝑘 𝑧) = 1 for
every 𝑘 ∈ ℕ. In particular, it must hold that PAL(𝑥𝑧) = 1, but this
is a contradiction, since 𝑥𝑧 = 0𝑛0 −|𝑦| ; 0𝑛0 and so its two parts are
not of the same length and in particular are not the reverse of one
another.

For yet another example of a pumping-lemma based proof, see


Fig. 6.10 which illustrates a cartoon of the proof of the non-regularity
of the function 𝐹 ∶ {0, 1}∗ → {0, 1} which is defined as 𝐹 (𝑥) = 1 iff
𝑥 = 0𝑛 1𝑛 for some 𝑛 ∈ ℕ (i.e., 𝑥 consists of a string of consecutive
zeroes, followed by a string of consecutive ones of the same length).

6.6 ANSWERING SEMANTIC QUESTIONS ABOUT REGULAR EX-


PRESSIONS
Regular expressions have applications beyond search. For example,
regular expressions are often used to define tokens (such as what is a
valid variable identifier, or keyword) in the design of parsers, compilers
and interpreters for programming languages. Regular expressions
have other applications too: for example, in recent years, the world
of networking moved from fixed topologies to “software defined
networks”. Such networks are routed by programmable switches
that can implement policies such as “if packet is secured by SSL then
forward it to A, otherwise forward it to B”. To represent such policies
we need a language that is on one hand sufficiently expressive to
capture the policies we want to implement, but on the other hand
sufficiently restrictive so that we can quickly execute them at network
speed and also be able to answer questions such as “can C see the
packets moved from A to B?”. The NetKAT network programming
language uses a variant of regular expressions to achieve precisely
that. For this application, it is important that we are not merely able
to answer whether an expression 𝑒 matches a string 𝑥 but also answer
semantic questions about regular expressions such as “do expressions
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 247

𝑒 and 𝑒′ compute the same function?” and “does there exist a string 𝑥
that is matched by the expression 𝑒?”. The following theorem shows
that we can answer the latter question:

There is an
Theorem 6.23 — Emptiness of regular languages is computable.
algorithm that given a regular expression 𝑒, outputs 1 if and only if
Φ𝑒 is the constant zero function.

Proof Idea:
The idea is that we can directly observe this from the structure
of the expression. The only way a regular expression 𝑒 computes
the constant zero function is if 𝑒 has the form ∅ or is obtained by
concatenating ∅ with other expressions.

Proof of Theorem 6.23. Define a regular expression to be “empty” if it


computes the constant zero function. Given a regular expression 𝑒, we
can determine if 𝑒 is empty using the following rules:

• If 𝑒 has the form 𝜎 or "" then it is not empty.

• If 𝑒 is not empty then 𝑒|𝑒′ is not empty for every 𝑒′ .

• If 𝑒 is not empty then 𝑒∗ is not empty.

• If 𝑒 and 𝑒′ are both not empty then 𝑒 𝑒′ is not empty.

• ∅ is empty.

Using these rules, it is straightforward to come up with a recursive


algorithm to determine emptiness.

Using Theorem 6.23, we can obtain an algorithm that determines


whether or not two regular expressions 𝑒 and 𝑒′ are equivalent, in the
sense that they compute the same function.

Let
Theorem 6.24 — Equivalence of regular expressions is computable.
REGEQ ∶ {0, 1} → {0, 1} be the function that on input (a string

representing) a pair of regular expressions 𝑒, 𝑒′ , REGEQ(𝑒, 𝑒′ ) = 1


if and only if Φ𝑒 = Φ𝑒′ . Then there is an algorithm that computes
REGEQ.

Proof Idea:
The idea is to show that given a pair of regular expressions 𝑒 and
𝑒′ we can find an expression 𝑒″ such that Φ𝑒″ (𝑥) = 1 if and only if
Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). Therefore Φ𝑒″ is the constant zero function if and only
248 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

if 𝑒 and 𝑒′ are equivalent, and thus we can test for emptiness of 𝑒″ to


determine equivalence of 𝑒 and 𝑒′ .

Proof of Theorem 6.24. We will prove Theorem 6.24 from Theorem 6.23.
(The two theorems are in fact equivalent: it is easy to prove Theo-
rem 6.23 from Theorem 6.24, since checking for emptiness is the same
as checking equivalence with the expression ∅.) Given two regu-
lar expressions 𝑒 and 𝑒′ , we will compute an expression 𝑒″ such that
Φ𝑒″ (𝑥) = 1 if and only if Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). One can see that 𝑒 is equiva-
lent to 𝑒′ if and only if 𝑒″ is empty.
We start with the observation that for every bit 𝑎, 𝑏 ∈ {0, 1}, 𝑎 ≠ 𝑏 if
and only if
(𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑏) .
Hence we need to construct 𝑒″ such that for every 𝑥,

Φ𝑒″ (𝑥) = (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) ∨ (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) . (6.3)
To construct the expression 𝑒″ , we will show how given any pair of
expressions 𝑒 and 𝑒′ , we can construct expressions 𝑒 ∧ 𝑒′ and 𝑒 that
compute the functions Φ𝑒 ∧ Φ𝑒′ and Φ𝑒 respectively. (Computing the
expression for 𝑒 ∨ 𝑒′ is straightforward using the | operation of regular
expressions.)
Specifically, by Lemma 6.18, regular functions are closed under
negation, which means that for every regular expression 𝑒, there is an
expression 𝑒 such that Φ𝑒 (𝑥) = 1 − Φ𝑒 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Now,
for every two expressions 𝑒 and 𝑒′ , the expression

𝑒 ∧ 𝑒′ = (𝑒|𝑒′ )

computes the AND of the two expressions. Given these two transfor-
mations, we see that for every regular expressions 𝑒 and 𝑒′ we can find
a regular expression 𝑒″ satisfying (6.3) such that 𝑒″ is empty if and
only if 𝑒 and 𝑒′ are equivalent.

✓ Chapter Recap

• We model computational tasks on arbitrarily large


inputs using infinite functions 𝐹 ∶ {0, 1}∗ → {0, 1}∗ .
• Such functions take an arbitrarily long (but still
finite!) string as input, and cannot be described by
a finite table of inputs and outputs.
• A function with a single bit of output is known as
a Boolean function, and the task of computing it is
equivalent to deciding a language 𝐿 ⊆ {0, 1}∗ .
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 249

• Deterministic finite automata (DFAs) are one simple


model for computing (infinite) Boolean functions.
• There are some functions that cannot be computed
by DFAs.
• The set of functions computable by DFAs is the
same as the set of languages that can be recognized
by regular expressions.

6.7 EXERCISES
Suppose that 𝐹 , 𝐺 ∶
Exercise 6.1 — Closure properties of regular functions.
{0, 1} → {0, 1} are regular. For each one of the following defini-

tions of the function 𝐻, either prove that 𝐻 is always regular or give a


counterexample for regular 𝐹 , 𝐺 that would make 𝐻 not regular.

1. 𝐻(𝑥) = 𝐹 (𝑥) ∨ 𝐺(𝑥).

2. 𝐻(𝑥) = 𝐹 (𝑥) ∧ 𝐺(𝑥)

3. 𝐻(𝑥) = NAND(𝐹 (𝑥), 𝐺(𝑥)).

4. 𝐻(𝑥) = 𝐹 (𝑥𝑅 ) where 𝑥𝑅 is the reverse of 𝑥: 𝑥𝑅 = 𝑥𝑛−1 𝑥𝑛−2 ⋯ 𝑥𝑜 for


𝑛 = |𝑥|.

{1 𝑥 = 𝑢𝑣 s.t. 𝐹 (𝑢) = 𝐺(𝑣) = 1



5. 𝐻(𝑥) = ⎨
⎩0 otherwise
{

{1 𝑥 = 𝑢𝑢 s.t. 𝐹 (𝑢) = 𝐺(𝑢) = 1



6. 𝐻(𝑥) =

⎩0 otherwise
{

{1 𝑥 = 𝑢𝑢𝑅 s.t. 𝐹 (𝑢) = 𝐺(𝑢) = 1



7. 𝐻(𝑥) =

⎩0 otherwise
{

Exercise 6.2One among the following two functions that map {0, 1}∗
to {0, 1} can be computed by a regular expression, and the other one
cannot. For the one that can be computed by a regular expression,
write the expression that does it. For the one that cannot, prove that
this cannot be done using the pumping lemma.
|𝑥|−1
• 𝐹 (𝑥) = 1 if 4 divides ∑𝑖=0 𝑥𝑖 and 𝐹 (𝑥) = 0 otherwise.
|𝑥|−1
• 𝐺(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 ≥ |𝑥|/4 and 𝐺(𝑥) = 0 otherwise.

1. Prove that the following function 𝐹 ∶


Exercise 6.3 — Non-regularity.
{0, 1}∗ → {0, 1} is not regular. For every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 iff 𝑥 is
of the form 𝑥 = 13 for some 𝑖 > 0.
𝑖
250 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2. Prove that the following function 𝐹 ∶ {0, 1}∗ → {0, 1} is not regular.
For every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 iff ∑𝑗 𝑥𝑗 = 3𝑖 for some 𝑖 > 0.

6.8 BIBLIOGRAPHICAL NOTES


The relation of regular expressions with finite automata is a beautiful
topic, on which we only touch upon in this text. It is covered more
extensively in [Sip97; HMU14; Koz97]. These texts also discuss top-
ics such as non-deterministic finite automata (NFA) and the relation
between context-free grammars and pushdown automata.
The automaton of Fig. 6.4 was generated using the FSM simulator
of Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.12 is
closely related to the Myhill-Nerode Theorem. One direction of the
Myhill-Nerode theorem can be stated as saying that if 𝑒 is a regular
expression then there is at most a finite number of strings 𝑧0 , … , 𝑧𝑘−1
such that Φ𝑒[𝑧𝑖 ] ≠ Φ𝑒[𝑧𝑗 ] for every 0 ≤ 𝑖 ≠ 𝑗 < 𝑘.
Learning Objectives:
• Learn the model of Turing machines, which
can compute functions of arbitrary input
lengths.
• See a programming-language description of
Turing machines, using NAND-TM programs,
which add loops and arrays to NAND-CIRC.
• See some basic syntactic sugar and
equivalence of variants of Turing machines

7
and NAND-TM programs.

Loops and infinity

“The bounds of arithmetic were however outstepped the moment the idea of
applying the [punched] cards had occurred; and the Analytical Engine does not
occupy common ground with mere ‘calculating machines.’… In enabling mech-
anism to combine together general symbols, in successions of unlimited variety
and extent, a uniting link is established between the operations of matter and
the abstract mental processes of the most abstract branch of mathematical sci-
ence.” , Ada Augusta, countess of Lovelace, 1843

As the quote of Chapter 6 says, an algorithm is “a finite answer to


an infinite number of questions”. To express an algorithm, we need to
write down a finite set of instructions that will enable us to compute
on arbitrarily long inputs. To describe and execute an algorithm we
need the following components (see Fig. 7.1):

• The finite set of instructions to be performed.

• Some “local variables” or finite state used in the execution.

• A potentially unbounded working memory to store the input and


any other values we may require later.

• While the memory is unbounded, at every single step we can only


read and write to a finite part of it, and we need a way to address
which are the parts we want to read from and write to.

• If we only have a finite set of instructions but our input can be


arbitrarily long, we will need to repeat instructions (i.e., loop back).
We need a mechanism to decide when we will loop and when we
will halt.
Figure 7.1: An algorithm is a finite recipe to compute
on arbitrarily long inputs. The components of an
This chapter: A non-mathy overview algorithm include the instructions to be performed,
In this chapter, we give a general model of an algorithm, finite state or “local variables”, the memory to store
the input and intermediate computations, as well as
which (unlike Boolean circuits) is not restricted to a fixed mechanisms to decide which part of the memory to
input length, and (unlike finite automata) is not restricted to access, and when to repeat instructions and when to
halt.

Compiled on 12.6.2023 00:05


252 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

a finite amount of working memory. We will see two ways to


model algorithms:

• Turing machines, invented by Alan Turing in 1936, are hy-


pothetical abstract devices that yield finite descriptions of
algorithms that can handle arbitrarily long inputs.

• The NAND-TM Programming language extends NAND-


CIRC with the notion of loops and arrays to obtain finite
programs that can compute a function with arbitrarily
long inputs.

It turns out that these two models are equivalent. In fact,


they are equivalent to many other computational models,
including programming languages such as C, Lisp, Python,
JavaScript, etc. This notion, known as Turing equivalence
or Turing completeness, will be discussed in Chapter 8. See
Fig. 7.2 for an overview of the models presented in this chap-
ter and Chapter 8.

Figure 7.2: Overview of our models for finite and


unbounded computation. In the previous chapters
we study the computation of finite functions, which
are functions 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 for some fixed
𝑛, 𝑚, and modeled computing these functions using
circuits or straight-line programs. In this chapter we
study computing unbounded functions of the form
𝐹 ∶ {0, 1}∗ → {0, 1}𝑚 or 𝐹 ∶ {0, 1}∗ → {0, 1}∗ .
We model computing these functions using Turing
Machines or (equivalently) NAND-TM programs,
which add the notion of loops to the NAND-CIRC
programming language. In Chapter 8 we will show
that these models are equivalent to many other
models, including RAM machines, the 𝜆 calculus, and
all the common programming languages including C,
Python, Java, JavaScript, etc.

7.1 TURING MACHINES


“Computing is normally done by writing certain symbols on paper. We may
suppose that this paper is divided into squares like a child’s arithmetic book..
The behavior of the [human] computer at any moment is determined by the
symbols which he is observing, and of his’ state of mind’ at that moment… We
may suppose that in a simple operation not more than one symbol is altered.” ,
“We compare a man in the process of computing … to a machine which is only
capable of a finite number of configurations… The machine is supplied with a
‘tape’ (the analogue of paper) … divided into sections (called ‘squares’) each
capable of bearing a ‘symbol’ ” , Alan Turing, 1936
loop s a n d i n fi n i ty 253

“What is the difference between a Turing machine and the modern computer?
It’s the same as that between Hillary’s ascent of Everest and the establishment
of a Hilton hotel on its peak.” , Alan Perlis, 1982.

The “granddaddy” of all models of computation is the Turing ma-


chine. Turing machines were defined in 1936 by Alan Turing in an
attempt to formally capture all the functions that can be computed
by human “computers” (see Fig. 7.4) that follow a well-defined set of
rules, such as the standard algorithms for addition or multiplication.
Turing thought of such a person as having access to as much
“scratch paper” as they need. For simplicity, we can think of this
scratch paper as a one dimensional piece of graph paper (or tape, as Figure 7.3: Aside from his many other achievements,
it is commonly referred to). The paper is divided into “cells”, where Alan Turing was an excellent long-distance runner
who just fell shy of making England’s Olympic team.
each “cell” can hold a single symbol (e.g., one digit or letter, and more A fellow runner once asked him why he punished
generally, some element of a finite alphabet). At any point in time, the himself so much in training. Alan said “I have such
person can read from and write to a single cell of the paper. Based on a stressful job that the only way I can get it out of my
mind is by running hard; it’s the only way I can get
the contents of this cell, the person can update their finite mental state, some release.”
and/or move to the cell immediately to the left or right of the current
one.
Turing modeled such a computation by a “machine” that maintains
one of 𝑘 states. At each point in time the machine reads from its “work
tape” a single symbol from a finite alphabet Σ and uses that to up-
date its state, write to tape, and possibly move to an adjacent cell (see
Fig. 7.7). To compute a function 𝐹 using this machine, we initialize the
tape with the input 𝑥 ∈ {0, 1}∗ and our goal is to ensure that the tape
will contain the value 𝐹 (𝑥) at the end of the computation. Specifically, Figure 7.4: Until the advent of electronic computers,
a computation of a Turing machine 𝑀 with 𝑘 states and alphabet Σ on the word “computer” was used to describe a person
that performed calculations. Most of these “human
input 𝑥 ∈ {0, 1}∗ proceeds as follows: computers” were women, and they were absolutely
essential to many achievements, including mapping
• Initially the machine is at state 0 (known as the “starting state”) the stars, breaking the Enigma cipher, and the NASA
space mission; see also the bibliographical notes.
and the tape is initialized to ▷, 𝑥0 , … , 𝑥𝑛−1 , ∅, ∅, …. We use the Photo from National Photo Company Collection; see
symbol ▷ to denote the beginning of the tape, and the symbol ∅ to also [Sob17].
denote an empty cell. We will always assume that the alphabet Σ is
a (potentially strict) superset of {▷, ∅, 0, 1}.

• The location 𝑖 to which the machine points to is set to 0.

• At each step, the machine reads the symbol 𝜎 = 𝑇 [𝑖] that is in the
𝑖𝑡ℎ location of the tape. Based on this symbol and its state 𝑠, the
machine decides on:
Figure 7.5: Steam-powered Turing machine mural,
– What symbol 𝜎′ to write on the tape painted by CSE grad students at the University of
Washington on the night before spring qualifying
– Whether to move Left (i.e., 𝑖 ← 𝑖 − 1), Right (i.e., 𝑖 ← 𝑖 + 1), Stay
examinations, 1987. Image from https://www.cs.
in place, or Halt the computation. washington.edu/building/art/SPTM.
– What is going to be the new state 𝑠 ∈ [𝑘]
254 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• The set of rules the Turing machine follows is known as its transi-
tion function.

• When the machine halts, its output is the binary string obtained by
reading the tape from the beginning until the first location in which
it contains a ∅ symbol, and then outputting all 0 and 1 symbols in
sequence, dropping the initial ▷ symbol if it exists, as well as the
final ∅ symbol.

7.1.1 Extended example: A Turing machine for palindromes


Let PAL (for palindromes) be the function that on input 𝑥 ∈ {0, 1}∗ ,
outputs 1 if and only if 𝑥 is an (even length) palindrome, in the sense
that 𝑥 = 𝑤0 ⋯ 𝑤𝑛−1 𝑤𝑛−1 𝑤𝑛−2 ⋯ 𝑤0 for some 𝑛 ∈ ℕ and 𝑤 ∈ {0, 1}𝑛 .
We now show a Turing machine 𝑀 that computes PAL. To specify
𝑀 we need to specify (i) 𝑀 ’s tape alphabet Σ which should contain at
least the symbols 0,1, ▷ and ∅, and (ii) 𝑀 ’s transition function which
Figure 7.6: The components of a Turing Machine. Note
determines what action 𝑀 takes when it reads a given symbol while it how they correspond to the general components of
is in a particular state. algorithms as described in Fig. 7.1.
In our case, 𝑀 will use the alphabet {0, 1, ▷, ∅, ×} and will have
𝑘 = 11 states. Though the states are simply numbers between 0 and
𝑘 − 1, we will give them the following labels for convenience:

State Label
0 START
1 RIGHT_0
2 RIGHT_1
3 LOOK_FOR_0
4 LOOK_FOR_1
5 RETURN
6 OUTPUT_0
7 OUTPUT_1
8 0_AND_BLANK
9 1_AND_BLANK
10 BLANK_AND_STOP

We describe the operation of our Turing machine 𝑀 in words:

• 𝑀 starts in state START and goes right, looking for the first symbol
that is 0 or 1. If it finds ∅ before it hits such a symbol then it moves
to the OUTPUT_1 state described below.

• Once 𝑀 finds such a symbol 𝑏 ∈ {0, 1}, 𝑀 deletes 𝑏 from the tape
by writing the × symbol, it enters either the RIGHT_𝑏 mode and
starts moving rightwards until it hits the first ∅ or × symbol.
loop s a n d i n fi n i ty 255

• Once 𝑀 finds this symbol, it goes into the state LOOK_FOR_0 or


LOOK_FOR_1 depending on whether it was in the state RIGHT_0 or
RIGHT_1 and makes one left move.

• In the state LOOK_FOR_𝑏, 𝑀 checks whether the value on the tape is


𝑏. If it is, then 𝑀 deletes it by changing its value to ×, and moves to
the state RETURN. Otherwise, it changes to the OUTPUT_0 state.

• The RETURN state means that 𝑀 goes back to the beginning. Specifi-
cally, 𝑀 moves leftward until it hits the first symbol that is not 0 or
1, in which case it changes its state to START.

• The OUTPUT_𝑏 states mean that 𝑀 will eventually output the value
𝑏. In both the OUTPUT_0 and OUTPUT_1 states, 𝑀 goes left until it
hits ▷. Once it does so, it makes a right step, and changes to the
1_AND_BLANK or 0_AND_BLANK states respectively. In the latter states,
𝑀 writes the corresponding value, moves right and changes to the
BLANK_AND_STOP state, in which it writes ∅ to the tape and halts.

The above description can be turned into a table describing for each
one of the 11 ⋅ 5 combination of state and symbol, what the Turing
machine will do when it is in that state and it reads that symbol. This
table is known as the transition function of the Turing machine.

7.1.2 Turing machines: a formal definition

Figure 7.7: A Turing machine has access to a tape of


unbounded length. At each point in the execution,
the machine can read a single symbol of the tape,
and based on that and its current state, write a new
symbol, update the tape, decide whether to move left,
right, stay, or halt.

The formal definition of Turing machines is as follows:

A (one tape) Turing machine 𝑀 with 𝑘


Definition 7.1 — Turing Machine.
states and alphabet Σ ⊇ {0, 1, ▷, ∅} is represented by a transition
function 𝛿𝑀 ∶ [𝑘] × Σ → [𝑘] × Σ × {L, R, S, H}.
256 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For every 𝑥 ∈ {0, 1}∗ , the output of 𝑀 on input 𝑥, denoted by


𝑀 (𝑥), is the result of the following process:

• We initialize 𝑇 to be the sequence ▷, 𝑥0 , 𝑥1 , … , 𝑥𝑛−1 , ∅, ∅, …,


where 𝑛 = |𝑥|. (That is, 𝑇 [0] = ▷, 𝑇 [𝑖 + 1] = 𝑥𝑖 for 𝑖 ∈ [𝑛], and
𝑇 [𝑖] = ∅ for 𝑖 > 𝑛.)

• We also initialize 𝑖 = 0 and 𝑠 = 0.

• We then repeat the following process:

1. Let (𝑠′ , 𝜎′ , 𝐷) = 𝛿𝑀 (𝑠, 𝑇 [𝑖]).


2. Set 𝑠 ← 𝑠′ , 𝑇 [𝑖] ← 𝜎′ .
3. If 𝐷 = R then set 𝑖 → 𝑖 + 1, if 𝐷 = L then set 𝑖 → max{𝑖 − 1, 0}.
(If 𝐷 = S then we keep 𝑖 the same.)
4. If 𝐷 = H, then halt.

• If the process above halts, then 𝑀 ’s output, denoted by 𝑀 (𝑥), is


the string 𝑦 ∈ {0, 1}∗ obtained by concatenating all the symbols
in {0, 1} in positions 𝑇 [0], … , 𝑇 [𝑖] where 𝑖 + 1 is the first location
in the tape containing ∅.

• If The Turing machine does not halt then we denote 𝑀 (𝑥) = ⊥.

P
You should make sure you see why this formal def-
inition corresponds to our informal description of
a Turing machine. To get more intuition on Turing
machines, you can explore some of the online avail-
able simulators such as Martin Ugarte’s, Anthony
Morphett’s, or Paul Rendell’s.

One should not confuse the transition function 𝛿𝑀 of a Turing ma-


chine 𝑀 with the function that the machine computes. The transition
function 𝛿𝑀 is a finite function, with 𝑘|Σ| inputs and 4𝑘|Σ| outputs.
(Can you see why?) The machine can compute an infinite function 𝐹
that takes as input a string 𝑥 ∈ {0, 1}∗ of arbitrary length and might
also produce an arbitrary length string as output.
In our formal definition, we identified the machine 𝑀 with its
transition function 𝛿𝑀 since the transition function tells us everything
we need to know about the Turing machine. However, this choice of
representation is somewhat arbitrary, and is based on our convention
that the state space is always the numbers {0, … , 𝑘 − 1} with 0 as
the starting state. Other texts use different conventions, and so their
mathematical definition of a Turing machine might look superficially
loop s a n d i n fi n i ty 257

different. However, these definitions describe the same computational


process and have the same computational powers. Hence they are
equivalent despite their superficial differences. See Section 7.7 for a
comparison between Definition 7.1 and the way Turing Machines are
defined in texts such as Sipser [Sip97].

7.1.3 Computable functions


We now turn to make one of the most important definitions in this
book: computable functions.

Let 𝐹 ∶ {0, 1}∗ → {0, 1}∗ be


Definition 7.2 — Computable functions.
a (total) function and let 𝑀 be a Turing machine. We say that 𝑀
computes 𝐹 if for every 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) = 𝐹 (𝑥).
We say that a function 𝐹 is computable if there exists a Turing
machine 𝑀 that computes it.

Defining a function “computable” if and only if it can be computed


by a Turing machine might seem “reckless” but, as we’ll see in Chap-
ter 8, being computable in the sense of Definition 7.2 is equivalent to
being computable in virtually any reasonable model of computation.
This statement is known as the Church-Turing Thesis. (Unlike the ex-
tended Church-Turing Thesis which we discussed in Section 5.6, the
Church-Turing thesis itself is widely believed and there are no candi-
date devices that attack it.)

 Big Idea 9 We can precisely define what it means for a function to


be computable by any possible algorithm.

This is a good point to remind the reader that functions are not the
same as programs:

Functions ≠ Programs .

A Turing machine (or program) 𝑀 can compute some function


𝐹 , but it is not the same as 𝐹 . In particular, there can be more than
one program to compute the same function. Being computable is a
property of functions, not of machines.
We will often pay special attention to functions 𝐹 ∶ {0, 1}∗ → {0, 1}
that have a single bit of output. Hence we give a special name for the
set of computable functions of this form.

We define R be the set of all computable


Definition 7.3 — The class R.
functions 𝐹 ∶ {0, 1}∗ → {0, 1}.
258 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

R
Remark 7.4 — Functions vs. languages. As discussed
in Section 6.1.2, many texts use the terminology of
“languages” rather than functions to refer to compu-
tational tasks. A Turing machine 𝑀 decides a language
𝐿 if for every input 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 1 if
and only if 𝑥 ∈ 𝐿. This is equivalent to computing
the Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1} defined as
𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿. A language 𝐿 is decidable if there
is a Turing machine 𝑀 that decides it. For historical
reasons, some texts also call such languages recursive
, which is the reason that the letter R is often used
to denote the set of computable Boolean functions /
decidable languages defined in Definition 7.3.
In this book we stick to the terminology of functions
rather than languages, but all definitions and results
can be easily translated back and forth by using the
equivalence between the function 𝐹 ∶ {0, 1}∗ → {0, 1}
and the language 𝐿 = {𝑥 ∈ {0, 1}∗ | 𝐹 (𝑥) = 1}.

7.1.4 Infinite loops and partial functions


One crucial difference between circuits/straight-line programs and
Turing machines is the following. Looking at a NAND-CIRC program
𝑃 , we can always tell how many inputs and how many outputs 𝑃
has by simply looking at the X and Y variables. Furthermore, we are
guaranteed that if we invoke 𝑃 on any input, then some output will be
produced.
In contrast, given a Turing machine 𝑀 , we cannot determine a
priori the length of 𝑀 ’s output. In fact, we don’t even know if an
output would be produced at all! For example, it is straightforward
to come up with a Turing machine whose transition function never
outputs H and hence never halts.
If a machine 𝑀 fails to stop and produce an output on some input
𝑥, then it cannot compute any total function 𝐹 , since clearly on input 1
A partial function 𝐹 from a set 𝐴 to a set 𝐵 is a
𝑥, 𝑀 will fail to output 𝐹 (𝑥). However, 𝑀 can still compute a partial function that is only defined on a subset of 𝐴, (see
function.1 Section 1.4.3). We can also think of such a function as
mapping 𝐴 to 𝐵 ∪ {⊥} where ⊥ is a special “failure”
For example, consider the partial function DIV that on input a pair symbol such that 𝐹 (𝑎) = ⊥ indicates the function 𝐹
(𝑎, 𝑏) of natural numbers, outputs ⌈𝑎/𝑏⌉ if 𝑏 > 0, and is undefined is not defined on 𝑎.
otherwise. We can define a Turing machine 𝑀 that computes DIV on
input 𝑎, 𝑏 by outputting the first 𝑐 = 0, 1, 2, … such that 𝑐𝑏 ≥ 𝑎. If 𝑎 > 0
and 𝑏 = 0 then the machine 𝑀 will never halt, but this is OK, since
DIV is undefined on such inputs. If 𝑎 = 0 and 𝑏 = 0, the machine 𝑀
will output 0, which is also OK, since we don’t care about what the
program outputs on inputs on which DIV is undefined. Formally, we
define computability of partial functions as follows:
loop s a n d i n fi n i ty 259

Let 𝐹 be either a
Definition 7.5 — Computable (partial or total) functions.
total or partial function mapping {0, 1} to {0, 1}∗ and let 𝑀 be a

Turing machine. We say that 𝑀 computes 𝐹 if for every 𝑥 ∈ {0, 1}∗


on which 𝐹 is defined, 𝑀 (𝑥) = 𝐹 (𝑥). We say that a (partial or
total) function 𝐹 is computable if there is a Turing machine that
computes it.

Note that if 𝐹 is a total function, then it is defined on every 𝑥 ∈


{0, 1}∗ and hence in this case, Definition 7.5 is identical to Defini-
tion 7.2.

R
Remark 7.6 — Bot symbol. We often use ⊥ as our spe-
cial “failure symbol”. If a Turing machine 𝑀 fails to
halt on some input 𝑥 ∈ {0, 1}∗ then we denote this by
𝑀 (𝑥) = ⊥. This does not mean that 𝑀 outputs some
encoding of the symbol ⊥ but rather that 𝑀 enters
into an infinite loop when given 𝑥 as input.
If a partial function 𝐹 is undefined on 𝑥 then we can
also write 𝐹 (𝑥) = ⊥. Therefore one might think
that Definition 7.5 can be simplified to requiring that
𝑀 (𝑥) = 𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ , which would imply
that for every 𝑥, 𝑀 halts on 𝑥 if and only if 𝐹 is de-
fined on 𝑥. However, this is not the case: for a Turing
machine 𝑀 to compute a partial function 𝐹 it is not
necessary for 𝑀 to enter an infinite loop on inputs 𝑥
on which 𝐹 is not defined. All that is needed is for 𝑀
to output 𝐹 (𝑥) on values of 𝑥 on which 𝐹 is defined:
on other inputs it is OK for 𝑀 to output an arbitrary
value such as 0, 1, or anything else, or not to halt at all.
To borrow a term from the C programming language,
on inputs 𝑥 on which 𝐹 is not defined, what 𝑀 does is
“undefined behavior”.

7.2 TURING MACHINES AS PROGRAMMING LANGUAGES


The name “Turing machine”, with its “tape” and “head” evokes a
physical object, while in contrast we think of a program as a piece
of text. But we can think of a Turing machine as a program as well.
For example, consider the Turing machine 𝑀 of Section 7.1.1 that
computes the function PAL such that PAL(𝑥) = 1 iff 𝑥 is a palindrome.
We can also describe this machine as a program using the Python-like
pseudocode of the form below

# Gets an array Tape initialized to


# [">", x_0 , x_1 , .... , x_(n-1), "฀", "฀", ...]
# At the end of the execution, Tape[1] is equal to 1
# if x is a palindrome and is equal to 0 otherwise
260 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

def PAL(Tape):
head = 0
state = 0 # START
while (state != 12):
if (state == 0 && Tape[head]=='0'):
state = 3 # LOOK_FOR_0
Tape[head] = 'x'
head += 1 # move right
if (state==0 && Tape[head]=='1')
state = 4 # LOOK_FOR_1
Tape[head] = 'x'
head += 1 # move right
... # more if statements here

The precise details of this program are not important. What mat-
ters is that we can describe Turing machines as programs. Moreover,
note that when translating a Turing machine into a program, the tape
2
Most programming languages use arrays of fixed
becomes a list or array that can hold values from the finite set Σ.2 The size, while a Turing machine’s tape is unbounded. But
head position can be thought of as an integer-valued variable that holds of course there is no need to store an infinite number
of ∅ symbols. If you want, you can think of the tape
integers of unbounded size. The state is a local register that can hold
as a list that starts off just long enough to store the
one of a fixed number of values in [𝑘]. input, but is dynamically grown in size as the Turing
More generally we can think of every Turing machine 𝑀 as equiva- machine’s head explores new positions.

lent to a program similar to the following:

# Gets an array Tape initialized to


# [">", x_0 , x_1 , .... , x_(n-1), "฀", "฀", ...]
def M(Tape):
state = 0
i = 0 # holds head location
while (True):
# Move head, modify state, write to tape
# based on current state and cell at head
# below are just examples for how program looks
↪ for a particular transition function
if Tape[i]=="0" and state==7: #
↪ δ_M(7,"0")=(19,"1","R")
Tape[i]="1"
i += 1

state = 19
elif Tape[i]==">" and state == 13: #
↪ δ_M(13,">")=(15,"0","S")
Tape[i]="0"
state = 15
elif ...
loop s a n d i n fi n i ty 261

...
elif Tape[i]==">" and state == 29: #
↪ δ_M(29,">")=(.,.,"H")
break # Halt

If we wanted to use only Boolean (i.e., 0/1-valued) variables, then


we can encode the state variables using ⌈log 𝑘⌉ bits. Similarly, we
can represent each element of the alphabet Σ using ℓ = ⌈log |Σ|⌉ bits
and hence we can replace the Σ-valued array Tape[] with ℓ Boolean-
valued arrays Tape0[],…, Tape(ℓ − 1)[].

7.2.1 The NAND-TM Programming language


We now introduce the NAND-TM programming language, which cap-
tures the power of a Turing machine with a programming-language
formalism. Like the difference between Boolean circuits and Turing
machines, the main difference between NAND-TM and NAND-CIRC
is that NAND-TM models a single uniform algorithm that can compute
a function that takes inputs of arbitrary lengths. To do so, we extend the
NAND-CIRC programming language with two constructs:

• Loops: NAND-CIRC is a straight-line programming language- a


NAND-CIRC program of 𝑠 lines takes exactly 𝑠 steps of computa-
tion and hence in particular, cannot even touch more than 3𝑠 vari-
ables. Loops allow us to use a fixed-length program to encode the
instructions for a computation that can take an arbitrary amount of
time.

• Arrays: A NAND-CIRC program of 𝑠 lines touches at most 3𝑠 vari-


ables. While we can use variables with names such as Foo_17 or
Bar[22] in NAND-CIRC, they are not true arrays, since the number
in the identifier is a constant that is “hardwired” into the program.
NAND-TM contains actual arrays that can have a length that is not
a priori bounded.

Figure 7.8: A NAND-TM program has scalar variables


that can take a Boolean value, array variables that
hold a sequence of Boolean values, and a special
index variable i that can be used to index the array
variables. We refer to the i-th value of the array
variable Spam using Spam[i]. At each iteration of
the program the index variable can be incremented
or decremented by one step using the MODANDJUMP
operation.
262 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Thus a good way to remember NAND-TM is using the following


informal equation:

NAND-TM = NAND-CIRC + loops + arrays (7.1)

R
Remark 7.7 — NAND-CIRC + loops + arrays = every-
thing.. As we will see, adding loops and arrays to
NAND-CIRC is enough to capture the full power of
all programming languages! Hence we could replace
“NAND-TM” with any of Python, C, Javascript, OCaml,
etc. in the left-hand side of (7.1). But we’re getting
ahead of ourselves: this issue will be discussed in
Chapter 8.

Concretely, the NAND-TM programming language adds the fol-


lowing features on top of NAND-CIRC (see Fig. 7.8):

• We add a special integer valued variable i. All other variables in


NAND-TM are Boolean valued (as in NAND-CIRC).

• Apart from i NAND-TM has two kinds of variables: scalars and


arrays. Scalar variables hold one bit (just as in NAND-CIRC). Array
variables hold an unbounded number of bits. At any point in the
computation we can access the array variables at the location in-
dexed by i using Foo[i]. We cannot access the arrays at locations
other than the one pointed to by i.

• We use the convention that arrays always start with a capital letter,
and scalar variables (which are never indexed with i) start with
lowercase letters. Hence Foo is an array and bar is a scalar variable.

• The input and output X and Y are now considered arrays with val-
ues of zeroes and ones. (There are also two other special arrays
X_nonblank and Y_nonblank, see below.)

• We add a special MODANDJUMP instruction that takes two Boolean


variables 𝑎, 𝑏 as input and does the following:

– If 𝑎 = 1 and 𝑏 = 1 then MODANDJUMP(𝑎, 𝑏) increments i by one


and jumps to the first line of the program.
– If 𝑎 = 0 and 𝑏 = 1 then MODANDJUMP(𝑎, 𝑏) decrements i by one
and jumps to the first line of the program. (If i is already equal
to 0 then it stays at 0.)
– If 𝑎 = 1 and 𝑏 = 0 then MODANDJUMP(𝑎, 𝑏) jumps to the first line of
the program without modifying i.
– If 𝑎 = 𝑏 = 0 then MODANDJUMP(𝑎, 𝑏) halts execution of the
program.
loop s a n d i n fi n i ty 263

• The MODANDJUMP instruction always appears in the last line of a


NAND-TM program and nowhere else.

Default values. We need one more convention to handle “default val-


ues”. Turing machines have the special symbol ∅ to indicate that tape
location is “blank” or “uninitialized”. In NAND-TM there is no such
symbol, and all variables are Boolean, containing either 0 or 1. All
variables and locations of arrays default to 0 if they have not been ini-
tialized to another value. To keep track of whether a 0 in an array cor-
responds to a true zero or to an uninitialized cell, a programmer can
always add to an array Foo a “companion array” Foo_nonblank and
set Foo_nonblank[i] to 1 whenever the i th location is initialized. In
particular, we will use this convention for the input and output arrays
X and Y. A NAND-TM program has four special arrays X, X_nonblank,
Y, and Y_nonblank. When a NAND-TM program is executed on input
𝑥 ∈ {0, 1}∗ of length 𝑛, the first 𝑛 cells of the array X are initialized to
𝑥0 , … , 𝑥𝑛−1 and the first 𝑛 cells of the array X_nonblank are initialized
to 1. (All uninitialized cells default to 0.) The output of a NAND-TM
program is the string Y[0], …, Y[𝑚 − 1] where 𝑚 is the smallest inte-
ger such that Y_nonblank[𝑚]= 0. A NAND-TM program gets called
with X and X_nonblank initialized to contain the input, and writes to Y
and Y_nonblank to produce the output.
Formally, NAND-TM programs are defined as follows:

Definition 7.8 — NAND-TM programs. A NAND-TM program consists of


a sequence of lines of the form foo = NAND(bar,blah) ending
with a line of the form MODANDJUMP(foo,bar), where foo,bar,blah
are either scalar variables (sequences of letters, digits, and under-
scores) or array variables of the form Foo[i] (starting with capital
letters and indexed by i). The program has the array variables X,
X_nonblank, Y, Y_nonblank and the index variable i built in, and
can use additional array and scalar variables.
If 𝑃 is a NAND-TM program and 𝑥 ∈ {0, 1}∗ is an input then an
execution of 𝑃 on 𝑥 is the following process:

1. The arrays X and X_nonblank are initialized by X[𝑖]= 𝑥𝑖 and


X_nonblank[𝑖]= 1 for all 𝑖 ∈ [|𝑥|]. All other variables and cells
are initialized to 0. The index variable i is also initialized to 0.

2. The program is executed line by line. When the last line MODAND-
JUMP(foo,bar) is executed we do as follows:

a. If foo= 1 and bar= 0, jump to the first line without modify-


ing the value of i.
264 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

b. If foo= 1 and bar= 1, increment i by one and jump to the


first line.
c. If foo= 0 and bar= 1, decrement i by one (unless it is al-
ready zero) and jump to the first line.
d. If foo= 0 and bar= 0, halt and output Y[0], …, Y[𝑚 − 1]
where 𝑚 is the smallest integer such that Y_nonblank[𝑚]= 0.

7.2.2 Sneak peak: NAND-TM vs Turing machines


As the name implies, NAND-TM programs are a direct implemen-
tation of Turing machines in programming language form. We will
show the equivalence below, but you can already see how the compo-
nents of Turing machines and NAND-TM programs correspond to one
another:

Table 7.2: Turing Machine and NAND-TM analogs

Turing Machine NAND-TM program


State: single register that Scalar variables: Several variables
takes values in [𝑘] such as foo, bar etc.. each taking
values in {0, 1}.
Tape: One tape containing Arrays: Several arrays such as Foo,
values in a finite set Σ. Bar etc.. for each such array Arr and
Potentially infinite but 𝑇 [𝑡] index 𝑗, the value of Arr at position 𝑗
defaults to ∅ for all locations is either 0 or 1. The value defaults to
𝑡 that have not been 0 for position that have not been
accessed. written to.
Head location: A number Index variable: The variable i that can
𝑖 ∈ ℕ that encodes the be used to access the arrays.
position of the head.
Accessing memory: At every Accessing memory: At every step a
step the Turing machine has NAND-TM program has access to all
access to its local state, but the scalar variables, but can only
can only access the tape at access the arrays at the location i of
the position of the current the index variable
head location.
Control of location: In each Control of index variable: In each
step the machine can move iteration of its main loop the
the head location by at most program can modify the index i by
one position. at most one.
loop s a n d i n fi n i ty 265

7.2.3 Examples
We now present some examples of NAND-TM programs.

■ Example 7.9 — Increment in NAND-TM.The following is a NAND-TM


program to compute the increment function. That is, INC ∶ {0, 1}∗ →
{0, 1}∗ such that for every 𝑥 ∈ {0, 1}𝑛 , INC(𝑥) is the 𝑛 + 1 bit long
𝑛−1
string 𝑦 such that if 𝑋 = ∑𝑖=0 𝑥𝑖 ⋅ 2𝑖 is the number represented by
𝑥, then 𝑦 is the (least-significant digit first) binary representation of
the number 𝑋 + 1.
We start by describing the program using “syntactic sugar” for
NAND-CIRC for the IF, XOR and AND functions (as well as the con-
stant one function, and the function COPY that just maps a bit to
itself).

carry = IF(started,carry,one(started))
started = one(started)
Y[i] = XOR(X[i],carry)
carry = AND(X[i],carry)
Y_nonblank[i] = one(started)
MODANDJUMP(X_nonblank[i],X_nonblank[i])

Since we used syntactic sugar, the above is not, strictly speaking,


a valid NAND-TM program. However, by “opening up” all the
syntactic sugar, we get the following “sugar free” valid program to
compute the same function.

temp_0 = NAND(started,started)
temp_1 = NAND(started,temp_0)
temp_2 = NAND(started,started)
temp_3 = NAND(temp_1,temp_2)
temp_4 = NAND(carry,started)
carry = NAND(temp_3,temp_4)
temp_6 = NAND(started,started)
started = NAND(started,temp_6)
temp_8 = NAND(X[i],carry)
temp_9 = NAND(X[i],temp_8)
temp_10 = NAND(carry,temp_8)
Y[i] = NAND(temp_9,temp_10)
temp_12 = NAND(X[i],carry)
carry = NAND(temp_12,temp_12)
temp_14 = NAND(started,started)
Y_nonblank[i] = NAND(started,temp_14)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
266 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

■ Example 7.10 — XOR in NAND-TM. The following is a NAND-TM pro-


gram to compute the XOR function on inputs of arbitrary length.
|𝑥|−1
That is XOR ∶ {0, 1}∗ → {0, 1} such that XOR(𝑥) = ∑𝑖=0 𝑥𝑖
mod 2 for every 𝑥 ∈ {0, 1}∗ . Once again, we use a certain “syn-
tactic sugar”. Specifically, we access the arrays X and Y at their
zero-th entry, while NAND-TM only allows access to arrays in the
coordinate of the variable i.

temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])

To transform the program above to a valid NAND-TM program,


we can transform references such as X[0] and Y[0] to scalar vari-
ables x_0 and y_0 (similarly we can transform any reference of the
form Foo[17] or Bar[15] to scalars such as foo_17 and bar_15).
We then need to add code to load the value of X[0] to x_0 and
similarly to write to Y[0] the value of y_0, but this is not hard to
do. Using the fact that variables are initialized to zero by default,
we can create a variable init which will be set to 1 at the end of
the first iteration and not changed since then. We can then add an
array Atzero and code that will modify Atzero[i] to 1 if init is
0 and otherwise leave it as it is. This will ensure that Atzero[i] is
equal to 1 if and only if i is set to zero, and allow the program to
know when we are at the zeroth location. Thus we can add code
to read and write to the corresponding scalars x_0, y_0 when we
are at the zeroth location, and also code to move i to zero and then
halt at the end. Working this out fully is somewhat tedious, but can
be a good exercise.

P
Working out the above two examples can go a long
way towards understanding the NAND-TM language.
See our GitHub repository for a full specification of
the NAND-TM language.
loop s a n d i n fi n i ty 267

7.3 EQUIVALENCE OF TURING MACHINES AND NAND-TM PRO-


GRAMS
Given the above discussion, it might not be surprising that Turing
machines turn out to be equivalent to NAND-TM programs. Indeed,
we designed the NAND-TM language to have this property. Never-
theless, this is a significant result, and the first of many other such
equivalence results we will see in this book.

For
Theorem 7.11 — Turing machines and NAND-TM programs are equivalent.
every 𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM pro-
∗ ∗

gram 𝑃 if and only if there is a Turing machine 𝑀 that computes


𝐹.

Proof Idea:
To prove such an equivalence theorem, we need to show two di-
rections. We need to be able to (1) transform a Turing machine 𝑀 to
a NAND-TM program 𝑃 that computes the same function as 𝑀 and
(2) transform a NAND-TM program 𝑃 into a Turing machine 𝑀 that
computes the same function as 𝑃 .
The idea of the proof is illustrated in Fig. 7.9. To show (1), given
a Turing machine 𝑀 , we will create a NAND-TM program 𝑃 that
will have an array Tape for the tape of 𝑀 and scalar (i.e., non-array)
variable(s) state for the state of 𝑀 . Specifically, since the state of a
Turing machine is not in {0, 1} but rather in a larger set [𝑘], we will use
⌈log 𝑘⌉ variables state_0 , …, state_⌈log 𝑘⌉ − 1 variables to store the
representation of the state. Similarly, to encode the larger alphabet Σ
of the tape, we will use ⌈log |Σ|⌉ arrays Tape_0 , …, Tape_⌈log |Σ|⌉ − 1,
such that the 𝑖𝑡ℎ location of these arrays encodes the 𝑖𝑡ℎ symbol in the
tape for every tape. Using the fact that every function can be computed
by a NAND-CIRC program, we will be able to compute the transition
function of 𝑀 , replacing moving left and right by decrementing and
incrementing i respectively.
We show (2) using very similar ideas. Given a program 𝑃 that uses
𝑎 array variables and 𝑏 scalar variables, we will create a Turing ma-
chine with about 2𝑏 states to encode the values of scalar variables, and
an alphabet of about 2𝑎 so we can encode the arrays using our tape.
(The reason the sizes are only “about” 2𝑎 and 2𝑏 is that we need to
add some symbols and steps for bookkeeping purposes.) The Turing
machine 𝑀 simulates each iteration of the program 𝑃 by updating its
state and tape accordingly.

Proof of Theorem 7.11. We start by proving the “if” direction of The-


orem 7.11. Namely we show that given a Turing machine 𝑀 , we can
268 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 7.9: Comparing a Turing machine to a NAND-


TM program. Both have an unbounded memory
component (the tape for a Turing machine, and the ar-
rays for a NAND-TM program), as well as a constant
local memory (state for a Turing machine, and scalar
variables for a NAND-TM program). Both can only
access at each step one location of the unbounded
memory, this is the “head” location for a Turing
machine, and the value of the index variable i for a
NAND-TM program.

find a NAND-TM program 𝑃𝑀 such that for every input 𝑥, if 𝑀 halts


on input 𝑥 with output 𝑦 then 𝑃𝑀 (𝑥) = 𝑦. Since our goal is just to
show such a program 𝑃𝑀 exists, we don’t need to write out the full
code of 𝑃𝑀 line by line, and can take advantage of our various “syn-
tactic sugar” in describing it.
The key observation is that by Theorem 4.12 we can compute every
finite function using a NAND-CIRC program. In particular, consider
the transition function 𝛿𝑀 ∶ [𝑘] × Σ → [𝑘] × Σ × {L, R, S, H} of our Turing
machine. We can encode its components as follows:

• We encode [𝑘] using {0, 1}ℓ and Σ using {0, 1}ℓ , where ℓ = ⌈log 𝑘⌉

and ℓ′ = ⌈log |Σ|⌉.

• We encode the set {L, R, S, H} using {0, 1}2 . We will choose the
encoding L ↦ 01, R ↦ 11, S ↦ 10, H ↦ 00. (This conveniently
corresponds to the semantics of the MODANDJUMP operation.)

Hence we can identify 𝛿𝑀 with a function 𝑀 ∶ {0, 1}ℓ+ℓ →


{0, 1}ℓ+ℓ +2 , mapping strings of length ℓ + ℓ′ to strings of length


ℓ + ℓ′ + 2. By Theorem 4.12 there exists a finite length NAND-CIRC


program ComputeM that computes this function 𝑀 . The idea behind
the NAND-TM program to simulate 𝑀 is to:

1. Use variables state_0 … state_ℓ − 1 to encode 𝑀 ’s state.

2. Use arrays Tape_0[] … Tape_ℓ′ − 1[] to encode 𝑀 ’s tape.

3. Use the fact that transition is finite and computable by NAND-


CIRC program.

Given the above, we can write code of the form:


state_0 … state_ℓ − 1, Tape_0[i]… Tape_ℓ′ − 1[i], dir0,dir1 ←
TRANSITION( state_0 … state_ℓ − 1, Tape_0[i]… Tape_ℓ′ − 1[i] )
MODANDJUMP(dir0,dir1)
Every step of the main loop of the above program perfectly mimics
the computation of the Turing machine 𝑀 , and so the program carries
out exactly the definition of computation by a Turing machine as per
Definition 7.1.
loop s a n d i n fi n i ty 269

For the other direction, suppose that 𝑃 is a NAND-TM program


with 𝑠 lines, ℓ scalar variables, and ℓ′ array variables. We will show
that there exists a Turing machine 𝑀𝑃 with 2ℓ + 𝐶 states and alphabet
Σ of size 𝐶 ′ + 2ℓ that computes the same functions as 𝑃 (where 𝐶, 𝐶 ′

are some constants to be determined later).


Specifically, consider the function 𝑃 ∶ {0, 1}ℓ × {0, 1}ℓ → {0, 1}ℓ ×

{0, 1}ℓ that on input the contents of 𝑃 ’s scalar variables and the con-

tents of the array variables at location i in the beginning of an itera-


tion, outputs all the new values of these variables at the last line of the
iteration, right before the MODANDJUMP instruction is executed.
If foo and bar are the two variables that are used as input to the
MODANDJUMP instruction, then based on the values of these variables we
can compute whether i will increase, decrease or stay the same, and
whether the program will halt or jump back to the beginning. Hence a
Turing machine can simulate an execution of 𝑃 in one iteration using
a finite function applied to its alphabet. The overall operation of the
Turing machine will be as follows:

1. The machine 𝑀𝑃 encodes the contents of the array variables of 𝑃


in its tape and the contents of the scalar variables in (part of) its
state. Specifically, if 𝑃 has ℓ local variables and 𝑡 arrays, then the
state space of 𝑀 will be large enough to encode all 2ℓ assignments
to the local variables, and the alphabet Σ of 𝑀 will be large enough
to encode all 2𝑡 assignments for the array variables at each location.
The head location corresponds to the index variable i.

2. Recall that every line of the program 𝑃 corresponds to reading


and writing either a scalar variable, or an array variable at the loca-
tion i. In one iteration of 𝑃 the value of i remains fixed, and so the
machine 𝑀 can simulate this iteration by reading the values of all
array variables at i (which are encoded by the single symbol in the
alphabet Σ located at the i-th cell of the tape) , reading the values
of all scalar variables (which are encoded by the state), and updat-
ing both. The transition function of 𝑀 can output L, S, R depending
on whether the values given to the MODANDJUMP operation are 01, 10
or 11 respectively.

3. When the program halts (i.e., MODANDJUMP gets 00) then the Turing
machine will enter into a special loop to copy the results of the Y
array into the output and then halt. We can achieve this by adding a
few more states.

The above is not a full formal description of a Turing machine, but


our goal is just to show that such a machine exists. One can see that
𝑀𝑃 simulates every step of 𝑃 , and hence computes the same function
as 𝑃 .
270 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

R
Remark 7.12 — Running time equivalence (optional). If
we examine the proof of Theorem 7.11 then we can
see that every iteration of the loop of a NAND-TM
program corresponds to one step in the execution of
the Turing machine. We will come back to this ques-
tion of measuring the number of computation steps
later in this course. For now, the main take away point
is that NAND-TM programs and Turing machines
are essentially equivalent in power even when taking
running time into account.

7.3.1 Specification vs implementation (again)


Once you understand the definitions of both NAND-TM programs
and Turing machines, Theorem 7.11 is straightforward. Indeed,
NAND-TM programs are not as much a different model from Tur-
ing machines as they are simply a reformulation of the same model
using programming language notation. You can think of the differ-
ence between a Turing machine and a NAND-TM program as the
difference between representing a number using decimal or binary
notation. In contrast, the difference between a function 𝐹 and a Turing
machine that computes 𝐹 is much more profound: it is like the differ-
ence between the equation 𝑥2 + 𝑥 = 12, and the number 3 that is a
solution for this equation. For this reason, while we take special care
in distinguishing functions from programs or machines, we will often
identify the two latter concepts. We will move freely between describ-
ing an algorithm as a Turing machine or as a NAND-TM program (as
well as some of the other equivalent computational models we will see
in Chapter 8 and beyond).

Table 7.3: Specification vs Implementation formalisms

Setting Specification Implementation


Finite com- Functions mapping {0, 1} to 𝑛
Circuits, Straightline
putation {0, 1}𝑚 programs
Infinite Functions mapping {0, 1}∗ to Algorithms, Turing
computa- {0, 1} or to {0, 1}∗ . Machines, Programs
tion
loop s a n d i n fi n i ty 271

7.4 NAND-TM SYNTACTIC SUGAR


Just like we did with NAND-CIRC in Chapter 4, we can use “syntactic
sugar” to make NAND-TM programs easier to write. For starters, we
can use all of the syntactic sugar of NAND-CIRC, such as macro def-
initions and conditionals (i.e., if/then). However, we can go beyond
this and achieve (for example):

• Inner loops such as the while and for operations common to many
programming languages.

• Multiple index variables (e.g., not just i but we can add j, k, etc.).

• Arrays with more than one dimension (e.g., Foo[i][j],


Bar[i][j][k] etc.)

In all of these cases (and many others) we can implement the new
feature as mere “syntactic sugar” on top of standard NAND-TM. This
means that the set of functions computable by NAND-TM with this
feature is the same as the set of functions computable by standard
NAND-TM. Similarly, we can show that the set of functions com-
putable by Turing machines that have more than one tape, or tapes
of more dimensions than one, is the same as the set of functions com-
putable by standard Turing machines.

7.4.1 “GOTO” and inner loops


We can implement more advanced looping constructs than the simple
MODANDJUMP. For example, we can implement GOTO. A GOTO statement
corresponds to jumping to a specific line in the execution. For exam-
ple, if we have code of the form

"start": do foo
GOTO("end")
"skip": do bar
"end": do blah

then the program will only do foo and blah as when it reaches the
line GOTO("end") it will jump to the line labeled with "end". We can
achieve the effect of GOTO in NAND-TM using conditionals. In the
code below, we assume that we have a variable pc that can take strings
of some constant length. This can be encoded using a finite number
of Boolean variables pc_0, pc_1, …, pc_𝑘 − 1, and so when we write
below pc = "label" what we mean is something like pc_0 = 0,pc_1
= 1, … (where the bits 0, 1, … correspond to the encoding of the finite
string "label" as a string of length 𝑘). We also assume that we have
access to conditional (i.e., if statements), which we can emulate using
syntactic sugar in the same way as we did in NAND-CIRC.
272 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

To emulate a GOTO statement, we will first modify a program P of


the form

do foo
do bar
do blah

to have the following form (using syntactic sugar for if):

pc = "line1"
if (pc=="line1"):
do foo
pc = "line2"
if (pc=="line2"):
do bar
pc = "line3"
if (pc=="line3"):
do blah

These two programs do the same thing. The variable pc cor-


responds to the “program counter” and tells the program which
line to execute next. We can see that if we wanted to emulate a
GOTO("line3") then we could simply modify the instruction pc =
"line2" to be pc = "line3".
In NAND-CIRC we could only have GOTOs that go forward in the
code, but since in NAND-TM everything is encompassed within a
large outer loop, we can use the same ideas to implement GOTOs that
can go backward, as well as conditional loops.

Other loops. Once we have GOTO, we can emulate all the standard loop
constructs such as while, do .. until or for in NAND-TM as well.
For example, we can replace the code

while foo:
do blah
do bar

with

"loop":
if NOT(foo): GOTO("next")
do blah
GOTO("loop")
"next":
do bar
loop s a n d i n fi n i ty 273

R
Remark 7.13 — GOTO’s in programming languages. The
GOTO statement was a staple of most early program-
ming languages, but has largely fallen out of favor and
is not included in many modern languages such as
Python, Java, Javascript. In 1968, Edsger Dijsktra wrote a
famous letter titled “Go to statement considered harm-
ful.” (see also Fig. 7.10). The main trouble with GOTO
is that it makes analysis of programs more difficult
by making it harder to argue about invariants of the
program.
When a program contains a loop of the form:

for j in range(100):
do something

do blah

you know that the line of code do blah can only be


reached if the loop ended, in which case you know
that j is equal to 100, and might also be able to argue
other properties of the state of the program. In con-
trast, if the program might jump to do blah from any
other point in the code, then it’s very hard for you as
the programmer to know what you can rely upon in
this code. As Dijkstra said, such invariants are im-
portant because _ “our intellectual powers are rather
geared to master static relations and .. our powers
to visualize processes evolving in time are relatively
poorly developed”_ and so ” we should … do …our
utmost best to shorten the conceptual gap between the static
program and the dynamic process.”
That said, GOTO is still a major part of lower level lan-
guages where it is used to implement higher-level
looping constructs such as while and for loops.
For example, even though Java doesn’t have a GOTO
statement, the Java Bytecode (which is a lower-level
representation of Java) does have such a statement.
Similarly, Python bytecode has instructions such as
POP_JUMP_IF_TRUE that implement the GOTO function-
ality, and similar instructions are included in many
assembly languages. The way we use GOTO to imple-
ment a higher-level functionality in NAND-TM is
reminiscent of the way these various jump instructions
are used to implement higher-level looping constructs.

7.5 UNIFORMITY, AND NAND VS NAND-TM (DISCUSSION)


While NAND-TM adds extra operations over NAND-CIRC, it is not
exactly accurate to say that NAND-TM programs or Turing machines Figure 7.10: XKCD’s take on the GOTO statement.
are “more powerful” than NAND-CIRC programs or Boolean circuits.
NAND-CIRC programs, having no loops, are simply not applicable
274 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

for computing functions with an unbounded number of inputs. Thus,


to compute a function 𝐹 ∶ {0, 1}∗ ∶→ {0, 1}∗ using NAND-CIRC (or
equivalently, Boolean circuits) we need a collection of programs/cir-
cuits: one for every input length.
The key difference between NAND-CIRC and NAND-TM is that
NAND-TM allows us to express the fact that the algorithm for com-
puting parities of length-100 strings is really the same one as the al-
gorithm for computing parities of length-5 strings (or similarly the
fact that the algorithm for adding 𝑛-bit numbers is the same for every
𝑛, etc.). That is, one can think of the NAND-TM program for general
parity as the “seed” out of which we can grow NAND-CIRC programs
for length 10, length 100, or length 1000 parities as needed.
This notion of a single algorithm that can compute functions of all
input lengths is known as uniformity of computation. Hence we think
of Turing machines / NAND-TM as uniform models of computation,
as opposed to Boolean circuits or NAND-CIRC, which are non-uniform
models, in which we have to specify a different program for every
input length.
Looking ahead, we will see that this uniformity leads to another
crucial difference between Turing machines and circuits. Turing ma-
chines can have inputs and outputs that are longer than the descrip-
tion of the machine as a string, and in particular there exists a Turing
machine that can “self replicate” in the sense that it can print its own
code. The notion of “self replication”, and the related notion of “self
reference” are crucial to many aspects of computation, and beyond
that to life itself, whether in the form of digital or biological programs.
For now, what you ought to remember is the following differences
between uniform and non-uniform computational models:

• Non-uniform computational models: Examples are NAND-CIRC


programs and Boolean circuits. These are models where each indi-
vidual program/circuit can compute a finite function 𝑓 ∶ {0, 1}𝑛 →
{0, 1}𝑚 . We have seen that every finite function can be computed by
some program/circuit. To discuss computation of an infinite func-
tion 𝐹 ∶ {0, 1}∗ → {0, 1}∗ we need to allow a sequence {𝑃𝑛 }𝑛∈ℕ of
programs/circuits (one for every input length), but this does not
capture the notion of a single algorithm to compute the function 𝐹 .

• Uniform computational models: Examples are Turing machines and


NAND-TM programs. These are models where a single program/-
machine can take inputs of arbitrary length and hence compute an
infinite function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ . The number of steps that
a program/machine takes on some input is not a priori bounded
in advance and in particular there is a chance that it will enter into
an infinite loop. Unlike the non-uniform case, we have not shown
loop s a n d i n fi n i ty 275

that every infinite function can be computed by some NAND-TM


program/Turing machine. We will come back to this point in Chap-
ter 9.

✓ Chapter Recap

• Turing machines capture the notion of a single al-


gorithm that can evaluate functions of every input
length.
• They are equivalent to NAND-TM programs, which
add loops and arrays to NAND-CIRC.
• Unlike NAND-CIRC or Boolean circuits, the num-
ber of steps that a Turing machine takes on a given
input is not fixed in advance. In fact, a Turing ma-
chine or a NAND-TM program can enter into an
infinite loop on certain inputs, and not halt at all.

7.6 EXERCISES
Produce the code of a
Exercise 7.1 — Explicit NAND TM programming.
(syntactic-sugar free) NAND-TM program 𝑃 that computes the (un-
bounded input length) Majority function 𝑀 𝑎𝑗 ∶ {0, 1}∗ → {0, 1} where
|𝑥|
for every 𝑥 ∈ {0, 1}∗ , 𝑀 𝑎𝑗(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 > |𝑥|/2. We
say “produce” rather than “write” because you do not have to write
the code of 𝑃 by hand, but rather can use the programming language
of your choice to compute this code.

Prove that the following


Exercise 7.2 — Computable functions examples.
functions are computable. For all of these functions, you do not have
to fully specify the Turing machine or the NAND-TM program that
computes the function, but rather only prove that such a machine or
program exists:

1. INC ∶ {0, 1}∗ → {0, 1}∗ which takes as input a representation of a


natural number 𝑛 and outputs the representation of 𝑛 + 1.

2. ADD ∶ {0, 1}∗ → {0, 1}∗ which takes as input a representation of


a pair of natural numbers (𝑛, 𝑚) and outputs the representation of
𝑛 + 𝑚.

3. MULT ∶ {0, 1}∗ → {0, 1}∗ , which takes a representation of a pair of


natural numbers (𝑛, 𝑚) and outputs the representation of 𝑛𝑚.̇

4. SORT ∶ {0, 1}∗ → {0, 1}∗ which takes as input the representation of
a list of natural numbers (𝑎0 , … , 𝑎𝑛−1 ) and returns its sorted version
(𝑏0 , … , 𝑏𝑛−1 ) such that for every 𝑖 ∈ [𝑛] there is some 𝑗 ∈ [𝑛] with
𝑏𝑖 = 𝑎𝑗 and 𝑏0 ≤ 𝑏1 ≤ ⋯ ≤ 𝑏𝑛−1 .
276 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Exercise 7.3 — Two index NAND-TM. Define NAND-TM’ to be the variant


of NAND-TM where there are two index variables i and j. Arrays
can be indexed by either i or j. The operation MODANDJUMP takes four
variables 𝑎, 𝑏, 𝑐, 𝑑 and uses the values of 𝑐, 𝑑 to decide whether to
increment j, decrement j or keep it in the same value (correspond-
ing to 01, 10, and 00 respectively). Prove that for every function
𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐹 is computable by a NAND-TM program if
and only if 𝐹 is computable by a NAND-TM’ program.

Define a two tape Turing machine


Exercise 7.4 — Two tape Turing machines.
to be a Turing machine which has two separate tapes and two separate
heads. At every step, the transition function gets as input the location
of the cells in the two tapes, and can decide whether to move each
head independently. Prove that for every function 𝐹 ∶ {0, 1}∗ →
{0, 1}∗ , 𝐹 is computable by a standard Turing machine if and only if 𝐹
is computable by a two-tape Turing machine.

Define NAND-TM” to be the vari-


Exercise 7.5 — Two dimensional arrays.
ant of NAND-TM where just like NAND-TM’ defined in Exercise 7.3
there are two index variables i and j, but now the arrays are two di-
mensional and so we index an array Foo by Foo[i][j]. Prove that for
every function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐹 is computable by a NAND-TM
program if and only if 𝐹 is computable by a NAND-TM’ ’ program.

Exercise 7.6 — Two dimensional Turing machines. Define a two-dimensional


Turing machine to be a Turing machine in which the tape is two dimen-
sional. At every step the machine can move Up, Down, Left, Right, or
Stay. Prove that for every function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐹 is com-
putable by a standard Turing machine if and only if 𝐹 is computable
by a two-dimensional Turing machine.

Exercise 7.7Prove the following closure properties of the set R defined


in Definition 7.3:

1. If 𝐹 ∈ R then the function 𝐺(𝑥) = 1 − 𝐹 (𝑥) is in R.

2. If 𝐹 , 𝐺 ∈ R then the function 𝐻(𝑥) = 𝐹 (𝑥) ∨ 𝐺(𝑥) is in R.

3. If 𝐹 ∈ R then the function 𝐹 ∗ in in R where 𝐹 ∗ is defined as fol-


lows: 𝐹 ∗ (𝑥) = 1 iff there exist some strings 𝑤0 , … , 𝑤𝑘−1 such that
𝑥 = 𝑤0 𝑤1 ⋯ 𝑤𝑘−1 and 𝐹 (𝑤𝑖 ) = 1 for every 𝑖 ∈ [𝑘].
loop s a n d i n fi n i ty 277

4. If 𝐹 ∈ R then the function


{∃𝑦∈{0,1}|𝑥| 𝐹 (𝑥𝑦) = 1
𝐺(𝑥) = ⎨
{
⎩0 otherwise

is in R.

Define a Turing ma-


Exercise 7.8 — Oblivious Turing Machines (challenging).
chine 𝑀 to be oblivious if its head movements are independent of its
input. That is, we say that 𝑀 is oblivious if there exists an infinite
sequence MOVE ∈ {L, R, S}∞ such that for every 𝑥 ∈ {0, 1}∗ , the
movements of 𝑀 when given input 𝑥 (up until the point it halts, if
such point exists) are given by MOVE0 , MOVE1 , MOVE2 , ….
Prove that for every function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , if 𝐹 is com-
putable then it is computable by an oblivious Turing machine. See 3
You can use the sequence R, L,R, R, L, L, R,R,R, L, L, L,
footnote for hint.3 ….

Prove that for every 𝐹 ∶ {0, 1}∗ →


Exercise 7.9 — Single vs multiple bit.
{0, 1} , the function 𝐹 is computable if and only if the following func-

tion 𝐺 ∶ {0, 1}∗ → {0, 1} is computable, where 𝐺 is defined as follows:


⎧𝐹 (𝑥)𝑖 𝑖 < |𝐹 (𝑥)|, 𝜎 = 0
{
{
𝐺(𝑥, 𝑖, 𝜎) = ⎨1 𝑖 < |𝐹 (𝑥)|, 𝜎 = 1
{
{0
⎩ 𝑖 ≥ |𝐹 (𝑥)|

Recall that R is the set of all


Exercise 7.10 — Uncomputability via counting.
total functions from {0, 1}∗ to {0, 1} that are computable by a Turing
machine (see Definition 7.3). Prove that R is countable. That is, prove
that there exists a one-to-one map 𝐷𝑡𝑁 ∶ R → ℕ. You can use the
equivalence between Turing machines and NAND-TM programs.

Prove that the set of all


Exercise 7.11 — Not every function is computable.
total functions from {0, 1}∗ → {0, 1} is not countable. You can use the
results of Section 2.4. (We will see an explicit uncomputable function
in Chapter 9.)

7.7 BIBLIOGRAPHICAL NOTES


Augusta Ada Byron, countess of Lovelace (1815-1852) lived a short
but turbulent life, though is today most well known for her collabo-
ration with Charles Babbage (see [Ste87] for a biography). Ada took
an immense interest in Babbage’s analytical engine, which we men-
tioned in Chapter 3. In 1842-3, she translated from Italian a paper of
278 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Menabrea on the engine, adding copious notes (longer than the paper
itself). The quote in the chapter’s beginning is taken from Nota A in
this text. Lovelace’s notes contain several examples of programs for the
analytical engine, and because of this she has been called “the world’s
first computer programmer” though it is not clear whether they were
written by Lovelace or Babbage himself [Hol01]. Regardless, Ada was
clearly one of very few people (perhaps the only one outside of Bab-
bage himself) to fully appreciate how important and revolutionary the
idea of mechanizing computation truly is.
The books of Shetterly [She16] and Sobel [Sob17] discuss the his-
tory of human computers (who were female, more often than not)
and their important contributions to scientific discoveries in astron-
omy and space exploration.
Alan Turing was one of the intellectual giants of the 20th century.
He was not only the first person to define the notion of computation,
but also invented and used some of the world’s earliest computational
devices as part of the effort to break the Enigma cipher during World
War II, saving millions of lives. Tragically, Turing committed suicide
in 1954, following his conviction in 1952 for homosexual acts and a
court-mandated hormonal treatment. In 2009, British prime minister
Gordon Brown made an official public apology to Turing, and in 2013
Queen Elizabeth II granted Turing a posthumous pardon. Turing’s life
is the subject of a great book and a mediocre movie.
Sipser’s text [Sip97] defines a Turing machine as a seven tuple con-
sisting of the state space, input alphabet, tape alphabet, transition
function, starting state, accepting state, and rejecting state. Superfi-
cially this looks like a very different definition than Definition 7.1 but
it is simply a different representation of the same concept, just as a
graph can be represented in either adjacency list or adjacency matrix
form.
One difference is that Sipser considers a general set of states 𝑄 that
is not necessarily of the form 𝑄 = {0, 1, 2, … , 𝑘 − 1} for some natural
number 𝑘 > 0. Sipser also restricts his attention to Turing machines
that output only a single bit and therefore designates two special halt-
ing states: the “0 halting state” (often known as the rejecting state) and
the other as the “1 halting state” (often known as the accepting state).
Thus instead of writing 0 or 1 on an output tape, the machine will en-
ter into one of these states and halt. This again makes no difference
to the computational power, though we prefer to consider the more
general model of multi-bit outputs. (Sipser presents the basic task of a
Turing machine as that of deciding a language as opposed to computing
a function, but these are equivalent, see Remark 7.4.)
Sipser considers also functions with input in Σ∗ for an arbitrary
alphabet Σ (and hence distinguishes between the input alphabet which
loop s a n d i n fi n i ty 279

he denotes as Σ and the tape alphabet which he denotes as Γ), while we


restrict attention to functions with binary strings as input. Again this
is not a major issue, since we can always encode an element of Σ using
a binary string of length log⌈|Σ|⌉. Finally (and this is a very minor
point) Sipser requires the machine to either move left or right in every
step, without the Stay operation, though staying in place is very easy
to emulate by simply moving right and then back left.
Another definition used in the literature is that a Turing machine
𝑀 recognizes a language 𝐿 if for every 𝑥 ∈ 𝐿, 𝑀 (𝑥) = 1 and for
every 𝑥 ∉ 𝐿, 𝑀 (𝑥) ∈ {0, ⊥}. A language 𝐿 is recursively enumerable if
there exists a Turing machine 𝑀 that recognizes it, and the set of all
recursively enumerable languages is often denoted by RE. We will not
use this terminology in this book.
One of the first programming-language formulations of Turing
machines was given by Wang [Wan57]. Our formulation of NAND-
TM is aimed at making the connection with circuits more direct, with
the eventual goal of using it for the Cook-Levin Theorem, as well as
results such as P ⊆ P/poly and BPP ⊆ P/poly . The website esolangs.org
features a large variety of esoteric Turing-complete programming
languages. One of the most famous of them is Brainf*ck.
Learning Objectives:
• Learn about RAM machines and the λ
calculus.
• Equivalence between these and other models
and Turing machines.
• Cellular automata and configurations of
Turing machines.
• Understand the Church-Turing thesis.

8
Equivalent models of computation

“All problems in computer science can be solved by another level of indirec-


tion”, attributed to David Wheeler.

“Because we shall later compute with expressions for functions, we need a


distinction between functions and forms and a notation for expressing this
distinction. This distinction and a notation for describing it, from which we
deviate trivially, is given by Church.”, John McCarthy, 1960 (in paper
describing the LISP programming language)

So far we have defined the notion of computing a function using


Turing machines, which are not a close match to the way computation
is done in practice. In this chapter we justify this choice by showing
that the definition of computable functions will remain the same
under a wide variety of computational models. This notion is known
as Turing completeness or Turing equivalence and is one of the most
fundamental facts of computer science. In fact, a widely believed
claim known as the Church-Turing Thesis holds that every “reasonable”
definition of computable function is equivalent to being computable
by a Turing machine. We discuss the Church-Turing Thesis and the
potential definitions of “reasonable” in Section 8.8.
Some of the main computational models we discuss in this chapter
include:

• RAM Machines: Turing machines do not correspond to standard


computing architectures that have Random Access Memory (RAM).
The mathematical model of RAM machines is much closer to actual
computers, but we will see that it is equivalent in power to Turing
machines. We also discuss a programming language variant of
RAM machines, which we call NAND-RAM. The equivalence of
Turing machines and RAM machines enables demonstrating the
Turing Equivalence of many popular programming languages, in-
cluding all general-purpose languages used in practice such as C,
Python, JavaScript, etc.

Compiled on 12.6.2023 00:05


282 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• Cellular Automata: Many natural and artificial systems can be


modeled as collections of simple components, each evolving ac-
cording to simple rules based on its state and the state of its imme-
diate neighbors. One well-known such example is Conway’s Game
of Life. To prove that cellular automata are equivalent to Turing
machines we introduce the tool of configurations of Turing machines.
These have other applications, and in particular are used in Chap-
ter 11 to prove Gödel’s Incompleteness Theorem: a central result in
mathematics.

• 𝜆 calculus: The 𝜆 calculus is a model for expressing computation


that originates from the 1930’s, though it is closely connected to
functional programming languages widely used today. Showing
the equivalence of 𝜆 calculus to Turing machines involves a beauti-
ful technique to eliminate recursion known as the “Y Combinator”.

This chapter: A non-mathy overview


In this chapter we study equivalence between models. Two
computational models are equivalent (also known as Turing
equivalent) if they can compute the same set of functions. For
example, we have seen that Turing machines and NAND-TM
programs are equivalent since we can transform every Tur-
ing machine into a NAND-TM program that computes the
same function, and similarly can transform every NAND-
TM program into a Turing machine that computes the same
function.
In this chapter we show this extends far beyond Turing ma-
chines. The techniques we develop allow us to show that
all general-purpose programming languages (i.e., Python,
C, Java, etc.) are Turing Complete, in the sense that they can
simulate Turing machines and hence compute all functions
that can be computed by a TM. We will also show the other
direction- Turing machines can be used to simulate a pro-
gram in any of these languages and hence compute any
function computable by them. This means that all these
programming languages are Turing equivalent: they are equiv-
alent in power to Turing machines and to each other. This is
a powerful principle, which underlies behind the vast reach
of Computer Science. Moreover, it enables us to “have our
cake and eat it too”- since all these models are equivalent, we
can choose the model of our convenience for the task at hand.
To achieve this equivalence, we define a new computational
model known as RAM machines. RAM Machines capture the
eq u i va l e n t mod e l s of comp u tati on 283

architecture of modern computers more closely than Turing


machines, but are still computationally equivalent to Turing
machines.
Finally, we will show that Turing equivalence extends far
beyond traditional programming languages. We will see
that cellular automata which are a mathematical model of ex-
tremely simple natural systems is also Turing equivalent, and
also see the Turing equivalence of the 𝜆 calculus - a logical
system for expressing functions that is the basis for functional
programming languages such as Lisp, OCaml, and more.
See Fig. 8.1 for an overview of the results of this chapter.

Figure 8.1: Some Turing-equivalent models. All of


these are equivalent in power to Turing machines
(or equivalently NAND-TM programs) in the sense
that they can compute exactly the same class of
functions. All of these are models for computing
infinite functions that take inputs of unbounded
length. In contrast, Boolean circuits / NAND-CIRC
programs can only compute finite functions and hence
are not Turing complete.

8.1 RAM MACHINES AND NAND-RAM


One of the limitations of Turing machines (and NAND-TM programs)
is that we can only access one location of our arrays/tape at a time. If
the head is at position 22 in the tape and we want to access the 957-th
position then it will take us at least 923 steps to get there. In contrast,
almost every programming language has a formalism for directly
accessing memory locations. Actual physical computers also provide
so called Random Access Memory (RAM) which can be thought of as a
large array Memory, such that given an index 𝑝 (i.e., memory address,
or a pointer), we can read from and write to the 𝑝𝑡ℎ location of Memory.
(“Random access memory” is quite a misnomer since it has nothing to
do with probability, but since it is a standard term in both the theory
and practice of computing, we will use it as well.)
The computational model that models access to such a memory is
the RAM machine (sometimes also known as the Word RAM model),
as depicted in Fig. 8.2. The memory of a RAM machine is an array
of unbounded size where each cell can store a single word, which
we think of as a string in {0, 1}𝑤 and also (equivalently) as a num-
ber in [2𝑤 ]. For example, many modern computing architectures
284 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

use 64 bit words, in which every memory location holds a string in


{0, 1}64 which can also be thought of as a number between 0 and
264 − 1 = 18, 446, 744, 073, 709, 551, 615. The parameter 𝑤 is known
as the word size. In practice often 𝑤 is a fixed number such as 64, but
when doing theory we model 𝑤 as a parameter that can depend on
the input length or number of steps. (You can think of 2𝑤 as roughly
corresponding to the largest memory address that we use in the com-
putation.) In addition to the memory array, a RAM machine also
contains a constant number of registers 𝑟0 , … , 𝑟𝑘−1 , each of which can
also contain a single word.
The operations a RAM machine can carry out include:

• Data movement: Load data from a certain cell in memory into


a register or store the contents of a register into a certain cell of
memory. A RAM machine can directly access any cell of memory
without having to move the “head” (as Turing machines do) to that
location. That is, in one step a RAM machine can load into register Figure 8.2: A RAM Machine contains a finite number of
𝑟𝑖 the contents of the memory cell indexed by register 𝑟𝑗 , or store local registers, each of which holds an integer, and an
into the memory cell indexed by register 𝑟𝑗 the contents of register unbounded memory array. It can perform arithmetic
operations on its register as well as load to a register 𝑟
𝑟𝑖 . the contents of the memory at the address indexed by
the number in register 𝑟′ .
• Computation: RAM machines can carry out computation on regis-
ters such as arithmetic operations, logical operations, and compar-
isons.

• Control flow: As in the case of Turing machines, the choice of what


instruction to perform next can depend on the state of the RAM
machine, which is captured by the contents of its register.

We will not give a formal definition of RAM Machines, though the


bibliographical notes section (Section 8.10) contains sources for such
definitions. Just as the NAND-TM programming language models
Turing machines, we can also define a NAND-RAM programming lan-
guage that models RAM machines. The NAND-RAM programming
language extends NAND-TM by adding the following features:

• The variables of NAND-RAM are allowed to be (non-negative)


integer valued rather than only Boolean as is the case in NAND-
TM. That is, a scalar variable foo holds a non-negative integer in ℕ
Figure 8.3: Different aspects of RAM machines and
(rather than only a bit in {0, 1}), and an array variable Bar holds Turing machines. RAM machines can store integers
an array of integers. As in the case of RAM machines, we will not in their local registers, and can read and write to
their memory at a location specified by a register.
allow integers of unbounded size. Concretely, each variable holds In contrast, Turing machines can only access their
a number between 0 and 𝑇 − 1, where 𝑇 is the number of steps memory in the head location, which moves at most
one position to the right or left in each step.
that have been executed by the program so far. (You can ignore
this restriction for now: if we want to hold larger numbers, we
can simply execute dummy instructions; it will be useful in later
chapters.)
eq u i va l e n t mod e l s of comp u tati on 285

• We allow indexed access to arrays. If foo is a scalar and Bar is an


array, then Bar[foo] refers to the location of Bar indexed by the
value of foo. (Note that this means we don’t need to have a special
index variable i anymore.)

• As is often the case in programming languages, we will assume


that for Boolean operations such as NAND, a zero valued integer is
considered as false, and a non-zero valued integer is considered as
true.

• In addition to NAND, NAND-RAM also includes all the basic arith-


metic operations of addition, subtraction, multiplication, (integer)
division, as well as comparisons (equal, greater than, less than,
etc..).

• NAND-RAM includes conditional statements if/then as part of


the language.

• NAND-RAM contains looping constructs such as while and do as


part of the language.

A full description of the NAND-RAM programming language is


in the appendix. However, the most important fact you need to know
about NAND-RAM is that you actually don’t need to know much
about NAND-RAM at all, since it is equivalent in power to Turing
machines:

Theorem 8.1 — Turing Machines (aka NAND-TM programs) and RAM ma-
For every function
chines (aka NAND-RAM programs) are equivalent.
𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM program if
∗ ∗

and only if 𝐹 is computable by a NAND-RAM program.

Since NAND-TM programs are equivalent to Turing machines, and


NAND-RAM programs are equivalent to RAM machines, Theorem 8.1
shows that all these four models are equivalent to one another.

Proof Idea:
Clearly NAND-RAM is only more powerful than NAND-TM, and Figure 8.4: Overview of the steps in the proof of The-
so if a function 𝐹 is computable by a NAND-TM program then it can orem 8.1 simulating NANDRAM with NANDTM.
We first use the inner loop syntactic sugar of Sec-
be computed by a NAND-RAM program. The challenging direction is tion 7.4.1 to enable loading an integer from an array
to transform a NAND-RAM program 𝑃 to an equivalent NAND-TM to the index variable i of NANDTM. Once we can do
program 𝑄. To describe the proof in full we will need to cover the full that, we can simulate indexed access in NANDTM. We
then use an embedding of ℕ2 in ℕ to simulate two
formal specification of the NAND-RAM language, and show how we dimensional bit arrays in NANDTM. Finally, we use
can implement every one of its features as syntactic sugar on top of the binary representation to encode one-dimensional
arrays of integers as two dimensional arrays of bits
NAND-TM. hence completing the simulation of NANDRAM with
This can be done but going over all the operations in detail is rather NANDTM.
tedious. Hence we will focus on describing the main ideas behind this
286 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

transformation. (See also Fig. 8.4.) NAND-RAM generalizes NAND-


TM in two main ways: (a) adding indexed access to the arrays (ie..,
Foo[bar] syntax) and (b) moving from Boolean valued variables to
integer valued ones. The transformation has two steps:

1. Indexed access of bit arrays: We start by showing how to handle (a).


Namely, we show how we can implement in NAND-TM the op-
eration Setindex(Bar) such that if Bar is an array that encodes
some integer 𝑗, then after executing Setindex(Bar) the value of
i will equal to 𝑗. This will allow us to simulate syntax of the form
Foo[Bar] by Setindex(Bar) followed by Foo[i].

2. Two dimensional bit arrays: We then show how we can use “syntactic
sugar” to augment NAND-TM with two dimensional arrays. That is,
have two indices i and j and two dimensional arrays, such that we can
use the syntax Foo[i][j] to access the (i,j)-th location of Foo.

3. Arrays of integers: Finally we will encode a one dimensional array


Arr of integers by a two dimensional Arrbin of bits. The idea is
simple: if 𝑎𝑖,0 , … , 𝑎𝑖,ℓ is a binary (prefix-free) representation of
Arr[𝑖], then Arrbin[𝑖][𝑗] will be equal to 𝑎𝑖,𝑗 .

Once we have arrays of integers, we can use our usual syntactic


sugar for functions, GOTO etc. to implement the arithmetic and control
flow operations of NAND-RAM.

The above approach is not the only way to obtain a proof of Theo-
rem 8.1, see for example Exercise 8.1

R
Remark 8.2 — RAM machines / NAND-RAM and assembly
language (optional). RAM machines correspond quite
closely to actual microprocessors such as those in the
Intel x86 series that also contains a large primary mem-
ory and a constant number of small registers. This is of
course no accident: RAM machines aim at modeling
more closely than Turing machines the architecture of
actual computing systems, which largely follows the
so called von Neumann architecture as described in
the report [Neu45]. As a result, NAND-RAM is sim-
ilar in its general outline to assembly languages such
as x86 or NIPS. These assembly languages all have
instructions to (1) move data from registers to mem-
ory, (2) perform arithmetic or logical computations
on registers, and (3) conditional execution and loops
(“if” and “goto”, commonly known as “branches” and
“jumps” in the context of assembly languages).
The main difference between RAM machines and
actual microprocessors (and correspondingly between
eq u i va l e n t mod e l s of comp u tati on 287

NAND-RAM and assembly languages) is that actual


microprocessors have a fixed word size 𝑤 so that all
registers and memory cells hold numbers in [2𝑤 ] (or
equivalently strings in {0, 1}𝑤 ). This number 𝑤 can
vary among different processors, but common values
are either 32 or 64. As a theoretical model, RAM ma-
chines do not have this limitation, but we rather let 𝑤
be the logarithm of our running time (which roughly
corresponds to its value in practice as well). Actual
microprocessors also have a fixed number of registers
(e.g., 14 general purpose registers in x86-64) but this
does not make a big difference with RAM machines.
It can be shown that RAM machines with as few as
two registers are as powerful as full-fledged RAM ma-
chines that have an arbitrarily large constant number
of registers.
Of course actual microprocessors have many features
not shared with RAM machines as well, including
parallelism, memory hierarchies, and many others.
However, RAM machines do capture actual comput-
ers to a first approximation and so (as we will see), the
running time of an algorithm on a RAM machine (e.g.,
𝑂(𝑛) vs 𝑂(𝑛2 )) is strongly correlated with its practical
efficiency.

8.2 THE GORY DETAILS (OPTIONAL)


We do not show the full formal proof of Theorem 8.1 but focus on the
most important parts: implementing indexed access, and simulating
two dimensional arrays with one dimensional ones. Even these are
already quite tedious to describe, as will not be surprising to anyone
that has ever written a compiler. Hence you can feel free to merely
skim this section. The important point is not for you to know all de-
tails by heart but to be convinced that in principle it is possible to
transform a NAND-RAM program to an equivalent NAND-TM pro-
gram, and even be convinced that, with sufficient time and effort, you
could do it if you wanted to.

8.2.1 Indexed access in NAND-TM


In NAND-TM we can only access our arrays in the position of the in-
dex variable i, while NAND-RAM has integer-valued variables and
can use them for indexed access to arrays, of the form Foo[bar]. To im-
plement indexed access in NAND-TM, we will encode integers in our
arrays using some prefix-free representation (see Section 2.5.2)), and
then have a procedure Setindex(Bar) that sets i to the value encoded
by Bar. We can simulate the effect of Foo[Bar] using Setindex(Bar)
followed by Foo[i].
Implementing Setindex(Bar) can be achieved as follows:
288 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

1. We initialize an array Atzero such that Atzero[0]= 1 and


Atzero[𝑗]= 0 for all 𝑗 > 0. (This can be easily done in NAND-TM
as all uninitialized variables default to zero.)

2. Set i to zero, by decrementing it until we reach the point where


Atzero[i]= 1.

3. Let Temp be an array encoding the number 0.

4. We use GOTO to simulate an inner loop of the form: while Temp ≠


Bar, increment Temp.

5. At the end of the loop, i is equal to the value encoded by Bar.


In NAND-TM code (using some syntactic sugar), we can imple-
ment the above operations as follows:
# assume Atzero is an array such that Atzero[0]=1
# and Atzero[j]=0 for all j>0

# set i to 0.
LABEL("zero_idx")
dir0 = zero
dir1 = one
# corresponds to i <- i-1
GOTO("zero_idx",NOT(Atzero[i]))
...
# zero out temp
#(code below assumes a specific prefix-free encoding in
↪ which 10 is the "end marker")
Temp[0] = 1
Temp[1] = 0
# set i to Bar, assume we know how to increment, compare
LABEL("increment_temp")
cond = EQUAL(Temp,Bar)
dir0 = one
dir1 = one
# corresponds to i <- i+1
INC(Temp)
GOTO("increment_temp",cond)
# if we reach this point, i is number encoded by Bar
...
# final instruction of program
MODANDJUMP(dir0,dir1)

8.2.2 Two dimensional arrays in NAND-TM


To implement two dimensional arrays, we want to embed them in a
one dimensional array. The idea is that we come up with a one to one
eq u i va l e n t mod e l s of comp u tati on 289

function 𝑒𝑚𝑏𝑒𝑑 ∶ ℕ × ℕ → ℕ, and so embed the location (𝑖, 𝑗) of the


two dimensional array Two in the location 𝑒𝑚𝑏𝑒𝑑(𝑖, 𝑗) of the array One.
Since the set ℕ × ℕ seems “much bigger” than the set ℕ, a priori it
might not be clear that such a one to one mapping exists. However,
once you think about it more, it is not that hard to construct. For ex-
ample, you could ask a child to use scissors and glue to transform a
10” by 10” piece of paper into a 1” by 100” strip. This is essentially
a one to one map from [10] × [10] to [100]. We can generalize this to
obtain a one to one map from [𝑛] × [𝑛] to [𝑛2 ] and more generally a one
to one map from ℕ × ℕ to ℕ. Specifically, the following map 𝑒𝑚𝑏𝑒𝑑
would do (see Fig. 8.5):

𝑒𝑚𝑏𝑒𝑑(𝑥, 𝑦) = 12 (𝑥 + 𝑦)(𝑥 + 𝑦 + 1) + 𝑥 .

Exercise 8.3 asks you to prove that 𝑒𝑚𝑏𝑒𝑑 is indeed one to one, as
well as computable by a NAND-TM program. (The latter can be done
by simply following the grade-school algorithms for multiplication,
addition, and division.) This means that we can replace code of the
form Two[Foo][Bar] = something (i.e., access the two dimensional
array Two at the integers encoded by the one dimensional arrays Foo
and Bar) by code of the form:

Blah = embed(Foo,Bar) Figure 8.5: Illustration of the map 𝑒𝑚𝑏𝑒𝑑(𝑥, 𝑦) =


2 (𝑥 + 𝑦)(𝑥 + 𝑦 + 1) + 𝑥 for 𝑥, 𝑦 ∈ [10], one can
1
Setindex(Blah) see that for every distinct pairs (𝑥, 𝑦) and (𝑥′ , 𝑦′ ),
Two[i] = something 𝑒𝑚𝑏𝑒𝑑(𝑥, 𝑦) ≠ 𝑒𝑚𝑏𝑒𝑑(𝑥′ , 𝑦′ ).

8.2.3 All the rest


Once we have two dimensional arrays and indexed access, simulating
NAND-RAM with NAND-TM is just a matter of implementing the
standard algorithms for arithmetic operations and comparisons in
NAND-TM. While this is cumbersome, it is not difficult, and the end
result is to show that every NAND-RAM program 𝑃 can be simulated
by an equivalent NAND-TM program 𝑄, thus completing the proof of
Theorem 8.1.

R
Remark 8.3 — Recursion in NAND-RAM (advanced). One
concept that appears in many programming languages
but we did not include in NAND-RAM programs is
recursion. However, recursion (and function calls in
general) can be implemented in NAND-RAM using
the stack data structure. A stack is a data structure con-
taining a sequence of elements, where we can “push”
elements into it and “pop” them from it in “first in last
out” order.
We can implement a stack using an array of integers
Stack and a scalar variable stackpointer that will
290 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

be the number of items in the stack. We implement


push(foo) by

Stack[stackpointer]=foo
stackpointer += one

and implement bar = pop() by

bar = Stack[stackpointer]
stackpointer -= one

We implement a function call to 𝐹 by pushing the


arguments for 𝐹 into the stack. The code of 𝐹 will
“pop” the arguments from the stack, perform the com-
putation (which might involve making recursive or
non-recursive calls) and then “push” its return value
into the stack. Because of the “first in last out” na-
ture of a stack, we do not return control to the calling
procedure until all the recursive calls are done.
The fact that we can implement recursion using a non-
recursive language is not surprising. Indeed, machine
languages typically do not have recursion (or function
calls in general), and hence a compiler implements
function calls using a stack and GOTO. You can find
online tutorials on how recursion is implemented
via stack in your favorite programming language,
whether it’s Python , JavaScript, or Lisp/Scheme.

8.3 TURING EQUIVALENCE (DISCUSSION)


Any of the standard programming languages such as C, Java, Python,
Pascal, Fortran have very similar operations to NAND-RAM. (In-
deed, ultimately they can all be executed by machines which have a
fixed number of registers and a large memory array.) Hence using
Theorem 8.1, we can simulate any program in such a programming
language by a NAND-TM program. In the other direction, it is a fairly
easy programming exercise to write an interpreter for NAND-TM in
any of the above programming languages. Hence we can also simulate Figure 8.6: A punched card corresponding to a Fortran
NAND-TM programs (and so by Theorem 7.11, Turing machines) us- statement.

ing these programming languages. This property of being equivalent


in power to Turing machines / NAND-TM is called Turing Equivalent
(or sometimes Turing Complete). Thus all programming languages we 1
Some programming languages have fixed (even if
are familiar with are Turing equivalent.1 extremely large) bounds on the amount of memory
they can access, which formally prevent them from
8.3.1 The “Best of both worlds” paradigm being applicable to computing infinite functions and
hence simulating Turing machines. We ignore such
The equivalence between Turing machines and RAM machines allows issues in this discussion and assume access to some
us to choose the most convenient language for the task at hand: storage device without a fixed upper bound on its
capacity.
• When we want to prove a theorem about all programs/algorithms,
we can use Turing machines (or NAND-TM) since they are sim-
eq u i va l e n t mod e l s of comp u tati on 291

pler and easier to analyze. In particular, if we want to show that


a certain function cannot be computed, then we will use Turing
machines.

• When we want to show that a function can be computed we can use


RAM machines or NAND-RAM, because they are easier to pro-
gram in and correspond more closely to high level programming
languages we are used to. In fact, we will often describe NAND-
RAM programs in an informal manner, trusting that the reader
can fill in the details and translate the high level description to the
precise program. (This is just like the way people typically use in-
formal or “pseudocode” descriptions of algorithms, trusting that
their audience will know to translate these descriptions to code if
needed.)

Our usage of Turing machines / NAND-TM and RAM Machines


/ NAND-RAM is very similar to the way people use in practice high
and low level programming languages. When one wants to produce
a device that executes programs, it is convenient to do so for a very
simple and “low level” programming language. When one wants to
describe an algorithm, it is convenient to use as high level a formalism
as possible.

 Big Idea 10 Using equivalence results such as those between


Turing and RAM machines, we can “have our cake and eat it too”.
We can use a simpler model such as Turing machines when we want
to prove something can’t be done, and use a feature-rich model such as
RAM machines when we want to prove something can be done.
Figure 8.7: By having the two equivalent languages
NAND-TM and NAND-RAM, we can “have our cake
and eat it too”, using NAND-TM when we want to
prove that programs can’t do something, and using
8.3.2 Let’s talk about abstractions
NAND-RAM or other high level languages when we
“The programmer is in the unique position that … he has to be able want to prove that programs can do something.
to think in terms of conceptual hierarchies that are much deeper than
a single mind ever needed to face before.”, Edsger Dijkstra, “On the
cruelty of really teaching computing science”, 1988.

At some point in any theory of computation course, the instructor


and students need to have the talk. That is, we need to discuss the level
of abstraction in describing algorithms. In algorithms courses, one
typically describes algorithms in English, assuming readers can “fill
in the details” and would be able to convert such an algorithm into an
implementation if needed. For example, Algorithm 8.4 is a high level
description of the breadth first search algorithm.
292 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 8.4 — Breadth First Search.

Input: Graph 𝐺, vertices 𝑢, 𝑣


Output: ”connected” when 𝑢 is connected to 𝑣 in 𝐺, ”dis-
connected”
1: Initialize empty queue 𝑄.
2: Put 𝑢 in 𝑄
3: while 𝑄 is not empty do
4: Remove top vertex 𝑤 from 𝑄
5: if 𝑤 = 𝑣 then return ”connected”
6: end if
7: Mark 𝑤
8: Add all unmarked neighbors of 𝑤 to 𝑄.
9: end while
10: return ”disconnected”

If we wanted to give more details on how to implement breadth


first search in a programming language such as Python or C (or
NAND-RAM / NAND-TM for that matter), we would describe how
we implement the queue data structure using an array, and similarly
how we would use arrays to mark vertices. We call such an “interme-
diate level” description an implementation level or pseudocode descrip-
tion. Finally, if we want to describe the implementation precisely, we
would give the full code of the program (or another fully precise rep-
resentation, such as in the form of a list of tuples). We call this a formal
or low level description.

Figure 8.8: We can describe an algorithm at different


levels of granularity/detail and precision. At the
highest level we just write the idea in words, omitting
all details on representation and implementation.
In the intermediate level (also known as implemen-
tation or pseudocode) we give enough details of the
implementation that would allow someone to de-
rive it, though we still fall short of providing the full
code. The lowest level is where the actual code or
mathematical description is fully spelled out. These
different levels of detail all have their uses, and mov-
ing between them is one of the most important skills
for a computer scientist.

While we started off by describing NAND-CIRC, NAND-TM, and


NAND-RAM programs at the full formal level, as we progress in this
eq u i va l e n t mod e l s of comp u tati on 293

book we will move to implementation and high level description.


After all, our goal is not to use these models for actual computation,
but rather to analyze the general phenomenon of computation. That
said, if you don’t understand how the high level description translates
to an actual implementation, going “down to the metal” is often an
excellent exercise. One of the most important skills for a computer
scientist is the ability to move up and down hierarchies of abstractions.
A similar distinction applies to the notion of representation of objects
as strings. Sometimes, to be precise, we give a low level specification
of exactly how an object maps into a binary string. For example, we
might describe an encoding of 𝑛 vertex graphs as length 𝑛2 binary
strings, by saying that we map a graph 𝐺 over the vertices [𝑛] to a
string 𝑥 ∈ {0, 1}𝑛 such that the 𝑛 ⋅ 𝑖 + 𝑗-th coordinate of 𝑥 is 1 if and
2

only if the edge ⃗⃗⃗⃗⃗⃗⃗⃗


𝑖 𝑗 is present in 𝐺. We can also use an intermediate or
implementation level description, by simply saying that we represent a
graph using the adjacency matrix representation.
Finally, because we are translating between the various represen-
tations of graphs (and objects in general) can be done via a NAND-
RAM (and hence a NAND-TM) program, when talking in a high level
we also suppress discussion of representation altogether. For example,
the fact that graph connectivity is a computable function is true re-
gardless of whether we represent graphs as adjacency lists, adjacency
matrices, list of edge-pairs, and so on and so forth. Hence, in cases
where the precise representation doesn’t make a difference, we would
often talk about our algorithms as taking as input an object 𝑋 (that
can be a graph, a vector, a program, etc.) without specifying how 𝑋 is
encoded as a string.

Defining ”Algorithms”. Up until now we have used the term “algo-


rithm” informally. However, Turing machines and the range of equiv-
alent models yield a way to precisely and formally define algorithms.
Hence whenever we refer to an algorithm in this book, we will mean
that it is an instance of one of the Turing equivalent models, such as
Turing machines, NAND-TM, RAM machines, etc. Because of the
equivalence of all these models, in many contexts, it will not matter
which of these we use.

8.3.3 Turing completeness and equivalence, a formal definition (optional)


A computational model is some way to define what it means for a pro-
gram (which is represented by a string) to compute a (partial) func-
tion. A computational model ℳ is Turing complete if we can map every
Turing machine (or equivalently NAND-TM program) 𝑁 into a pro-
gram 𝑃 for ℳ that computes the same function as 𝑁 . It is Turing
equivalent if the other direction holds as well (i.e., we can map every
program in ℳ to a Turing machine that computes the same function).
294 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

We can define this notion formally as follows. (This formal definition


is not crucial for the remainder of this book so feel to skip it as long
as you understand the general concept of Turing equivalence; This
notion is sometimes referred to in the literature as Gödel numbering
or admissible numbering.)

Let ℱ be
Definition 8.5 — Turing completeness and equivalence (optional).
the set of all partial functions from {0, 1}∗ to {0, 1}∗ . A computa-
tional model is a map ℳ ∶ {0, 1}∗ → ℱ.
We say that a program 𝑃 ∈ {0, 1}∗ ℳ-computes a function 𝐹 ∈ ℱ
if ℳ(𝑃 ) = 𝐹 .
A computational model ℳ is Turing complete if there is a com-
putable map ENCODEℳ ∶ {0, 1}∗ → {0, 1}∗ for every Turing
machine 𝑁 (represented as a string), ℳ(ENCODEℳ (𝑁 )) is equal
to the partial function computed by 𝑁 .
A computational model ℳ is Turing equivalent if it is Tur-
ing complete and there exists a computable map DECODEℳ ∶
{0, 1}∗ → {0, 1}∗ such that or every string 𝑃 ∈ {0, 1}∗ , 𝑁 =
DECODEℳ (𝑃 ) is a string representation of a Turing machine that
computes the function ℳ(𝑃 ).

Some examples of Turing equivalent models (some of which we


have already seen, and some are discussed below) include:

• Turing machines
• NAND-TM programs
• NAND-RAM programs
• λ calculus
• Game of life (mapping programs and inputs/outputs to starting
and ending configurations)
• Programming languages such as Python/C/Javascript/OCaml…
(allowing for unbounded storage)

8.4 CELLULAR AUTOMATA


Many physical systems can be described as consisting of a large num-
ber of elementary components that interact with one another. One
way to model such systems is using cellular automata. This is a system
that consists of a large (or even infinite) number of cells. Each cell
only has a constant number of possible states. At each time step, a cell
updates to a new state by applying some simple rule to the state of
itself and its neighbors.
A canonical example of a cellular automaton is Conway’s Game
of Life. In this automata the cells are arranged in an infinite two di-
mensional grid. Each cell has only two states: “dead” (which we can
eq u i va l e n t mod e l s of comp u tati on 295

Figure 8.9: Rules for Conway’s Game of Life. Image


from this blog post.

encode as 0 and identify with ∅) or “alive” (which we can encode


as 1). The next state of a cell depends on its previous state and the
states of its 8 vertical, horizontal and diagonal neighbors (see Fig. 8.9).
A dead cell becomes alive only if exactly three of its neighbors are
alive. A live cell continues to live if it has two or three live neighbors.
Even though the number of cells is potentially infinite, we can en-
code the state using a finite-length string by only keeping track of the
live cells. If we initialize the system in a configuration with a finite
number of live cells, then the number of live cells will stay finite in all
future steps. The Wikipedia page for the Game of Life contains some
beautiful figures and animations of configurations that produce very
interesting evolutions.

Figure 8.10: In a two dimensional cellular automaton


every cell is in position 𝑖, 𝑗 for some integers 𝑖, 𝑗 ∈ ℤ.
The state of a cell is some value 𝐴𝑖,𝑗 ∈ Σ for some
finite alphabet Σ. At a given time step, the state of the
cell is adjusted according to some function applied to
the state of (𝑖, 𝑗) and all its neighbors (𝑖 ± 1, 𝑗 ± 1).
In a one dimensional cellular automaton every cell is in
position 𝑖 ∈ ℤ and the state 𝐴𝑖 of 𝑖 at the next time
step depends on its current state and the state of its
two neighbors 𝑖 − 1 and 𝑖 + 1.

Since the cells in the game of life are arranged in an infinite two-
dimensional grid, it is an example of a two dimensional cellular automa-
ton. We can also consider the even simpler setting of a one dimensional
cellular automaton, where the cells are arranged in an infinite line, see
Fig. 8.10. It turns out that even this simple model is enough to achieve
296 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Turing-completeness. We will now formally define one-dimensional


cellular automata and then prove their Turing completeness.

Let Σ be a finite set


Definition 8.6 — One dimensional cellular automata.
containing the symbol ∅. A one dimensional cellular automaton over
alphabet Σ is described by a transition rule 𝑟 ∶ Σ3 → Σ, which
satisfies 𝑟(∅, ∅, ∅) = ∅.
A configuration of the automaton 𝑟 is a function 𝐴 ∶ ℤ → Σ. If
an automaton with rule 𝑟 is in configuration 𝐴, then its next config-
uration, denoted by 𝐴′ = NEXT𝑟 (𝐴), is the function 𝐴′ such that
𝐴′ (𝑖) = 𝑟(𝐴(𝑖 − 1), 𝐴(𝑖), 𝐴(𝑖 + 1)) for every 𝑖 ∈ ℤ. In other words,
the next state of the automaton 𝑟 at point 𝑖 is obtained by applying
the rule 𝑟 to the values of 𝐴 at 𝑖 and its two neighbors.

Finite configuration. We say that a configuration of an automaton 𝑟


is finite if there is only some finite number of indices 𝑖0 , … , 𝑖𝑗−1 in ℤ
such that 𝐴(𝑖𝑗 ) ≠ ∅. (That is, for every 𝑖 ∉ {𝑖0 , … , 𝑖𝑗−1 }, 𝐴(𝑖) = ∅.)
Such a configuration can be represented using a finite string that
encodes the indices 𝑖0 , … , 𝑖𝑛−1 and the values 𝐴(𝑖0 ), … , 𝐴(𝑖𝑛−1 ). Since
𝑅(∅, ∅, ∅) = ∅, if 𝐴 is a finite configuration then NEXT𝑟 (𝐴) is finite
as well. We will only be interested in studying cellular automata that
are initialized in finite configurations, and hence remain in a finite
configuration throughout their evolution.

8.4.1 One dimensional cellular automata are Turing complete


We can write a program (for example using NAND-RAM) that sim-
ulates the evolution of any cellular automaton from an initial finite
configuration by simply storing the values of the cells with state not
equal to ∅ and repeatedly applying the rule 𝑟. Hence cellular au-
tomata can be simulated by Turing machines. What is more surprising
that the other direction holds as well. For example, as simple as its
rules seem, we can simulate a Turing machine using the game of life
(see Fig. 8.11).
In fact, even one dimensional cellular automata can be Turing com-
plete:

For every
Theorem 8.7 — One dimensional automata are Turing complete.
Turing machine 𝑀 , there is a one dimensional cellular automaton
that can simulate 𝑀 on every input 𝑥.
Figure 8.11: A Game-of-Life configuration simulating
To make the notion of “simulating a Turing machine” more precise
a Turing machine. Figure by Paul Rendell.
we will need to define configurations of Turing machines. We will
do so in Section 8.4.2 below, but at a high level a configuration of a
Turing machine is a string that encodes its full state at a given step in
eq u i va l e n t mod e l s of comp u tati on 297

its computation. That is, the contents of all (non-empty) cells of its
tape, its current state, as well as the head position.
The key idea in the proof of Theorem 8.7 is that at every point in
the computation of a Turing machine 𝑀 , the only cell in 𝑀 ’s tape that
can change is the one where the head is located, and the value this
cell changes to is a function of its current state and the finite state of
𝑀 . This observation allows us to encode the configuration of a Turing
machine 𝑀 as a finite configuration of a cellular automaton 𝑟, and
ensure that a one-step evolution of this encoded configuration under
the rules of 𝑟 corresponds to one step in the execution of the Turing
machine 𝑀 .

8.4.2 Configurations of Turing machines and the next-step function


To turn the above ideas into a rigorous proof (and even statement!)
of Theorem 8.7 we will need to precisely define the notion of config-
urations of Turing machines. This notion will be useful for us in later
chapters as well.

Figure 8.12: A configuration of a Turing machine 𝑀


with alphabet Σ and state space [𝑘] encodes the state
of 𝑀 at a particular step in its execution as a string 𝛼
over the alphabet Σ = Σ × ({⋅} ∪ [𝑘]). The string is of
length 𝑡 where 𝑡 is such that 𝑀’s tape contains ∅ in
all positions 𝑡 and larger and 𝑀’s head is in a position
smaller than 𝑡. If 𝑀’s head is in the 𝑖-th position, then
for 𝑗 ≠ 𝑖, 𝛼𝑗 encodes the value of the 𝑗-th cell of 𝑀’s
tape, while 𝛼𝑖 encodes both this value as well as the
current state of 𝑀. If the machine writes the value 𝜏,
changes state to 𝑡, and moves right, then in the next
configuration will contain at position 𝑖 the value (𝜏, ⋅)
and at position 𝑖 + 1 the value (𝛼𝑖+1 , 𝑡).

Let 𝑀 be a Turing ma-


Definition 8.8 — Configuration of Turing Machines..
chine with tape alphabet Σ and state space [𝑘]. A configuration of 𝑀

is a string 𝛼 ∈ Σ where Σ = Σ × ({⋅} ∪ [𝑘]) that satisfies that there
is exactly one coordinate 𝑖 for which 𝛼𝑖 = (𝜎, 𝑠) for some 𝜎 ∈ Σ and
𝑠 ∈ [𝑘]. For all other coordinates 𝑗, 𝛼𝑗 = (𝜎′ , ⋅) for some 𝜎′ ∈ Σ.

A configuration 𝛼 ∈ Σ of 𝑀 corresponds to the following state
of its execution:

• 𝑀 ’s tape contains 𝛼𝑗,0 for all 𝑗 < |𝛼| and contains ∅ for all po-
sitions that are at least |𝛼|, where we let 𝛼𝑗,0 be the value 𝜎 such
that 𝛼𝑗 = (𝜎, 𝑡) with 𝜎 ∈ Σ and 𝑡 ∈ {⋅} ∪ [𝑘]. (In other words,
298 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

since 𝛼𝑗 is a pair of an alphabet symbol 𝜎 and either a state in [𝑘]


or the symbol ⋅, 𝛼𝑗,0 is the first component 𝜎 of this pair.)

• 𝑀 ’s head is in the unique position 𝑖 for which 𝛼𝑖 has the form


(𝜎, 𝑠) for 𝑠 ∈ [𝑘], and 𝑀 ’s state is equal to 𝑠.

P
Definition 8.8 below has some technical details, but
is not actually that deep or complicated. Try to take a
moment to stop and think how you would encode as a
string the state of a Turing machine at a given point in
an execution.
Think what are all the components that you need to
know in order to be able to continue the execution
from this point onwards, and what is a simple way
to encode them using a list of finite symbols. In par-
ticular, with an eye towards our future applications,
try to think of an encoding which will make it as sim-
ple as possible to map a configuration at step 𝑡 to the
configuration at step 𝑡 + 1.

Definition 8.8 is a little cumbersome, but ultimately a configuration


is simply a string that encodes a snapshot of the Turing machine at a
given point in the execution. (In operating-systems lingo, it is a “core
dump”.) Such a snapshot needs to encode the following components:

1. The current head position.

2. The full contents of the large scale memory, that is the tape.

3. The contents of the “local registers”, that is the state of the ma-
chine.

The precise details of how we encode a configuration are not impor-


tant, but we do want to record the following simple fact:
∗ ∗
Lemma 8.9 Let 𝑀 be a Turing machine and let NEXT𝑀 ∶ Σ → Σ
be the function that maps a configuration of 𝑀 to the configuration
at the next step of the execution. Then for every 𝑖 ∈ ℕ, the value of
NEXT𝑀 (𝛼)𝑖 only depends on the coordinates 𝛼𝑖−1 , 𝛼𝑖 , 𝛼𝑖+1 .
(For simplicity of notation, above we use the convention that if 𝑖
is “out of bounds”, such as 𝑖 < 0 or 𝑖 > |𝛼|, then we assume that
𝛼𝑖 = (∅, ⋅).) We leave proving Lemma 8.9 as Exercise 8.7. The idea
behind the proof is simple: if the head is neither in position 𝑖 nor
positions 𝑖 − 1 and 𝑖 + 1, then the next-step configuration at 𝑖 will be
the same as it was before. Otherwise, we can “read off” the state of the
Turing machine and the value of the tape at the head location from the
eq u i va l e n t mod e l s of comp u tati on 299

configuration at 𝑖 or one of its neighbors and use that to update what


the new state at 𝑖 should be. Completing the full proof is not hard,
but doing it is a great way to ensure that you are comfortable with the
definition of configurations.

We can now restate Theorem 8.7


Completing the proof of Theorem 8.7.
more formally, and complete its proof:

Theorem 8.10 — One dimensional automata are Turing complete (formal state-
ment). For every Turing machine 𝑀 , if we denote by Σ the alphabet
of its configuration strings, then there is a one-dimensional cellular

automaton 𝑟 over the alphabet Σ such that

NEXT𝑀 (𝛼) = NEXT𝑟 (𝛼)



for every configuration 𝛼 ∈ Σ of 𝑀 (again using the convention
that we consider 𝛼𝑖 = ∅ if 𝑖 is “out of bounds”).

Proof. We consider the element (∅, ⋅) of Σ to correspond to the ∅


element of the automaton 𝑟. In this case, by Lemma 8.9, the function
NEXT𝑀 that maps a configuration of 𝑀 into the next one is in fact a
valid rule for a one dimensional automata.

The automaton arising from the proof of Theorem 8.10 has a large
alphabet, and furthermore one whose size that depends on the ma-
chine 𝑀 that is being simulated. It turns out that one can obtain an
automaton with an alphabet of fixed size that is independent of the
program being simulated, and in fact the alphabet of the automaton
can be the minimal set {0, 1}! See Fig. 8.13 for an example of such an
Turing-complete automaton.

R
Remark 8.11 — Configurations of NAND-TM programs.
We can use the same approach as Definition 8.8 to
define configurations of a NAND-TM program. Such a
configuration will need to encode:

1. The current value of the variable i.


Figure 8.13: Evolution of a one dimensional automata.
2. For every scalar variable foo, the value of foo. Each row in the figure corresponds to the configura-
3. For every array variable Bar, the value Bar[𝑗] for tion. The initial configuration corresponds to the top
row and contains only a single “live” cell. This figure
every 𝑗 ∈ {0, … , 𝑡 − 1} where 𝑡 − 1 is the largest
corresponds to the “Rule 110” automaton of Stephen
value that the index variable i ever achieved in the Wolfram which is Turing Complete. Figure taken
computation. from Wolfram MathWorld.
300 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

8.5 LAMBDA CALCULUS AND FUNCTIONAL PROGRAMMING LAN-


GUAGES
The λ calculus is another way to define computable functions. It was
proposed by Alonzo Church in the 1930’s around the same time as
Alan Turing’s proposal of the Turing machine. Interestingly, while
Turing machines are not used for practical computation, the λ calculus
has inspired functional programming languages such as LISP, ML and
Haskell, and indirectly the development of many other programming
languages as well. In this section we will present the λ calculus and
show that its power is equivalent to NAND-TM programs (and hence
also to Turing machines). Our Github repository contains a Jupyter
notebook with a Python implementation of the λ calculus that you can
experiment with to get a better feel for this topic.

The λ operator. At the core of the λ calculus is a way to define “anony-


mous” functions. For example, instead of giving a name 𝑓 to a func-
tion and defining it as

𝑓(𝑥) = 𝑥 × 𝑥
we can write it as

𝜆𝑥.𝑥 × 𝑥
and so (𝜆𝑥.𝑥 × 𝑥)(7) = 49. That is, you can think of 𝜆𝑥.𝑒𝑥𝑝(𝑥),
where 𝑒𝑥𝑝 is some expression as a way of specifying the anonymous
function 𝑥 ↦ 𝑒𝑥𝑝(𝑥). Anonymous functions, using either 𝜆𝑥.𝑓(𝑥), 𝑥 ↦
𝑓(𝑥) or other closely related notation, appear in many programming
languages. For example, in Python we can define the squaring function
using lambda x: x*x while in JavaScript we can use x => x*x or
(x) => x*x. In Scheme we would define it as (lambda (x) (* x x)).
Clearly, the name of the argument to a function doesn’t matter, and so
𝜆𝑦.𝑦 × 𝑦 is the same as 𝜆𝑥.𝑥 × 𝑥, as both correspond to the squaring
function.
Dropping parentheses. To reduce notational clutter, when writing
𝜆 calculus expressions we often drop the parentheses for function
evaluation. Hence instead of writing 𝑓(𝑥) for the result of applying
the function 𝑓 to the input 𝑥, we can also write this as simply 𝑓 𝑥.
Therefore we can write (𝜆𝑥.𝑥 × 𝑥)7 = 49. In this chapter, we will use
both the 𝑓(𝑥) and 𝑓 𝑥 notations for function application. Function
evaluations are associative and bind from left to right, and hence 𝑓 𝑔 ℎ
is the same as (𝑓𝑔)ℎ.

8.5.1 Applying functions to functions


A key feature of the λ calculus is that functions are “first-class objects”
in the sense that we can use functions as arguments to other functions.
eq u i va l e n t mod e l s of comp u tati on 301

For example, can you guess what number is the following expression
equal to?

(((𝜆𝑓.(𝜆𝑦.(𝑓 (𝑓 𝑦))))(𝜆𝑥.𝑥 × 𝑥)) 3) (8.1)

P
The expression (8.1) might seem daunting, but before
you look at the solution below, try to break it apart
to its components, and evaluate each component at a
time. Working out this example would go a long way
toward understanding the λ calculus.

Let’s evaluate (8.1) one step at a time. As nice as it is for the λ


calculus to allow anonymous functions, adding names can be very
helpful for understanding complicated expressions. So, let us write
𝐹 = 𝜆𝑓.(𝜆𝑦.(𝑓(𝑓𝑦))) and 𝑔 = 𝜆𝑥.𝑥 × 𝑥.
Therefore (8.1) becomes

((𝐹 𝑔) 3) .

On input a function 𝑓, 𝐹 outputs the function 𝜆𝑦.(𝑓(𝑓 𝑦)), or in


other words 𝐹 𝑓 is the function 𝑦 ↦ 𝑓(𝑓(𝑦)). Our function 𝑔 is simply
𝑔(𝑥) = 𝑥2 and so (𝐹 𝑔) is the function that maps 𝑦 to (𝑦2 )2 = 𝑦4 . Hence
((𝐹 𝑔)3) = 34 = 81.
Solved Exercise 8.1 What number does the following expression evalu-
ate to?

((𝜆𝑥.(𝜆𝑦.𝑥)) 2) 9 . (8.2)

Solution:
𝜆𝑦.𝑥 is the function that on input 𝑦 ignores its input and outputs
𝑥. Hence (𝜆𝑥.(𝜆𝑦.𝑥))2 yields the function 𝑦 ↦ 2 (or, using 𝜆 nota-
tion, the function 𝜆𝑦.2). Hence (8.2) is equivalent to (𝜆𝑦.2)9 = 2.

8.5.2 Obtaining multi-argument functions via Currying


In a λ expression of the form 𝜆𝑥.𝑒, the expression 𝑒 can itself involve
the λ operator. Thus for example the function

𝜆𝑥.(𝜆𝑦.𝑥 + 𝑦) (8.3)

maps 𝑥 to the function 𝑦 ↦ 𝑥 + 𝑦.


In particular, if we invoke the function (8.3) on 𝑎 to obtain some
function 𝑓, and then invoke 𝑓 on 𝑏, we obtain the value 𝑎 + 𝑏. We
302 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

can see that the one-argument function (8.3) corresponding to 𝑎 ↦


(𝑏 ↦ 𝑎 + 𝑏) can also be thought of as the two-argument function
(𝑎, 𝑏) ↦ 𝑎 + 𝑏. Generally, we can use the λ expression 𝜆𝑥.(𝜆𝑦.𝑓(𝑥, 𝑦))
to simulate the effect of a two argument function (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦). This
technique is known as Currying. We will use the shorthand 𝜆𝑥, 𝑦.𝑒
for 𝜆𝑥.(𝜆𝑦.𝑒). If 𝑓 = 𝜆𝑥.(𝜆𝑦.𝑒) then (𝑓𝑎)𝑏 corresponds to applying 𝑓𝑎
and then invoking the resulting function on 𝑏, obtaining the result of
replacing in 𝑒 the occurrences of 𝑥 with 𝑎 and occurrences of 𝑏 with
𝑦. By our rules of associativity, this is the same as (𝑓𝑎𝑏) which we’ll
sometimes also write as 𝑓(𝑎, 𝑏).

8.5.3 Formal description of the λ calculus


We now provide a formal description of the λ calculus. We start with
“basic expressions” that contain a single variable such as 𝑥 or 𝑦 and
build more complex expressions of the form (𝑒 𝑒′ ) and 𝜆𝑥.𝑒 where 𝑒, 𝑒′
are expressions and 𝑥 is a variable idenifier. Formally λ expressions
are defined as follows:

A λ expression is either a single variable


Definition 8.12 — λ expression..
Figure 8.14: In the “currying” transformation, we can
identifier or an expression 𝑒 of the one of the following forms:
create the effect of a two parameter function 𝑓(𝑥, 𝑦)
with the λ expression 𝜆𝑥.(𝜆𝑦.𝑓(𝑥, 𝑦)) which on input
• Application: 𝑒 = (𝑒′ 𝑒″ ), where 𝑒′ and 𝑒″ are λ expressions. 𝑥 outputs a one-parameter function 𝑓𝑥 that has 𝑥
“hardwired” into it and such that 𝑓𝑥 (𝑦) = 𝑓(𝑥, 𝑦).
• Abstraction: 𝑒 = 𝜆𝑥.(𝑒′ ) where 𝑒′ is a λ expression. This can be illustrated by a circuit diagram; see
Chelsea Voss’s site.

Definition 8.12 is a recursive definition since we defined the concept


of λ expressions in terms of itself. This might seem confusing at first,
but in fact you have known recursive definitions since you were an
elementary school student. Consider how we define an arithmetic
expression: it is an expression that is either just a number, or has one of
the forms (𝑒 + 𝑒′ ), (𝑒 − 𝑒′ ), (𝑒 × 𝑒′ ), or (𝑒 ÷ 𝑒′ ), where 𝑒 and 𝑒′ are other
arithmetic expressions.
Free and bound variables. Variables in a λ expression can either be
free or bound to a 𝜆 operator (in the sense of Section 1.4.7). In a single-
variable λ expression 𝑣𝑎𝑟, the variable 𝑣𝑎𝑟 is free. The set of free and
bound variables in an application expression 𝑒 = (𝑒′ 𝑒″ ) is the same
as that of the underlying expressions 𝑒′ and 𝑒″ . In an abstraction ex-
pression 𝑒 = 𝜆𝑣𝑎𝑟.(𝑒′ ), all free occurences of 𝑣𝑎𝑟 in 𝑒′ are bound to
the 𝜆 operator of 𝑒. If you find the notion of free and bound variables
confusing, you can avoid all these issues by using unique identifiers
for all variables.
Precedence and parentheses. We will use the following rules to allow
us to drop some parentheses. Function application associates from left
to right, and so 𝑓𝑔ℎ is the same as (𝑓𝑔)ℎ. Function application has a
higher precedence than the λ operator, and so 𝜆𝑥.𝑓𝑔𝑥 is the same as
eq u i va l e n t mod e l s of comp u tati on 303

𝜆𝑥.((𝑓𝑔)𝑥). This is similar to how we use the precedence rules in arith-


metic operations to allow us to use fewer parentheses and so write the
expression (7 × 3) + 2 as 7 × 3 + 2. As mentioned in Section 8.5.2, we
also use the shorthand 𝜆𝑥, 𝑦.𝑒 for 𝜆𝑥.(𝜆𝑦.𝑒) and the shorthand 𝑓(𝑥, 𝑦)
for (𝑓 𝑥) 𝑦. This plays nicely with the “Currying” transformation of
simulating multi-input functions using λ expressions.

Equivalence of λ expressions. As we have seen in Solved Exercise 8.1,


the rule that (𝜆𝑥.𝑒𝑥𝑝)𝑒𝑥𝑝′ is equivalent to 𝑒𝑥𝑝[𝑥 → 𝑒𝑥𝑝′ ] enables us
to modify λ expressions and obtain a simpler equivalent form for them.
Another rule that we can use is that the parameter does not matter
and hence for example 𝜆𝑦.𝑦 is the same as 𝜆𝑧.𝑧. Together these rules
define the notion of equivalence of λ expressions:

Two λ expressions are


Definition 8.13 — Equivalence of λ expressions.
equivalent if they can be made into the same expression by repeated
applications of the following rules:

1. Evaluation (aka 𝛽 reduction): The expression (𝜆𝑥.𝑒𝑥𝑝)𝑒𝑥𝑝′ is


equivalent to 𝑒𝑥𝑝[𝑥 → 𝑒𝑥𝑝′ ].

2. Variable renaming (aka 𝛼 conversion): The expression 𝜆𝑥.𝑒𝑥𝑝


is equivalent to 𝜆𝑦.𝑒𝑥𝑝[𝑥 → 𝑦].

If 𝑒𝑥𝑝 is a λ expression of the form 𝜆𝑥.𝑒𝑥𝑝′ then it naturally corre-


sponds to the function that maps any input 𝑧 to 𝑒𝑥𝑝′ [𝑥 → 𝑧]. Hence
the λ calculus naturally implies a computational model. Since in the λ
calculus the inputs can themselves be functions, we need to decide in
what order we evaluate an expression such as

(𝜆𝑥.𝑓)(𝜆𝑦.𝑔 𝑧) . (8.4)
There are two natural conventions for this:

• Call by name (aka “lazy evaluation”): We evaluate (8.4) by first plug-


ging in the right-hand expression (𝜆𝑦.𝑔 𝑧) as input to the left-hand
side function, obtaining 𝑓[𝑥 → (𝜆𝑦.𝑔 𝑧)] and then continue from
there.

• Call by value (aka “eager evaluation”): We evaluate (8.4) by first


evaluating the right-hand side and obtaining ℎ = 𝑔[𝑦 → 𝑧], and then
plugging this into the left-hand side to obtain 𝑓[𝑥 → ℎ].

Because the λ calculus has only pure functions, that do not have
“side effects”, in many cases the order does not matter. In fact, it can
be shown that if we obtain a definite irreducible expression (for ex-
ample, a number) in both strategies, then it will be the same one.
304 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

However, for concreteness we will always use the “call by name” (i.e.,
lazy evaluation) order. (The same choice is made in the programming
language Haskell, though many other programming languages use
eager evaluation.) Formally, the evaluation of a λ expression using
“call by name” is captured by the following process:

Definition 8.14 — Simplification of λ expressions. Let 𝑒 be a λ expres-


sion. The simplification of 𝑒 is the result of the following recursive
process:

1. If 𝑒 is a single variable 𝑥 then the simplification of 𝑒 is 𝑒.

2. If 𝑒 has the form 𝑒 = 𝜆𝑥.𝑒′ then the simplification of 𝑒 is 𝜆𝑥.𝑓 ′


where 𝑓 ′ is the simplification of 𝑒′ .

3. (Evaluation / 𝛽 reduction.) If 𝑒 has the form 𝑒 = (𝜆𝑥.𝑒′ 𝑒″ ) then


the simplification of 𝑒 is the simplification of 𝑒′ [𝑥 → 𝑒″ ], which
denotes replacing all copies of 𝑥 in 𝑒′ bound to the 𝜆 operator
with 𝑒″

4. (Renaming / 𝛼 conversion.) The canonical simplification of 𝑒 is


obtained by taking the simplification of 𝑒 and renaming the vari-
ables so that the first bound variable in the expression is 𝑣0 , the
second one is 𝑣1 , and so on and so forth.

We say that two λ expressions 𝑒 and 𝑒′ are equivalent, denoted by


𝑒 ≅ 𝑒′ , if they have the same canonical simplification.

Solved Exercise 8.2 — Equivalence of λ expressions. Prove that the following


two expressions 𝑒 and 𝑓 are equivalent:

𝑒 = 𝜆𝑥.𝑥

𝑓 = (𝜆𝑎.(𝜆𝑏.𝑏))(𝜆𝑧.𝑧 𝑧)

Solution:
The canonical simplification of 𝑒 is simply 𝜆𝑣0 .𝑣0 . To do the
canonical simplification of 𝑓 we first use 𝛽 reduction to plug in
𝜆𝑧.𝑧𝑧 instead of 𝑎 in (𝜆𝑏.𝑏) but since 𝑎 is not used in this function at
all, we simply obtained 𝜆𝑏.𝑏 which simplifies to 𝜆𝑣0 .𝑣0 as well.

eq u i va l e n t mod e l s of comp u tati on 305

8.5.4 Infinite loops in the λ calculus


Like Turing machines and NAND-TM programs, the simplification
process in the λ calculus can also enter into an infinite loop. For exam-
ple, consider the λ expression

𝜆𝑥.𝑥𝑥 𝜆𝑥.𝑥𝑥 (8.5)


If we try to simplify (8.5) by invoking the left-hand function on the
right-hand one, then we get another copy of (8.5) and hence this never
ends. There are examples where the order of evaluation can matter for
whether or not an expression can be simplified, see Exercise 8.9.

8.6 THE “ENHANCED” Λ CALCULUS


We now discuss the λ calculus as a computational model. We will
start by describing an “enhanced” version of the λ calculus that con-
tains some “superfluous features” but is easier to wrap your head
around. We will first show how the enhanced λ calculus is equiva-
lent to Turing machines in computational power. Then we will show
how all the features of “enhanced λ calculus” can be implemented as
“syntactic sugar” on top of the “pure” (i.e., non-enhanced) λ calculus.
Hence the pure λ calculus is equivalent in power to Turing machines
(and hence also to RAM machines and all other Turing-equivalent
models).
The enhanced λ calculus includes the following set of objects and
operations:
• Boolean constants and IF function: There are λ expressions 0, 1
and IF that satisfy the following conditions: for every λ expression
𝑒 and 𝑓, IF 1 𝑒 𝑓 = 𝑒 and IF 0 𝑒 𝑓 = 𝑓. That is, IF is the function that
given three arguments 𝑎, 𝑒, 𝑓 outputs 𝑒 if 𝑎 = 1 and 𝑓 if 𝑎 = 0.

• Pairs: There is a λ expression PAIR which we will think of as the


pairing function. For every λ expressions 𝑒, 𝑓, PAIR 𝑒 𝑓 is the
pair ⟨𝑒, 𝑓⟩ that contains 𝑒 as its first member and 𝑓 as its second
member. We also have λ expressions HEAD and TAIL that extract
the first and second member of a pair respectively. Hence, for every
λ expressions 𝑒, 𝑓, HEAD (PAIR 𝑒 𝑓) = 𝑒 and TAIL (PAIR 𝑒 𝑓) = 𝑓.2 2
In Lisp, the PAIR, HEAD and TAIL functions are
traditionally called cons, car and cdr.
• Lists and strings: There is λ expression NIL that corresponds to
the empty list, which we also denote by ⟨⊥⟩. Using PAIR and NIL
we construct lists. The idea is that if 𝐿 is a 𝑘 element list of the
form ⟨𝑒1 , 𝑒2 , … , 𝑒𝑘 , ⊥⟩ then for every λ expression 𝑒0 we can obtain
the 𝑘 + 1 element list ⟨𝑒0 , 𝑒1 , 𝑒2 , … , 𝑒𝑘 , ⊥⟩ using the expression
PAIR 𝑒0 𝐿. For example, for every three λ expressions 𝑒, 𝑓, 𝑔, the
following corresponds to the three element list ⟨𝑒, 𝑓, 𝑔, ⊥⟩:
PAIR 𝑒 (PAIR 𝑓 (PAIR 𝑔 NIL)) .
306 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The λ expression ISEMPTY returns 1 on NIL and returns 0 on every


other list. A string is simply a list of bits.

• List operations: The enhanced λ calculus also contains the


list-processing functions MAP, REDUCE, and FILTER. Given
a list 𝐿 = ⟨𝑥0 , … , 𝑥𝑛−1 , ⊥⟩ and a function 𝑓, MAP 𝐿 𝑓 ap-
plies 𝑓 on every member of the list to obtain the new list
𝐿′ = ⟨𝑓(𝑥0 ), … , 𝑓(𝑥𝑛−1 ), ⊥⟩. Given a list 𝐿 as above and an
expression 𝑓 whose output is either 0 or 1, FILTER 𝐿 𝑓 returns the
list ⟨𝑥𝑖 ⟩𝑓𝑥𝑖 =1 containing all the elements of 𝐿 for which 𝑓 outputs
1. The function REDUCE applies a “combining” operation to a
list. For example, REDUCE 𝐿 + 0 will return the sum of all the
elements in the list 𝐿. More generally, REDUCE takes a list 𝐿, an
operation 𝑓 (which we think of as taking two arguments) and a λ
expression 𝑧 (which we think of as the “neutral element” for the
operation 𝑓, such as 0 for addition and 1 for multiplication). The
output is defined via


{𝑧 𝐿 = NIL
REDUCE 𝐿 𝑓 𝑧 = ⎨ .
{𝑓 (HEAD 𝐿) (REDUCE (TAIL 𝐿) 𝑓 𝑧)
⎩ otherwise

See Fig. 8.16 for an illustration of the three list-processing operations.

• Recursion: Finally, we want to be able to execute recursive func-


tions. Since in λ calculus functions are anonymous, we can’t write
a definition of the form 𝑓(𝑥) = 𝑏𝑙𝑎ℎ where 𝑏𝑙𝑎ℎ includes calls to
𝑓. Instead we use functions 𝑓 that take an additional input 𝑚𝑒 as a
parameter. The operator RECURSE will take such a function 𝑓 as
input and return a “recursive version” of 𝑓 where all the calls to 𝑚𝑒
are replaced by recursive calls to this function. That is, if we have a
function 𝐹 taking two parameters 𝑚𝑒 and 𝑥, then RECURSE 𝐹 will
be the function 𝑓 taking one parameter 𝑥 such that 𝑓(𝑥) = 𝐹 (𝑓, 𝑥)
for every 𝑥.

Give a λ expression
Solved Exercise 8.3 — Compute NAND using λ calculus.
𝑁 such that 𝑁 𝑥 𝑦 = NAND(𝑥, 𝑦) for every 𝑥, 𝑦 ∈ {0, 1}.

Solution:
The NAND of 𝑥, 𝑦 is equal to 1 unless 𝑥 = 𝑦 = 1. Hence we can
write

𝑁 = 𝜆𝑥, 𝑦.IF(𝑥, IF(𝑦, 0, 1), 1)


eq u i va l e n t mod e l s of comp u tati on 307

Solved Exercise 8.4 — Compute XOR using λ calculus. Give a λ expression


XOR such that for every list 𝐿 = ⟨𝑥0 , … , 𝑥𝑛−1 , ⊥⟩ where 𝑥𝑖 ∈ {0, 1} for
𝑖 ∈ [𝑛], XOR𝐿 evaluates to ∑ 𝑥𝑖 mod 2.

Solution:
First, we note that we can compute XOR of two bits as follows:

NOT = 𝜆𝑎.IF(𝑎, 0, 1) (8.6)

and
XOR2 = 𝜆𝑎, 𝑏.IF(𝑏, NOT(𝑎), 𝑎) (8.7)
(We are using here a bit of syntactic sugar to describe the func-
tions. To obtain the λ expression for XOR we will simply replace
the expression (8.6) in (8.7).) Now recursively we can define the
XOR of a list as follows:


{0 𝐿 is empty
XOR(𝐿) =

⎩XOR2 (HEAD(𝐿), XOR(TAIL(𝐿)))
{ otherwise

This means that XOR is equal to

RECURSE (𝜆𝑚𝑒, 𝐿.IF(ISEMPTY(𝐿), 0, XOR2 (HEAD 𝐿 , 𝑚𝑒(TAIL 𝐿)))) .

That is, XOR is obtained by applying the RECURSE operator


to the function that on inputs 𝑚𝑒, 𝐿, returns 0 if ISEMPTY(𝐿) and
otherwise returns XOR2 applied to HEAD(𝐿) and to 𝑚𝑒(TAIL(𝐿)).
We could have also computed XOR using the REDUCE opera-
tion, we leave working this out as an exercise to the reader.

Figure 8.15: A list ⟨𝑥0 , 𝑥1 , 𝑥2 ⟩ in the λ calculus is con-


structed from the tail up, building the pair ⟨𝑥2 , NIL⟩,
then the pair ⟨𝑥1 , ⟨𝑥2 , NIL⟩⟩ and finally the pair
⟨𝑥0 , ⟨𝑥1 , ⟨𝑥2 , NIL⟩⟩⟩. That is, a list is a pair where
the first element of the pair is the first element of the
list and the second element is the rest of the list. The
figure on the left renders this “pairs inside pairs”
construction, though it is often easier to think of a list
as a “chain”, as in the figure on the right, where the
second element of each pair is thought of as a link,
pointer or reference to the remainder of the list.
308 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 8.16: Illustration of the MAP, FILTER and


REDUCE operations.

8.6.1 Computing a function in the enhanced λ calculus


An enhanced λ expression is obtained by composing the objects above
with the application and abstraction rules. The result of simplifying a λ
expression is an equivalent expression, and hence if two expressions
have the same simplification then they are equivalent.

Definition 8.15 — Computing a function via λ calculus. Let 𝐹 ∶ {0, 1}∗ →



{0, 1}
We say that 𝑒𝑥𝑝 computes 𝐹 if for every 𝑥 ∈ {0, 1}∗ ,

𝑒𝑥𝑝⟨𝑥0 , … , 𝑥𝑛−1 , ⊥⟩ ≅ ⟨𝑦0 , … , 𝑦𝑚−1 , ⊥⟩

where 𝑛 = |𝑥|, 𝑦 = 𝐹 (𝑥), and 𝑚 = |𝑦|, and the notion of equiva-


lence is defined as per Definition 8.14.

8.6.2 Enhanced λ calculus is Turing-complete


The basic operations of the enhanced λ calculus more or less amount
to the Lisp or Scheme programming languages. Given that, it is per-
haps not surprising that the enhanced λ-calculus is equivalent to
Turing machines:

Theorem 8.16 — Lambda calculus and NAND-TM. For every function


𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐹 is computable in the enhanced λ calculus if
and only if it is computable by a Turing machine.

Proof Idea:
To prove the theorem, we need to show that (1) if 𝐹 is computable
by a λ calculus expression then it is computable by a Turing machine,
and (2) if 𝐹 is computable by a Turing machine, then it is computable
by an enhanced λ calculus expression.
Showing (1) is fairly straightforward. Applying the simplification
rules to a λ expression basically amounts to “search and replace”
eq u i va l e n t mod e l s of comp u tati on 309

which we can implement easily in, say, NAND-RAM, or for that


matter Python (both of which are equivalent to Turing machines in
power). Showing (2) essentially amounts to simulating a Turing ma-
chine (or writing a NAND-TM interpreter) in a functional program-
ming language such as LISP or Scheme. We give the details below but
how this can be done is a good exercise in mastering some functional
programming techniques that are useful in their own right.

Proof of Theorem 8.16. We only sketch the proof. The “if” direction
is simple. As mentioned above, evaluating λ expressions basically
amounts to “search and replace”. It is also a fairly straightforward
programming exercise to implement all the above basic operations in
an imperative language such as Python or C, and using the same ideas
we can do so in NAND-RAM as well, which we can then transform to
a NAND-TM program.
For the “only if” direction we need to simulate a Turing machine
using a λ expression. We will do so by first showing for every Tur-
ing machine 𝑀 a λ expression to compute the next-step function
∗ ∗
NEXT𝑀 ∶ Σ → Σ that maps a configuration of 𝑀 to the next one (see
Section 8.4.2).

A configuration of 𝑀 is a string 𝛼 ∈ Σ for a finite set Σ. We can
encode every symbol 𝜎 ∈ Σ by a finite string {0, 1}ℓ , and so we will
encode a configuration 𝛼 in the λ calculus as a list ⟨𝛼0 , 𝛼1 , … , 𝛼𝑚−1 , ⊥⟩
where 𝛼𝑖 is an ℓ-length string (i.e., an ℓ-length list of 0’s and 1’s) en-
coding a symbol in Σ.

By Lemma 8.9, for every 𝛼 ∈ Σ , NEXT𝑀 (𝛼)𝑖 is equal to
3
𝑟(𝛼𝑖−1 , 𝛼𝑖 , 𝛼𝑖+1 ) for some finite function 𝑟 ∶ Σ → Σ. Using our
encoding of Σ as {0, 1}ℓ , we can also think of 𝑟 as mapping {0, 1}3ℓ to
{0, 1}ℓ . By Solved Exercise 8.3, we can compute the NAND function,
and hence every finite function, including 𝑟, using the λ calculus.
Using this insight, we can compute NEXT𝑀 using the λ calculus as
follows. Given a list 𝐿 encoding the configuration 𝛼0 ⋯ 𝛼𝑚−1 , we
define the lists 𝐿𝑝𝑟𝑒𝑣 and 𝐿𝑛𝑒𝑥𝑡 encoding the configuration 𝛼 shifted
by one step to the right and left respectively. The next configuration
𝛼′ is defined as 𝛼′𝑖 = 𝑟(𝐿𝑝𝑟𝑒𝑣 [𝑖], 𝐿[𝑖], 𝐿𝑛𝑒𝑥𝑡 [𝑖]) where we let 𝐿′ [𝑖] denote
the 𝑖-th element of 𝐿′ . This can be computed by recursion (and hence
using the enhanced λ calculus’ RECURSE operator) as follows:
310 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 8.17 — 𝑁 𝐸𝑋𝑇𝑀 using the λ calculus.

Input: List 𝐿 = ⟨𝛼0 , 𝛼1 , … , 𝛼𝑚−1 , ⊥⟩ encoding a configura-


tion 𝛼.
Output: List 𝐿′ encoding 𝑁 𝐸𝑋𝑇𝑀 (𝛼)
1: procedure Comp u te Ne x t(𝐿𝑝𝑟𝑒𝑣 , 𝐿, 𝐿𝑛𝑒𝑥𝑡 )
2: if then𝐼𝑆𝐸𝑀 𝑃 𝑇 𝑌 𝐿𝑝𝑟𝑒𝑣
3: return 𝑁 𝐼𝐿
4: end if
5: 𝑎 ← 𝐻𝐸𝐴𝐷 𝐿𝑝𝑟𝑒𝑣
6: if then𝐼𝑆𝐸𝑀 𝑃 𝑇 𝑌 𝐿
7: 𝑏←∅ # Encoding of ∅ in {0, 1}ℓ
8: else
9: 𝑏 ← 𝐻𝐸𝐴𝐷 𝐿
10: end if
11: if then𝐼𝑆𝐸𝑀 𝑃 𝑇 𝑌 𝐿𝑛𝑒𝑥𝑡
12: 𝑐←∅
13: else
14: 𝑐 ← 𝐻𝐸𝐴𝐷 𝐿𝑛𝑒𝑥𝑡
15: end if
16: return 𝑃 𝐴𝐼𝑅 𝑟(𝑎, 𝑏, 𝑐) Comp u te Ne x t(𝑇 𝐴𝐼𝐿 𝐿𝑝𝑟𝑒𝑣 , 𝑇 𝐴𝐼𝐿 𝐿 , 𝑇 𝐴𝐼𝐿 𝐿𝑛𝑒𝑥𝑡 )
17: end procedure
18: 𝐿𝑝𝑟𝑒𝑣 ← 𝑃 𝐴𝐼𝑅 ∅ 𝐿 # 𝐿𝑝𝑟𝑒𝑣 = ⟨∅, 𝛼0 , … , 𝛼𝑚−1 , ⊥⟩
19: 𝐿𝑛𝑒𝑥𝑡 ← 𝑇 𝐴𝐼𝐿 𝐿 # 𝐿𝑛𝑒𝑥𝑡 = ⟨𝛼1 , … , 𝛼𝑚−1 , ⊥}
20: return Comp u te Ne x t(𝐿𝑝𝑟𝑒𝑣 , 𝐿, 𝐿𝑛𝑒𝑥𝑡 )

Once we can compute NEXT𝑀 , we can simulate the execution of


𝑀 on input 𝑥 using the following recursion. Define FINAL(𝛼) to be
the final configuration of 𝑀 when initialized at configuration 𝛼. The
function FINAL can be defined recursively as follows:


{𝛼 𝛼 is halting configuration
FINAL(𝛼) = ⎨ .
⎩FINAL(NEXT𝑀 (𝛼)) otherwise
{

Checking whether a configuration is halting (i.e., whether it is


one in which the transition function would output Halt) can be easily
implemented in the 𝜆 calculus, and hence we can use the RECURSE
to compute FINAL. If we let 𝛼0 be the initial configuration of 𝑀 on
input 𝑥 then we can obtain the output 𝑀 (𝑥) from FINAL(𝛼0 ), hence
completing the proof.

eq u i va l e n t mod e l s of comp u tati on 311

8.7 FROM ENHANCED TO PURE Λ CALCULUS


While the collection of “basic” functions we allowed for the enhanced
λ calculus is smaller than what’s provided by most Lisp dialects, com-
ing from NAND-TM it still seems a little “bloated”. Can we make do
with less? In other words, can we find a subset of these basic opera-
tions that can implement the rest?
It turns out that there is in fact a proper subset of the operations of
the enhanced λ calculus that can be used to implement the rest. That
subset is the empty set. That is, we can implement all the operations
above using the λ formalism only, even without using 0’s and 1’s. It’s
λ’s all the way down!

P
This is a good point to pause and think how
you would implement these operations your-
self. For example, start by thinking how you
could implement MAP using REDUCE, and
then REDUCE using RECURSE combined with
0, 1, IF, PAIR, HEAD, TAIL, NIL, ISEMPTY. You can
also implement PAIR, HEAD and TAIL based on
0, 1, IF. The most challenging part is to implement
RECURSE using only the operations of the pure λ
calculus.

There
Theorem 8.18 — Enhanced λ calculus equivalent to pure λ calculus..
are λ expressions that implement the functions 0,1,IF,PAIR, HEAD,
TAIL, NIL, ISEMPTY, MAP, REDUCE, and RECURSE.

The idea behind Theorem 8.18 is that we encode 0 and 1 them-


selves as λ expressions, and build things up from there. This is known
as Church encoding, as it was originated by Church in his effort to
show that the λ calculus can be a basis for all computation. We will
not write the full formal proof of Theorem 8.18 but outline the ideas
involved in it:

• We define 0 to be the function that on two inputs 𝑥, 𝑦 outputs 𝑦,


and 1 to be the function that on two inputs 𝑥, 𝑦 outputs 𝑥. We use
Currying to achieve the effect of two-input functions and hence
0 = 𝜆𝑥.𝜆𝑦.𝑦 and 1 = 𝜆𝑥.𝜆𝑦.𝑥. (This representation scheme is the
common convention for representing false and true but there are
many other alternative representations for 0 and 1 that would have
worked just as well.)

• The above implementation makes the IF function trivial:


IF(𝑐𝑜𝑛𝑑, 𝑎, 𝑏) is simply 𝑐𝑜𝑛𝑑 𝑎 𝑏 since 0𝑎𝑏 = 𝑏 and 1𝑎𝑏 = 𝑎. We
312 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

can write IF = 𝜆𝑥.𝑥 to achieve IF(𝑐𝑜𝑛𝑑, 𝑎, 𝑏) = (((IF𝑐𝑜𝑛𝑑)𝑎)𝑏) =


𝑐𝑜𝑛𝑑 𝑎 𝑏.

• To encode a pair (𝑥, 𝑦) we will produce a function 𝑓𝑥,𝑦 that has 𝑥


and 𝑦 “in its belly” and satisfies 𝑓𝑥,𝑦 𝑔 = 𝑔𝑥𝑦 for every function 𝑔.
That is, PAIR = 𝜆𝑥, 𝑦. (𝜆𝑔.𝑔𝑥𝑦). We can extract the first element of
a pair 𝑝 by writing 𝑝1 and the second element by writing 𝑝0, and so
HEAD = 𝜆𝑝.𝑝1 and TAIL = 𝜆𝑝.𝑝0.

• We define NIL to be the function that ignores its input and always
outputs 1. That is, NIL = 𝜆𝑥.1. The ISEMPTY function checks,
given an input 𝑝, whether we get 1 if we apply 𝑝 to the function
𝑧𝑒𝑟𝑜 = 𝜆𝑥, 𝑦.0 that ignores both its inputs and always outputs 0. For
every valid pair of the form 𝑝 = PAIR𝑥𝑦, 𝑝𝑧𝑒𝑟𝑜 = 𝑝𝑥𝑦 = 0 while
NIL𝑧𝑒𝑟𝑜 = 1. Formally, ISEMPTY = 𝜆𝑝.𝑝(𝜆𝑥, 𝑦.0).

R
Remark 8.19 — Church numerals (optional). There is
nothing special about Boolean values. You can use
similar tricks to implement natural numbers using
λ terms. The standard way to do so is to represent
the number 𝑛 by the function ITER𝑛 that on input a
function 𝑓 outputs the function 𝑥 ↦ 𝑓(𝑓(⋯ 𝑓(𝑥))) (𝑛
times). That is, we represent the natural number 1 as
𝜆𝑓.𝑓, the number 2 as 𝜆𝑓.(𝜆𝑥.𝑓(𝑓𝑥)), the number 3 as
𝜆𝑓.(𝜆𝑥.𝑓(𝑓(𝑓𝑥))), and so on and so forth. (Note that
this is not the same representation we used for 1 in
the Boolean context: this is fine; we already know that
the same object can be represented in more than one
way.) The number 0 is represented by the function
that maps any function 𝑓 to the identity function 𝜆𝑥.𝑥.
(That is, 0 = 𝜆𝑓.(𝜆𝑥.𝑥).)
In this representation, we can compute PLUS(𝑛, 𝑚)
as 𝜆𝑓.𝜆𝑥.(𝑛𝑓)((𝑚𝑓)𝑥) and TIMES(𝑛, 𝑚) as 𝜆𝑓.𝑛(𝑚𝑓).
Subtraction and division are trickier, but can be
achieved using recursion. (Working this out is a great
exercise.)

8.7.1 List processing


Now we come to a bigger hurdle, which is how to implement
MAP, FILTER, REDUCE and RECURSE in the pure λ calculus. It
turns out that we can build MAP and FILTER from REDUCE, and
REDUCE from RECURSE. For example MAP(𝐿, 𝑓) is the same as
REDUCE(𝐿, 𝑔, NIL) where 𝑔 is the operation that on input 𝑥 and 𝑦,
outputs PAIR(𝑓(𝑥), 𝑦). (I leave checking this as a (recommended!)
exercise for you, the reader.)
We can define REDUCE(𝐿, 𝑓, 𝑧) recursively, by setting
REDUCE(NIL, 𝑓, 𝑧) = 𝑧 and stipulating that given a non-
eq u i va l e n t mod e l s of comp u tati on 313

empty list 𝐿, which we can think of as a pair (ℎ𝑒𝑎𝑑, 𝑟𝑒𝑠𝑡),


REDUCE(𝐿, 𝑓, 𝑧) = 𝑓(ℎ𝑒𝑎𝑑, REDUCE(𝑟𝑒𝑠𝑡, 𝑓, 𝑧))). Thus, we
might try to write a recursive λ expression for REDUCE as follows

REDUCE = 𝜆𝐿, 𝑓, 𝑧.IF(ISEMPTY(𝐿), 𝑧, 𝑓HEAD(𝐿)REDUCE(TAIL(𝐿), 𝑓, 𝑧)) .


(8.8)
The only fly in this ointment is that the λ calculus does not have the
notion of recursion, and so this is an invalid definition. But of course
we can use our RECURSE operator to solve this problem. We will
replace the recursive call to “REDUCE” with a call to a function 𝑚𝑒
that is given as an extra argument, and then apply RECURSE to this.
Thus REDUCE = RECURSE 𝑚𝑦𝑅𝐸𝐷𝑈 𝐶𝐸 where

𝑚𝑦𝑅𝐸𝐷𝑈 𝐶𝐸 = 𝜆𝑚𝑒, 𝐿, 𝑓, 𝑧.IF(ISEMPTY(𝐿), 𝑧, 𝑓HEAD(𝐿)𝑚𝑒(TAIL(𝐿), 𝑓, 𝑧)) .


(8.9)

8.7.2 The Y combinator, or recursion without recursion


Eq. (8.9) means that implementing MAP, FILTER, and REDUCE boils
down to implementing the RECURSE operator in the pure λ calculus.
This is what we do now.
How can we implement recursion without recursion? We will
illustrate this using a simple example - the XOR function. As shown in
Solved Exercise 8.4, we can write the XOR function of a list recursively
as follows:


{0 𝐿 is empty
XOR(𝐿) =

⎩XOR2 (HEAD(𝐿), XOR(TAIL(𝐿))) otherwise
{

where XOR2 ∶ {0, 1}2 → {0, 1} is the XOR on two bits. In Python we
would write this as

def xor2(a,b): return 1-b if a else b


def head(L): return L[0]
def tail(L): return L[1:]

def xor(L): return xor2(head(L),xor(tail(L))) if L else 0

print(xor([0,1,1,0,0,1]))
# 1

Now, how could we eliminate this recursive call? The main idea is
that since functions can take other functions as input, it is perfectly
legal in Python (and the λ calculus of course) to give a function itself
314 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

as input. So, our idea is to try to come up with a non-recursive function


tempxor that takes two inputs: a function and a list, and such that
tempxor(tempxor,L) will output the XOR of L!

P
At this point you might want to stop and try to im-
plement this on your own in Python or any other
programming language of your choice (as long as it
allows functions as inputs).

Our first attempt might be to simply use the idea of replacing the
recursive call by me. Let’s define this function as myxor

def myxor(me,L): return xor2(head(L),me(tail(L))) if L


↪ else 0

Let’s test this out:

myxor(myxor,[1,0,1])

If you do this, you will get the following complaint from the inter-
preter:
TypeError: myxor() missing 1 required positional argu-
ment
The problem is that myxor expects two inputs- a function and a
list- while in the call to me we only provided a list. To correct this, we
modify the call to also provide the function itself:

def tempxor(me,L): return xor2(head(L),me(me,tail(L))) if


↪ L else 0

Note the call me(me,..) in the definition of tempxor: given a func-


tion me as input, tempxor will actually call the function me with itself
as the first input. If we test this out now, we see that we actually get
the right result!

tempxor(tempxor,[1,0,1])
# 0
tempxor(tempxor,[1,0,1,1])
# 1

and so we can define xor(L) as simply return tem-


pxor(tempxor,L).
The approach above is not specific to XOR. Given a recursive func-
tion f that takes an input x, we can obtain a non-recursive version as
follows:

1. Create the function myf that takes a pair of inputs me and x, and
replaces recursive calls to f with calls to me.
eq u i va l e n t mod e l s of comp u tati on 315

2. Create the function tempf that converts calls in myf of the form
me(x) to calls of the form me(me,x).

3. The function f(x) will be defined as tempf(tempf,x)

Here is the way we implement the RECURSE operator in Python. It


will take a function myf as above, and replace it with a function g such
that g(x)=myf(g,x) for every x.

def RECURSE(myf):
def tempf(me,x): return myf(lambda y: me(me,y),x)

return lambda x: tempf(tempf,x)

xor = RECURSE(myxor)

print(xor([0,1,1,0,0,1]))
# 1

print(xor([1,1,0,0,1,1,1,1]))
# 0

From Python to the ฀calculus. In the λ calculus, a two input function


𝑔 that takes a pair of inputs 𝑚𝑒, 𝑦 is written as 𝜆𝑚𝑒.(𝜆𝑦.𝑔). So the
function 𝑦 ↦ 𝑚𝑒(𝑚𝑒, 𝑦) is simply written as 𝑚𝑒 𝑚𝑒 and similarly
the function 𝑥 ↦ 𝑡𝑒𝑚𝑝𝑓(𝑡𝑒𝑚𝑝𝑓, 𝑥) is simply 𝑡𝑒𝑚𝑝𝑓 𝑡𝑒𝑚𝑝𝑓. (Can
you see why?) Therefore the function tempf defined above can be
written as λ me. myf(me me). This means that if we denote the input
of RECURSE by 𝑓, then RECURSE 𝑚𝑦𝑓 = 𝑡𝑒𝑚𝑝𝑓 𝑡𝑒𝑚𝑝𝑓 where 𝑡𝑒𝑚𝑝𝑓 =
𝜆𝑚.𝑓(𝑚 𝑚) or in other words

RECURSE = 𝜆𝑓.((𝜆𝑚.𝑓(𝑚 𝑚)) (𝜆𝑚.𝑓(𝑚 𝑚)))

The online appendix contains an implementation of the λ calcu-


lus using Python. Here is an implementation of the recursive XOR
function from that appendix:3 3
Because of specific issues of Python syntax, in this
implementation we use f * g for applying f to g
# XOR of two bits rather than fg, and use λx(exp) rather than λx.exp
for abstraction. We also use _0 and _1 for the λ terms
XOR2 = λ(a,b)(IF(a,IF(b,_0,_1),b)) for 0 and 1 so as not to confuse with the Python
constants.
# Recursive XOR with recursive calls replaced by m
↪ parameter
myXOR = λ(m,l)(IF(ISEMPTY(l),_0,XOR2(HEAD(l),m(TAIL(l)))))

# Recurse operator (aka Y combinator)


RECURSE = λf((λm(f(m*m)))(λm(f(m*m))))
316 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

# XOR function
XOR = RECURSE(myXOR)

#TESTING:

XOR(PAIR(_1,NIL)) # List [1]


# equals 1

XOR(PAIR(_1,PAIR(_0,PAIR(_1,NIL)))) # List [1,0,1]


# equals 0

R
Remark 8.20 — The Y combinator. The RECURSE opera-
tor above is better known as the Y combinator.
It is one of a family of a fixed point operators that given
a lambda expression 𝐹 , find a fixed point 𝑓 of 𝐹 such
that 𝑓 = 𝐹 𝑓. If you think about it, XOR is the fixed
point of 𝑚𝑦𝑋𝑂𝑅 above. XOR is the function such
that for every 𝑥, if plug in XOR as the first argument
of 𝑚𝑦𝑋𝑂𝑅 then we get back XOR, or in other words
XOR = 𝑚𝑦𝑋𝑂𝑅 XOR. Hence finding a fixed point for
𝑚𝑦𝑋𝑂𝑅 is the same as applying RECURSE to it.

8.8 THE CHURCH-TURING THESIS (DISCUSSION)


“[In 1934], Church had been speculating, and finally definitely proposed, that
the λ-definable functions are all the effectively calculable functions …. When
Church proposed this thesis, I sat down to disprove it … but, quickly realizing
that [my approach failed], I became overnight a supporter of the thesis.”,
Stephen Kleene, 1979.

“[The thesis is] not so much a definition or to an axiom but … a natural law.”,
Emil Post, 1936.

We have defined functions to be computable if they can be computed


by a NAND-TM program, and we’ve seen that the definition would
remain the same if we replaced NAND-TM programs by Python pro-
grams, Turing machines, λ calculus, cellular automata, and many
other computational models. The Church-Turing thesis is that this is
the only sensible definition of “computable” functions. Unlike the
“Physical Extended Church-Turing Thesis” (PECTT) which we saw
before, the Church-Turing thesis does not make a concrete physical
prediction that can be experimentally tested, but it certainly motivates
predictions such as the PECTT. One can think of the Church-Turing
Thesis as either advocating a definitional choice, making some pre-
diction about all potential computing devices, or suggesting some
eq u i va l e n t mod e l s of comp u tati on 317

laws of nature that constrain the natural world. In Scott Aaronson’s


words, “whatever it is, the Church-Turing thesis can only be regarded
as extremely successful”. No candidate computing device (including
quantum computers, and also much less reasonable models such as
the hypothetical “closed time curve” computers we mentioned before)
has so far mounted a serious challenge to the Church-Turing thesis.
These devices might potentially make some computations more effi-
cient, but they do not change the difference between what is finitely
computable and what is not. (The extended Church-Turing thesis,
which we discuss in Section 13.3, stipulates that Turing machines cap-
ture also the limit of what can be efficiently computable. Just like its
physical version, quantum computing presents the main challenge to
this thesis.)

8.8.1 Different models of computation


We can summarize the models we have seen in the following table:

Table 8.1: Different models for computing finite functions and


functions with arbitrary input length.

Computational
problems Type of model Examples
Finite functions Non-uniform Boolean circuits,
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 computation NAND circuits,
(algorithm straight-line programs
depends on input (e.g., NAND-CIRC)
length)
Functions with Sequential access Turing machines,
unbounded inputs to memory NAND-TM programs
𝐹 ∶ {0, 1}∗ → {0, 1}∗
– Indexed access / RAM machines,
RAM NAND-RAM, modern
programming
languages
– Other Lambda calculus,
cellular automata

Later on in Chapter 17 we will study memory bounded computa-


tion. It turns out that NAND-TM programs with a constant amount
of memory are equivalent to the model of finite automata (the adjec-
tives “deterministic” or “non-deterministic” are sometimes added as
well, this model is also known as finite state machines) which in turn
captures the notion of regular languages (those that can be described by
regular expressions), which is a concept we will see in Chapter 10.
318 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

✓ Chapter Recap

• While we defined computable functions using


Turing machines, we could just as well have done
so using many other models, including not just
NAND-TM programs but also RAM machines,
NAND-RAM, the λ-calculus, cellular automata and
many other models.
• Very simple models turn out to be “Turing com-
plete” in the sense that they can simulate arbitrarily
complex computation.

8.9 EXERCISES
Exercise 8.1 — Alternative proof for TM/RAM equivalence. Let SEARCH ∶
{0, 1}∗ → {0, 1}∗ be the following function. The input is a pair
(𝐿, 𝑘) where 𝑘 ∈ {0, 1}∗ , 𝐿 is an encoding of a list of key value pairs
(𝑘0 , 𝑣1 ), … , (𝑘𝑚−1 , 𝑣𝑚−1 ) where 𝑘0 , … , 𝑘𝑚−1 , 𝑣0 , … , 𝑣𝑚−1 are binary
strings. The output is 𝑣𝑖 for the smallest 𝑖 such that 𝑘𝑖 = 𝑘, if such 𝑖
exists, and otherwise the empty string.

1. Prove that SEARCH is computable by a Turing machine.

2. Let UPDATE(𝐿, 𝑘, 𝑣) be the function whose input is a list 𝐿 of pairs,


and whose output is the list 𝐿′ obtained by prepending the pair
(𝑘, 𝑣) to the beginning of 𝐿. Prove that UPDATE is computable by a
Turing machine.

3. Suppose we encode the configuration of a NAND-RAM program


by a list 𝐿 of key/value pairs where the key is either the name of
a scalar variable foo or of the form Bar[<num>] for some num-
ber <num> and it contains all the non-zero values of variables. Let
NEXT(𝐿) be the function that maps a configuration of a NAND-
RAM program at one step to the configuration in the next step.
Prove that NEXT is computable by a Turing machine (you don’t
have to implement each one of the arithmetic operations: it is
enough to implement addition and multiplication).

4. Prove that for every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ that is computable by a
NAND-RAM program, 𝐹 is computable by a Turing machine.

Exercise 8.2 — NAND-TM lookup. This exercise shows part of the proof that
NAND-TM can simulate NAND-RAM. Produce the code of a NAND-
TM program that computes the function LOOKUP ∶ {0, 1}∗ → {0, 1}
that is defined as follows. On input 𝑝𝑓(𝑖)𝑥, where 𝑝𝑓(𝑖) denotes a
prefix-free encoding of an integer 𝑖, LOOKUP(𝑝𝑓(𝑖)𝑥) = 𝑥𝑖 if 𝑖 < |𝑥|
eq u i va l e n t mod e l s of comp u tati on 319

and LOOKUP(𝑝𝑓(𝑖)𝑥) = 0 otherwise. (We don’t care what LOOKUP


outputs on inputs that are not of this form.) You can choose any
prefix-free encoding of your choice, and also can use your favorite
programming language to produce this code.

Exercise 8.3 — Pairing. Let 𝑒𝑚𝑏𝑒𝑑 ∶ ℕ2 → ℕ be the function defined as


𝑒𝑚𝑏𝑒𝑑(𝑥0 , 𝑥1 ) = 1
2 (𝑥0 + 𝑥1 )(𝑥0 + 𝑥1 + 1) + 𝑥1 .

1. Prove that for every 𝑥0 , 𝑥1 ∈ ℕ, 𝑒𝑚𝑏𝑒𝑑(𝑥0 , 𝑥1 ) is indeed a natural


number.

2. Prove that 𝑒𝑚𝑏𝑒𝑑 is one-to-one

3. Construct a NAND-TM program 𝑃 such that for every 𝑥0 , 𝑥1 ∈ ℕ,


𝑃 (𝑝𝑓(𝑥0 )𝑝𝑓(𝑥1 )) = 𝑝𝑓(𝑒𝑚𝑏𝑒𝑑(𝑥0 , 𝑥1 )), where 𝑝𝑓 is the prefix-free
encoding map defined above. You can use the syntactic sugar for
inner loops, conditionals, and incrementing/decrementing the
counter.

4. Construct NAND-TM programs 𝑃0 , 𝑃1 such that for every 𝑥0 , 𝑥1 ∈


ℕ and 𝑖 ∈ 𝑁 , 𝑃𝑖 (𝑝𝑓(𝑒𝑚𝑏𝑒𝑑(𝑥0 , 𝑥1 ))) = 𝑝𝑓(𝑥𝑖 ). You can use the syn-
tactic sugar for inner loops, conditionals, and incrementing/decre-
menting the counter.

Let SHORTPATH ∶ {0, 1} → {0, 1}


Exercise 8.4 — Shortest Path. ∗ ∗

be the function that on input a string encoding a triple (𝐺, 𝑢, 𝑣) out-


puts a string encoding ∞ if 𝑢 and 𝑣 are disconnected in 𝐺 or a string
encoding the length 𝑘 of the shortest path from 𝑢 to 𝑣. Prove that
SHORTPATH is computable by a Turing machine. See footnote for
hint.4 You don’t have to give a full description of a Turing
4

machine: use our “have the cake and eat it too”



paradigm to show the existence of such a machine by
arguing from more powerful equivalent models.
Let LONGPATH ∶ {0, 1}∗ → {0, 1}∗ be
Exercise 8.5 — Longest Path.
the function that on input a string encoding a triple (𝐺, 𝑢, 𝑣) outputs
a string encoding ∞ if 𝑢 and 𝑣 are disconnected in 𝐺 or a string en-
coding the length 𝑘 of the longest simple path from 𝑢 to 𝑣. Prove that
LONGPATH is computable by a Turing machine. See footnote for
hint.5 Same hint as Exercise 8.5 applies. Note that for
5

showing that LONGPATH is computable you don’t



have to give an efficient algorithm.
Let SHORTPATH be as in
Exercise 8.6 — Shortest path λ expression.
Exercise 8.4. Prove that there exists a 𝜆 expression that computes
SHORTPATH. You can use Exercise 8.4

Exercise 8.7 — Next-step function is local. Prove Lemma 8.9 and use it to
complete the proof of Theorem 8.7.
320 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Exercise 8.8 — λ calculus requires at most three variables. Prove that for ev-
ery λ-expression 𝑒 with no free variables there is an equivalent λ- 6
Hint: You can reduce the number of variables a
expression 𝑓 that only uses the variables 𝑥,𝑦, and 𝑧.6 function takes by “pairing them up”. That is, define a
■ λ expression PAIR such that for every 𝑥, 𝑦 PAIR𝑥𝑦 is
some function 𝑓 such that 𝑓0 = 𝑥 and 𝑓1 = 𝑦. Then
1. Let 𝑒 =
Exercise 8.9 — Evaluation order example in λ calculus. use PAIR to iteratively reduce the number of variables
𝜆𝑥.7 ((𝜆𝑥.𝑥𝑥)(𝜆𝑥.𝑥𝑥)). Prove that the simplification process of 𝑒 used.

ends in a definite number if we use the “call by name” evaluation


order while it never ends if we use the “call by value” order.

2. (bonus, challenging) Let 𝑒 be any λ expression. Prove that if the


simplification process ends in a definite number if we use the “call
by value” order then it also ends in such a number if we use the 7
Use structural induction on the expression 𝑒.
“call by name” order. See footnote for hint.7

Exercise 8.10 — Zip function. Give an enhanced λ calculus expression to


compute the function 𝑧𝑖𝑝 that on input a pair of lists 𝐼 and 𝐿 of the
same length 𝑛, outputs a list of 𝑛 pairs 𝑀 such that the 𝑗-th element
of 𝑀 (which we denote by 𝑀𝑗 ) is the pair (𝐼𝑗 , 𝐿𝑗 ). Thus 𝑧𝑖𝑝 “zips 8
The name 𝑧𝑖𝑝 is a common name for this operation,
together” these two lists of elements into a single list of pairs.8 for example in Python. It should not be confused with

the zip compression file format.

Let 𝑀 be a Turing
Exercise 8.11 — Next-step function without 𝑅𝐸𝐶𝑈 𝑅𝑆𝐸 .
machine. Give an enhanced λ calculus expression to compute the
next-step function NEXT𝑀 of 𝑀 (as in the proof of Theorem 8.16) 9
Use MAP and REDUCE (and potentially FILTER).
without using RECURSE. See footnote for hint.9 You might also find the function 𝑧𝑖𝑝 of Exercise 8.10

useful.

Give a program
Exercise 8.12 — λ calculus to NAND-TM compiler (challenging).
in the programming language of your choice that takes as input a λ
expression 𝑒 and outputs a NAND-TM program 𝑃 that computes the
same function as 𝑒. For partial credit you can use the GOTO and all
NAND-CIRC syntactic sugar in your output program. You can use
any encoding of λ expressions as binary string that is convenient for 10
Try to set up a procedure such that if array Left
you. See footnote for hint.10 contains an encoding of a λ expression 𝜆𝑥.𝑒 and

array Right contains an encoding of another λ expres-
sion 𝑒′ , then the array Result will contain 𝑒[𝑥 → 𝑒′ ].
Exercise 8.13 — At least two in 𝜆 calculus. Let 1 = 𝜆𝑥, 𝑦.𝑥 and 0 = 𝜆𝑥, 𝑦.𝑦 as
before. Define

ALT = 𝜆𝑎, 𝑏, 𝑐.(𝑎(𝑏1(𝑐10))(𝑏𝑐0))

Prove that ALT is a 𝜆 expression that computes the at least two func-
tion. That is, for every 𝑎, 𝑏, 𝑐 ∈ {0, 1} (as encoded above) ALT𝑎𝑏𝑐 = 1
if and only at least two of {𝑎, 𝑏, 𝑐} are equal to 1.

eq u i va l e n t mod e l s of comp u tati on 321

This question will help you


Exercise 8.14 — Locality of next-step function.
get a better sense of the notion of locality of the next step function of Tur-
ing machines. This locality plays an important role in results such as
the Turing completeness of 𝜆 calculus and one dimensional cellular
automata, as well as results such as Godel’s Incompleteness Theorem
and the Cook Levin theorem that we will see later in this course. De-
fine STRINGS to be the a programming language that has the following
semantics:

• A STRINGS program 𝑄 has a single string variable str that is both


the input and the output of 𝑄. The program has no loops and no
other variables, but rather consists of a sequence of conditional
search and replace operations that modify str.

• The operations of a STRINGS program are:

– REPLACE(pattern1,pattern2) where pattern1 and pattern2


are fixed strings. This replaces the first occurrence of pattern1
in str with pattern2
– if search(pattern) { code } executes code if pattern is a
substring of str. The code code can itself include nested if’s.
(One can also add an else { ... } to execute if pattern is not
a substring of condf).
– the returned value is str

• A STRING program 𝑄 computes a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ if


for every 𝑥 ∈ {0, 1}∗ , if we initialize str to 𝑥 and then execute the
sequence of instructions in 𝑄, then at the end of the execution str
equals 𝐹 (𝑥).

For example, the following is a STRINGS program that computes


the function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ such that for every 𝑥 ∈ {0, 1}∗ , if 𝑥
contains a substring of the form 𝑦 = 11𝑎𝑏11 where 𝑎, 𝑏 ∈ {0, 1}, then
𝐹 (𝑥) = 𝑥′ where 𝑥′ is obtained by replacing the first occurrence of 𝑦 in
𝑥 with 00.

if search('110011') {
replace('110011','00')
} else if search('110111') {
replace('110111','00')
} else if search('111011') {
replace('111011','00')
} else if search('111111') {
replace('1111111','00')
}
322 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Prove that for every Turing machine program 𝑀 , there exists a


STRINGS program 𝑄 that computes the NEXT𝑀 function that maps
every string encoding a valid configuration of 𝑀 to the string encoding
the configuration of the next step of 𝑀 ’s computation. (We don’t
care what the function will do on strings that do not encode a valid
configuration.) You don’t have to write the STRINGS program fully,
but you do need to give a convincing argument that such a program
exists.

8.10 BIBLIOGRAPHICAL NOTES


Chapters 7 in the wonderful book of Moore and Mertens [MM11]
contains a great exposition much of this material. .
The RAM model can be very useful in studying the concrete com-
plexity of practical algorithms. Its theoretical study was initiated in
[CR73]. However, the exact set of operations that are allowed in the
RAM model and their costs vary between texts and contexts. One
needs to be careful in making such definitions, especially if the word
size grows, as was already shown by Shamir [Sha79]. Chapter 3 in
Savage’s book [Sav98] contains a more formal description of RAM
machines, see also the paper [Hag98]. A study of RAM algorithms
that are independent of the input size (known as the “transdichoto-
mous RAM model”) was initiated by [FW93]
The models of computation we considered so far are inherently
sequential, but these days much computation happens in parallel,
whether using multi-core processors or in massively parallel dis-
tributed computation in data centers or over the Internet. Parallel
computing is important in practice, but it does not really make much
difference for the question of what can and can’t be computed. After
all, if a computation can be performed using 𝑚 machines in 𝑡 time,
then it can be computed by a single machine in time 𝑚𝑡.
The λ-calculus was described by Church in [Chu41]. Pierce’s book
[Pie02] is a canonical textbook, see also [Bar84]. The “Currying tech-
nique” is named after the logician Haskell Curry (the Haskell pro-
gramming language is named after Haskell Curry as well). Curry
himself attributed this concept to Moses Schönfinkel, though for some
reason the term “Schönfinkeling” never caught on.
Unlike most programming languages, the pure λ-calculus doesn’t
have the notion of types. Every object in the λ calculus can also be
thought of as a λ expression and hence as a function that takes one
input and returns one output. All functions take one input and re-
turn one output, and if you feed a function an input of a form it didn’t
expect, it still evaluates the λ expression via “search and replace”,
replacing all instances of its parameter with copies of the input expres-
eq u i va l e n t mod e l s of comp u tati on 323

sion you fed it. Typed variants of the λ calculus are objects of intense
research, and are strongly related to type systems for programming
language and computer-verifiable proof systems, see [Pie02]. Some of
the typed variants of the λ calculus do not have infinite loops, which
makes them very useful as ways of enabling static analysis of pro-
grams as well as computer-verifiable proofs. We will come back to this
point in Chapter 10 and Chapter 22.
Tao has proposed showing the Turing completeness of fluid dy-
namics (a “water computer”) as a way of settling the question of the
behavior of the Navier-Stokes equations, see this popular article.
Learning Objectives:
• The universal machine/program - “one
program to rule them all”
• A fundamental result in computer science and
mathematics: the existence of uncomputable
functions.
• The halting problem: the canonical example of
an uncomputable function.
• Introduction to the technique of reductions.

9 • Rice’s Theorem: A “meta tool” for


uncomputability results, and a starting point

Universality and uncomputability


for much of the research on compilers,
programming languages, and software
verification.

“A function of a variable quantity is an analytic expression composed in any


way whatsoever of the variable quantity and numbers or constant quantities.”,
Leonhard Euler, 1748.

“The importance of the universal machine is clear. We do not need to have an


infinity of different machines doing different jobs. … The engineering problem
of producing various machines for various jobs is replaced by the office work of
‘programming’ the universal machine”, Alan Turing, 1948

One of the most significant results we showed for Boolean circuits


(or equivalently, straight-line programs) is the notion of universality:
there is a single circuit that can evaluate all other circuits. However,
this result came with a significant caveat. To evaluate a circuit of 𝑠
gates, the universal circuit needed to use a number of gates larger
than 𝑠. It turns out that uniform models such as Turing machines or
NAND-TM programs allow us to “break out of this cycle” and obtain
a truly universal Turing machine 𝑈 that can evaluate all other machines,
including machines that are more complex (e.g., more states) than 𝑈
itself. (Similarly, there is a Universal NAND-TM program 𝑈 ′ that can
evaluate all NAND-TM programs, including programs that have more
lines than 𝑈 ′ .)
It is no exaggeration to say that the existence of such a universal
program/machine underlies the information technology revolution
that began in the latter half of the 20th century (and is still ongoing).
Up to that point in history, people have produced various special-
purpose calculating devices such as the abacus, the slide ruler, and
machines that compute various trigonometric series. But as Turing
(who was perhaps the one to see most clearly the ramifications of
universality) observed, a general purpose computer is much more pow-
erful. Once we build a device that can compute the single universal
function, we have the ability, via software, to extend it to do arbitrary
computations. For example, if we want to simulate a new Turing ma-
chine 𝑀 , we do not need to build a new physical machine, but rather

Compiled on 12.6.2023 00:05


326 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

can represent 𝑀 as a string (i.e., using code) and then input 𝑀 to the
universal machine 𝑈 .
Beyond the practical applications, the existence of a universal algo-
rithm also has surprising theoretical ramifications, and in particular
can be used to show the existence of uncomputable functions, upend-
ing the intuitions of mathematicians over the centuries from Euler
to Hilbert. In this chapter we will prove the existence of the univer-
sal program, and also show its implications for uncomputability, see
Fig. 9.1

This chapter: A non-mathy overview


In this chapter we will see two of the most important results
in Computer Science:

1. The existence of a universal Turing machine: a single algo-


rithm that can evaluate all other algorithms,

2. The existence of uncomputable functions: functions (includ-


ing the famous “Halting problem”) that cannot be computed
by any algorithm.

Along the way, we develop the technique of reductions as a


way to show hardness of computing a function. A reduction
gives a way to compute a certain function using “wishful
thinking” and assuming that another function can be com-
puted. Reductions are of course widely used in program-
ming - we often obtain an algorithm for one task by using
another task as a “black box” subroutine. However we will
use it in the “contra positive”: rather than using a reduction
to show that the former task is “easy”, we use them to show
that the latter task is “hard”. Don’t worry if you find this
confusing - reductions are initially confusing - but they can be
mastered with time and practice.

9.1 UNIVERSALITY OR A META-CIRCULAR EVALUATOR


We start by proving the existence of a universal Turing machine. This is
a single Turing machine 𝑈 that can evaluate arbitrary Turing machines
𝑀 on arbitrary inputs 𝑥, including machines 𝑀 that can have more
states and larger alphabet than 𝑈 itself. In particular, 𝑈 can even be
used to evaluate itself! This notion of self reference will appear time and
again in this book, and as we will see, leads to several counter-intuitive
phenomena in computing.
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 327

Figure 9.1: In this chapter we will show the existence


of a universal Turing machine and then use this to de-
rive first the existence of some uncomputable function.
We then use this to derive the uncomputability of
Turing’s famous “halting problem” (i.e., the HALT
function), from which a host of other uncomputabil-
ity results follow. We also introduce reductions, which
allow us to use the uncomputability of a function 𝐹 to
derive the uncomputability of a new function 𝐺.

Theorem 9.1 — Universal Turing Machine.There exists a Turing machine


𝑈 such that on every string 𝑀 which represents a Turing machine,
and 𝑥 ∈ {0, 1}∗ , 𝑈 (𝑀 , 𝑥) = 𝑀 (𝑥).
That is, if the machine 𝑀 halts on 𝑥 and outputs some 𝑦 ∈
{0, 1}∗ then 𝑈 (𝑀 , 𝑥) = 𝑦, and if 𝑀 does not halt on 𝑥 (i.e.,
𝑀 (𝑥) = ⊥) then 𝑈 (𝑀 , 𝑥) = ⊥.

 Big Idea 11 There is a “universal” algorithm that can evaluate


arbitrary algorithms on arbitrary inputs.

Proof Idea:
Once you understand what the theorem says, it is not that hard to
prove. The desired program 𝑈 is an interpreter for Turing machines.
Figure 9.2: A Universal Turing Machine is a single
That is, 𝑈 gets a representation of the machine 𝑀 (think of it as source Turing Machine 𝑈 that can evaluate, given input the
code), and some input 𝑥, and needs to simulate the execution of 𝑀 on (description as a string of) arbitrary Turing machine
𝑀 and input 𝑥, the output of 𝑀 on 𝑥. In contrast to
𝑥.
the universal circuit depicted in Fig. 5.6, the machine
Think of how you would code 𝑈 in your favorite programming 𝑀 can be much more complex (e.g., more states or
language. First, you would need to decide on some representation tape alphabet symbols) than 𝑈.

scheme for 𝑀 . For example, you can use an array or a dictionary


to encode 𝑀 ’s transition function. Then you would use some data
structure, such as a list, to store the contents of 𝑀 ’s tape. Now you can
simulate 𝑀 step by step, updating the data structure as you go along.
The interpreter will continue the simulation until the machine halts.
Once you do that, translating this interpreter from your favorite
programming language to a Turing machine can be done just as we
have seen in Chapter 8. The end result is what’s known as a “meta-
328 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

circular evaluator”: an interpreter for a programming language in the


same one. This is a concept that has a long history in computer science
starting from the original universal Turing machine. See also Fig. 9.3.

9.1.1 Proving the existence of a universal Turing Machine


To prove (and even properly state) Theorem 9.1, we need to fix some
representation for Turing machines as strings. One potential choice
for such a representation is to use the equivalence between Turing
machines and NAND-TM programs and hence represent a Turing
machine 𝑀 using the ASCII encoding of the source code of the corre-
sponding NAND-TM program 𝑃 . However, we will use a more direct
encoding.

Let 𝑀 be a Turing
Definition 9.2 — String representation of Turing Machine.
machine with 𝑘 states and a size ℓ alphabet Σ = {𝜎0 , … , 𝜎ℓ−1 } (we
use the convention 𝜎0 = 0, 𝜎1 = 1, 𝜎2 = ∅, 𝜎3 = ▷). We represent
𝑀 as the triple (𝑘, ℓ, 𝑇 ) where 𝑇 is the table of values for 𝛿𝑀 :

𝑇 = (𝛿𝑀 (0, 𝜎0 ), 𝛿𝑀 (0, 𝜎1 ), … , 𝛿𝑀 (𝑘 − 1, 𝜎ℓ−1 )) ,


where each value 𝛿𝑀 (𝑠, 𝜎) is a triple (𝑠′ , 𝜎′ , 𝑑) with 𝑠′ ∈ [𝑘],
𝜎 ∈ Σ and 𝑑 a number {0, 1, 2, 3} encoding one of {L, R, S, H}. Thus

such a machine 𝑀 is encoded by a list of 2 + 3𝑘 ⋅ ℓ natural num-


bers. The string representation of 𝑀 is obtained by concatenating a
prefix free representation of all these integers. If a string 𝛼 ∈ {0, 1}∗
does not represent a list of integers in the form above, then we treat
it as representing the trivial Turing machine with one state that
immediately halts on every input.

R
Remark 9.3 — Take away points of representation. The
details of the representation scheme of Turing ma-
chines as strings are immaterial for almost all applica-
tions. What you need to remember are the following
points:

1. We can represent every Turing machine as a string.


2. Given the string representation of a Turing ma-
chine 𝑀 and an input 𝑥, we can simulate 𝑀 ’s
execution on the input 𝑥. (This is the content of
Theorem 9.1.)

An additional minor issue is that for convenience we


make the assumption that every string represents some
Turing machine. This is very easy to ensure by just
mapping strings that would otherwise not represent a
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 329

Turing machine into some fixed trivial machine. This


assumption is not very important, but does make a
few results (such as Rice’s Theorem: Theorem 9.15) a
little less cumbersome to state.

Using this representation, we can formally prove Theorem 9.1.

Proof of Theorem 9.1. We will only sketch the proof, giving the major
ideas. First, we observe that we can easily write a Python program
that, on input a representation (𝑘, ℓ, 𝑇 ) of a Turing machine 𝑀 and
an input 𝑥, evaluates 𝑀 on 𝑋. Here is the code of this program for
concreteness, though you can feel free to skip it if you are not familiar
with (or interested in) Python:

# constants
def EVAL(δ,x):
'''Evaluate TM given by transition table δ
on input x'''
Tape = ["฀"] + [a for a in x]
i = 0; s = 0 # i = head pos, s = state
while True:
s, Tape[i], d = δ[(s,Tape[i])]
if d == "H": break
if d == "L": i = max(i-1,0)
if d == "R": i += 1
if i>= len(Tape): Tape.append('Φ')

j = 1; Y = [] # produce output
while Tape[j] != 'Φ':
Y.append(Tape[j])
j += 1
return Y

On input a transition table 𝛿 this program will simulate the cor-


responding machine 𝑀 step by step, at each point maintaining the
invariant that the array Tape contains the contents of 𝑀 ’s tape, and
the variable s contains 𝑀 ’s current state.
The above does not prove the theorem as stated, since we need
to show a Turing machine that computes EVAL rather than a Python
program. With enough effort, we can translate this Python code
line by line to a Turing machine. However, to prove the theorem we
don’t need to do this, but can use our “eat the cake and have it too”
paradigm. That is, while we need to evaluate a Turing machine, in
writing the code for the interpreter we are allowed to use a richer
model such as NAND-RAM since it is equivalent in power to Turing
machines per Theorem 8.1.
330 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Translating the above Python code to NAND-RAM is truly straight-


forward. The only issue is that NAND-RAM doesn’t have the dictio-
nary data structure built in, which we have used above to store the
transition function δ. However, we can represent a dictionary 𝐷 of
the form {𝑘𝑒𝑦0 ∶ 𝑣𝑎𝑙0 , … , 𝑘𝑒𝑦𝑚−1 ∶ 𝑣𝑎𝑙𝑚−1 } as simply a list of pairs.
To compute 𝐷[𝑘] we can scan over all the pairs until we find one of
the form (𝑘, 𝑣) in which case we return 𝑣. Similarly we scan the list
to update the dictionary with a new value, either modifying it or ap-
pending the pair (𝑘𝑒𝑦, 𝑣𝑎𝑙) at the end.

R
Remark 9.4 — Efficiency of the simulation. The argu-
ment in the proof of Theorem 9.1 is a very inefficient
way to implement the dictionary data structure in
practice, but it suffices for the purpose of proving the
theorem. Reading and writing to a dictionary of 𝑚
values in this implementation takes Ω(𝑚) steps, but
it is in fact possible to do this in 𝑂(log 𝑚) steps using
a search tree data structure or even 𝑂(1) (for “typical”
instances) using a hash table. NAND-RAM and RAM
machines correspond to the architecture of modern
electronic computers, and so we can implement hash
tables and search trees in NAND-RAM just as they are
implemented in other programming languages.

The construction above yields a universal Turing machine with a


very large number of states. However, since universal Turing machines
have such a philosophical and technical importance, researchers have
attempted to find the smallest possible universal Turing machines, see
Section 9.7.

9.1.2 Implications of universality (discussion)


There is more than one Turing machine 𝑈 that satisfies the condi-
tions of Theorem 9.1, but the existence of even a single such machine
is already extremely fundamental to both the theory and practice of
computer science. Theorem 9.1’s impact reaches beyond the particu-
lar model of Turing machines. Because we can simulate every Turing
machine by a NAND-TM program and vice versa, Theorem 9.1 im-
mediately implies there exists a universal NAND-TM program 𝑃𝑈
such that 𝑃𝑈 (𝑃 , 𝑥) = 𝑃 (𝑥) for every NAND-TM program 𝑃 . We can
also “mix and match” models. For example since we can simulate
every NAND-RAM program by a Turing machine, and every Turing
machine by the 𝜆 calculus, Theorem 9.1 implies that there exists a 𝜆
expression 𝑒 such that for every NAND-RAM program 𝑃 and input 𝑥
on which 𝑃 (𝑥) = 𝑦, if we encode (𝑃 , 𝑥) as a 𝜆-expression 𝑓 (using the
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 331

Figure 9.3: a) A particularly elegant example of a


“meta-circular evaluator” comes from John Mc-
Carthy’s 1960 paper, where he defined the Lisp
programming language and gave a Lisp function that
evaluates an arbitrary Lisp program (see above). Lisp
was not initially intended as a practical program-
ming language and this example was merely meant
as an illustration that the Lisp universal function is
more elegant than the universal Turing machine. It
was McCarthy’s graduate student Steve Russell who
suggested that it can be implemented. As McCarthy
later recalled, “I said to him, ho, ho, you’re confusing
theory with practice, this eval is intended for reading, not
for computing. But he went ahead and did it. That is, he
compiled the eval in my paper into IBM 704 machine code,
fixing a bug, and then advertised this as a Lisp interpreter,
which it certainly was”. b) A self-replicating C program
from the classic essay of Thompson [Tho84].

𝜆-calculus encoding of strings as lists of 0’s and 1’s) then (𝑒 𝑓) eval-


uates to an encoding of 𝑦. More generally we can say that for every
𝒳 and 𝒴 in the set { Turing machines, RAM Machines, NAND-TM,
NAND-RAM, 𝜆-calculus, JavaScript, Python, … } of Turing equivalent
models, there exists a program/machine in 𝒳 that computes the map
(𝑃 , 𝑥) ↦ 𝑃 (𝑥) for every program/machine 𝑃 ∈ 𝒴.
The idea of a “universal program” is of course not limited to theory.
For example compilers for programming languages are often used to
compile themselves, as well as programs more complicated than the
compiler. (An extreme example of this is Fabrice Bellard’s Obfuscated
Tiny C Compiler which is a C program of 2048 bytes that can compile
a large subset of the C programming language, and in particular can
compile itself.) This is also related to the fact that it is possible to write
a program that can print its own source code, see Fig. 9.3. There are
universal Turing machines known that require a very small number
of states or alphabet symbols, and in particular there is a universal
Turing machine (with respect to a particular choice of representing
Turing machines as strings) whose tape alphabet is {▷, ∅, 0, 1} and
has fewer than 25 states (see Section 9.7).

9.2 IS EVERY FUNCTION COMPUTABLE?


In Theorem 4.12, we saw that NAND-CIRC programs can compute
every finite function 𝑓 ∶ {0, 1}𝑛 → {0, 1}. Therefore a natural guess is
that NAND-TM programs (or equivalently, Turing machines) could
compute every infinite function 𝐹 ∶ {0, 1}∗ → {0, 1}. However, this
turns out to be false. That is, there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1}
that is uncomputable!
332 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The existence of uncomputable functions is quite surprising. Our


intuitive notion of a “function” (and the notion most mathematicians
had until the 20th century) is that a function 𝑓 defines some implicit
or explicit way of computing the output 𝑓(𝑥) from the input 𝑥. The
notion of an “uncomputable function” thus seems to be a contradic-
tion in terms, but yet the following theorem shows that such creatures
do exist:

There exists a function 𝐹 ∗


Theorem 9.5 — Uncomputable functions. ∶
{0, 1}∗ → {0, 1} that is not computable by any Turing machine.

Proof Idea:
The idea behind the proof follows quite closely Cantor’s proof that
the reals are uncountable (Theorem 2.5), and in fact the theorem can
also be obtained fairly directly from that result (see Exercise 7.11).
However, it is instructive to see the direct proof. The idea is to con-
struct 𝐹 ∗ in a way that will ensure that every possible machine 𝑀 will
in fact fail to compute 𝐹 ∗ . We do so by defining 𝐹 ∗ (𝑥) to equal 0 if 𝑥
describes a Turing machine 𝑀 which satisfies 𝑀 (𝑥) = 1 and defining
𝐹 ∗ (𝑥) = 1 otherwise. By construction, if 𝑀 is any Turing machine and
𝑥 is the string describing it, then 𝐹 ∗ (𝑥) ≠ 𝑀 (𝑥) and therefore 𝑀 does
not compute 𝐹 ∗ .

Proof of Theorem 9.5. The proof is illustrated in Fig. 9.4. We start by


defining the following function 𝐺 ∶ {0, 1}∗ → {0, 1}:
For every string 𝑥 ∈ {0, 1}∗ , if 𝑥 satisfies (1) 𝑥 is a valid repre-
sentation of some Turing machine 𝑀 (per the representation scheme
above) and (2) when the program 𝑀 is executed on the input 𝑥 it
halts and produces an output, then we define 𝐺(𝑥) as the first bit of
this output. Otherwise (i.e., if 𝑥 is not a valid representation of a Tur-
ing machine, or the machine 𝑀𝑥 never halts on 𝑥) we define 𝐺(𝑥) = 0.
We define 𝐹 ∗ (𝑥) = 1 − 𝐺(𝑥).
We claim that there is no Turing machine that computes 𝐹 ∗ . In-
deed, suppose, towards the sake of contradiction, there exists a ma-
chine 𝑀 that computes 𝐹 ∗ , and let 𝑥 be the binary string that rep-
resents the machine 𝑀 . On one hand, since by our assumption 𝑀
computes 𝐹 ∗ , on input 𝑥 the machine 𝑀 halts and outputs 𝐹 ∗ (𝑥). On
the other hand, by the definition of 𝐹 ∗ , since 𝑥 is the representation
of the machine 𝑀 , 𝐹 ∗ (𝑥) = 1 − 𝐺(𝑥) = 1 − 𝑀 (𝑥), hence yielding a
contradiction.

un i ve rsa l i ty a n d u ncomp u ta bi l i ty 333

Figure 9.4: We construct an uncomputable function


by defining for every two strings 𝑥, 𝑦 the value
1 − 𝑀𝑦 (𝑥) which equals 0 if the machine described
by 𝑦 outputs 1 on 𝑥, and 1 otherwise. We then define
𝐹 ∗ (𝑥) to be the “diagonal” of this table, namely
𝐹 ∗ (𝑥) = 1 − 𝑀𝑥 (𝑥) for every 𝑥. The function 𝐹 ∗
is uncomputable, because if it was computable by
some machine whose string description is 𝑥∗ then we
would get that 𝑀𝑥∗ (𝑥∗ ) = 𝐹 ∗ (𝑥∗ ) = 1 − 𝑀𝑥∗ (𝑥∗ ).

 Big Idea 12 There are some functions that can not be computed by
any algorithm.

P
The proof of Theorem 9.5 is short but subtle. I suggest
that you pause here and go back to read it again and
think about it - this is a proof that is worth reading at
least twice if not three or four times. It is not often the
case that a few lines of mathematical reasoning estab-
lish a deeply profound fact - that there are problems
we simply cannot solve.

The type of argument used to prove Theorem 9.5 is known as di-


agonalization since it can be described as defining a function based
on the diagonal entries of a table as in Fig. 9.4. The proof can be
thought of as an infinite version of the counting argument we used
for showing lower bound for NAND-CIRC programs in Theorem 5.3.
Namely, we show that it’s not possible to compute all functions from
{0, 1}∗ → {0, 1} by Turing machines simply because there are more
functions like that than there are Turing machines.
As mentioned in Remark 7.4, many texts use the “language” ter-
minology and so will call a set 𝐿 ⊆ {0, 1}∗ an undecidable or non-
recursive language if the function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 (𝑥) = 1 ↔ 𝑥 ∈ 𝐿 is uncomputable.
334 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

9.3 THE HALTING PROBLEM


Theorem 9.5 shows that there is some function that cannot be com-
puted. But is this function the equivalent of the “tree that falls in the
forest with no one hearing it”? That is, perhaps it is a function that
no one actually wants to compute. It turns out that there are natural
uncomputable functions:

Let HALT ∶ {0, 1}∗ →


Theorem 9.6 — Uncomputability of Halting function.
{0, 1} be the function such that for every string 𝑀 ∈ {0, 1}∗ ,
HALT(𝑀 , 𝑥) = 1 if Turing machine 𝑀 halts on the input 𝑥 and
HALT(𝑀 , 𝑥) = 0 otherwise. Then HALT is not computable.

Before turning to prove Theorem 9.6, we note that HALT is a very


natural function to want to compute. For example, one can think of
HALT as a special case of the task of managing an “App store”. That
is, given the code of some application, the gatekeeper for the store
needs to decide if this code is safe enough to allow in the store or not.
At a minimum, it seems that we should verify that the code would not
go into an infinite loop.

Proof Idea:
One way to think about this proof is as follows:

Uncomputability of 𝐹 ∗ + Universality = Uncomputability of HALT

That is, we will use the universal Turing machine that computes EVAL
to derive the uncomputability of HALT from the uncomputability of
𝐹 ∗ shown in Theorem 9.5. Specifically, the proof will be by contra-
diction. That is, we will assume towards a contradiction that HALT is
computable, and use that assumption, together with the universal Tur-
ing machine of Theorem 9.1, to derive that 𝐹 ∗ is computable, which
will contradict Theorem 9.5.

 Big Idea 13 If a function 𝐹 is uncomputable we can show that


another function 𝐻 is uncomputable by giving a way to reduce the task
of computing 𝐹 to computing 𝐻.

Proof of Theorem 9.6. The proof will use the previously established
result Theorem 9.5. Recall that Theorem 9.5 shows that the following
function 𝐹 ∗ ∶ {0, 1}∗ → {0, 1} is uncomputable:


{0 𝑥(𝑥) = 1
𝐹 ∗ (𝑥) = ⎨
⎩1 otherwise
{
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 335

where 𝑥(𝑥) denotes the output of the Turing machine described by the
string 𝑥 on the input 𝑥 (with the usual convention that 𝑥(𝑥) = ⊥ if this
computation does not halt).
We will show that the uncomputability of 𝐹 ∗ implies the uncom-
putability of HALT. Specifically, we will assume, towards a contra-
diction, that there exists a Turing machine 𝑀 that can compute the
HALT function, and use that to obtain a Turing machine 𝑀 ′ that com-
putes the function 𝐹 ∗ . (This is known as a proof by reduction, since we
reduce the task of computing 𝐹 ∗ to the task of computing HALT. By
the contrapositive, this means the uncomputability of 𝐹 ∗ implies the
uncomputability of HALT.)
Indeed, suppose that 𝑀 is a Turing machine that computes HALT.
Algorithm 9.7 describes a Turing machine 𝑀 ′ that computes 𝐹 ∗ . (We
use “high level” description of Turing machines, appealing to the
“have your cake and eat it too” paradigm, see Big Idea 10.)

Algorithm 9.7 — 𝐹 ∗ to 𝐻𝐴𝐿𝑇 reduction.

Input: 𝑥 ∈ {0, 1}∗


Output: 𝐹 ∗ (𝑥)
1: # Assume T.M. 𝑀𝐻𝐴𝐿𝑇 computes 𝐻𝐴𝐿𝑇
2: Let 𝑧 ← 𝑀𝐻𝐴𝐿𝑇 (𝑥, 𝑥). # Assume 𝑧 = 𝐻𝐴𝐿𝑇 (𝑥, 𝑥).
3: if 𝑧 = 0 then
4: return 1
5: end if
6: Let 𝑦 ← 𝑈 (𝑥, 𝑥) # 𝑈 universal TM, i.e., 𝑦 = 𝑥(𝑥)
7: if 𝑦 = 1 then
8: return 0
9: end if
10: return 1

We claim that Algorithm 9.7 computes the function 𝐹 ∗ . In-


deed, suppose that 𝑥(𝑥) = 1 (and hence 𝐹 ∗ (𝑥) = 0). In this
case, HALT(𝑥, 𝑥) = 1 and hence, under our assumption that
𝑀 (𝑥, 𝑥) = HALT(𝑥, 𝑥), the value 𝑧 will equal 1, and hence Al-
gorithm 9.7 will set 𝑦 = 𝑥(𝑥) = 1, and output the correct value
0.
Suppose otherwise that 𝑥(𝑥) ≠ 1 (and hence 𝐹 ∗ (𝑥) = 1). In this
case there are two possibilities:

• Case 1: The machine described by 𝑥 does not halt on the input 𝑥


(and hence 𝐹 ∗ (𝑥) = 1). In this case, HALT(𝑥, 𝑥) = 0. Since we
assume that 𝑀 computes HALT it means that on input 𝑥, 𝑥, the
machine 𝑀 must halt and output the value 0. This means that
Algorithm 9.7 will set 𝑧 = 0 and output 1.
336 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• Case 2: The machine described by 𝑥 halts on the input 𝑥 and out-


puts some 𝑦′ ≠ 1 (and hence 𝐹 ∗ (𝑥) = 0). In this case, since
HALT(𝑥, 𝑥) = 1, under our assumptions, Algorithm 9.7 will set
𝑦 = 𝑦′ ≠ 1 and so output 1.

We see that in all cases, 𝑀 ′ (𝑥) = 𝐹 ∗ (𝑥), which contradicts the


fact that 𝐹 ∗ is uncomputable. Hence we reach a contradiction to our
original assumption that 𝑀 computes HALT.

P
Once again, this is a proof that’s worth reading more
than once. The uncomputability of the halting prob-
lem is one of the fundamental theorems of computer
science, and is the starting point for much of the in-
vestigations we will see later. An excellent way to get
a better understanding of Theorem 9.6 is to go over
Section 9.3.2, which presents an alternative proof of
the same result.

9.3.1 Is the Halting problem really hard? (discussion)


Many people’s first instinct when they see the proof of Theorem 9.6
is to not believe it. That is, most people do believe the mathematical
statement, but intuitively it doesn’t seem that the Halting problem is
really that hard. After all, being uncomputable only means that HALT
cannot be computed by a Turing machine.
But programmers seem to solve HALT all the time by informally or
formally arguing that their programs halt. It’s true that their programs
are written in C or Python, as opposed to Turing machines, but that
makes no difference: we can easily translate back and forth between
this model and any other programming language.
While every programmer encounters at some point an infinite loop,
is there really no way to solve the halting problem? Some people
argue that they personally can, if they think hard enough, determine
whether any concrete program that they are given will halt or not.
Some have even argued that humans in general have the ability to
do that, and hence humans have inherently superior intelligence to
1
This argument has also been connected to the
computers or anything else modeled by Turing machines.1
issues of consciousness and free will. I am personally
The best answer we have so far is that there truly is no way to solve skeptical of its relevance to these issues. Perhaps the
HALT, whether using Macs, PCs, quantum computers, humans, or reasoning is that humans have the ability to solve the
halting problem but they exercise their free will and
any other combination of electronic, mechanical, and biological de- consciousness by choosing not to do so.
vices. Indeed this assertion is the content of the Church-Turing Thesis.
This of course does not mean that for every possible program 𝑃 , it
is hard to decide if 𝑃 enters an infinite loop. Some programs don’t
even have loops at all (and hence trivially halt), and there are many
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 337

other far less trivial examples of programs that we can certify to never
enter an infinite loop (or programs that we know for sure that will
enter such a loop). However, there is no general procedure that would
determine for an arbitrary program 𝑃 whether it halts or not. More-
over, there are some very simple programs for which no one knows
whether they halt or not. For example, the following Python program
will halt if and only if Goldbach’s conjecture is false:

def isprime(p):
return all(p % i for i in range(2,p-1))

def Goldbach(n):
return any( (isprime(p) and isprime(n-p))
for p in range(2,n-1))

n = 4
while True:
if not Goldbach(n): break
n+= 2

Given that Goldbach’s Conjecture has been open since 1742, it is


unclear that humans have any magical ability to say whether this (or
other similar programs) will halt or not.

9.3.2 A direct proof of the uncomputability of HALT (optional)


It turns out that we can combine the ideas of the proofs of Theo-
rem 9.5 and Theorem 9.6 to obtain a short proof of the latter theorem,
that does not appeal to the uncomputability of 𝐹 ∗ . This short proof
appeared in print in a 1965 letter to the editor of Christopher Strachey:

To the Editor, The Computer Journal.


An Impossible Program Figure 9.5: SMBC’s take on solving the Halting prob-
Sir, lem.

A well-known piece of folk-lore among programmers holds that it is


impossible to write a program which can examine any other program
and tell, in every case, if it will terminate or get into a closed loop when
it is run. I have never actually seen a proof of this in print, and though
Alan Turing once gave me a verbal proof (in a railway carriage on the
way to a Conference at the NPL in 1953), I unfortunately and promptly
forgot the details. This left me with an uneasy feeling that the proof
must be long or complicated, but in fact it is so short and simple that it
may be of interest to casual readers. The version below uses CPL, but
not in any essential way.
Suppose T[R] is a Boolean function taking a routine (or program) R
with no formal or free variables as its arguments and that for all R,
T[R] = True if R terminates if run and that T[R] = False if R does not
terminate.
338 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Consider the routine P defined as follows


rec routine P
§L: if T[P] go to L
Return §

If T[P] = True the routine P will loop, and it will only terminate if
T[P] = False. In each case T[P] has exactly the wrong value, and this
contradiction shows that the function T cannot exist.
Yours faithfully,
C. Strachey
Churchill College, Cambridge

P
Try to stop and extract the argument for proving
Theorem 9.6 from the letter above.

Since CPL is not as common today, let us reproduce this proof. The
idea is the following: suppose for the sake of contradiction that there
exists a program T such that T(f,x) equals True iff f halts on input
x. (Strachey’s letter considers the no-input variant of HALT, but as
we’ll see, this is an immaterial distinction.) Then we can construct a
program P and an input x such that T(P,x) gives the wrong answer.
The idea is that on input x, the program P will do the following: run
T(x,x), and if the answer is True then go into an infinite loop, and
otherwise halt. Now you can see that T(P,P) will give the wrong
answer: if P halts when it gets its own code as input, then T(P,P) is
supposed to be True, but then P(P) will go into an infinite loop. And
if P does not halt, then T(P,P) is supposed to be False but then P(P)
will halt. We can also code this up in Python:

def CantSolveMe(T):
"""
Gets function T that claims to solve HALT.
Returns a pair (P,x) of code and input on which
T(P,x) ≠ HALT(x)
"""
def fool(x):
if T(x,x):
while True: pass
return "I halted"

return (fool,fool)

For example, consider the following Naive Python program T that


guesses that a given function does not halt if its input contains while
or for
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 339

def T(f,x):
"""Crude halting tester - decides it doesn't halt if it
↪ contains a loop."""
import inspect
source = inspect.getsource(f)
if source.find("while"): return False
if source.find("for"): return False
return True

If we now set (f,x) = CantSolveMe(T), then T(f,x)=False but


f(x) does in fact halt. This is of course not specific to this particular T:
for every program T, if we run (f,x) = CantSolveMe(T) then we’ll
get an input on which T gives the wrong answer to HALT.

9.4 REDUCTIONS
The Halting problem turns out to be a linchpin of uncomputability, in
the sense that Theorem 9.6 has been used to show the uncomputabil-
ity of a great many interesting functions. We will see several examples
of such results in this chapter and the exercises, but there are many
more such results (see Fig. 9.6).

Figure 9.6: Some uncomputability results. An arrow


from problem X to problem Y means that we use the
uncomputability of X to prove the uncomputability
of Y by reducing computing X to computing Y.
All of these results except for the MRDP Theorem
appear in either the text or exercises. The Halting
Problem HALT serves as our starting point for all
these uncomputability results as well as many others.

The idea behind such uncomputability results is conceptually sim-


ple but can at first be quite confusing. If we know that HALT is un-
computable, and we want to show that some other function BLAH is
uncomputable, then we can do so via a contrapositive argument (i.e.,
proof by contradiction). That is, we show that if there exists a Turing
machine that computes BLAH then there exists a Turing machine that
computes HALT. (Indeed, this is exactly how we showed that HALT
itself is uncomputable, by deriving this fact from the uncomputability
of the function 𝐹 ∗ of Theorem 9.5.)
340 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For example, to prove that BLAH is uncomputable, we could show


that there is a computable function 𝑅 ∶ {0, 1}∗ → {0, 1}∗ such that for
every pair 𝑀 and 𝑥, HALT(𝑀 , 𝑥) = BLAH(𝑅(𝑀 , 𝑥)). The existence of
such a function 𝑅 implies that if BLAH was computable then HALT
would be computable as well, hence leading to a contradiction! The
confusing part about reductions is that we are assuming something
we believe is false (that BLAH has an algorithm) to derive something
that we know is false (that HALT has an algorithm). Michael Sipser
describes such results as having the form “If pigs could whistle then
horses could fly”.
A reduction-based proof has two components. For starters, since
we need 𝑅 to be computable, we should describe the algorithm to
compute it. The algorithm to compute 𝑅 is known as a reduction since
the transformation 𝑅 modifies an input to HALT to an input to BLAH,
and hence reduces the task of computing HALT to the task of comput-
ing BLAH. The second component of a reduction-based proof is the
analysis of the algorithm 𝑅: namely a proof that 𝑅 does indeed satisfy
the desired properties.
Reduction-based proofs are just like other proofs by contradiction,
but the fact that they involve hypothetical algorithms that don’t really
exist tends to make reductions quite confusing. The one silver lining
is that at the end of the day the notion of reductions is mathematically
quite simple, and so it’s not that bad even if you have to go back to
first principles every time you need to remember what is the direction
that a reduction should go in.

R
Remark 9.8 — Reductions are algorithms. A reduction
is an algorithm, which means that, as discussed in
Remark 0.3, a reduction has three components:

• Specification (what): In the case of a reduction


from HALT to BLAH, the specification is that func-
tion 𝑅 ∶ {0, 1}∗ → {0, 1}∗ should satisfy that
HALT(𝑀 , 𝑥) = BLAH(𝑅(𝑀 , 𝑥)) for every Tur-
ing machine 𝑀 and input 𝑥. In general, to reduce
a function 𝐹 to 𝐺, the reduction should satisfy
𝐹 (𝑤) = 𝐺(𝑅(𝑤)) for every input 𝑤 to 𝐹 .
• Implementation (how): The algorithm’s descrip-
tion: the precise instructions how to transform an
input 𝑤 to the output 𝑦 = 𝑅(𝑤).
• Analysis (why): A proof that the algorithm meets
the specification. In particular, in a reduction from
𝐹 to 𝐺 this is a proof that for every input 𝑤, the
output 𝑦 = 𝑅(𝑤) of the algorithm satisfies that
𝐹 (𝑤) = 𝐺(𝑦).
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 341

9.4.1 Example: Halting on the zero problem


Here is a concrete example for a proof by reduction. We define the
function HALTONZERO ∶ {0, 1}∗ → {0, 1} as follows. Given any
string 𝑀 , HALTONZERO(𝑀 ) = 1 if and only if 𝑀 describes a Turing
machine that halts when it is given the string 0 as input. A priori
HALTONZERO seems like a potentially easier function to compute
than the full-fledged HALT function, and so we could perhaps hope
that it is not uncomputable. Alas, the following theorem shows that
this is not the case:

Theorem 9.9 — Halting without input. HALTONZERO is uncomputable.

P
The proof of Theorem 9.9 is below, but before reading
it you might want to pause for a couple of minutes
and think how you would prove it yourself. In partic-
ular, try to think of what a reduction from HALT to
HALTONZERO would look like. Doing so is an excel-
lent way to get some initial comfort with the notion
of proofs by reduction, which a technique we will be
using time and again in this book. You can also see
Fig. 9.8 and the following Colab notebook for a Python
implementation of this reduction.

Figure 9.7: To prove Theorem 9.9, we show that


HALTONZERO is uncomputable by giving a reduction
from the task of computing HALT to the task of com-
puting HALTONZERO. This shows that if there was a
hypothetical algorithm 𝐴 computing HALTONZERO,
then there would be an algorithm 𝐵 computing
HALT, contradicting Theorem 9.6. Since neither 𝐴 nor
𝐵 actually exists, this is an example of an implication
of the form “if pigs could whistle then horses could
fly”.

Proof of Theorem 9.9. The proof is by reduction from HALT, see


Fig. 9.7. We will assume, towards the sake of contradiction, that
HALTONZERO is computable by some algorithm 𝐴, and use this
hypothetical algorithm 𝐴 to construct an algorithm 𝐵 to compute
342 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

HALT, hence obtaining a contradiction to Theorem 9.6. (As discussed


in Big Idea 10, following our “have your cake and eat it too” paradigm,
we just use the generic name “algorithm” rather than worrying
whether we model them as Turing machines, NAND-TM programs,
NAND-RAM, etc.; this makes no difference since all these models are
equivalent to one another.)
Since this is our first proof by reduction from the Halting prob-
lem, we will spell it out in more details than usual. Such a proof by
reduction consists of two steps:

1. Description of the reduction: We will describe the operation of our


algorithm 𝐵, and how it makes “function calls” to the hypothetical
algorithm 𝐴.

2. Analysis of the reduction: We will then prove that under the hypoth-
esis that Algorithm 𝐴 computes HALTONZERO, Algorithm 𝐵 will
compute HALT.

Algorithm 9.10 — 𝐻𝐴𝐿𝑇 to 𝐻𝐴𝐿𝑇 𝑂𝑁 𝑍𝐸𝑅𝑂 reduction.

Input: Turing machine 𝑀 and string 𝑥.


Output: Turing machine 𝑀 ′ such that 𝑀 halts on 𝑥 iff 𝑀 ′
halts on zero
1: procedure 𝑁𝑀,𝑥 (𝑤) # Description of the T.M. 𝑁𝑀,𝑥
2: return 𝐸𝑉 𝐴𝐿(𝑀 , 𝑥) # Ignore the
Input: 𝑤, evaluate 𝑀 on 𝑥.
3: end procedure
4: return 𝑁𝑀,𝑥 # We do not execute 𝑁𝑀,𝑥 : only return its
description

Our Algorithm 𝐵 works as follows: on input 𝑀 , 𝑥, it runs Algo-


rithm 9.10 to obtain a Turing machine 𝑀 ′ , and then returns 𝐴(𝑀 ′ ).
The machine 𝑀 ′ ignores its input 𝑧 and simply runs 𝑀 on 𝑥.
In pseudocode, the program 𝑁𝑀,𝑥 will look something like the
following:

def N(z):
M = r'.......'
# a string constant containing desc. of M
x = r'.......'
# a string constant containing x
return eval(M,x)
# note that we ignore the input z

That is, if we think of 𝑁𝑀,𝑥 as a program, then it is a program that


contains 𝑀 and 𝑥 as “hardwired constants”, and given any input 𝑧, it
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 343

simply ignores the input and always returns the result of evaluating
𝑀 on 𝑥. The algorithm 𝐵 does not actually execute the machine 𝑁𝑀,𝑥 .
𝐵 merely writes down the description of 𝑁𝑀,𝑥 as a string (just as we
did above) and feeds this string as input to 𝐴.
The above completes the description of the reduction. The analysis is
obtained by proving the following claim:
Claim: For every strings 𝑀 , 𝑥, 𝑧, the machine 𝑁𝑀,𝑥 constructed by
Algorithm 𝐵 in Step 1 satisfies that 𝑁𝑀,𝑥 halts on 𝑧 if and only if the
program described by 𝑀 halts on the input 𝑥.
Proof of Claim: Since 𝑁𝑀,𝑥 ignores its input and evaluates 𝑀 on 𝑥
using the universal Turing machine, it will halt on 𝑧 if and only if 𝑀
halts on 𝑥.
In particular if we instantiate this claim with the input 𝑧 = 0 to
𝑁𝑀,𝑥 , we see that HALTONZERO(𝑁𝑀,𝑥 ) = HALT(𝑀 , 𝑥). Thus if
the hypothetical algorithm 𝐴 satisfies 𝐴(𝑀 ) = HALTONZERO(𝑀 )
for every 𝑀 then the algorithm 𝐵 we construct satisfies 𝐵(𝑀 , 𝑥) =
HALT(𝑀 , 𝑥) for every 𝑀 , 𝑥, contradicting the uncomputability of
HALT.

Figure 9.8: A Python implementation of the reduction


showing that HALTONZERO is uncomputable if
HALT is. See this Colab notebook for a full implemen-
tation of the reduction.

R
Remark 9.11 — The hardwiring technique. In the proof of
Theorem 9.9 we used the technique of “hardwiring”
an input 𝑥 to a program/machine 𝑃 . That is, we take
a program that computes the function 𝑥 ↦ 𝑓(𝑥) and
“fix” or “hardwire” some of the inputs to some con-
stant value. For example, if you have a program that
takes as input a pair of numbers 𝑥, 𝑦 and outputs their
product (i.e., computes the function 𝑓(𝑥, 𝑦) = 𝑥 × 𝑦),
344 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

then you can “hardwire” the second input to be 17


and obtain a program that takes as input a number
𝑥 and outputs 𝑥 × 17 (i.e., computes the function
𝑔(𝑥) = 𝑥 × 17). This technique is quite common in
reductions and elsewhere, and we will use it time and
again in this book.

9.5 RICE’S THEOREM AND THE IMPOSSIBILITY OF GENERAL


SOFTWARE VERIFICATION
The uncomputability of the Halting problem turns out to be a special
case of a much more general phenomenon. Namely, that we cannot
certify semantic properties of general purpose programs. “Semantic prop-
erties” mean properties of the function that the program computes, as
opposed to properties that depend on the particular syntax used by
the program.
An example for a semantic property of a program 𝑃 is the property
that whenever 𝑃 is given an input string with an even number of 1’s,
it outputs 0. Another example is the property that 𝑃 will always halt
whenever the input ends with a 1. In contrast, the property that a C
program contains a comment before every function declaration is not
a semantic property, since it depends on the actual source code as
opposed to the input/output relation.
Checking semantic properties of programs is of great interest, as it
corresponds to checking whether a program conforms to a specifica-
tion. Alas it turns out that such properties are in general uncomputable.
We have already seen some examples of uncomputable semantic func-
tions, namely HALT and HALTONZERO, but these are just the “tip of
the iceberg”. We start by observing one more such example:

Let ZEROFUNC ∶ {0, 1}∗ →


Theorem 9.12 — Computing all zero function.
{0, 1} be the function such that for every 𝑀 ∈ {0, 1}∗ , ZEROFUNC(𝑀 ) =
1 if and only if 𝑀 represents a Turing machine such that 𝑀 outputs
0 on every input 𝑥 ∈ {0, 1}∗ . Then ZEROFUNC is uncomputable.

P
Despite the similarity in their names, ZEROFUNC and
HALTONZERO are two different functions. For exam-
ple, if 𝑀 is a Turing machine that on input 𝑥 ∈ {0, 1}∗ ,
halts and outputs the OR of all of 𝑥’s coordinates, then
HALTONZERO(𝑀 ) = 1 (since 𝑀 does halt on the
input 0) but ZEROFUNC(𝑀 ) = 0 (since 𝑀 does not
compute the constant zero function).
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 345

Proof of Theorem 9.12. The proof is by reduction from HALTONZERO.


Suppose, towards the sake of contradiction, that there was an algo-
rithm 𝐴 such that 𝐴(𝑀 ) = ZEROFUNC(𝑀 ) for every 𝑀 ∈ {0, 1}∗ .
Then we will construct an algorithm 𝐵 that solves HALTONZERO,
contradicting Theorem 9.9.
Given a Turing machine 𝑁 (which is the input to HALTONZERO),
our Algorithm 𝐵 does the following:

1. Construct a Turing machine 𝑀 which on input 𝑥 ∈ {0, 1}∗ , first


runs 𝑁 (0) and then outputs 0.

2. Return 𝐴(𝑀 ).

Now if 𝑁 halts on the input 0 then the Turing machine 𝑀 com-


putes the constant zero function, and hence under our assumption
that 𝐴 computes ZEROFUNC, 𝐴(𝑀 ) = 1. If 𝑁 does not halt on the
input 0, then the Turing machine 𝑀 will not halt on any input, and
so in particular will not compute the constant zero function. Hence
under our assumption that 𝐴 computes ZEROFUNC, 𝐴(𝑀 ) = 0.
We see that in both cases, ZEROFUNC(𝑀 ) = HALTONZERO(𝑁 )
and hence the value that Algorithm 𝐵 returns in step 2 is equal to
HALTONZERO(𝑁 ) which is what we needed to prove.

Another result along similar lines is the following:

Theorem 9.13 — Uncomputability of verifying parity. The following func-


tion is uncomputable

{1 𝑃 computes the parity function



COMPUTES-PARITY(𝑃 ) =

⎩0 otherwise
{

P
We leave the proof of Theorem 9.13 as an exercise
(Exercise 9.6). I strongly encourage you to stop here
and try to solve this exercise.

9.5.1 Rice’s Theorem


Theorem 9.13 can be generalized far beyond the parity function. In
fact, this generalization rules out verifying any type of semantic spec-
ification on programs. We define a semantic specification on programs
to be some property that does not depend on the code of the program
but just on the function that the program computes.
For example, consider the following two C programs
346 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

int First(int n) {
if (n<0) return 0;
return 2*n;
}

int Second(int n) {
int i = 0;
int j = 0
if (n<0) return 0;
while (j<n) {
i = i + 2;
j = j + 1;
}
return i;
}

First and Second are two distinct C programs, but they compute
the same function. A semantic property, would be either true for both
programs or false for both programs, since it depends on the function
the programs compute and not on their code. An example for a se-
mantic property that both First and Second satisfy is the following:
“The program 𝑃 computes a function 𝑓 mapping integers to integers satisfy-
ing that 𝑓(𝑛) ≥ 𝑛 for every input 𝑛”.
A property is not semantic if it depends on the source code rather
than the input/output behavior. For example, properties such as “the
program contains the variable k” or “the program uses the while op-
eration” are not semantic. Such properties can be true for one of the
programs and false for others. Formally, we define semantic proper-
ties as follows:

A pair of Turing machines


Definition 9.14 — Semantic properties.
𝑀 and 𝑀 ′ are functionally equivalent if for every 𝑥 ∈ {0, 1}∗ ,
𝑀 (𝑥) = 𝑀 ′ (𝑥). (In particular, 𝑀 (𝑥) = ⊥ iff 𝑀 ′ (𝑥) = ⊥ for all
𝑥.)
A function 𝐹 ∶ {0, 1}∗ → {0, 1} is semantic if for every pair of
strings 𝑀 , 𝑀 ′ that represent functionally equivalent Turing ma-
chines, 𝐹 (𝑀 ) = 𝐹 (𝑀 ′ ). (Recall that we assume that every string
represents some Turing machine, see Remark 9.3)

There are two trivial examples of semantic functions: the constant


one function and the constant zero function. For example, if 𝑍 is the
constant zero function (i.e., 𝑍(𝑀 ) = 0 for every 𝑀 ) then clearly
𝐹 (𝑀 ) = 𝐹 (𝑀 ′ ) for every pair of Turing machines 𝑀 and 𝑀 ′ that are
functionally equivalent 𝑀 and 𝑀 ′ . Here is a non-trivial example
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 347

Solved Exercise 9.1 — 𝑍𝐸𝑅𝑂𝐹 𝑈 𝑁 𝐶 is semantic. Prove that the function


ZEROFUNC is semantic.

Solution:
Recall that ZEROFUNC(𝑀 ) = 1 if and only if 𝑀 (𝑥) = 0 for
every 𝑥 ∈ {0, 1}∗ . If 𝑀 and 𝑀 ′ are functionally equivalent, then for
every 𝑥, 𝑀 (𝑥) = 𝑀 ′ (𝑥). Hence ZEROFUNC(𝑀 ) = 1 if and only if
ZEROFUNC(𝑀 ′ ) = 1.

Often the properties of programs that we are most interested in


computing are the semantic ones, since we want to understand the
programs’ functionality. Unfortunately, Rice’s Theorem tells us that
these properties are all uncomputable:

Theorem 9.15 — Rice’s Theorem. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. If 𝐹 is seman-
tic and non-trivial then it is uncomputable.

Proof Idea:
The idea behind the proof is to show that every semantic non-
trivial function 𝐹 is at least as hard to compute as HALTONZERO.
This will conclude the proof since by Theorem 9.9, HALTONZERO
is uncomputable. If a function 𝐹 is non-trivial then there are two
machines 𝑀0 and 𝑀1 such that 𝐹 (𝑀0 ) = 0 and 𝐹 (𝑀1 ) = 1. So,
the goal would be to take a machine 𝑁 and find a way to map it into
a machine 𝑀 = 𝑅(𝑁 ), such that (i) if 𝑁 halts on zero then 𝑀 is
functionally equivalent to 𝑀1 and (ii) if 𝑁 does not halt on zero then
𝑀 is functionally equivalent to 𝑀0 .
Because 𝐹 is semantic, if we achieved this, then we would be guar-
anteed that HALTONZERO(𝑁 ) = 𝐹 (𝑅(𝑁 )), and hence would show
that if 𝐹 was computable, then HALTONZERO would be computable
as well, contradicting Theorem 9.9.

Proof of Theorem 9.15. We will not give the proof in full formality, but
rather illustrate the proof idea by restricting our attention to a particu-
lar semantic function 𝐹 . However, the same techniques generalize to
all possible semantic functions. Define MONOTONE ∶ {0, 1}∗ → {0, 1}
as follows: MONOTONE(𝑀 ) = 1 if there does not exist 𝑛 ∈ ℕ and
two inputs 𝑥, 𝑥′ ∈ {0, 1}𝑛 such that for every 𝑖 ∈ [𝑛] 𝑥𝑖 ≤ 𝑥′𝑖 but 𝑀 (𝑥)
outputs 1 and 𝑀 (𝑥′ ) = 0. That is, MONOTONE(𝑀 ) = 1 if it’s not
possible to find an input 𝑥 such that flipping some bits of 𝑥 from 0 to
1 will change 𝑀 ’s output in the other direction from 1 to 0. We will
348 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

prove that MONOTONE is uncomputable, but the proof will easily


generalize to any semantic function.
We start by noting that MONOTONE is neither the constant zero
nor the constant one function:

• The machine INF that simply goes into an infinite loop on every
input satisfies MONOTONE(INF) = 1, since INF is not defined
anywhere and so in particular there are no two inputs 𝑥, 𝑥′ where
𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 but INF(𝑥) = 0 and INF(𝑥′ ) = 1.

• The machine PAR that computes the XOR or parity of its input, is
not monotone (e.g., PAR(1, 1, 0, 0, … , 0) = 0 but PAR(1, 0, 0, … , 0) =
0) and hence MONOTONE(PAR) = 0.

(Note that INF and PAR are machines and not functions.)
We will now give a reduction from HALTONZERO to
MONOTONE. That is, we assume towards a contradiction that
there exists an algorithm 𝐴 that computes MONOTONE and we will
build an algorithm 𝐵 that computes HALTONZERO. Our algorithm 𝐵
will work as follows:

Algorithm 𝐵:
Input: String 𝑁 describing a Turing machine. (Goal: Compute
HALTONZERO(𝑁 ))
Assumption: Access to Algorithm 𝐴 to compute MONOTONE.
Operation:

1. Construct the following machine 𝑀 : “On input 𝑧 ∈ {0, 1}∗ do: (a)
Run 𝑁 (0), (b) Return PAR(𝑧)”.
2. Return 1 − 𝐴(𝑀 ).

To complete the proof we need to show that 𝐵 outputs the cor-


rect answer, under our assumption that 𝐴 computes MONOTONE.
In other words, we need to show that HALTONZERO(𝑁 ) = 1 −
𝑀 𝑂𝑁 𝑂𝑇 𝑂𝑁 𝐸(𝑀 ). Suppose that 𝑁 does not halt on zero. In this
case the program 𝑀 constructed by Algorithm 𝐵 enters into an in-
finite loop in step (a) and will never reach step (b). Hence in this
case 𝑁 is functionally equivalent to INF. (The machine 𝑁 is not
the same machine as INF: its description or code is different. But it
does have the same input/output behavior (in this case) of never
halting on any input. Also, while the program 𝑀 will go into an in-
finite loop on every input, Algorithm 𝐵 never actually runs 𝑀 : it
only produces its code and feeds it to 𝐴. Hence Algorithm 𝐵 will
not enter into an infinite loop even in this case.) Thus in this case,
MONOTONE(𝑀 ) = MONOTONE(INF) = 1.
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 349

If 𝑁 does halt on zero, then step (a) in 𝑀 will eventually conclude


and 𝑀 ’s output will be determined by step (b), where it simply out-
puts the parity of its input. Hence in this case, 𝑀 computes the non-
monotone parity function (i.e., is functionally equivalent to PAR), and
so we get that MONOTONE(𝑀 ) = MONOTONE(PAR) = 0. In both
cases, MONOTONE(𝑀 ) = 1 − 𝐻𝐴𝐿𝑇 𝑂𝑁 𝑍𝐸𝑅𝑂(𝑁 ), which is what
we wanted to prove.
An examination of this proof shows that we did not use anything
about MONOTONE beyond the fact that it is semantic and non-trivial.
For every semantic non-trivial 𝐹 , we can use the same proof, replacing
PAR and INF with two machines 𝑀0 and 𝑀1 such that 𝐹 (𝑀0 ) = 0 and
𝐹 (𝑀1 ) = 1. Such machines must exist if 𝐹 is non-trivial.

R
Remark 9.16 — Semantic is not the same as uncom-
putable. Rice’s Theorem is so powerful and such a
popular way of proving uncomputability that peo-
ple sometimes get confused and think that it is the
only way to prove uncomputability. In particular, a
common misconception is that if a function 𝐹 is not
semantic then it is computable. This is not at all the
case.
For example, consider the following function
HALTNOYALE ∶ {0, 1}∗ → {0, 1}. This is a function
that on input a string that represents a NAND-TM
program 𝑃 , outputs 1 if and only if both (i) 𝑃 halts
on the input 0, and (ii) the program 𝑃 does not con-
tain a variable with the identifier Yale. The function
HALTNOYALE is clearly not semantic, as it will out-
put two different values when given as input one of
the following two functionally equivalent programs:

Yale[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Yale[0])

and

Harvard[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Harvard[0])

However, HALTNOYALE is uncomputable since every


program 𝑃 can be transformed into an equivalent
(and in fact improved :)) program 𝑃 ′ that does not
contain the variable Yale. Hence if we could compute
HALTNOYALE then determine halting on zero for
NAND-TM programs (and hence for Turing machines
as well).
Moreover, as we will see in Chapter 11, there are un-
computable functions whose inputs are not programs,
and hence for which the adjective “semantic” is not
applicable.
350 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Properties such as “the program contains the variable


Yale” are sometimes known as syntactic properties.
The terms “semantic” and “syntactic” are used be-
yond the realm of programming languages: a famous
example of a syntactically correct but semantically
meaningless sentence in English is Chomsky’s “Color-
less green ideas sleep furiously.” However, formally
defining “syntactic properties” is rather subtle and we
will not use this terminology in this book, sticking to
the terms “semantic” and “non-semantic” only.

9.5.2 Halting and Rice’s Theorem for other Turing-complete models


As we saw before, many natural computational models turn out to be
equivalent to one another, in the sense that we can transform a “pro-
gram” of one model (such as a 𝜆 expression, or a game-of-life config-
urations) into another model (such as a NAND-TM program). This
equivalence implies that we can translate the uncomputability of the
Halting problem for NAND-TM programs into uncomputability for
Halting in other models. For example:

Theorem 9.17 — NAND-TM Machine Halting.Let NANDTMHALT ∶


{0, 1} ∗
→ {0, 1} be the function that on input strings 𝑃 ∈
{0, 1}∗ and 𝑥 ∈ {0, 1}∗ outputs 1 if the NAND-TM program de-
scribed by 𝑃 halts on the input 𝑥 and outputs 0 otherwise. Then
NANDTMHALT is uncomputable.

P
Once again, this is a good point for you to stop and try
to prove the result yourself before reading the proof
below.

Proof. We have seen in Theorem 7.11 that for every Turing machine
𝑀 , there is an equivalent NAND-TM program 𝑃𝑀 such that for ev-
ery 𝑥, 𝑃𝑀 (𝑥) = 𝑀 (𝑥). In particular this means that HALT(𝑀 ) =
NANDTMHALT(𝑃𝑀 ).
The transformation 𝑀 ↦ 𝑃𝑀 that is obtained from the proof
of Theorem 7.11 is constructive. That is, the proof yields a way to
compute the map 𝑀 ↦ 𝑃𝑀 . This means that this proof yields a
reduction from task of computing HALT to the task of computing
NANDTMHALT, which means that since HALT is uncomputable,
neither is NANDTMHALT.

un i ve rsa l i ty a n d u ncomp u ta bi l i ty 351

The same proof carries over to other computational models such as


the 𝜆 calculus, two dimensional (or even one-dimensional) automata etc.
Hence for example, there is no algorithm to decide if a 𝜆 expression
evaluates the identity function, and no algorithm to decide whether
an initial configuration of the game of life will result in eventually
coloring the cell (0, 0) black or not.
Indeed, we can generalize Rice’s Theorem to all these models. For
example, if 𝐹 ∶ {0, 1}∗ → {0, 1} is a non-trivial function such that
𝐹 (𝑃 ) = 𝐹 (𝑃 ′ ) for every functionally equivalent NAND-TM programs
𝑃 , 𝑃 ′ then 𝐹 is uncomputable, and the same holds for NAND-RAM
programs, 𝜆-expressions, and all other Turing complete models (as
defined in Definition 8.5), see also Exercise 9.12.

9.5.3 Is software verification doomed? (discussion)


Programs are increasingly being used for mission critical purposes,
whether it’s running our banking system, flying planes, or monitoring
nuclear reactors. If we can’t even give a certification algorithm that
a program correctly computes the parity function, how can we ever
be assured that a program does what it is supposed to do? The key
insight is that while it is impossible to certify that a general program
conforms with a specification, it is possible to write a program in
the first place in a way that will make it easier to certify. As a trivial
example, if you write a program without loops, then you can certify
that it halts. Also, while it might not be possible to certify that an
arbitrary program computes the parity function, it is quite possible to
write a particular program 𝑃 for which we can mathematically prove
that 𝑃 computes the parity. In fact, writing programs or algorithms
and providing proofs for their correctness is what we do all the time in
algorithms research.
The field of software verification is concerned with verifying that
given programs satisfy certain conditions. These conditions can be
that the program computes a certain function, that it never writes
into a dangerous memory location, that is respects certain invari-
ants, and others. While the general tasks of verifying this may be
uncomputable, researchers have managed to do so for many inter-
esting cases, especially if the program is written in the first place in
a formalism or programming language that makes verification eas-
ier. That said, verification, especially of large and complex programs,
remains a highly challenging task in practice as well, and the num-
ber of programs that have been formally proven correct is still quite
small. Moreover, even phrasing the right theorem to prove (i.e., the
specification) is often a highly non-trivial endeavor.
352 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 9.9: The set R of computable Boolean functions


(Definition 7.3) is a proper subset of the set of all
functions mapping {0, 1}∗ to {0, 1}. In this chapter
we saw a few examples of elements in the latter set
that are not in the former.

✓ Chapter Recap

• There is a universal Turing machine (or NAND-TM


program) 𝑈 such that on input a description of a
Turing machine 𝑀 and some input 𝑥, 𝑈 (𝑀 , 𝑥) halts
and outputs 𝑀 (𝑥) if (and only if) 𝑀 halts on input
𝑥. Unlike in the case of finite computation (i.e.,
NAND-CIRC programs / circuits), the input to
the program 𝑈 can be a machine 𝑀 that has more
states than 𝑈 itself.
• Unlike the finite case, there are actually functions
that are inherently uncomputable in the sense that
they cannot be computed by any Turing machine.
• These include not only some “degenerate” or “eso-
teric” functions but also functions that people have
deeply cared about and conjectured that could be
computed.
• If the Church-Turing thesis holds then a function
𝐹 that is uncomputable according to our definition
cannot be computed by any means in our physical
world.

9.6 EXERCISES
Let NANDRAMHALT ∶ {0, 1}∗ → {0, 1}
Exercise 9.1 — NAND-RAM Halt.
be the function such that on input (𝑃 , 𝑥) where 𝑃 represents a NAND-
RAM program, NANDRAMHALT(𝑃 , 𝑥) = 1 iff 𝑃 halts on the input 𝑥.
Prove that NANDRAMHALT is uncomputable.

Let TIMEDHALT ∶ {0, 1}∗ → {0, 1} be


Exercise 9.2 — Timed halting.
the function that on input (a string representing) a triple (𝑀 , 𝑥, 𝑇 ),
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 353

TIMEDHALT(𝑀 , 𝑥, 𝑇 ) = 1 iff the Turing machine 𝑀 , on input 𝑥,


halts within at most 𝑇 steps (where a step is defined as one sequence
of reading a symbol from the tape, updating the state, writing a new
symbol and (potentially) moving the head).
Prove that TIMEDHALT is computable.

Let SPACEHALT ∶ {0, 1}∗ →


Exercise 9.3 — Space halting (challenging).
{0, 1} be the function that on input (a string representing) a triple
(𝑀 , 𝑥, 𝑇 ), SPACEHALT(𝑀 , 𝑥, 𝑇 ) = 1 iff the Turing machine 𝑀 , on
input 𝑥, halts before its head reached the 𝑇 -th location of its tape. (We
don’t care how many steps 𝑀 makes, as long as the head stays inside
locations {0, … , 𝑇 − 1}.) 2
A machine with alphabet Σ can have at most |Σ|𝑇
Prove that SPACEHALT is computable. See footnote for hint2 choices for the contents of the first 𝑇 locations of
■ its tape. What happens if the machine repeats a
previously seen configuration, in the sense that the
Suppose that 𝐹 ∶ {0, 1}∗ → {0, 1}
Exercise 9.4 — Computable compositions. tape contents, the head location, and the current state,
and 𝐺 ∶ {0, 1}∗ → {0, 1} are computable functions. For each one of the are all identical to what they were in some previous
state of the execution?
following functions 𝐻, either prove that 𝐻 is necessarily computable or
give an example of a pair 𝐹 and 𝐺 of computable functions such that
𝐻 will not be computable. Prove your assertions.

1. 𝐻(𝑥) = 1 iff 𝐹 (𝑥) = 1 OR 𝐺(𝑥) = 1.

2. 𝐻(𝑥) = 1 iff there exist two non-empty strings 𝑢, 𝑣 ∈ {0, 1}∗ such
that 𝑥 = 𝑢𝑣 (i.e., 𝑥 is the concatenation of 𝑢 and 𝑣), 𝐹 (𝑢) = 1 and
𝐺(𝑣) = 1.

3. 𝐻(𝑥) = 1 iff there exist a list 𝑢0 , … , 𝑢𝑡−1 of non-empty strings such


that strings𝐹 (𝑢𝑖 ) = 1 for every 𝑖 ∈ [𝑡] and 𝑥 = 𝑢0 𝑢1 ⋯ 𝑢𝑡−1 .

4. 𝐻(𝑥) = 1 iff 𝑥 is a valid string representation of a NAND++


program 𝑃 such that for every 𝑧 ∈ {0, 1}∗ , on input 𝑧 the program
𝑃 outputs 𝐹 (𝑧).

5. 𝐻(𝑥) = 1 iff 𝑥 is a valid string representation of a NAND++


program 𝑃 such that on input 𝑥 the program 𝑃 outputs 𝐹 (𝑥).

6. 𝐻(𝑥) = 1 iff 𝑥 is a valid string representation of a NAND++


program 𝑃 such that on input 𝑥, 𝑃 outputs 𝐹 (𝑥) after executing at
most 100 ⋅ |𝑥|2 lines.

Exercise 9.5Prove that the following function FINITE ∶ {0, 1}∗ → {0, 1}
is uncomputable. On input 𝑃 ∈ {0, 1}∗ , we define FINITE(𝑃 ) = 1
if and only if 𝑃 is a string that represents a NAND++ program such 3
Hint: You can use Rice’s Theorem.
that there only a finite number of inputs 𝑥 ∈ {0, 1}∗ s.t. 𝑃 (𝑥) = 1.3

354 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Exercise 9.6 — Computing parity. Prove Theorem 9.13 without using Rice’s
Theorem.

Let EQ ∶ {0, 1}∗ ∶→ {0, 1} be the func-


Exercise 9.7 — TM Equivalence.
tion defined as follows: given a string representing a pair (𝑀 , 𝑀 ′ )
of Turing machines, EQ(𝑀 , 𝑀 ′ ) = 1 iff 𝑀 and 𝑀 ′ are functionally
equivalent as per Definition 9.14. Prove that EQ is uncomputable.
Note that you cannot use Rice’s Theorem directly, as this theorem
only deals with functions that take a single Turing machine as input,
and EQ takes two machines.

Exercise 9.8 For each of the following two functions, say whether it is
computable or not:

1. Given a NAND-TM program 𝑃 , an input 𝑥, and a number 𝑘, when


we run 𝑃 on 𝑥, does the index variable i ever reach 𝑘?

2. Given a NAND-TM program 𝑃 , an input 𝑥, and a number 𝑘, when


we run 𝑃 on 𝑥, does 𝑃 ever write to an array at index 𝑘?

Exercise 9.9Let 𝐹 ∶ {0, 1}∗ → {0, 1} be the function that is defined as


follows. On input a string 𝑃 that represents a NAND-RAM program
and a String 𝑀 that represents a Turing machine, 𝐹 (𝑃 , 𝑀 ) = 1 if and
only if there exists some input 𝑥 such 𝑃 halts on 𝑥 but 𝑀 does not halt
4
Hint: While it cannot be applied directly, with a
on 𝑥. Prove that 𝐹 is uncomputable. See footnote for hint.4 little “massaging” you can prove this using Rice’s
■ Theorem.

Exercise 9.10 — Recursively enumerable. Define a function 𝐹 ∶ {0, 1}∗ ∶→


{0, 1} to be recursively enumerable if there exists a Turing machine 𝑀
such that such that for every 𝑥 ∈ {0, 1}∗ , if 𝐹 (𝑥) = 1 then 𝑀 (𝑥) = 1,
and if 𝐹 (𝑥) = 0 then 𝑀 (𝑥) = ⊥. (i.e., if 𝐹 (𝑥) = 0 then 𝑀 does not halt
on 𝑥.)

1. Prove that every computable 𝐹 is also recursively enumerable.

2. Prove that there exists 𝐹 that is not computable but is recursively 5


HALT has this property.
enumerable. See footnote for hint.5

3. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that 𝐹 is
6
You can either use the diagonalization method to
not recursively enumerable. See footnote for hint.6
prove this directly or show that the set of all recur-
sively enumerable functions is countable.
4. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 is recursively enumerable but the function 𝐹 defined as 𝐹 (𝑥) =
7
HALT has this property: show that if both HALT
1 − 𝐹 (𝑥) is not recursively enumerable. See footnote for hint.7
and 𝐻𝐴𝐿𝑇 were recursively enumerable then HALT
would be in fact computable.

un i ve rsa l i ty a n d u ncomp u ta bi l i ty 355

In this exercise we will


Exercise 9.11 — Rice’s Theorem: standard form.
prove Rice’s Theorem in the form that it is typically stated in the litera-
ture.
For a Turing machine 𝑀 , define 𝐿(𝑀 ) ⊆ {0, 1}∗ to be the set of all
𝑥 ∈ {0, 1}∗ such that 𝑀 halts on the input 𝑥 and outputs 1. (The set
𝐿(𝑀 ) is known in the literature as the language recognized by 𝑀 . Note
that 𝑀 might either output a value other than 1 or not halt at all on
inputs 𝑥 ∉ 𝐿(𝑀 ). )

1. Prove that for every Turing machine 𝑀 , if we define 𝐹𝑀 ∶ {0, 1}∗ →


{0, 1} to be the function such that 𝐹𝑀 (𝑥) = 1 iff 𝑥 ∈ 𝐿(𝑀 ) then 𝐹𝑀
is recursively enumerable as defined in Exercise 9.10.

2. Use Theorem 9.15 to prove that for every 𝐺 ∶ {0, 1}∗ → {0, 1}, if (a)
𝐺 is neither the constant zero nor the constant one function, and
(b) for every 𝑀 , 𝑀 ′ such that 𝐿(𝑀 ) = 𝐿(𝑀 ′ ), 𝐺(𝑀 ) = 𝐺(𝑀 ′ ), 8
Show that any 𝐺 satisfying (b) must be semantic.
then 𝐺 is uncomputable. See footnote for hint.8

Exercise 9.12 — Rice’s Theorem for general Turing-equivalent models (optional).


Let ℱ be the set of all partial functions from {0, 1}∗ to {0, 1} and ℳ ∶
{0, 1}∗ → ℱ be a Turing-equivalent model as defined in Definition 8.5.
We define a function 𝐹 ∶ {0, 1}∗ → {0, 1} to be ℳ-semantic if there
exists some 𝒢 ∶ ℱ → {0, 1} such that 𝐹 (𝑃 ) = 𝒢(ℳ(𝑃 )) for every
𝑃 ∈ {0, 1}∗ .
Prove that for every ℳ-semantic 𝐹 ∶ {0, 1}∗ → {0, 1} that is neither
the constant one nor the constant zero function, 𝐹 is uncomputable.

In this question we define the NAND-


Exercise 9.13 — Busy Beaver.
TM variant of the busy beaver function (see Aaronson’s 1999 essay,
2017 blog post and 2020 survey [Aar20]; see also Tao’s highly recom-
mended presentation on how civilization’s scientific progress can be
measured by the quantities we can grasp).

1. Let 𝑇𝐵𝐵 ∶ {0, 1}∗ → ℕ be defined as follows. For every string 𝑃 ∈


{0, 1}∗ , if 𝑃 represents a NAND-TM program such that when 𝑃 is
executed on the input 0 then it halts within 𝑀 steps then 𝑇𝐵𝐵 (𝑃 ) =
𝑀 . Otherwise (if 𝑃 does not represent a NAND-TM program, or it
is a program that does not halt on 0), 𝑇𝐵𝐵 (𝑃 ) = 0. Prove that 𝑇𝐵𝐵
is uncomputable.
⋅2
2. Let TOWER(𝑛) denote the number 2⏟
⋅⋅ (that is, a “tower of pow-
𝑛 times
ers of two” of height 𝑛). To get a sense of how fast this function
grows, TOWER(1) = 2, TOWER(2) = 22 = 4, TOWER(3) = 22 =
2

16, TOWER(4) = 216 = 65536 and TOWER(5) = 265536 which


356 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

is about 1020000 . TOWER(6) is already a number that is too big to


write even in scientific notation. Define NBB ∶ ℕ → ℕ (for “NAND-
TM Busy Beaver”) to be the function NBB(𝑛) = max𝑃 ∈{0,1}𝑛 𝑇𝐵𝐵 (𝑃 )
where 𝑇𝐵𝐵 is as defined in Question 6.1. Prove that NBB grows
faster than TOWER, in the sense that TOWER(𝑛) = 𝑜(NBB(𝑛)). See 9
You will not need to use very specific properties
footnote for hint9 of the TOWER function in this exercise. For exam-
ple, NBB(𝑛) also grows faster than the Ackerman
function.

9.7 BIBLIOGRAPHICAL NOTES


The cartoon of the Halting problem in Fig. 9.1 and taken from Charles
Cooper’s website, Copyright 2019 Charles F. Cooper.
Section 7.2 in [MM11] gives a highly recommended overview of
uncomputability. Gödel, Escher, Bach [Hof99] is a classic popular
science book that touches on uncomputability, and unprovability, and
specifically Gödel’s Theorem that we will see in Chapter 11. See also
the recent book by Holt [Hol18].
The history of the definition of a function is intertwined with the
development of mathematics as a field. For many years, a function
was identified (as per Euler’s quote above) with the means to calcu-
late the output from the input. In the 1800’s, with the invention of
the Fourier series and with the systematic study of continuity and
differentiability, people have started looking at more general kinds of
functions, but the modern definition of a function as an arbitrary map-
ping was not yet universally accepted. For example, in 1899 Poincare
wrote “we have seen a mass of bizarre functions which appear to be forced
to resemble as little as possible honest functions which serve some purpose.
… they are invented on purpose to show that our ancestor’s reasoning was at
fault, and we shall never get anything more than that out of them”. Some of
this fascinating history is discussed in [Gra83; Kle91; Lüt02; Gra05].
The existence of a universal Turing machine, and the uncomputabil-
ity of HALT was first shown by Turing in his seminal paper [Tur37],
though closely related results were shown by Church a year before.
These works built on Gödel’s 1931 incompleteness theorem that we will
discuss in Chapter 11.
Some universal Turing machines with a small alphabet and number
of states are given in [Rog96], including a single-tape universal Turing
machine with the binary alphabet and with less than 25 states; see
also the survey [WN09]. Adam Yedidia has written software to help
in producing Turing machines with a small number of states. This is
related to the recreational pastime of “Code Golfing” which is about
solving a certain computational task using the as short as possible
program. Finding “highly complex” small Turing machine is also
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 357

related to the “Busy Beaver” problem, see Exercise 9.13 and the survey
[Aar20].
The diagonalization argument used to prove uncomputability of 𝐹 ∗
is derived from Cantor’s argument for the uncountability of the reals
discussed in Chapter 2.
Christopher Strachey was an English computer scientist and the
inventor of the CPL programming language. He was also an early
artificial intelligence visionary, programming a computer to play
Checkers and even write love letters in the early 1950’s, see this New
Yorker article and this website.
Rice’s Theorem was proven in [Ric53]. It is typically stated in a
form somewhat different than what we used, see Exercise 9.11.
We do not discuss in the chapter the concept of recursively enumer-
able languages, but it is covered briefly in Exercise 9.10. As usual, we
use function, as opposed to language, notation.
Learning Objectives:
• See that Turing completeness is not always a
good thing.
• Another example of an always-halting
formalism: context-free grammars and simply
typed 𝜆 calculus.
• The pumping lemma for non context-free
functions.
• Examples of computable and uncomputable

10 semantic properties of regular expressions and


context-free grammars.

Restricted computational models

“Happy families are all alike; every unhappy family is unhappy in its own
way”, Leo Tolstoy (opening of the book “Anna Karenina”).

We have seen that many models of computation are Turing equiva-


lent, including Turing machines, NAND-TM/NAND-RAM programs,
standard programming languages such as C/Python/Javascript, as
well as other models such as the 𝜆 calculus and even the game of life.
The flip side of this is that for all these models, Rice’s theorem (The-
orem 9.15) holds as well, which means that any semantic property of
programs in such a model is uncomputable.
The uncomputability of halting and other semantic specification
problems for Turing equivalent models motivates restricted com-
putational models that are (a) powerful enough to capture a set of
functions useful for certain applications but (b) weak enough that we
can still solve semantic specification problems on them. In this chapter
we discuss several such examples.

 Big Idea 14 We can use restricted computational models to bypass


limitations such as uncomputability of the Halting problem and Rice’s
Theorem. Such models can compute only a restricted subclass of
functions, but allow to answer at least some semantic questions on
programs.

10.1 TURING COMPLETENESS AS A BUG


We have seen that seemingly simple computational models or sys-
tems can turn out to be Turing complete. The following webpage lists
several examples of formalisms that “accidentally” turned out to Tur-
ing complete, including supposedly limited languages such as the C
preprocessor, CSS, (certain variants of) SQL, sendmail configuration,
as well as games such as Minecraft, Super Mario, and the card game

Compiled on 12.6.2023 00:05


360 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 10.1: Some restricted computational models.


We have already seen two equivalent restricted
models of computation: regular expressions and
deterministic finite automata. We show a more
powerful model: context-free grammars. We also
present tools to demonstrate that some functions can
not be computed in these models.

“Magic: The Gathering”. Turing completeness is not always a good


thing, as it means that such formalisms can give rise to arbitrarily
complex behavior. For example, the postscript format (a precursor of
PDF) is a Turing-complete programming language meant to describe
documents for printing. The expressive power of postscript can allow
for short descriptions of very complex images, but it also gave rise to
some nasty surprises, such as the attacks described in this page rang-
ing from using infinite loops as a denial of service attack, to accessing
the printer’s file system.

■ An interesting recent example of the


Example 10.1 — The DAO Hack.
pitfalls of Turing-completeness arose in the context of the cryp-
tocurrency Ethereum. The distinguishing feature of this currency
is the ability to design “smart contracts” using an expressive (and
in particular Turing-complete) programming language. In our
current “human operated” economy, Alice and Bob might sign a
contract to agree that if condition X happens then they will jointly
invest in Charlie’s company. Ethereum allows Alice and Bob to
create a joint venture where Alice and Bob pool their funds to-
gether into an account that will be governed by some program 𝑃
that decides under what conditions it disburses funds from it. For
example, one could imagine a piece of code that interacts between
Alice, Bob, and some program running on Bob’s car that allows
Alice to rent out Bob’s car without any human intervention or
overhead.
Specifically Ethereum uses the Turing-complete programming
language solidity which has a syntax similar to JavaScript. The
flagship of Ethereum was an experiment known as The “Decen-
tralized Autonomous Organization” or The DAO. The idea was
to create a smart contract that would create an autonomously run
decentralized venture capital fund, without human managers,
where shareholders could decide on investment opportunities. The
re stri c te d comp u tati ona l mod e l s 361

DAO was at the time the biggest crowdfunding success in history.


At its height the DAO was worth 150 million dollars, which was
more than ten percent of the total Ethereum market. Investing in
the DAO (or entering any other “smart contract”) amounts to pro-
viding your funds to be run by a computer program. i.e., “code
is law”, or to use the words the DAO described itself: “The DAO
is borne from immutable, unstoppable, and irrefutable computer code”.
Unfortunately, it turns out that (as we saw in Chapter 9) under-
standing the behavior of computer programs is quite a hard thing
to do. A hacker (or perhaps, some would say, a savvy investor)
was able to fashion an input that caused the DAO code to enter
into an infinite recursive loop in which it continuously transferred
funds into the hacker’s account, thereby cleaning out about 60 mil-
lion dollars out of the DAO. While this transaction was “legal” in
the sense that it complied with the code of the smart contract, it
was obviously not what the humans who wrote this code had in
mind. The Ethereum community struggled with the response to
this attack. Some tried the “Robin Hood” approach of using the
same loophole to drain the DAO funds into a secure account, but
it only had limited success. Eventually, the Ethereum community
decided that the code can be mutable, stoppable, and refutable.
Specifically, the Ethereum maintainers and miners agreed on a
“hard fork” (also known as a “bailout”) to revert history to be-
fore the hacker’s transaction occurred. Some community members
strongly opposed this decision, and so an alternative currency
called Ethereum Classic was created that preserved the original
history.

10.2 CONTEXT FREE GRAMMARS


If you have ever written a program, you’ve experienced a syntax error.
You probably also had the experience of your program entering into
an infinite loop. What is less likely is that the compiler or interpreter
entered an infinite loop while trying to figure out if your program has
a syntax error.
When a person designs a programming language, they need to
determine its syntax. That is, the designer decides which strings corre-
sponds to valid programs, and which ones do not (i.e., which strings
contain a syntax error). To ensure that a compiler or interpreter al-
ways halts when checking for syntax errors, language designers typi-
cally do not use a general Turing-complete mechanism to express their
syntax. Rather they use a restricted computational model. One of the
most popular choices for such models is context free grammars.
362 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

To explain context free grammars, let us begin with a canonical ex-


ample. Consider the function ARITH ∶ Σ∗ → {0, 1} that takes as input
a string 𝑥 over the alphabet Σ = {(, ), +, −, ×, ÷, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and returns 1 if and only if the string 𝑥 represents a valid arithmetic
expression. Intuitively, we build expressions by applying an opera-
tion such as +,−,× or ÷ to smaller expressions, or enclosing them in
parentheses, where the “base case” corresponds to expressions that
are simply numbers. More precisely, we can make the following defi-
nitions:

• A digit is one of the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

• A number is a sequence of digits. (For simplicity we drop the condi-


tion that the sequence does not have a leading zero, though it is not
hard to encode it in a context-free grammar as well.)

• An operation is one of +, −, ×, ÷

• An expression has either the form “number”, the form “sub-


expression1 operation sub-expression2”, or the form “(sub-
expression1)”, where “sub-expression1” and “sub-expression2” are
themselves expressions. (Note that this is a recursive definition.)

A context free grammar (CFG) is a formal way of specifying such


conditions. A CFG consists of a set of rules that tell us how to generate
strings from smaller components. In the above example, one of the
rules is “if 𝑒𝑥𝑝1 and 𝑒𝑥𝑝2 are valid expressions, then 𝑒𝑥𝑝1 × 𝑒𝑥𝑝2 is
also a valid expression”; we can also write this rule using the short-
hand 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ⇒ 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 × 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛. As in the above ex-
ample, the rules of a context-free grammar are often recursive: the rule
𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ⇒ 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 × 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 defines valid expressions in
terms of itself. We now formally define context-free grammars:

Let Σ be some finite set. A


Definition 10.2 — Context Free Grammar.
context free grammar (CFG) over Σ is a triple (𝑉 , 𝑅, 𝑠) such that:

• 𝑉 , known as the variables, is a set disjoint from Σ.

• 𝑠 ∈ 𝑉 is known as the initial variable.

• 𝑅 is a set of rules. Each rule is a pair (𝑣, 𝑧) with 𝑣 ∈ 𝑉 and


𝑧 ∈ (Σ ∪ 𝑉 ) . We often write the rule (𝑣, 𝑧) as 𝑣 ⇒ 𝑧 and say that

the string 𝑧 can be derived from the variable 𝑣.


re stri c te d comp u tati ona l mod e l s 363

■ Example 10.3 — Context free grammar for arithmetic expressions. The


example above of well-formed arithmetic expressions can be cap-
tured formally by the following context free grammar:

• The alphabet Σ is {(, ), +, −, ×, ÷, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

• The variables are 𝑉 = {𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 , 𝑛𝑢𝑚𝑏𝑒𝑟 , 𝑑𝑖𝑔𝑖𝑡 , 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛}.

• The rules are the set 𝑅 containing the following 19 rules:

– The 4 rules 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ⇒ +, 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ⇒ −, 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ⇒ ×,


and 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ⇒ ÷.
– The 10 rules 𝑑𝑖𝑔𝑖𝑡 ⇒ 0,…, 𝑑𝑖𝑔𝑖𝑡 ⇒ 9.
– The rule 𝑛𝑢𝑚𝑏𝑒𝑟 ⇒ 𝑑𝑖𝑔𝑖𝑡.
– The rule 𝑛𝑢𝑚𝑏𝑒𝑟 ⇒ 𝑑𝑖𝑔𝑖𝑡 𝑛𝑢𝑚𝑏𝑒𝑟.
– The rule 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ⇒ 𝑛𝑢𝑚𝑏𝑒𝑟.
– The rule 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ⇒ 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛.
– The rule 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ⇒ (𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛).

• The starting variable is 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛

People use many different notations to write context free grammars.


One of the most common notations is the Backus–Naur form. In this
notation we write a rule of the form 𝑣 ⇒ 𝑎 (where 𝑣 is a variable and 𝑎
is a string) in the form <v> := a. If we have several rules of the form
𝑣 ↦ 𝑎, 𝑣 ↦ 𝑏, and 𝑣 ↦ 𝑐 then we can combine them as <v> := a|b|c.
(In words we say that 𝑣 can derive either 𝑎, 𝑏, or 𝑐.) For example, the
Backus-Naur description for the context free grammar of Example 10.3
is the following (using ASCII equivalents for operations):

operation := +|-|*|/
digit := 0|1|2|3|4|5|6|7|8|9
number := digit|digit number
expression := number|expression operation
↪ expression|(expression)

Another example of a context free grammar is the “matching paren-


theses” grammar, which can be represented in Backus-Naur as fol-
lows:

match := ""|match match|(match)

A string over the alphabet { (,) } can be generated from this gram-
mar (where match is the starting expression and "" corresponds to the
empty string) if and only if it consists of a matching set of parentheses.
364 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

In contrast, by Lemma 6.20 there is no regular expression that matches


a string 𝑥 if and only if 𝑥 contains a valid sequence of matching paren-
theses.

10.2.1 Context-free grammars as a computational model


We can think of a context-free grammar over the alphabet Σ as defin-
ing a function that maps every string 𝑥 in Σ∗ to 1 or 0 depending on
whether 𝑥 can be generated by the rules of the grammars. We now
make this definition formally.

If 𝐺 = (𝑉 , 𝑅, 𝑠) is a
Definition 10.4 — Deriving a string from a grammar.
context-free grammar over Σ, then for two strings 𝛼, 𝛽 ∈ (Σ ∪ 𝑉 )∗
we say that 𝛽 can be derived in one step from 𝛼, denoted by 𝛼 ⇒𝐺 𝛽,
if we can obtain 𝛽 from 𝛼 by applying one of the rules of 𝐺. That is,
we obtain 𝛽 by replacing in 𝛼 one occurrence of the variable 𝑣 with
the string 𝑧, where 𝑣 ⇒ 𝑧 is a rule of 𝐺.
We say that 𝛽 can be derived from 𝛼, denoted by 𝛼 ⇒∗𝐺 𝛽, if it
can be derived by some finite number 𝑘 of steps. That is, if there
are 𝛼1 , … , 𝛼𝑘−1 ∈ (Σ ∪ 𝑉 )∗ , so that 𝛼 ⇒𝐺 𝛼1 ⇒𝐺 𝛼2 ⇒𝐺 ⋯ ⇒𝐺
𝛼𝑘−1 ⇒𝐺 𝛽.
We say that 𝑥 ∈ Σ∗ is matched by 𝐺 = (𝑉 , 𝑅, 𝑠) if 𝑥 can be de-
rived from the starting variable 𝑠 (i.e., if 𝑠 ⇒∗𝐺 𝑥). We define the
function computed by (𝑉 , 𝑅, 𝑠) to be the map Φ𝑉 ,𝑅,𝑠 ∶ Σ∗ → {0, 1}
such that Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 iff 𝑥 is matched by (𝑉 , 𝑅, 𝑠). A function
𝐹 ∶ Σ∗ → {0, 1} is context free if 𝐹 = Φ𝑉 ,𝑅,𝑠 for some CFG (𝑉 , 𝑅, 𝑠).
1
1
As in the case of Definition 6.7 we can also use
A priori it might not be clear that the map Φ𝑉 ,𝑅,𝑠 is computable, language rather than function notation and say that a
language 𝐿 ⊆ Σ∗ is context free if the function 𝐹 such
but it turns out that this is the case. that 𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿 is context free.

Theorem 10.5 — Context-free grammars always halt. For every CFG


(𝑉 , 𝑅, 𝑠) over {0, 1}, the function Φ𝑉 ,𝑅,𝑠 ∶ {0, 1}∗ → {0, 1} is
computable.

As usual we restrict attention to grammars over {0, 1} although the


proof extends to any finite alphabet Σ.

Proof. We only sketch the proof. We start with the observation we can
convert every CFG to an equivalent version of Chomsky normal form,
where all rules either have the form 𝑢 → 𝑣𝑤 for variables 𝑢, 𝑣, 𝑤 or the
form 𝑢 → 𝜎 for a variable 𝑢 and symbol 𝜎 ∈ Σ, plus potentially the
rule 𝑠 → "" where 𝑠 is the starting variable.
The idea behind such a transformation is to simply add new vari-
ables as needed, and so for example we can translate a rule such as
𝑣 → 𝑢𝜎𝑤 into the three rules 𝑣 → 𝑢𝑟, 𝑟 → 𝑡𝑤 and 𝑡 → 𝜎.
re stri c te d comp u tati ona l mod e l s 365

Using the Chomsky Normal form we get a natural recursive algo-


rithm for computing whether 𝑠 ⇒∗𝐺 𝑥 for a given grammar 𝐺 and
string 𝑥. We simply try all possible guesses for the first rule 𝑠 → 𝑢𝑣
that is used in such a derivation, and then all possible ways to par-
tition 𝑥 as a concatenation 𝑥 = 𝑥′ 𝑥″ . If we guessed the rule and the
partition correctly, then this reduces our task to checking whether
𝑢 ⇒∗𝐺 𝑥′ and 𝑣 ⇒∗𝐺 𝑥″ , which (as it involves shorter strings) can
be done recursively. The base cases are when 𝑥 is empty or a single
symbol, and can be easily handled.

R
Remark 10.6 — Parse trees. While we focus on the
task of deciding whether a CFG matches a string, the
algorithm to compute Φ𝑉 ,𝑅,𝑠 actually gives more in-
formation than that. That is, on input a string 𝑥, if
Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 then the algorithm yields the sequence
of rules that one can apply from the starting vertex 𝑠
to obtain the final string 𝑥. We can think of these rules
as determining a tree with 𝑠 being the root vertex and
the sinks (or leaves) corresponding to the substrings
of 𝑥 that are obtained by the rules that do not have a
variable in their second element. This tree is known
as the parse tree of 𝑥, and often yields very useful
information about the structure of 𝑥.
Often the first step in a compiler or interpreter for a
programming language is a parser that transforms the
source into the parse tree (also known as the abstract
syntax tree). There are also tools that can automati-
cally convert a description of a context-free grammars
into a parser algorithm that computes the parse tree of
a given string. (Indeed, the above recursive algorithm
can be used to achieve this, but there are much more
efficient versions, especially for grammars that have
particular forms, and programming language design-
ers often try to ensure their languages have these more
efficient grammars.)

10.2.2 The power of context free grammars


Context free grammars can capture every regular expression:

Let 𝑒 be a
Theorem 10.7 — Context free grammars and regular expressions.
regular expression over {0, 1}, then there is a CFG (𝑉 , 𝑅, 𝑠) over
{0, 1} such that Φ𝑉 ,𝑅,𝑠 = Φ𝑒 .

Proof. We prove the theorem by induction on the length of 𝑒. If 𝑒 is


an expression of one bit length, then 𝑒 = 0 or 𝑒 = 1, in which case
we leave it to the reader to verify that there is a (trivial) CFG that
366 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

computes it. Otherwise, we fall into one of the following case: case 1:
𝑒 = 𝑒′ 𝑒″ , case 2: 𝑒 = 𝑒′ |𝑒″ or case 3: 𝑒 = (𝑒′ )∗ where in all cases 𝑒′ , 𝑒″
are shorter regular expressions. By the induction hypothesis, we can
define grammars (𝑉 ′ , 𝑅′ , 𝑠′ ) and (𝑉 ″ , 𝑅″ , 𝑠″ ) that compute Φ𝑒′ and
Φ𝑒″ respectively. By renaming variables, we can also assume without
loss of generality that 𝑉 ′ and 𝑉 ″ are disjoint.
In case 1, we can define the new grammar as follows: we add a new
starting variable 𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rule 𝑠 ↦ 𝑠′ 𝑠″ . In case 2, we can
define the new grammar as follows: we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rules 𝑠 ↦ 𝑠′ and 𝑠 ↦ 𝑠″ . Case 3 will be the
only one that uses recursion. As before we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ , but now add the rules 𝑠 ↦ "" (i.e., the empty string) and
also add, for every rule of the form (𝑠′ , 𝛼) ∈ 𝑅′ , the rule 𝑠 ↦ 𝑠𝛼 to 𝑅.
We leave it to the reader as (a very good!) exercise to verify that in
all three cases the grammars we produce capture the same function as
the original expression.

It turns out that CFG’s are strictly more powerful than regular
expressions. In particular, as we’ve seen, the “matching parentheses”
function MATCHPAREN can be computed by a context free grammar,
whereas, as shown in Lemma 6.20, it cannot be computed by regular
expressions. Here is another example:
Let PAL ∶
Solved Exercise 10.1 — Context free grammar for palindromes.
{0, 1, ; } → {0, 1} be the function defined in Solved Exercise 6.4 where

PAL(𝑤) = 1 iff 𝑤 has the form 𝑢; 𝑢𝑅 . Then PAL can be computed by a


context-free grammar

Solution:
A simple grammar computing PAL can be described using
Backus–Naur notation:

start := ; | 0 start 0 | 1 start 1

One can prove by induction that this grammar generates exactly


the strings 𝑤 such that PAL(𝑤) = 1.

A more interesting example is computing the strings of the form


𝑢; 𝑣 that are not palindromes:
Prove that there is a context free
Solved Exercise 10.2 — Non-palindromes.
grammar that computes NPAL ∶ {0, 1, ; }∗ → {0, 1} where NPAL(𝑤) =
1 if 𝑤 = 𝑢; 𝑣 but 𝑣 ≠ 𝑢𝑅 .

re stri c te d comp u tati ona l mod e l s 367

Solution:
Using Backus–Naur notation we can describe such a grammar as
follows

palindrome := ; | 0 palindrome 0 | 1 palindrome 1


different := 0 palindrome 1 | 1 palindrome 0
start := different | 0 start | 1 start | start
↪ 0 | start 1

In words, this means that we can characterize a string 𝑤 such


that NPAL(𝑤) = 1 as having the following form

𝑤 = 𝛼𝑏𝑢; 𝑢𝑅 𝑏′ 𝛽

where 𝛼, 𝛽, 𝑢 are arbitrary strings and 𝑏 ≠ 𝑏′ . Hence we can


generate such a string by first generating a palindrome 𝑢; 𝑢𝑅
(palindrome variable), then adding 0 on either the left or right and
1 on the opposite side to get something that is not a palindrome
(different variable), and then we can add arbitrary number of 0’s
and 1’s on either end (the start variable).

10.2.3 Limitations of context-free grammars (optional)


Even though context-free grammars are more powerful than regular
expressions, there are some simple languages that are not captured
by context free grammars. One tool to show this is the context-free
grammar analog of the “pumping lemma” (Theorem 6.21):

Theorem 10.8 — Context-free pumping lemma. Let (𝑉 , 𝑅, 𝑠) be a CFG


over Σ, then there is some numbers 𝑛0 , 𝑛1 ∈ ℕ such that for every
𝑥 ∈ Σ∗ with |𝑥| > 𝑛0 , if Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 then 𝑥 = 𝑎𝑏𝑐𝑑𝑒 such that
|𝑏| + |𝑐| + |𝑑| ≤ 𝑛1 , |𝑏| + |𝑑| ≥ 1, and Φ𝑉 ,𝑅,𝑠 (𝑎𝑏𝑘 𝑐𝑑𝑘 𝑒) = 1 for every
𝑘 ∈ ℕ.

P
The context-free pumping lemma is even more cum-
bersome to state than its regular analog, but you can
remember it as saying the following: “If a long enough
string is matched by a grammar, there must be a variable
that is repeated in the derivation.”

Proof of Theorem 10.8. We only sketch the proof. The idea is that if
the total number of symbols in the rules of the grammar is 𝑛0 , then
the only way to get |𝑥| > 𝑛0 with Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 is to use recursion.
That is, there must be some variable 𝑣 ∈ 𝑉 such that we are able to
368 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

derive from 𝑣 the value 𝑏𝑣𝑑 for some strings 𝑏, 𝑑 ∈ Σ∗ , and then further
on derive from 𝑣 some string 𝑐 ∈ Σ∗ such that 𝑏𝑐𝑑 is a substring of
𝑥 (in other words, 𝑥 = 𝑎𝑏𝑐𝑑𝑒 for some 𝑎, 𝑒 ∈ {0, 1}∗ ). If we take
the variable 𝑣 satisfying this requirement with a minimum number
of derivation steps, then we can ensure that |𝑏𝑐𝑑| is at most some
constant depending on 𝑛0 and we can set 𝑛1 to be that constant (𝑛1 =
10 ⋅ |𝑅| ⋅ 𝑛0 will do, since we will not need more than |𝑅| applications
of rules, and each such application can grow the string by at most 𝑛0
symbols).
Thus by the definition of the grammar, we can repeat the derivation
to replace the substring 𝑏𝑐𝑑 in 𝑥 with 𝑏𝑘 𝑐𝑑𝑘 for every 𝑘 ∈ ℕ while
retaining the property that the output of Φ𝑉 ,𝑅,𝑠 is still one. Since 𝑏𝑐𝑑
is a substring of 𝑥, we can write 𝑥 = 𝑎𝑏𝑐𝑑𝑒 and are guaranteed that
𝑎𝑏𝑘 𝑐𝑑𝑘 𝑒 is matched by the grammar for every 𝑘.

Using Theorem 10.8 one can show that even the simple function
𝐹 ∶ {0, 1}∗ → {0, 1} defined as follows:

{1 𝑥 = 𝑤𝑤 for some 𝑤 ∈ {0, 1}∗



𝐹 (𝑥) =

⎩0 otherwise
{

is not context free. (In contrast, the function 𝐺 ∶ {0, 1}∗ → {0, 1}
defined as 𝐺(𝑥) = 1 iff 𝑥 = 𝑤0 𝑤1 ⋯ 𝑤𝑛−1 𝑤𝑛−1 𝑤𝑛−2 ⋯ 𝑤0 for some
𝑤 ∈ {0, 1}∗ and 𝑛 = |𝑤| is context free, can you see why?.)
Let EQ ∶ {0, 1, ; }∗ →
Solved Exercise 10.3 — Equality is not context-free.
{0, 1} be the function such that EQ(𝑥) = 1 if and only if 𝑥 = 𝑢; 𝑢 for
some 𝑢 ∈ {0, 1}∗ . Then EQ is not context free.

Solution:
We use the context-free pumping lemma. Suppose towards the
sake of contradiction that there is a grammar 𝐺 that computes EQ,
and let 𝑛0 be the constant obtained from Theorem 10.8.
Consider the string 𝑥 = 1𝑛0 0𝑛0 ; 1𝑛0 0𝑛0 , and write it as 𝑥 = 𝑎𝑏𝑐𝑑𝑒
as per Theorem 10.8, with |𝑏𝑐𝑑| ≤ 𝑛0 and with |𝑏| + |𝑑| ≥ 1. By The-
orem 10.8, it should hold that EQ(𝑎𝑐𝑒) = 1. However, by case anal-
ysis this can be shown to be a contradiction.
Firstly, unless 𝑏 is on the left side of the ; separator and 𝑑 is on
the right side, dropping 𝑏 and 𝑑 will definitely make the two parts
different. But if it is the case that 𝑏 is on the left side and 𝑑 is on the
right side, then by the condition that |𝑏𝑐𝑑| ≤ 𝑛0 we know that 𝑏 is a
string of only zeros and 𝑑 is a string of only ones. If we drop 𝑏 and
𝑑 then since one of them is non-empty, we get that there are either
re stri c te d comp u tati ona l mod e l s 369

less zeroes on the left side than on the right side, or there are less
ones on the right side than on the left side. In either case, we get
that EQ(𝑎𝑐𝑒) = 0, obtaining the desired contradiction.

10.3 SEMANTIC PROPERTIES OF CONTEXT FREE LANGUAGES


As in the case of regular expressions, the limitations of context free
grammars do provide some advantages. For example, emptiness of
context free grammars is decidable:

There is an algorithm
Theorem 10.9 — Emptiness for CFG’s is decidable.
that on input a context-free grammar 𝐺, outputs 1 if and only if Φ𝐺
is the constant zero function.

Proof Idea:
The proof is easier to see if we transform the grammar to Chomsky
Normal Form as in Theorem 10.5. Given a grammar 𝐺, we can recur-
sively define a non-terminal variable 𝑣 to be non-empty if there is either
a rule of the form 𝑣 ⇒ 𝜎, or there is a rule of the form 𝑣 ⇒ 𝑢𝑤 where
both 𝑢 and 𝑤 are non-empty. Then the grammar is non-empty if and
only if the starting variable 𝑠 is non-empty.

Proof of Theorem 10.9. We assume that the grammar 𝐺 in Chomsky


Normal Form as in Theorem 10.5. We consider the following proce-
dure for marking variables as “non-empty”:

1. We start by marking all variables 𝑣 that are involved in a rule of the


form 𝑣 ⇒ 𝜎 as non-empty.

2. We then continue to mark 𝑣 as non-empty if it is involved in a rule


of the form 𝑣 ⇒ 𝑢𝑤 where 𝑢, 𝑤 have been marked before.

We continue this way until we cannot mark any more variables. We


then declare that the grammar is empty if and only if 𝑠 has not been
marked. To see why this is a valid algorithm, note that if a variable 𝑣
has been marked as “non-empty” then there is some string 𝛼 ∈ Σ∗ that
can be derived from 𝑣. On the other hand, if 𝑣 has not been marked,
then every sequence of derivations from 𝑣 will always have a variable
that has not been replaced by alphabet symbols. Hence in particular
Φ𝐺 is the all zero function if and only if the starting variable 𝑠 is not
marked “non-empty”.

370 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

10.3.1 Uncomputability of context-free grammar equivalence (optional)


By analogy to regular expressions, one might have hoped to get an
algorithm for deciding whether two given context free grammars
are equivalent. Alas, no such luck. It turns out that the equivalence
problem for context free grammars is uncomputable. This is a direct
corollary of the following theorem:

For every set Σ, let


Theorem 10.10 — Fullness of CFG’s is uncomputable.
CFGFULLΣ be the function that on input a context-free grammar 𝐺
over Σ, outputs 1 if and only if 𝐺 computes the constant 1 function.
Then there is some finite Σ such that CFGFULLΣ is uncomputable.

Theorem 10.10 immediately implies that equivalence for context-


free grammars is uncomputable, since computing “fullness” of a
grammar 𝐺 over some alphabet Σ = {𝜎0 , … , 𝜎𝑘−1 } corresponds to
checking whether 𝐺 is equivalent to the grammar 𝑠 ⇒ ""|𝑠𝜎0 | ⋯ |𝑠𝜎𝑘−1 .
Note that Theorem 10.10 and Theorem 10.9 together imply that
context-free grammars, unlike regular expressions, are not closed
under complement. (Can you see why?) Since we can encode every
element of Σ using ⌈log |Σ|⌉ bits (and this finite encoding can be easily
carried out within a grammar) Theorem 10.10 implies that fullness is
also uncomputable for grammars over the binary alphabet.

Proof Idea:
We prove the theorem by reducing from the Halting problem. To
do that we use the notion of configurations of NAND-TM programs, as
defined in Definition 8.8. Recall that a configuration of a program 𝑃 is a
binary string 𝑠 that encodes all the information about the program in
the current iteration.
We define Σ to be {0, 1} plus some separator characters and define
INVALID𝑃 ∶ Σ∗ → {0, 1} to be the function that maps every string 𝐿 ∈
Σ∗ to 1 if and only if 𝐿 does not encode a sequence of configurations
that correspond to a valid halting history of the computation of 𝑃 on
the empty input.
The heart of the proof is to show that INVALID𝑃 is context-free.
Once we do that, we see that 𝑃 halts on the empty input if and only if
INVALID𝑃 (𝐿) = 1 for every 𝐿. To show that, we will encode the list
in a special way that makes it amenable to deciding via a context-free
grammar. Specifically we will reverse all the odd-numbered strings.

Proof of Theorem 10.10. We only sketch the proof. We will show that if
we can compute CFGFULL then we can solve HALTONZERO, which
has been proven uncomputable in Theorem 9.9. Let 𝑀 be an input
re stri c te d comp u tati ona l mod e l s 371

Turing machine for HALTONZERO. We will use the notion of configu-


rations of a Turing machine, as defined in Definition 8.8.
Recall that a configuration of Turing machine 𝑀 and input 𝑥 cap-
tures the full state of 𝑀 at some point of the computation. The partic-
ular details of configurations are not so important, but what you need
to remember is that:

• A configuration can be encoded by a binary string 𝜎 ∈ {0, 1}∗ .

• The initial configuration of 𝑀 on the input 0 is some fixed string.

• A halting configuration will have the value a certain state (which can
be easily “read off” from it) set to 1.

• If 𝜎 is a configuration at some step 𝑖 of the computation, we denote


by NEXT𝑀 (𝜎) as the configuration at the next step. NEXT𝑀 (𝜎) is
a string that agrees with 𝜎 on all but a constant number of coor-
dinates (those encoding the position corresponding to the head
position and the two adjacent ones). On those coordinates, the
value of NEXT𝑀 (𝜎) can be computed by some finite function.

We will let the alphabet Σ = {0, 1} ∪ {‖, #}. A computation his-


tory of 𝑀 on the input 0 is a string 𝐿 ∈ Σ that corresponds to a list
‖𝜎0 #𝜎1 ‖𝜎2 #𝜎3 ⋯ 𝜎𝑡−2 ‖𝜎𝑡−1 # (i.e., ‖ comes before an even numbered
block, and # comes before an odd numbered one) such that if 𝑖 is
even then 𝜎𝑖 is the string encoding the configuration of 𝑃 on input 0
at the beginning of its 𝑖-th iteration, and if 𝑖 is odd then it is the same
except the string is reversed. (That is, for odd 𝑖, 𝑟𝑒𝑣(𝜎𝑖 ) encodes the
configuration of 𝑃 on input 0 at the beginning of its 𝑖-th iteration.)
Reversing the odd-numbered blocks is a technical trick to ensure that
the function INVALID𝑀 we define below is context free.
We now define INVALID𝑀 ∶ Σ∗ → {0, 1} as follows:

{0 𝐿 is a valid computation history of 𝑀 on 0



INVALID𝑀 (𝐿) = ⎨
⎩1 otherwise
{

We will show the following claim:


CLAIM: INVALID𝑀 is context-free.
The claim implies the theorem. Since 𝑀 halts on 0 if and only if
there exists a valid computation history, INVALID𝑀 is the constant
one function if and only if 𝑀 does not halt on 0. In particular, this
allows us to reduce determining whether 𝑀 halts on 0 to determining
whether the grammar 𝐺𝑀 corresponding to INVALID𝑀 is full.
We now turn to the proof of the claim. We will not show all the
details, but the main point INVALID𝑀 (𝐿) = 1 if at least one of the
following three conditions hold:
372 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

1. 𝐿 is not of the right format, i.e. not of the form ⟨binary-string⟩#⟨binary-string⟩‖⟨binary-string⟩# ⋯.

2. 𝐿 contains a substring of the form ‖𝜎#𝜎′ ‖ such that


𝜎′ ≠ 𝑟𝑒𝑣(NEXT𝑃 (𝜎))

3. 𝐿 contains a substring of the form #𝜎‖𝜎′ # such that


𝜎′ ≠ NEXT𝑃 (𝑟𝑒𝑣(𝜎))

Since context-free functions are closed under the OR operation, the


claim will follow if we show that we can verify conditions 1, 2 and 3
via a context-free grammar.
For condition 1 this is very simple: checking that 𝐿 is of the correct
format can be done using a regular expression. Since regular expres-
sions are closed under negation, this means that checking that 𝐿 is not
of this format can also be done by a regular expression and hence by a
context-free grammar.
For conditions 2 and 3, this follows via very similar reasoning to
that showing that the function 𝐹 such that 𝐹 (𝑢#𝑣) = 1 iff 𝑢 ≠ 𝑟𝑒𝑣(𝑣)
is context-free, see Solved Exercise 10.2. After all, the NEXT𝑀 function
only modifies its input in a constant number of places. We leave filling
out the details as an exercise to the reader. Since INVALID𝑀 (𝐿) = 1
if and only if 𝐿 satisfies one of the conditions 1., 2. or 3., and all three
conditions can be tested for via a context-free grammar, this completes
the proof of the claim and hence the theorem.

10.4 SUMMARY OF SEMANTIC PROPERTIES FOR REGULAR EX-


PRESSIONS AND CONTEXT-FREE GRAMMARS
To summarize, we can often trade expressiveness of the model for
amenability to analysis. If we consider computational models that are
not Turing complete, then we are sometimes able to bypass Rice’s The-
orem and answer certain semantic questions about programs in such
models. Here is a summary of some of what is known about semantic
questions for the different models we have seen.

Table 10.1: Computability of semantic properties

Model Halting Emptiness Equivalence


Regular expressions Computable Computable Computable
Context free grammars Computable Computable Uncomputable
Turing-complete models UncomputableUncomputable Uncomputable
re stri c te d comp u tati ona l mod e l s 373

✓ Chapter Recap

• The uncomputability of the Halting problem for


general models motivates the definition of re-
stricted computational models.
• In some restricted models we can answer semantic
questions such as: does a given program terminate,
or do two programs compute the same function?
• Regular expressions are a restricted model of com-
putation that is often useful to capture tasks of
string matching. We can test efficiently whether
an expression matches a string, as well as answer
questions such as Halting and Equivalence.
• Context free grammars is a stronger, yet still not Tur-
ing complete, model of computation. The halting
problem for context free grammars is computable,
but equivalence is not computable.

10.5 EXERCISES
Suppose that
Exercise 10.1 — Closure properties of context-free functions.
𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1} are context free. For each one of the following
definitions of the function 𝐻, either prove that 𝐻 is always context
free or give a counterexample for regular 𝐹 , 𝐺 that would make 𝐻 not
context free.

1. 𝐻(𝑥) = 𝐹 (𝑥) ∨ 𝐺(𝑥).

2. 𝐻(𝑥) = 𝐹 (𝑥) ∧ 𝐺(𝑥)

3. 𝐻(𝑥) = NAND(𝐹 (𝑥), 𝐺(𝑥)).

4. 𝐻(𝑥) = 𝐹 (𝑥𝑅 ) where 𝑥𝑅 is the reverse of 𝑥: 𝑥𝑅 = 𝑥𝑛−1 𝑥𝑛−2 ⋯ 𝑥𝑜 for


𝑛 = |𝑥|.

{1 𝑥 = 𝑢𝑣 s.t. 𝐹 (𝑢) = 𝐺(𝑣) = 1



5. 𝐻(𝑥) = ⎨
⎩0 otherwise
{

{1 𝑥 = 𝑢𝑢 s.t. 𝐹 (𝑢) = 𝐺(𝑢) = 1



6. 𝐻(𝑥) =

⎩0 otherwise
{

{1 𝑥 = 𝑢𝑢𝑅 s.t. 𝐹 (𝑢) = 𝐺(𝑢) = 1



7. 𝐻(𝑥) = ⎨
⎩0 otherwise
{

Exercise 10.2 Prove that the function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 (𝑥) = 1 if and only if |𝑥| is a power of two is not context free.

374 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Consider the following


Exercise 10.3 — Syntax for programming languages.
syntax of a “programming language” whose source can be written
using the ASCII character set:

• Variables are obtained by a sequence of letters, numbers and under-


scores, but can’t start with a number.

• A statement has either the form foo = bar; where foo and bar are
variables, or the form IF (foo) BEGIN ... END where ... is list
of one or more statements, potentially separated by newlines.

A program in our language is simply a sequence of statements (pos-


sibly separated by newlines or spaces).

1. Let VAR ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑥 ∈ {0, 1}∗ , outputs 1 if and only if 𝑥 corresponds to an ASCII
encoding of a valid variable identifier. Prove that VAR is regular.

2. Let SYN ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑠 ∈ {0, 1}∗ , outputs 1 if and only if 𝑠 is an ASCII encoding of a valid
program in our language. Prove that SYN is context free. (You do
not have to specify the full formal grammar for SYN, but you need
to show that such a grammar exists.)
2
Try to see if you can “embed” in some way a func-
3. Prove that SYN is not regular. See footnote for hint2 tion that looks similar to MATCHPAREN in SYN, so
you can use a similar proof. Of course for a function
to be non-regular, it does not need to utilize literal

parentheses symbols.

10.6 BIBLIOGRAPHICAL NOTES


As in the case of regular expressions, there are many resources avail-
able that cover context-free grammar in great detail. Chapter 2 of
[Sip97] contains many examples of context-free grammars and their
properties. There are also websites such as Grammophone where you
can input grammars, and see what strings they generate, as well as
some of the properties that they satisfy.
The adjective “context free” is used for CFG’s because a rule of
the form 𝑣 ↦ 𝑎 means that we can always replace 𝑣 with the string
𝑎, no matter what is the context in which 𝑣 appears. More generally,
we might want to consider cases where the replacement rules depend
on the context. This gives rise to the notion of general (aka “Type 0”)
grammars that allow rules of the form 𝑎 ⇒ 𝑏 where both 𝑎 and 𝑏 are
strings over (𝑉 ∪ Σ)∗ . The idea is that if, for example, we wanted to
enforce the condition that we only apply some rule such as 𝑣 ↦ 0𝑤1
when 𝑣 is surrounded by three zeroes on both sides, then we could do
so by adding a rule of the form 000𝑣000 ↦ 0000𝑤1000 (and of course
we can add much more general conditions). Alas, this generality
re stri c te d comp u tati ona l mod e l s 375

comes at a cost - general grammars are Turing complete and hence


their halting problem is uncomputable. That is, there is no algorithm
𝐴 that can determine for every general grammar 𝐺 and a string 𝑥,
whether or not the grammar 𝐺 generates 𝑥.
The Chomsky Hierarchy is a hierarchy of grammars from the least
restrictive (most powerful) Type 0 grammars, which correspond to
recursively enumerable languages (see Exercise 9.10) to the most re-
strictive Type 3 grammars, which correspond to regular languages.
Context-free languages correspond to Type 2 grammars. Type 1 gram-
mars are context sensitive grammars. These are more powerful than
context-free grammars but still less powerful than Turing machines.
In particular functions/languages corresponding to context-sensitive
grammars are always computable, and in fact can be computed by a
linear bounded automatons which are non-deterministic algorithms
that take 𝑂(𝑛) space. For this reason, the class of functions/languages
corresponding to context-sensitive grammars is also known as the
complexity class NSPACE𝑂(𝑛); we discuss space-bounded com-
plexity in Chapter 17). While Rice’s Theorem implies that we cannot
compute any non-trivial semantic property of Type 0 grammars, the
situation is more complex for other types of grammars: some seman-
tic properties can be determined and some cannot, depending on the
grammar’s place in the hierarchy.
Learning Objectives:
• More examples of uncomputable functions
that are not as tied to computation.
• Gödel’s incompleteness theorem - a result
that shook the world of mathematics in the
early 20th century.

11
Is every theorem provable?

“Take any definite unsolved problem, such as … the existence of an infinite


number of prime numbers of the form 2𝑛 + 1. However unapproachable these
problems may seem to us and however helpless we stand before them, we have,
nevertheless, the firm conviction that their solution must follow by a finite
number of purely logical processes…”
“…This conviction of the solvability of every mathematical problem is a pow-
erful incentive to the worker. We hear within us the perpetual call: There is the
problem. Seek its solution. You can find it by pure reason, for in mathematics
there is no ignorabimus.”, David Hilbert, 1900.

“The meaning of a statement is its method of verification.”, Moritz Schlick,


1938 (aka “The verification principle” of logical positivism)

The problems shown uncomputable in Chapter 9, while natural


and important, still intimately involved NAND-TM programs or other
computing mechanisms in their definitions. One could perhaps hope
that as long as we steer clear of functions whose inputs are themselves
programs, we can avoid the “curse of uncomputability”. Alas, we have
no such luck.
In this chapter we will see an example of a natural and seemingly
“computation free” problem that nevertheless turns out to be uncom-
putable: solving Diophantine equations. As a corollary, we will see
one of the most striking results of 20th century mathematics: Gödel’s
Incompleteness Theorem, which showed that there are some mathemat-
ical statements (in fact, in number theory) that are inherently unprov-
able. We will actually start with the latter result, and then show the
former.

This chapter: A non-mathy overview


The marquee result of this chapter is Gödel’s Incompleteness
Theorem, which states that for every proof system, there are
some statements about arithmetic that are true but unprovable
in this system. But more than that we will see a deep connec-

Compiled on 12.6.2023 00:05


378 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

tion between uncomputability and unprovability. For example,


the uncomputability of the Halting problem immediately
gives rise to the existence of unprovable statements about
Turing machines. To even state Gödel’s Incompleteness The-
orem we will need to formally define the notion of a “proof
system”. We give a very general definition, that encompasses
all types of “axioms + inference rules” systems used in logic
and math. We will then build up the machinery to encode
computation using arithmetic that will enable us to prove
Gödel’s Theorem.

Figure 11.1: Outline of the results of this chapter. One


version of Gödel’s Incompleteness Theorem is an
immediate consequence of the uncomputability of the
Halting problem. To obtain the theorem as originally
stated (for statements about the integers) we first
prove that the QMS problem of determining truth
of quantified statements involving both integers and
strings is uncomputable. We do so using the notion of
Turing Machine configurations but there are alternative
approaches to do so as well, see Remark 11.14.

11.1 HILBERT’S PROGRAM AND GÖDEL’S INCOMPLETENESS


THEOREM
“And what are these …vanishing increments? They are neither finite quanti-
ties, nor quantities infinitely small, nor yet nothing. May we not call them the
ghosts of departed quantities?”, George Berkeley, Bishop of Cloyne, 1734.

The 1700’s and 1800’s were a time of great discoveries in mathe-


matics but also of several crises. The discovery of calculus by Newton
and Leibnitz in the late 1600’s ushered a golden age of problem solv-
ing. Many longstanding challenges succumbed to the new tools that
were discovered, and mathematicians got ever better at doing some
truly impressive calculations. However, the rigorous foundations
behind these calculations left much to be desired. Mathematicians
manipulated infinitesimal quantities and infinite series cavalierly, and
while most of the time they ended up with the correct results, there
were a few strange examples (such as trying to calculate the value
of the infinite series 1 − 1 + 1 − 1 + 1 + …) which seemed to give
out different answers depending on the method of calculation. This
led to a growing sense of unease in the foundations of the subject
is e ve ry the ore m p rova bl e ? 379

which was addressed in the works of mathematicians such as Cauchy,


Weierstrass, and Riemann, who eventually placed analysis on firmer
foundations, giving rise to the 𝜖’s and 𝛿’s that students taking honors
calculus grapple with to this day.
In the beginning of the 20th century, there was an effort to replicate
this effort, in greater rigor, to all parts of mathematics. The hope was
to show that all the true results of mathematics can be obtained by
starting with a number of axioms, and deriving theorems from them
using logical rules of inference. This effort was known as the Hilbert
program, named after the influential mathematician David Hilbert.
Alas, it turns out the results we’ve seen dealt a devastating blow to
this program, as was shown by Kurt Gödel in 1931:

For
Theorem 11.1 — Gödel’s Incompleteness Theorem: informal version.
every sound proof system 𝑉 for sufficiently rich mathematical
statements, there is a mathematical statement that is true but is not
provable in 𝑉 .

11.1.1 Defining “Proof Systems”


Before proving Theorem 11.1, we need to define “proof systems” and
even formally define the notion of a “mathematical statement”. In
geometry and other areas of mathematics, proof systems are often
defined by starting with some basic assumptions or axioms and then
deriving more statements by using inference rules such as the famous
Modus Ponens, but what axioms shall we use? What rules? We will
use an extremely general notion of proof systems, not even restricting
ourselves to ones that have the form of axioms and inference.

Mathematical statements. At the highest level, a mathematical statement


is simply a piece of text, which we can think of as a string 𝑥 ∈ {0, 1}∗ .
Mathematical statements contain assertions whose truth does not
depend on any empirical fact, but rather only on properties of abstract 1
This happens to be a false statement.
objects. For example, the following is a mathematical statement:1

“The number 2,696,635,869,504,783,333,238,805,675,613, 588,278,597,832,162,617,892,474,670,798,113


is prime”.

Mathematical statements do not have to involve numbers. They


can assert properties of any other mathematical object including sets,
strings, functions, graphs and yes, even programs. Thus, another exam-
2
It is unknown whether this statement is true or false.
ple of a mathematical statement is the following:2

The following Python function halts on every positive integer n

def f(n):
if n==1: return 1
return f(3*n+1) if n % 2 else f(n//2)
380 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proof systems. A proof for a statement 𝑥 ∈ {0, 1}∗ is another piece of


text 𝑤 ∈ {0, 1}∗ that certifies the truth of the statement asserted in 𝑥.
The conditions for a valid proof system are:

1. (Effectiveness) Given a statement 𝑥 and a proof 𝑤, there is an algo-


rithm to verify whether or not 𝑤 is a valid proof for 𝑥. (For exam-
ple, by going line by line and checking that each line follows from
the preceding ones using one of the allowed inference rules.)

2. (Soundness) If there is a valid proof 𝑤 for 𝑥 then 𝑥 is true.

These are quite minimal requirements for a proof system. Require-


ment 2 (soundness) is the very definition of a proof system: you
shouldn’t be able to prove things that are not true. Requirement 1
is also essential. If there is no set of rules (i.e., an algorithm) to check
that a proof is valid then in what sense is it a proof system? We could
replace it with a system where the “proof” for a statement 𝑥 is “trust
me: it’s true”.
We formally define proof systems as an algorithm 𝑉 where
𝑉 (𝑥, 𝑤) = 1 holds if the string 𝑤 is a valid proof for the statement 𝑥.
Even if 𝑥 is true, the string 𝑤 does not have to be a valid proof for it
(there are plenty of wrong proofs for true statements such as 4=2+2)
but if 𝑤 is a valid proof for 𝑥 then 𝑥 must be true.

Let 𝒯 ⊆ {0, 1}∗ be some set (which


Definition 11.2 — Proof systems.
we consider the “true” statements). A proof system for 𝒯 is an algo-
rithm 𝑉 that satisfies:

1. (Effectiveness) For every 𝑥, 𝑤 ∈ {0, 1}∗ , 𝑉 (𝑥, 𝑤) halts with an out-


put of either 0 or 1.

2. (Soundness) For every 𝑥 ∉ 𝒯 and 𝑤 ∈ {0, 1}∗ , 𝑉 (𝑥, 𝑤) = 0.

A true statement 𝑥 ∈ 𝒯 is unprovable (with respect to 𝑉 ) if for


every 𝑤 ∈ {0, 1}∗ , 𝑉 (𝑥, 𝑤) = 0. We say that 𝑉 is complete if there
does not exist a true statement 𝑥 that is unprovable with respect to
𝑉.

 Big Idea 15 A proof is just a string of text whose meaning is given


by a verification algorithm.
is e ve ry the ore m p rova bl e ? 381

11.2 GÖDEL’S INCOMPLETENESS THEOREM: COMPUTATIONAL


VARIANT
Our first formalization of Theorem 11.1 involves statements about
Turing machines. We let ℋ be the set of strings 𝑥 ∈ {0, 1}∗ that have
the form “Turing machine 𝑀 does not halt on the zero input”.

Theorem 11.3 — Gödel’s Incompleteness Theorem: computational variant.


There does not exist a complete proof system for ℋ.

Proof Idea:
If we had such a complete and sound proof system then we could
solve the HALTONZERO problem. On input a Turing machine 𝑀 , we
would in parallel run the machine on the input zero, as well as search
all purported proofs 𝑤 and output 0 if we find a proof of “𝑀 does
not halt on zero”. If the system is sound and complete then either the
machine will halt or we will eventually find such a proof, and it will
provide us with the correct output.

Proof of Theorem 11.3. Assume for the sake of contradiction that there
was such a proof system 𝑉 . We will use 𝑉 to build an algorithm 𝐴
that computes HALTONZERO, hence contradicting Theorem 9.9. Our
algorithm 𝐴 will work as follows:

Algorithm 11.4 — Halting from proofs.

Input: Turing machine 𝑀


Output: 1 𝑀 if halts on input 0; 0 otherwise.
1: for 𝑛 = 1, 2, 3, … do
2: for 𝑤 ∈ {0, 1}𝑛 do
3: if 𝑉 (”𝑀 does not halt on 0”, 𝑤) = 1 then
4: return 0
5: end if
6: Simulate 𝑀 for 𝑛 steps on 0.
7: if 𝑀 halts then
8: return 1
9: end if
10: end for
11: end for

If 𝑀 halts on zero within 𝑁 steps, then by the soundness of the


proof system, there will not exist a proof for “𝑀 does not halt on
0” on so we will never return 0. By the time time we get to 𝑛 = 𝑁
in the loop, we will simulate 𝑀 for 𝑁 steps and so return 1. On the
hand, if 𝑀 does not halt on 0, then we will never return 1. Because
382 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the proof system is complete, there exists 𝑤 that proves this fact, and
so when Algorithm 𝐴 reaches 𝑛 = |𝑤| we will eventually find this
𝑤 and output 0. Hence under the assumption that the proof system
is complete and sound, 𝐴(𝑀 ) solves the HALTONZERO function,
yielding a contradiction.

R
Remark 11.5 — The Gödel statement (optional). One can
extract from the proof of Theorem 11.3 a procedure
that for every proof system 𝑉 , yields a true statement
𝑥∗ that cannot be proven in 𝑉 . But Gödel’s proof
gave a very explicit description of such a statement 𝑥∗
which is closely related to the “Liar’s paradox”. That
is, Gödel’s statement 𝑥∗ was designed to be true if and
only if ∀𝑤∈{0,1}∗ 𝑉 (𝑥, 𝑤) = 0. In other words, it satisfied
the following property

𝑥∗ is true ⇔ 𝑥∗ does not have a proof in 𝑉 (11.1)

One can see that if 𝑥∗ is true, then it does not have a


proof, but if it is false then (assuming the proof sys-
tem is sound) then it cannot have a proof, and hence
𝑥∗ must be both true and unprovable. One might
wonder how is it possible to come up with an 𝑥∗ that
satisfies a condition such as (11.1) where the same
string 𝑥∗ appears on both the right-hand side and the
left-hand side of the equation. The idea is that the
proof of Theorem 11.3 yields a way to transform every
statement 𝑥 into a statement 𝐹 (𝑥) that is true if and
only if 𝑥 does not have a proof in 𝑉 . Thus 𝑥∗ needs to
be a fixed point of 𝐹 : a sentence such that 𝑥∗ = 𝐹 (𝑥∗ ).
It turns out that we can always find such a fixed point
of 𝐹 . We’ve already seen this phenomenon in the 𝜆
calculus, where the 𝑌 combinator maps every 𝐹 into
a fixed point 𝑌 𝐹 of 𝐹 . This is very related to the idea
of programs that can print their own code. Indeed,
Scott Aaronson likes to describe Gödel’s statement as
follows:

The following sentence repeated twice, the sec-


ond time in quotes, is not provable in the formal
system 𝑉 . “The following sentence repeated
twice, the second time in quotes, is not provable
in the formal system 𝑉 .”

In the argument above we actually showed that 𝑥∗ is


true, under the assumption that 𝑉 is sound. Since 𝑥∗
is true and does not have a proof in 𝑉 , this means that
we cannot carry the above argument in the system 𝑉 ,
which means that 𝑉 cannot prove its own soundness
(or even consistency: that there is no proof of both a
is e ve ry the ore m p rova bl e ? 383

statement and its negation). Using this idea, it’s not


hard to get Gödel’s second incompleteness theorem,
which says that every sufficiently rich 𝑉 cannot prove
its own consistency. That is, if we formalize the state-
ment 𝑐∗ that is true if and only if 𝑉 is consistent (i.e.,
𝑉 cannot prove both a statement and the statement’s
negation), then 𝑐∗ cannot be proven in 𝑉 .

11.3 QUANTIFIED INTEGER STATEMENTS


There is something “unsatisfying” about Theorem 11.3. Sure, it shows
there are statements that are unprovable, but they don’t feel like “real”
statements about math. After all, they talk about programs rather than
numbers, matrices, or derivatives, or whatever it is they teach in math
courses. It turns out that we can get an analogous result for statements
such as “there are no positive integers 𝑥 and 𝑦 such that 𝑥2 − 2 =
𝑦7 ”, or “there are positive integers 𝑥, 𝑦, 𝑧 such that 𝑥2 + 𝑦6 = 𝑧 11 ”
that only talk about natural numbers. It doesn’t get much more “real
math” than this. Indeed, the 19th century mathematician Leopold
Kronecker famously said that “God made the integers, all else is the
work of man.” (By the way, the status of the above two statements is
unknown.)
To make this more precise, let us define the notion of quantified
integer statements:

A quantified integer state-


Definition 11.6 — Quantified integer statements.
ment is a well-formed statement with no unbound variables involv-
ing integers, variables, the operators >, <, ×, +, −, =, the logical
operations ¬ (NOT), ∧ (AND), and ∨ (OR), as well as quantifiers
of the form ∃𝑥∈ℕ and ∀𝑦∈ℕ where 𝑥, 𝑦 are variable names.

We often care deeply about determining the truth of quantified


integer statements. For example, the statement that Fermat’s Last
Theorem is true for 𝑛 = 3 can be phrased as the quantified integer
statement

¬∃𝑎∈ℕ ∃𝑏∈ℕ ∃𝑐∈ℕ (𝑎 > 0)∧(𝑏 > 0)∧(𝑐 > 0)∧(𝑎 × 𝑎 × 𝑎 + 𝑏 × 𝑏 × 𝑏 = 𝑐 × 𝑐 × 𝑐) .

The twin prime conjecture, that states that there is an infinite num-
ber of numbers 𝑝 such that both 𝑝 and 𝑝 + 2 are primes can be phrased
as the quantified integer statement

∀𝑛∈ℕ ∃𝑝∈ℕ (𝑝 > 𝑛) ∧ PRIME(𝑝) ∧ PRIME(𝑝 + 2)

where we replace an instance of PRIME(𝑞) with the statement (𝑞 >


1) ∧ ∀𝑎∈ℕ ∀𝑏∈ℕ (𝑎 = 1) ∨ (𝑎 = 𝑞) ∨ ¬(𝑎 × 𝑏 = 𝑞).
384 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The claim (mentioned in Hilbert’s quote above) that are infinitely


many primes of the form 𝑝 = 2𝑛 + 1 can be phrased as follows:

∀𝑛∈ℕ ∃𝑝∈ℕ (𝑝 > 𝑛) ∧ PRIME(𝑝)∧


(11.2)
(∀𝑘∈ℕ (𝑘 ≠ 2 ∧ PRIME(𝑘)) ⇒ ¬DIVIDES(𝑘, 𝑝 − 1))

where DIVIDES(𝑎, 𝑏) is the statement ∃𝑐∈ℕ 𝑏 × 𝑐 = 𝑎. In English, this


corresponds to the claim that for every 𝑛 there is some 𝑝 > 𝑛 such that
all of 𝑝 − 1’s prime factors are equal to 2.

R
Remark 11.7 — Syntactic sugar for quantified integer
statements. To make our statements more readable,
we often use syntactic sugar and so write 𝑥 ≠ 𝑦 as
shorthand for ¬(𝑥 = 𝑦), and so on. Similarly, the
“implication operator” 𝑎 ⇒ 𝑏 is “syntactic sugar” or
shorthand for ¬𝑎 ∨ 𝑏, and the “if and only if operator”
𝑎 ⇔ is shorthand for (𝑎 ⇒ 𝑏) ∧ (𝑏 ⇒ 𝑎). We will
also allow ourselves the use of “macros”: plugging in
one quantified integer statement in another, as we did
with DIVIDES and PRIME above.

Much of number theory is concerned with determining the truth


of quantified integer statements. Since our experience has been that,
given enough time (which could sometimes be several centuries) hu-
manity has managed to do so for the statements that it cared enough
about, one could (as Hilbert did) hope that eventually we would be
able to prove or disprove all such statements. Alas, this turns out to be
impossible:

Theorem 11.8 — Gödel’s Incompleteness Theorem for quantified integer state-


ments.Let 𝑉 ∶ {0, 1}∗ → {0, 1} a computable purported verification
procedure for quantified integer statements. Then either:

• 𝑉 is not sound: There exists a false statement 𝑥 and a string


𝑤 ∈ {0, 1}∗ such that 𝑉 (𝑥, 𝑤) = 1.

or

• 𝑉 is not complete: There exists a true statement 𝑥 such that for


every 𝑤 ∈ {0, 1}∗ , 𝑉 (𝑥, 𝑤) = 0.

Theorem 11.8 is a direct corollary of the following result, just


as Theorem 11.3 was a direct corollary of the uncomputability of
HALTONZERO:
is e ve ry the ore m p rova bl e ? 385

Let
Theorem 11.9 — Uncomputability of quantified integer statements.
QIS ∶ {0, 1}∗ → {0, 1} be the function that given a (string rep-
resentation of) a quantified integer statement outputs 1 if it is true
and 0 if it is false. Then QIS is uncomputable.

Since a quantified integer statement is simply a sequence of sym-


bols, we can easily represent it as a string. For simplicity we will as-
sume that every string represents some quantified integer statement,
by mapping strings that do not correspond to such a statement to an
arbitrary statement such as ∃𝑥∈ℕ 𝑥 = 1.

P
Please stop here and make sure you understand
why the uncomputability of QIS (i.e., Theorem 11.9)
means that there is no sound and complete proof
system for proving quantified integer statements (i.e.,
Theorem 11.8). This follows in the same way that
Theorem 11.3 followed from the uncomputability of
HALTONZERO, but working out the details is a great
exercise (see Exercise 11.1)

In the rest of this chapter, we will show the proof of Theorem 11.8,
following the outline illustrated in Fig. 11.1.

11.4 DIOPHANTINE EQUATIONS AND THE MRDP THEOREM


Many of the functions people wanted to compute over the years in-
volved solving equations. These have a much longer history than
mechanical computers. The Babylonians already knew how to solve
some quadratic equations in 2000BC, and the formula for all quadrat-
ics appears in the Bakhshali Manuscript that was composed in India
around the 3rd century. During the Renaissance, Italian mathemati-
cians discovered generalization of these formulas for cubic and quar-
tic (degrees 3 and 4) equations. Many of the greatest minds of the
17th and 18th century, including Euler, Lagrange, Leibniz and Gauss
worked on the problem of finding such a formula for quintic equations
to no avail, until in the 19th century Ruffini, Abel and Galois showed
that no such formula exists, along the way giving birth to group theory.
However, the fact that there is no closed-form formula does
not mean we can not solve such equations. People have been
solving higher degree equations numerically for ages. The Chinese
manuscript Jiuzhang Suanshu from the first century mentions such
approaches. Solving polynomial equations is by no means restricted
only to ancient history or to students’ homework. The gradient
descent method is the workhorse powering many of the machine
386 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

learning tools that have revolutionized Computer Science over the last
several years.
But there are some equations that we simply do not know how to
solve by any means. For example, it took more than 200 years until peo-
ple succeeded in proving that the equation 𝑎11 + 𝑏11 = 𝑐11 has no
solution in integers.3 The notorious difficulty of so called Diophantine
equations (i.e., finding integer roots of a polynomial) motivated the
mathematician David Hilbert in 1900 to include the question of find-
ing a general procedure for solving such equations in his famous list
of twenty-three open problems for mathematics of the 20th century. I
Figure 11.2: Diophantine equations such as finding
don’t think Hilbert doubted that such a procedure exists. After all, the a positive integer solution to the equation 𝑎(𝑎 +
whole history of mathematics up to this point involved the discovery 𝑏)(𝑎 + 𝑐) + 𝑏(𝑏 + 𝑎)(𝑏 + 𝑐) + 𝑐(𝑐 + 𝑎)(𝑐 + 𝑏) =
4(𝑎 + 𝑏)(𝑎 + 𝑐)(𝑏 + 𝑐) (depicted more compactly
of ever more powerful methods, and even impossibility results such and whimsically above) can be surprisingly difficult.
as the inability to trisect an angle with a straightedge and compass, or There are many equations for which we do not know
if they have a solution, and there is no algorithm to
the non-existence of an algebraic formula for quintic equations, merely
solve them in general. The smallest solution for this
pointed out to the need to use more general methods. equation has 80 digits! See this Quora post for more
Alas, this turned out not to be the case for Diophantine equations. information, including the credits for this image.
3
This is a special case of what’s known as “Fermat’s
In 1970, Yuri Matiyasevich, building on a decades long line of work by
Last Theorem” which states that 𝑎𝑛 + 𝑏𝑛 = 𝑐𝑛 has no
Martin Davis, Hilary Putnam and Julia Robinson, showed that there is solution in integers for 𝑛 > 2. This was conjectured in
simply no method to solve such equations in general: 1637 by Pierre de Fermat but only proven by Andrew
Wiles in 1991. The case 𝑛 = 11 (along with all other
so called “regular prime exponents”) was established
Theorem 11.10 — MRDP Theorem. Let DIO ∶ {0, 1}∗ → {0, 1} be the by Kummer in 1850.
function that takes as input a string describing a 100-variable poly-
nomial with integer coefficients 𝑃 (𝑥0 , … , 𝑥99 ) and outputs 1 if and
only if there exists 𝑧0 , … , 𝑧99 ∈ ℕ s.t. 𝑃 (𝑧0 , … , 𝑧99 ) = 0.
Then DIO is uncomputable.

As usual, we assume some standard way to express numbers and


text as binary strings. The constant 100 is of course arbitrary; the prob-
lem is known to be uncomputable even for polynomials of degree
four and at most 58 variables. In fact the number of variables can be
reduced to nine, at the expense of the polynomial having a larger (but
still constant) degree. See Jones’s paper for more about this issue.

R
Remark 11.11 — Active code vs static data. The diffi-
culty in finding a way to distinguish between “code”
such as NAND-TM programs, and “static content”
such as polynomials is just another manifestation of
the phenomenon that code is the same as data. While
a fool-proof solution for distinguishing between the
two is inherently impossible, finding heuristics that do
a reasonable job keeps many firewall and anti-virus
manufacturers very busy (and finding ways to bypass
these tools keeps many hackers busy as well).
is e ve ry the ore m p rova bl e ? 387

11.5 HARDNESS OF QUANTIFIED INTEGER STATEMENTS


We will not prove the MRDP Theorem (Theorem 11.10). However, as
we mentioned, we will prove the uncomputability of QIS (i.e., Theo-
rem 11.9), which is a special case of the MRDP Theorem. The reason
is that a Diophantine equation is a special case of a quantified integer
statement where the only quantifier is ∃. This means that deciding the
truth of quantified integer statements is a potentially harder problem
than solving Diophantine equations, and so it is potentially easier to
prove that QIS is uncomputable.

P
If you find the last sentence confusing, it is worth-
while to reread it until you are sure you follow its
logic. We are so accustomed to trying to find solu-
tions for problems that it can sometimes be hard to
follow the arguments for showing that problems are
uncomputable.

Our proof of the uncomputability of QIS (i.e. Theorem 11.9) will, as


usual, go by reduction from the Halting problem, but we will do so in
two steps:

1. We will first use a reduction from the Halting problem to show that
deciding the truth of quantified mixed statements is uncomputable.
Quantified mixed statements involve both strings and integers.
Since quantified mixed statements are a more general concept than
quantified integer statements, it is easier to prove the uncomputabil-
ity of deciding their truth.

2. We will then reduce the problem of quantified mixed statements to


quantified integer statements.

11.5.1 Step 1: Quantified mixed statements and computation histories


We define quantified mixed statements as statements involving not just
integers and the usual arithmetic operators, but also string variables as
well.

Definition 11.12 — Quantified mixed statements. A quantified mixed state-


ment is a well-formed statement with no unbound variables involv-
ing integers, variables, the operators >, <, ×, +, −, =, the logical
operations ¬ (NOT), ∧ (AND), and ∨ (OR), as well as quanti-
fiers of the form ∃𝑥∈ℕ , ∃𝑎∈{0,1}∗ , ∀𝑦∈ℕ , ∀𝑏∈{0,1}∗ where 𝑥, 𝑦, 𝑎, 𝑏 are
variable names. These also include the operator |𝑎| which returns
the length of a string valued variable 𝑎, as well as the operator 𝑎𝑖
where 𝑎 is a string-valued variable and 𝑖 is an integer valued ex-
388 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

pression which is true if 𝑖 is smaller than the length of 𝑎 and the 𝑖𝑡ℎ
coordinate of 𝑎 is 1, and is false otherwise.

For example, the true statement that for every string 𝑎 there is a
string 𝑏 that corresponds to 𝑎 in reverse order can be phrased as the
following quantified mixed statement
∀𝑎∈{0,1}∗ ∃𝑏∈{0,1}∗ (|𝑎| = |𝑏|) ∧ (∀𝑖∈ℕ 𝑖 < |𝑎| ⇒ (𝑎𝑖 ⇔ 𝑏|𝑎|−𝑖 )) .
Quantified mixed statements are more general than quantified
integer statements, and so the following theorem is potentially easier
to prove than Theorem 11.9:

Let
Theorem 11.13 — Uncomputability of quantified mixed statements.
QMS ∶ {0, 1}∗ → {0, 1} be the function that given a (string rep-
resentation of) a quantified mixed statement outputs 1 if it is true
and 0 if it is false. Then QMS is uncomputable.

Proof Idea:
The idea behind the proof is similar to that used in showing that
one-dimensional cellular automata are Turing complete (Theorem 8.7)
as well as showing that equivalence (or even “fullness”) of context
free grammars is uncomputable (Theorem 10.10). We use the notion
of a configuration of a NAND-TM program as in Definition 8.8. Such
a configuration can be thought of as a string 𝛼 over some large-but-
finite alphabet Σ describing its current state, including the values
of all arrays, scalars, and the index variable i. It can be shown that
if 𝛼 is the configuration at a certain step of the execution and 𝛽 is
the configuration at the next step, then 𝛽𝑗 = 𝛼𝑗 for all 𝑗 outside of
{𝑖 − 1, 𝑖, 𝑖 + 1} where 𝑖 is the value of i. In particular, every value 𝛽𝑗 is
simply a function of 𝛼𝑗−1,𝑗,𝑗+1 . Using these observations we can write
a quantified mixed statement NEXT(𝛼, 𝛽) that will be true if and only if
𝛽 is the configuration encoding the next step after 𝛼. Since a program
𝑃 halts on input 𝑥 if and only if there is a sequence of configurations
𝛼0 , … , 𝛼𝑡−1 (known as a computation history) starting with the initial
configuration with input 𝑥 and ending in a halting configuration, we
can define a quantified mixed statement to determine if there is such
a statement by taking a universal quantifier over all strings 𝐻 (for
history) that encode a tuple (𝛼0 , 𝛼1 , … , 𝛼𝑡−1 ) and then checking that
𝛼0 and 𝛼𝑡−1 are valid starting and halting configurations, and that
NEXT(𝛼𝑗 , 𝛼𝑗+1 ) is true for every 𝑗 ∈ {0, … , 𝑡 − 2}.

Proof of Theorem 11.13. The proof is obtained by a reduction from the


Halting problem. Specifically, we will use the notion of a configura-
tion of a Turing machines (Definition 8.8) that we have seen in the
is e ve ry the ore m p rova bl e ? 389

context of proving that one dimensional cellular automata are Turing


complete. We need the following facts about configurations:

• For every Turing machine 𝑀 , there is a finite alphabet Σ, and a


configuration of 𝑀 is a string 𝛼 ∈ Σ∗ .

• A configuration 𝛼 encodes all the state of the program at a particu-


lar iteration, including the array, scalar, and index variables.

• If 𝛼 is a configuration, then 𝛽 = NEXT𝑃 (𝛼) denotes the configura-


tion of the computation after one more iteration. 𝛽 is a string over Σ
of length either |𝛼| or |𝛼| + 1, and every coordinate of 𝛽 is a function
of just three coordinates in 𝛼. That is, for every 𝑗 ∈ {0, … , |𝛽| − 1},
𝛽𝑗 = MAP𝑃 (𝛼𝑗−1 , 𝛼𝑗 , 𝛼𝑗+1 ) where MAP𝑃 ∶ Σ3 → Σ is some function
depending on 𝑃 .

• There are simple conditions to check whether a string 𝛼 is a valid


starting configuration corresponding to an input 𝑥, as well as to
check whether a string 𝛼 is a halting configuration. In particular
these conditions can be phrased as quantified mixed statements.

• A program 𝑀 halts on input 𝑥 if and only if there exists a sequence


of configurations 𝐻 = (𝛼0 , 𝛼1 , … , 𝛼𝑇 −1 ) such that (i) 𝛼0 is a valid
starting configuration of 𝑀 with input 𝑥, (ii) 𝛼𝑇 −1 is a valid halting
configuration of 𝑃 , and (iii) 𝛼𝑖+1 = NEXT𝑃 (𝛼𝑖 ) for every 𝑖 ∈
{0, … , 𝑇 − 2}.

We can encode such a sequence 𝐻 of configuration as a binary


string. For concreteness, we let ℓ = ⌈log(|Σ| + 1)⌉ and encode each
symbol 𝜎 in Σ ∪ {"; "} by a string in {0, 1}ℓ . We use “;” as a “separator”
symbol, and so encode 𝐻 = (𝛼0 , 𝛼1 , … , 𝛼𝑇 −1 ) as the concatenation
of the encodings of each configuration, using “;” to separate the en-
coding of 𝛼𝑖 and 𝛼𝑖+1 for every 𝑖 ∈ [𝑇 ]. In particular for every Turing
machine 𝑀 , 𝑀 halts on the input 0 if and only if the following state-
ment 𝜑𝑀 is true

∃𝐻∈{0,1}∗ 𝐻 encodes halting configuration sequence starting with input 0 .

If we can encode the statement 𝜑𝑀 as a quantified mixed statement


then, since 𝜑𝑀 is true if and only if HALTONZERO(𝑀 ) = 1, this
would reduce the task of computing HALTONZERO to computing
QMS, and hence imply (using Theorem 9.9 ) that QMS is uncom-
putable, completing the proof. Indeed, 𝜑𝑀 can be encoded as a quan-
tified mixed statement for the following reasons:

1. Let 𝛼, 𝛽 ∈ {0, 1}∗ be two strings that encode configurations of 𝑀 .


We can define a quantified mixed predicate NEXT(𝛼, 𝛽) that is true
390 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

if and only if 𝛽 = NEXT𝑀 (𝛼) (i.e., 𝛽 encodes the configuration


obtained by proceeding from 𝛼 in one computational step). Indeed
NEXT(𝛼, 𝛽) is true if for every 𝑖 ∈ {0, … , |𝛽|} which is a multiple
of ℓ, 𝛽𝑖,…,𝑖+ℓ−1 = MAP𝑀 (𝛼𝑖−ℓ,⋯,𝑖+2ℓ−1 ) where MAP𝑀 ∶ {0, 1}3ℓ →
{0, 1}ℓ is the finite function above (identifying elements of Σ with
their encoding in {0, 1}ℓ ). Since MAP𝑀 is a finite function, we can
express it using the logical operations AND,OR, NOT (for example
by computing MAP𝑀 with NAND’s).

2. Using the above we can now write the condition that for every
substring of 𝐻 that has the form 𝛼ENC(; )𝛽 with 𝛼, 𝛽 ∈ {0, 1}ℓ
and ENC(; ) being the encoding of the separator “;”, it holds that
NEXT(𝛼, 𝛽) is true.

3. Finally, if 𝛼0 is a binary string encoding the initial configuration of


𝑀 on input 0, checking that the first |𝛼0 | bits of 𝐻 equal 𝛼0 can be
expressed using AND,OR, and NOT’s. Similarly checking that the
last configuration encoded by 𝐻 corresponds to a state in which 𝑀
will halt can also be expressed as a quantified statement.

Together the above yields a computable procedure that maps every


Turing machine 𝑀 into a quantified mixed statement 𝜑𝑀 such that
HALTONZERO(𝑀 ) = 1 if and only if QMS(𝜑𝑀 ) = 1. This reduces
computing HALTONZERO to computing QMS, and hence the uncom-
putability of HALTONZERO implies the uncomputability of QMS.

R
Remark 11.14 — Alternative proofs. There are sev-
eral other ways to show that QMS is uncomputable.
For example, we can express the condition that a 1-
dimensional cellular automaton eventually writes a
“1” to a given cell from a given initial configuration
as a quantified mixed statement over a string encod-
ing the history of all configurations. We can then use
the fact that cellular automatons can simulate Tur-
ing machines (Theorem 8.7) to reduce the halting
problem to QMS. We can also use other well known
uncomputable problems such as tiling or the post cor-
respondence problem. Exercise 11.5 and Exercise 11.6
explore two alternative proofs of Theorem 11.13.

11.5.2 Step 2: Reducing mixed statements to integer statements


We now show how to prove Theorem 11.9 using Theorem 11.13. The
idea is again a proof by reduction. We will show a transformation of
every quantified mixed statement 𝜑 into a quantified integer statement
is e ve ry the ore m p rova bl e ? 391

𝜉 that does not use string-valued variables such that 𝜑 is true if and
only if 𝜉 is true.
To remove string-valued variables from a statement, we encode
every string by a pair integer. We will show that we can encode a
string 𝑥 ∈ {0, 1}∗ by a pair of numbers (𝑋, 𝑛) ∈ ℕ s.t.

• 𝑛 = |𝑥|

• There is a quantified integer statement COORD(𝑋, 𝑖) that for every


𝑖 < 𝑛, will be true if 𝑥𝑖 = 1 and will be false otherwise.

This will mean that we can replace a “for all” quantifier over strings
such as ∀𝑥∈{0,1}∗ with a pair of quantifiers over integers of the form
∀𝑋∈ℕ ∀𝑛∈ℕ (and similarly replace an existential quantifier of the form
∃𝑥∈{0,1}∗ with a pair of quantifiers ∃𝑋∈ℕ ∃𝑛∈ℕ ) . We can then replace all
calls to |𝑥| by 𝑛 and all calls to 𝑥𝑖 by COORD(𝑋, 𝑖). This means that
if we are able to define COORD via a quantified integer statement,
then we obtain a proof of Theorem 11.9, since we can use it to map
every mixed quantified statement 𝜑 to an equivalent quantified inte-
ger statement 𝜉 such that 𝜉 is true if and only if 𝜑 is true, and hence
QMS(𝜑) = QIS(𝜉). Such a procedure implies that the task of comput-
ing QMS reduces to the task of computing QIS, which means that the
uncomputability of QMS implies the uncomputability of QIS.
The above shows that proof of Theorem 11.9 all boils down to find-
ing the right encoding of strings as integers, and the right way to
implement COORD as a quantified integer statement. To achieve this
we use the following technical result :
There is a sequence of prime
Lemma 11.15 — Constructible prime sequence.
numbers 𝑝0 < 𝑝1 < 𝑝2 < ⋯ such that there is a quantified integer
statement PSEQ(𝑝, 𝑖) that is true if and only if 𝑝 = 𝑝𝑖 .
Using Lemma 11.15 we can encode a 𝑥 ∈ {0, 1}∗ by the numbers
(𝑋, 𝑛) where 𝑋 = ∏𝑥 =1 𝑝𝑖 and 𝑛 = |𝑥|. We can then define the
𝑖
statement COORD(𝑋, 𝑖) as

COORD(𝑋, 𝑖) = ∃𝑝∈ℕ PSEQ(𝑝, 𝑖) ∧ DIVIDES(𝑝, 𝑋)

where DIVIDES(𝑎, 𝑏), as before, is defined as ∃𝑐∈ℕ 𝑎 × 𝑐 = 𝑏. Note that


indeed if 𝑋, 𝑛 encodes the string 𝑥 ∈ {0, 1}∗ , then for every 𝑖 < 𝑛,
COORD(𝑋, 𝑖) = 𝑥𝑖 , since 𝑝𝑖 divides 𝑋 if and only if 𝑥𝑖 = 1.
Thus all that is left to conclude the proof of Theorem 11.9 is to
prove Lemma 11.15, which we now proceed to do.

Proof. The sequence of prime numbers we consider is the following:


We fix 𝐶 to be a sufficiently large constant (𝐶 = 22 will do) and
34

define 𝑝𝑖 to be the smallest prime number that is in the interval [(𝑖 +


𝐶)3 + 1, (𝑖 + 𝐶 + 1)3 − 1]. It is known that there exists such a prime
392 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

number for every 𝑖 ∈ ℕ. Given this, the definition of PSEQ(𝑝, 𝑖) is


simple:

(𝑝 > (𝑖+𝐶)×(𝑖+𝐶)×(𝑖+𝐶))∧(𝑝 < (𝑖+𝐶+1)×(𝑖+𝐶+1)×(𝑖+𝐶+1)∧PRIME(𝑝)∧(∀𝑝′ ¬PRIME(𝑝′ ) ∨ (𝑝′ ≤ (𝑖 + 𝐶) × (𝑖 + 𝐶) × (𝑖 + 𝐶

We leave it to the reader to verify that PSEQ(𝑝, 𝑖) is true iff 𝑝 = 𝑝𝑖 .


To sum up we have shown that for every quantified mixed state-


ment 𝜑, we can compute a quantified integer statement 𝜉 such that
QMS(𝜑) = 1 if and only if QIS(𝜉) = 1. Hence the uncomputability
of QMS (Theorem 11.13) implies the uncomputability of QIS, com-
pleting the proof of Theorem 11.9, and so also the proof of Gödel’s
Incompleteness Theorem for quantified integer statements (Theo-
rem 11.8).

✓ Chapter Recap

• Uncomputable functions include also functions


that seem to have nothing to do with NAND-TM
programs or other computational models such
as determining the satisfiability of Diophantine
equations.
• This also implies that for any sound proof system
(and in particular every finite axiomatic system) 𝑆,
there are interesting statements 𝑋 (namely of the
form “𝐹 (𝑥) = 0” for an uncomputable function
𝐹 ) such that 𝑆 is not able to prove either 𝑋 or its
negation.

11.6 EXERCISES
Exercise 11.1 — Gödel’s Theorem from uncomputability of 𝑄𝐼𝑆 . Prove Theo-
rem 11.8 using Theorem 11.9.

Let FINDPROOF ∶
Exercise 11.2 — Proof systems and uncomputability.
{0, 1}∗ → {0, 1} be the following function. On input a Turing machine
𝑉 (which we think of as the verifying algorithm for a proof system)
and a string 𝑥 ∈ {0, 1}∗ , FINDPROOF(𝑉 , 𝑥) = 1 if and only if there
exists 𝑤 ∈ {0, 1}∗ such that 𝑉 (𝑥, 𝑤) = 1.

1. Prove that FINDPROOF is uncomputable.

2. Prove that there exists a Turing machine 𝑉 such that 𝑉 halts


4
Hint: think of 𝑥 as saying “Turing machine 𝑀 halts
on every input 𝑥, 𝑣 but the function FINDPROOF𝑉 defined as
on input 𝑢” and 𝑤 being a proof that is the number of
FINDPROOF𝑉 (𝑥) = FINDPROOF(𝑉 , 𝑥) is uncomputable. See steps that it will take for this to happen. Can you find
footnote for hint.4 an always-halting 𝑉 that will verify such statements?
is e ve ry the ore m p rova bl e ? 393

Exercise 11.3 — Expression for floor. Let FSQRT(𝑛, 𝑚) = ∀𝑗∈ℕ ((𝑗 × 𝑗) >

𝑚) ∨ (𝑗 ≤ 𝑛). Prove that FSQRT(𝑛, 𝑚) is true if and only if 𝑛 = ⌊ 𝑚⌋.

Exercise 11.4 — axiomatic proof systems. For every representation of logical


statements as strings, we can define an axiomatic proof system to
consist of a finite set of strings 𝐴 and a finite set of rules 𝐼0 , … , 𝐼𝑚−1
with 𝐼𝑗 ∶ ({0, 1}∗ )𝑘𝑗 → {0, 1}∗ such that a proof (𝑠1 , … , 𝑠𝑛 ) that 𝑠𝑛
is true is valid if for every 𝑖, either 𝑠𝑖 ∈ 𝐴 or is some 𝑗 ∈ [𝑚] and
are 𝑖1 , … , 𝑖𝑘𝑗 < 𝑖 such that 𝑠𝑖 = 𝐼𝑗 (𝑠𝑖1 , … , 𝑖𝑘𝑗 ). A system is sound if
whenever there is no false 𝑠 such that there is a proof that 𝑠 is true.
Prove that for every uncomputable function 𝐹 ∶ {0, 1}∗ → {0, 1}
and every sound axiomatic proof system 𝑆 (that is characterized by a
finite number of axioms and inference rules), there is some input 𝑥 for
which the proof system 𝑆 is not able to prove neither that 𝐹 (𝑥) = 0
nor that 𝐹 (𝑥) ≠ 0.

Exercise 11.5 — Post Corrrespondence Problem. In the Post Correspondence


Problem the input is a set 𝑆 = {(𝛼0 , 𝛽 0 ), … , (𝛽 𝑐−1 , 𝛽 𝑐−1 )} where each
𝛼𝑖 and 𝛽 𝑗 is a string in {0, 1}∗ . We say that PCP(𝑆) = 1 if and only if
there exists a list (𝛼0 , 𝛽0 ), … , (𝛼𝑚−1 , 𝛽𝑚−1 ) of pairs in 𝑆 such that

𝛼0 𝛼1 ⋯ 𝛼𝑚−1 = 𝛽0 𝛽1 ⋯ 𝛽𝑚−1 .

(We can think of each pair (𝛼, 𝛽) ∈ 𝑆 as a “domino tile” and the ques- Figure 11.3: In the puzzle problem, the input can be
tion is whether we can stack a list of such tiles so that the top and the thought of as a finite collection Σ of types of puz-
zle pieces and the goal is to find out whether or not
bottom yield the same string.) It can be shown that the PCP is uncom-
find a way to arrange pieces from these types in a
putable by a fairly straightforward though somewhat tedious proof rectangle. Formally, we model the input as a pair of
(see for example the Wikipedia page for the Post Correspondence functions 𝑚𝑎𝑡𝑐ℎ↔ , 𝑚𝑎𝑡𝑐ℎ↕ ∶ Σ2 → {0, 1} that
such that 𝑚𝑎𝑡𝑐ℎ↔ (𝑙𝑒𝑓𝑡, 𝑟𝑖𝑔ℎ𝑡) = 1 (respectively
Problem or Section 5.2 in [Sip97]). 𝑚𝑎𝑡𝑐ℎ↕ (𝑢𝑝, 𝑑𝑜𝑤𝑛) = 1 ) if the pair of pieces are
Use this fact to provide a direct proof that QMS is uncomputable by compatible when placed in their respective posi-
tions. We assume Σ contains a special symbol ∅
showing that there exists a computable map 𝑅 ∶ {0, 1}∗ → {0, 1}∗ such
corresponding to having no piece, and an arrange-
that PCP(𝑆) = QMS(𝑅(𝑆)) for every string 𝑆 encoding an instance of ment of puzzle pieces by an (𝑚 − 2) × (𝑛 − 2)
the post correspondence problem. rectangle is modeled by a string 𝑥 ∈ Σ𝑚⋅𝑛 whose
“outer coordinates’ ’ are ∅ and such that for every

𝑖 ∈ [𝑛 − 1], 𝑗 ∈ [𝑚 − 1], 𝑚𝑎𝑡𝑐ℎ↕ (𝑥𝑖,𝑗 , 𝑥𝑖+1,𝑗 ) = 1 and
𝑚𝑎𝑡𝑐ℎ↔ (𝑥𝑖,𝑗 , 𝑥𝑖,𝑗+1 ) = 1.
Exercise 11.6 — Uncomputability of puzzle. Let PUZZLE ∶ {0, 1}∗ → {0, 1} be
the problem of determining, given a finite collection of types of “puz-
zle pieces”, whether it is possible to put them together in a rectangle,
see Fig. 11.3. Formally, we think of such a collection as a finite set Σ
(see Fig. 11.3). We model the criteria as to which pieces “fit together”
by a pair of finite function 𝑚𝑎𝑡𝑐ℎ↕ , 𝑚𝑎𝑡𝑐ℎ↔ ∶ Σ2 → {0, 1} such that a
piece 𝑎 fits above a piece 𝑏 if and only if 𝑚𝑎𝑡𝑐ℎ↕ (𝑎, 𝑏) = 1 and a piece
𝑐 fits to the left of a piece 𝑑 if and only if 𝑚𝑎𝑡𝑐ℎ↔ (𝑐, 𝑑) = 1. To model
394 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the “straight edge” pieces that can be placed next to a “blank spot”
we assume that Σ contains the symbol ∅ and the matching functions
are defined accordingly. A square tiling of Σ is an 𝑚 × 𝑛 long string
𝑥 ∈ Σ𝑚𝑛 , such that for every 𝑖 ∈ {1, … , 𝑚 − 2} and 𝑗 ∈ {1, … , 𝑛 − 2},
𝑚𝑎𝑡𝑐ℎ(𝑥𝑖,𝑗 , 𝑥𝑖−1,𝑗 , 𝑥𝑖+1,𝑗 , 𝑥𝑖,𝑗−1 , 𝑥𝑖,𝑗+1 ) = 1 (i.e., every “internal pieve”
fits in with the pieces adjacent to it). We also require all of the “outer
pieces” (i.e., 𝑥𝑖,𝑗 where 𝑖 ∈ {0, 𝑚 − 1} of 𝑗 ∈ {0, 𝑛 − 1}) are “blank”
or equal to ∅. The function PUZZLE takes as input a string describing
the set Σ and the function 𝑚𝑎𝑡𝑐ℎ and outputs 1 if and only if there is
some square tiling of Σ: some not all blank string 𝑥 ∈ Σ𝑚𝑛 satisfying
the above condition.

1. Prove that PUZZLE is uncomputable.

2. Give a reduction from PUZZLE to QMS.

Exercise 11.7 — MRDP exercise. The MRDP theorem states that the
problem of determining, given a 𝑘-variable polynomial 𝑝 with integer
coefficients, whether there exists integers 𝑥0 , … , 𝑥𝑘−1 such that
𝑝(𝑥0 , … , 𝑥𝑘−1 ) = 0 is uncomputable. Consider the following quadratic
integer equation problem: the input is a list of polynomials 𝑝0 , … , 𝑝𝑚−1
over 𝑘 variables with integer coefficients, where each of the polynomi-
als is of degree at most two (i.e., it is a quadratic function). The goal
is to determine whether there exist integers 𝑥0 , … , 𝑥𝑘−1 that solve the
equations 𝑝0 (𝑥) = ⋯ = 𝑝𝑚−1 (𝑥) = 0.
Use the MRDP Theorem to prove that this problem is uncom-
putable. That is, show that the function QUADINTEQ ∶ {0, 1}∗ →
{0, 1} is uncomputable, where this function gets as input a string de-
scribing the polynomials 𝑝0 , … , 𝑝𝑚−1 (each with integer coefficients
and degree at most two), and outputs 1 if and only if there exists
5
You can replace the equation 𝑦 = 𝑥4 with the pair
𝑥0 , … , 𝑥𝑘−1 ∈ ℤ such that for every 𝑖 ∈ [𝑚], 𝑝𝑖 (𝑥0 , … , 𝑥𝑘−1 ) = 0. See of equations 𝑦 = 𝑧2 and 𝑧 = 𝑥2 . Also, you can
footnote for hint5 replace the equation 𝑤 = 𝑥6 with the three equations

𝑤 = 𝑦𝑢, 𝑦 = 𝑥4 and 𝑢 = 𝑥2 .

In this question we define the


Exercise 11.8 — The Busy Beaver problem.
NAND-TM variant of the busy beaver function.

1. We define the function 𝑇 ∶ {0, 1}∗ → ℕ as follows: for every


string 𝑃 ∈ {0, 1}∗ , if 𝑃 represents a NAND-TM program such that
when 𝑃 is executed on the input 0 (i.e., the string of length 1 that is
simply 0), a total of 𝑀 lines are executed before the program halts,
then 𝑇 (𝑃 ) = 𝑀 . Otherwise (if 𝑃 does not represent a NAND-TM
program, or it is a program that does not halt on 0), 𝑇 (𝑃 ) = 0.
Prove that 𝑇 is uncomputable.
is e ve ry the ore m p rova bl e ? 395

⋯2
2. Let TOWER(𝑛) denote the number 2⏟
22 (that is, a “tower of pow-
𝑛 times
ers of two” of height 𝑛). To get a sense of how fast this function
grows, TOWER(1) = 2, TOWER(2) = 22 = 4, TOWER(3) = 22 =
2

16, TOWER(4) = 216 = 65536 and TOWER(5) = 265536 which


is about 1020000 . TOWER(6) is already a number that is too big to
write even in scientific notation. Define NBB ∶ ℕ → ℕ (for “NAND-
TM Busy Beaver”) to be the function NBB(𝑛) = max𝑃 ∈{0,1}𝑛 𝑇 (𝑃 )
where 𝑇 ∶ ℕ → ℕ is the function defined in Item 1. Prove that
NBB grows faster than TOWER, in the sense that TOWER(𝑛) =
𝑜(NBB(𝑛)) (i.e., for every 𝜖 > 0, there exists 𝑛0 such that for every 6
You will not need to use very specific properties of
𝑛 > 𝑛0 , TOWER(𝑛) < 𝜖 ⋅ NBB(𝑛).).6 the TOWER function in this exercise. For example,
NBB(𝑛) also grows faster than the Ackerman func-
■ tion. You might find Aaronson’s blog post on the
same topic to be quite interesting, and relevant to this
book at large. If you like it then you might also enjoy
11.7 BIBLIOGRAPHICAL NOTES this piece by Terence Tao.

As mentioned before, Gödel, Escher, Bach [Hof99] is a highly recom-


mended book covering Gödel’s Theorem. A classic popular science
book about Fermat’s Last Theorem is [Sin97].
Cantor’s are used for both Turing and Gödel’s theorems. In a twist
of fate, using techniques originating from the works of Gödel and Tur-
ing, Paul Cohen showed in 1963 that Cantor’s Continuum Hypothesis
is independent of the axioms of set theory, which means that neither
it nor its negation is provable from these axioms and hence in some
sense can be considered as “neither true nor false” (see [Coh08]). The
Continuum Hypothesis is the conjecture that for every subset 𝑆 of ℝ,
either there is a one-to-one and onto map between 𝑆 and ℕ or there
is a one-to-one and onto map between 𝑆 and ℝ. It was conjectured
by Cantor and listed by Hilbert in 1900 as one of the most important
problems in mathematics. See also the non-conventional survey of
Shelah [She03]. See here for recent progress on a related question.
Thanks to Alex Lombardi for pointing out an embarrassing mistake
in the description of Fermat’s Last Theorem. (I said that it was open
for exponent 11 before Wiles’ work.)
III
EFFICIENT ALGORITHMS
Learning Objectives:
• Describe at a high level some interesting
computational problems.
• The difference between polynomial and
exponential time.
• Examples of techniques for obtaining efficient
algorithms
• Examples of how seemingly small differences
in problems can potentially make huge

12 differences in their computational complexity.

Efficient computation: An informal introduction

“The problem of distinguishing prime numbers from composite and of resolving


the latter into their prime factors is … one of the most important and useful
in arithmetic … Nevertheless we must confess that all methods … are either
restricted to very special cases or are so laborious … they try the patience of
even the practiced calculator … and do not apply at all to larger numbers.”,
Carl Friedrich Gauss, 1798

“For practical purposes, the difference between algebraic and exponential order
is often more crucial than the difference between finite and non-finite.”, Jack
Edmunds, “Paths, Trees, and Flowers”, 1963

“What is the most efficient way to sort a million 32-bit integers?”, Eric
Schmidt to Barack Obama, 2008
“I think the bubble sort would be the wrong way to go.”, Barack Obama.

So far we have been concerned with which functions are computable


and which ones are not. In this chapter we look at the finer question
of the time that it takes to compute functions, as a function of their input
length. Time complexity is extremely important to both the theory and
practice of computing, but in introductory courses, coding interviews,
and software development, terms such as “𝑂(𝑛) running time” are of-
ten used in an informal way. People don’t have a precise definition of
what a linear-time algorithm is, but rather assume that “they’ll know
it when they see it”. In this book we will define running time pre-
cisely, using the mathematical models of computation we developed
in the previous chapters. This will allow us to ask (and sometimes
answer) questions such as:

• “Is there a function that can be computed in 𝑂(𝑛2 ) time but not in
𝑂(𝑛) time?”

• “Are there natural problems for which the best algorithm (and not
just the best known) requires 2Ω(𝑛) time?”

Compiled on 12.6.2023 00:05


400 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

 Big Idea 16 The running time of an algorithm is not a number, it is


a function of the length of the input.

This chapter: A non-mathy overview


In this chapter, we informally survey examples of compu-
tational problems. For some of these problems we know
efficient (i.e., 𝑂(𝑛𝑐 )-time for a small constant 𝑐) algorithms,
and for others the best known algorithms are exponential.
We present these examples to get a feel as to the kinds of
problems that lie on each side of this divide and also see how
sometimes seemingly minor changes in problem formula-
tion can make the (known) complexity of a problem “jump”
from polynomial to exponential. We do not formally define
the notion of running time in this chapter, but use the same
“I know it when I see it” notion of an 𝑂(𝑛) or 𝑂(𝑛2 ) time
algorithms as the one you’ve seen in introduction to com-
puter science courses. We will see the precise definition of
running time (using Turing machines and RAM machines /
NAND-RAM) in Chapter 13.

While the difference between 𝑂(𝑛) and 𝑂(𝑛2 ) time can be crucial in
practice, in this book we focus on the even bigger difference between
polynomial and exponential running time. As we will see, the difference
between polynomial versus exponential time is typically insensitive to
the choice of the particular computational model, a polynomial-time
algorithm is still polynomial whether you use Turing machines, RAM
machines, or parallel cluster as your model of computation, and sim-
ilarly an exponential-time algorithm will remain exponential in all of
these platforms. One of the interesting phenomena of computing is
that there is often a kind of a “threshold phenomenon” or “zero-one
law” for running time. Many natural problems can either be solved
in polynomial running time with a not-too-large exponent (e.g., some-
thing like 𝑂(𝑛2 ) or 𝑂(𝑛3 )), or require exponential (e.g., at least 2Ω(𝑛)

or 2Ω( 𝑛) ) time to solve. The reasons for this phenomenon are still not
fully understood, but some light on it is shed by the concept of NP
completeness, which we will see in Chapter 15.
This chapter is merely a tiny sample of the landscape of computa-
tional problems and efficient algorithms. If you want to explore the
field of algorithms and data structures more deeply (which I very
much hope you do!), the bibliographical notes contain references to
some excellent texts, some of which are available freely on the web.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 401

R
Remark 12.1 — Relations between parts of this book.
Part I of this book contained a quantitative study of
computation of finite functions. We asked what are
the resources (in terms of gates of Boolean circuits or
lines in straight-line programs) required to compute
various finite functions.
Part II of the book contained a qualitative study of
computation of infinite functions (i.e., functions of
unbounded input length). In that part we asked the
qualitative question of whether or not a function is com-
putable at all, regardless of the number of operations.
Part III of the book, beginning with this chapter,
merges the two approaches and contains a quantitative
study of computation of infinite functions. In this part
we ask how do resources for computing a function
scale with the length of the input. In Chapter 13 we
define the notion of running time, and the class P of
functions that can be computed using a number of
steps that scales polynomially with the input length.
In Section 13.6 we will relate this class to the models
of Boolean circuits and straightline programs that we
studied in Part I.

12.1 PROBLEMS ON GRAPHS


In this chapter we discuss several examples of important computa-
tional problems. Many of the problems will involve graphs. We have
already encountered graphs before (see Section 1.4.4) but now quickly
recall the basic notation. A graph 𝐺 consists of a set of vertices 𝑉 and
edges 𝐸 where each edge is a pair of vertices. We typically denote by
𝑛 the number of vertices (and in fact often consider graphs where the
set of vertices 𝑉 equals the set [𝑛] of the integers between 0 and 𝑛 − 1).
In a directed graph, an edge is an ordered pair (𝑢, 𝑣), which we some-
times denote as ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗𝑢 𝑣. In an undirected graph, an edge is an unordered
pair (or simply a set) {𝑢, 𝑣} which we sometimes denote as 𝑢 𝑣 or
𝑢 ∼ 𝑣. An equivalent viewpoint is that an undirected graph corre-
sponds to a directed graph satisfying the property that whenever the
edge 𝑢⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗𝑣 is present then so is the edge ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑣 𝑢. In this chapter we restrict
our attention to graphs that are undirected and simple (i.e., containing
no parallel edges or self-loops). Graphs can be represented either in
the adjacency list or adjacency matrix representation. We can transform
between these two representations using 𝑂(𝑛2 ) operations, and hence
for our purposes we will mostly consider them as equivalent.
Graphs are so ubiquitous in computer science and other sciences
because they can be used to model a great many of the data that we
encounter. These are not just the “obvious” data such as the road
402 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

network (which can be thought of as a graph of whose vertices are


locations with edges corresponding to road segments), or the web
(which can be thought of as a graph whose vertices are web pages
with edges corresponding to links), or social networks (which can
be thought of as a graph whose vertices are people and the edges
correspond to friend relation). Graphs can also denote correlations in
data (e.g., graph of observations of features with edges corresponding
to features that tend to appear together), causal relations (e.g., gene
regulatory networks, where a gene is connected to gene products it
derives), or the state space of a system (e.g., graph of configurations
of a physical system, with edges corresponding to states that can be
reached from one another in one step).

12.1.1 Finding the shortest path in a graph


The shortest path problem is the task of finding, given a graph 𝐺 = Figure 12.1: Some examples of graphs found on the
Internet.
(𝑉 , 𝐸) and two vertices 𝑠, 𝑡 ∈ 𝑉 , the length of the shortest path
between 𝑠 and 𝑡 (if such a path exists). That is, we want to find the
smallest number 𝑘 such that there are vertices 𝑣0 , 𝑣1 , … , 𝑣𝑘 with 𝑣0 = 𝑠,
𝑣𝑘 = 𝑡 and for every 𝑖 ∈ {0, … , 𝑘 − 1} an edge between 𝑣𝑖 and 𝑣𝑖+1 . For-
mally, we define MINPATH ∶ {0, 1}∗ → {0, 1}∗ to be the function that
on input a triple (𝐺, 𝑠, 𝑡) (represented as a string) outputs the number
𝑘 which is the length of the shortest path in 𝐺 between 𝑠 and 𝑡 or a
string representing no path if no such path exists. (In practice people
often want to also find the actual path and not just its length; it turns
out that the algorithms to compute the length of the path often yield
the actual path itself as a byproduct, and so everything we say about
the task of computing the length also applies to the task of finding the
path.)
If each vertex has at least two neighbors then there can be an expo-
1
A queue is a data structure for storing a list of el-
ements in “First In First Out (FIFO)” order. Each
nential number of paths from 𝑠 to 𝑡, but fortunately we do not have to “pop” operation removes an element from the queue
enumerate them all to find the shortest path. We can find the short- in the order that they were “pushed” into it; see the
Wikipedia page.
est path using a breadth first search (BFS), enumerating 𝑠’s neigh-
bors, and then neighbors’ neighbors, etc.. in order. If we maintain the
neighbors in a list we can perform a BFS in 𝑂(𝑛2 ) time, while using a
queue we can do this in 𝑂(𝑚) time, where 𝑚 is the number of edges.1
Dijkstra’s algorithm is a well-known generalization of BFS to weighted
graphs. More formally, the algorithm for computing the function
MINPATH is described in Algorithm 12.2.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 403

Algorithm 12.2 — Shortest path via BFS.

Input: Graph 𝐺 = (𝑉 , 𝐸) and vertices 𝑠, 𝑡 ∈ 𝑉 . Assume


𝑉 = [𝑛].
Output: Length 𝑘 of shortest path from 𝑠 to 𝑡 or ∞ if no
such path exists.
1: Let 𝐷 be length-𝑛 array.
2: Set 𝐷[𝑠] = 0 and 𝐷[𝑖] = ∞ for all 𝑖 ∈ [𝑛] ⧵ {𝑠}.
3: Initialize queue 𝑄 to contain 𝑠.
4: while 𝑄 non empty do
5: Pop 𝑣 from 𝑄
6: if 𝑣 = 𝑡 then
7: return 𝐷[𝑣]
8: end if
9: for 𝑢 neighbor of 𝑣 with 𝐷[𝑢] = ∞ do
10: Set 𝐷[𝑢] ← 𝐷[𝑣] + 1
11: Add 𝑢 to 𝑄.
12: end for
13: end while
14: return ∞

Since we only add to the queue vertices 𝑤 with 𝐷[𝑤] = ∞ (and


then immediately set 𝐷[𝑤] to an actual number), we never push to
the queue a vertex more than once, and hence the algorithm makes at
most 𝑛 “push” and “pop” operations. For each vertex 𝑣, the number
of times we run the inner loop is equal to the degree of 𝑣 and hence
the total running time is proportional to the sum of all degrees which
equals twice the number 𝑚 of edges. Algorithm 12.2 returns the cor-
rect answer since the vertices are added to the queue in the order of
their distance from 𝑠, and hence we will reach 𝑡 after we have explored
all the vertices that are closer to 𝑠 than 𝑡.

R
Remark 12.3 — On data structures. If you’ve ever taken
an algorithms course, you have probably encountered
many data structures such as lists, arrays, queues,
stacks, heaps, search trees, hash tables and many
more. Data structures are extremely important in com-
puter science, and each one of those offers different
tradeoffs between overhead in storage, operations
supported, cost in time for each operation, and more.
For example, if we store 𝑛 items in a list, we will need
a linear (i.e., 𝑂(𝑛) time) scan to retrieve an element,
while we achieve the same operation in 𝑂(1) time if
we used a hash table. However, when we only care
about polynomial-time algorithms, such factors of
𝑂(𝑛) in the running time will not make much differ-
404 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

ence. Similarly, if we don’t care about the difference


between 𝑂(𝑛) and 𝑂(𝑛2 ), then it doesn’t matter if we
represent graphs as adjacency lists or adjacency matri-
ces. Hence we will often describe our algorithms at a
very high level, without specifying the particular data
structures that are used to implement them. How-
ever, it will always be clear that there exists some data
structure that is sufficient for our purposes.

12.1.2 Finding the longest path in a graph


The longest path problem is the task of finding the length of the longest
simple (i.e., non-intersecting) path between a given pair of vertices
𝑠 and 𝑡 in a given graph 𝐺. If the graph is a road network, then the
longest path might seem less motivated than the shortest path (unless
you are the kind of person that always prefers the “scenic route”).
But graphs can and are used to model a variety of phenomena, and in
many such cases finding the longest path (and some of its variants)
can be very useful. In particular, finding the longest path is a gener-
alization of the famous Hamiltonian path problem which asks for a
maximally long simple path (i.e., path that visits all 𝑛 vertices once)
between 𝑠 and 𝑡, as well as the notorious traveling salesman problem
(TSP) of finding (in a weighted graph) a path visiting all vertices of
cost at most 𝑤. TSP is a classical optimization problem, with appli-
cations ranging from planning and logistics to DNA sequencing and
astronomy.
Surprisingly, while we can find the shortest path in 𝑂(𝑚) time,
there is no known algorithm for the longest path problem that signif-
icantly improves on the trivial “exhaustive search” or “brute force”
algorithm that enumerates all the exponentially many possibilities
for such paths. Specifically, the best known algorithms for the longest
path problem take 𝑂(𝑐𝑛 ) time for some constant 𝑐 > 1. (At the mo-
ment the best record is 𝑐 ∼ 1.65 or so; even obtaining an 𝑂(2𝑛 ) time
bound is not that simple, see Exercise 12.1.)

12.1.3 Finding the minimum cut in a graph


Given a graph 𝐺 = (𝑉 , 𝐸), a cut of 𝐺 is a subset 𝑆 ⊆ 𝑉 such that 𝑆
is neither empty nor is it all of 𝑉 . The edges cut by 𝑆 are those edges
where one of their endpoints is in 𝑆 and the other is in 𝑆 = 𝑉 ⧵ 𝑆. We
denote this set of edges by 𝐸(𝑆, 𝑆). If 𝑠, 𝑡 ∈ 𝑉 are a pair of vertices
then an 𝑠, 𝑡 cut is a cut such that 𝑠 ∈ 𝑆 and 𝑡 ∈ 𝑆 (see Fig. 12.3).
The minimum 𝑠, 𝑡 cut problem is the task of finding, given 𝑠 and 𝑡, the
minimum number 𝑘 such that there is an 𝑠, 𝑡 cut cutting 𝑘 edges (the Figure 12.2: A knight’s tour can be thought of as a
maximally long path on the graph corresponding to
problem is also sometimes phrased as finding the set that achieves a chessboard where we put an edge between any two
this minimum; it turns out that algorithms to compute the number squares that can be reached by one step via a legal
often yield the set as well). Formally, we define MINCUT ∶ {0, 1}∗ → knight move.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 405

{0, 1}∗ to be the function that on input a string representing a triple


(𝐺 = (𝑉 , 𝐸), 𝑠, 𝑡) of a graph and two vertices, outputs the minimum
number 𝑘 such that there exists a set 𝑆 ⊆ 𝑉 with 𝑠 ∈ 𝑆, 𝑡 ∉ 𝑆 and
|𝐸(𝑆, 𝑆)| = 𝑘.
Computing minimum 𝑠, 𝑡 cuts is useful in many applications since
minimum cuts often correspond to bottlenecks. For example, in a com-
munication or railroad network the minimum cut between 𝑠 and 𝑡
corresponds to the smallest number of edges that, if dropped, will
disconnect 𝑠 from 𝑡. (This was actually the original motivation for this
problem; see Section 12.6.) Similar applications arise in scheduling
and planning. In the setting of image segmentation, one can define a
graph whose vertices are pixels and whose edges correspond to neigh-
Figure 12.3: A cut in a graph 𝐺 = (𝑉 , 𝐸) is simply a
boring pixels of distinct colors. If we want to separate the foreground subset 𝑆 of its vertices. The edges that are cut by 𝑆
from the background then we can pick (or guess) a foreground pixel 𝑠 are all those whose one endpoint is in 𝑆 and the other
one is in 𝑆 = 𝑉 ⧵ 𝑆. The cut edges are colored red in
and background pixel 𝑡 and ask for a minimum cut between them. this figure.
The naive algorithm for computing MINCUT will check all 2𝑛 pos-
sible subsets of an 𝑛-vertex graph, but it turns out we can do much
better than that. As we’ve seen in this book time and again, there is
more than one algorithm to compute the same function, and some
of those algorithms might be more efficient than others. Luckily the
minimum cut problem is one of those cases. In particular, as we will
see in the next section, there are algorithms that compute MINCUT in
time which is polynomial in the number of vertices.

12.1.4 Min-Cut Max-Flow and Linear programming


We can obtain a polynomial-time algorithm for computing MINCUT
using the Max-Flow Min-Cut Theorem. This theorem says that the
minimum cut between 𝑠 and 𝑡 equals the maximum amount of flow
we can send from 𝑠 to 𝑡, if every edge has unit capacity. Specifically,
imagine that every edge of the graph corresponded to a pipe that
could carry one unit of fluid per one unit of time (say 1 liter of water
per second). The maximum 𝑠, 𝑡 flow is the maximum units of water
that we could transfer from 𝑠 to 𝑡 over these pipes. If there is an 𝑠, 𝑡
cut of 𝑘 edges, then the maximum flow is at most 𝑘. The reason is
that such a cut 𝑆 acts as a “bottleneck” since at most 𝑘 units can flow
from 𝑆 to its complement at any given unit of time. This means that
the maximum 𝑠, 𝑡 flow is always at most the value of the minimum
𝑠, 𝑡 cut. The surprising and non-trivial content of the Max-Flow Min-
Cut Theorem is that the maximum flow is also at least the value of the
minimum cut, and hence computing the cut is the same as computing
the flow.
The Max-Flow Min-Cut Theorem reduces the task of computing a
minimum cut to the task of computing a maximum flow. However, this
still does not show how to compute such a flow. The Ford-Fulkerson
406 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm is a direct way to compute a flow using incremental im-


provements. But computing flows in polynomial time is also a special
case of a much more general tool known as linear programming.
A flow on a graph 𝐺 of 𝑚 edges can be modeled as a vector 𝑥 ∈ ℝ𝑚
where for every edge 𝑒, 𝑥𝑒 corresponds to the amount of water per
time-unit that flows on 𝑒. We think of an edge 𝑒 as an ordered pair
(𝑢, 𝑣) (we can choose the order arbitrarily) and let 𝑥𝑒 be the amount
of flow that goes from 𝑢 to 𝑣. (If the flow is in the other direction then
we make 𝑥𝑒 negative.) Since every edge has capacity one, we know
that −1 ≤ 𝑥𝑒 ≤ 1 for every edge 𝑒. A valid flow has the property that
the amount of water leaving the source 𝑠 is the same as the amount
entering the sink 𝑡, and that for every other vertex 𝑣, the amount of
water entering and leaving 𝑣 is the same.
Mathematically, we can write these conditions as follows:

∑ 𝑥𝑒 + ∑ 𝑥𝑒 = 0
𝑒∋𝑠 𝑒∋𝑡

∑ 𝑥𝑒 = 0 ∀𝑣∈𝑉 ⧵{𝑠,𝑡} (12.1)


𝑒∋𝑣

−1 ≤ 𝑥𝑒 ≤ 1 ∀𝑒∈𝐸
where for every vertex 𝑣, summing over 𝑒 ∋ 𝑣 means summing over all
the edges that touch 𝑣.
The maximum flow problem can be thought of as the task of max-
imizing ∑𝑒∋𝑠 𝑥𝑒 over all the vectors 𝑥 ∈ ℝ𝑚 that satisfy the above
conditions (12.1). Maximizing a linear function ℓ(𝑥) over the set of
𝑥 ∈ ℝ𝑚 that satisfy certain linear equalities and inequalities is known
as linear programming. Luckily, there are polynomial-time algorithms
for solving linear programming, and hence we can solve the maxi-
mum flow (and so, equivalently, minimum cut) problem in polyno-
mial time. In fact, there are much better algorithms for maximum-
flow/minimum-cut, even for weighted directed graphs, with currently

the record standing at 𝑂(min{𝑚10/7 , 𝑚 𝑛}) time.
Given a graph 𝐺 = (𝑉 , 𝐸),
Solved Exercise 12.1 — Global minimum cut.
define the global minimum cut of 𝐺 to be the minimum over all 𝑆 ⊆ 𝑉
with 𝑆 ≠ ∅ and 𝑆 ≠ 𝑉 of the number of edges cut by 𝑆. Prove that
there is a polynomial-time algorithm to compute the global minimum
cut of a graph.

Solution:
By the above we know that there is a polynomial-time algorithm
𝐴 that on input (𝐺, 𝑠, 𝑡) finds the minimum 𝑠, 𝑡 cut in the graph
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 407

𝐺. Using 𝐴, we can obtain an algorithm 𝐵 that on input a graph 𝐺


computes the global minimum cut as follows:

1. For every distinct pair 𝑠, 𝑡 ∈ 𝑉 , Algorithms 𝐵 sets 𝑘𝑠,𝑡 ←


𝐴(𝐺, 𝑠, 𝑡).

2. 𝐵 returns the minimum of 𝑘𝑠,𝑡 over all distinct pairs 𝑠, 𝑡

The running time of 𝐵 will be 𝑂(𝑛2 ) times the running time of 𝐴


and hence polynomial time. Moreover, if the global minimum cut
is 𝑆, then when 𝐵 reaches an iteration with 𝑠 ∈ 𝑆 and 𝑡 ∉ 𝑆 it will
obtain the value of this cut, and hence the value output by 𝐵 will
be the value of the global minimum cut.
The above is our first example of a reduction in the context of
polynomial-time algorithms. Namely, we reduced the task of com-
puting the global minimum cut to the task of computing minimum
𝑠, 𝑡 cuts.

12.1.5 Finding the maximum cut in a graph


The maximum cut problem is the task of finding, given an input graph
𝐺 = (𝑉 , 𝐸), the subset 𝑆 ⊆ 𝑉 that maximizes the number of edges
cut by 𝑆. (We can also define an 𝑠, 𝑡-cut variant of the maximum cut
like we did for minimum cut; the two variants have similar complexity
but the global maximum cut is more common in the literature.) Like
its cousin the minimum cut problem, the maximum cut problem is
also very well motivated. For example, maximum cut arises in VLSI
design, and also has some surprising relation to analyzing the Ising
model in statistical physics.
Surprisingly, while (as we’ve seen) there is a polynomial-time al-
gorithm for the minimum cut problem, there is no known algorithm Figure 12.4: In a convex function 𝑓 (left figure), for
every 𝑥 and 𝑦 and 𝑝 ∈ [0, 1] it holds that 𝑓(𝑝𝑥 + (1 −
solving maximum cut much faster than the trivial “brute force” algo- 𝑝)𝑦) ≤ 𝑝 ⋅ 𝑓(𝑥) + (1 − 𝑝) ⋅ 𝑓(𝑦). In particular this means
rithm that tries all 2𝑛 possibilities for the set 𝑆. that every local minimum of 𝑓 is also a global minimum.
In contrast in a non-convex function there can be many
local minima.
12.1.6 A note on convexity
There is an underlying reason for the sometimes radical difference
between the difficulty of maximizing and minimizing a function over
a domain. If 𝐷 ⊆ ℝ𝑛 , then a function 𝑓 ∶ 𝐷 → 𝑅 is convex if for every
𝑥, 𝑦 ∈ 𝐷 and 𝑝 ∈ [0, 1] 𝑓(𝑝𝑥 + (1 − 𝑝)𝑦) ≤ 𝑝𝑓(𝑥) + (1 − 𝑝)𝑓(𝑦).
That is, 𝑓 applied to the 𝑝-weighted midpoint between 𝑥 and 𝑦 is Figure 12.5: In the high dimensional case, if 𝑓 is a
smaller than the 𝑝-weighted average value of 𝑓. If 𝐷 itself is convex convex function (left figure) the global minimum
is the only local minimum, and we can find it by
(which means that if 𝑥, 𝑦 are in 𝐷 then so is the line segment between a local-search algorithm which can be thought of
them), then this means that if 𝑥 is a local minimum of 𝑓 then it is also as dropping a marble and letting it “slide down”
until it reaches the global minimum. In contrast, a
a global minimum. The reason is that if 𝑓(𝑦) < 𝑓(𝑥) then every point
non-convex function (right figure) might have an
𝑧 = 𝑝𝑥 + (1 − 𝑝)𝑦 on the line segment between 𝑥 and 𝑦 will satisfy exponential number of local minima in which any
𝑓(𝑧) ≤ 𝑝𝑓(𝑥) + (1 − 𝑝)𝑓(𝑦) < 𝑓(𝑥) and hence in particular 𝑥 cannot local-search algorithm could get stuck.
408 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

be a local minimum. Intuitively, local minima of functions are much


easier to find than global ones: after all, any “local search” algorithm
that keeps finding a nearby point on which the value is lower, will
eventually arrive at a local minimum. One example of such a local
search algorithm is gradient descent which takes a sequence of small
steps, each one in the direction that would reduce the value by the
most amount based on the current derivative.
Indeed, under certain technical conditions, we can often efficiently
find the minimum of convex functions over a convex domain, and
this is the reason why problems such as minimum cut and shortest
path are easy to solve. On the other hand, maximizing a convex func-
tion over a convex domain (or equivalently, minimizing a concave
function) can often be a hard computational task. A linear function
is both convex and concave, which is the reason that both the maxi-
mization and minimization problems for linear functions can be done
efficiently.
The minimum cut problem is not a priori a convex minimization
task, because the set of potential cuts is discrete and not continuous.
However, it turns out that we can embed it in a continuous and con-
vex set via the (linear) maximum flow problem. The “max flow min
cut” theorem ensures that this embedding is “tight” in the sense that
the minimum “fractional cut” that we obtain through the maximum-
flow linear program will be the same as the true minimum cut. Un-
fortunately, we don’t know of such a tight embedding in the setting of
the maximum cut problem.
Convexity arises time and again in the context of efficient computa-
tion. For example, one of the basic tasks in machine learning is empir-
ical risk minimization. This is the task of finding a classifier for a given
set of training examples. That is, the input is a list of labeled examples
(𝑥0 , 𝑦0 ), … , (𝑥𝑚−1 , 𝑦𝑚−1 ), where each 𝑥𝑖 ∈ {0, 1}𝑛 and 𝑦𝑖 ∈ {0, 1},
and the goal is to find a classifier ℎ ∶ {0, 1}𝑛 → {0, 1} (or sometimes
ℎ ∶ {0, 1}𝑛 → ℝ) that minimizes the number of errors. More generally,
we want to find ℎ that minimizes
𝑚−1
∑ 𝐿(𝑦𝑖 , ℎ(𝑥𝑖 ))
𝑖=0

where 𝐿 is some loss function measuring how far is the predicted la-
bel ℎ(𝑥𝑖 ) from the true label 𝑦𝑖 . When 𝐿 is the square loss function
𝐿(𝑦, 𝑦′ ) = (𝑦 − 𝑦′ )2 and ℎ is a linear function, empirical risk mini-
mization corresponds to the well-known convex minimization task of
linear regression. In other cases, when the task is non-convex, there can
be many global or local minima. That said, even if we don’t find the
global (or even a local) minima, this continuous embedding can still
help us. In particular, when running a local improvement algorithm
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 409

such as Gradient Descent, we might still find a function ℎ that is “use-


ful” in the sense of having a small error on future examples from the
same distribution.

12.2 BEYOND GRAPHS


Not all computational problems arise from graphs. We now list some
other examples of computational problems that are of great interest.

12.2.1 SAT
A propositional formula 𝜑 involves 𝑛 variables 𝑥1 , … , 𝑥𝑛 and the logical
operators AND (∧), OR (∨), and NOT (¬, also denoted as ⋅). We say
that such a formula is in conjunctive normal form (CNF for short) if it is
an AND of ORs of variables or their negations (we call a term of the
form 𝑥𝑖 or 𝑥𝑖 a literal). For example, this is a CNF formula

(𝑥7 ∨ 𝑥22 ∨ 𝑥15 ) ∧ (𝑥37 ∨ 𝑥22 ) ∧ (𝑥55 ∨ 𝑥7 )

The satisfiability problem is the task of determining, given a CNF


formula 𝜑, whether or not there exists a satisfying assignment for 𝜑. A
satisfying assignment for 𝜑 is a string 𝑥 ∈ {0, 1}𝑛 such that 𝜑 evalu-
ates to True if we assign its variables the values of 𝑥. The SAT problem
might seem as an abstract question of interest only in logic but in fact
SAT is of huge interest in industrial optimization, with applications
including manufacturing planning, circuit synthesis, software verifica-
tion, air-traffic control, scheduling sports tournaments, and more.

2SAT. We say that a formula is a 𝑘-CNF it is an AND of ORs where


each OR involves exactly 𝑘 literals. The 𝑘-SAT problem is the restric-
tion of the satisfiability problem for the case that the input formula is
a 𝑘-CNF. In particular, the 2SAT problem is to find out, given a 2-CNF
formula 𝜑, whether there is an assignment 𝑥 ∈ {0, 1}𝑛 that satisfies
𝜑, in the sense that it makes it evaluate to 1 or “True”. The trivial,
brute-force, algorithm for 2SAT will enumerate all the 2𝑛 assignments
𝑥 ∈ {0, 1}𝑛 but fortunately we can do much better. The key is that
we can think of every constraint of the form ℓ𝑖 ∨ ℓ𝑗 (where ℓ𝑖 , ℓ𝑗 are
literals, corresponding to variables or their negations) as an implication
ℓ𝑖 ⇒ ℓ𝑗 , since it corresponds to the constraints that if the literal ℓ𝑖′ = ℓ𝑖
is true then it must be the case that ℓ𝑗 is true as well. Hence we can
think of 𝜑 as a directed graph between the 2𝑛 literals, with an edge
from ℓ𝑖 to ℓ𝑗 corresponding to an implication from the former to the
latter. It can be shown that 𝜑 is unsatisfiable if and only if there is a
variable 𝑥𝑖 such that there is a directed path from 𝑥𝑖 to 𝑥𝑖 as well as
a directed path from 𝑥𝑖 to 𝑥𝑖 (see Exercise 12.2). This reduces 2SAT
to the (efficiently solvable) problem of determining connectivity in
directed graphs.
410 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

3SAT. The 3SAT problem is the task of determining satisfiability


for 3CNFs. One might think that changing from two to three would
not make that much of a difference for complexity. One would be
wrong. Despite much effort, we do not know of a significantly better
than brute force algorithm for 3SAT (the best known algorithms take
roughly 1.3𝑛 steps).
Interestingly, a similar issue arises time and again in computation,
where the difference between two and three often corresponds to
the difference between tractable and intractable. We do not fully un-
derstand the reasons for this phenomenon, though the notion of NP
completeness we will see later does offer a partial explanation. It may
be related to the fact that optimizing a polynomial often amounts to
equations on its derivative. The derivative of a quadratic polynomial is
linear, while the derivative of a cubic is quadratic, and, as we will see,
the difference between solving linear and quadratic equations can be
quite profound.

12.2.2 Solving linear equations


One of the most useful problems that people have been solving time
and again is solving 𝑛 linear equations in 𝑛 variables. That is, solve
equations of the form

𝑎0,0 𝑥0 + 𝑎0,1 𝑥1 +⋯ + 𝑎0,𝑛−1 𝑥𝑛−1 = 𝑏0


𝑎1,0 𝑥0 + 𝑎1,1 𝑥1 +⋯ + 𝑎1,𝑛−1 𝑥𝑛−1 = 𝑏1
⋮+ ⋮ +⋮ +⋮ =⋮
𝑎𝑛−1,0 𝑥0 + 𝑎𝑛−1,1 𝑥1 +⋯ + 𝑎𝑛−1,𝑛−1 𝑥𝑛−1 = 𝑏𝑛−1
where {𝑎𝑖,𝑗 }𝑖,𝑗∈[𝑛] and {𝑏𝑖 }𝑖∈[𝑛] are real (or rational) numbers. More
compactly, we can write this as the equations 𝐴𝑥 = 𝑏 where 𝐴 is an
𝑛 × 𝑛 matrix, and we think of 𝑥, 𝑏 are column vectors in ℝ𝑛 .
The standard Gaussian elimination algorithm can be used to solve
such equations in polynomial time (i.e., determine if they have a so-
lution, and if so, to find it). As we discussed above, if we are willing
to allow some loss in precision, we even have algorithms that handle
linear inequalities, also known as linear programming. In contrast, if
we insist on integer solutions, the task of solving for linear equalities
or inequalities is known as integer programming, and the best known
algorithms are exponential time in the worst case.

R
Remark 12.4 — Bit complexity of numbers. Whenever we
discuss problems whose inputs correspond to num-
bers, the input length corresponds to how many bits
are needed to describe the number (or, as is equiv-
alent up to a constant factor, the number of digits
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 411

in base 10, 16 or any other constant). The difference


between the length of the input and the magnitude
of the number itself can be of course quite profound.
For example, most people would agree that there is
a huge difference between having a billion (i.e. 109 )
dollars and having nine dollars. Similarly there is a
huge difference between an algorithm that takes 𝑛
steps on an 𝑛-bit number and an algorithm that takes
2𝑛 steps.
One example is the problem (discussed below) of
finding the prime factors of a given integer 𝑁 . The
natural algorithm is to search for such a factor by try-
ing all numbers from 1 to 𝑁 , but that would take 𝑁
steps which is exponential in the input length, which
is the number of bits needed to describe 𝑁 . (The run-
ning time√of this algorithm can be easily improved to
roughly 𝑁 , but this is still exponential (i.e., 2𝑛/2 ) in
the number 𝑛 of bits to describe 𝑁 .) It is an important
and long open question whether there is such an algo-
rithm that runs in time polynomial in the input length
(i.e., polynomial in log 𝑁 ).

12.2.3 Solving quadratic equations


Suppose that we want to solve not just linear but also equations in-
volving quadratic terms of the form 𝑎𝑖,𝑗,𝑘 𝑥𝑗 𝑥𝑘 . That is, suppose that
we are given a set of quadratic polynomials 𝑝1 , … , 𝑝𝑚 and consider
the equations {𝑝𝑖 (𝑥) = 0}. To avoid issues with bit representations,
we will always assume that the equations contain the constraints
{𝑥2𝑖 − 𝑥𝑖 = 0}𝑖∈[𝑛] . Since only 0 and 1 satisfy the equation 𝑎2 − 𝑎 = 0,
this assumption means that we can restrict attention to solutions in
{0, 1}𝑛 . Solving quadratic equations in several variables is a classical
and extremely well motivated problem. This is the generalization of
the classical case of single-variable quadratic equations that gener-
ations of high school students grapple with. It also generalizes the
quadratic assignment problem, introduced in the 1950’s as a way to
optimize assignment of economic activities. Once again, we do not
know a much better algorithm for this problem than the one that enu-
merates over all the 2𝑛 possibilities.

12.3 MORE ADVANCED EXAMPLES


We now list a few more examples of interesting problems that are a
little more advanced but are of significant interest in areas such as
physics, economics, number theory, and cryptography.

12.3.1 Determinant of a matrix


The determinant of a 𝑛 × 𝑛 matrix 𝐴, denoted by det(𝐴), is an ex-
tremely important quantity in linear algebra. For example, it is known
412 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

that det(𝐴) ≠ 0 if and only if 𝐴 is non-singular, which means that it


has an inverse 𝐴−1 , and hence we can always uniquely solve equations
of the form 𝐴𝑥 = 𝑏 where 𝑥 and 𝑏 are 𝑛-dimensional vectors. More
generally, the determinant can be thought of as a quantitative measure
as to what extent 𝐴 is far from being singular. If the rows of 𝐴 are “al-
most” linearly dependent (for example, if the third row is very close
to being a linear combination of the first two rows) then the determi-
nant will be small, while if they are far from it (for example, if they are
are orthogonal to one another, then the determinant will be large). In
particular, for every matrix 𝐴, the absolute value of the determinant
of 𝐴 is at most the product of the norms (i.e., square root of sum of
squares of entries) of the rows, with equality if and only if the rows
are orthogonal to one another.
The determinant can be defined in several ways. One way to define
the determinant of an 𝑛 × 𝑛 matrix 𝐴 is:

det(𝐴) = ∑ sign(𝜋) ∏ 𝐴𝑖,𝜋(𝑖) (12.2)


𝜋∈𝑆𝑛 𝑖∈[𝑛]

where 𝑆𝑛 is the set of all permutations from [𝑛] to [𝑛] and the sign of
a permutation 𝜋 is equal to −1 raised to the power of the number of
inversions in 𝜋 (pairs 𝑖, 𝑗 such that 𝑖 > 𝑗 but 𝜋(𝑖) < 𝜋(𝑗)).
This definition suggests that computing det(𝐴) might require
summing over |𝑆𝑛 | terms which would take exponential time since
|𝑆𝑛 | = 𝑛! > 2𝑛 . However, there are other ways to compute the de-
terminant. For example, it is known that det is the only function that
satisfies the following conditions:

1. det(AB) = det(𝐴)det(𝐵) for every square matrices 𝐴, 𝐵.

2. For every 𝑛 × 𝑛 triangular matrix 𝑇 with diagonal entries


𝑛
𝑑0 , … , 𝑑𝑛−1 , det(𝑇 ) = ∏𝑖=0 𝑑𝑖 . In particular det(𝐼) = 1 where 𝐼 is
the identity matrix. (A triangular matrix is one in which either all
entries below the diagonal, or all entries above the diagonal, are
zero.)

3. det(𝑆) = −1 where 𝑆 is a “swap matrix” that corresponds to


swapping two rows or two columns of 𝐼. That is, there are two
⎧1 𝑖 = 𝑗 , 𝑖 ∉ {𝑎, 𝑏}
{
{
coordinates 𝑎, 𝑏 such that for every 𝑖, 𝑗, 𝑆𝑖,𝑗 = ⎨1 {𝑖, 𝑗} = {𝑎, 𝑏} .
{
{0 otherwise

Using these rules and the Gaussian elimination algorithm, it is
possible to tell whether 𝐴 is singular or not, and in the latter case, de-
compose 𝐴 as a product of a polynomial number of swap matrices
and triangular matrices. (Indeed one can verify that the row opera-
tions in Gaussian elimination corresponds to either multiplying by a
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 413

swap matrix or by a triangular matrix.) Hence we can compute the


determinant for an 𝑛 × 𝑛 matrix using a polynomial time of arithmetic
operations.

12.3.2 Permanent of a matrix


Given an 𝑛 × 𝑛 matrix 𝐴, the permanent of 𝐴 is defined as

perm(𝐴) = ∑ ∏ 𝐴𝑖,𝜋(𝑖) . (12.3)


𝜋∈𝑆𝑛 𝑖∈[𝑛]

That is, perm(𝐴) is defined analogously to the determinant in (12.2)


except that we drop the term sign(𝜋). The permanent of a matrix is a
natural quantity, and has been studied in several contexts including
combinatorics and graph theory. It also arises in physics where it can
be used to describe the quantum state of multiple Boson particles (see
here and here).

Permanent modulo 2. If the entries of 𝐴 are integers, then we can de-


fine the Boolean function 𝑝𝑒𝑟𝑚2 which outputs on input a matrix 𝐴
the result of the permanent of 𝐴 modulo 2. It turns out that we can
compute 𝑝𝑒𝑟𝑚2 (𝐴) in polynomial time. The key is that modulo 2, −𝑥
and +𝑥 are the same quantity and hence, since the only difference
between (12.2) and (12.3) is that some terms are multiplied by −1,
det(𝐴) mod 2 = perm(𝐴) mod 2 for every 𝐴.

Permanent modulo 3. Emboldened by our good fortune above, we


might hope to be able to compute the permanent modulo any prime 𝑝
and perhaps in full generality. Alas, we have no such luck. In a similar
“two to three” type of a phenomenon, we do not know of a much
better than brute force algorithm to even compute the permanent
modulo 3.

12.3.3 Finding a zero-sum equilibrium


A zero sum game is a game between two players where the payoff for
one is the same as the penalty for the other. That is, whatever the first
player gains, the second player loses. As much as we want to avoid
them, zero sum games do arise in life, and the one good thing about
them is that at least we can compute the optimal strategy.
A zero sum game can be specified by an 𝑛 × 𝑛 matrix 𝐴, where if
player 1 chooses action 𝑖 and player 2 chooses action 𝑗 then player one
gets 𝐴𝑖,𝑗 and player 2 loses the same amount. The famous Min Max
Theorem by John von Neumann states that if we allow probabilistic or
“mixed” strategies (where a player does not choose a single action but
rather a distribution over actions) then it does not matter who plays
first: the end result will be the same. Mathematically the min max
theorem is that if we let Δ𝑛 be the set of probability distributions over
414 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

[𝑛] (i.e., non-negative columns vectors in ℝ𝑛 whose entries sum to 1)


then

max min 𝑝⊤ 𝐴𝑞 = min max 𝑝⊤ 𝐴𝑞 (12.4)


𝑝∈∆𝑛 𝑞∈∆𝑛 𝑞∈∆𝑛 𝑝∈∆𝑛

The min-max theorem turns out to be a corollary of linear pro-


gramming duality, and indeed the value of (12.4) can be computed
efficiently by a linear program.

12.3.4 Finding a Nash equilibrium


Fortunately, not all real-world games are zero sum, and we do have
more general games, where the payoff of one player does not neces-
sarily equal the loss of the other. John Nash won the Nobel prize for
showing that there is a notion of equilibrium for such games as well.
In many economic texts it is taken as an article of faith that when
actual agents are involved in such a game then they reach a Nash
equilibrium. However, unlike zero sum games, we do not know of
an efficient algorithm for finding a Nash equilibrium given the de-
scription of a general (non-zero-sum) game. In particular this means
that, despite economists’ intuitions, there are games for which natural
strategies will take an exponential number of steps to converge to an
equilibrium.

12.3.5 Primality testing


Another classical computational problem, that has been of interest
since the ancient Greeks, is to determine whether a given number
𝑁 is prime or composite. Clearly we can do so by trying to divide it
with all the numbers in 2, … , 𝑁 − 1, but this would take at least 𝑁
steps which is exponential in its bit complexity 𝑛 = log 𝑁 . We can

reduce these 𝑁 steps to 𝑁 by observing that if 𝑁 is a composite of

the form 𝑁 = PQ then either 𝑃 or 𝑄 is smaller than 𝑁 . But this is

still quite terrible. If 𝑁 is a 1024 bit integer, 𝑁 is about 2512 , and so
running this algorithm on such an input would take much more than
the lifetime of the universe.
Luckily, it turns out we can do radically better. In the 1970’s, Ra-
bin and Miller gave probabilistic algorithms to determine whether a
given number 𝑁 is prime or composite in time 𝑝𝑜𝑙𝑦(𝑛) for 𝑛 = log 𝑁 .
We will discuss the probabilistic model of computation later in this
course. In 2002, Agrawal, Kayal, and Saxena found a deterministic
𝑝𝑜𝑙𝑦(𝑛) time algorithm for this problem. This is surely a development
that mathematicians from Archimedes till Gauss would have found
exciting.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 415

12.3.6 Integer factoring


Given that we can efficiently determine whether a number 𝑁 is prime
or composite, we could expect that in the latter case we could also ef-
ficiently find the factorization of 𝑁 . Alas, no such algorithm is known.
In a surprising and exciting turn of events, the non-existence of such an
algorithm has been used as a basis for encryptions, and indeed it un-
derlies much of the security of the world wide web. We will return to
the factoring problem later in this course. We remark that we do know
much better than brute force algorithms for this problem. While the
brute force algorithms would require 2Ω(𝑛) time to factor an 𝑛-bit inte-

ger, there are known algorithms running in time roughly 2𝑂( 𝑛) and
also algorithms that are widely believed (though not fully rigorously
analyzed) to run in time roughly 2𝑂(𝑛 ) . (By “roughly” we mean that
1/3

we neglect factors that are polylogarithmic in 𝑛.)

12.4 OUR CURRENT KNOWLEDGE


The difference between an exponential and polynomial time algo-
rithm might seem merely “quantitative” but it is in fact extremely
significant. As we’ve already seen, the brute force exponential time
algorithm runs out of steam very very fast, and as Edmonds says, in
practice there might not be much difference between a problem where
the best algorithm is exponential and a problem that is not solvable
at all. Thus the efficient algorithms we mentioned above are widely
used and power many computer science applications. Moreover, a Figure 12.6: The current computational status of
polynomial-time algorithm often arises out of significant insight to several interesting problems. For all of them we either
know a polynomial-time algorithm or the known
the problem at hand, whether it is the “max-flow min-cut” result, the 𝑐
algorithms require at least 2𝑛 for some 𝑐 > 0. In
solvability of the determinant, or the group theoretic structure that fact for all except the factoring problem, we either
know an 𝑂(𝑛3 ) time algorithm or the best known
enables primality testing. Such insight can be useful regardless of its
algorithm require at least 2Ω(𝑛) time where 𝑛 is a
computational implications. natural parameter such that there is a brute force
At the moment we do not know whether the “hard” problems are algorithm taking roughly 2𝑛 or 𝑛! time. Whether this
“cliff” between the easy and hard problem is a real
truly hard, or whether it is merely because we haven’t yet found the phenomenon or a reflection of our ignorance is still an
right algorithms for them. However, we will now see that there are open question.
problems that do inherently require exponential time. We just don’t
know if any of the examples above fall into that category.

✓ Chapter Recap

• There are many natural problems that have


polynomial-time algorithms, and other natural
problems that we’d love to solve, but for which the
best known algorithms are exponential.
• Often a polynomial time algorithm relies on dis-
covering some hidden structure in the problem, or
finding a surprising equivalent formulation for it.
416 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• There are many interesting problems where there


is an exponential gap between the best known algo-
rithm and the best algorithm that we can rule out.
Closing this gap is one of the main open questions
of theoretical computer science.

12.5 EXERCISES
The naive algo-
Exercise 12.1 — exponential time algorithm for longest path.
rithm for computing the longest path in a given graph could take
more than 𝑛! steps. Give a 𝑝𝑜𝑙𝑦(𝑛)2𝑛 time algorithm for the longest 2
Hint: Use dynamic programming to compute for
every 𝑠, 𝑡 ∈ [𝑛] and 𝑆 ⊆ [𝑛] the value 𝑃 (𝑠, 𝑡, 𝑆)
path problem in 𝑛 vertex graphs.2 which equals 1 if there is a simple path from 𝑠 to 𝑡
■ that uses exactly the vertices in 𝑆. Do this iteratively
for 𝑆’s of growing sizes.
Exercise 12.2 — 2SAT algorithm. For every 2CNF 𝜑, define the graph 𝐺𝜑
on 2𝑛 vertices corresponding to the literals 𝑥1 , … , 𝑥𝑛 , 𝑥1 , … , 𝑥𝑛 , such
that there is an edge ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
ℓ𝑖 ℓ𝑗 iff the constraint ℓ𝑖 ∨ ℓ𝑗 is in 𝜑. Prove that 𝜑
is unsatisfiable if and only if there is some 𝑖 such that there is a path
from 𝑥𝑖 to 𝑥𝑖 and from 𝑥𝑖 to 𝑥𝑖 in 𝐺𝜑 . Show how to use this to solve
2SAT in polynomial time.

Exercise 12.3 — Reductions for showing algorithms.The following fact is


true: there is a polynomial-time algorithm BIP that on input a graph
𝐺 = (𝑉 , 𝐸) outputs 1 if and only if the graph is bipartite: there is a
partition of 𝑉 to disjoint parts 𝑆 and 𝑇 such that every edge (𝑢, 𝑣) ∈ 𝐸
satisfies either 𝑢 ∈ 𝑆 and 𝑣 ∈ 𝑇 or 𝑢 ∈ 𝑇 and 𝑣 ∈ 𝑆. Use this
fact to prove that there is a polynomial-time algorithm to compute
that following function CLIQUEPARTITION that on input a graph
𝐺 = (𝑉 , 𝐸) outputs 1 if and only if there is a partition of 𝑉 the graph
into two parts 𝑆 and 𝑇 such that both 𝑆 and 𝑇 are cliques: for every
pair of distinct vertices 𝑢, 𝑣 ∈ 𝑆, the edge (𝑢, 𝑣) is in 𝐸 and similarly for
every pair of distinct vertices 𝑢, 𝑣 ∈ 𝑇 , the edge (𝑢, 𝑣) is in 𝐸.

12.6 BIBLIOGRAPHICAL NOTES


The classic undergraduate introduction to algorithms text is
[Cor+09]. Two texts that are less “encyclopedic” are Kleinberg and
Tardos [KT06], and Dasgupta, Papadimitriou and Vazirani [DPV08].
Jeff Erickson’s book is an excellent algorithms text that is freely
available online.
The origins of the minimum cut problem date to the Cold War.
Specifically, Ford and Fulkerson discovered their max-flow/min-cut
algorithm in 1955 as a way to find out the minimum amount of train
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 417

tracks that would need to be blown up to disconnect Russia from the


rest of Europe. See the survey [Sch05] for more.
Some algorithms for the longest path problem are given in [Wil09;
Bjo14].

12.7 FURTHER EXPLORATIONS


Some topics related to this chapter that might be accessible to ad-
vanced students include: (to be completed)
Learning Objectives:
• Formally modeling running time, and in
particular notions such as 𝑂(𝑛) or 𝑂(𝑛3 ) time
algorithms.
• The classes P and EXP modelling polynomial
and exponential time respectively.
• The time hierarchy theorem, that in particular
says that for every 𝑘 ≥ 1 there are functions
we can compute in 𝑂(𝑛𝑘+1 ) time but can not

13
compute in 𝑂(𝑛𝑘 ) time.
• The class P/poly of non-uniform computation
and the result that P ⊆ P/poly

Modeling running time

“When the measure of the problem-size is reasonable and when the


sizes assume values arbitrarily large, an asymptotic estimate of … the or-
der of difficulty of [an] algorithm .. is theoretically important. It cannot
be rigged by making the algorithm artificially difficult for smaller sizes”,
Jack Edmonds, “Paths, Trees, and Flowers”, 1963

Max Newman: It is all very well to say that a machine could … do this or
that, but … what about the time it would take to do it?
Alan Turing: To my mind this time factor is the one question which will
involve all the real technical difficulty.
BBC radio panel on “Can automatic Calculating Machines Be Said to
Think?”, 1952

In Chapter 12 we saw examples of efficient algorithms, and made


some claims about their running time, but did not give a mathemati-
cally precise definition for this concept. We do so in this chapter, using
the models of Turing machines and RAM machines (or equivalently
NAND-TM and NAND-RAM) we have seen before. The running
time of an algorithm is not a fixed number since any non-trivial algo-
rithm will take longer to run on longer inputs. Thus, what we want
to measure is the dependence between the number of steps the algo-
rithms takes and the length of the input. In particular we care about
the distinction between algorithms that take at most polynomial time
(i.e., 𝑂(𝑛𝑐 ) time for some constant 𝑐) and problems for which every
algorithm requires at least exponential time (i.e., Ω(2𝑛 ) for some 𝑐). As
𝑐

mentioned in Edmond’s quote in Chapter 12, the difference between


these two can sometimes be as important as the difference between
being computable and uncomputable.

This chapter: A non-mathy overview


In this chapter we formally define what it means for a func-
tion to be computable in a certain number of steps. As dis-
cussed in Chapter 12, running time is not a number, rather

Compiled on 12.6.2023 00:05


420 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 13.1: Overview of the results of this chapter.

what we care about is the _scaling _behavior of the number


of steps as the input size grows. We can use either Turing
machines or RAM machines to give such a formal definition
- it turns out that this doesn’t make a difference at the reso-
lution we care about. We make several important definitions
and prove some important theorems in this chapter. We will
define the main time complexity classes we use in this book,
and also show the Time Hierarchy Theorem which states that
given more resources (more time steps per input size) we can
compute more functions.

To put this in more “mathy” language, in this chapter we define


what it means for a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ to be computable
in time 𝑇 (𝑛) steps, where 𝑇 is some function mapping the length 𝑛
of the input to the number of computation steps allowed. Using this
definition we will do the following (see also Fig. 13.1):

• We define the class P of Boolean functions that can be computed


in polynomial time and the class EXP of functions that can be com-
puted in exponential time. Note that P ⊆ EXP. If we can compute
a function in polynomial time, we can certainly compute it in expo-
nential time.

• We show that the times to compute a function using a Turing ma-


chine and using a RAM machine (or NAND-RAM program) are
polynomially related. In particular this means that the classes P and
EXP are identical regardless of whether they are defined using
Turing machines or RAM machines / NAND-RAM programs.
mod e l i ng ru n n i ng ti me 421

• We give an efficient universal NAND-RAM program and use this to


establish the time hierarchy theorem that in particular implies that P is
a strict subset of EXP.
• We relate the notions defined here to the non-uniform models of
Boolean circuits and NAND-CIRC programs defined in Chapter 3.
We define P/poly to be the class of functions that can be computed
by a sequence of polynomial-sized circuits. We prove that P ⊆ P/poly
and that P/poly contains uncomputable functions.

13.1 FORMALLY DEFINING RUNNING TIME


Our models of computation (Turing machines, NAND-TM and
NAND-RAM programs and others) all operate by executing a se-
quence of instructions on an input one step at a time. We can define
the running time of an algorithm 𝑀 in one of these models by measur-
ing the number of steps 𝑀 takes on input 𝑥 as a function of the length
|𝑥| of the input. We start by defining running time with respect to Tur-
ing machines:

Let 𝑇 ∶ ℕ → ℕ be some
Definition 13.1 — Running time (Turing Machines).
function mapping natural numbers to natural numbers. We say
that a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is computable in 𝑇 (𝑛) Turing-
Machine time (TM-time for short) if there exists a Turing machine 𝑀
such that for every sufficiently large 𝑛 and every 𝑥 ∈ {0, 1}𝑛 , when
given input 𝑥, the machine 𝑀 halts after executing at most 𝑇 (𝑛)
steps and outputs 𝐹 (𝑥).
We define TIMETM (𝑇 (𝑛)) to be the set of Boolean functions
(functions mapping {0, 1}∗ to {0, 1}) that are computable in 𝑇 (𝑛)
TM time.

 Big Idea 17 For a function 𝐹 ∶ {0, 1}∗ → {0, 1} and 𝑇 ∶


ℕ → ℕ, we can formally define what it means for 𝐹 to be computable
in time at most 𝑇 (𝑛) where 𝑛 is the size of the input.

P
Definition 13.1 is not very complicated but is one of
the most important definitions of this book. As usual,
TIMETM (𝑇 (𝑛)) is a class of functions, not of machines. If
𝑀 is a Turing machine then a statement such as “𝑀
is a member of TIMETM (𝑛2 )” does not make sense.
The concept of TM-time as defined here is sometimes
known as “single-tape Turing machine time” in the
literature, since some texts consider Turing machines
with more than one working tape.
422 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The relaxation of considering only “sufficiently large” 𝑛’s is not


very important but it is convenient since it allows us to avoid dealing
explicitly with un-interesting “edge cases”.
While the notion of being computable within a certain running time
can be defined for every function, the class TIMETM (𝑇 (𝑛)) is a class
of Boolean functions that have a single bit of output. This choice is not
very important, but is made for simplicity and convenience later on.
In fact, every non-Boolean function has a computationally equivalent
Boolean variant, see Exercise 13.3.
Solved Exercise 13.1 — Example of time bounds. Prove that TIMETM (10⋅𝑛3 ) ⊆
TIMETM (2 ). 𝑛

Solution:
The proof is illustrated in Fig. 13.2. Suppose that 𝐹 ∈ TIMETM (10⋅
𝑛3 ) and hence there exists some number 𝑁0 and a machine 𝑀 such
that for every 𝑛 > 𝑁0 , and 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 𝐹 (𝑥) within
at most 10 ⋅ 𝑛3 steps. Since 10 ⋅ 𝑛3 = 𝑜(2𝑛 ), there is some number
𝑁1 such that for every 𝑛 > 𝑁1 , 10 ⋅ 𝑛3 < 2𝑛 . Hence for every
𝑛 > max{𝑁0 , 𝑁1 }, 𝑀 (𝑥) will output 𝐹 (𝑥) within at most 2𝑛 steps,
demonstrating that 𝐹 ∈ TIMETM (2𝑛 ). Figure 13.2: Comparing 𝑇 (𝑛) = 10𝑛3 with 𝑇 ′ (𝑛) =
2𝑛 (on the right figure the Y axis is in log scale).
Since for every large enough 𝑛, 𝑇 ′ (𝑛) ≥ 𝑇 (𝑛),

TIMETM (𝑇 (𝑛)) ⊆ TIMETM (𝑇 ′ (𝑛)).

13.1.1 Polynomial and Exponential Time


Unlike the notion of computability, the exact running time can be a
function of the model we use. However, it turns out that if we only
care about “coarse enough” resolution (as will most often be the case)
then the choice of the model, whether Turing machines, RAM ma-
chines, NAND-TM/NAND-RAM programs, or C/Python programs,
does not matter. This is known as the extended Church-Turing Thesis.
Specifically we will mostly care about the difference between polyno-
mial and exponential time.
The two main time complexity classes we will be interested in are
the following:

• Polynomial time: A function 𝐹 ∶ {0, 1}∗ → {0, 1} is computable in


polynomial time if it is in the class P = ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ). That
is, 𝐹 ∈ P if there is an algorithm to compute 𝐹 that runs in time at
most polynomial (i.e., at most 𝑛𝑐 for some constant 𝑐) in the length
of the input.

• Exponential time: A function 𝐹 ∶ {0, 1}∗ → {0, 1} is computable in


exponential time if it is in the class EXP = ∪𝑐∈{1,2,3,…} TIMETM (2𝑛 ).
𝑐

That is, 𝐹 ∈ EXP if there is an algorithm to compute 𝐹 that runs in


mod e l i ng ru n n i ng ti me 423

time at most exponential (i.e., at most 2𝑛 for some constant 𝑐) in the


𝑐

length of the input.

In other words, these are defined as follows:

Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that


Definition 13.2 — P and EXP.
𝐹 ∈ P if there is a polynomial 𝑝 ∶ ℕ → ℝ and a Turing machine
𝑀 such that for every 𝑥 ∈ {0, 1}∗ , when given input 𝑥, the Turing
machine halts within at most 𝑝(|𝑥|) steps and outputs 𝐹 (𝑥).
We say that 𝐹 ∈ EXP if there is a polynomial 𝑝 ∶ ℕ → ℝ and
a Turing machine 𝑀 such that for every 𝑥 ∈ {0, 1}∗ , when given
input 𝑥, 𝑀 halts within at most 2𝑝(|𝑥|) steps and outputs 𝐹 (𝑥).

P
Please take the time to make sure you understand
these definitions. In particular, sometimes students
think of the class EXP as corresponding to functions
that are not in P. However, this is not the case. If 𝐹 is
in EXP then it can be computed in exponential time.
This does not mean that it cannot be computed in
polynomial time as well.

Prove that P as defined in


Solved Exercise 13.2 — Differerent definitions of P.
Definition 13.2 is equal to ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 )

Solution:
To show these two sets are equal we need to show that P ⊆
∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ) and ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ) ⊆ P. We start
with the former inclusion. Suppose that 𝐹 ∈ P. Then there is some
polynomial 𝑝 ∶ ℕ → ℝ and a Turing machine 𝑀 such that 𝑀
computes 𝐹 and 𝑀 halts on every input 𝑥 within at most 𝑝(|𝑥|)
steps. We can write the polynomial 𝑝 ∶ ℕ → ℝ in the form
𝑑
𝑝(𝑛) = ∑𝑖=0 𝑎𝑖 𝑛𝑖 where 𝑎0 , … , 𝑎𝑑 ∈ ℝ, and we assume that 𝑎𝑑
is non-zero (or otherwise we just let 𝑑 correspond to the largest
number such that 𝑎𝑑 is non-zero). The degree of 𝑝 is the number 𝑑.
Since 𝑛𝑑 = 𝑜(𝑛𝑑+1 ), no matter what the coefficient 𝑎𝑑 is, for large
enough 𝑛, 𝑝(𝑛) < 𝑛𝑑+1 which means that the Turing machine 𝑀
will halt on inputs of length 𝑛 within fewer than 𝑛𝑑+1 steps, and
hence 𝐹 ∈ TIMETM (𝑛𝑑+1 ) ⊆ ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ).
For the second inclusion, suppose that 𝐹 ∈ ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ).
Then there is some positive 𝑐 ∈ ℕ such that 𝐹 ∈ TIMETM (𝑛𝑐 ) which
means that there is a Turing machine 𝑀 and some number 𝑁0 such
that 𝑀 computes 𝐹 and for every 𝑛 > 𝑁0 , 𝑀 halts on length 𝑛
424 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

inputs within at most 𝑛𝑐 steps. Let 𝑇0 be the maximum number


of steps that 𝑀 takes on inputs of length at most 𝑁0 . Then if we
define the polynomial 𝑝(𝑛) = 𝑛𝑐 + 𝑇0 then we see that 𝑀 halts on
every input 𝑥 within at most 𝑝(|𝑥|) steps and hence the existence of
𝑀 demonstrates that 𝐹 ∈ P.

Since exponential time is much larger than polynomial time, P ⊆


EXP. All of the problems we listed in Chapter 12 are in EXP, but as
we’ve seen, for some of them there are much better algorithms that
demonstrate that they are in fact in the smaller class P.

P EXP (but not known to be in P)


Shortest path Longest Path
Min cut Max cut
2SAT 3SAT
Linear eqs Quad. eqs
Zerosum Nash
Determinant Permanent
Primality Factoring

Table : A table of the examples from Chapter 12. All these problems
are in EXP but only the ones on the left column are currently known to
be in P as well (i.e., they have a polynomial-time algorithm). See also
Fig. 13.3.

R
Remark 13.3 — Boolean versions of problems. Many
of the problems defined in Chapter 12 correspond to
non-Boolean functions (functions with more than one
bit of output) while P and EXP are sets of Boolean
functions. However, for every non-Boolean function
𝐹 we can always define a computationally-equivalent
Boolean function 𝐺 by letting 𝐺(𝑥, 𝑖) be the 𝑖-th bit
of 𝐹 (𝑥) (see Exercise 13.3). Hence the table above,
as well as Fig. 13.3, refer to the computationally- Figure 13.3: Some examples of problems that are
equivalent Boolean variants of these problems. known to be in P and problems that are known to
be in EXP but not known whether or not they are
in P. Since both P and EXP are classes of Boolean
functions, in this figure we always refer to the Boolean
(i.e., Yes/No) variant of the problems.
13.2 MODELING RUNNING TIME USING RAM MACHINES / NAND-
RAM
Turing machines are a clean theoretical model of computation, but
do not closely correspond to real-world computing architectures. The
discrepancy between Turing machines and actual computers does
not matter much when we consider the question of which functions
are computable, but can make a difference in the context of efficiency.
mod e l i ng ru n n i ng ti me 425

Even a basic staple of undergraduate algorithms such as “merge sort”


cannot be implemented on a Turing machine in 𝑂(𝑛 log 𝑛) time (see
Section 13.8). RAM machines (or equivalently, NAND-RAM programs)
match more closely actual computing architecture and what we mean
when we say 𝑂(𝑛) or 𝑂(𝑛 log 𝑛) algorithms in algorithms courses
or whiteboard coding interviews. We can define running time with
respect to NAND-RAM programs just as we did for Turing machines.

Let 𝑇 ∶ ℕ → ℕ be some func-


Definition 13.4 — Running time (RAM).
tion mapping natural numbers to natural numbers. We say that
a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is computable in 𝑇 (𝑛) RAM time
(RAM-time for short) if there exists a NAND-RAM program 𝑃 such
that for every sufficiently large 𝑛 and every 𝑥 ∈ {0, 1}𝑛 , when given
input 𝑥, the program 𝑃 halts after executing at most 𝑇 (𝑛) lines and
outputs 𝐹 (𝑥).
We define TIMERAM (𝑇 (𝑛)) to be the set of Boolean functions
(functions mapping {0, 1}∗ to {0, 1}) that are computable in 𝑇 (𝑛)
RAM time.

Because NAND-RAM programs correspond more closely to our


natural notions of running time, we will use NAND-RAM as our
“default” model of running time, and hence use TIME(𝑇 (𝑛)) (without
any subscript) to denote TIMERAM (𝑇 (𝑛)). However, it turns out that
as long as we only care about the difference between exponential and
polynomial time, this does not make much difference. The reason is
that Turing machines can simulate NAND-RAM programs with at
most a polynomial overhead (see also Fig. 13.4):

Let 𝑇 ∶ ℕ → ℕ be a
Theorem 13.5 — Relating RAM and Turing machines.
function such that 𝑇 (𝑛) ≥ 𝑛 for every 𝑛 and the map 𝑛 ↦ 𝑇 (𝑛) can
be computed by a Turing machine in time 𝑂(𝑇 (𝑛)). Then

TIMETM (𝑇 (𝑛)) ⊆ TIMERAM (10 ⋅ 𝑇 (𝑛)) ⊆ TIMETM (𝑇 (𝑛)4 ) . (13.1)

P
The technical details of Theorem 13.5, such as the con-
dition that 𝑛 ↦ 𝑇 (𝑛) is computable in 𝑂(𝑇 (𝑛)) time
or the constants 10 and 4 in (13.1) (which are not tight
and can be improved), are not very important. In par-
ticular, all non-pathological time bound functions we
encounter in practice such as 𝑇 (𝑛) = 𝑛, 𝑇 (𝑛) = 𝑛 log 𝑛,
𝑇 (𝑛) = 2𝑛 etc. will satisfy the conditions of Theo-
rem 13.5, see also Remark 13.6.
The main message of the theorem is Turing machines
and RAM machines are “roughly equivalent” in the
sense that one can simulate the other with polyno-
426 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

mial overhead. Similarly, while the proof involves


some technical details, it’s not very deep or hard, and
merely follows the simulation of RAM machines with
Turing machines we saw in Theorem 8.1 with more
careful “book keeping”.

For example, by instantiating Theorem 13.5 with 𝑇 (𝑛) = 𝑛𝑎 and


using the fact that 10𝑛𝑎 = 𝑜(𝑛𝑎+1 ), we see that TIMETM (𝑛𝑎 ) ⊆
TIMERAM (𝑛𝑎+1 ) ⊆ TIMETM (𝑛4𝑎+4 ) which means that (by Solved Ex-
ercise 13.2)

P = ∪𝑎=1,2,… TIMETM (𝑛𝑎 ) = ∪𝑎=1,2,… TIMERAM (𝑛𝑎 ) .

That is, we could have equally well defined P as the class of functions
computable by NAND-RAM programs (instead of Turing machines) Figure 13.4: The proof of Theorem 13.5 shows that
we can simulate 𝑇 steps of a Turing machine with 𝑇
that run in time polynomial in the length of the input. Similarly, by
steps of a NAND-RAM program, and can simulate
instantiating Theorem 13.5 with 𝑇 (𝑛) = 2𝑛 we see that the class EXP
𝑎
𝑇 steps of a NAND-RAM program with 𝑜(𝑇 4 )
can also be defined as the set of functions computable by NAND-RAM steps of a Turing machine. Hence TIMETM (𝑇 (𝑛)) ⊆
TIMERAM (10 ⋅ 𝑇 (𝑛)) ⊆ TIMETM (𝑇 (𝑛)4 ).
programs in time at most 2𝑝(𝑛) where 𝑝 is some polynomial. Similar
equivalence results are known for many models including cellular
automata, C/Python/Javascript programs, parallel computers, and a
great many other models, which justifies the choice of P as capturing
a technology-independent notion of tractability. (See Section 13.3
for more discussion of this issue.) This equivalence between Turing
machines and NAND-RAM (as well as other models) allows us to
pick our favorite model depending on the task at hand (i.e., “have our
cake and eat it too”) even when we study questions of efficiency, as
long as we only care about the gap between polynomial and exponential
time. When we want to design an algorithm, we can use the extra
power and convenience afforded by NAND-RAM. When we want
to analyze a program or prove a negative result, we can restrict our
attention to Turing machines.

 Big Idea 18 All “reasonable” computational models are equiv-


alent if we only care about the distinction between polynomial and
exponential.

The adjective “reasonable” above refers to all scalable computa-


tional models that have been implemented, with the possible excep-
tion of quantum computers, see Section 13.3 and Chapter 23.

Proof Idea:
The direction TIMETM (𝑇 (𝑛)) ⊆ TIMERAM (10 ⋅ 𝑇 (𝑛)) is not hard to
show, since a NAND-RAM program 𝑃 can simulate a Turing machine
𝑀 with constant overhead by storing the transition table of 𝑀 in
mod e l i ng ru n n i ng ti me 427

an array (as is done in the proof of Theorem 9.1). Simulating every


step of the Turing machine can be done in a constant number 𝑐 of
steps of RAM, and it can be shown this constant 𝑐 is smaller than
10. Thus the heart of the theorem is to prove that TIMERAM (𝑇 (𝑛)) ⊆
TIMETM (𝑇 (𝑛)4 ). This proof closely follows the proof of Theorem 8.1,
where we have shown that every function 𝐹 that is computable by
a NAND-RAM program 𝑃 is computable by a Turing machine (or
equivalently a NAND-TM program) 𝑀 . To prove Theorem 13.5, we
follow the exact same proof but just check that the overhead of the
simulation of 𝑃 by 𝑀 is polynomial. The proof has many details, but
is not deep. It is therefore much more important that you understand
the statement of this theorem than its proof.

Proof of Theorem 13.5. We only focus on the non-trivial direction


TIMERAM (𝑇 (𝑛)) ⊆ TIMETM (𝑇 (𝑛)4 ). Let 𝐹 ∈ TIMERAM (𝑇 (𝑛)). 𝐹 can
be computed in time 𝑇 (𝑛) by some NAND-RAM program 𝑃 and we
need to show that it can also be computed in time 𝑇 (𝑛)4 by a Turing
machine 𝑀 . This will follow from showing that 𝐹 can be computed
in time 𝑇 (𝑛)4 by a NAND-TM program, since for every NAND-TM
program 𝑄 there is a Turing machine 𝑀 simulating it such that each
iteration of 𝑄 corresponds to a single step of 𝑀 .
As mentioned above, we follow the proof of Theorem 8.1 (simula-
tion of NAND-RAM programs using NAND-TM programs) and use
the exact same simulation, but with a more careful accounting of the
number of steps that the simulation costs. Recall, that the simulation
of NAND-RAM works by “peeling off” features of NAND-RAM one
by one, until we are left with NAND-TM.
We will not provide the full details but will present the main ideas
used in showing that every feature of NAND-RAM can be simulated
by NAND-TM with at most a polynomial overhead:

1. Recall that every NAND-RAM variable or array element can con-


tain an integer between 0 and 𝑇 where 𝑇 is the number of lines that
have been executed so far. Therefore if 𝑃 is a NAND-RAM pro-
gram that computes 𝐹 in 𝑇 (𝑛) time, then on inputs of length 𝑛, all
integers used by 𝑃 are of magnitude at most 𝑇 (𝑛). This means that
the largest value i can ever reach is at most 𝑇 (𝑛) and so each one of
𝑃 ’s variables can be thought of as an array of at most 𝑇 (𝑛) indices,
each of which holds a natural number of magnitude at most 𝑇 (𝑛).
We let ℓ = ⌈log 𝑇 (𝑛)⌉ be the number of bits needed to encode such
numbers. (We can start off the simulation by computing 𝑇 (𝑛) and
ℓ.)
428 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2. We can encode a NAND-RAM array of length ≤ 𝑇 (𝑛) containing


numbers in {0, … , 𝑇 (𝑛) − 1} as an Boolean (i.e., NAND-TM) array
of 𝑇 (𝑛)ℓ = 𝑂(𝑇 (𝑛) log 𝑇 (𝑛)) bits, which we can also think of as
a two dimensional array as we did in the proof of Theorem 8.1. We
encode a NAND-RAM scalar containing a number in {0, … , 𝑇 (𝑛) −
1} simply by a shorter NAND-TM array of ℓ bits.

3. We can simulate the two dimensional arrays using one-


dimensional arrays of length 𝑇 (𝑛)ℓ = 𝑂(𝑇 (𝑛) log 𝑇 (𝑛)). All the
arithmetic operations on integers use the grade-school algorithms,
that take time that is polynomial in the number ℓ of bits of the
integers, which is 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛)) in our case. Hence we can simulate
𝑇 (𝑛) steps of NAND-RAM with 𝑂(𝑇 (𝑛)𝑝𝑜𝑙𝑦(log 𝑇 (𝑛)) steps of a
model that uses random access memory but only Boolean-valued
one-dimensional arrays.

4. The most expensive step is to translate from random access mem-


ory to the sequential memory model of NAND-TM/Turing ma-
chines. As we did in the proof of Theorem 8.1 (see Section 8.2), we
can simulate accessing an array Foo at some location encoded in an
array Bar by:

a. Copying Bar to some temporary array Temp


b. Having an array Index which is initially all zeros except 1 at the
first location.
c. Repeating the following until Temp encodes the number 0:
(Number of repetitions is at most 𝑇 (𝑛).)
• Decrease the number encoded temp by 1. (Takes number of
steps polynomial in ℓ = ⌈log 𝑇 (𝑛)⌉.)
• Decrease i until it is equal to 0. (Takes 𝑂(𝑇 (𝑛)) steps.)
• Scan Index until we reach the point in which it equals 1 and
then change this 1 to 0 and go one step further and write 1 in
this location. (Takes 𝑂(𝑇 (𝑛)) steps.)
d. When we are done we know that if we scan Index until we reach
the point in which Index[i]= 1 then i contains the value that
was encoded by Bar (Takes 𝑂(𝑇 (𝑛)) steps.)

The total cost for each such operation is 𝑂(𝑇 (𝑛)2 +𝑇 (𝑛)𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) =
𝑂(𝑇 (𝑛)2 ) steps.
In sum, we simulate a single step of NAND-RAM using
𝑂(𝑇 (𝑛)2 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) steps of NAND-TM, and hence the total
simulation time is 𝑂(𝑇 (𝑛)3 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) which is smaller than 𝑇 (𝑛)4
for sufficiently large 𝑛.

mod e l i ng ru n n i ng ti me 429

R
Remark 13.6 — Nice time bounds. When considering
general time bounds we need to make sure to rule
out some “pathological” cases such as functions 𝑇
that don’t give enough time for the algorithm to read
the input, or functions where the time bound itself is
uncomputable. We say that a function 𝑇 ∶ ℕ → ℕ is
a nice time bound function (or nice function for short)
if for every 𝑛 ∈ ℕ, 𝑇 (𝑛) ≥ 𝑛 (i.e., 𝑇 allows enough
time to read the input), for every 𝑛′ ≥ 𝑛, 𝑇 (𝑛′ ) ≥ 𝑇 (𝑛)
(i.e., 𝑇 allows more time on longer inputs), and the
map 𝐹 (𝑥) = 1𝑇 (|𝑥|) (i.e., mapping a string of length
𝑛 to a sequence of 𝑇 (𝑛) ones) can be computed by a
NAND-RAM program in 𝑂(𝑇 (𝑛)) time.
All the “normal” time complexity bounds we en-
counter in applications such as 𝑇 (𝑛) √ = 100𝑛,
𝑇 (𝑛) = 𝑛2 log 𝑛,𝑇 (𝑛) = 2 𝑛 , etc. are “nice”.
Hence from now on we will only care about the
class TIME(𝑇 (𝑛)) when 𝑇 is a “nice” function. The
computability condition is in particular typically easily
satisfied. For example, for arithmetic functions such
as 𝑇 (𝑛) = 𝑛3 , we can typically compute the binary
representation of 𝑇 (𝑛) in time polynomial in the num-
ber of bits of 𝑇 (𝑛) and hence poly-logarithmic in 𝑇 (𝑛).
Hence the time to write the string 1𝑇 (𝑛) in such cases
will be 𝑇 (𝑛) + 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛)) = 𝑂(𝑇 (𝑛)).

13.3 EXTENDED CHURCH-TURING THESIS (DISCUSSION)


Theorem 13.5 shows that the computational models of Turing machines
and RAM machines / NAND-RAM programs are equivalent up to poly-
nomial factors in the running time. Other examples of polynomially
equivalent models include:

• All standard programming languages, including C/Python/-


JavaScript/Lisp/etc.

• The 𝜆 calculus (see also Section 13.8).

• Cellular automata

• Parallel computers

• Biological computing devices such as DNA-based computers.

The Extended Church Turing Thesis is the statement that this is true
for all physically realizable computing models. In other words, the
extended Church Turing thesis says that for every scalable computing
device 𝐶 (which has a finite description but can be in principle used
to run computation on arbitrarily large inputs), there is some con-
stant 𝑎 such that for every function 𝐹 ∶ {0, 1}∗ → {0, 1} that 𝐶 can
430 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

compute on 𝑛 length inputs using an 𝑆(𝑛) amount of physical re-


sources, 𝐹 is in TIME(𝑆(𝑛)𝑎 ). This is a strengthening of the (“plain”)
Church-Turing Thesis, discussed in Section 8.8, which states that the
set of computable functions is the same for all physically realizable
models, but without requiring the overhead in the simulation between
different models to be at most polynomial.
All the current constructions of scalable computational models and
programming languages conform to the Extended Church-Turing
Thesis, in the sense that they can be simulated with polynomial over-
head by Turing machines (and hence also by NAND-TM or NAND-
RAM programs). Consequently, the classes P and EXP are robust to
the choice of model, and we can use the programming language of
our choice, or high level descriptions of an algorithm, to determine
whether or not a problem is in P.
Like the Church-Turing thesis itself, the extended Church-Turing
thesis is in the asymptotic setting and does not directly yield an ex-
perimentally testable prediction. However, it can be instantiated with
more concrete bounds on the overhead, yielding experimentally-
testable predictions such as the Physical Extended Church-Turing Thesis
we mentioned in Section 5.6.
In the last hundred+ years of studying and mechanizing com-
putation, no one has yet constructed a scalable computing device
that violates the extended Church Turing Thesis. However, quan-
tum computing, if realized, will pose a serious challenge to the ex-
tended Church-Turing Thesis (see Chapter 23). However, even if
the promises of quantum computing are fully realized, the extended
Church-Turing thesis is “morally” correct, in the sense that, while we
do need to adapt the thesis to account for the possibility of quantum
computing, its broad outline remains unchanged. We are still able
to model computation mathematically, we can still treat programs
as strings and have a universal program, we still have time hierarchy
and uncomputability results, and there is still no reason to doubt the
(“plain”) Church-Turing thesis. Moreover, the prospect of quantum
computing does not seem to make a difference for the time complexity
of many (though not all!) of the concrete problems that we care about.
In particular, as far as we know, out of all the example problems men-
tioned in Chapter 12 the complexity of only one— integer factoring—
is affected by modifying our model to include quantum computers as
well.
mod e l i ng ru n n i ng ti me 431

13.4 EFFICIENT UNIVERSAL MACHINE: A NAND-RAM INTER-


PRETER IN NAND-RAM
We have seen in Theorem 9.1 the “universal Turing machine”. Exam-
ining that proof, and combining it with Theorem 13.5 , we can see that
the program 𝑈 has a polynomial overhead, in the sense that it can sim-
ulate 𝑇 steps of a given NAND-TM (or NAND-RAM) program 𝑃 on
an input 𝑥 in 𝑂(𝑇 4 ) steps. But in fact, by directly simulating NAND-
RAM programs we can do better with only a constant multiplicative
overhead. That is, there is a universal NAND-RAM program 𝑈 such that
for every NAND-RAM program 𝑃 , 𝑈 simulates 𝑇 steps of 𝑃 using
only 𝑂(𝑇 ) steps. (The implicit constant in the 𝑂 notation can depend
on the program 𝑃 but does not depend on the length of the input.)

Theorem 13.7 — Efficient universality of NAND-RAM. There exists a NAND-


RAM program 𝑈 satisfying the following:

1. (𝑈 is a universal NAND-RAM program.) For every NAND-RAM


program 𝑃 and input 𝑥, 𝑈 (𝑃 , 𝑥) = 𝑃 (𝑥) where by 𝑈 (𝑃 , 𝑥) we
denote the output of 𝑈 on a string encoding the pair (𝑃 , 𝑥).

2. (𝑈 is efficient.) There are some constants 𝑎, 𝑏 such that for ev-


ery NAND-RAM program 𝑃 , if 𝑃 halts on input 𝑥 after at most
𝑇 steps, then 𝑈 (𝑃 , 𝑥) halts after at most 𝐶 ⋅ 𝑇 steps where
𝐶 ≤ 𝑎|𝑃 |𝑏 .

P
As in the case of Theorem 13.5, the proof of Theo-
rem 13.7 is not very deep and so it is more important
to understand its statement. Specifically, if you under-
stand how you would go about writing an interpreter
for NAND-RAM using a modern programming lan-
guage such as Python, then you know everything you
need to know about the proof of this theorem.

Proof of Theorem 13.7. To present a universal NAND-RAM program


in full we would need to describe a precise representation scheme, Figure 13.5: The universal NAND-RAM program
as well as the full NAND-RAM instructions for the program. While 𝑈 simulates an input NAND-RAM program 𝑃
by storing all of 𝑃 ’s variables inside a single array
this can be done, it is more important to focus on the main ideas, and
Vars of 𝑈. If 𝑃 has 𝑡 variables, then the array Vars
so we just sketch the proof here. A specification of NAND-RAM is is divided into blocks of length 𝑡, where the 𝑗-th
given in the appendix, and for the purposes of this simulation, we can coordinate of the 𝑖-th block contains the 𝑖-th element
of the 𝑗-th array of 𝑃 . If the 𝑗-th variable of 𝑃 is
simply use the representation of the NAND-RAM code as an ASCII scalar, then we just store its value in the zeroth block
string. of Vars.
432 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The program 𝑈 gets as input a NAND-RAM program 𝑃 and an


input 𝑥 and simulates 𝑃 one step at a time. To do so, 𝑈 does the fol-
lowing:

1. 𝑈 maintains variables program_counter, and number_steps for the


current line to be executed and the number of steps executed so far.

2. 𝑈 initially scans the code of 𝑃 to find the number 𝑡 of unique vari-


able names that 𝑃 uses. It will translate each variable name into a
number between 0 and 𝑡 − 1 and use an array Program to store 𝑃 ’s
code where for every line ℓ, Program[ℓ] will store the ℓ-th line of 𝑃
where the variable names have been translated to numbers. (More
concretely, we will use a constant number of arrays to separately
encode the operation used in this line, and the variable names and
indices of the operands.)

3. 𝑈 maintains a single array Vars that contains all the values of 𝑃 ’s


variables. We divide Vars into blocks of length 𝑡. If 𝑠 is a num-
ber corresponding to an array variable Foo of 𝑃 , then we store
Foo[0] in Vars[𝑠], we store Foo[1] in Var_values[𝑡 + 𝑠], Foo[2]
in Vars[2𝑡 + 𝑠] and so on and so forth (see Fig. 13.5). Generally,
if the 𝑠-th variable of 𝑃 is a scalar variable, then its value will be
stored in location Vars[𝑠]. If it is an array variable then the value of
its 𝑖-th element will be stored in location Vars[𝑡 ⋅ 𝑖 + 𝑠].

4. To simulate a single step of 𝑃 , the program 𝑈 recovers from Pro-


gram the line corresponding to program_counter and executes it.
Since NAND-RAM has a constant number of arithmetic operations,
we can implement the logic of which operation to execute using a
sequence of a constant number of if-then-else’s. Retrieving from
Vars the values of the operands of each instruction can be done
using a constant number of arithmetic operations.

The setup stages take only a constant (depending on |𝑃 | but not


on the input 𝑥) number of steps. Once we are done with the setup, to
simulate a single step of 𝑃 , we just need to retrieve the corresponding
line and do a constant number of “if elses” and accesses to Vars to
simulate it. Hence the total running time to simulate 𝑇 steps of the
program 𝑃 is at most 𝑂(𝑇 ) when suppressing constants that depend
on the program 𝑃 .

13.4.1 Timed Universal Turing Machine


One corollary of the efficient universal machine is the following.
Given any Turing machine 𝑀 , input 𝑥, and “step budget” 𝑇 , we can
simulate the execution of 𝑀 for 𝑇 steps in time that is polynomial in
mod e l i ng ru n n i ng ti me 433

𝑇 . Formally, we define a function TIMEDEVAL that takes the three


parameters 𝑀 , 𝑥, and the time budget, and outputs 𝑀 (𝑥) if 𝑀 halts
within at most 𝑇 steps, and outputs 0 otherwise. The timed univer-
sal Turing machine computes TIMEDEVAL in polynomial time (see
Fig. 13.6). (Since we measure time as a function of the input length,
we define TIMEDEVAL as taking the input 𝑇 represented in unary: a
string of 𝑇 ones.)

Let TIMEDEVAL
Theorem 13.8 — Timed Universal Turing Machine. ∶
{0, 1}∗ → {0, 1}∗ be the function defined as

⎧𝑀 (𝑥) 𝑀 halts within ≤ 𝑇 steps on 𝑥


{
TIMEDEVAL(𝑀 , 𝑥, 1𝑇 ) = ⎨ .
{
⎩0 otherwise

Then TIMEDEVAL ∈ P.

Proof. We only sketch the proof since the result follows fairly directly
from Theorem 13.5 and Theorem 13.7. By Theorem 13.5 to show that
TIMEDEVAL ∈ P, it suffices to give a polynomial-time NAND-RAM
program to compute TIMEDEVAL.
Such a program can be obtained as follows. Given a Turing ma-
chine 𝑀 , by Theorem 13.5 we can transform it in time polynomial in Figure 13.6: The timed universal Turing machine takes
as input a Turing machine 𝑀, an input 𝑥, and a time
its description into a functionally-equivalent NAND-RAM program bound 𝑇 , and outputs 𝑀(𝑥) if 𝑀 halts within at
𝑃 such that the execution of 𝑀 on 𝑇 steps can be simulated by the most 𝑇 steps. Theorem 13.8 states that there is such a
machine that runs in time polynomial in 𝑇 .
execution of 𝑃 on 𝑐 ⋅ 𝑇 steps. We can then run the universal NAND-
RAM machine of Theorem 13.7 to simulate 𝑃 for 𝑐 ⋅ 𝑇 steps, using
𝑂(𝑇 ) time, and output 0 if the execution did not halt within this bud-
get. This shows that TIMEDEVAL can be computed by a NAND-RAM
program in time polynomial in |𝑀 | and linear in 𝑇 , which means
TIMEDEVAL ∈ P.

13.5 THE TIME HIERARCHY THEOREM


Some functions are uncomputable, but are there functions that can
be computed, but only at an exorbitant cost? For example, is there a
function that can be computed in time 2𝑛 , but can not be computed in
time 20.9𝑛 ? It turns out that the answer is Yes:

Theorem 13.9 — Time Hierarchy Theorem.For every nice function 𝑇 ∶


ℕ → ℕ, there is a function 𝐹 ∶ {0, 1}∗ → {0, 1} in TIME(𝑇 (𝑛) log 𝑛) ⧵
TIME(𝑇 (𝑛)).

There is nothing special about log 𝑛, and we could have used any
other efficiently computable function that tends to infinity with 𝑛.
434 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

 Big Idea 19 If we have more time, we can compute more functions.

R
Remark 13.10 — Simpler corollary of the time hierarchy
theorem. The generality of the time hierarchy theorem
can make its proof a little hard to read. It might be
easier to follow the proof if you first try to prove by
yourself the easier statement P ⊊ EXP.
You can do so by showing that the following function
𝐹 ∶ {0, 1}∗ ∶→ {0, 1} is in EXP ⧵ P: for every Turing
machine 𝑀 and input 𝑥, 𝐹 (𝑀 , 𝑥) = 1 if and only if
𝑀 halts on 𝑥 within at most |𝑥|log |𝑥| steps. One can
show that 𝐹 ∈ TIME(𝑛𝑂(log 𝑛) ) ⊆ EXP using the
universal Turing machine (or the efficient universal
NAND-RAM program of Theorem 13.7). On the other
hand, we can use similar ideas to those used to show
the uncomputability of HALT in Section 9.3.2 to prove
that 𝐹 ∉ P.

Figure 13.7: The Time Hierarchy Theorem (Theo-


rem 13.9) states that all of these classes are distinct.

Proof Idea:
In the proof of Theorem 9.6 (the uncomputability of the Halting
problem), we have shown that the function HALT cannot be com-
puted in any finite time. An examination of the proof shows that it
gives something stronger. Namely, the proof shows that if we fix our
computational budget to be 𝑇 steps, then not only can we not dis-
tinguish between programs that halt and those that do not, but we
cannot even distinguish between programs that halt within at most 𝑇 ′
steps and those that take more than that (where 𝑇 ′ is some number
depending on 𝑇 ). Therefore, the proof of Theorem 13.9 follows the
mod e l i ng ru n n i ng ti me 435

ideas of the uncomputability of the halting problem, but again with a


more careful accounting of the running time.

Proof of Theorem 13.9. Our proof is inspired by the proof of the un-
computability of the halting problem. Specifically, for every function
𝑇 as in the theorem’s statement, we define the Bounded Halting func-
tion HALT𝑇 as follows. The input to HALT𝑇 is a pair (𝑃 , 𝑥) such that
|𝑃 | ≤ log log |𝑥| and 𝑃 encodes some NAND-RAM program. We
define

{1, 𝑃 halts on 𝑥 within ≤ 100 ⋅ 𝑇 (|𝑃 | + |𝑥|) steps



HALT𝑇 (𝑃 , 𝑥) = ⎨ .
⎩0, otherwise
{

(The constant 100 and the function log log 𝑛 are rather arbitrary, and
are chosen for convenience in this proof.)
Theorem 13.9 is an immediate consequence of the following two
claims:
Claim 1: HALT𝑇 ∈ TIME(𝑇 (𝑛) ⋅ log 𝑛)
and
Claim 2: HALT𝑇 ∉ TIME(𝑇 (𝑛)).
Please make sure you understand why indeed the theorem follows
directly from the combination of these two claims. We now turn to
proving them.
Proof of claim 1: We can easily check in linear time whether an
input has the form 𝑃 , 𝑥 where |𝑃 | ≤ log log |𝑥|. Since 𝑇 (⋅) is a nice
function, we can evaluate it in 𝑂(𝑇 (𝑛)) time. Thus, we can compute
HALT𝑇 (𝑃 , 𝑥) as follows:

1. Compute 𝑇0 = 𝑇 (|𝑃 | + |𝑥|) in 𝑂(𝑇0 ) steps.

2. Use the universal NAND-RAM program of Theorem 13.7 to simu-


late 100⋅𝑇0 steps of 𝑃 on the input 𝑥 using at most 𝑝𝑜𝑙𝑦(|𝑃 |)𝑇0 steps.
(Recall that we use 𝑝𝑜𝑙𝑦(ℓ) to denote a quantity that is bounded by
𝑎ℓ𝑏 for some constants 𝑎, 𝑏.)

3. If 𝑃 halts within these 100 ⋅ 𝑇0 steps then output 1, else output 0.

The length of the input is 𝑛 = |𝑃 | + |𝑥|. Since |𝑥| ≤ 𝑛 and


(log log |𝑥|)𝑏 = 𝑜(log |𝑥|) for every 𝑏, the running time will be
𝑜(𝑇 (|𝑃 | + |𝑥|) log 𝑛) and hence the above algorithm demonstrates that
HALT𝑇 ∈ TIME(𝑇 (𝑛) ⋅ log 𝑛), completing the proof of Claim 1.
Proof of claim 2: This proof is the heart of Theorem 13.9, and is
very reminiscent of the proof that HALT is not computable. Assume,
for the sake of contradiction, that there is some NAND-RAM program
436 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑃 ∗ that computes HALT𝑇 (𝑃 , 𝑥) within 𝑇 (|𝑃 | + |𝑥|) steps. We are going


to show a contradiction by creating a program 𝑄 and showing that
under our assumptions, if 𝑄 runs for less than 𝑇 (𝑛) steps when given
(a padded version of) its own code as input then it actually runs for
more than 𝑇 (𝑛) steps and vice versa. (It is worth re-reading the last
sentence twice or thrice to make sure you understand this logic. It is
very similar to the direct proof of the uncomputability of the halting
problem where we obtained a contradiction by using an assumed
“halting solver” to construct a program that, given its own code as
input, halts if and only if it does not halt.)
We will define 𝑄∗ to be the program that on input a string 𝑧 does
the following:

1. If 𝑧 does not have the form 𝑧 = 𝑃 1𝑚 where 𝑃 represents a NAND-


RAM program and |𝑃 | < 0.1 log log 𝑚 then return 0. (Recall that
1𝑚 denotes the string of 𝑚 ones.)

2. Compute 𝑏 = 𝑃 ∗ (𝑃 , 𝑧) (at a cost of at most 𝑇 (|𝑃 | + |𝑧|) steps, under


our assumptions).

3. If 𝑏 = 1 then 𝑄∗ goes into an infinite loop, otherwise it halts.

Let ℓ be the length description of 𝑄∗ as a string, and let 𝑚 be larger


than 22 . We will reach a contradiction by splitting into cases ac-
1000ℓ

cording to whether or not HALT𝑇 (𝑄∗ , 𝑄∗ 1𝑚 ) equals 0 or 1.


On the one hand, if HALT𝑇 (𝑄∗ , 𝑄∗ 1𝑚 ) = 1, then under our as-
sumption that 𝑃 ∗ computes HALT𝑇 , 𝑄∗ will go into an infinite loop
on input 𝑧 = 𝑄∗ 1𝑚 , and hence in particular 𝑄∗ does not halt within
100𝑇 (|𝑄∗ | + 𝑚) steps on the input 𝑧. But this contradicts our assump-
tion that HALT𝑇 (𝑄∗ , 𝑄∗ 1𝑚 ) = 1.
This means that it must hold that HALT𝑇 (𝑄∗ , 𝑄∗ 1𝑚 ) = 0. But
in this case, since we assume 𝑃 ∗ computes HALT𝑇 , 𝑄∗ does not do
anything in phase 3 of its computation, and so the only computa-
tion costs come in phases 1 and 2 of the computation. It is not hard
to verify that Phase 1 can be done in linear and in fact less than 5|𝑧|
steps. Phase 2 involves executing 𝑃 ∗ , which under our assumption
requires 𝑇 (|𝑄∗ | + 𝑚) steps. In total we can perform both phases in
less than 10𝑇 (|𝑄∗ | + 𝑚) in steps, which by definition means that
HALT𝑇 (𝑄∗ , 𝑄∗ 1𝑚 ) = 1, but this is of course a contradiction. This
completes the proof of Claim 2 and hence of Theorem 13.9.

Solved Exercise 13.3 — P vs EXP. Prove that P ⊊ EXP.



mod e l i ng ru n n i ng ti me 437

Solution:
This statement follows directly from the time hierarchy theo-
rem, but it can be an instructive exercise to prove it directly, see
Remark 13.10. We need to show that there exists 𝐹 ∈ EXP ⧵ P.
Let 𝑇 (𝑛) = 𝑛log 𝑛 and 𝑇 ′ (𝑛) = 𝑛log 𝑛/2 . Both are nice functions.
Since 𝑇 (𝑛)/𝑇 ′ (𝑛) = 𝜔(log 𝑛), by Theorem 13.9 there exists some
𝐹 in TIME(𝑇 (𝑛)) ⧵ TIME(𝑇 ′ (𝑛)). Since for sufficiently large 𝑛,
2𝑛 > 𝑛log 𝑛 , 𝐹 ∈ TIME(2𝑛 ) ⊆ EXP. On the other hand, 𝐹 ∉ P.
Indeed, suppose otherwise that there was a constant 𝑐 > 0 and
a Turing machine computing 𝐹 on 𝑛-length input in at most 𝑛𝑐
steps for all sufficiently large 𝑛. Then since for 𝑛 large enough
𝑛𝑐 < 𝑛log 𝑛/2 , it would have followed that 𝐹 ∈ TIME(𝑛log 𝑛/2 )
contradicting our choice of 𝐹 .

The time hierarchy theorem tells us that there are functions we can

compute in 𝑂(𝑛2 ) time but not 𝑂(𝑛), in 2𝑛 time, but not 2 𝑛 , etc.. In
particular there are most definitely functions that we can compute in
time 2𝑛 but not 𝑂(𝑛). We have seen that we have no shortage of natu-
ral functions for which the best known algorithm requires roughly 2𝑛
time, and that many people have invested significant effort in trying
to improve that. However, unlike in the finite vs. infinite case, for all
of the examples above at the moment we do not know how to rule
out even an 𝑂(𝑛) time algorithm. We will however see that there is a
single unproven conjecture that would imply such a result for most of
these problems.
The time hierarchy theorem relies on the existence of an efficient
universal NAND-RAM program, as proven in Theorem 13.7. For
other models such as Turing machines we have similar time hierarchy
results showing that there are functions computable in time 𝑇 (𝑛) and
not in time 𝑇 (𝑛)/𝑓(𝑛) where 𝑓(𝑛) corresponds to the overhead in the
corresponding universal machine.

Figure 13.8: Some complexity classes and some of the


functions we know (or conjecture) to be contained in
13.6 NON-UNIFORM COMPUTATION them.

We have now seen two measures of “computation cost” for functions.


In Section 4.6 we defined the complexity of computing finite functions
using circuits / straightline programs. Specifically, for a finite function
𝑔 ∶ {0, 1}𝑛 → {0, 1} and number 𝑠 ∈ ℕ, 𝑔 ∈ SIZE𝑛 (𝑠) if there is a circuit
of at most 𝑠 NAND gates (or equivalently an 𝑠-line NAND-CIRC
program) that computes 𝑔. To relate this to the classes TIME(𝑇 (𝑛))
defined in this chapter we first need to extend the class SIZE𝑛 (𝑠) from
finite functions to functions with unbounded input length.
438 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Let 𝐹 ∶ {0, 1}∗ → {0, 1}


Definition 13.11 — Non-uniform computation.
and 𝑇 ∶ ℕ → ℕ be a nice time bound. For every 𝑛 ∈ ℕ, define
𝐹↾𝑛 ∶ {0, 1}𝑛 → {0, 1} to be the restriction of 𝐹 to inputs of size 𝑛.
That is, 𝐹↾𝑛 is the function mapping {0, 1}𝑛 to {0, 1} such that for
every 𝑥 ∈ {0, 1}𝑛 , 𝐹↾𝑛 (𝑥) = 𝐹 (𝑥).
We say that 𝐹 is non-uniformly computable in at most 𝑇 (𝑛) size,
denoted by 𝐹 ∈ SIZE(𝑇 ) if there exists a sequence (𝐶0 , 𝐶1 , 𝐶2 , …)
of NAND circuits such that:

• For every 𝑛 ∈ ℕ, 𝐶𝑛 computes the function 𝐹↾𝑛

• For every sufficiently large 𝑛, 𝐶𝑛 has at most 𝑇 (𝑛) gates.

In other words, 𝐹 ∈ SIZE(𝑇 ) iff for every 𝑛 ∈ ℕ, it holds that


𝐹↾𝑛 ∈ SIZE𝑛 (𝑇 (𝑛)). The non-uniform analog to the class P is the class
P/poly defined as

P/poly = ∪𝑐∈ℕ SIZE(𝑛𝑐 ) . (13.2)


There is a big difference between non-uniform computation and uni-
form complexity classes such as TIME(𝑇 (𝑛)) or P. The condition
𝐹 ∈ P means that there is a single Turing machine 𝑀 that computes
𝐹 on all inputs in polynomial time. The condition 𝐹 ∈ P/poly only
means that for every input length 𝑛 there can be a different circuit 𝐶𝑛
that computes 𝐹 using polynomially many gates on inputs of these
lengths. As we will see, 𝐹 ∈ P/poly does not necessarily imply that
𝐹 ∈ P. However, the other direction is true:

Theorem 13.12 — Non-uniform computation contains uniform computa-


tion.There is some 𝑎 ∈ ℕ s.t. for every nice 𝑇 ∶ ℕ → ℕ and
𝐹 ∶ {0, 1}∗ → {0, 1},

TIME(𝑇 (𝑛)) ⊆ SIZE(𝑇 (𝑛)𝑎 ) .

In particular, Theorem 13.12 shows that for every 𝑐, TIME(𝑛𝑐 ) ⊆


SIZE(𝑛𝑐𝑎 ) and hence P ⊆ P/poly .
Figure 13.9: We can think of an infinite function
Proof Idea: 𝐹 ∶ {0, 1}∗ → {0, 1} as a collection of finite functions
The idea behind the proof is to “unroll the loop”. Specifically, we 𝐹0 , 𝐹1 , 𝐹2 , … where 𝐹↾𝑛 ∶ {0, 1}𝑛 → {0, 1} is the
restriction of 𝐹 to inputs of length 𝑛. We say 𝐹 is in
will use the programming language variants of non-uniform and uni- P/poly if for every 𝑛, the function 𝐹↾𝑛 is computable
form computation: namely NAND-CIRC and NAND-TM. The main by a polynomial-size NAND-CIRC program, or
difference between the two is that NAND-TM has loops. However, for equivalently, a polynomial-sized Boolean circuit.

every fixed 𝑛, if we know that a NAND-TM program runs in at most


𝑇 (𝑛) steps, then we can replace its loop by simply “copying and past-
ing” its code 𝑇 (𝑛) times, similar to how in Python we can replace code
such as
mod e l i ng ru n n i ng ti me 439

for i in range(4):
print(i)

with the “loop free” code

print(0)
print(1)
print(2)
print(3)

To make this idea into an actual proof we need to tackle one tech-
nical difficulty, and this is to ensure that the NAND-TM program is
oblivious in the sense that the value of the index variable i in the 𝑗-th
iteration of the loop will depend only on 𝑗 and not on the contents of
the input. We make a digression to do just that in Section 13.6.1 and
then complete the proof of Theorem 13.12.

13.6.1 Oblivious NAND-TM programs


Our approach for proving Theorem 13.12 involves “unrolling the
loop”. For example, consider the following NAND-TM to compute the
XOR function on inputs of arbitrary length:

temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])

Setting (as an example) 𝑛 = 3, we can attempt to translate this


NAND-TM program into a NAND-CIRC program for computing
XOR3 ∶ {0, 1}3 → {0, 1} by simply “copying and pasting” the loop
three times (dropping the MODANDJMP line):

temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
440 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)

However, the above is still not a valid NAND-CIRC program since


it contains references to the special variable i. To make it into a valid
NAND-CIRC program, we replace references to i in the first iteration
with 0, references in the second iteration with 1, and references in the
third iteration with 2. (We also create a variable zero and use it for the
first time any variable is instantiated, as well as remove assignments to
non-output variables that are never used later on.) The resulting pro-
gram is a standard “loop free and index free” NAND-CIRC program
that computes XOR3 (see also Fig. 13.10):

temp_0 = NAND(X[0],X[0])
one = NAND(X[0],temp_0)
zero = NAND(one,one)
temp_2 = NAND(X[0],zero)
temp_3 = NAND(X[0],temp_2)
temp_4 = NAND(zero,temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[1],Y[0])
temp_3 = NAND(X[1],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[2],Y[0])
temp_3 = NAND(X[2],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)

Key to this transformation was the fact that in our original NAND-
TM program for XOR, regardless of whether the input is 011, 100, or
any other string, the index variable i is guaranteed to equal 0 in the
first iteration, 1 in the second iteration, 2 in the third iteration, and so
on and so forth. The particular sequence 0, 1, 2, … is immaterial: the
crucial property is that the NAND-TM program for XOR is oblivious
in the sense that the value of the index i in the 𝑗-th iteration depends
only on 𝑗 and does not depend on the particular choice of the input. Figure 13.10: A NAND circuit for XOR3 obtained by
“unrolling the loop” of the NAND-TM program for
Luckily, it is possible to transform every NAND-TM program into computing XOR three times.
a functionally equivalent oblivious program with at most quadratic
mod e l i ng ru n n i ng ti me 441

overhead. (Similarly we can transform any Turing machine into a


functionally equivalent oblivious Turing machine, see Exercise 13.6.)

Theorem 13.13 — Making NAND-TM oblivious. Let 𝑇 ∶ ℕ → ℕ be a nice


function and let 𝐹 ∈ TIMETM (𝑇 (𝑛)). Then there is a NAND-TM
program 𝑃 that computes 𝐹 in 𝑂(𝑇 (𝑛)2 ) steps and satisfying the
following. For every 𝑛 ∈ ℕ there is a sequence 𝑖0 , 𝑖1 , … , 𝑖𝑚−1 such
that for every 𝑥 ∈ {0, 1}𝑛 , if 𝑃 is executed on input 𝑥 then in the
𝑗-th iteration the variable i is equal to 𝑖𝑗 .

In other words, Theorem 13.13 implies that if we can compute 𝐹 in


𝑇 (𝑛) steps, then we can compute it in 𝑂(𝑇 (𝑛)2 ) steps with a program
𝑃 in which the position of i in the 𝑗-th iteration depends only on 𝑗
and the length of the input, and not on the contents of the input. Such
a program can be easily translated into a NAND-CIRC program of
𝑂(𝑇 (𝑛)2 ) lines by “unrolling the loop”.
Proof Idea:
We can translate any NAND-TM program 𝑃 ′ into an oblivious
program 𝑃 by making 𝑃 “sweep” its arrays. That is, the index i in
𝑃 will always move all the way from position 0 to position 𝑇 (𝑛) − 1
and back again. We can then simulate the program 𝑃 ′ with at most
𝑇 (𝑛) overhead: if 𝑃 ′ wants to move i left when we are in a rightward
sweep then we simply wait the at most 2𝑇 (𝑛) steps until the next time
we are back in the same position while sweeping to the left.

Proof of Theorem 13.13. Let 𝑃 ′ be a NAND-TM program computing 𝐹


in 𝑇 (𝑛) steps. We construct an oblivious NAND-TM program 𝑃 for
computing 𝐹 as follows (see also Fig. 13.11).

1. On input 𝑥, 𝑃 will compute 𝑇 = 𝑇 (|𝑥|) and set up arrays Atstart


and Atend satisfying Atstart[0]= 1 and Atstart[𝑖]= 0 for 𝑖 > 0
and Atend[𝑇 − 1]= 1 and Atend[i]= 0 for all 𝑖 ≠ 𝑇 − 1. We can do
this because 𝑇 is a nice function. Note that since this computation Figure 13.11: We simulate a 𝑇 (𝑛)-time NAND-TM
program 𝑃 ′ with an oblivious NAND-TM program 𝑃
does not depend on 𝑥 but only on its length, it is oblivious. by adding special arrays Atstart and Atend to mark
positions 0 and 𝑇 − 1 respectively. The program 𝑃
2. 𝑃 will also have a special array Marker initialized to all zeroes. will simply “sweep” its arrays from right to left and
back again. If the original program 𝑃 ′ would have
3. The index variable of 𝑃 will change direction of movement to moved i in a different direction then we wait 𝑂(𝑇 )
steps until we reach the same point back again, and so
the right whenever Atstart[i]= 1 and to the left whenever
𝑃 runs in 𝑂(𝑇 (𝑛)2 ) time.
Atend[i]= 1.

4. The program 𝑃 simulates the execution of 𝑃 ′ . However, if the


MODANDJMP instruction in 𝑃 ′ attempts to move to the right when 𝑃
is moving left (or vice versa) then 𝑃 will set Marker[i] to 1 and
enter into a special “waiting mode”. In this mode 𝑃 will wait until
442 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the next time in which Marker[i]= 1 (at the next sweep) at which
points 𝑃 zeroes Marker[i] and continues with the simulation. In
the worst case this will take 2𝑇 (𝑛) steps (if 𝑃 has to go all the way
from one end to the other and back again.)

5. We also modify 𝑃 to ensure it ends the computation after simu-


lating exactly 𝑇 (𝑛) steps of 𝑃 ′ , adding “dummy steps” if 𝑃 ′ ends
early.

We see that 𝑃 simulates the execution of 𝑃 ′ with an overhead of


𝑂(𝑇 (𝑛)) steps of 𝑃 per one step of 𝑃 ′ , hence completing the proof.

Theorem 13.13 implies Theorem 13.12. Indeed, if 𝑃 is a 𝑘-line obliv-


ious NAND-TM program computing 𝐹 in time 𝑇 (𝑛) then for every 𝑛
we can obtain a NAND-CIRC program of (𝑘 − 1) ⋅ 𝑇 (𝑛) lines by simply
making 𝑇 (𝑛) copies of 𝑃 (dropping the final MODANDJMP line). In the
𝑗-th copy we replace all references of the form Foo[i] to foo_𝑖𝑗 where
𝑖𝑗 is the value of i in the 𝑗-th iteration.

13.6.2 “Unrolling the loop”: algorithmic transformation of Turing Machines


to circuits
The proof of Theorem 13.12 is algorithmic, in the sense that the proof
yields a polynomial-time algorithm that given a Turing machine 𝑀
and parameters 𝑇 and 𝑛, produces a circuit of 𝑂(𝑇 2 ) gates that agrees
with 𝑀 on all inputs 𝑥 ∈ {0, 1}𝑛 (as long as 𝑀 runs for less than 𝑇
steps these inputs.) We record this fact in the following theorem, since
it will be useful for us later on:

There is algorithm
Theorem 13.14 — Turing-machine to circuit compiler.
UNROLL such that for every Turing machine 𝑀 and numbers 𝑛, 𝑇 ,
UNROLL(𝑀 , 1𝑇 , 1𝑛 ) runs for 𝑝𝑜𝑙𝑦(|𝑀 |, 𝑇 , 𝑛) steps and outputs a
NAND circuit 𝐶 with 𝑛 inputs, 𝑂(𝑇 2 ) gates, and one output, such
that
Figure 13.12: The function UNROLL takes as input a
{𝑦 𝑀 halts in ≤ 𝑇 steps and outputs 𝑦
⎧ Turing machine 𝑀, an input length parameter 𝑛, a
𝐶(𝑥) = ⎨ . step budget parameter 𝑇 , and outputs a circuit 𝐶 of
⎩0 otherwise
{
size 𝑝𝑜𝑙𝑦(𝑇 ) that takes 𝑛 bits of inputs and outputs
𝑀(𝑥) if 𝑀 halts on 𝑥 within at most 𝑇 steps.

Proof. We only sketch the proof since it follows by directly translat-


ing the proof of Theorem 13.12 into an algorithm together with the
simulation of Turing machines by NAND-TM programs (see also
Fig. 13.13). Specifically, UNROLL does the following:

1. Transform the Turing machine 𝑀 into an equivalent NAND-TM


program 𝑃 .
mod e l i ng ru n n i ng ti me 443

2. Transform the NAND-TM program 𝑃 into an equivalent oblivious


program 𝑃 ′ following the proof of Theorem 13.13. The program 𝑃 ′
takes 𝑇 ′ = 𝑂(𝑇 2 ) steps to simulate 𝑇 steps of 𝑃 .

3. “Unroll the loop” of 𝑃 ′ by obtaining a NAND-CIRC program of


𝑂(𝑇 ′ ) lines (or equivalently a NAND circuit with 𝑂(𝑇 2 ) gates)
corresponding to the execution of 𝑇 ′ iterations of 𝑃 ′ .

Figure 13.13: We can transform a Turing machine 𝑀,


input length parameter 𝑛, and time bound 𝑇 into an
𝑂(𝑇 2 )-sized NAND circuit that agrees with 𝑀 on
all inputs 𝑥 ∈ {0, 1}𝑛 on which 𝑀 halts in at most
𝑇 steps. The transformation is obtained by first using
the equivalence of Turing machines and NAND-
TM programs 𝑃 , then turning 𝑃 into an equivalent
oblivious NAND-TM program 𝑃 ′ via Theorem 13.13,
then “unrolling” 𝑂(𝑇 2 ) iterations of the loop of
𝑃 ′ to obtain an 𝑂(𝑇 2 ) line NAND-CIRC program
that agrees with 𝑃 ′ on length 𝑛 inputs, and finally
translating this program into an equivalent circuit.

 Big Idea 20 By “unrolling the loop” we can transform an al-


gorithm that takes 𝑇 (𝑛) steps to compute 𝐹 into a circuit that uses
𝑝𝑜𝑙𝑦(𝑇 (𝑛)) gates to compute the restriction of 𝐹 to {0, 1}𝑛 .

P
Reviewing the transformations described in Fig. 13.13,
as well as solving the following two exercises is a great
way to get more comfort with non-uniform complexity
and in particular with P/poly and its relation to P.

Prove that for every


Solved Exercise 13.4 — Alternative characterization of P.
𝐹 ∶ {0, 1} → {0, 1}, 𝐹 ∈ P if and only if there is a polynomial-

time Turing machine 𝑀 such that for every 𝑛 ∈ ℕ, 𝑀 (1𝑛 ) outputs a


description of an 𝑛 input circuit 𝐶𝑛 that computes the restriction 𝐹↾𝑛
of 𝐹 to inputs in {0, 1}𝑛 .

444 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Solution:
We start with the “if” direction. Suppose that there is a polynomial-
time Turing machine 𝑀 that on input 1𝑛 outputs a circuit 𝐶𝑛 that
computes 𝐹↾𝑛 . Then the following is a polynomial-time Turing
machine 𝑀 ′ to compute 𝐹 . On input 𝑥 ∈ {0, 1}∗ , 𝑀 ′ will:

1. Let 𝑛 = |𝑥| and compute 𝐶𝑛 = 𝑀 (1𝑛 ).

2. Return the evaluation of 𝐶𝑛 on 𝑥.

Since we can evaluate a Boolean circuit on an input in poly-


nomial time, 𝑀 ′ runs in polynomial time and computes 𝐹 (𝑥) on
every input 𝑥.
For the “only if” direction, if 𝑀 ′ is a Turing machine that com-
putes 𝐹 in polynomial-time, then (applying the equivalence of
Turing machines and NAND-TM as well as Theorem 13.13) there is
also an oblivious NAND-TM program 𝑃 that computes 𝐹 in time
𝑝(𝑛) for some polynomial 𝑝. We can now define 𝑀 to be the Turing
machine that on input 1𝑛 outputs the NAND circuit obtained by
“unrolling the loop” of 𝑃 for 𝑝(𝑛) iterations. The resulting NAND
circuit computes 𝐹↾𝑛 and has 𝑂(𝑝(𝑛)) gates. It can also be trans-
formed to a Boolean circuit with 𝑂(𝑝(𝑛)) AND/OR/NOT gates.

Let 𝐹 ∶ {0, 1}∗ →


Solved Exercise 13.5 — P/poly characterization by advice.
{0, 1}. Then 𝐹 ∈ P/poly if and only if there exists a polynomial 𝑝 ∶ ℕ →
ℕ, a polynomial-time Turing machine 𝑀 and a sequence {𝑎𝑛 }𝑛∈ℕ of
strings, such that for every 𝑛 ∈ ℕ:

• |𝑎𝑛 | ≤ 𝑝(𝑛)
• For every 𝑥 ∈ {0, 1}𝑛 , 𝑀 (𝑎𝑛 , 𝑥) = 𝐹 (𝑥).

Solution:
We only sketch the proof. For the “only if” direction, if 𝐹 ∈
P/poly then we can use for 𝑎𝑛 simply the description of the cor-
responding circuit 𝐶𝑛 and for 𝑀 the program that computes in
polynomial time the evaluation of a circuit on its input.
For the “if” direction, we can use the same “unrolling the loop”
technique of Theorem 13.12 to show that if 𝑃 is a polynomial-time
NAND-TM program, then for every 𝑛 ∈ ℕ, the map 𝑥 ↦ 𝑃 (𝑎𝑛 , 𝑥)
can be computed by a polynomial-size NAND-CIRC program 𝑄𝑛 .

mod e l i ng ru n n i ng ti me 445

13.6.3 Can uniform algorithms simulate non-uniform ones?


Theorem 13.12 shows that every function in TIME(𝑇 (𝑛)) is in
SIZE(𝑝𝑜𝑙𝑦(𝑇 (𝑛))). One can ask if there is an inverse relation. Suppose
that 𝐹 is such that 𝐹↾𝑛 has a “short” NAND-CIRC program for every
𝑛. Can we say that it must be in TIME(𝑇 (𝑛)) for some “small” 𝑇 ? The
answer is an emphatic no. Not only is P/poly not contained in P, in fact
P/poly contains functions that are uncomputable!

There exists an
Theorem 13.15 — P/poly contains uncomputable functions.
uncomputable function 𝐹 ∶ {0, 1} → {0, 1} such that 𝐹 ∈ P/poly .

Proof Idea:
Since P/poly corresponds to non-uniform computation, a function
𝐹 is in P/poly if for every 𝑛 ∈ ℕ, the restriction 𝐹↾𝑛 to inputs of length
𝑛 has a small circuit/program, even if the circuits for different values
of 𝑛 are completely different from one another. In particular, if 𝐹 has
the property that for every equal-length inputs 𝑥 and 𝑥′ , 𝐹 (𝑥) =
𝐹 (𝑥′ ) then this means that 𝐹↾𝑛 is either the constant function zero
or the constant function one for every 𝑛 ∈ ℕ. Since the constant
function has a (very!) small circuit, such a function 𝐹 will always
be in P/poly (indeed even in smaller classes). Yet by a reduction from
the Halting problem, we can obtain a function with this property that
is uncomputable.

Proof of Theorem 13.15. Consider the following “unary halting func-


tion” UH ∶ {0, 1}∗ → {0, 1} defined as follows. We let 𝑆 ∶ ℕ → {0, 1}∗
be the function that on input 𝑛 ∈ ℕ, outputs the string that corre-
sponds to the binary representation of the number 𝑛 without the most
significant 1 digit. Note that 𝑆 is onto. For every 𝑥 ∈ {0, 1}∗ , we de-
fine UH(𝑥) = HALTONZERO(𝑆(|𝑥|)). That is, if 𝑛 is the length of 𝑥,
then UH(𝑥) = 1 if and only if the string 𝑆(𝑛) encodes a NAND-TM
program that halts on the input 0.
UH is uncomputable, since otherwise we could compute
HALTONZERO by transforming the input program 𝑃 into the integer
𝑛 such that 𝑃 = 𝑆(𝑛) and then running UH(1𝑛 ) (i.e., UH on the string
of 𝑛 ones). On the other hand, for every 𝑛, UH𝑛 (𝑥) is either equal
to 0 for all inputs 𝑥 or equal to 1 on all inputs 𝑥, and hence can be
computed by a NAND-CIRC program of a constant number of lines.

The issue here is of course uniformity. For a function 𝐹 ∶ {0, 1}∗ →


{0, 1}, if 𝐹 is in TIME(𝑇 (𝑛)) then we have a single algorithm that
can compute 𝐹↾𝑛 for every 𝑛. On the other hand, 𝐹↾𝑛 might be in
446 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

SIZE(𝑇 (𝑛)) for every 𝑛 using a completely different algorithm for ev-
ery input length. For this reason we typically use P/poly not as a model
of efficient computation but rather as a way to model inefficient compu-
tation. For example, in cryptography people often define an encryp-
tion scheme to be secure if breaking it for a key of length 𝑛 requires
more than a polynomial number of NAND lines. Since P ⊆ P/poly ,
this in particular precludes a polynomial time algorithm for doing so,
but there are technical reasons why working in a non-uniform model
makes more sense in cryptography. It also allows to talk about se-
curity in non-asymptotic terms such as a scheme having “128 bits of
security”.
While it can sometimes be a real issue, in many natural settings the
difference between uniform and non-uniform computation does not
seem so important. In particular, in all the examples of problems not
known to be in P we discussed before: longest path, 3SAT, factoring,
etc., these problems are also not known to be in P/poly either. Thus,
for “natural” functions, if you pretend that TIME(𝑇 (𝑛)) is roughly the
same as SIZE(𝑇 (𝑛)), you will be right more often than wrong.

Figure 13.14: Relations between P, EXP, and P/poly . It


is known that P ⊆ EXP, P ⊆ P/poly and that P/poly
contains uncomputable functions (which in particular
are outside of EXP). It is not known whether or not
EXP ⊆ P/poly though it is believed that EXP ⊈ P/poly .

13.6.4 Uniform vs. Non-uniform computation: A recap


To summarize, the two models of computation we have described so
far are:

• Uniform models: Turing machines, NAND-TM programs, RAM ma-


chines, NAND-RAM programs, C/JavaScript/Python, etc. These mod-
els include loops and unbounded memory hence a single program
can compute a function with unbounded input length.

• Non-uniform models: Boolean Circuits or straightline programs have


no loops and can only compute finite functions. The time to execute
them is exactly the number of lines or gates they contain.

For a function 𝐹 ∶ {0, 1}∗ → {0, 1} and some nice time bound
𝑇 ∶ ℕ → ℕ, we know that:
mod e l i ng ru n n i ng ti me 447

• If 𝐹 is uniformly computable in time 𝑇 (𝑛) then there is a sequence


of circuits 𝐶1 , 𝐶2 , … where 𝐶𝑛 has 𝑝𝑜𝑙𝑦(𝑇 (𝑛)) gates and computes
𝐹↾𝑛 (i.e., restriction of 𝐹 to {0, 1}𝑛 ) for every 𝑛.

• The reverse direction is not necessarily true - there are examples of


functions 𝐹 ∶ {0, 1}𝑛 → {0, 1} such that 𝐹↾𝑛 can be computed by
even a constant size circuit but 𝐹 is uncomputable.

This means that non-uniform complexity is more useful to establish


hardness of a function than its easiness.

✓ Chapter Recap

• We can define the time complexity of a function


using NAND-TM programs, and similarly to the
notion of computability, this appears to capture the
inherent complexity of the function.
• There are many natural problems that have
polynomial-time algorithms, and other natural
problems that we’d love to solve, but for which the
best known algorithms are exponential.
• The definition of polynomial time, and hence the
class P, is robust to the choice of model, whether
it is Turing machines, NAND-TM, NAND-RAM,
modern programming languages, and many other
models.
• The time hierarchy theorem shows that there are
some problems that can be solved in exponential,
but not in polynomial time. However, we do not
know if that is the case for the natural examples
that we described in this lecture.
• By “unrolling the loop” we can show that every
function computable in time 𝑇 (𝑛) can be computed
by a sequence of NAND-CIRC programs (one for
every input length) each of size at most 𝑝𝑜𝑙𝑦(𝑇 (𝑛)).

13.7 EXERCISES
Prove
Exercise 13.1 — Equivalence of different definitions of P and EXP..
that the classes P and EXP defined in Definition 13.2 are equal to
∪𝑐∈{1,2,3,…} TIME(𝑛𝑐 ) and ∪𝑐∈{1,2,3,…} TIME(2𝑛 ) respectively. (If
𝑐

𝑆1 , 𝑆2 , 𝑆3 , … is a collection of sets then the set 𝑆 = ∪𝑐∈{1,2,3,…} 𝑆𝑐 is


the set of all elements 𝑒 such that there exists some 𝑐 ∈ {1, 2, 3, …}
where 𝑒 ∈ 𝑆𝑐 .)

Theorem 13.5 shows that the


Exercise 13.2 — Robustness to representation.
classes P and EXP are robust with respect to variations in the choice
of the computational model. This exercise shows that these classes
448 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

are also robust with respect to our choice of the representation of the
input.
Specifically, let 𝐹 be a function mapping graphs to {0, 1}, and let
𝐹 ′ , 𝐹 ″ ∶ {0, 1}∗ → {0, 1} be the functions defined as follows. For every
𝑥 ∈ {0, 1}∗ :

• 𝐹 ′ (𝑥) = 1 iff 𝑥 represents a graph 𝐺 via the adjacency matrix


representation such that 𝐹 (𝐺) = 1.

• 𝐹 ″ (𝑥) = 1 iff 𝑥 represents a graph 𝐺 via the adjacency list represen-


tation such that 𝐹 (𝐺) = 1.

Prove that 𝐹 ′ ∈ P iff 𝐹 ″ ∈ P.


More generally, for every function 𝐹 ∶ {0, 1}∗ → {0, 1}, the answer
to the question of whether 𝐹 ∈ P (or whether 𝐹 ∈ EXP) is unchanged
by switching representations, as long as transforming one represen-
tation to the other can be done in polynomial time (which essentially
holds for all reasonable representations).

For every function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ ,


Exercise 13.3 — Boolean functions.
define 𝐵𝑜𝑜𝑙(𝐹 ) to be the function mapping {0, 1}∗ to {0, 1} such that
on input a (string representation of a) triple (𝑥, 𝑖, 𝜎) with 𝑥 ∈ {0, 1}∗ ,
𝑖 ∈ ℕ and 𝜎 ∈ {0, 1},

⎧𝐹 (𝑥)𝑖 𝜎 = 0, 𝑖 < |𝐹 (𝑥)|


{
{
𝐵𝑜𝑜𝑙(𝐹 )(𝑥, 𝑖, 𝜎) = ⎨1 𝜎 = 1, 𝑖 < |𝐹 (𝑥)|
{
{0
⎩ otherwise
where 𝐹 (𝑥)𝑖 is the 𝑖-th bit of the string 𝐹 (𝑥).
Prove that for every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐵𝑜𝑜𝑙(𝐹 ) ∈ P if and only
if there is a Turing Machine 𝑀 and a polynomial 𝑝 ∶ ℕ → ℕ such that
for every 𝑥 ∈ {0, 1}∗ , on input 𝑥, 𝑀 halts within ≤ 𝑝(|𝑥|) steps and
outputs 𝐹 (𝑥).

Say that a (possibly non-


Exercise 13.4 — Composition of polynomial time.
Boolean) function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is computable in polynomial time,
if there is a Turing Machine 𝑀 and a polynomial 𝑝 ∶ ℕ → ℕ such that
for every 𝑥 ∈ {0, 1}∗ , on input 𝑥, 𝑀 halts within ≤ 𝑝(|𝑥|) steps and
outputs 𝐹 (𝑥). Prove that for every pair of functions 𝐹 , 𝐺 ∶ {0, 1}∗ →
{0, 1}∗ computable in polynomial time, their composition 𝐹 ∘𝐺, which is
the function 𝐻 s.t. 𝐻(𝑥) = 𝐹 (𝐺(𝑥)), is also computable in polynomial
time.

Say that a (possibly


Exercise 13.5 — Non-composition of exponential time.
non-Boolean) function 𝐹 ∶ {0, 1} → {0, 1} is computable in exponential
∗ ∗
mod e l i ng ru n n i ng ti me 449

time, if there is a Turing Machine 𝑀 and a polynomial 𝑝 ∶ ℕ → ℕ such


that for every 𝑥 ∈ {0, 1}∗ , on input 𝑥, 𝑀 halts within ≤ 2𝑝(|𝑥|) steps
and outputs 𝐹 (𝑥). Prove that there is some 𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1}∗
s.t. both 𝐹 and 𝐺 are computable in exponential time, but 𝐹 ∘ 𝐺 is not
computable in exponential time.

We say that a Turing machine 𝑀


Exercise 13.6 — Oblivious Turing Machines.
is oblivious if there is some function 𝑇 ∶ ℕ × ℕ → ℤ such that for every
input 𝑥 of length 𝑛, and 𝑡 ∈ ℕ it holds that:

• If 𝑀 takes more than 𝑡 steps to halt on the input 𝑥, then in the 𝑡-


th step 𝑀 ’s head will be in the position 𝑇 (𝑛, 𝑡). (Note that this
position depends only on the length of 𝑥 and not its contents.)

• If 𝑀 halts before the 𝑡-th step then 𝑇 (𝑛, 𝑡) = −1.

Prove that if 𝐹 ∈ P then there exists an oblivious Turing machine 𝑀 1


Hint: This is the Turing machine analog of Theo-
rem 13.13. We replace one step of the original TM 𝑀 ′
that computes 𝐹 in polynomial time. See footnote for hint.1 computing 𝐹 with a “sweep” of the obliviouss TM 𝑀
■ in which it goes 𝑇 steps to the right and then 𝑇 steps
to the left.
Exercise 13.7Let EDGE ∶ {0, 1} → {0, 1} be the function such that on

input a string representing a triple (𝐿, 𝑖, 𝑗), where 𝐿 is the adjacency


list representation of an 𝑛 vertex graph 𝐺, and 𝑖 and 𝑗 are numbers in
[𝑛], EDGE(𝐿, 𝑖, 𝑗) = 1 if the edge {𝑖, 𝑗} is present in the graph. EDGE
outputs 0 on all other inputs.

1. Prove that EDGE ∈ P.

2. Let PLANARMATRIX ∶ {0, 1}∗ → {0, 1} be the function that


on input an adjacency matrix 𝐴 outputs 1 if and only if the graph
represented by 𝐴 is planar (that is, can be drawn on the plane with-
out edges crossing one another). For this question, you can use
without proof the fact that PLANARMATRIX ∈ P. Prove that
PLANARLIST ∈ P where PLANARLIST ∶ {0, 1}∗ → {0, 1} is the
function that on input an adjacency list 𝐿 outputs 1 if and only if 𝐿
represents a planar graph.

Let NANDEVAL ∶ {0, 1}∗ →


Exercise 13.8 — Evaluate NAND circuits.
{0, 1} be the function such that for every string representing a pair
(𝑄, 𝑥) where 𝑄 is an 𝑛-input 1-output NAND-CIRC (not NAND-TM!)
program and 𝑥 ∈ {0, 1}𝑛 , NANDEVAL(𝑄, 𝑥) = 𝑄(𝑥). On all other
inputs NANDEVAL outputs 0.
Prove that NANDEVAL ∈ P.

450 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Let NANDHARD ∶ {0, 1}∗ → {0, 1}


Exercise 13.9 — Find hard function.
be the function such that on input a string representing a pair (𝑓, 𝑠)
where

• 𝑓 ∈ {0, 1}2 for some 𝑛 ∈ ℕ


𝑛

• 𝑠∈ℕ

NANDHARD(𝑓, 𝑠) = 1 if there is no NAND-CIRC program 𝑄


of at most 𝑠 lines that computes the function 𝐹 ∶ {0, 1}𝑛 → {0, 1}
whose truth table is the string 𝑓. That is, NANDHARD(𝑓, 𝑠) = 1
if for every NAND-CIRC program 𝑄 of at most 𝑠 lines, there exists
some 𝑥 ∈ {0, 1}𝑛 such that 𝑄(𝑥) ≠ 𝑓𝑥 where 𝑓𝑥 denote the 𝑥-the
coordinate of 𝑓, using the binary representation to identify {0, 1}𝑛
with the numbers {0, … , 2𝑛 − 1}.

1. Prove that NANDHARD ∈ EXP.

2. (Challenge) Prove that there is an algorithm FINDHARD such


that if 𝑛 is sufficiently large, then FINDHARD(1𝑛 ) runs in time
and outputs a string 𝑓 ∈ {0, 1}2 that is the truth table of
𝑂(𝑛) 𝑛
22
a function that is not contained in SIZE(2𝑛 /(1000𝑛)). (In other
words, if 𝑓 is the string output by FINDHARD(1𝑛 ) then if we let
𝐹 ∶ {0, 1}𝑛 → {0, 1} be the function such that 𝐹 (𝑥) outputs the 𝑥-th
coordinate of 𝑓, then 𝐹 ∉ SIZE(2𝑛 /(1000𝑛)).2 Hint: Use Item 1, the existence of functions requir-
2

ing exponentially hard NAND programs, and the fact


that there are only finitely many functions mapping
{0, 1}𝑛 to {0, 1}.

Exercise 13.10 Suppose that you are in charge of scheduling courses in


computer science in University X. In University X, computer science
students wake up late, and have to work on their startups in the af-
ternoon, and take long weekends with their investors. So you only
have two possible slots: you can schedule a course either Monday-
Wednesday 11am-1pm or Tuesday-Thursday 11am-1pm.
Let SCHEDULE ∶ {0, 1}∗ → {0, 1} be the function that takes as input
a list of courses 𝐿 and a list of conflicts 𝐶 (i.e., list of pairs of courses
that cannot share the same time slot) and outputs 1 if and only if there
is a “conflict free” scheduling of the courses in 𝐿, where no pair in 𝐶 is
scheduled in the same time slot.
More precisely, the list 𝐿 is a list of strings (𝑐0 , … , 𝑐𝑛−1 ) and the list
𝐶 is a list of pairs of the form (𝑐𝑖 , 𝑐𝑗 ). SCHEDULE(𝐿, 𝐶) = 1 if and
only if there exists a partition of 𝑐0 , … , 𝑐𝑛−1 into two parts so that there
is no pair (𝑐𝑖 , 𝑐𝑗 ) ∈ 𝐶 such that both 𝑐𝑖 and 𝑐𝑗 are in the same part.
Prove that SCHEDULE ∈ P. As usual, you do not have to provide
the full code to show that this is the case, and can describe operations
as a high level, as well as appeal to any data structures or other results
mentioned in the book or in lecture. Note that to show that a function
𝐹 is in P you need to both (1) present an algorithm 𝐴 that computes
mod e l i ng ru n n i ng ti me 451

𝐹 in polynomial time, (2) prove that 𝐴 does indeed run in polynomial


time, and does indeed compute the correct answer.
Try to think whether or not your algorithm extends to the case
where there are three possible time slots.

13.8 BIBLIOGRAPHICAL NOTES


Because we are interested in the maximum number of steps for inputs
of a given length, running-time as we defined it is often known as
worst case complexity. The minimum number of steps (or “best case”
complexity) to compute a function on length 𝑛 inputs is typically not
a meaningful quantity since essentially every natural problem will
have some trivially easy instances. However, the average case complexity
(i.e., complexity on a “typical” or “random” input) is an interesting
concept which we’ll return to when we discuss cryptography. That
said, worst-case complexity is the most standard and basic of the
complexity measures, and will be our focus in most of this book.
Some lower bounds for single-tape Turing machines are given in
[Maa85].
For defining efficiency in the 𝜆 calculus, one needs to be careful
about the order of application of the reduction steps, which can matter
for computational efficiency, see for example this paper.
The notation P/poly is used for historical reasons. It was introduced
by Karp and Lipton, who considered this class as corresponding to
functions that can be computed by polynomial-time Turing machines
that are given for any input length 𝑛 an advice string of length polyno-
mial in 𝑛.
Learning Objectives:
• Introduce the notion of polynomial-time
reductions as a way to relate the complexity of
problems to one another.
• See several examples of such reductions.
• 3SAT as a basic starting point for reductions.

14
Polynomial-time reductions

Consider some of the problems we have encountered in Chapter 12:

1. The 3SAT problem: deciding whether a given 3CNF formula has a


satisfying assignment.

2. Finding the longest path in a graph.

3. Finding the maximum cut in a graph.

4. Solving quadratic equations over 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 ∈ ℝ.

All of these problems have the following properties:

• These are important problems, and people have spent significant


effort on trying to find better algorithms for them.

• Each one of these is a search problem, whereby we search for a


solution that is “good” in some easy to define sense (e.g., a long
path, a satisfying assignment, etc.).

• Each of these problems has a trivial exponential time algorithm that


involve enumerating all possible solutions.

• At the moment, for all these problems the best known algorithm is
not much faster than the trivial one in the worst case.

In this chapter and in Chapter 15 we will see that, despite their


apparent differences, we can relate the computational complexity of
these and many other problems. In fact, it turns out that the prob-
lems above are computationally equivalent, in the sense that solving one
of them immediately implies solving the others. This phenomenon,
known as NP completeness, is one of the surprising discoveries of the-
oretical computer science, and we will see that it has far-reaching
ramifications.

Compiled on 12.6.2023 00:05


454 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

This chapter: A non-mathy overview


This chapter introduces the concept of a polynomial time re-
duction which is a central object in computational complexity
and this book in particular. A polynomial-time reduction is a
way to reduce the task of solving one problem to another. The
way we use reductions in complexity is to argue that if the
first problem is hard to solve efficiently, then the second must
also be hard. We see several examples for reductions in this
chapter, and reductions will be the basis for the theory of NP
completeness that we will develop in Chapter 15.
All the code for the reductions described in this chapter is
available on the following Jupyter notebook.

Figure 14.1: In this chapter we show that if the 3SAT


problem cannot be solved in polynomial time, then
neither can the QUADEQ, LONGESTPATH, ISET
and MAXCUT problems. We do this by using the
reduction paradigm showing for example “if pigs could
whistle” (i.e., if we had an efficient algorithm for
QUADEQ) then “horses could fly” (i.e., we would
have an efficient algorithm for 3SAT.)

In this chapter we will see that for each one of the problems of find-
ing a longest path in a graph, solving quadratic equations, and finding
the maximum cut, if there exists a polynomial-time algorithm for this
problem then there exists a polynomial-time algorithm for the 3SAT
problem as well. In other words, we will reduce the task of solving
3SAT to each one of the above tasks. Another way to interpret these
results is that if there does not exist a polynomial-time algorithm for
3SAT then there does not exist a polynomial-time algorithm for these
other problems as well. In Chapter 15 we will see evidence (though
no proof!) that all of the above problems do not have polynomial-time
algorithms and hence are inherently intractable.

14.1 FORMAL DEFINITIONS OF PROBLEMS


For reasons of technical convenience rather than anything substantial,
we concern ourselves with decision problems (i.e., Yes/No questions) or
poly nomi a l -ti me re d u c ti on s 455

in other words Boolean (i.e., one-bit output) functions. We model the


problems above as functions mapping {0, 1}∗ to {0, 1} in the following
way:

3SAT. The 3SAT problem can be phrased as the function 3SAT ∶


{0, 1}∗ → {0, 1} that takes as input a 3CNF formula 𝜑 (i.e., a formula
of the form 𝐶0 ∧ ⋯ ∧ 𝐶𝑚−1 where each 𝐶𝑖 is the OR of three variables
or their negation) and maps 𝜑 to 1 if there exists some assignment to
the variables of 𝜑 that causes it to evalute to true, and to 0 otherwise.
For example

3SAT ("(𝑥0 ∨ 𝑥1 ∨ 𝑥2 ) ∧ (𝑥1 ∨ 𝑥2 ∨ 𝑥3 ) ∧ (𝑥0 ∨ 𝑥2 ∨ 𝑥3 )") = 1

since the assignment 𝑥 = 1101 satisfies the input formula. In the


above we assume some representation of formulas as strings, and
define the function to output 0 if its input is not a valid representation;
we use the same convention for all the other functions below.

Quadratic equations. The quadratic equations problem corresponds to the


function QUADEQ ∶ {0, 1}∗ → {0, 1} that maps a set of quadratic
equations 𝐸 to 1 if there is an assignment 𝑥 that satisfies all equations,
and to 0 otherwise.

Longest path. The longest path problem corresponds to the function


LONGPATH ∶ {0, 1}∗ → {0, 1} that maps a graph 𝐺 and a number 𝑘
to 1 if there is a simple path in 𝐺 of length at least 𝑘, and maps (𝐺, 𝑘)
to 0 otherwise. The longest path problem is a generalization of the
well-known Hamiltonian Path Problem of determining whether a path
of length 𝑛 exists in a given 𝑛 vertex graph.

Maximum cut. The maximum cut problem corresponds to the function


MAXCUT ∶ {0, 1}∗ → {0, 1} that maps a graph 𝐺 and a number 𝑘 to
1 if there is a cut in 𝐺 that cuts at least 𝑘 edges, and maps (𝐺, 𝑘) to 0
otherwise.
All of the problems above are in EXP but it is not known whether
or not they are in P. However, we will see in this chapter that if either
QUADEQ , LONGPATH or MAXCUT are in P, then so is 3SAT.

14.2 POLYNOMIAL-TIME REDUCTIONS


Suppose that 𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1} are two Boolean functions. A
polynomial-time reduction (or sometimes just “reduction” for short) from
𝐹 to 𝐺 is a way to show that 𝐹 is “no harder” than 𝐺, in the sense
that a polynomial-time algorithm for 𝐺 implies a polynomial-time
algorithm for 𝐹 .
456 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Let 𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1}.


Definition 14.1 — Polynomial-time reductions.
We say that 𝐹 reduces to 𝐺, denoted by 𝐹 ≤𝑝 𝐺 if there is a
polynomial-time computable 𝑅 ∶ {0, 1}∗ → {0, 1}∗ such that
for every 𝑥 ∈ {0, 1}∗ ,

𝐹 (𝑥) = 𝐺(𝑅(𝑥)) . (14.1)

We say that 𝐹 and 𝐺 have equivalent complexity if 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝


𝐹.

The following exercise justifies our intuition that 𝐹 ≤𝑝 𝐺 signifies


that “𝐹 is no harder than 𝐺”.
Solved Exercise 14.1 — Reductions and P. Prove that if 𝐹 ≤𝑝 𝐺 and 𝐺 ∈ P
then 𝐹 ∈ P.

Figure 14.2: If 𝐹 ≤𝑝 𝐺 then we can transform a


P polynomial-time algorithm 𝐵 that computes 𝐺
As usual, solving this exercise on your own is an excel-
into a polynomial-time algorithm 𝐴 that computes
lent way to make sure you understand Definition 14.1. 𝐹 . To compute 𝐹 (𝑥) we can run the reduction 𝑅
guaranteed by the fact that 𝐹 ≤𝑝 𝐺 to obtain 𝑦 =
𝑅(𝑥) and then run our algorithm 𝐵 for 𝐺 to compute
Solution: 𝐺(𝑦).

Suppose there is an algorithm 𝐵 that computes 𝐺 in time 𝑝(𝑛)


where 𝑛 is its input size. Then, (14.1) directly gives an algorithm
𝐴 to compute 𝐹 (see Fig. 14.2). Indeed, on input 𝑥 ∈ {0, 1}∗ , Al-
gorithm 𝐴 will run the polynomial-time reduction 𝑅 to obtain
𝑦 = 𝑅(𝑥) and then return 𝐵(𝑦). By (14.1), 𝐺(𝑅(𝑥)) = 𝐹 (𝑥) and
hence Algorithm 𝐴 will indeed compute 𝐹 .
We now show that 𝐴 runs in polynomial time. By assumption, 𝑅
can be computed in time 𝑞(𝑛) for some polynomial 𝑞. In particular,
this means that |𝑦| ≤ 𝑞(|𝑥|) (as just writing down 𝑦 takes |𝑦| steps).
Computing 𝐵(𝑦) will take at most 𝑝(|𝑦|) ≤ 𝑝(𝑞(|𝑥|)) steps. Thus
the total running time of 𝐴 on inputs of length 𝑛 is at most the time
to compute 𝑦, which is bounded by 𝑞(𝑛), and the time to compute
𝐵(𝑦), which is bounded by 𝑝(𝑞(𝑛)), and since the composition of
two polynomials is a polynomial, 𝐴 runs in polynomial time.

 Big Idea 21 A reduction 𝐹 ≤𝑝 𝐺 shows that 𝐹 is “no harder than


𝐺” or equivalently that 𝐺 is “no easier than 𝐹 ”.

14.2.1 Whistling pigs and flying horses


A reduction from 𝐹 to 𝐺 can be used for two purposes:
poly nomi a l -ti me re d u c ti on s 457

• If we already know an algorithm for 𝐺 and 𝐹 ≤𝑝 𝐺 then we can


use the reduction to obtain an algorithm for 𝐹 . This is a widely
used tool in algorithm design. For example in Section 12.1.4 we saw
how the Min-Cut Max-Flow theorem allows to reduce the task of
computing a minimum cut in a graph to the task of computing a
maximum flow in it.

• If we have proven (or have evidence) that there exists no polynomial-


time algorithm for 𝐹 and 𝐹 ≤𝑝 𝐺 then the existence of this reduction
allows us to conclude that there exists no polynomial-time algo-
rithm for 𝐺. This is the “if pigs could whistle then horses could
fly” interpretation we’ve seen in Section 9.4. We show that if there
was an hypothetical efficient algorithm for 𝐺 (a “whistling pig”)
then since 𝐹 ≤𝑝 𝐺 then there would be an efficient algorithm for
𝐹 (a “flying horse”). In this book we often use reductions for this
second purpose, although the lines between the two is sometimes
blurry (see the bibliographical notes in Section 14.10).

The most crucial difference between the notion in Definition 14.1


and the reductions we saw in the context of uncomputability (e.g.,
in Section 9.4) is that for relating time complexity of problems, we
need the reduction to be computable in polynomial time, as opposed to
merely computable. Definition 14.1 also restricts reductions to have a
very specific format. That is, to show that 𝐹 ≤𝑝 𝐺, rather than allow-
ing a general algorithm for 𝐹 that uses a “magic box” that computes
𝐺, we only allow an algorithm that computes 𝐹 (𝑥) by outputting
𝐺(𝑅(𝑥)). This restricted form is convenient for us, but people have
defined and used more general reductions as well (see Section 14.10).
In this chapter we use reductions to relate the computational com-
plexity of the problems mentioned above: 3SAT, Quadratic Equations,
Maximum Cut, and Longest Path, as well as a few others. We will
reduce 3SAT to the latter problems, demonstrating that solving any
one of them efficiently will result in an efficient algorithm for 3SAT.
In Chapter 15 we show the other direction: reducing each one of these
problems to 3SAT in one fell swoop.

Transitivity of reductions. Since we think of 𝐹 ≤𝑝 𝐺 as saying that


(as far as polynomial-time computation is concerned) 𝐹 is “easier or
equal in difficulty to” 𝐺, we would expect that if 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐻,
then it would hold that 𝐹 ≤𝑝 𝐻. Indeed this is the case:
For every
Solved Exercise 14.2 — Transitivity of polynomial-time reductions.
𝐹 , 𝐺, 𝐻 ∶ {0, 1}∗ → {0, 1}, if 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐻 then 𝐹 ≤𝑝 𝐻.

458 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Solution:
If 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐻 then there exist polynomial-time com-
putable functions 𝑅1 and 𝑅2 mapping {0, 1}∗ to {0, 1}∗ such that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐺(𝑅1 (𝑥)) and for every 𝑦 ∈ {0, 1}∗ ,
𝐺(𝑦) = 𝐻(𝑅2 (𝑦)). Combining these two equalities, we see that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐻(𝑅2 (𝑅1 (𝑥))) and so to show that
𝐹 ≤𝑝 𝐻, it is sufficient to show that the map 𝑥 ↦ 𝑅2 (𝑅1 (𝑥)) is
computable in polynomial time. But if there are some constants 𝑐, 𝑑
such that 𝑅1 (𝑥) is computable in time |𝑥|𝑐 and 𝑅2 (𝑦) is computable
in time |𝑦|𝑑 then 𝑅2 (𝑅1 (𝑥)) is computable in time (|𝑥|𝑐 )𝑑 = |𝑥|𝑐𝑑
which is polynomial.

14.3 REDUCING 3SAT TO ZERO ONE AND QUADRATIC EQUATIONS


We now show our first example of a reduction. The Zero-One Lin-
ear Equations problem corresponds to the function 01EQ ∶ {0, 1}∗ →
{0, 1} whose input is a collection 𝐸 of linear equations in variables
𝑥0 , … , 𝑥𝑛−1 , and the output is 1 iff there is an assignment 𝑥 ∈ {0, 1}𝑛
of 0/1 values to the variables that satisfies all the equations. For exam-
ple, if the input 𝐸 is a string encoding the set of equations

𝑥0 + 𝑥 1 + 𝑥 2 = 2
𝑥0 + 𝑥 2 = 1
𝑥1 + 𝑥 2 = 2
then 01EQ(𝐸) = 1 since the assignment 𝑥 = 011 satisfies all three
equations. We specifically restrict attention to linear equations in
variables 𝑥0 , … , 𝑥𝑛−1 in which every equation has the form ∑𝑖∈𝑆 𝑥𝑖 = 1
If you are familiar with matrix notation you may
𝑏 where 𝑆 ⊆ [𝑛] and 𝑏 ∈ ℕ.1 note that such equations can be written as 𝐴𝑥 = b
If we asked the question of whether there is a solution 𝑥 ∈ ℝ𝑛 of where 𝐴 is an 𝑚 × 𝑛 matrix with entries in 0/1 and
b ∈ ℕ𝑚 .
real numbers to 𝐸, then this can be solved using the famous Gaussian
elimination algorithm in polynomial time. However, there is no known
efficient algorithm to solve 01EQ. Indeed, such an algorithm would
imply an algorithm for 3SAT as shown by the following theorem:

Theorem 14.2 — Hardness of 01𝐸𝑄. 3SAT ≤𝑝 01EQ

Proof Idea:
A constraint 𝑥2 ∨ 𝑥5 ∨ 𝑥7 can be written as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 ≥ 1.
This is a linear inequality but since the sum on the left-hand side is
at most three, we can also turn it into an equality by adding two new
variables 𝑦, 𝑧 and writing it as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 + 𝑦 + 𝑧 = 3. (We
will use fresh variables 𝑦, 𝑧 for every constraint.) Finally, for every
variable 𝑥𝑖 we can add a variable 𝑥′𝑖 corresponding to its negation by
poly nomi a l -ti me re d u c ti on s 459

adding the equation 𝑥𝑖 + 𝑥′𝑖 = 1, hence mapping the original constraint


𝑥2 ∨ 𝑥5 ∨ 𝑥7 to 𝑥2 + 𝑥′5 + 𝑥7 + 𝑦 + 𝑧 = 3. The main takeaway
technique from this reduction is the idea of adding auxiliary variables
to replace an equation such as 𝑥1 + 𝑥2 + 𝑥3 ≥ 1 that is not quite in the
form we want with the equivalent (for 0/1 valued variables) equation
𝑥1 + 𝑥2 + 𝑥3 + 𝑢 + 𝑣 = 3 which is in the form we want.

Figure 14.3: Left: Python code implementing the


reduction of 3SAT to 01EQ. Right: Example output of
the reduction. Code is in our repository.

Proof of Theorem 14.2. To prove the theorem we need to:

1. Describe an algorithm 𝑅 for mapping an input 𝜑 for 3SAT into an


input 𝐸 for 01EQ.

2. Prove that the algorithm runs in polynomial time.

3. Prove that 01EQ(𝑅(𝜑)) = 3SAT(𝜑) for every 3CNF formula 𝜑.

We now proceed to do just that. Since this is our first reduction, we


will spell out this proof in detail. However it straightforwardly follows
the proof idea.
460 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 14.3 — 3𝑆𝐴𝑇 to 01𝐸𝑄 reduction.

Input: 3CNF formula 𝜑 with 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 and 𝑚


clauses.
Output: Set 𝐸 of linear equations over 0/1 such that
3𝑆𝐴𝑇 (𝜑) = 1 iff 01𝐸𝑄(𝐸) = 1.
1: Let 𝐸’s variables be 𝑥0 , … , 𝑥𝑛−1 , 𝑥′0 , … , 𝑥′𝑛−1 , 𝑦0 , … , 𝑦𝑚−1 ,
𝑧0 , … , 𝑧𝑚−1 .
2: for 𝑖 ∈ [𝑛] do
3: add to 𝐸 the equation 𝑥𝑖 + 𝑥′𝑖 = 1
4: end for
5: for 𝑗 ∈ [𝑚] do
6: Let 𝑗-th clause be 𝑤0 ∨ 𝑤1 ∨ 𝑤2 where 𝑤0 , 𝑤1 , 𝑤2 are
literals.
7: for 𝑎 ∈ [3] do
8: if 𝑤𝑎 is variable 𝑥𝑖 then
9: set 𝑡𝑎 ← 𝑥𝑖
10: end if
11: if 𝑤𝑎 is negation ¬𝑥𝑖 then
12: set 𝑡𝑎 ← 𝑥′𝑖
13: end if
14: end for
15: Add to 𝐸 the equation 𝑡0 + 𝑡1 + 𝑡2 + 𝑦𝑗 + 𝑧𝑗 = 3.
16: end for
17: return 𝐸

The reduction is described in Algorithm 14.3, see also Fig. 14.3. If


the input formula has 𝑛 variables and 𝑚 clauses, Algorithm 14.3 cre-
ates a set 𝐸 of 𝑛 + 𝑚 equations over 2𝑛 + 2𝑚 variables. Algorithm 14.3
makes an initial loop of 𝑛 steps (each taking constant time) and then
another loop of 𝑚 steps (each taking constant time) to create the equa-
tions, and hence it runs in polynomial time.
Let 𝑅 be the function computed by Algorithm 14.3. The heart of
the proof is to show that for every 3CNF 𝜑, 01EQ(𝑅(𝜑)) = 3SAT(𝜑).
We split the proof into two parts. The first part, traditionally known
as the completeness property, is to show that if 3SAT(𝜑) = 1 then
01EQ(𝑅(𝜑)) = 1. The second part, traditionally known as the sound-
ness property, is to show that if 01EQ(𝑅(𝜑)) = 1 then 3SAT(𝜑) = 1.
(The names “completeness” and “soundness” derive viewing a so-
lution to 𝑅(𝜑) as a “proof” that 𝜑 is satisfiable, in which case these
conditions corresponds to completeness and soundness as defined
in Section 11.1.1. However, if you find the names confusing you can
simply think of completeness as the “1-instance maps to 1-instance”
poly nomi a l -ti me re d u c ti on s 461

property and soundness as the “0-instance maps to 0-instance” prop-


erty.)
We complete the proof by showing both parts:

• Completeness: Suppose that 3SAT(𝜑) = 1, which means that


there is an assignment 𝑥 ∈ {0, 1}𝑛 that satisfies 𝜑. If we use the
assignment 𝑥0 , … , 𝑥𝑛−1 and 1 − 𝑥0 , … , 1 − 𝑥𝑛−1 for the first 2𝑛
variables of 𝐸 = 𝑅(𝜑) then we will satisfy all equations of the form
𝑥𝑖 + 𝑥′𝑖 = 1. Moreover, for every 𝑗 ∈ [𝑚], if 𝑡0 + 𝑡1 + 𝑡2 + 𝑦𝑗 + 𝑧𝑗 =
3(∗) is the equation arising from the 𝑗th clause of 𝜑 (with 𝑡0 , 𝑡1 , 𝑡2
being variables of the form 𝑥𝑖 or 𝑥′𝑖 depending on the literals of the
clause) then our assignment to the first 2𝑛 variables ensures that
𝑡0 + 𝑡1 + 𝑡2 ≥ 1 (since 𝑥 satisfied 𝜑) and hence we can assign values
to 𝑦𝑗 and 𝑧𝑗 that will ensure that the equation (∗) is satisfied. Hence
in this case 𝐸 = 𝑅(𝜑) is satisfied, meaning that 01EQ(𝑅(𝜑)) = 1.

• Soundness: Suppose that 01EQ(𝑅(𝜑)) = 1, which means that the


set of equations 𝐸 = 𝑅(𝜑) has a satisfying assignment 𝑥0 , … , 𝑥𝑛−1 ,
𝑥′0 , … , 𝑥′𝑛−1 , 𝑦0 , … , 𝑦𝑚−1 , 𝑧0 , … , 𝑧𝑚−1 . Then, since the equations
contain the condition 𝑥𝑖 + 𝑥′𝑖 = 1, for every 𝑖 ∈ [𝑛], 𝑥′𝑖 is the negation
of 𝑥𝑖 , and morover, for every 𝑗 ∈ [𝑚], if 𝐶 has the form 𝑤0 ∨ 𝑤1 ∨ 𝑤2
and is the 𝑗-th clause of 𝜑, then the corresponding assignment 𝑥
will ensure that 𝑤0 + 𝑤1 + 𝑤2 ≥ 1, implying that 𝐶 is satisfied.
Hence in this case 3SAT(𝜑) = 1.

14.3.1 Quadratic equations


Now that we reduced 3SAT to 01EQ, we can use this to reduce 3SAT
to the quadratic equations problem. This is the function QUADEQ in
which the input is a list of 𝑛-variate polynomials 𝑝0 , … , 𝑝𝑚−1 ∶ ℝ𝑛 → ℝ
that are all of degree at most two (i.e., they are quadratic) and with
integer coefficients. (The latter condition is for convenience and can
be achieved by scaling.) We define QUADEQ(𝑝0 , … , 𝑝𝑚−1 ) to equal 1
if and only if there is a solution 𝑥 ∈ ℝ𝑛 to the equations 𝑝0 (𝑥) = 0,
𝑝1 (𝑥) = 0, …, 𝑝𝑚−1 (𝑥) = 0.
For example, the following is a set of quadratic equations over the
variables 𝑥0 , 𝑥1 , 𝑥2 :
𝑥20 − 𝑥0 = 0
𝑥21 − 𝑥1 = 0
𝑥22 − 𝑥2 = 0
1 − 𝑥 0 − 𝑥 1 + 𝑥 0 𝑥1 = 0

You can verify that 𝑥 ∈ ℝ3 satisfies this set of equations if and only if
𝑥 ∈ {0, 1}3 and 𝑥0 ∨ 𝑥1 = 1.
462 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Theorem 14.4 — Hardness of quadratic equations.

3SAT ≤𝑝 QUADEQ

Proof Idea:
Using the transitivity of reductions (Solved Exercise 14.2), it is
enough to show that 01EQ ≤𝑝 QUADEQ, but this follows since we can
phrase the equation 𝑥𝑖 ∈ {0, 1} as the quadratic constraint 𝑥2𝑖 − 𝑥𝑖 = 0.
The takeaway technique of this reduction is that we can use non-
linearity to force continuous variables (e.g., variables taking values in
ℝ) to be discrete (e.g., take values in {0, 1}).

Proof of Theorem 14.4. By Theorem 14.2 and Solved Exercise 14.2,


it is sufficient to prove that 01EQ ≤𝑝 QUADEQ. Let 𝐸 be an in-
stance of 01EQ with variables 𝑥0 , … , 𝑥𝑛−1 . We map 𝐸 to the set of
quadratic equations 𝐸 ′ that is obtained by taking the linear equations
in 𝐸 and adding to them the 𝑛 quadratic equations 𝑥2𝑖 − 𝑥𝑖 = 0 for
all 𝑖 ∈ [𝑛]. (See Algorithm 14.5.) The map 𝐸 ↦ 𝐸 ′ can be com-
puted in polynomial time. We claim that 01EQ(𝐸) = 1 if and only
if QUADEQ(𝐸 ′ ) = 1. Indeed, the only difference between the two
instances is that:

• In the 01EQ instance 𝐸, the equations are over variables 𝑥0 , … , 𝑥𝑛−1


in {0, 1}.

• In the QUADEQ instance 𝐸 ′ , the equations are over variables


𝑥0 , … , 𝑥𝑛−1 ∈ ℝ but we have the extra constraints 𝑥2𝑖 − 𝑥𝑖 = 0
for all 𝑖 ∈ [𝑛].

Since for every 𝑎 ∈ ℝ, 𝑎2 − 𝑎 = 0 if and only if 𝑎 ∈ {0, 1}, the two


sets of equations are equivalent and 01EQ(𝐸) = QUADEQ(𝐸 ′ ) which
is what we wanted to prove.

poly nomi a l -ti me re d u c ti on s 463

Algorithm 14.5 — 01𝐸𝑄 to 𝑄𝑈 𝐴𝐷𝐸𝑄 reduction.

Input: Set 𝐸 of linear equations over 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 .


Output: Set 𝐸 ′ of quadratic equations over 𝑚 variables
𝑤0 , … , 𝑤𝑚−1 such that there is an 0/1 assignment
𝑥 ∈ {0, 1}𝑛
1: satisfying the equations of 𝐸 iff there is an assignment
𝑤 ∈ ℝ𝑚 satisfying the equations of 𝐸 ′ .
2: That is, 01𝐸𝑄(𝐸) = 𝑄𝑈 𝐴𝐷𝐸𝑄(𝐸 ′ ).
3: Let 𝑚 ← 𝑛.
4: Variables of 𝐸 ′ are set to be same variable 𝑥0 , … , 𝑥𝑛−1
as 𝐸.
5: for every equation 𝑒 ∈ 𝐸 do
6: Add 𝑒 to 𝐸 ′
7: end for
8: for 𝑖 ∈ [𝑛] do
9: Add to 𝐸 ′ the equation 𝑥2𝑖 − 𝑥𝑖 = 0.
10: end for
11: return 𝐸 ′

14.4 THE SUBSET SUM PROBLEM


As another consequence of the reduction of 3SAT to 01EQ, we can
also show that 3SAT (through 01EQ) reduces to the subset sum prob-
lem (also known as the knapsack problem). In the subset sum prob-
lem, we are given a list of integers 𝑥0 , … , 𝑥𝑛−1 ∈ ℤ and an integer
𝑇 ∈ ℤ. We need to determine whether or not there exists some set
of the integers that sums up to 𝑇 . That is, for 𝑥0 , … , 𝑥𝑛−1 , 𝑇 ∈ ℤ,
SSUM(𝑥0 , … , 𝑥𝑛−1 , 𝑇 ) = 1 if and only if there exists 𝑆 ⊆ [𝑛] such that
∑𝑖∈𝑆 𝑥𝑖 = 𝑇 . Note that the input length for the subset sum problem
is the length of string needed to encode all the numbers, which will
𝑛
be approximately ⌈log 𝑇 ⌉ + ∑𝑖=0 ⌈log 𝑥𝑖 ⌉, since encoding an integer 𝑥
using the binary representation requires ⌈log 𝑥⌉ bits.

Theorem 14.6 — Hardness of subset sum.

3SAT ≤𝑝 SSUM

Proof Idea:
We reduce from 01EQ. The intuition is the following. Consider
an instance 𝐸 of 01EQ with 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 and 𝑚 equa-
tions 𝑒0 , … , 𝑒𝑚−1 . Recall that each equation 𝑒ℓ in 𝐸 has the form
𝑥𝑖 + 𝑥𝑗 + 𝑥𝑘 = 𝑏 (potentially with more or less than three variables
summed up on the left-hand side of the equation). For every variable
𝑥𝑖 , we can define a vector 𝑣𝑖 ∈ {0, 1}𝑚 where 𝑣𝑡𝑖 = 1 if the variable
464 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑥𝑖 appears in the equation 𝑒𝑡 and 𝑣𝑡𝑖 = 0 otherwise. Then there is a


solution to the set of equations if and only if there is some set 𝑆 ⊆ [𝑛]
(corresponding to the 𝑖’s such that 𝑥𝑖 = 1) such that ∑𝑖∈𝑆 𝑣𝑖 = 𝑏⃗
where 𝑏⃗ ∈ ℤ𝑚 is the vector of right hand sides of the equations (i.e., 𝑏⃗𝑡
is the value 𝑏𝑡 on the righthand side of the 𝑡-th equation). Now if we
could interpret the vectors 𝑣0 , … , 𝑣𝑛−1 and 𝑏⃗ as numbers then we could
think of this as a subset sum instance. The key insight is that we can in
fact think of vectors as numbers by thinking of the 𝑗-th coordinate of
the vector 𝑣 as the 𝑗-th digit. Since the vectors are in {0, 1}𝑚 , the nat-
ural choice is to use the binary basis, but this turns out to cause issues
with “carries” when we add them up. Hence we use a larger basis 𝐵,
see proof below.

Proof of Theorem 14.6. For a given set of 01EQ on 𝑛 variables, we note


that the right hand side can never be larger than 𝑛 (since the sum of at
most 𝑛 variables in {0, 1} is at most 𝑛). More concretely, if the instance
has such an equation then we can know for sure that the answer is
0 (and in the context of a reduction map it into some trivial instance
of subset sum that doesn’t have a solution such as 𝑥0 = 𝑥1 = 1 and
𝑇 = 3).
Our reduction is described in Algorithm 14.7. On input an instance
𝐸 = {𝑒𝑡 }𝑚𝑡=1 of 01EQ over 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 , we output an SSUM
instance 𝑦0 , … , 𝑦𝑛−1 , 𝑇 computed as follows:
𝑚−1
• 𝑦𝑖 = ∑𝑡=0 𝐵𝑡 𝑣𝑡𝑖 where 𝑣𝑡𝑖 equals 1 if the variable 𝑥𝑖 appears in the
equation 𝑒𝑡 and equals 0 otherwise. The number 𝐵 is set to be 2𝑛
(any numb er larger than 𝑛 would work.)
𝑚−1
• 𝑇 = ∑𝑡=0 𝐵𝑡 𝑏𝑡 where 𝑏𝑡 is the integer on the right-hand side of the
equation 𝑒𝑡 .

In other words, 𝑦0 , … , 𝑦𝑛−1 and 𝑇 are the integers such that, written
in the 𝐵-ary basis, the 𝑡-th digit of 𝑦𝑖 is 1 iff 𝑥𝑖 appears in 𝑥𝑡 , and the
𝑡-th digit of 𝑇 is the right-hand side of 𝑒𝑡 .
The following claim will imply the correctness of the reduction:
Claim: For every 𝑥 ∈ {0, 1}𝑛 , if 𝑆 = {𝑖|𝑥𝑖 = 1} then 𝑥 satisfies the
equations of 𝐸 if and only if ∑𝑖∈𝑆 𝑦𝑖 = 𝑇 .
Proof: Key to the proof is the following simple property of grade-
school addition: when adding at most 𝑛 numbers in the 𝐵-ary basis,
if all the numbers have all their digits either 0 or 1, and 𝐵 > 𝑛, then
for every 𝑡, the 𝑡-th digit of the sum is the sum of the 𝑡-th digits of
the numbers. This is a simple consequence of the fact that there is no
“carry” in the addition. Since in our case the numbers 𝑦0 , … , 𝑦𝑛 sat-
isfy this property in the 𝐵-ary basis, and 𝐵 > 𝑛, we get that for every
poly nomi a l -ti me re d u c ti on s 465

𝑆 ⊆ [𝑛] and every digit 𝑡, the 𝑡-th digit of the sum ∑𝑖∈𝑆 𝑦𝑖 is simply
the sum of the 𝑡-th digit, which would correspond to the sum over 𝑥𝑖
for all 𝑥𝑖 ’s that participate in the 𝑡-th equation. This sum would equal
the 𝑡-th digit of 𝑇 if and only if that equation is satisfied.
The claim shows that 01EQ(𝐸) = SSUM(𝑦0 , … , 𝑦𝑛−1 , 𝑇 ) which is
what we needed to prove.

Algorithm 14.7 — 01𝐸𝑄 to 𝑆𝑆𝑈 𝑀 reduction.

Input: Set 𝐸 = {𝑒𝑡 }𝑡∈[𝑚] of 𝑚 linear equations over 𝑛


variables 𝑥0 , … , 𝑥𝑛−1 .
Output: Numbers 𝑦0 , … , 𝑦𝑛−1 , 𝑇 ∈ ℤ such that there is an
0/1 assignment 𝑥 ∈ {0, 1}𝑛 satisfying the equations of
𝐸 iff there is 𝑆 ⊆ [𝑛] such that ∑𝑖∈𝑆 𝑦𝑖 = 𝑇 .
1: for every equation 𝑒𝑡 ∈ 𝐸 do
2: Let 𝐴 ⊆ [𝑛] and 𝑏 ∈ ℤ be such that 𝑒𝑡 has the form
∑𝑖∈𝐴 𝑥𝑖 = 𝑏
3: Let 𝑣𝑖𝑡 ← 1 if 𝑖 ∈ 𝐴 and 𝑣𝑖𝑡 ← 0 otherwise.
4: Let 𝑏𝑡 ← 𝑏.
5: end for
6: Set 𝐵 ← 2𝑛
7: for 𝑖 ∈ [𝑛] do
𝑚
8: Let 𝑦𝑖 ← ∑𝑡=1 𝐵𝑡 𝑣𝑖𝑡 .
9: end for
𝑇
10: Let 𝑇 ← ∑𝑡=1 𝐵 𝑡 𝑏𝑡
11: return 𝑦0 , … , 𝑦𝑛−1 , 𝑇

14.5 THE INDEPENDENT SET PROBLEM


For a graph 𝐺 = (𝑉 , 𝐸), an independent set (also known as a stable
set) is a subset 𝑆 ⊆ 𝑉 such that there are no edges with both end-
points in 𝑆 (in other words, 𝐸(𝑆, 𝑆) = ∅). Every “singleton” (set
consisting of a single vertex) is trivially an independent set, but find-
ing larger independent sets can be challenging. The maximum indepen-
dent set problem (henceforth simply “independent set”) is the task of
finding the largest independent set in the graph. The independent set
problem is naturally related to scheduling problems: if we put an edge
between two conflicting tasks, then an independent set corresponds
to a set of tasks that can all be scheduled together without conflicts.
The independent set problem has been studied in a variety of settings,
including for example in the case of algorithms for finding structure in
protein-protein interaction graphs.
As mentioned in Section 14.1, we think of the independent set prob-
lem as the function ISET ∶ {0, 1}∗ → {0, 1} that on input a graph 𝐺
466 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

and a number 𝑘 outputs 1 if and only if the graph 𝐺 contains an in-


dependent set of size at least 𝑘. We now reduce 3SAT to Independent
set.

Theorem 14.8 — Hardness of Independent Set. 3SAT ≤𝑝 ISET.

Proof Idea:
The idea is that finding a satisfying assignment to a 3SAT formula
corresponds to satisfying many local constraints without creating
any conflicts. One can think of “𝑥17 = 0” and “𝑥17 = 1” as two
conflicting events, and of the constraints 𝑥17 ∨ 𝑥5 ∨ 𝑥9 as creating
a conflict between the events “𝑥17 = 0”, “𝑥5 = 1” and “𝑥9 = 0”,
saying that these three cannot simultaneously co-occur. Using these
ideas, we can we can think of solving a 3SAT problem as trying to
schedule non-conflicting events, though the devil is, as usual, in the
details. The takeaway technique here is to map each clause of the
original formula into a gadget which is a small subgraph (or more
generally “subinstance”) satisfying some convenient properties. We
will see these “gadgets” used time and again in the construction of
polynomial-time reductions.

Algorithm 14.9 — 3𝑆𝐴𝑇 to 𝐼𝑆 reduction.

Input: 3𝑆𝐴𝑇 formula 𝜑 with 𝑛 variables and 𝑚 clauses.


Output: Graph 𝐺 = (𝑉 , 𝐸) and number 𝑘, such that 𝐺
has an independent set of size 𝑘 iff 𝜑 has a satisfying Figure 14.4: An example of the reduction of 3SAT
to ISET for the case the original input formula is
assignment.
𝜑 = (𝑥0 ∨ 𝑥1 ∨ 𝑥2 ) ∧ (𝑥0 ∨ 𝑥1 ∨ 𝑥2 ) ∧ (𝑥1 ∨ 𝑥2 ∨ 𝑥3 ). We
1: That is, 3𝑆𝐴𝑇 (𝜑) = 𝐼𝑆𝐸𝑇 (𝐺, 𝑘), map each clause of 𝜑 to a triangle of three vertices,
2: Initialize 𝑉 ← ∅, 𝐸 ← ∅ each tagged above with “𝑥𝑖 = 0” or “𝑥𝑖 = 1”
depending on the value of 𝑥𝑖 that would satisfy the
3: for every clause 𝐶 = 𝑦 ∨ 𝑦 ′ ∨ 𝑦 ″ of 𝜑 do particular literal. We put an edge between every two
4: Add three vertices (𝐶, 𝑦), (𝐶, 𝑦′ ), (𝐶, 𝑦″ ) to 𝑉 literals that are conflicting (i.e., tagged with “𝑥𝑖 = 0”
and “𝑥𝑖 = 1” respectively).
5: Add edges {(𝐶, 𝑦), (𝐶, 𝑦′ )}, {(𝐶, 𝑦′ ), (𝐶, 𝑦″ )},
{(𝐶, 𝑦″ ), (𝐶, 𝑦)} to 𝐸.
6: end for
7: for every distinct clauses 𝐶, 𝐶 ′ in 𝜑 do
8: for every 𝑖 ∈ [𝑛] do
9: if 𝐶 contains literal 𝑥𝑖 and 𝐶 ′ contains literal 𝑥𝑖
then
10: Add edge {(𝐶, 𝑥𝑖 ), (𝐶 ′ , 𝑥𝑖 )} to 𝐸
11: end if
12: end for
13: end for
14: return (𝐺 = (𝑉 , 𝐸), 𝑚)
poly nomi a l -ti me re d u c ti on s 467

Proof of Theorem 14.8. Given a 3SAT formula 𝜑 on 𝑛 variables and


with 𝑚 clauses, we will create a graph 𝐺 with 3𝑚 vertices as follows.
(See Algorithm 14.9, see also Fig. 14.4 for an example and Fig. 14.5 for
Python code.)

• A clause 𝐶 in 𝜑 has the form 𝐶 = 𝑦 ∨ 𝑦′ ∨ 𝑦″ where 𝑦, 𝑦′ , 𝑦″ are


literals (variables or their negation). For each such clause 𝐶, we will
add three vertices to 𝐺, and label them (𝐶, 𝑦), (𝐶, 𝑦′ ), and (𝐶, 𝑦″ )
respectively. We will also add the three edges between all pairs of
these vertices, so they form a triangle. Since there are 𝑚 clauses in 𝜑,
the graph 𝐺 will have 3𝑚 vertices.

• In addition to the above edges, we also add an edge between ev-


ery pair of vertices of the form (𝐶, 𝑦) and (𝐶 ′ , 𝑦′ ) where 𝑦 and 𝑦′
are conflicting literals. That is, we add an edge between (𝐶, 𝑦) and
(𝐶 ′ , 𝑦′ ) if there is an 𝑖 such that 𝑦 = 𝑥𝑖 and 𝑦′ = 𝑥𝑖 or vice versa.

The algorithm constructing 𝐺 based on 𝜑 takes polynomial time


since it involves two loops, the first taking 𝑂(𝑚) steps and the second
taking 𝑂(𝑚2 𝑛) steps (see Algorithm 14.9). Hence to prove the theo-
rem we need to show that 𝜑 is satisfiable if and only if 𝐺 contains an
independent set of 𝑚 vertices. We now show both directions of this
equivalence:
Part 1: Completeness. The “completeness” direction is to show that
if 𝜑 has a satisfying assignment 𝑥∗ , then 𝐺 has an independent set 𝑆 ∗
of 𝑚 vertices. Let us now show this.
Indeed, suppose that 𝜑 has a satisfying assignment 𝑥∗ ∈ {0, 1}𝑛 .
Then for every clause 𝐶 = 𝑦 ∨ 𝑦′ ∨ 𝑦″ of 𝜑, one of the literals 𝑦, 𝑦′ , 𝑦″
must evaluate to true under the assignment 𝑥∗ (as otherwise it would
not satisfy 𝜑). We let 𝑆 be a set of 𝑚 vertices that is obtained by choos-
ing for every clause 𝐶 one vertex of the form (𝐶, 𝑦) such that 𝑦 eval-
uates to true under 𝑥∗ . (If there is more than one such vertex for the
same 𝐶, we arbitrarily choose one of them.)
We claim that 𝑆 is an independent set. Indeed, suppose otherwise
that there was a pair of vertices (𝐶, 𝑦) and (𝐶 ′ , 𝑦′ ) in 𝑆 that have an
edge between them. Since we picked one vertex out of each triangle
corresponding to a clause, it must be that 𝐶 ≠ 𝐶 ′ . Hence the only
way that there is an edge between (𝐶, 𝑦) and (𝐶 ′ , 𝑦′ ) is if 𝑦 and 𝑦′ are
conflicting literals (i.e. 𝑦 = 𝑥𝑖 and 𝑦′ = 𝑥𝑖 for some 𝑖). But then they
can’t both evaluate to true under the assignment 𝑥∗ , which contradicts
the way we constructed the set 𝑆. This completes the proof of the
completeness condition.
Part 2: Soundness. The “soundness” direction is to show that if
𝐺 has an independent set 𝑆 ∗ of 𝑚 vertices, then 𝜑 has a satisfying
assignment 𝑥∗ ∈ {0, 1}𝑛 . Let us now show this.
468 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Indeed, suppose that 𝐺 has an independent set 𝑆 ∗ with 𝑚 vertices.


We will define an assignment 𝑥∗ ∈ {0, 1}𝑛 for the variables of 𝜑 as
follows. For every 𝑖 ∈ [𝑛], we set 𝑥∗𝑖 according to the following rules:

• If 𝑆 ∗ contains a vertex of the form (𝐶, 𝑥𝑖 ) then we set 𝑥∗𝑖 = 1.

• If 𝑆 ∗ contains a vertex of the form (𝐶, 𝑥𝑖 ) then we set 𝑥∗𝑖 = 0.

• If 𝑆 ∗ does not contain a vertex of either of these forms, then it does


not matter which value we give to 𝑥∗𝑖 , but for concreteness we’ll set
𝑥∗𝑖 = 0.

The first observation is that 𝑥∗ is indeed well defined, in the sense


that the rules above do not conflict with one another, and ask to set 𝑥∗𝑖
to be both 0 and 1. This follows from the fact that 𝑆 ∗ is an independent
set and hence if it contains a vertex of the form (𝐶, 𝑥𝑖 ) then it cannot
contain a vertex of the form (𝐶 ′ , 𝑥𝑖 ).
We now claim that 𝑥∗ is a satisfying assignment for 𝜑. Indeed, since
𝑆 ∗ is an independent set, it cannot have more than one vertex inside
each one of the 𝑚 triangles (𝐶, 𝑦), (𝐶, 𝑦′ ), (𝐶, 𝑦″ ) corresponding to a
clause of 𝜑. Hence since |𝑆 ∗ | = 𝑚, it must have exactly one vertex in
each such triangle. For every clause 𝐶 of 𝜑, if (𝐶, 𝑦) is the vertex in
𝑆 ∗ in the triangle corresponding to 𝐶, then by the way we defined 𝑥∗ ,
the literal 𝑦 must evaluate to true, which means that 𝑥∗ satisfies this
clause. Therefore 𝑥∗ satisfies all clauses of 𝜑, which is the definition of
a satisfying assignment.
This completes the proof of Theorem 14.8

Figure 14.5: The reduction of 3SAT to Independent Set.


On the right-hand side is Python code that implements
this reduction. On the left-hand side is a sample
output of the reduction. We use black for the “triangle
edges” and red for the “conflict edges”. Note that the
satisfying assignment 𝑥∗ = 0110 corresponds to the
independent set (0, ¬𝑥3 ), (1, ¬𝑥0 ), (2, 𝑥2 ).

14.6 SOME EXERCISES AND ANATOMY OF A REDUCTION.


Reductions can be confusing and working out exercises is a great way
to gain more comfort with them. Here is one such example. As usual,
I recommend you try it out yourself before looking at the solution.
poly nomi a l -ti me re d u c ti on s 469

A vertex cover in a graph 𝐺 = (𝑉 , 𝐸)


Solved Exercise 14.3 — Vertex cover.
is a subset 𝑆 ⊆ 𝑉 of vertices such that every edge touches at least
one vertex of 𝑆 (see Fig. 14.6). The vertex cover problem is the task to
determine, given a graph 𝐺 and a number 𝑘, whether there exists a
vertex cover in the graph with at most 𝑘 vertices. Formally, this is the
function VC ∶ {0, 1}∗ → {0, 1} such that for every 𝐺 = (𝑉 , 𝐸) and
𝑘 ∈ ℕ, VC(𝐺, 𝑘) = 1 if and only if there exists a vertex cover 𝑆 ⊆ 𝑉
such that |𝑆| ≤ 𝑘.
Prove that 3SAT ≤𝑝 VC.

Solution:
The key observation is that if 𝑆 ⊆ 𝑉 is a vertex cover that
touches all vertices, then there is no edge 𝑒 such that both 𝑒’s end-
points are in the set 𝑆 = 𝑉 ⧵ 𝑆, and vice versa. In other words,
𝑆 is a vertex cover if and only if 𝑆 is an independent set. Since
the size of 𝑆 is |𝑉 | − |𝑆|, we see that the polynomial-time map
𝑅(𝐺, 𝑘) = (𝐺, 𝑛 − 𝑘) (where 𝑛 is the number of vertices of 𝐺)
satisfies that VC(𝑅(𝐺, 𝑘)) = ISET(𝐺, 𝑘) which means that it is a Figure 14.6: A vertex cover in a graph is a subset of
vertices that touches all edges. In this 7-vertex graph,
reduction from independent set to vertex cover. the 3 filled vertices are a vertex cover.

The maximum
Solved Exercise 14.4 — Clique is equivalent to independent set.
clique problem corresponds to the function CLIQUE ∶ {0, 1}∗ → {0, 1}
such that for a graph 𝐺 and a number 𝑘, CLIQUE(𝐺, 𝑘) = 1 iff there
is a subset 𝑆 of 𝑘 vertices such that for every distinct 𝑢, 𝑣 ∈ 𝑆, the edge
𝑢, 𝑣 is in 𝐺. Such a set is known as a clique.
Prove that CLIQUE ≤𝑝 ISET and ISET ≤𝑝 CLIQUE.

Solution:
If 𝐺 = (𝑉 , 𝐸) is a graph, we denote by 𝐺 its complement which
is the graph on the same vertices 𝑉 and such that for every distinct
𝑢, 𝑣 ∈ 𝑉 , the edge {𝑢, 𝑣} is present in 𝐺 if and only if this edge is
not present in 𝐺.
This means that for every set 𝑆, 𝑆 is an independent set in 𝐺 if
and only if 𝑆 is a clique in 𝐺. Therefore for every 𝑘, ISET(𝐺, 𝑘) =
CLIQUE(𝐺, 𝑘). Since the map 𝐺 ↦ 𝐺 can be computed efficiently,
this yields a reduction ISET ≤𝑝 CLIQUE. Moreover, since 𝐺 = 𝐺
this yields a reduction in the other direction as well.

14.6.1 Dominating set


In the two examples above, the reduction was almost “trivial”: the
reduction from independent set to vertex cover merely changes the
470 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

number 𝑘 to 𝑛 − 𝑘, and the reduction from independent set to clique


flips edges to non-edges and vice versa. The following exercise re-
quires a somewhat more interesting reduction.
A dominating set in a graph 𝐺 =
Solved Exercise 14.5 — Dominating set.
(𝑉 , 𝐸) is a subset 𝑆 ⊆ 𝑉 of vertices such that every 𝑢 ∈ 𝑉 ⧵ 𝑆 is a
neighbor in 𝐺 of some 𝑠 ∈ 𝑆 (see Fig. 14.7). The dominating set problem
is the task, given a graph 𝐺 = (𝑉 , 𝐸) and number 𝑘, of determining
whether there exists a dominating set 𝑆 ⊆ 𝑉 with |𝑆| ≤ 𝑘. Formally,
this is the function DS ∶ {0, 1}∗ → {0, 1} such that DS(𝐺, 𝑘) = 1 iff
there is a dominating set in 𝐺 of at most 𝑘 vertices.
Prove that ISET ≤𝑝 DS.

Solution:
Since we know that ISET ≤𝑝 VC, using transitivity, it is enough
to show that VC ≤𝑝 DS. As Fig. 14.7 shows, a dominating set is
not the same thing as a vertex cover. However, we can still relate
the two problems. The idea is to map a graph 𝐺 into a graph 𝐻
such that a vertex cover in 𝐺 would translate into a dominating set
in 𝐻 and vice versa. We do so by including in 𝐻 all the vertices
and edges of 𝐺, but for every edge {𝑢, 𝑣} of 𝐺 we also add to 𝐻 a Figure 14.7: A dominating set is a subset 𝑆 of vertices
new vertex 𝑤𝑢,𝑣 and connect it to both 𝑢 and 𝑣. Let ℓ be the number such that every vertex in the graph is either in 𝑆 or a
of isolated vertices in 𝐺. The idea behind the proof is that we can neighbor of 𝑆. The figure above are two copies of the
same graph. The red vertices on the left are a vertex
transform a vertex cover 𝑆 of 𝑘 vertices in 𝐺 into a dominating set cover that is not a dominating set. The blue vertices
of 𝑘 + ℓ vertices in 𝐻 by adding to 𝑆 all the isolated vertices, and on the right are a dominating set that is not a vertex
cover.
moreover we can transform every 𝑘 + ℓ-sized dominating set in 𝐻
into a vertex cover in 𝐺. We now give the details.
Description of the algorithm. Given an instance (𝐺, 𝑘) for the
vertex cover problem, we will map 𝐺 into an instance (𝐻, 𝑘′ ) for
the dominating set problem as follows (see Fig. 14.8 for Python
implementation):
poly nomi a l -ti me re d u c ti on s 471

Algorithm 14.10 — 𝑉 𝐶 to 𝐷𝑆 reduction.

Input: Graph 𝐺 = (𝑉 , 𝐸) and number 𝑘.


Output: Graph 𝐻 = (𝑉 ′ , 𝐸 ′ ) and number 𝑘′ , such that
𝐺 has a vertex cover of size 𝑘 iff 𝐻 has a dominating
set of size 𝑘′ , that is, 𝐷𝑆(𝐻, 𝑘′ ) = 𝑉 𝐶(𝐺, 𝑘).
1: Initialize 𝑉 ′ ← 𝑉 , 𝐸 ′ ← 𝐸
2: for every edge {𝑢, 𝑣} ∈ 𝐸 do
3: Add vertex 𝑤𝑢,𝑣 to 𝑉 ′
4: Add edges {𝑢, 𝑤𝑢,𝑣 }, {𝑣, 𝑤𝑢,𝑣 } to 𝐸 ′ .
5: end for
6: Let ℓ ← number of isolated vertices in 𝐺
7: return (𝐻 = (𝑉 ′ , 𝐸 ′ ) , 𝑘 + ℓ)

Algorithm 14.10 runs in polynomial time, since the loop takes


𝑂(𝑚) steps where 𝑚 is the number of edges, with each step can be
implemented in constant or at most linear time (depending on the
representation of the graph 𝐻). Counting the number of isolated
vertices in an 𝑛 vertex graph 𝐺 can be done in time 𝑂(𝑛2 ) if 𝐺 is
represented in the adjacency matrix representation and 𝑂(𝑛) time
if it is represented in the adjacency list representation. Regardless
the algorithm runs in polynomial time.
To complete the proof we need to prove that for every 𝐺, 𝑘,
if 𝐻, 𝑘′ is the output of Algorithm 14.10 on input (𝐺, 𝑘), then
DS(𝐻, 𝑘′ ) = VC(𝐺, 𝑘). We split the proof into two parts. The
completeness part is that if VC(𝐺, 𝑘) = 1 then DS(𝐻, 𝑘′ ) = 1. The
soundness part is that if DS(𝐻, 𝑘′ ) = 1 then VC(𝐺, 𝑘) = 1.
Completeness. Suppose that VC(𝐺, 𝑘) = 1. Then there is a ver-
tex cover 𝑆 ⊆ 𝑉 of at most 𝑘 vertices. Let 𝐼 be the set of isolated
vertices in 𝐺 and ℓ be their number. Then |𝑆 ∪ 𝐼| ≤ 𝑘 + ℓ. We claim
that 𝑆 ∪ 𝐼 is a dominating set in 𝐻. Indeed for every vertex 𝑣 of 𝐻
there are three cases:

• Case 1: 𝑣 is an isolated vertex of 𝐺. In this case 𝑣 is in 𝑆 ∪ 𝐼.

• Case 2: 𝑣 is a non-isolated vertex of 𝐺 and hence there is an edge


{𝑢, 𝑣} of 𝐺 for some 𝑢. In this case since 𝑆 is a vertex cover, one
of 𝑢, 𝑣 has to be in 𝑆, and hence either 𝑣 or a neighbor of 𝑣 has to
be in 𝑆 ⊆ 𝑆 ∪ 𝐼.

• Case 3: 𝑣 is of the form 𝑤𝑢,𝑢′ for some two neighbors 𝑢, 𝑢′ in 𝐺.


But then since 𝑆 is a vertex cover, one of 𝑢, 𝑢′ has to be in 𝑆 and
hence 𝑆 contains a neighbor of 𝑣.
472 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

We conclude that 𝑆 ∪ 𝐼 is a dominating set of size at most


𝑘′ = 𝑘 + ℓ in 𝐻 ′ and hence under the assumption that VC(𝐺, 𝑘) = 1,
DS(𝐻 ′ , 𝑘′ ) = 1.
Soundness. Suppose that DS(𝐻, 𝑘′ ) = 1. Then there is a domi-
nating set 𝐷 of size at most 𝑘′ = 𝑘 + ℓ in 𝐻. For every edge {𝑢, 𝑣} in
the graph 𝐺, if 𝐷 contains the vertex 𝑤𝑢,𝑣 then we remove this ver-
tex and add 𝑢 in its place. The only two neighbors of 𝑤𝑢,𝑣 are 𝑢 and
𝑣, and since 𝑢 is a neighbor of both 𝑤𝑢,𝑣 and of 𝑣, replacing 𝑤𝑢,𝑣
with 𝑢 maintains the property that it is a dominating set. More-
over, this change cannot increase the size of 𝐷. Thus following this
modification, we can assume that 𝐷 is a dominating set of at most
𝑘 + ℓ vertices that does not contain any vertices of the form 𝑤𝑢,𝑣 .
Let 𝐼 be the set of isolated vertices in 𝐺. These vertices are also
isolated in 𝐻 and hence must be included in 𝐷 (an isolated ver-
tex must be in any dominating set, since it has no neighbors). We
let 𝑆 = 𝐷 ⧵ 𝐼. Then |𝑆| ≤ 𝐼. We claim that 𝑆 is a vertex cover
in 𝐺. Indeed, for every edge {𝑢, 𝑣} of 𝐺, either the vertex 𝑤𝑢,𝑣 or
one of its neighbors must be in 𝑆 by the dominating set property.
But since we ensured 𝑆 doesn’t contain any of the vertices of the
form 𝑤𝑢,𝑣 , it must be the case that either 𝑢 or 𝑣 is in 𝑆. This shows
that 𝑆 is a vertex cover of 𝐺 of size at most 𝑘, hence proving that
VC(𝐺, 𝑘) = 1.

A corollary of Algorithm 14.10 and the other reduction we have


seen so far is that if DS ∈ P (i.e., dominating set has a polynomial-time
algorithm) then 3SAT ∈ P (i.e., 3SAT has a polynomial-time algo-
rithm). By the contra-positive, if 3SAT does not have a polynomial-
time algorithm then neither does dominating set.

Figure 14.8: Python implementation of the reduction


from vertex cover to dominating set, together with an
example of an input graph and the resulting output
graph. This reduction allows to transform a hypothet-
ical polynomial-time algorithm for dominating set (a
“whistling pig”) into a hypothetical polynomial-time
algorithm for vertex-cover (a “flying horse”).

14.6.2 Anatomy of a reduction


The reduction of Solved Exercise 14.5 gives a good illustration of the
anatomy of a reduction. A reduction consists of four parts:
poly nomi a l -ti me re d u c ti on s 473

Figure 14.9: The four components of a reduction,


illustrated for the particular reduction of vertex cover
to dominating set. A reduction from problem 𝐹 to
problem 𝐺 is an algorithm that maps an input 𝑥 for 𝐹
into an input 𝑅(𝑥) for 𝐺. To show that the reduction
is correct we need to show the properties of efficiency:
algorithm 𝑅 runs in polynomial time, completeness:
if 𝐹 (𝑥) = 1 then 𝐺(𝑅(𝑥)) = 1, and soundness: if
𝐹 (𝑅(𝑥)) = 1 then 𝐺(𝑥) = 1.

• Algorithm description: This is the description of how the algorithm


maps an input into the output. For example, in Solved Exercise 14.5
this is the description of how we map an instance (𝐺, 𝑘) of the
vertex cover problem into an instance (𝐻, 𝑘′ ) of the dominating set
problem.

• Algorithm analysis: It is not enough to describe how the algorithm


works but we need to also explain why it works. In particular we
need to provide an analysis explaining why the reduction is both
efficient (i.e., runs in polynomial time) and correct (satisfies that
𝐺(𝑅(𝑥)) = 𝐹 (𝑥) for every 𝑥). Specifically, the components of
analysis of a reduction 𝑅 include:

– Efficiency: We need to show that 𝑅 runs in polynomial time. In


most reductions we encounter this part is straightforward, as the
reductions we typically use involve a constant number of nested
loops, each involving a constant number of operations. For ex-
ample, the reduction of Solved Exercise 14.5 just enumerates over
the edges and vertices of the input graph.
– Completeness: In a reduction 𝑅 demonstrating 𝐹 ≤𝑝 𝐺, the
completeness condition is the condition that for every 𝑥 ∈ {0, 1}∗ ,
if 𝐹 (𝑥) = 1 then 𝐺(𝑅(𝑥)) = 1. Typically we construct the
reduction to ensure that this holds, by giving a way to map a
“certificate/solution” certifying that 𝐹 (𝑥) = 1 into a solution
certifying that 𝐺(𝑅(𝑥)) = 1. For example, in Solved Exercise 14.5
we constructed the graph 𝐻 such that for every vertex cover 𝑆
in 𝐺, the set 𝑆 ∪ 𝐼 (where 𝐼 is the isolated vertices) would be a
dominating set in 𝐻.
– Soundness: This is the condition that if 𝐹 (𝑥) = 0 then
𝐺(𝑅(𝑥)) = 0 or (taking the contrapositive) if 𝐺(𝑅(𝑥)) = 1 then
𝐹 (𝑥) = 1. This is sometimes straightforward but can often be
474 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

harder to show than the completeness condition, and in more


advanced reductions (such as the reduction 3SAT ≤𝑝 ISET
of Theorem 14.8) demonstrating soundness is the main part
of the analysis. For example, in Solved Exercise 14.5 to show
soundness we needed to show that for every dominating set 𝐷 in
the graph 𝐻, there exists a vertex cover 𝑆 of size at most |𝐷| − ℓ
in the graph 𝐺 (where ℓ is the number of isolated vertices).
This was challenging since the dominating set 𝐷 might not be
necessarily the one we “had in mind”. In particular, in the proof
above we needed to modify 𝐷 to ensure that it does not contain
vertices of the form 𝑤𝑢,𝑣 , and it was important to show that this
modification still maintains the property that 𝐷 is a dominating
set, and also does not make it bigger.

Whenever you need to provide a reduction, you should make sure


that your description has all these components. While it is sometimes
tempting to weave together the description of the reduction and its
analysis, it is usually clearer if you separate the two, and also break
down the analysis to its three components of efficiency, completeness,
and soundness.

14.7 REDUCING INDEPENDENT SET TO MAXIMUM CUT


We now show that the independent set problem reduces to the maxi-
mum cut (or “max cut”) problem, modeled as the function MAXCUT
that on input a pair (𝐺, 𝑘) outputs 1 iff 𝐺 contains a cut of at least 𝑘
edges. Since both are graph problems, a reduction from independent
set to max cut maps one graph into the other, but as we will see the
output graph does not have to have the same vertices or edges as the
input graph.

Theorem 14.11 — Hardness of Max Cut. ISET ≤𝑝 MAXCUT

Proof Idea:
We will map a graph 𝐺 into a graph 𝐻 such that a large indepen-
dent set in 𝐺 becomes a partition cutting many edges in 𝐻. We can
think of a cut in 𝐻 as coloring each vertex either “blue” or “red”. We
will add a special “source” vertex 𝑠∗ , connect it to all other vertices,
and assume without loss of generality that it is colored blue. Hence
the more vertices we color red, the more edges from 𝑠∗ we cut. Now,
for every edge 𝑢, 𝑣 in the original graph 𝐺 we will add a special “gad-
get” which will be a small subgraph that involves 𝑢,𝑣, the source 𝑠∗ ,
and two other additional vertices. We design the gadget in a way so
that if the red vertices are not an independent set in 𝐺 then the cor-
responding cut in 𝐻 will be “penalized” in the sense that it would
poly nomi a l -ti me re d u c ti on s 475

not cut as many edges. Once we set for ourselves this objective, it is
not hard to find a gadget that achieves it− see the proof below. Once
again the takeaway technique is to use (this time a slightly more
clever) gadget.

Figure 14.10: In the reduction of ISET to MAXCUT


we map an 𝑛-vertex 𝑚-edge graph 𝐺 into the
𝑛 + 2𝑚 + 1 vertex and 𝑛 + 5𝑚 edge graph 𝐻 as
follows. The graph 𝐻 contains a special “source”
vertex 𝑠∗ ,𝑛 vertices 𝑣0 , … , 𝑣𝑛−1 , and 2𝑚 ver-
tices 𝑒00 , 𝑒10 , … , 𝑒0𝑚−1 , 𝑒1𝑚−1 with each pair cor-
responding to an edge of 𝐺. We put an edge be-
tween 𝑠∗ and 𝑣𝑖 for every 𝑖 ∈ [𝑛], and if the 𝑡-th
edge of 𝐺 was (𝑣𝑖 , 𝑣𝑗 ) then we add the five edges
(𝑠∗ , 𝑒0𝑡 ), (𝑠∗ , 𝑒1𝑡 ), (𝑣𝑖 , 𝑒0𝑡 ), (𝑣𝑗 , 𝑒1𝑡 ), (𝑒0𝑡 , 𝑒1𝑡 ). The intent
is that if we cut at most one of 𝑣𝑖 , 𝑣𝑗 from 𝑠∗ then
we’ll be able to cut 4 out of these five edges, while if
we cut both 𝑣𝑖 and 𝑣𝑗 from 𝑠∗ then we’ll be able to cut
at most three of them.

Proof of Theorem 14.11. We will transform a graph 𝐺 of 𝑛 vertices and


𝑚 edges into a graph 𝐻 of 𝑛 + 1 + 2𝑚 vertices and 𝑛 + 5𝑚 edges in the
following way (see also Fig. 14.10). The graph 𝐻 contains all vertices
of 𝐺 (though not the edges between them!) and in addition 𝐻 also
has:
* A special vertex 𝑠∗ that is connected to all the vertices of 𝐺
* For every edge 𝑒 = {𝑢, 𝑣} ∈ 𝐸(𝐺), two vertices 𝑒0 , 𝑒1 such that 𝑒0
is connected to 𝑢 and 𝑒1 is connected to 𝑣, and moreover we add the
edges {𝑒0 , 𝑒1 }, {𝑒0 , 𝑠∗ }, {𝑒1 , 𝑠∗ } to 𝐻.
Theorem 14.11 will follow by showing that 𝐺 contains an inde-
pendent set of size at least 𝑘 if and only if 𝐻 has a cut cutting at least
𝑘 + 4𝑚 edges. We now prove both directions of this equivalence:
Part 1: Completeness. If 𝐼 is an independent 𝑘-sized set in 𝐺, then
we can define 𝑆 to be a cut in 𝐻 of the following form: we let 𝑆 con-
tain all the vertices of 𝐼 and for every edge 𝑒 = {𝑢, 𝑣} ∈ 𝐸(𝐺), if 𝑢 ∈ 𝐼
and 𝑣 ∉ 𝐼 then we add 𝑒1 to 𝑆; if 𝑢 ∉ 𝐼 and 𝑣 ∈ 𝐼 then we add 𝑒0 to
𝑆; and if 𝑢 ∉ 𝐼 and 𝑣 ∉ 𝐼 then we add both 𝑒0 and 𝑒1 to 𝑆. (We don’t
need to worry about the case that both 𝑢 and 𝑣 are in 𝐼 since it is an
independent set.) We can verify that in all cases the number of edges
from 𝑆 to its complement in the gadget corresponding to 𝑒 will be four
(see Fig. 14.11). Since 𝑠∗ is not in 𝑆, we also have 𝑘 edges from 𝑠∗ to 𝐼,
for a total of 𝑘 + 4𝑚 edges.
Part 2: Soundness. Suppose that 𝑆 is a cut in 𝐻 that cuts at least
𝐶 = 𝑘 + 4𝑚 edges. We can assume that 𝑠∗ is not in 𝑆 (otherwise we
can “flip” 𝑆 to its complement 𝑆, since this does not change the size
of the cut). Now let 𝐼 be the set of vertices in 𝑆 that correspond to the
original vertices of 𝐺. If 𝐼 was an independent set of size 𝑘 then we
476 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

would be done. This might not always be the case but we will see that
if 𝐼 is not an independent set then it’s also larger than 𝑘. Specifically,
we define 𝑚𝑖𝑛 = |𝐸(𝐼, 𝐼)| be the set of edges in 𝐺 that are contained
in 𝐼 and let 𝑚𝑜𝑢𝑡 = 𝑚 − 𝑚𝑖𝑛 (i.e., if 𝐼 is an independent set then
𝑚𝑖𝑛 = 0 and 𝑚𝑜𝑢𝑡 = 𝑚). By the properties of our gadget we know
that for every edge {𝑢, 𝑣} of 𝐺, we can cut at most three edges when
both 𝑢 and 𝑣 are in 𝑆, and at most four edges otherwise. Hence the
number 𝐶 of edges cut by 𝑆 satisfies 𝐶 ≤ |𝐼| + 3𝑚𝑖𝑛 + 4𝑚𝑜𝑢𝑡 =
|𝐼| + 3𝑚𝑖𝑛 + 4(𝑚 − 𝑚𝑖𝑛 ) = |𝐼| + 4𝑚 − 𝑚𝑖𝑛 . Since 𝐶 = 𝑘 + 4𝑚 we
get that |𝐼| − 𝑚𝑖𝑛 ≥ 𝑘. Now we can transform 𝐼 into an independent
set 𝐼 ′ by going over every one of the 𝑚𝑖𝑛 edges that are inside 𝐼 and
removing one of the endpoints of the edge from it. The resulting set 𝐼 ′
is an independent set in the graph 𝐺 of size |𝐼| − 𝑚𝑖𝑛 ≥ 𝑘 and so this
concludes the proof of the soundness condition.

Figure 14.11: In the reduction of independent set
to max cut, for every 𝑡 ∈ [𝑚], we have a “gadget”
corresponding to the 𝑡-th edge 𝑒 = {𝑣𝑖 , 𝑣𝑗 } in the
Figure 14.12: The reduction of independent set to
original graph. If we think of the side of the cut
max cut. On the right-hand side is Python code
containing the special source vertex 𝑠 as “white” and

implementing the reduction. On the left-hand side is
the other side as “blue”, then the leftmost and center
an example output of the reduction where we apply
figures show that if 𝑣 and 𝑣𝑗 are not both blue then
it to the independent 𝑖set instance that is obtained by
we can cut four edges from the gadget. In contrast,
running the reduction of Theorem 14.8 on the 3CNF
by enumerating all possibilities one can verify that
formula (𝑥0 ∨ 𝑥3 ∨ 𝑥2 ) ∧ (𝑥0 ∨ 𝑥1 ∨ 𝑥2 ) ∧ (𝑥1 ∨ 𝑥2 ∨ 𝑥3 ).
if both 𝑢 and 𝑣 are blue, then no matter how we
color the intermediate vertices 𝑒0𝑡 , 𝑒1𝑡 , we will cut at
most three edges from the gadget. The figure above
contains only the gadget edges and ignores the edges
connecting 𝑠∗ to the vertices 𝑣0 , … , 𝑣𝑛−1 .

14.8 REDUCING 3SAT TO LONGEST PATH


Note: This section is still a little messy; feel free to skip it or just read
it without going into the proof details. The proof appears in Section
7.5 in Sipser’s book.
One of the most basic algorithms in Computer Science is Dijkstra’s
algorithm to find the shortest path between two vertices. We now show Figure 14.13: We can transform a 3SAT formula 𝜑 into
a graph 𝐺 such that the longest path in the graph 𝐺
that in contrast, an efficient algorithm for the longest path problem would correspond to a satisfying assignment in 𝜑. In
would imply a polynomial-time algorithm for 3SAT. this graph, the black colored part corresponds to the
variables of 𝜑 and the blue colored part corresponds
to the vertices. A sufficiently long path would have to
Theorem 14.12 — Hardness of longest path.
first “snake” through the black part, for each variable
choosing either the “upper path” (corresponding
3SAT ≤𝑝 LONGPATH to assigning it the value True) or the “lower path”
(corresponding to assigning it the value False). Then
to achieve maximum length the path would traverse
Proof Idea: through the blue part, where to go between two
vertices corresponding to a clause such as 𝑥17 ∨ 𝑥32 ∨
To prove Theorem 14.12 need to show how to transform a 3CNF 𝑥57 , the corresponding vertices would have to have
formula 𝜑 into a graph 𝐺 and two vertices 𝑠, 𝑡 such that 𝐺 has a path been not traversed before.
poly nomi a l -ti me re d u c ti on s 477

of length at least 𝑘 if and only if 𝜑 is satisfiable. The idea of the reduc-


tion is sketched in Fig. 14.13 and Fig. 14.14. We will construct a graph
that contains a potentially long “snaking path” that corresponds to
all variables in the formula. We will add a “gadget” corresponding
to each clause of 𝜑 in a way that we would only be able to use the
gadgets if we have a satisfying assignment.

Figure 14.14: The graph above with the longest path


def TSAT2LONGPATH(φ):
marked on it, the part of the path corresponding to
"""Reduce 3SAT to LONGPATH""" variables is in green and part corresponding to the
def var(v): # return variable and True/False depending clauses is in pink.

↪ if positive or negated
return int(v[2:]),False if v[0]=="¬" else
↪ int(v[1:]),True
n = numvars(φ)
clauses = getclauses(φ)
m = len(clauses)
G =Graph()
G.edge("start","start_0")
for i in range(n): # add 2 length-m paths per variable
G.edge(f"start_{i}",f"v_{i}_{0}_T")
G.edge(f"start_{i}",f"v_{i}_{0}_F")
for j in range(m-1):
G.edge(f"v_{i}_{j}_T",f"v_{i}_{j+1}_T")
G.edge(f"v_{i}_{j}_F",f"v_{i}_{j+1}_F")
G.edge(f"v_{i}_{m-1}_T",f"end_{i}")
G.edge(f"v_{i}_{m-1}_F",f"end_{i}")
if i<n-1:
G.edge(f"end_{i}",f"start_{i+1}")
G.edge(f"end_{n-1}","start_clauses")
for j,C in enumerate(clauses): # add gadget for each
↪ clause
for v in enumerate(C):
i,sign = var(v[1])
s = "F" if sign else "T"
G.edge(f"C_{j}_in",f"v_{i}_{j}_{s}")
G.edge(f"v_{i}_{j}_{s}",f"C_{j}_out")
if j<m-1:
G.edge(f"C_{j}_out",f"C_{j+1}_in")
G.edge("start_clauses","C_0_in")
G.edge(f"C_{m-1}_out","end")
return G, 1+n*(m+1)+1+2*m+1

Proof of Theorem 14.12. We build a graph 𝐺 that “snakes” from 𝑠 to 𝑡 as


follows. After 𝑠 we add a sequence of 𝑛 long loops. Each loop has an
478 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

“upper path” and a “lower path”. A simple path cannot take both the
upper path and the lower path, and so it will need to take exactly one
of them to reach 𝑠 from 𝑡.
Our intention is that a path in the graph will correspond to an as-
signment 𝑥 ∈ {0, 1}𝑛 in the sense that taking the upper path in the 𝑖𝑡ℎ
loop corresponds to assigning 𝑥𝑖 = 1 and taking the lower path cor-
responds to assigning 𝑥𝑖 = 0. When we are done snaking through all
the 𝑛 loops corresponding to the variables to reach 𝑡 we need to pass
through 𝑚 “obstacles”: for each clause 𝑗 we will have a small gad-
get consisting of a pair of vertices 𝑠𝑗 , 𝑡𝑗 that have three paths between
them. For example, if the 𝑗𝑡ℎ clause had the form 𝑥17 ∨ 𝑥55 ∨ 𝑥72 then
one path would go through a vertex in the lower loop corresponding
to 𝑥17 , one path would go through a vertex in the upper loop corre-
sponding to 𝑥55 and the third would go through the lower loop cor-
responding to 𝑥72 . We see that if we went in the first stage according
to a satisfying assignment then we will be able to find a free vertex to
travel from 𝑠𝑗 to 𝑡𝑗 . We link 𝑡1 to 𝑠2 , 𝑡2 to 𝑠3 , etc and link 𝑡𝑚 to 𝑡. Thus
a satisfying assignment would correspond to a path from 𝑠 to 𝑡 that
goes through one path in each loop corresponding to the variables,
and one path in each loop corresponding to the clauses. We can make
the loop corresponding to the variables long enough so that we must
take the entire path in each loop in order to have a fighting chance of
getting a path as long as the one corresponds to a satisfying assign-
ment. But if we do that, then the only way if we are able to reach 𝑡 is
if the paths we took corresponded to a satisfying assignment, since
otherwise we will have one clause 𝑗 where we cannot reach 𝑡𝑗 from 𝑠𝑗
without using a vertex we already used before.

14.8.1 Summary of relations


We have shown that there are a number of functions 𝐹 for which we
can prove a statement of the form “If 𝐹 ∈ P then 3SAT ∈ P”. Hence
coming up with a polynomial-time algorithm for even one of these
problems will entail a polynomial-time algorithm for 3SAT (see for
example Fig. 14.16). In Chapter 15 we will show the inverse direction
Figure 14.15: The result of applying the reduction of
(“If 3SAT ∈ P then 𝐹 ∈ P”) for these functions, hence allowing us to 3SAT to LONGPATH to the formula (𝑥0 ∨ ¬𝑥3 ∨ 𝑥2 ) ∧
conclude that they have equivalent complexity to 3SAT. (¬𝑥0 ∨ 𝑥1 ∨ ¬𝑥2 ) ∧ (𝑥1 ∨ 𝑥2 ∨ ¬𝑥3 ).

✓ Chapter Recap

• The computational complexity of many seemingly


unrelated computational problems can be related
to one another through the use of reductions.
poly nomi a l -ti me re d u c ti on s 479

Figure 14.16: So far we have shown that P ⊆ EXP and


that several problems we care about such as 3SAT and
MAXCUT are in EXP but it is not known whether or
not they are in P. However, since 3SAT ≤𝑝 MAXCUT
we can rule out the possiblity that MAXCUT ∈ P but
3SAT ∉ P. The relation of P/poly to the class EXP is
not known. We know that EXP does not contain P/poly
since the latter even contains uncomputable functions,
but we do not know whether ot not EXP ⊆ P/poly
(though it is believed that this is not the case and in
particular that both 3SAT and MAXCUT are not in
• If 𝐹 ≤𝑝 𝐺 then a polynomial-time algorithm P/poly ).
for 𝐺 can be transformed into a polynomial-time
algorithm for 𝐹 .
• Equivalently, if 𝐹 ≤𝑝 𝐺 and 𝐹 does not have a
polynomial-time algorithm then neither does 𝐺.
• We’ve developed many techniques to show that
3SAT ≤𝑝 𝐹 for interesting functions 𝐹 . Sometimes
we can do so by using transitivity of reductions: if
3SAT ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐹 then 3SAT ≤𝑝 𝐹 .

14.9 EXERCISES

14.10 BIBLIOGRAPHICAL NOTES


Several notions of reductions are defined in the literature. The notion
defined in Definition 14.1 is often known as a mapping reduction, many
to one reduction or a Karp reduction.
The maximal (as opposed to maximum) independent set is the task
of finding a “local maximum” of an independent set: an independent
set 𝑆 such that one cannot add a vertex to it without losing the in-
dependence property (such a set is known as a vertex cover). Unlike
finding a maximum independent set, finding a maximal independent
set can be done efficiently by a greedy algorithm, but this local maxi-
mum can be much smaller than the global maximum.
Reduction of independent set to max cut taken from these notes.
Image of Hamiltonian Path through Dodecahedron by Christoph
Sommer.
We have mentioned that the line between reductions used for algo-
rithm design and showing hardness is sometimes blurry. An excellent
example for this is the area of SAT Solvers (see [Gom+08]). In this
field people use algorithms for SAT (that take exponential time in the
worst case but often are much faster on many instances in practice)
together with reductions of the form 𝐹 ≤𝑝 SAT to derive algorithms
for other functions 𝐹 of interest.
Learning Objectives:
• Introduce the class NP capturing a great many
important computational problems
• NP-completeness: evidence that a problem
might be intractable.
• The P vs NP problem.

15
NP, NP completeness, and the Cook-Levin Theorem

“In this paper we give theorems that suggest, but do not imply, that these
problems, as well as many others, will remain intractable perpetually”, Richard
Karp, 1972

“Sad to say, but it will be many more years, if ever before we really understand
the Mystical Power of Twoness… 2-SAT is easy, 3-SAT is hard, 2-dimensional
matching is easy, 3-dimensional matching is hard. Why? oh, Why?” Eugene
Lawler

So far we have shown that 3SAT is no harder than Quadratic Equa-


tions, Independent Set, Maximum Cut, and Longest Path. But to show
that these problems are computationally equivalent we need to give re-
ductions in the other direction, reducing each one of these problems to
3SAT as well. It turns out we can reduce all three problems to 3SAT in
one fell swoop.
In fact, this result extends far beyond these particular problems. All
of the problems we discussed in Chapter 14, and a great many other
problems, share the same commonality: they are all search problems,
where the goal is to decide, given an instance 𝑥, whether there exists
a solution 𝑦 that satisfies some condition that can be verified in poly-
nomial time. For example, in 3SAT, the instance is a formula and the
solution is an assignment to the variable; in Max-Cut the instance is a
graph and the solution is a cut in the graph; and so on and so forth. It
turns out that every such search problem can be reduced to 3SAT.

This chapter: A non-mathy overview


In this chapter we will see the definition of the complexity
class NP- one of the most important definitions in this book,
and the Cook-Levin Theorem- one of the most important
theorems in it. Intuitively, the class NP corresponds to the
class of problems where it is easy to verify a solution (i.e.,
verification can be done by a polynomial-time algorithm).
For example, finding a satistfying assignment to a 2SAT or

Compiled on 12.6.2023 00:05


482 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

3SAT formula is such a problem, since if we are given an


assignment to the variables a 2SAT or 3SAT formula then
we can efficiently verify that it satisfies all constraints. More
precisely, NP is the class of decision problems (i.e., Boolean
functions or languages) corresponding to determining the
existence of such a solution, though we will see in Chapter 16
that the decision and search problems are closely related.
As the examples of 2SAT and 3SAT show, there are some
computational problems (i.e., functions) in NP for which we
have a polynomial-time algorithm, and some for which no
such algorithm is known. It is an outstanding open question
whether or not all functions in NP have a polynomial-time
algorithm, or in other words (to use just a little bit of math)
whether or not P = NP. In this chapter we will see that
there are some functions in NP that are in a precise sense
“hardest in all of NP” in the sense that if even one of these
functions has a polynomial-time algorithm then all functions
in NP have such an algorithm. Such functions are known
as NP complete. The Cook-Levin Theorem states that 3SAT
is NP complete. Using a complex web of polynomial-time
reductions, researchers have derived from the Cook-Levin
theorem the NP-completeness of thousands of computa-
tional problems from all areas of mathematics, natural and
social sciences, engineering, and more. These results provide
strong evidence that all of these problems cannot be solved in
the worst-case by polynomial-time algorithm.

Figure 15.1: Overview of the results of this chapter.


We define NP to contain all decision problems for
which a solution can be efficiently verified. The main
result of this chapter is the Cook Levin Theorem (The-
orem 15.6) which states that 3SAT has a polynomial-
time algorithm if and only if every problem in NP
has a polynomial-time algorithm. Another way to
state this theorem is that 3SAT is NP complete. We
will prove the Cook-Levin theorem by defining the
two intermediate problems NANDSAT and 3NAND,
proving that NANDSAT is NP complete, and then
proving that NANDSAT ≤𝑝 3NAND ≤𝑝 3SAT.
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 483

15.1 THE CLASS NP


To make the above precise, we will make the following mathematical
definition. We define the class NP to contain all Boolean functions that
correspond to a search problem of the form above. That is, a Boolean
function 𝐹 is in NP if 𝐹 has the form that on input a string 𝑥, 𝐹 (𝑥) = 1
if and only if there exists a “solution” string 𝑤 such that the pair (𝑥, 𝑤)
satisfies some polynomial-time checkable condition. Formally, NP is
defined as follows:

We say that 𝐹 ∶ {0, 1}∗ → {0, 1} is in NP if there


Definition 15.1 — NP.
exists some integer 𝑎 > 0 and 𝑉 ∶ {0, 1}∗ → {0, 1} such that 𝑉 ∈ P
and for every 𝑥 ∈ {0, 1}𝑛 ,

𝐹 (𝑥) = 1 ⇔ ∃𝑤∈{0,1}𝑛𝑎 s.t. 𝑉 (𝑥𝑤) = 1 . (15.1)

In other words, for 𝐹 to be in NP, there needs to exist some


Figure 15.2: The class NP corresponds to problems
polynomial-time computable verification function 𝑉 , such that if where solutions can be efficiently verified. That is, this
𝐹 (𝑥) = 1 then there must exist 𝑤 (of length polynomial in |𝑥|) such is the class of functions 𝐹 such that 𝐹 (𝑥) = 1 if there
that 𝑉 (𝑥𝑤) = 1, and if 𝐹 (𝑥) = 0 then for every such 𝑤, 𝑉 (𝑥𝑤) = 0. is a “solution” 𝑤 of length polynomial in |𝑥| that can
be verified by a polynomial-time algorithm 𝑉 .
Since the existence of this string 𝑤 certifies that 𝐹 (𝑥) = 1, 𝑤 is often
referred to as a certificate, witness, or proof that 𝐹 (𝑥) = 1.
See also Fig. 15.2 for an illustration of Definition 15.1. The name
NP stands for “non-deterministic polynomial time” and is used for
historical reasons; see the bibiographical notes. The string 𝑤 in (15.1)
is sometimes known as a solution, certificate, or witness for the instance
𝑥.
Show that the condition
Solved Exercise 15.1 — Alternative definition of NP.
that |𝑤| = |𝑥| in Definition 15.1 can be replaced by the condition
𝑎

that |𝑤| ≤ 𝑝(|𝑥|) for some polynomial 𝑝. That is, prove that for every
𝐹 ∶ {0, 1}∗ → {0, 1}, 𝐹 ∈ NP if and only if there is a polynomial-
time Turing machine 𝑉 and a polynomial 𝑝 ∶ ℕ → ℕ such that for
every 𝑥 ∈ {0, 1}∗ 𝐹 (𝑥) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥, 𝑤) = 1.

Solution:
The “only if” direction (namely that if 𝐹 ∈ NP then there is an
algorithm 𝑉 and a polynomial 𝑝 as above) follows immediately
from Definition 15.1 by letting 𝑝(𝑛) = 𝑛𝑎 . For the “if” direc-
tion, the idea is that if a string 𝑤 is of size at most 𝑝(𝑛) for degree
𝑑 polynomial 𝑝, then there is some 𝑛0 such that for all 𝑛 > 𝑛0 ,
|𝑤| < 𝑛 . Hence we can encode 𝑤 by a string of exactly length
𝑑+1
484 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑛𝑑+1 by padding it with 1 and an appropriate number of zeroes.


Hence if there is an algorithm 𝑉 and polynomial 𝑝 as above, then
we can define an algorithm 𝑉 ′ that does the following on input
𝑥, 𝑤′ with |𝑥| = 𝑛 and |𝑤′ | = 𝑛𝑎 :

• If 𝑛 ≤ 𝑛0 then 𝑉 ′ (𝑥, 𝑤′ ) ignores 𝑤′ and enumerates over all 𝑤


of length at most 𝑝(𝑛) and outputs 1 if there exists 𝑤 such that
𝑉 (𝑥, 𝑤) = 1. (Since 𝑛 < 𝑛0 , this only takes a constant number of
steps.)

• If 𝑛 > 𝑛0 then 𝑉 ′ (𝑥, 𝑤′ ) “strips out” the padding by dropping


all the rightmost zeroes from 𝑤 until it reaches out the first 1
(which it drops as well) and obtains a string 𝑤. If |𝑤| ≤ 𝑝(𝑛)
then 𝑉 ′ outputs 𝑉 (𝑥, 𝑤).

Since 𝑉 runs in polynomial time, 𝑉 ′ runs in polynomial time


as well, and by definition for every 𝑥, there exists 𝑤′ ∈ {0, 1}|𝑥|
𝑎

such that 𝑉 ′ (𝑥𝑤′ ) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥𝑤) = 1.

The definition of NP means that for every 𝐹 ∈ NP and string


𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 if and only if there is a short and efficiently
verifiable proof of this fact. That is, we can think of the function 𝑉 in
Definition 15.1 as a verifier algorithm, similar to what we’ve seen in
Section 11.1. The verifier checks whether a given string 𝑤 ∈ {0, 1}∗ is a
valid proof for the statement “𝐹 (𝑥) = 1”. Essentially all proof systems
considered in mathematics involve line-by-line checks that can be car-
ried out in polynomial time. Thus the heart of NP is asking for state-
ments that have short (i.e., polynomial in the size of the statements)
proofs. Indeed, as we will see in Chapter 16, Kurt Gödel phrased the
question of whether NP = P as asking whether “the mental work of
a mathematician [in proving theorems] could be completely replaced
by a machine”.

R
Remark 15.2 — NP not (necessarily) closed under com-
plement. Definition 15.1 is asymmetric in the sense that
there is a difference between an output of 1 and an
output of 0. You should make sure you understand
why this definition does not guarantee that if 𝐹 ∈ NP
then the function 1 − 𝐹 (i.e., the map 𝑥 ↦ 1 − 𝐹 (𝑥)) is
in NP as well.
In fact, it is believed that there do exist functions 𝐹
such that 𝐹 ∈ NP but 1 − 𝐹 ∉ NP. For example, as
shown below, 3SAT ∈ NP, but the function 3SAT that
on input a 3CNF formula 𝜑 outputs 1 if and only if 𝜑
is not satisfiable is not known (nor believed) to be in
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 485

NP. This is in contrast to the class P which does satisfy


that if 𝐹 ∈ P then 1 − 𝐹 is in P as well.

15.1.1 Examples of functions in NP


We now present some examples of functions that are in the class NP.
We start with the canonical example of the 3SAT function.

■ Example 15.3 — 3𝑆𝐴𝑇 ∈ NP. 3SAT is in NP since for every ℓ-


variable formula 𝜑, 3SAT(𝜑) = 1 if and only if there exists a
satisfying assignment 𝑥 ∈ {0, 1}ℓ such that 𝜑(𝑥) = 1, and we
can check this condition in polynomial time.
The above reasoning explains why 3SAT is in NP, but since this
is our first example, we will now belabor the point and expand out
in full formality the precise representation of the witness 𝑤 and the
algorithm 𝑉 that demonstrate that 3SAT is in NP. Since demon-
strating that functions are in NP is fairly straightforward, in future
cases we will not use as much detail, and the reader can also feel
free to skip the rest of this example.
Using Solved Exercise 15.1, it is OK if witness is of size at most
polynomial in the input length 𝑛, rather than of precisely size 𝑛𝑎
for some integer 𝑎 > 0. Specifically, we can represent a 3CNF
formula 𝜑 with 𝑘 variables and 𝑚 clauses as a string of length
𝑛 = 𝑂(𝑚 log 𝑘), since every one of the 𝑚 clauses involves three
variables and their negation, and the identity of each variable can
be represented using ⌈log2 𝑘⌉. We assume that every variable par-
ticipates in some clause (as otherwise it can be ignored) and hence
that 𝑚 ≥ 𝑘, which in particular means that the input length 𝑛 is at
least as large as 𝑚 and 𝑘.
We can represent an assignment to the 𝑘 variables using a 𝑘-
length string 𝑤. The following algorithm checks whether a given 𝑤
satisfies the formula 𝜑:
486 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Algorithm 15.4 — Verifier for 3𝑆𝐴𝑇 .

Input: 3CNF formula 𝜑 on 𝑘 variables and with 𝑚


clauses, string 𝑤 ∈ {0, 1}𝑘
Output: 1 iff 𝑤 satisfies 𝜑
1: for 𝑗 ∈ [𝑚] do
2: Let ℓ1 ∨ ℓ2 ∨ ℓ3 be the 𝑗-th clause of 𝜑
3: if 𝑤 violates all three literals then
4: return 0
5: end if
6: end for
7: return 1

Algorithm 15.4 takes 𝑂(𝑚) time to enumerate over all clauses,


and will return 1 if and only if 𝑦 satisfies all the clauses.

Here are some more examples for problems in NP. For each one
of these problems we merely sketch how the witness is represented
and why it is efficiently checkable, but working out the details can be a
good way to get more comfortable with Definition 15.1:

• QUADEQ is in NP since for every ℓ-variable instance of quadratic


equations 𝐸, QUADEQ(𝐸) = 1 if and only if there exists an assign-
ment 𝑥 ∈ {0, 1}ℓ that satisfies 𝐸. We can check the condition that
𝑥 satisfies 𝐸 in polynomial time by enumerating over all the equa-
tions in 𝐸, and for each such equation 𝑒, plug in the values of 𝑥 and
verify that 𝑒 is satisfied.

• ISET is in NP since for every graph 𝐺 and integer 𝑘, ISET(𝐺, 𝑘) =


1 if and only if there exists a set 𝑆 of 𝑘 vertices that contains no
pair of neighbors in 𝐺. We can check the condition that 𝑆 is an
independent set of size ≥ 𝑘 in polynomial time by first checking
that |𝑆| ≥ 𝑘 and then enumerating over all edges {𝑢, 𝑣} in 𝐺, and
for each such edge verify that either 𝑢 ∉ 𝑆 or 𝑣 ∉ 𝑆.

• LONGPATH is in NP since for every graph 𝐺 and integer 𝑘,


LONGPATH(𝐺, 𝑘) = 1 if and only if there exists a simple path 𝑃
in 𝐺 that is of length at least 𝑘. We can check the condition that 𝑃
is a simple path of length 𝑘 in polynomial time by checking that it
has the form (𝑣0 , 𝑣1 , … , 𝑣𝑘 ) where each 𝑣𝑖 is a vertex in 𝐺, no 𝑣𝑖 is
repeated, and for every 𝑖 ∈ [𝑘], the edge {𝑣𝑖 , 𝑣𝑖+1 } is present in the
graph.

• MAXCUT is in NP since for every graph 𝐺 and integer 𝑘,


MAXCUT(𝐺, 𝑘) = 1 if and only if there exists a cut (𝑆, 𝑆) in 𝐺 that
cuts at least 𝑘 edges. We can check that condition that (𝑆, 𝑆) is a
cut of value at least 𝑘 in polynomial time by checking that 𝑆 is a
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 487

subset of 𝐺’s vertices and enumerating over all the edges {𝑢, 𝑣} of
𝐺, counting those edges such that 𝑢 ∈ 𝑆 and 𝑣 ∉ 𝑆 or vice versa.

15.1.2 Basic facts about NP


The definition of NP is one of the most important definitions of this
book, and is worth while taking the time to digest and internalize. The
following solved exercises establish some basic properties of this class.
As usual, I highly recommend that you try to work out the solutions
yourself.
Solved Exercise 15.2 — Verifying is no harder than solving. Prove that P ⊆ NP.

Solution:
Suppose that 𝐹 ∈ P. Define the following function 𝑉 : 𝑉 (𝑥0𝑛 ) =
1 iff 𝑛 = |𝑥| and 𝐹 (𝑥) = 1. (𝑉 outputs 0 on all other inputs.) Since
𝐹 ∈ P we can clearly compute 𝑉 in polynomial time as well.
Let 𝑥 ∈ {0, 1}𝑛 be some string. If 𝐹 (𝑥) = 1 then 𝑉 (𝑥0𝑛 ) = 1. On
the other hand, if 𝐹 (𝑥) = 0 then for every 𝑤 ∈ {0, 1}𝑛 , 𝑉 (𝑥𝑤) = 0.
Therefore, setting 𝑎 = 1 (i.e. 𝑤 ∈ {0, 1}𝑛 ), we see that 𝑉 satisfies
1

(15.1), and establishes that 𝐹 ∈ NP.


R
Remark 15.5 — NP does not mean non-polynomial!.
People sometimes think that NP stands for “non-
polynomial time”. As Solved Exercise 15.2 shows, this
is far from the truth, and in fact every polynomial-
time computable function is in NP as well.
If 𝐹 is in NP it certainly does not mean that 𝐹 is hard
to compute (though it does not, as far as we know,
necessarily mean that it’s easy to compute either).
Rather, it means that 𝐹 is easy to verify, in the technical
sense of Definition 15.1.

Solved Exercise 15.3 — NP is in exponential time. Prove that NP ⊆ EXP.


Solution:
Suppose that 𝐹 ∈ NP and let 𝑉 be the polynomial-time com-
putable function that satisfies (15.1) and 𝑎 the corresponding
constant. Then given every 𝑥 ∈ {0, 1}𝑛 , we can check whether
𝐹 (𝑥) = 1 in time 𝑝𝑜𝑙𝑦(𝑛) ⋅ 2 𝑛𝑎
= 𝑜(2𝑛 ) by enumerating over
𝑎+1

all the 2𝑛 strings 𝑤 ∈ {0, 1}𝑛 and checking whether 𝑉 (𝑥𝑤) = 1,


𝑎 𝑎

in which case we return 1. If 𝑉 (𝑥𝑤) = 0 for every such 𝑤 then we


return 0. By construction, the algorithm above will run in time at
488 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

most exponential in its input length and by the definition of NP it


will return 𝐹 (𝑥) for every 𝑥.

Solved Exercise 15.2 and Solved Exercise 15.3 together imply that

P ⊆ NP ⊆ EXP .

The time hierarchy theorem (Theorem 13.9) implies that P ⊊ EXP


and hence at least one of the two inclusions P ⊆ NP or NP ⊆ EXP
is strict. It is believed that both of them are in fact strict inclusions.
That is, it is believed that there are functions in NP that cannot be
computed in polynomial time (this is the P ≠ NP conjecture) and
that there are functions 𝐹 in EXP for which we cannot even effi-
ciently certify that 𝐹 (𝑥) = 1 for a given input 𝑥. One function 𝐹
that is believed to lie in EXP ⧵ NP is the function 3SAT defined as
3SAT(𝜑) = 1 − 3SAT(𝜑) for every 3CNF formula 𝜑. The conjecture
that 3SAT ∉ NP is known as the “NP ≠ co − NP” conjecture. It
implies the P ≠ NP conjecture (see Exercise 15.2).
We have previously informally equated the notion of 𝐹 ≤𝑝 𝐺 with
𝐹 being “no harder than 𝐺” and in particular have seen in Solved
Exercise 14.1 that if 𝐺 ∈ P and 𝐹 ≤𝑝 𝐺, then 𝐹 ∈ P as well. The
following exercise shows that if 𝐹 ≤𝑝 𝐺 then it is also “no harder to
verify” than 𝐺. That is, regardless of whether or not it is in P, if 𝐺 has
the property that solutions to it can be efficiently verified, then so does
𝐹.
Let 𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1}.
Solved Exercise 15.4 — Reductions and NP.
Show that if 𝐹 ≤𝑝 𝐺 and 𝐺 ∈ NP then 𝐹 ∈ NP.

Solution:
Suppose that 𝐺 is in NP and in particular there exists 𝑎 and 𝑉 ∈
P such that for every 𝑦 ∈ {0, 1}∗ , 𝐺(𝑦) = 1 ⇔ ∃𝑤∈{0,1}|𝑦|𝑎 𝑉 (𝑦𝑤) = 1.
Suppose also that 𝐹 ≤𝑝 𝐺 and so in particular there is a 𝑛𝑏 -
time computable function 𝑅 such that 𝐹 (𝑥) = 𝐺(𝑅(𝑥)) for all
𝑥 ∈ {0, 1} . Define 𝑉 to be a Turing machine that on input a pair
∗ ′

(𝑥, 𝑤) computes 𝑦 = 𝑅(𝑥) and returns 1 if and only if |𝑤| = |𝑦|𝑎


and 𝑉 (𝑦𝑤) = 1. Then 𝑉 ′ runs in polynomial time, and for every
𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 iff there exists 𝑤 of size |𝑅(𝑥)|𝑎 which is at
most polynomial in |𝑥| such that 𝑉 ′ (𝑥, 𝑤) = 1, hence demonstrating
that 𝐹 ∈ NP.

n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 489

15.2 FROM NP TO 3SAT: THE COOK-LEVIN THEOREM


We have seen several examples of problems for which we do not know
if their best algorithm is polynomial or exponential, but we can show
that they are in NP. That is, we don’t know if they are easy to solve, but
we do know that it is easy to verify a given solution. There are many,
many, many, more examples of interesting functions we would like to
compute that are easily shown to be in NP. What is quite amazing is
that if we can solve 3SAT then we can solve all of them!
The following is one of the most fundamental theorems in Com-
puter Science:

Theorem 15.6 — Cook-Levin Theorem. For every 𝐹 ∈ NP, 𝐹 ≤𝑝 3SAT.

We will soon show the proof of Theorem 15.6, but note that it im-
mediately implies that QUADEQ, LONGPATH, and MAXCUT all
reduce to 3SAT. Combining it with the reductions we’ve seen in Chap-
ter 14, it implies that all these problems are equivalent! For example,
to reduce QUADEQ to LONGPATH, we can first reduce QUADEQ to
3SAT using Theorem 15.6 and use the reduction we’ve seen in Theo-
rem 14.12 from 3SAT to LONGPATH. That is, since QUADEQ ∈ NP,
Theorem 15.6 implies that QUADEQ ≤𝑝 3SAT, and Theorem 14.12
implies that 3SAT ≤𝑝 LONGPATH, which by the transitivity of reduc-
tions (Solved Exercise 14.2) means that QUADEQ ≤𝑝 LONGPATH.
Similarly, since LONGPATH ∈ NP, we can use Theorem 15.6 and
Theorem 14.4 to show that LONGPATH ≤𝑝 3SAT ≤𝑝 QUADEQ,
concluding that LONGPATH and QUADEQ are computationally
equivalent.
There is of course nothing special about QUADEQ and LONGPATH
here: by combining (15.6) with the reductions we saw, we see that just
like 3SAT, every 𝐹 ∈ NP reduces to LONGPATH, and the same is true
for QUADEQ and MAXCUT. All these problems are in some sense
“the hardest in NP” since an efficient algorithm for any one of them
would imply an efficient algorithm for all the problems in NP. This
motivates the following definition:

Let 𝐺 ∶ {0, 1}∗ →


Definition 15.7 — NP-hardness and NP-completeness.
{0, 1}. We say that 𝐺 is NP hard if for every 𝐹 ∈ NP, 𝐹 ≤𝑝 𝐺.
We say that 𝐺 is NP complete if 𝐺 is NP hard and 𝐺 ∈ NP.

The Cook-Levin Theorem (Theorem 15.6) can be rephrased as


saying that 3SAT is NP hard, and since it is also in NP, this means that
3SAT is NP complete. Together with the reductions of Chapter 14,
Theorem 15.6 shows that despite their superficial differences, 3SAT,
quadratic equations, longest path, independent set, and maximum
490 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

cut, are all NP-complete. Many thousands of additional problems


have been shown to be NP-complete, arising from all the sciences,
mathematics, economics, engineering and many other fields. (For a
few examples, see this Wikipedia page and this website.)

 Big Idea 22 If a single NP-complete has a polynomial-time algo-


rithm, then there is such an algorithm for every decision problem that
corresponds to the existence of an efficiently-verifiable solution.

15.2.1 What does this mean?


As we’ve seen in Solved Exercise 15.2, P ⊆ NP. The most famous con-
jecture in Computer Science is that this containment is strict. That is,
it is widely conjectured that P ≠ NP. One way to refute the conjec- Figure 15.3: The world if P ≠ NP (left) and P = NP
(right). In the former case the set of NP-complete
ture that P ≠ NP is to give a polynomial-time algorithm for even a problems is disjoint from P and Ladner’s theorem
single one of the NP-complete problems such as 3SAT, Max Cut, or shows that there exist problems that are neither in
the thousands of others that have been studied in all fields of human P nor are NP-complete. (There are remarkably few
natural candidates for such problems, with some
endeavors. The fact that these problems have been studied by so many prominent examples being decision variants of
people, and yet not a single polynomial-time algorithm for any of problems such as integer factoring, lattice shortest
vector, and finding Nash equilibria.) In the latter case
them has been found, supports that conjecture that indeed P ≠ NP. In that P = NP the notion of NP-completeness loses its
fact, for many of these problems (including all the ones we mentioned meaning, as essentially all functions in P (save for the
above), we don’t even know of a 2𝑜(𝑛) -time algorithm! However, to the trivial constant zero and constant one functions) are
NP-complete.
frustration of computer scientists, we have not yet been able to prove
that P ≠ NP or even rule out the existence of an 𝑂(𝑛)-time algorithm
for 3SAT. Resolving whether or not P = NP is known as the P vs NP
problem. A million-dollar prize has been offered for the solution of
this problem, a popular book has been written, and every year a new
paper comes out claiming a proof of P = NP or P ≠ NP, only to wither
under scrutiny.
One of the mysteries of computation is that people have observed a
certain empirical “zero-one law” or “dichotomy” in the computational
Figure 15.4: A rough illustration of the (conjectured)
complexity of natural problems, in the sense that many natural prob- status of problems in exponential time. Darker colors
lems are either in P (often in TIME(𝑂(𝑛)) or TIME(𝑂(𝑛2 ))), or they correspond to higher running time, and the circle in
the middle is the problems in P. NP is a (conjectured
are are NP hard. This is related to the fact that for most natural prob-
to be proper) superclass of P and the NP-complete
lems, the best known algorithm is either exponential or polynomial, problems (or NPC for short) are the “hardest” prob-
with not too many examples where the best running time is some lems in NP, in the sense that a solution for one of
them implies a solution for all other problems in NP.
strange intermediate complexity such as 22 . However, it is be-
√log 𝑛
It is conjectured that all the NP-complete problems
lieved that there exist problems in NP that are neither in P nor are NP- require at least exp(𝑛𝜖 ) time to solve for a constant
𝜖 > 0, and many require exp(Ω(𝑛)) time. The per-
complete, and in fact a result known as “Ladner’s Theorem” shows
manent is not believed to be contained in NP though
that if P ≠ NP then this is indeed the case (see also Exercise 15.1 and it is NP-hard, which means that a polynomial-time
Fig. 15.3). algorithm for it implies that P = NP.
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 491

15.2.2 The Cook-Levin Theorem: Proof outline


We will now prove the Cook-Levin Theorem, which is the underpin-
ning to a great web of reductions from 3SAT to thousands of problems
across many great fields. Some problems that have been shown to be
NP-complete include: minimum-energy protein folding, minimum
surface-area foam configuration, map coloring, optimal Nash equi-
librium, quantum state entanglement, minimum supersequence of
a genome, minimum codeword problem, shortest vector in a lattice,
minimum genus knots, positive Diophantine equations, integer pro-
gramming, and many many more. The worst-case complexity of all
these problems is (up to polynomial factors) equivalent to that of
3SAT, and through the Cook-Levin Theorem, to all problems in NP.
To prove Theorem 15.6 we need to show that 𝐹 ≤𝑝 3SAT for every
𝐹 ∈ NP. We will do so in three stages. We define two intermediate
problems: NANDSAT and 3NAND. We will shortly show the def-
initions of these two problems, but Theorem 15.6 will follow from
combining the following three results:

1. NANDSAT is NP hard (Lemma 15.8).

2. NANDSAT ≤𝑝 3NAND (Lemma 15.9).

3. 3NAND ≤𝑝 3SAT (Lemma 15.10).

By the transitivity of reductions, it will follow that for every 𝐹 ∈


NP,

𝐹 ≤𝑝 NANDSAT ≤𝑝 3NAND ≤𝑝 3SAT


hence establishing Theorem 15.6.
We will prove these three results Lemma 15.8, Lemma 15.9 and
Lemma 15.10 one by one, providing the requisite definitions as we go
along.

15.3 THE NANDSAT PROBLEM, AND WHY IT IS NP HARD


The function NANDSAT ∶ {0, 1}∗ → {0, 1} is defined as follows:

• The input to NANDSAT is a string 𝑄 representing a NAND-CIRC


program (or equivalently, a circuit with NAND gates).

• The output of NANDSAT on input 𝑄 is 1 if and only if there exists a


string 𝑤 ∈ {0, 1}𝑛 (where 𝑛 is the number of inputs to 𝑄) such that
𝑄(𝑤) = 1.

Solved Exercise 15.5 — 𝑁 𝐴𝑁 𝐷𝑆𝐴𝑇 ∈ NP. Prove that NANDSAT ∈ NP.



492 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Solution:
We have seen that the circuit (or straightline program) evalua-
tion problem can be computed in polynomial time. Specifically,
given a NAND-CIRC program 𝑄 of 𝑠 lines and 𝑛 inputs, and
𝑤 ∈ {0, 1}𝑛 , we can evaluate 𝑄 on the input 𝑤 in time which is
polynomial in 𝑠 and hence verify whether or not 𝑄(𝑤) = 1.

We now prove that NANDSAT is NP hard.


Lemma 15.8 NANDSAT is NP hard.

Proof Idea:
The proof closely follows the proof that P ⊆ P/poly (Theorem 13.12
, see also Section 13.6.2). Specifically, if 𝐹 ∈ NP then there is a poly-
nomial time Turing machine 𝑀 and positive integer 𝑎 such that for
every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 1 iff there is some 𝑤 ∈ {0, 1}𝑛 such that
𝑎

𝑀 (𝑥𝑤) = 1. The proof that P ⊆ P/poly gave us a way (via “unrolling


the loop”) to come up in polynomial time with a Boolean circuit 𝐶 on
𝑛𝑎 inputs that computes the function 𝑤 ↦ 𝑀 (𝑥𝑤). We can then trans-
late 𝐶 into an equivalent NAND circuit (or NAND-CIRC program) 𝑄.
We see that there is a string 𝑤 ∈ {0, 1}𝑛 such that 𝑄(𝑤) = 1 if and
𝑎

only if there is such 𝑤 satisfying 𝑀 (𝑥𝑤) = 1 which (by definition)


happens if and only if 𝐹 (𝑥) = 1. Hence the translation of 𝑥 into the
circuit 𝑄 is a reduction showing 𝐹 ≤𝑝 NANDSAT.

P
The proof is a little bit technical but ultimately follows
quite directly from the definition of NP, as well as the
ability to “unroll the loop” of NAND-TM programs as
discussed in Section 13.6.2. If you find it confusing, try
to pause here and think how you would implement
in your favorite programming language the function
unroll which on input a NAND-TM program 𝑃
and numbers 𝑇 , 𝑛 outputs an 𝑛-input NAND-CIRC
program 𝑄 of 𝑂(|𝑇 |) lines such that for every input
𝑧 ∈ {0, 1}𝑛 , if 𝑃 halts on 𝑧 within at most 𝑇 steps and
outputs 𝑦, then 𝑄(𝑧) = 𝑦.

Proof of Lemma 15.8. Let 𝐹 ∈ NP. To prove Lemma 15.8 we need to


give a polynomial-time computable function that will map every 𝑥∗ ∈
{0, 1}∗ to a NAND-CIRC program 𝑄 such that 𝐹 (𝑥∗ ) = NANDSAT(𝑄).
Let 𝑥∗ ∈ {0, 1}∗ be such a string and let 𝑛 = |𝑥∗ | be its length. By
Definition 15.1 there exists 𝑉 ∈ P and positive 𝑎 ∈ ℕ such that 𝐹 (𝑥∗ ) =
1 if and only if there exists 𝑤 ∈ {0, 1}𝑛 satisfying 𝑉 (𝑥∗ 𝑤) = 1.
𝑎
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 493

Let 𝑚 = 𝑛𝑎 . Since 𝑉 ∈ P there is some NAND-TM program 𝑃 ∗ that


computes 𝑉 on inputs of the form 𝑥𝑤 with 𝑥 ∈ {0, 1}𝑛 and 𝑤 ∈ {0, 1}𝑚
𝑐
in at most (𝑛 + 𝑚) time for some constant 𝑐. Using our “unrolling
the loop NAND-TM to NAND compiler” of Theorem 13.14, we can
obtain a NAND-CIRC program 𝑄′ that has 𝑛 + 𝑚 inputs and at most
𝑂((𝑛 + 𝑚)2𝑐 ) lines such that 𝑄′ (𝑥𝑤) = 𝑃 ∗ (𝑥𝑤) for every 𝑥 ∈ {0, 1}𝑛
and 𝑤 ∈ {0, 1}𝑚 .
We can then use a simple “hardwiring” technique, reminiscent of
Remark 9.11 to map 𝑄′ into a circuit/NAND-CIRC program 𝑄 on 𝑚
inputs such that 𝑄(𝑤) = 𝑄′ (𝑥∗ 𝑤) for every 𝑤 ∈ {0, 1}𝑚 .
CLAIM: There is a polynomial-time algorithm that on input a
NAND-CIRC program 𝑄′ on 𝑛 + 𝑚 inputs and 𝑥∗ ∈ {0, 1}𝑛 , outputs
a NAND-CIRC program 𝑄 such that for every 𝑤 ∈ {0, 1}𝑛 , 𝑄(𝑤) =
𝑄′ (𝑥∗ 𝑤).
PROOF OF CLAIM: We can do so by adding a few lines to ensure
that the variables zero and one are 0 and 1 respectively, and then
simply replacing any reference in 𝑄′ to an input 𝑥𝑖 with 𝑖 ∈ [𝑛] the
corresponding value based on 𝑥∗𝑖 . See Fig. 15.5 for an implementation
of this reduction in Python.
Our final reduction maps an input 𝑥∗ , into the NAND-CIRC pro-
gram 𝑄 obtained above. By the above discussion, this reduction runs
in polynomial time. Since we know that 𝐹 (𝑥∗ ) = 1 if and only if there
exists 𝑤 ∈ {0, 1}𝑚 such that 𝑃 ∗ (𝑥∗ 𝑤) = 1, this means that 𝐹 (𝑥∗ ) = 1 if
and only if NANDSAT(𝑄) = 1, which is what we wanted to prove.

Figure 15.5: Given an 𝑇 -line NAND-CIRC program


𝑄 that has 𝑛 + 𝑚 inputs and some 𝑥∗ ∈ {0, 1}𝑛 ,
we can transform 𝑄 into a 𝑇 + 3 line NAND-CIRC
program 𝑄′ that computes the map 𝑤 ↦ 𝑄(𝑥∗ 𝑤)
for 𝑤 ∈ {0, 1}𝑚 by simply adding code to compute
the zero and one constants, replacing all references to
X[𝑖] with either zero or one depending on the value
of 𝑥∗𝑖 , and then replacing the remaining references
to X[𝑗] with X[𝑗 − 𝑛]. Above is Python code that
implements this transformation, as well as an example
of its execution on a simple program.

15.4 THE 3NAND PROBLEM


The 3NAND problem is defined as follows:

• The input is a logical formula Ψ on a set of variables 𝑧0 , … , 𝑧𝑟−1


which is an AND of constraints of the form 𝑧𝑖 = NAND(𝑧𝑗 , 𝑧𝑘 ).
494 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• The output is 1 if and only if there is an input 𝑧 ∈ {0, 1}𝑟 that


satisfies all of the constraints.

For example, the following is a 3NAND formula with 5 variables


and 3 constraints:

Ψ = (𝑧3 = NAND(𝑧0 , 𝑧2 ))∧(𝑧1 = NAND(𝑧0 , 𝑧2 ))∧(𝑧4 = NAND(𝑧3 , 𝑧1 )) .

In this case 3NAND(Ψ) = 1 since the assignment 𝑧 = 01010 satisfies


it. Given a 3NAND formula Ψ on 𝑟 variables and an assignment 𝑧 ∈
{0, 1}𝑟 , we can check in polynomial time whether Ψ(𝑧) = 1, and hence
3NAND ∈ NP. We now prove that 3NAND is NP hard:
Lemma 15.9 NANDSAT ≤𝑝 3NAND.

Proof Idea:
To prove Lemma 15.9 we need to give a polynomial-time map from
every NAND-CIRC program 𝑄 to a 3NAND formula Ψ such that there
exists 𝑤 such that 𝑄(𝑤) = 1 if and only if there exists 𝑧 satisfying Ψ.
For every line 𝑖 of 𝑄, we define a corresponding variable 𝑧𝑖 of Ψ. If
the line 𝑖 has the form foo = NAND(bar,blah) then we will add the
clause 𝑧𝑖 = NAND(𝑧𝑗 , 𝑧𝑘 ) where 𝑗 and 𝑘 are the last lines in which bar
and blah were written to. We will also set variables corresponding
to the input variables, as well as add a clause to ensure that the final
output is 1. The resulting reduction can be implemented in about a
dozen lines of Python, see Fig. 15.6.

Figure 15.6: Python code to reduce an instance 𝑄 of


NANDSAT to an instance Ψ of 3NAND. In the exam-
ple above we transform the NAND-CIRC program
xor5 which has 5 input variables and 16 lines, into
a 3NAND formula Ψ that has 24 variables and 20
clauses. Since xor5 outputs 1 on the input 1, 0, 0, 1, 1,
there exists an assignment 𝑧 ∈ {0, 1}24 to Ψ such that
(𝑧0 , 𝑧1 , 𝑧2 , 𝑧3 , 𝑧4 ) = (1, 0, 0, 1, 1) and Ψ evaluates to
true on 𝑧.
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 495

Proof of Lemma 15.9. To prove Lemma 15.9 we need to give a reduction


from NANDSAT to 3NAND. Let 𝑄 be a NAND-CIRC program with
𝑛 inputs, one output, and 𝑚 lines. We can assume without loss of
generality that 𝑄 contains the variables one and zero as usual.
We map 𝑄 to a 3NAND formula Ψ as follows:

• Ψ has 𝑚 + 𝑛 variables 𝑧0 , … , 𝑧𝑚+𝑛−1 .

• The first 𝑛 variables 𝑧0 , … , 𝑧𝑛−1 will corresponds to the inputs of 𝑄.


The next 𝑚 variables 𝑧𝑛 , … , 𝑧𝑛+𝑚−1 will correspond to the 𝑚 lines
of 𝑄.

• For every ℓ ∈ {𝑛, 𝑛 + 1, … , 𝑛 + 𝑚}, if the ℓ − 𝑛-th line of the program


𝑄 is foo = NAND(bar,blah) then we add to Ψ the constraint 𝑧ℓ =
NAND(𝑧𝑗 , 𝑧𝑘 ) where 𝑗 − 𝑛 and 𝑘 − 𝑛 correspond to the last lines
in which the variables bar and blah (respectively) were written to.
If one or both of bar and blah was not written to before then we
use 𝑧ℓ0 instead of the corresponding value 𝑧𝑗 or 𝑧𝑘 in the constraint,
where ℓ0 − 𝑛 is the line in which zero is assigned a value. If one or
both of bar and blah is an input variable X[i] then we use 𝑧𝑖 in the
constraint.

• Let ℓ∗ be the last line in which the output y_0 is assigned a value.
Then we add the constraint 𝑧ℓ∗ = NAND(𝑧ℓ0 , 𝑧ℓ0 ) where ℓ0 − 𝑛 is as
above the last line in which zero is assigned a value. Note that this
is effectively the constraint 𝑧ℓ∗ = NAND(0, 0) = 1.

To complete the proof we need to show that there exists 𝑤 ∈ {0, 1}𝑛
s.t. 𝑄(𝑤) = 1 if and only if there exists 𝑧 ∈ {0, 1}𝑛+𝑚 that satisfies all
constraints in Ψ. We now show both sides of this equivalence.
Part I: Completeness. Suppose that there is 𝑤 ∈ {0, 1}𝑛 s.t. 𝑄(𝑤) =
1. Let 𝑧 ∈ {0, 1}𝑛+𝑚 be defined as follows: for 𝑖 ∈ [𝑛], 𝑧𝑖 = 𝑤𝑖 and
for 𝑖 ∈ {𝑛, 𝑛 + 1, … , 𝑛 + 𝑚} 𝑧𝑖 equals the value that is assigned in
the (𝑖 − 𝑛)-th line of 𝑄 when executed on 𝑤. Then by construction
𝑧 satisfies all of the constraints of Ψ (including the constraint that
𝑧ℓ∗ = NAND(0, 0) = 1 since 𝑄(𝑤) = 1.)
Part II: Soundness. Suppose that there exists 𝑧 ∈ {0, 1}𝑛+𝑚 satisfy-
ing Ψ. Soundness will follow by showing that 𝑄(𝑧0 , … , 𝑧𝑛−1 ) = 1 (and
hence in particular there exists 𝑤 ∈ {0, 1}𝑛 , namely 𝑤 = 𝑧0 ⋯ 𝑧𝑛−1 ,
such that 𝑄(𝑤) = 1). To do this we will prove the following claim
(∗): for every ℓ ∈ [𝑚], 𝑧ℓ+𝑛 equals the value assigned in the ℓ-th step
of the execution of the program 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . Note that because 𝑧
satisfies the constraints of Ψ, (∗) is sufficient to prove the soundness
condition since these constraints imply that the last value assigned
to the variable y_0 in the execution of 𝑄 on 𝑧0 ⋯ 𝑧𝑛−1 is equal to 1. To
prove (∗) suppose, towards a contradiction, that it is false, and let ℓ be
496 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the smallest number such that 𝑧ℓ+𝑛 is not equal to the value assigned
in the ℓ-th step of the execution of 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . But since 𝑧 sat-
isfies the constraints of Ψ, we get that 𝑧ℓ+𝑛 = NAND(𝑧𝑖 , 𝑧𝑗 ) where
(by the assumption above that ℓ is smallest with this property) these
values do correspond to the values last assigned to the variables on the
right-hand side of the assignment operator in the ℓ-th line of the pro-
gram. But this means that the value assigned in the ℓ-th step is indeed
simply the NAND of 𝑧𝑖 and 𝑧𝑗 , contradicting our assumption on the
choice of ℓ.

15.5 FROM 3NAND TO 3SAT


The final step in the proof of Theorem 15.6 is the following:
Lemma 15.10 3NAND ≤𝑝 3SAT.

Proof Idea:
To prove Lemma 15.10 we need to map a 3NAND formula 𝜑 into
a 3SAT formula 𝜓 such that 𝜑 is satisfiable if and only if 𝜓 is. The Figure 15.7: A 3NAND instance that is obtained by
taking a NAND-TM program for computing the
idea is that we can transform every NAND constraint of the form
AND function, unrolling it to obtain a NANDSAT
𝑎 = NAND(𝑏, 𝑐) into the AND of ORs involving the variables 𝑎, 𝑏, 𝑐 instance, and then composing it with the reduction of
and their negations, where each of the ORs contains at most three Lemma 15.9.

terms. The construction is fairly straightforward, and the details are


given below.

P
It is a good exercise for you to try to find a 3CNF for-
mula 𝜉 on three variables 𝑎, 𝑏, 𝑐 such that 𝜉(𝑎, 𝑏, 𝑐) is
true if and only if 𝑎 = NAND(𝑏, 𝑐). Once you do so, try
to see why this implies a reduction from 3NAND to
3SAT, and hence completes the proof of Lemma 15.10

Figure 15.8: Code and example output for the reduc-


tion given in Lemma 15.10 of 3NAND to 3SAT.
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 497

Proof of Lemma 15.10. The constraint


𝑧𝑖 = NAND(𝑧𝑗 , 𝑧𝑘 ) (15.2)
is satisfied if 𝑧𝑖 = 1 whenever (𝑧𝑗 , 𝑧𝑘 ) ≠ (1, 1). By going through all
cases, we can verify that (15.2) is equivalent to the constraint

(𝑧𝑖 ∨ 𝑧𝑗 ∨ 𝑧𝑘 ) ∧ (𝑧𝑖 ∨ 𝑧𝑗 ) ∧ (𝑧𝑖 ∨ 𝑧𝑘 ) . (15.3)


Indeed if 𝑧𝑗 = 𝑧𝑘 = 1 then the first constraint of Eq. (15.3) is only
true if 𝑧𝑖 = 0. On the other hand, if either of 𝑧𝑗 or 𝑧𝑘 equals 0 then un-
less 𝑧𝑖 = 1 either the second or third constraints will fail. This means
that, given any 3NAND formula 𝜑 over 𝑛 variables 𝑧0 , … , 𝑧𝑛−1 , we can
obtain a 3SAT formula 𝜓 over the same variables by replacing every
3NAND constraint of 𝜑 with three 3OR constraints as in Eq. (15.3).1 1
The resulting formula will have some of the OR’s
Because of the equivalence of (15.2) and (15.3), the formula 𝜓 sat- involving only two variables. If we wanted to insist on
isfies that 𝜓(𝑧0 , … , 𝑧𝑛−1 ) = 𝜑(𝑧0 , … , 𝑧𝑛−1 ) for every assignment each formula involving three distinct variables we can
always add a “dummy variable” 𝑧𝑛+𝑚 and include it
𝑧0 , … , 𝑧𝑛−1 ∈ {0, 1}𝑛 to the variables. In particular 𝜓 is satisfiable if in all the OR’s involving only two variables, and add a
and only if 𝜑 is, thus completing the proof. constraint requiring this dummy variable to be zero.

Figure 15.9: An instance of the independent set problem


obtained by applying the reductions NANDSAT ≤𝑝
3NAND ≤𝑝 3SAT ≤𝑝 ISAT starting with the xor5
NAND-CIRC program.

15.6 WRAPPING UP
We have shown that for every function 𝐹 in NP, 𝐹 ≤𝑝 NANDSAT ≤𝑝
3NAND ≤𝑝 3SAT, and so 3SAT is NP-hard. Since in Chapter 14 we
saw that 3SAT ≤𝑝 QUADEQ, 3SAT ≤𝑝 ISET, 3SAT ≤𝑝 MAXCUT
and 3SAT ≤𝑝 LONGPATH, all these problems are NP-hard as well.
Finally, since all the aforementioned problems are in NP, they are
all in fact NP-complete and have equivalent complexity. There are
thousands of other natural problems that are NP-complete as well.
Finding a polynomial-time algorithm for any one of them will imply a
polynomial-time algorithm for all of them.
498 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 15.10: We believe that P ≠ NP and all NP


complete problems lie outside of P, but we cannot
rule out the possiblity that P = NP. However, we
can rule out the possiblity that some NP-complete
problems are in P and others are not, since we know
that if even one NP-complete problem is in P then
P = NP. The relation between P/poly and NP is
not known though it can be shown that if one NP-
complete problem is in P/poly then NP ⊆ P/poly .

✓ Chapter Recap

• Many of the problems for which we don’t know


polynomial-time algorithms are NP-complete,
which means that finding a polynomial-time algo-
rithm for one of them would imply a polynomial-
time algorithm for all of them.
• It is conjectured that NP ≠ P which means that
we believe that polynomial-time algorithms for
these problems are not merely unknown but are
non-existent.
• While an NP-hardness result means for example
that a full-fledged “textbook” solution to a problem
such as MAX-CUT that is as clean and general as
the algorithm for MIN-CUT probably does not
exist, it does not mean that we need to give up
whenever we see a MAX-CUT instance. Later in
this course we will discuss several strategies to deal
with NP-hardness, including average-case complexity
and approximation algorithms.

15.7 EXERCISES
Prove that if there is no
Exercise 15.1 — Poor man’s Ladner’s Theorem.
2
𝑛𝑂(log 𝑛) time algorithm for 3SAT then there is some 𝐹 ∈ NP such 2
Hint: Use the function 𝐹 that on input a formula 𝜑
that 𝐹 ∉ P and 𝐹 is not NP complete.2 and a string of the form 1𝑡 , outputs 1 if and only if 𝜑
is satisfiable and 𝑡 = |𝜑|log |𝜑| .

Exercise 15.2 — NP ≠ co − NP ⇒ NP ≠ P. Let 3SAT be the function


that on input a 3CNF formula 𝜑 return 1 − 3SAT(𝜑). Prove that if 3
Hint: Prove and then use the fact that P is closed
under complement.
3SAT ∉ NP then P ≠ NP. See footnote for hint.3

Exercise 15.3Define WSAT to be the following function: the input is a


CNF formula 𝜑 where each clause is the OR of one to three variables
(without negations), and a number 𝑘 ∈ ℕ. For example, the following
formula can be used for a valid input to WSAT: 𝜑 = (𝑥5 ∨ 𝑥2 ∨ 𝑥1 ) ∧
(𝑥1 ∨ 𝑥3 ∨ 𝑥0 ) ∧ (𝑥2 ∨ 𝑥4 ∨ 𝑥0 ). The output WSAT(𝜑, 𝑘) = 1 if and
only if there exists a satisfying assignment to 𝜑 in which exactly 𝑘
of the variables get the value 1. For example for the formula above
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 499

WSAT(𝜑, 2) = 1 since the assignment (1, 1, 0, 0, 0, 0) satisfies all the


clauses. However WSAT(𝜑, 1) = 0 since there is no single variable
appearing in all clauses.
Prove that WSAT is NP-complete.

Exercise 15.4 In the employee recruiting problem we are given a list of


potential employees, each of which has some subset of 𝑚 potential
skills, and a number 𝑘. We need to assemble a team of 𝑘 employees
such that for every skill there would be one member of the team with
this skill.
For example, if Alice has the skills “C programming”, “NAND
programming” and “Solving Differential Equations”, Bob has the
skills “C programming” and “Solving Differential Equations”, and
Charlie has the skills “NAND programming” and “Coffee Brewing”,
then if we want a team of two people that covers all the four skills, we
would hire Alice and Charlie.
Define the function EMP s.t. on input the skills 𝐿 of all potential
employees (in the form of a sequence 𝐿 of 𝑛 lists 𝐿1 , … , 𝐿𝑛 , each
containing distinct numbers between 0 and 𝑚), and a number 𝑘,
EMP(𝐿, 𝑘) = 1 if and only if there is a subset 𝑆 of 𝑘 potential em-
ployees such that for every skill 𝑗 in [𝑚], there is an employee in 𝑆 that
has the skill 𝑗.
Prove that EMP is NP complete.

Prove that the “balanced variant” of


Exercise 15.5 — Balanced max cut.
the maximum cut problem is NP-complete, where this is defined as
BMC ∶ {0, 1}∗ → {0, 1} where for every graph 𝐺 = (𝑉 , 𝐸) and 𝑘 ∈ ℕ,
BMC(𝐺, 𝑘) = 1 if and only if there exists a cut 𝑆 in 𝐺 cutting at least 𝑘
edges such that |𝑆| = |𝑉 |/2.

Let MANYREGS be the fol-


Exercise 15.6 — Regular expression intersection.
lowing function: On input a list of regular expressions 𝑒𝑥𝑝0 , … , exp𝑚
(represented as strings in some standard way), output 1 if and only if
there is a single string 𝑥 ∈ {0, 1}∗ that matches all of them. Prove that
MANYREGS is NP-hard.

15.8 BIBLIOGRAPHICAL NOTES


Aaronson’s 120 page survey [Aar16] is a beautiful and extensive ex-
position to the P vs NP problem, its importance and status. See also
as well as Chapter 3 in Wigderson’s excellent book [Wig19]. Johnson
[Joh12] gives a survey of the historical development of the theory of
NP completeness. The following web page keeps a catalog of failed
500 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

attempts at settling P vs NP. At the time of this writing, it lists about


110 papers claiming to resolve the question, of which about 60 claim to
prove that P = NP and about 50 claim to prove that P ≠ NP.
Ladner’s Theorem was proved by Richard Ladner in 1975. Lad-
ner, who was born to deaf parents, later switched his research focus
into computing for assistive technologies, where he have made many
contributions. In 2014, he wrote a personal essay on his path from
theoretical CS to accessibility research.
Eugene Lawler’s quote on the “mystical power of twoness” was
taken from the wonderful book “The Nature of Computation” by
Moore and Mertens. See also this memorial essay on Lawler by
Lenstra.
Learning Objectives:
• Explore the consequences of P = NP
• Search-to-decision reduction: transform
algorithms that solve decision version to
search version for NP-complete problems.
• Optimization and learning problems
• Quantifier elimination and solving problems
in the polynomial hierarchy.

16
• What is the evidence for P = NP vs P ≠ NP?

What if P equals NP?

“You don’t have to believe in God, but you should believe in The Book.”, Paul 1
Paul Erdős (1913-1996) was one of the most prolific
Erdős, 1985.1 mathematicians of all times. Though he was an athe-
ist, Erdős often referred to “The Book” in which God
“No more half measures, Walter”, Mike Ehrmantraut in “Breaking Bad”, keeps the most elegant proof of each mathematical
2010. theorem.

“The evidence in favor of [P ≠ NP] and [ its algebraic counterpart ] is so


overwhelming, and the consequences of their failure are so grotesque, that their
status may perhaps be compared to that of physical laws rather than that of
ordinary mathematical conjectures.”, Volker Strassen, laudation for Leslie
Valiant, 1986.

“Suppose aliens invade the earth and threaten to obliterate it in a year’s time
unless human beings can find the [fifth Ramsey number]. We could marshal
the world’s best minds and fastest computers, and within a year we could prob-
ably calculate the value. If the aliens demanded the [sixth Ramsey number],
however, we would have no choice but to launch a preemptive attack.”, Paul
Erdős, as quoted by Graham and Spencer, 1990.2 2
The 𝑘-th Ramsey number, denoted as 𝑅(𝑘, 𝑘), is the
smallest number 𝑛 such that for every graph 𝐺 on 𝑛
vertices, both 𝐺 and its complement contain a 𝑘-sized
We have mentioned that the question of whether P = NP, which
independent set. If P = NP then we can compute
is equivalent to whether there is a polynomial-time algorithm for 𝑅(𝑘, 𝑘) in time polynomial in 2𝑘 , while otherwise it
3SAT, is the great open question of Computer Science. But why is it so
2𝑘
can potentially take closer to 22 steps.
important? In this chapter, we will try to figure out the implications of
such an algorithm.
First, let us get one qualm out of the way. Sometimes people say,
“What if P = NP but the best algorithm for 3SAT takes 𝑛1000 time?” Well,

𝑛1000 is much larger than, say, 20.001 𝑛 for any input smaller than 250 ,
as large as a harddrive as you will encounter, and so another way to
phrase this question is to say “what if the complexity of 3SAT is ex-
ponential for all inputs that we will ever encounter, but then grows
much smaller than that?” To me this sounds like the computer science
equivalent of asking, “what if the laws of physics change completely
once they are out of the range of our telescopes?”. Sure, this is a valid
possibility, but wondering about it does not sound like the most pro-
ductive use of our time.

Compiled on 12.6.2023 00:05


502 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

So, as the saying goes, we’ll keep an open mind, but not so open
that our brains fall out, and assume from now on that:

• There is a mathematical god,

and

• She does not “beat around the bush” or take “half measures”.

What we mean by this is that we will consider two extreme scenar-


ios:

• 3SAT is very easy: 3SAT has an 𝑂(𝑛) or 𝑂(𝑛2 ) time algorithm with
a not too huge constant (say smaller than 106 .)

• 3SAT is very hard: 3SAT is exponentially hard and cannot be


solved faster than 2𝜖𝑛 for some not too tiny 𝜖 > 0 (say at least
10−6 ). We can even make the stronger assumption that for every
sufficiently large 𝑛, the restriction of 3SAT to inputs of length 𝑛
cannot be computed by a circuit of fewer than 2𝜖𝑛 gates.

At the time of writing, the fastest known algorithm for 3SAT re-
quires more than 20.35𝑛 to solve 𝑛 variable formulas, while we do not
even know how to rule out the possibility that we can compute 3SAT
using 10𝑛 gates. To put it in perspective, for the case 𝑛 = 1000 our
lower and upper bounds for the computational costs are apart by
a factor of about 10100 . As far as we know, it could be the case that
1000-variable 3SAT can be solved in a millisecond on a first-generation
iPhone, and it can also be the case that such instances require more
than the age of the universe to solve on the world’s fastest supercom-
puter.
So far, most of our evidence points to the latter possibility of 3SAT
being exponentially hard, but we have not ruled out the former possi-
bility either. In this chapter we will explore some of the consequences
of the “3SAT easy” scenario.

This chapter: A non-mathy overview


This chapter shows some of the truly breathtaking conse-
quences that would be derived from an efficient algorithm
for NP-complete problems. We will see that such an algo-
rithm would imply efficient algorithms for tasks including
solving search problems, eliminating quantifiers, fitting data
with complex models, sampling and counting, and more.
While the evidence strongly suggests that such an algorithm
does not exist, the tools developed in these results have
nonetheless found many other applications.
w hat i f p e q ua l s n p ? 503

16.1 SEARCH-TO-DECISION REDUCTION


A priori, having a fast algorithm for 3SAT might not seem so impres-
sive. Sure, such an algorithm allows us to decide the satisfiability of
not just 3CNF formulas but also of quadratic equations, as well as find
out whether there is a long path in a graph, and solve many other de-
cision problems. But this is not typically what we want to do. It’s not
enough to know if a formula is satisfiable: we want to discover the
actual satisfying assignment. Similarly, it’s not enough to find out if a
graph has a long path: we want to actually find the path.
It turns out that if we can solve these decision problems, we can
solve the corresponding search problems as well:

Suppose that P
Theorem 16.1 — Search vs Decision. = NP. Then
for every polynomial-time algorithm 𝑉 and 𝑎, 𝑏 ∈ ℕ, there is a
polynomial-time algorithm FIND𝑉 such that for every 𝑥 ∈ {0, 1}𝑛 ,
if there exists 𝑦 ∈ {0, 1}𝑎𝑛 satisfying 𝑉 (𝑥𝑦) = 1, then FIND𝑉 (𝑥)
𝑏

finds some string 𝑦′ satisfying this condition.

P
To understand what the statement of Theo-
rem 16.1 means, let us look at the special case of
the MAXCUT problem. It is not hard to see that there
is a polynomial-time algorithm VERIFYCUT such that
VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if and only if 𝑆 is a subset
of 𝐺’s vertices that cuts at least 𝑘 edges. Theorem 16.1
implies that if P = NP then there is a polynomial-time
algorithm FINDCUT that on input 𝐺, 𝑘 outputs a set
𝑆 such that VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if such a set
exists. This means that if P = NP, by trying all values
of 𝑘 we can find in polynomial time a maximum cut
in any given graph. We can use a similar argument to
show that if P = NP then we can find a satisfying as-
signment for every satisfiable 3CNF formula, find the
longest path in a graph, solve integer programming,
and so and so forth.

Proof Idea:
The idea behind the proof of Theorem 16.1 is simple; let us
demonstrate it for the special case of 3SAT. (In fact, this case is not
so “special”− since 3SAT is NP-complete, we can reduce the task of
solving the search problem for MAXCUT or any other problem in
NP to the task of solving it for 3SAT.) Suppose that P = NP and we
are given a satisfiable 3CNF formula 𝜑, and we now want to find a
satisfying assignment 𝑦 for 𝜑. Define 3SAT0 (𝜑) to output 1 if there is
a satisfying assignment 𝑦 for 𝜑 such that its first bit is 0, and similarly
define 3SAT1 (𝜑) = 1 if there is a satisfying assignment 𝑦 with 𝑦0 = 1.
504 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The key observation is that both 3SAT0 and 3SAT1 are in NP, and so if
P = NP then we can compute them in polynomial time as well. Thus
we can use this to find the first bit of the satisfying assignment. We
can continue in this way to recover all the bits.

Proof of Theorem 16.1. Let 𝑉 be some polynomial time algorithm and


𝑎, 𝑏 ∈ ℕ some constants. Define the function STARTSWITH𝑉 as
follows: For every 𝑥 ∈ {0, 1}∗ and 𝑧 ∈ {0, 1}∗ , STARTSWITH𝑉 (𝑥, 𝑧) =
1 if and only if there exists some 𝑦 ∈ {0, 1}𝑎𝑛 −|𝑧| (where 𝑛 = |𝑥|) such
𝑏

that 𝑉 (𝑥𝑧𝑦) = 1. That is, STARTSWITH𝑉 (𝑥, 𝑧) outputs 1 if there is


some string 𝑤 of length 𝑎|𝑥|𝑏 such that 𝑉 (𝑥, 𝑤) = 1 and the first |𝑧|
bits of 𝑤 are 𝑧0 , … , 𝑧ℓ−1 . Since, given 𝑥, 𝑦, 𝑧 as above, we can check in
polynomial time if 𝑉 (𝑥𝑧𝑦) = 1, the function STARTSWITH𝑉 is in NP
and hence if P = NP we can compute it in polynomial time.
Now for every such polynomial-time 𝑉 and 𝑎, 𝑏 ∈ ℕ, we can imple-
ment FIND𝑉 (𝑥) as follows:

Algorithm 16.2 — 𝐹 𝐼𝑁 𝐷𝑉 : Search to decision reduction.

Input: 𝑥 ∈ {0, 1}𝑛


Output: 𝑧 ∈ {0, 1}𝑎𝑛 s.t. 𝑉 (𝑥𝑧) = 1, if such 𝑧 exists. Other-
𝑏

wise output the empty string.


1: Initially 𝑧0 = 𝑧1 = ⋯ = 𝑧𝑎𝑛𝑏 −1 = 0.
2: for ℓ = 0, … , 𝑎𝑛𝑏 − 1 do
3: Let 𝑏0 ← 𝑆𝑇 𝐴𝑅𝑇 𝑆𝑊 𝐼𝑇 𝐻𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0).
4: Let 𝑏1 ← 𝑆𝑇 𝐴𝑅𝑇 𝑆𝑊 𝐼𝑇 𝐻𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 1).
5: if 𝑏0 = 𝑏1 = 0 then
6: return ””
7: # Can’t extend 𝑥𝑧0 … 𝑧ℓ−1 to input 𝑉 accepts
8: end if
9: if 𝑏0 = 1 then
10: 𝑧ℓ ← 0
11: # Can extend 𝑥𝑧0 … 𝑥ℓ−1 with 0 to accepting input
12: else
13: 𝑧ℓ ← 1
14: # Can extend 𝑥𝑧0 … 𝑥ℓ−1 with 1 to accepting input
15: end if
16: end for
17: return 𝑧0 , … , 𝑧𝑎𝑛𝑏 −1

To analyze Algorithm 16.2, note that it makes 2𝑎𝑛𝑏 invocations to


STARTSWITH𝑉 and hence if the latter is polynomial-time, then so is
Algorithm 16.2. Now suppose that 𝑥 is such that there exists some 𝑦
satisfying 𝑉 (𝑥𝑦) = 1. We claim that at every step ℓ = 0, … , 𝑎𝑛𝑏 − 1, we
w hat i f p e q ua l s n p ? 505

maintain the invariant that there exists 𝑦 ∈ {0, 1}𝑎𝑛 whose first ℓ bits
𝑏

are 𝑧 s.t. 𝑉 (𝑥𝑦) = 1. Note that this claim implies the theorem, since in
particular it means that for ℓ = 𝑎𝑛𝑏 − 1, 𝑧 satisfies 𝑉 (𝑥𝑧) = 1.
We prove the claim by induction. For ℓ = 0, this holds vacuously.
Now for every ℓ > 0, if the call STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0)
returns 1, then we are guaranteed the invariant by definition of
STARTSWITH𝑉 . Now under our inductive hypothesis, there is
𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 such that 𝑃 (𝑥𝑧0 , … , 𝑧ℓ−1 𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 ) = 1. If the call to
STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0) returns 0 then it must be the case that
𝑦ℓ = 1, and hence when we set 𝑧ℓ = 1 we maintain the invariant.

16.2 OPTIMIZATION
Theorem 16.1 allows us to find solutions for NP problems if P = NP,
but it is not immediately clear that we can find the optimal solution.
For example, suppose that P = NP, and you are given a graph 𝐺. Can
you find the longest simple path in 𝐺 in polynomial time?

P
This is actually an excellent question for you to at-
tempt on your own. That is, assuming P = NP, give
a polynomial-time algorithm that on input a graph 𝐺,
outputs a maximally long simple path in the graph 𝐺.

The answer is Yes. The idea is simple: if P = NP then we can find


out in polynomial time if an 𝑛-vertex graph 𝐺 contains a simple path
of length 𝑛, and moreover, by Theorem 16.1, if 𝐺 does contain such a
path, then we can find it. (Can you see why?) If 𝐺 does not contain a
simple path of length 𝑛, then we will check if it contains a simple path
of length 𝑛 − 1, and continue in this way to find the largest 𝑘 such that
𝐺 contains a simple path of length 𝑘.
The above reasoning was not specifically tailored to finding paths
in graphs. In fact, it can be vastly generalized to proving the following
result:

Theorem 16.3 — Optimization from P = NP.Suppose that P = NP. Then


for every polynomial-time computable function 𝑓 ∶ {0, 1}∗ → ℕ
(identifying 𝑓(𝑥) with natural numbers via the binary representa-
tion) there is a polynomial-time algorithm OPT such that on input
𝑥 ∈ {0, 1}∗ ,
OPT(𝑥, 1𝑚 ) = max 𝑓(𝑥, 𝑦) .
𝑦∈{0,1}𝑚
506 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Moreover under the same assumption, there is a polynomial-


time algorithm FINDOPT such that for every 𝑥 ∈ {0, 1}∗ , FINDOPT(𝑥, 1𝑚 )
outputs 𝑦∗ ∈ {0, 1}∗ such that 𝑓(𝑥, 𝑦∗ ) = max𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦).

P
The statement of Theorem 16.3 is a bit cumbersome.
To understand it, think how it would subsume the
example above of a polynomial time algorithm for
finding the maximum length path in a graph. In
this case the function 𝑓 would be the map that on
input a pair 𝑥, 𝑦 outputs 0 if the pair (𝑥, 𝑦) does not
represent some graph and a simple path inside the
graph respectively; otherwise 𝑓(𝑥, 𝑦) would equal
the length of the path 𝑦 in the graph 𝑥. Since a path
in an 𝑛 vertex graph can be represented by at most
𝑛 log 𝑛 bits, for every 𝑥 representing a graph of 𝑛 ver-
tices, finding max𝑦∈{0,1}𝑛 log 𝑛 𝑓(𝑥, 𝑦) corresponds to
finding the length of the maximum simple path in the
graph corresponding to 𝑥, and finding the string 𝑦∗
that achieves this maximum corresponds to actually
finding the path.

Proof Idea:
The proof follows by generalizing our ideas from the longest path
example above. Let 𝑓 be as in the theorem statement. If P = NP then
for every for every string 𝑥 ∈ {0, 1}∗ and number 𝑘, we can test in
𝑝𝑜𝑙𝑦(|𝑥|, 𝑚) time whether there exists 𝑦 such that 𝑓(𝑥, 𝑦) ≥ 𝑘, or in
other words test whether max𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘. If 𝑓(𝑥, 𝑦) is an
integer between 0 and 𝑝𝑜𝑙𝑦(|𝑥| + |𝑦|) (as is the case in the example of
longest path) then we can just try out all possibilities for 𝑘 to find the
maximum number 𝑘 for which max𝑦 𝑓(𝑥, 𝑦) ≥ 𝑘. Otherwise, we can
use binary search to hone down on the right value. Once we do so, we
can use search-to-decision to actually find the string 𝑦∗ that achieves
the maximum.

Proof of Theorem 16.3. For every 𝑓 as in the theorem statement, we can


define the Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1} as follows.


{1 ∃𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘
𝐹 (𝑥, 1𝑚 , 𝑘) = ⎨
⎩0 otherwise
{
Since 𝑓 is computable in polynomial time, 𝐹 is in NP, and so under
our assumption that P = NP, 𝐹 itself can be computed in polynomial
time. Now, for every 𝑥 and 𝑚, we can compute the largest 𝑘 such that
𝐹 (𝑥, 1𝑚 , 𝑘) = 1 by a binary search. Specifically, we will do this as
follows:
w hat i f p e q ua l s n p ? 507

1. We maintain two numbers 𝑎, 𝑏 such that we are guaranteed that


𝑎 ≤ max𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) < 𝑏.

2. Initially we set 𝑎 = 0 and 𝑏 = 2𝑇 (𝑛) where 𝑇 (𝑛) is the running time


of 𝑓. (A function with 𝑇 (𝑛) running time can’t output more than
𝑇 (𝑛) bits and so can’t output a number larger than 2𝑇 (𝑛) .)

3. At each point in time, we compute the midpoint 𝑐 = ⌊(𝑎 + 𝑏)/2⌋)


and let 𝑦 = 𝐹 (1𝑛 , 𝑐).

a. If 𝑦 = 1 then we set 𝑎 = 𝑐 and leave 𝑏 as it is.


b. If 𝑦 = 0 then we set 𝑏 = 𝑐 and leave 𝑎 as it is.

4. We then go back to step 3, until 𝑏 ≤ 𝑎 + 1.

Since |𝑏 − 𝑎| shrinks by a factor of 2, within log2 2𝑇 (𝑛) = 𝑇 (𝑛)


steps, we will get to the point at which 𝑏 ≤ 𝑎 + 1, and then we can
simply output 𝑎. Once we find the maximum value of 𝑘 such that
𝐹 (𝑥, 1𝑚 , 𝑘) = 1, we can use the search to decision reduction of Theo-
rem 16.1 to obtain the actual value 𝑦∗ ∈ {0, 1}𝑚 such that 𝑓(𝑥, 𝑦∗ ) = 𝑘.

■ Example 16.4 — Integer programming. One application for Theo-


rem 16.3 is in solving optimization problems. For example, the task
of linear programming is to find 𝑦 ∈ ℝ𝑛 that maximizes some linear
𝑛−1
objective ∑𝑖=0 𝑐𝑖 𝑦𝑖 subject to the constraint that 𝑦 satisfies linear
𝑛−1
inequalities of the form ∑𝑖=0 𝑎𝑖 𝑦𝑖 ≤ 𝑐. As we discussed in Sec-
tion 12.1.3, there is a known polynomial-time algorithm for linear
programming. However, if we want to place additional constraints
on 𝑦, such as requiring the coordinates of 𝑦 to be integer or 0/1
valued then the best-known algorithms run in exponential time in
the worst case. However, if P = NP then Theorem 16.3 tells us
that we would be able to solve all problems of this form in poly-
nomial time. For every string 𝑥 that describes a set of constraints
and objective, we will define a function 𝑓 such that if 𝑦 satisfies
the constraints of 𝑥 then 𝑓(𝑥, 𝑦) is the value of the objective, and
otherwise we set 𝑓(𝑥, 𝑦) = −𝑀 where 𝑀 is some large number. We
can then use Theorem 16.3 to compute the 𝑦 that maximizes 𝑓(𝑥, 𝑦)
and that will give us the assignment for the variables that satisfies
our constraints and maximizes the objective. (If the computation
results in 𝑦 such that 𝑓(𝑥, 𝑦) = −𝑀 then we can double 𝑀 and try
again; if the true maximum objective is achieved by some string
𝑦∗ , then eventually 𝑀 will be large enough so that −𝑀 would be
508 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

smaller than the objective achieved by 𝑦∗ , and hence when we run


procedure of Theorem 16.3 we would get a value larger than −𝑀 .)

R
Remark 16.5 — Need for binary search. In many ex-
amples, such as the case of finding the longest path,
we don’t need to use the binary search step in Theo-
rem 16.3, and can simply enumerate over all possible
values for 𝑘 until we find the correct one. One exam-
ple where we do need to use this binary search step
is in the case of the problem of finding a maximum
length path in a weighted graph. This is the problem
where 𝐺 is a weighted graph, and every edge of 𝐺 is
given a weight which is a number between 0 and 2𝑘 .
Theorem 16.3 shows that we can find the maximum-
weight simple path in 𝐺 (i.e., simple path maximizing
the sum of the weights of its edges) in time polyno-
mial in the number of vertices and in 𝑘.
Beyond just this example there is a vast field of math-
ematical optimization that studies problems of the
same form as in Theorem 16.3. In the context of opti-
mization, 𝑥 typically denotes a set of constraints over
some variables (that can be Boolean, integer, or real
valued), 𝑦 encodes an assignment to these variables,
and 𝑓(𝑥, 𝑦) is the value of some objective function that
we want to maximize. Given that we don’t know
efficient algorithms for NP complete problems, re-
searchers in optimization research study special cases
of functions 𝑓 (such as linear programming and
semidefinite programming) where it is possible to
optimize the value efficiently. Optimization is widely
used in a great many scientific areas including: ma-
chine learning, engineering, economics and operations
research.

16.2.1 Example: Supervised learning


One classical optimization task is supervised learning. In supervised
learning we are given a list of examples 𝑥0 , 𝑥1 , … , 𝑥𝑚−1 (where we
can think of each 𝑥𝑖 as a string in {0, 1}𝑛 for some 𝑛) and the la-
bels for them 𝑦0 , … , 𝑦𝑚−1 (which we will think of simply bits, i.e.,
𝑦𝑖 ∈ {0, 1}). For example, we can think of the 𝑥𝑖 ’s as images of ei-
ther dogs or cats, for which 𝑦𝑖 = 1 in the former case and 𝑦𝑖 = 0
in the latter case. Our goal is to come up with a hypothesis or predic-
tor ℎ ∶ {0, 1}𝑛 → {0, 1} such that if we are given a new example 𝑥
that has an (unknown to us) label 𝑦, then with high probability ℎ
will predict the label. That is, with high probability it will hold that
ℎ(𝑥) = 𝑦. The idea in supervised learning is to use the Occam’s Ra-
zor principle: the simplest hypothesis that explains the data is likely
w hat i f p e q ua l s n p ? 509

to be correct. There are several ways to model this, but one popular
approach is to pick some fairly simple function 𝐻 ∶ {0, 1}𝑘+𝑛 → {0, 1}.
We think of the first 𝑘 inputs as the parameters and the last 𝑛 inputs
as the example data. (For example, we can think of the first 𝑘 inputs
of 𝐻 as specifying the weights and connections for some neural net-
work that will then be applied on the latter 𝑛 inputs.) We can then
phrase the supervised learning problem as finding, given a set of la-
beled examples 𝑆 = {(𝑥0 , 𝑦0 ), … , (𝑥𝑚−1 , 𝑦𝑚−1 )}, the set of parameters
𝜃0 , … , 𝜃𝑘−1 ∈ {0, 1} that minimizes the number of errors made by
the predictor 𝑥 ↦ 𝐻(𝜃, 𝑥) (This is often known as Empirical Risk
Minimization.)
In other words, we can define for every set 𝑆 as above the function
𝐹𝑆 ∶ {0, 1}𝑘 → [𝑚] such that 𝐹𝑆 (𝜃) = ∑(𝑥,𝑦)∈𝑆 |𝐻(𝜃, 𝑥) − 𝑦|. Now,
finding the value 𝜃 that minimizes 𝐹𝑆 (𝜃) is equivalent to solving the
supervised learning problem with respect to 𝐻. For every polynomial-
time computable 𝐻 ∶ {0, 1}𝑘+𝑛 → {0, 1}, the task of minimizing
𝐹𝑆 (𝜃) can be “massaged” to fit the form of Theorem 16.3 and hence if
P = NP, then we can solve the supervised learning problem in great
generality. In fact, this observation extends to essentially any learn-
ing model, and allows for finding the optimal predictors given the
minimum number of examples. (This is in contrast to many current
learning algorithms, which often rely on having access to an extremely
large number of examples− far beyond the minimum needed, and
in particular far beyond the number of examples humans use for the
same tasks.)

16.2.2 Example: Breaking cryptosystems


We will discuss cryptography later in this course, but it turns out that
if P = NP then almost every cryptosystem can be efficiently bro-
ken. One approach is to treat finding an encryption key as an in-
stance of a supervised learning problem. If there is an encryption
scheme that maps a “plaintext” message 𝑝 and a key 𝜃 to a “cipher-
text” 𝑐, then given examples of ciphertext/plaintext pairs of the
form (𝑐0 , 𝑝0 ), … , (𝑐𝑚−1 , 𝑝𝑚−1 ), our goal is to find the key 𝜃 such that
𝐸(𝜃, 𝑝𝑖 ) = 𝑐𝑖 where 𝐸 is the encryption algorithm. While you might
think getting such “labeled examples” is unrealistic, it turns out (as
many amateur home-brew crypto designers learn the hard way) that
this is actually quite common in real-life scenarios, and that it is also
possible to relax the assumption to having more minimal prior infor-
mation about the plaintext (e.g., that it is English text). We defer a
more formal treatment to Chapter 21.
510 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

16.3 FINDING MATHEMATICAL PROOFS


In the context of Gödel’s Theorem, we discussed the notion of a proof
system (see Section 11.1). Generally speaking, a proof system can be
thought of as an algorithm 𝑉 ∶ {0, 1}∗ → {0, 1} (known as the verifier)
such that given a statement 𝑥 ∈ {0, 1}∗ and a candidate proof 𝑤 ∈ {0, 1}∗ ,
𝑉 (𝑥, 𝑤) = 1 if and only if 𝑤 encodes a valid proof for the statement 𝑥.
Any type of proof system that is used in mathematics for geometry,
number theory, analysis, etc., is an instance of this form. In fact, stan-
dard mathematical proof systems have an even simpler form where
the proof 𝑤 encodes a sequence of lines 𝑤0 , … , 𝑤𝑚 (each of which is
itself a binary string) such that each line 𝑤𝑖 is either an axiom or fol-
lows from some prior lines through an application of some inference
rule. For example, Peano’s axioms encode a set of axioms and rules
for the natural numbers, and one can use them to formalize proofs
in number theory. Also, there are some even stronger axiomatic sys-
tems, the most popular one being Zermelo–Fraenkel with the Axiom
of Choice or ZFC for short. Thus, although mathematicians typically
write their papers in natural language, proofs of number theorists
can typically be translated to ZFC or similar systems, and so in par-
ticular the existence of an 𝑛-page proof for a statement 𝑥 implies that
there exists a string 𝑤 of length 𝑝𝑜𝑙𝑦(𝑛) (in fact often 𝑂(𝑛) or 𝑂(𝑛2 ))
that encodes the proof in such a system. Moreover, because verify-
ing a proof simply involves going over each line and checking that it
does indeed follow from the prior lines, it is fairly easy to do that in
𝑂(|𝑤|) or 𝑂(|𝑤|2 ) (where as usual |𝑤| denotes the length of the proof
𝑤). This means that for every reasonable proof system 𝑉 , the follow-
ing function SHORTPROOF𝑉 ∶ {0, 1}∗ → {0, 1} is in NP, where
for every input of the form 𝑥1𝑚 , SHORTPROOF𝑉 (𝑥, 1𝑚 ) = 1 if and
only if there exists 𝑤 ∈ {0, 1}∗ with |𝑤| ≤ 𝑚 s.t. 𝑉 (𝑥𝑤) = 1. That
is, SHORTPROOF𝑉 (𝑥, 1𝑚 ) = 1 if there is a proof (in the system 𝑉 )
of length at most 𝑚 bits that 𝑥 is true. Thus, if P = NP, then despite
Gödel’s Incompleteness Theorems, we can still automate mathematics
in the sense of finding proofs that are not too long for every statement
that has one. (Frankly speaking, if the shortest proof for some state-
ment requires a terabyte, then human mathematicians won’t ever find
this proof either.) For this reason, Gödel himself felt that the question
of whether SHORTPROOF𝑉 has a polynomial time algorithm is of
great interest. As Gödel wrote in a letter to John von Neumann in 1956
(before the concept of NP or even “polynomial time” was formally
defined):
One can obviously easily construct a Turing machine, which for every
formula 𝐹 in first order predicate logic and every natural number 𝑛, al-
lows one to decide if there is a proof of 𝐹 of length 𝑛 (length = number
of symbols). Let 𝜓(𝐹 , 𝑛) be the number of steps the machine requires
w hat i f p e q ua l s n p ? 511

for this and let 𝜑(𝑛) = max𝐹 𝜓(𝐹 , 𝑛). The question is how fast 𝜑(𝑛)
grows for an optimal machine. One can show that 𝜑 ≥ 𝑘 ⋅ 𝑛 [for some
constant 𝑘 > 0]. If there really were a machine with 𝜑(𝑛) ∼ 𝑘 ⋅ 𝑛 (or
even ∼ 𝑘 ⋅ 𝑛2 ), this would have consequences of the greatest importance.
Namely, it would obviously mean that in spite of the undecidability 3
The undecidability of Entscheidungsproblem refers
of the Entscheidungsproblem,3 the mental work of a mathematician to the uncomputability of the function that maps a
concerning Yes-or-No questions could be completely replaced by a ma- statement in first order logic to 1 if and only if that
chine. After all, one would simply have to choose the natural number statement has a proof.
𝑛 so large that when the machine does not deliver a result, it makes no
sense to think more about the problem.

For many reasonable proof systems (including the one that Gödel
referred to), SHORTPROOF𝑉 is in fact NP-complete, and so Gödel can
be thought of as the first person to formulate the P vs NP question.
Unfortunately, the letter was only discovered in 1988.

16.4 QUANTIFIER ELIMINATION (ADVANCED)


If P = NP then we can solve all NP search and optimization problems in
polynomial time. But can we do more? It turns out that the answer is
that Yes we can!
An NP decision problem can be thought of as the task of deciding,
given some string 𝑥 ∈ {0, 1}∗ the truth of a statement of the form

∃𝑦∈{0,1}𝑝(|𝑥|) 𝑉 (𝑥𝑦) = 1

for some polynomial-time algorithm 𝑉 and polynomial 𝑝 ∶ ℕ → ℕ.


That is, we are trying to determine, given some string 𝑥, whether
there exists a string 𝑦 such that 𝑥 and 𝑦 satisfy some polynomial-time
checkable condition 𝑉 . For example, in the independent set problem,
the string 𝑥 represents a graph 𝐺 and a number 𝑘, the string 𝑦 repre-
sents some subset 𝑆 of 𝐺’s vertices, and the condition that we check is
whether |𝑆| ≥ 𝑘 and there is no edge {𝑢, 𝑣} in 𝐺 such that both 𝑢 ∈ 𝑆
and 𝑣 ∈ 𝑆.
We can consider more general statements such as checking, given a
string 𝑥 ∈ {0, 1}∗ , the truth of a statement of the form

∃𝑦∈{0,1}𝑝0 (|𝑥|) ∀𝑧∈{0,1}𝑝1 (|𝑥|) 𝑉 (𝑥𝑦𝑧) = 1 , (16.1)

which in words corresponds to checking, given some string 𝑥, whether


there exists a string 𝑦 such that for every string 𝑧, the triple (𝑥, 𝑦, 𝑧) sat-
isfy some polynomial-time checkable condition. We can also consider
more levels of quantifiers such as checking the truth of the statement

∃𝑦∈{0,1}𝑝0 (|𝑥|) ∀𝑧∈{0,1}𝑝1 (|𝑥|) ∃𝑤∈{0,1}𝑝2 (|𝑥|) 𝑉 (𝑥𝑦𝑧𝑤) = 1 (16.2)

and so on and so forth.


For example, given an 𝑛-input NAND-CIRC program 𝑃 , we might
want to find the smallest NAND-CIRC program 𝑃 ′ that computes the
512 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

same function as 𝑃 . The question of whether there is such a 𝑃 ′ that


can be described by a string of at most 𝑠 bits can be phrased as

∃𝑃 ′ ∈{0,1}𝑠 ∀𝑥∈{0,1}𝑛 𝑃 (𝑥) = 𝑃 ′ (𝑥) (16.3)

which has the form (16.1). (Since NAND-CIRC programs are equiv-
alent to Boolean circuits, the search problem corresponding to (16.3)
known as the circuit minimization problem and is widely studied in
Engineering. You can skip ahead to Section 16.4.1 to see a particularly
compelling application of this.)
Another example of a statement involving 𝑎 levels of quantifiers
would be to check, given a chess position 𝑥, whether there is a strategy
that guarantees that White wins within 𝑎 steps. For example is 𝑎 = 3
we would want to check if given the board position 𝑥, there exists a
move 𝑦 for White such that for every move 𝑧 for Black there exists a
move 𝑤 for White that ends in a a checkmate.
It turns out that if P = NP then we can solve these kinds of prob-
lems as well:

If P = NP then for every


Theorem 16.6 — Polynomial hierarchy collapse.
𝑎 ∈ ℕ, polynomial 𝑝 ∶ ℕ → ℕ and polynomial-time algorithm
𝑉 , there is a polynomial-time algorithm SOLVE𝑉 ,𝑎 that on input
𝑥 ∈ {0, 1}𝑛 returns 1 if and only if

∃𝑦0 ∈{0,1}𝑚 ∀𝑦1 ∈{0,1}𝑚 ⋯ 𝒬𝑦𝑎−1 ∈{0,1}𝑚 𝑉 (𝑥𝑦0 𝑦1 ⋯ 𝑦𝑎−1 ) = 1 (16.4)

where 𝑚 = 𝑝(𝑛) and 𝒬 is either ∃ or ∀ depending on whether 𝑎 is


odd or even, respectively. (For the ease of notation, we assume that
all the strings we quantify over have the same length 𝑚 = 𝑝(𝑛), but
using simple padding one can show that this captures the general
case of strings of different polynomial lengths.)

Proof Idea:
To understand the idea behind the proof, consider the special case
where we want to decide, given 𝑥 ∈ {0, 1}𝑛 , whether for every 𝑦 ∈
{0, 1}𝑛 there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Consider the
function 𝐹 such that 𝐹 (𝑥𝑦) = 1 if there exists 𝑧 ∈ {0, 1}𝑛 such that
𝑉 (𝑥𝑦𝑧) = 1. Since 𝑉 runs in polynomial-time 𝐹 ∈ NP and hence if
P = NP, then there is an algorithm 𝑉 ′ that on input 𝑥, 𝑦 outputs 1 if
and only if there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Now we
can see that the original statement we consider is true if and only if for
every 𝑦 ∈ {0, 1}𝑛 , 𝑉 ′ (𝑥𝑦) = 1, which means it is false if and only if
the following condition (∗) holds: there exists some 𝑦 ∈ {0, 1}𝑛 such
that 𝑉 ′ (𝑥𝑦) = 0. But for every 𝑥 ∈ {0, 1}𝑛 , the question of whether
the condition (∗) is itself in NP (as we assumed 𝑉 ′ can be computed
w hat i f p e q ua l s n p ? 513

in polynomial time) and hence under the assumption that P = NP


we can determine in polynomial time whether the condition (∗), and
hence our original statement, is true.

Proof of Theorem 16.6. We prove the theorem by induction. We assume


that there is a polynomial-time algorithm SOLVE𝑉 ,𝑎−1 that can solve
the problem (16.4) for 𝑎 − 1 and use that to solve the problem for 𝑎.
For 𝑎 = 1, SOLVE𝑉 ,𝑎−1 (𝑥) = 1 iff 𝑉 (𝑥) = 1 which is a polynomial-time
computation since 𝑉 runs in polynomial time. For every 𝑥, 𝑦0 , define
the statement 𝜑𝑥,𝑦0 to be the following:

𝜑𝑥,𝑦0 = ∀𝑦1 ∈{0,1}𝑚 ∃𝑦2 ∈{0,1}𝑚 ⋯ 𝒬𝑦𝑎−1 ∈{0,1}𝑚 𝑉 (𝑥𝑦0 𝑦1 ⋯ 𝑦𝑎−1 ) = 1

By the definition of SOLVE𝑉 ,𝑎 , for every 𝑥 ∈ {0, 1}𝑛 , our goal is


that SOLVE𝑉 ,𝑎 (𝑥) = 1 if and only if there exists 𝑦0 ∈ {0, 1}𝑚 such that
𝜑𝑥,𝑦0 is true.
The negation of 𝜑𝑥,𝑦0 is the statement

𝜑𝑥,𝑦 = ∃𝑦1 ∈{0,1}𝑚 ∀𝑦2 ∈{0,1}𝑚 ⋯ 𝒬𝑦𝑎−1 ∈{0,1}𝑚 𝑉 (𝑥𝑦0 𝑦1 ⋯ 𝑦𝑎−1 ) = 0


0

where 𝒬 is ∃ if 𝒬 was ∀ and 𝒬 is ∀ otherwise. (Please stop and verify


that you understand why this is true, this is a generalization of the fact
that if Ψ is some logical condition then the negation of ∃𝑦 ∀𝑧 Ψ(𝑦, 𝑧) is
∀𝑦 ∃𝑧 ¬Ψ(𝑦, 𝑧).)
The crucial observation is that 𝜑𝑥,𝑦 is exactly a statement of the
0
form we consider with 𝑎 − 1 quantifiers instead of 𝑎, and hence by
our inductive hypothesis there is some polynomial time algorithm
𝑆 that on input 𝑥𝑦0 outputs 1 if and only if 𝜑𝑥,𝑦 is true. If we let 𝑆
0

be the algorithm that on input 𝑥, 𝑦0 outputs 1 − 𝑆(𝑥𝑦0 ) then we see


that 𝑆 outputs 1 if and only if 𝜑𝑥,𝑦0 is true. Hence we can rephrase the
original statement (16.4) as follows:

∃𝑦0 ∈{0,1}𝑚 𝑆(𝑥𝑦0 ) = 1 (16.5)

but since 𝑆 is a polynomial-time algorithm, Eq. (16.5) is clearly a


statement in NP and hence under our assumption that P = NP there is
a polynomial time algorithm that on input 𝑥 ∈ {0, 1}𝑛 , will determine
if (16.5) is true and so also if the original statement (16.4) is true.

The algorithm of Theorem 16.6 can also solve the search problem
as well: find the value 𝑦0 that certifies the truth of (16.4). We note
that while this algorithm is in polynomial time, the exponent of this
514 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

polynomial blows up quite fast. If the original NANDSAT algorithm


required Ω(𝑛2 ) time, solving 𝑎 levels of quantifiers would require time 4
We do not know whether such loss is inherent.
Ω(𝑛2 ).4
𝑎
As far as we can tell, it’s possible that the quantified
boolean formula problem has a linear-time algorithm.
We will, however, see later in this course that it
16.4.1 Application: self improving algorithm for 3SAT satisfies a notion known as PSPACE-hardness that is
even stronger than NP-hardness.
Suppose that we found a polynomial-time algorithm 𝐴 for 3SAT that
is “good but not great”. For example, maybe our algorithm runs in
time 𝑐𝑛2 for some not too small constant 𝑐. However, it’s possible
that the best possible SAT algorithm is actually much more efficient
than that. Perhaps, as we guessed before, there is a circuit 𝐶 ∗ of at
most 106 𝑛 gates that computes 3SAT on 𝑛 variables, and we simply
haven’t discovered it yet. We can use Theorem 16.6 to “bootstrap” our
original “good but not great” 3SAT algorithm to discover the optimal
one. The idea is that we can phrase the question of whether there
exists a size 𝑠 circuit that computes 3SAT for all length 𝑛 inputs as
follows: there exists a size ≤ 𝑠 circuit 𝐶 such that for every formula 𝜑
described by a string of length at most 𝑛, if 𝐶(𝜑) = 1 then there exists
an assignment 𝑥 to the variables of 𝜑 that satisfies it. One can see that
this is a statement of the form (16.2) and hence if P = NP we can solve
it in polynomial time as well. We can therefore imagine investing huge
computational resources in running 𝐴 one time to discover the circuit
𝐶 ∗ and then using 𝐶 ∗ for all further computation.

16.5 APPROXIMATING COUNTING PROBLEMS AND POSTERIOR


SAMPLING (ADVANCED, OPTIONAL)
Given a Boolean circuit 𝐶, if P = NP then we can find an input 𝑥 (if
one exists) such that 𝐶(𝑥) = 1. But what if there is more than one 𝑥
like that? Clearly we can’t efficiently output all such 𝑥’s; there might
be exponentially many. But we can get an arbitrarily good multiplica-
tive approximation (i.e., a 1±𝜖 factor for arbitrarily small 𝜖 > 0) for the
number of such 𝑥’s, as well as output a (nearly) uniform member of
this set. The details are beyond the scope of this book, but this result is
formally stated in the following theorem (whose proof is omitted).

Let 𝑉 ∶ {0, 1}∗ → {0, 1}


Theorem 16.7 — Approximate counting if P = NP.
be some polynomial-time algorithm, and suppose that P = NP.
Then there exists an algorithm COUNT𝑉 that on input 𝑥, 1𝑚 , 𝜖,
runs in time polynomial in |𝑥|, 𝑚, 1/𝜖 and outputs a number in
[2𝑚 + 1] satisfying

(1−𝜖)COUNT𝑉 (𝑥, 𝑚, 𝜖) ≤ ∣{𝑦 ∈ {0, 1}𝑚 ∶ 𝑉 (𝑥𝑦) = 1}∣ ≤ (1+𝜖)COUNT𝑉 (𝑥, 𝑚, 𝜖) .


w hat i f p e q ua l s n p ? 515

In other words, the algorithm COUNT𝑉 gives an approximation


up to a factor of 1 ± 𝜖 for the number of witnesses for 𝑥 with respect
to the verifying algorithm 𝑉 . Once again, to understand this theorem
it can be useful to see how it implies that if P = NP then there is a
polynomial-time algorithm that given a graph 𝐺 and a number 𝑘,
can compute a number 𝐾 that is within a 1 ± 0.01 factor equal to the
number of simple paths in 𝐺 of length 𝑘. (That is, 𝐾 is between 0.99 to
1.01 times the number of such paths.)

The algorithm for count-


Posterior sampling and probabilistic programming.
ing can also be extended to sampling from a given posterior distri-
bution. That is, if 𝐶 ∶ {0, 1}𝑛 → {0, 1}𝑚 is a Boolean circuit and
𝑦 ∈ {0, 1}𝑚 , then if P = NP we can sample from (a close approx-
imation of) the distribution of uniform 𝑥 ∈ {0, 1}𝑛 conditioned on
𝐶(𝑥) = 𝑦. This task is known as posterior sampling and is crucial for
Bayesian data analysis. These days it is known how to achieve pos-
terior sampling only for circuits 𝐶 of very special form, and even in
these cases more often than not we do have guarantees on the quality
of the sampling algorithm. The field of making inferences by sampling
from posterior distribution specified by circuits or programs is known
as probabilistic programming.

16.6 WHAT DOES ALL OF THIS IMPLY?


So, what will happen if we have a 106 𝑛 algorithm for 3SAT? We have
mentioned that NP-hard problems arise in many contexts, and indeed
scientists, engineers, programmers and others routinely encounter
such problems in their daily work. A better 3SAT algorithm will prob-
ably make their lives easier, but that is the wrong place to look for
the most foundational consequences. Indeed, while the invention of
electronic computers did of course make it easier to do calculations
that people were already doing with mechanical devices and pen and
paper, the main applications computers are used for today were not
even imagined before their invention.
An exponentially faster algorithm for all NP problems would be
no less radical an improvement (and indeed, in some sense would
be more) than the computer itself, and it is as hard for us to imagine
what it would imply as it was for Babbage to envision today’s world.
For starters, such an algorithm would completely change the way we
program computers. Since we could automatically find the “best”
(in any measure we chose) program that achieves a certain task, we
would not need to define how to achieve a task, but only specify tests
as to what would be a good solution, and could also ensure that a
program satisfies an exponential number of tests without actually
running them.
516 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The possibility that P = NP is often described as “automating


creativity”. There is something to that analogy, as we often think of
a creative solution as one that is hard to discover but that, once the
“spark” hits, is easy to verify. But there is also an element of hubris
to that statement, implying that the most impressive consequence of
such an algorithmic breakthrough will be that computers would suc-
ceed in doing something that humans already do today. Nevertheless,
artificial intelligence, like many other fields, will clearly be greatly
impacted by an efficient 3SAT algorithm. For example, it is clearly
much easier to find a better Chess-playing algorithm when, given any
algorithm 𝑃 , you can find the smallest algorithm 𝑃 ′ that plays Chess
better than 𝑃 . Moreover, as we mentioned above, much of machine
learning (and statistical reasoning in general) is about finding “sim-
ple” concepts that explain the observed data, and if NP = P, we could
search for such concepts automatically for any notion of “simplicity”
we see fit. In fact, we could even “skip the middle man” and do an
automatic search for the learning algorithm with smallest general-
ization error. Ultimately the field of Artificial Intelligence is about
trying to “shortcut” billions of years of evolution to obtain artificial
programs that match (or beat) the performance of natural ones, and a
fast algorithm for NP would provide the ultimate shortcut. (One in-
teresting theory is that P = NP and evolution has already discovered
this algorithm, which we are already using without realizing it. At the
moment, there seems to be very little evidence for such a scenario. In
fact, we have some partial results in the other direction showing that,
regardless of whether P = NP, many types of “local search” or “evolu-
tionary” algorithms require exponential time to solve 3SAT and other
NP-hard problems.)
More generally, a faster algorithm for NP problems would be im-
mensely useful in any field where one is faced with computational or
quantitative problems− which is basically all fields of science, math,
and engineering. This will not only help with concrete problems such
as designing a better bridge, or finding a better drug, but also with
addressing basic mysteries such as trying to find scientific theories or
“laws of nature”. In a fascinating talk, physicist Nima Arkani-Hamed
discusses the effort of finding scientific theories in much the same lan-
guage as one would describe solving an NP problem, for which the
solution is easy to verify or seems “inevitable”, once found, but that
requires searching through a huge landscape of possibilities to reach,
and that often can get “stuck” at local optima:
“the laws of nature have this amazing feeling of inevitability… which is associ-
ated with local perfection.”
“The classical picture of the world is the top of a local mountain in the space of
ideas. And you go up to the top and it looks amazing up there and absolutely
w hat i f p e q ua l s n p ? 517

incredible. And you learn that there is a taller mountain out there. Find it,
Mount Quantum…. they’re not smoothly connected … you’ve got to make a
jump to go from classical to quantum … This also tells you why we have such
major challenges in trying to extend our understanding of physics. We don’t
have these knobs, and little wheels, and twiddles that we can turn. We have to
learn how to make these jumps. And it is a tall order. And that’s why things are
difficult.”

Finding an efficient algorithm for NP amounts to always being able


to search through an exponential space and find not just the “local”
mountain, but the tallest peak.
But perhaps more than any computational speedups, a fast algo-
rithm for NP problems would bring about a new type of understanding.
In many of the areas where NP-completeness arises, it is not as much
a barrier for solving computational problems as it is a barrier for ob-
taining “closed-form formulas” or other types of more constructive
descriptions of the behavior of natural, biological, social and other sys-

tems. A better algorithm for NP, even if it is “merely” 2 𝑛 -time, seems
to require obtaining a new way to understand these types of systems,
whether it is characterizing Nash equilibria, spin-glass configurations,
entangled quantum states, or any of the other questions where NP is
currently a barrier for analytical understanding. Such new insights
would be very fruitful regardless of their computational utility.

 Big Idea 23 If P = NP, we can efficiently solve a fantastic number


of decision, search, optimization, counting, and sampling problems
from all areas of human endeavors.

16.7 CAN P ≠ NP BE NEITHER TRUE NOR FALSE?


The Continuum Hypothesis is a conjecture made by Georg Cantor in
1878, positing the non-existence of a certain type of infinite cardinality.
(One way to phrase it is that for every infinite subset 𝑆 of the real
numbers ℝ, either there is a one-to-one and onto function 𝑓 ∶ 𝑆 → ℝ
or there is a one-to-one and onto function 𝑓 ∶ 𝑆 → ℕ.) This was
considered one of the most important open problems in set theory,
and settling its truth or falseness was the first problem put forward by
Hilbert in the 1900 address we mentioned before. However, using the
theories developed by Gödel and Turing, in 1963 Paul Cohen proved
that both the Continuum Hypothesis and its negation are consistent
with the standard axioms of set theory (i.e., the Zermelo-Fraenkel
axioms + the Axiom of choice, or “ZFC” for short). Formally, what
he proved is that if ZFC is consistent, then so is ZFC when we assume
either the continuum hypothesis or its negation.
518 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Today, many (though not all) mathematicians interpret this result


as saying that the Continuum Hypothesis is neither true nor false, but
rather is an axiomatic choice that we are free to make one way or the
other. Could the same hold for P ≠ NP?
In short, the answer is No. For example, suppose that we are try-
ing to decide between the “3SAT is easy” conjecture (there is an 106 𝑛
time algorithm for 3SAT) and the “3SAT is hard” conjecture (for ev-
ery 𝑛, any NAND-CIRC program that solves 𝑛 variable 3SAT takes
210 𝑛 lines). Then, since for 𝑛 = 108 , 210 𝑛 > 106 𝑛, this boils down
−6 −6

to the finite question of deciding whether or not there is a 1013 -line


NAND-CIRC program deciding 3SAT on formulas with 108 variables.
If there is such a program then there is a finite proof of its existence,
namely the approximately 1TB file describing the program, and for
which the verification is the (finite in principle though infeasible in 5
This inefficiency is not necessarily inherent. Results
practice) process of checking that it succeeds on all inputs.5 If there in program-checking, interactive proofs, and average-
isn’t such a program, then there is also a finite proof of that, though case complexity can be used for efficient verification
of proofs of related statements. In contrast, the
any such proof would take longer since we would need to enumer-
inefficiency of verifying failure of all programs could
ate over all programs as well. Ultimately, since it boils down to a finite well be inherent.
statement about bits and numbers; either the statement or its negation
must follow from the standard axioms of arithmetic in a finite number
of arithmetic steps. Thus, we cannot justify our ignorance in distin-
guishing between the “3SAT easy” and “3SAT hard” cases by claiming
that this might be an inherently ill-defined question. Similar reason-
ing (with different numbers) applies to other variants of the P vs NP
question. We note that in the case that 3SAT is hard, it may well be
that there is no short proof of this fact using the standard axioms, and
this is a question that people have been studying in various restricted
forms of proof systems.

16.8 IS P = NP “IN PRACTICE”?


The fact that a problem is NP-hard means that we believe there is no
efficient algorithm that solves it in the worst case. It does not, however,
mean that every single instance of the problem is hard. For exam-
ple, if all the clauses in a 3SAT instance 𝜑 contain the same variable
𝑥𝑖 (possibly in negated form), then by guessing a value to 𝑥𝑖 we can
reduce 𝜑 to a 2SAT instance which can then be efficiently solved. Gen-
eralizations of this simple idea are used in “SAT solvers”, which are
algorithms that have solved certain specific interesting SAT formulas
with thousands of variables, despite the fact that we believe SAT to
be exponentially hard in the worst case. Similarly, a lot of problems
6
The computational difficulty of problems in eco-
arising in economics and machine learning are NP-hard.6 And yet
nomics such as finding optimal (or any) equilibria
vendors and customers manage to figure out market-clearing prices is quite subtle. Some variants of such problems are
(as economists like to point out, there is milk on the shelves) and mice NP-hard, while others have a certain “intermediate”
complexity.
succeed in distinguishing cats from dogs. Hence people (and ma-
w hat i f p e q ua l s n p ? 519

chines) seem to regularly succeed in solving interesting instances of


NP-hard problems, typically by using some combination of guessing
while making local improvements.
It is also true that there are many interesting instances of NP-hard
problems that we do not currently know how to solve. Across all ap-
plication areas, whether it is scientific computing, optimization, con-
trol or more, people often encounter hard instances of NP problems
on which our current algorithms fail. In fact, as we will see, all of our
digital security infrastructure relies on the fact that some concrete and
easy-to-generate instances of, say, 3SAT (or, equivalently, any other
NP-hard problem) are exponentially hard to solve.
Thus it would be wrong to say that NP is easy “in practice”, nor
would it be correct to take NP-hardness as the “final word” on the
complexity of a problem, particularly when we have more informa-
tion about how any given instance is generated. Understanding both
the “typical complexity” of NP problems, as well as the power and
limitations of certain heuristics (such as various local-search based al-
gorithms) is a very active area of research. We will see more on these
topics later in this course.

16.9 WHAT IF P ≠ NP?


So, P = NP would give us all kinds of fantastical outcomes. But we
strongly suspect that P ≠ NP, and moreover that there is no much-
better-than-brute-force algorithm for 3SAT. If indeed that is the case, is
it all bad news?
One might think that impossibility results, telling you that you
cannot do something, is the kind of cloud that does not have a silver
lining. But in fact, as we already alluded to before, it does. A hard
(in a sufficiently strong sense) problem in NP can be used to create a
code that cannot be broken, a task that for thousands of years has been
the dream of not just spies but of many scientists and mathematicians
over the generations. But the complexity viewpoint turned out to
yield much more than simple codes, achieving tasks that people had
previously not even dared to dream of. These include the notion of
public key cryptography, allowing two people to communicate securely
without ever having exchanged a secret key; electronic cash, allowing
private and secure transaction without a central authority; and secure
multiparty computation, enabling parties to compute a joint function on
private inputs without revealing any extra information about it. Also,
as we will see, computational hardness can be used to replace the role
of randomness in many settings.
Furthermore, while it is often convenient to pretend that computa-
tional problems are simply handed to us, and that our job as computer
scientists is to find the most efficient algorithm for them, this is not
520 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

how things work in most computing applications. Typically even for-


mulating the problem to solve is a highly non-trivial task. When we
discover that the problem we want to solve is NP-hard, this might be a
useful sign that we used the wrong formulation for it.
Beyond all these, the quest to understand computational hardness
− including the discoveries of lower bounds for restricted compu-
tational models, as well as new types of reductions (such as those
arising from “probabilistically checkable proofs”) − has already had
surprising positive applications to problems in algorithm design, as
well as in coding for both communication and storage. This is not
surprising since, as we mentioned before, from group theory to the
theory of relativity, the pursuit of impossibility results has often been
one of the most fruitful enterprises of mankind.

✓ Chapter Recap

• The question of whether P = NP is one of the


most important and fascinating questions of com-
puter science and science at large, touching on all
fields of the natural and social sciences, as well as
mathematics and engineering.
• Our current evidence and understanding supports
the “SAT hard” scenario that there is no much-
better-than-brute-force algorithm for 3SAT or many
other NP-hard problems.
• We are very far from proving this, however. Re-
searchers have studied proving lower bounds on
the number of gates to compute explicit functions
in restricted forms of circuits, and have made some
advances in this effort, along the way generating
mathematical tools that have found other uses.
However, we have made essentially no headway
in proving lower bounds for general models of
computation such as Boolean circuits and Turing
machines. Indeed, we currently do not even know
how to rule out the possibility that for every 𝑛 ∈ ℕ,
SAT restricted to 𝑛-length inputs has a Boolean
circuit of less than 10𝑛 gates (even though there
exist 𝑛-input functions that require at least 2𝑛 /(10𝑛)
gates to compute).
• Understanding how to cope with this compu-
tational intractability, and even benefit from it,
comprises much of the research in theoretical
computer science.

16.10 EXERCISES
w hat i f p e q ua l s n p ? 521

16.11 BIBLIOGRAPHICAL NOTES


As mentioned before, Aaronson’s survey [Aar16] is a great exposition
of the P vs NP problem. Another recommended survey by Aaronson
is [Aar05] which discusses the question of whether NP complete
problems could be computed by any physical means.
The paper [BU11] discusses some results about problems in the
polynomial hierarchy.
17
Space bounded computation

PLAN: Example of space bounded algorithms, importance of pre-


serving space. The classes L and PSPACE, space hierarchy theorem,
PSPACE=NPSPACE, constant space = regular languages.

17.1 EXERCISES

17.2 BIBLIOGRAPHICAL NOTES

Compiled on 12.6.2023 00:05


IV
RANDOMIZED COMPUTATION
Learning Objectives:
• Review the basic notion of probability theory
that we will use.
• Sample spaces, and in particular the space
{0, 1}𝑛
• Events, probabilities of unions and
intersections.
• Random variables and their expectation,
variance, and standard deviation.

18 • Independence and correlation for both events


and random variables.

Probability Theory 101 • Markov, Chebyshev and Chernoff tail bounds


(bounding the probability that a random
variable will deviate from its expectation).

“God doesn’t play dice with the universe”, Albert Einstein

“Einstein was doubly wrong … not only does God definitely play dice, but He
sometimes confuses us by throwing them where they can’t be seen.”, Stephen
Hawking

“ ‘The probability of winning a battle has no place in our theory because it


does not belong to any [random experiment]. Probability cannot be applied
to this problem any more than the physical concept of work can be applied to
the ’work’ done by an actor reciting his part.”, Richard Von Mises, 1928
(paraphrased)

“I am unable to see why ‘objectivity’ requires us to interpret every probability


as a frequency in some random experiment; particularly when in most problems
probabilities are frequencies only in an imaginary universe invented just for the
purpose of allowing a frequency interpretation.”, E.T. Jaynes, 1976

Before we show how to use randomness in algorithms, let us do a


quick review of some basic notions in probability theory. This is not
meant to replace a course on probability theory, and if you have not
seen this material before, I highly recommend you look at additional
resources to get up to speed. Fortunately, we will not need many of
the advanced notions of probability theory, but, as we will see, even
the so-called “simple” setting of tossing 𝑛 coins can lead to very subtle
and interesting issues.

This chapter: A non-mathy overview


This chapter contains an overview of the basics of probability
theory, as needed for understanding randomized computa-
tion. The main topics covered are the notions of:

1. A sample space, which for us will almost always consist


of the set of all possible outcomes of the experiment of
tossing a finite number of independent coins.

Compiled on 12.6.2023 00:05


528 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2. An event, which is simply a subset of the sample space,


with the probability of the event happening being the
fraction of outcomes that are in this subset.

3. A random variable, which is a way to assign some number


or statistic to an outcome of the sample space.

4. The notion of conditioning, which corresponds to how the


value of a random variable (or the probability of an event)
changes if we restrict attention to outcomes for which the
value of another variable is known (or for which some
other event has happened). Random variables and events
that have no impact on one another are called independent.

5. Expectation, which is the average of a random variable,


and concentration bounds which quantify the probability
that a random variable can “stray too far” from its ex-
pected value.

These concepts are at once both basic and subtle. While


we will not need many “fancy” topics covered in statistics
courses, including special distributions (e.g., gemoetric,
Poisson, exponential, Gaussian, etc.), nor topics such as hy-
pothesis testing or regression, this doesn’t mean that the
probability we use is “trivial”. The human brain has not
evolved to do probabilistic reasoning very well, and notions
such as conditioning and independence can be quite subtle
and confusing even in the basic setting of tossing a random
coin. However, this is all the more reason that studying these
notions in this basic setting is useful not just for following
this book, but also as a strong foundation for “fancier topics”.

18.1 RANDOM COINS


The nature of randomness and probability is a topic of great philo-
sophical, scientific and mathematical depth. Is there actual random-
ness in the world, or does it proceed in a deterministic clockwork fash-
ion from some initial conditions set at the beginning of time? Does
probability refer to our uncertainty of beliefs, or to the frequency of
occurrences in repeated experiments? How can we define probability
over infinite sets?
These are all important questions that have been studied and de-
bated by scientists, mathematicians, statisticians and philosophers.
Fortunately, we will not need to deal directly with these questions
here. We will be mostly interested in the setting of tossing 𝑛 random,
p roba bi l i ty the ory 1 0 1 529

unbiased and independent coins. Below we define the basic proba-


bilistic objects of events and random variables when restricted to this
setting. These can be defined for much more general probabilistic ex-
periments or sample spaces, and later on we will briefly discuss how
this can be done. However, the 𝑛-coin case is sufficient for almost
everything we’ll need in this course.
If instead of “heads” and “tails” we encode the sides of each coin
by “zero” and “one”, we can encode the result of tossing 𝑛 coins as
a string in {0, 1}𝑛 . Each particular outcome 𝑥 ∈ {0, 1}𝑛 is obtained
with probability 2−𝑛 . For example, if we toss three coins, then we
obtain each of the 8 outcomes 000, 001, 010, 011, 100, 101, 110, 111
with probability 2−3 = 1/8 (see also Fig. 18.1). We can describe the
experiment of tossing 𝑛 coins as choosing a string 𝑥 uniformly at
random from {0, 1}𝑛 , and hence we’ll use the shorthand 𝑥 ∼ {0, 1}𝑛
for 𝑥 that is chosen according to this experiment.
An event is simply a subset 𝐴 of {0, 1}𝑛 . The probability of 𝐴, de-
noted by Pr𝑥∼{0,1}𝑛 [𝐴] (or Pr[𝐴] for short, when the sample space is
understood from the context), is the probability that an 𝑥 chosen uni-
formly at random will be contained in 𝐴. Note that this is the same as
|𝐴|/2𝑛 (where |𝐴| as usual denotes the number of elements in the set
𝐴). For example, the probability that 𝑥 has an even number of ones
𝑛−1
is Pr[𝐴] where 𝐴 = {𝑥 ∶ ∑𝑖=0 𝑥𝑖 = 0 mod 2}. In the case 𝑛 = 3,
𝐴 = {000, 011, 101, 110}, and hence Pr[𝐴] = 84 = 21 (see Fig. 18.2). It
turns out this is true for every 𝑛: Figure 18.1: The probabilistic experiment of tossing
three coins corresponds to making 2 × 2 × 2 = 8
Lemma 18.1 For every 𝑛 > 0, choices, each with equal probability. In this example,
the blue set corresponds to the event 𝐴 = {𝑥 ∈
𝑛−1 {0, 1}3 | 𝑥0 = 0} where the first coin toss is equal
Pr [∑ 𝑥𝑖 is even ] = 1/2 . to 0, and the pink set corresponds to the event 𝐵 =
𝑥∼{0,1} 𝑛
𝑖=0 {𝑥 ∈ {0, 1}3 | 𝑥1 = 1} where the second coin toss is
equal to 1 (with their intersection having a purplish
color). As we can see, each of these events contains 4
elements (out of 8 total) and so has probability 1/2.
P The intersection of 𝐴 and 𝐵 contains two elements,
To test your intuition on probability, try to stop here and so the probability that both of these events occur
and prove the lemma on your own. is 2/8 = 1/4.

Proof of Lemma 18.1. We prove the lemma by induction on 𝑛. For the


case 𝑛 = 1 it is clear since 𝑥 = 0 is even and 𝑥 = 1 is odd, and hence
the probability that 𝑥 ∈ {0, 1} is even is 1/2. Let 𝑛 > 1. We assume
by induction that the lemma is true for 𝑛 − 1 and we will prove it
for 𝑛. We split the set {0, 1}𝑛 into four disjoint sets 𝐸0 , 𝐸1 , 𝑂0 , 𝑂1 ,
where for 𝑏 ∈ {0, 1}, 𝐸𝑏 is defined as the set of 𝑥 ∈ {0, 1}𝑛 such that
𝑥0 ⋯ 𝑥𝑛−2 has even number of ones and 𝑥𝑛−1 = 𝑏 and similarly 𝑂𝑏 is
Figure 18.2: The event that if we toss three coins
the set of 𝑥 ∈ {0, 1}𝑛 such that 𝑥0 ⋯ 𝑥𝑛−2 has odd number of ones and 𝑥0 , 𝑥1 , 𝑥2 ∈ {0, 1} then the sum of the 𝑥𝑖 ’s is even
𝑥𝑛−1 = 𝑏. Since 𝐸0 is obtained by simply extending 𝑛 − 1-length string has probability 1/2 since it corresponds to exactly 4
out of the 8 possible strings of length 3.
with even number of ones by the digit 0, the size of 𝐸0 is simply the
530 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

number of such 𝑛 − 1-length strings which by the induction hypothesis


is 2𝑛−1 /2 = 2𝑛−2 . The same reasoning applies for 𝐸1 , 𝑂0 , and 𝑂1 .
Hence each one of the four sets 𝐸0 , 𝐸1 , 𝑂0 , 𝑂1 is of size 2𝑛−2 . Since
𝑥 ∈ {0, 1}𝑛 has an even number of ones if and only if 𝑥 ∈ 𝐸0 ∪ 𝑂1
(i.e., either the first 𝑛 − 1 coordinates sum up to an even number and
the final coordinate is 0 or the first 𝑛 − 1 coordinates sum up to an odd
number and the final coordinate is 1), we get that the probability that
𝑥 satisfies this property is

|𝐸0 ∪𝑂1 | 2𝑛−2 + 2𝑛−2 1


2𝑛 = = ,
2𝑛 2
using the fact that 𝐸0 and 𝑂1 are disjoint and hence |𝐸0 ∪ 𝑂1 | =
|𝐸0 | + |𝑂1 |.

We can also use the intersection (∩) and union (∪) operators to
talk about the probability of both event 𝐴 and event 𝐵 happening, or
the probability of event 𝐴 or event 𝐵 happening. For example, the
probability 𝑝 that 𝑥 has an even number of ones and 𝑥0 = 1 is the same
𝑛−1
as Pr[𝐴 ∩ 𝐵] where 𝐴 = {𝑥 ∈ {0, 1}𝑛 ∶ ∑𝑖=0 𝑥𝑖 = 0 mod 2} and
𝐵 = {𝑥 ∈ {0, 1}𝑛 ∶ 𝑥0 = 1}. This probability is equal to 1/4 for
𝑛 > 1. (It is a great exercise for you to pause here and verify that you
understand why this is the case.)
Because intersection corresponds to considering the logical AND
of the conditions that two events happen, while union corresponds
to considering the logical OR, we will sometimes use the ∧ and ∨
operators instead of ∩ and ∪, and so write this probability 𝑝 = Pr[𝐴 ∩
𝐵] defined above also as
𝑛−1
Pr 𝑛
[∑ 𝑥 𝑖 = 0 mod 2 ∧ 𝑥0 = 1] .
𝑥∼{0,1}
𝑖=0

If 𝐴 ⊆ {0, 1}𝑛 is an event, then 𝐴 = {0, 1}𝑛 ⧵ 𝐴 corresponds to the


event that 𝐴 does not happen. Since |𝐴| = 2𝑛 − |𝐴|, we get that

Pr[𝐴] = |𝐴|
2𝑛 = 2𝑛 −|𝐴|
2𝑛 =1− |𝐴|
2𝑛 = 1 − Pr[𝐴] .

This makes sense: since 𝐴 happens if and only if 𝐴 does not happen,
the probability of 𝐴 should be one minus the probability of 𝐴.

R
Remark 18.2 — Remember the sample space. While the
above definition might seem very simple and almost
trivial, the human mind seems not to have evolved for
probabilistic reasoning, and it is surprising how often
people can get even the simplest settings of probability
wrong. One way to make sure you don’t get confused
p roba bi l i ty the ory 1 0 1 531

when trying to calculate probability statements is


to always ask yourself the following two questions:
(1) Do I understand what is the sample space that
this probability is taken over?, and (2) Do I under-
stand what is the definition of the event that we are
analyzing?.
For example, suppose that I were to randomize seating
in my course, and then it turned out that students
sitting in row 7 performed better on the final: how
surprising should we find this? If we started out with
the hypothesis that there is something special about
the number 7 and chose it ahead of time, then the
event that we are discussing is the event 𝐴 that stu-
dents sitting in number 7 had better performance on
the final, and we might find it surprising. However, if
we first looked at the results and then chose the row
whose average performance is best, then the event
we are discussing is the event 𝐵 that there exists some
row where the performance is higher than the over-
all average. 𝐵 is a superset of 𝐴, and its probability
(even if there is no correlation between sitting and
performance) can be quite significant.

18.1.1 Random variables


Events correspond to Yes/No questions, but often we want to analyze
finer questions. For example, if we make a bet at the roulette wheel,
we don’t want to just analyze whether we won or lost, but also how
much we’ve gained. A (real valued) random variable is simply a way
to associate a number with the result of a probabilistic experiment.
Formally, a random variable is a function 𝑋 ∶ {0, 1}𝑛 → ℝ that maps
every outcome 𝑥 ∈ {0, 1}𝑛 to an element 𝑋(𝑥) ∈ ℝ. For example, the
function SUM ∶ {0, 1}𝑛 → ℝ that maps 𝑥 to the sum of its coordinates
𝑛−1
(i.e., to ∑𝑖=0 𝑥𝑖 ) is a random variable.
The expectation of a random variable 𝑋, denoted by 𝔼[𝑋], is the
average value that this number takes, taken over all draws from the
probabilistic experiment. In other words, the expectation of 𝑋 is de-
fined as follows:
𝔼[𝑋] = ∑ 2−𝑛 𝑋(𝑥) .
𝑥∈{0,1}𝑛

If 𝑋 and 𝑌 are random variables, then we can define 𝑋 + 𝑌 as


simply the random variable that maps a point 𝑥 ∈ {0, 1}𝑛 to 𝑋(𝑥) +
𝑌 (𝑥). One basic and very useful property of the expectation is that it
is linear:
Lemma 18.3 — Linearity of expectation.

𝔼[𝑋 + 𝑌 ] = 𝔼[𝑋] + 𝔼[𝑌 ]


532 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proof.
𝔼[𝑋 + 𝑌 ] = ∑ 2−𝑛 (𝑋(𝑥) + 𝑌 (𝑥)) =
𝑥∈{0,1}𝑛
−𝑛
∑ 2 𝑋(𝑥) + ∑ 2−𝑛 𝑌 (𝑥) =
𝑥∈{0,1}𝑛 𝑥∈{0,1}𝑛

𝔼[𝑋] + 𝔼[𝑌 ]

Similarly, 𝔼[𝑘𝑋] = 𝑘 𝔼[𝑋] for every 𝑘 ∈ ℝ.


Let 𝑋 ∶ {0, 1}𝑛 → ℝ be the
Solved Exercise 18.1 — Expectation of sum.
random variable that maps 𝑥 ∈ {0, 1} to 𝑥0 + 𝑥1 + … + 𝑥𝑛−1 . Prove
𝑛

that 𝔼[𝑋] = 𝑛/2.


Solution:
We can solve this using the linearity of expectation. We can de-
fine random variables 𝑋0 , 𝑋1 , … , 𝑋𝑛−1 such that 𝑋𝑖 (𝑥) = 𝑥𝑖 . Since
each 𝑥𝑖 equals 1 with probability 1/2 and 0 with probability 1/2,
𝑛−1
𝔼[𝑋𝑖 ] = 1/2. Since 𝑋 = ∑𝑖=0 𝑋𝑖 , by the linearity of expectation

𝑛
𝔼[𝑋] = 𝔼[𝑋0 ] + 𝔼[𝑋1 ] + ⋯ + 𝔼[𝑋𝑛−1 ] = 2 .

P
If you have not seen discrete probability before, please
go over this argument again until you are sure you
follow it; it is a prototypical simple example of the
type of reasoning we will employ again and again in
this course.

If 𝐴 is an event, then 1𝐴 is the random variable such that 1𝐴 (𝑥)


equals 1 if 𝑥 ∈ 𝐴, and 1𝐴 (𝑥) = 0 otherwise. Note that Pr[𝐴] = 𝔼[1𝐴 ]
(can you see why?). Using this and the linearity of expectation, we
can show one of the most useful bounds in probability theory:
Lemma 18.4 — Union bound. For every two events 𝐴, 𝐵, Pr[𝐴 ∪ 𝐵] ≤
Pr[𝐴] + Pr[𝐵]

P
Before looking at the proof, try to see why the union
bound makes intuitive sense. We can also prove
it directly from the definition of probabilities and
the cardinality of sets, together with the equation
|𝐴 ∪ 𝐵| ≤ |𝐴| + |𝐵|. Can you see why the latter
equation is true? (See also Fig. 18.3.)
p roba bi l i ty the ory 1 0 1 533

Proof of Lemma 18.4. For every 𝑥, the variable 1𝐴∪𝐵 (𝑥) ≤ 1𝐴 (𝑥)+1𝐵 (𝑥).
Hence, Pr[𝐴∪𝐵] = 𝔼[1𝐴∪𝐵 ] ≤ 𝔼[1𝐴 +1𝐵 ] = 𝔼[1𝐴 ]+𝔼[1𝐵 ] = Pr[𝐴]+Pr[𝐵].

The way we often use this in theoretical computer science is to


argue that, for example, if there is a list of 100 bad events that can hap-
pen, and each one of them happens with probability at most 1/10000,
then with probability at least 1 − 100/10000 = 0.99, no bad event
happens.

18.1.2 Distributions over strings


While most of the time we think of random variables as having
as output a real number, we sometimes consider random vari-
ables whose output is a string. That is, we can think of a map
𝑌 ∶ {0, 1}𝑛 → {0, 1}∗ and consider the “random variable” 𝑌 such
that for every 𝑦 ∈ {0, 1}∗ , the probability that 𝑌 outputs 𝑦 is equal
to 21𝑛 |{𝑥 ∈ {0, 1}𝑛 | 𝑌 (𝑥) = 𝑦}|. To avoid confusion, we will typically Figure 18.3: The union bound tells us that the proba-
bility of 𝐴 or 𝐵 happening is at most the sum of the
refer to such string-valued random variables as distributions over individual probabilities. We can see it by noting that
strings. So, a distribution 𝑌 over strings {0, 1}∗ can be thought of as for every two sets |𝐴 ∪ 𝐵| ≤ |𝐴| + |𝐵| (with equality
only if 𝐴 and 𝐵 have no intersection).
a finite collection of strings 𝑦0 , … , 𝑦𝑀−1 ∈ {0, 1}∗ and probabilities
𝑝0 , … , 𝑝𝑀−1 (which are non-negative numbers summing up to one),
so that Pr[𝑌 = 𝑦𝑖 ] = 𝑝𝑖 .
Two distributions 𝑌 and 𝑌 ′ are identical if they assign the same
probability to every string. For example, consider the following two
functions 𝑌 , 𝑌 ′ ∶ {0, 1}2 → {0, 1}2 . For every 𝑥 ∈ {0, 1}2 , we define
𝑌 (𝑥) = 𝑥 and 𝑌 ′ (𝑥) = 𝑥0 (𝑥0 ⊕ 𝑥1 ) where ⊕ is the XOR operations.
Although these are two different functions, they induce the same
distribution over {0, 1}2 when invoked on a uniform input. The distri-
bution 𝑌 (𝑥) for 𝑥 ∼ {0, 1}2 is of course the uniform distribution over
{0, 1}2 . On the other hand 𝑌 ′ is simply the map 00 ↦ 00, 01 ↦ 01,
10 ↦ 11, 11 ↦ 10 which is a permutation of 𝑌 .

18.1.3 More general sample spaces


While throughout most of this book we assume that the underlying
probabilistic experiment corresponds to tossing 𝑛 independent coins,
all the claims we make easily generalize to sampling 𝑥 from a more
general finite or countable set 𝑆 (and not-so-easily generalizes to
uncountable sets 𝑆 as well). A probability distribution over a finite set
𝑆 is simply a function 𝜇 ∶ 𝑆 → [0, 1] such that ∑𝑥∈𝑆 𝜇(𝑥) = 1. We
think of this as the experiment where we obtain every 𝑥 ∈ 𝑆 with
probability 𝜇(𝑥), and sometimes denote this as 𝑥 ∼ 𝜇. In particular,
tossing 𝑛 random coins corresponds to the probability distribution
𝜇 ∶ {0, 1}𝑛 → [0, 1] defined as 𝜇(𝑥) = 2−𝑛 for every 𝑥 ∈ {0, 1}𝑛 . An
event 𝐴 is a subset of 𝑆, and the probability of 𝐴, which we denote by
534 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Pr𝜇 [𝐴], is ∑𝑥∈𝐴 𝜇(𝑥). A random variable is a function 𝑋 ∶ 𝑆 → ℝ, where


the probability that 𝑋 = 𝑦 is equal to ∑𝑥∈𝑆 s.t. 𝑋(𝑥)=𝑦 𝜇(𝑥).

18.2 CORRELATIONS AND INDEPENDENCE


One of the most delicate but important concepts in probability is the
notion of independence (and the opposing notion of correlations). Subtle
correlations are often behind surprises and errors in probability and
statistical analysis, and several mistaken predictions have been blamed
on miscalculating the correlations between, say, housing prices in
Florida and Arizona, or voter preferences in Ohio and Michigan. See
also Joe Blitzstein’s aptly named talk “Conditioning is the Soul of
Statistics”. (Another thorny issue is of course the difference between
correlation and causation. Luckily, this is another point we don’t need to
worry about in our clean setting of tossing 𝑛 coins.)
Two events 𝐴 and 𝐵 are independent if the fact that 𝐴 happens
makes 𝐵 neither more nor less likely to happen. For example, if we
think of the experiment of tossing 3 random coins 𝑥 ∈ {0, 1}3 , and we
let 𝐴 be the event that 𝑥0 = 1 and 𝐵 the event that 𝑥0 + 𝑥1 + 𝑥2 ≥ 2,
then if 𝐴 happens it is more likely that 𝐵 happens, and hence these
events are not independent. On the other hand, if we let 𝐶 be the event
that 𝑥1 = 1, then because the second coin toss is not affected by the
result of the first one, the events 𝐴 and 𝐶 are independent.
The formal definition is that events 𝐴 and 𝐵 are independent if
Pr[𝐴 ∩ 𝐵] = Pr[𝐴] ⋅ Pr[𝐵]. If Pr[𝐴 ∩ 𝐵] > Pr[𝐴] ⋅ Pr[𝐵] then we say
that 𝐴 and 𝐵 are positively correlated, while if Pr[𝐴 ∩ 𝐵] < Pr[𝐴] ⋅ Pr[𝐵]
then we say that 𝐴 and 𝐵 are negatively correlated (see Fig. 18.4).
If we consider the above examples on the experiment of choosing
𝑥 ∈ {0, 1}3 then we can see that

Pr[𝑥0 = 1] = 1
2
Pr[𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{011, 101, 110, 111}] = 4
= 1
8 2 Figure 18.4: Two events 𝐴 and 𝐵 are independent if
but Pr[𝐴 ∩ 𝐵] = Pr[𝐴] ⋅ Pr[𝐵]. In the two figures above,
the empty 𝑥 × 𝑥 square is the sample space, and 𝐴
and 𝐵 are two events in this sample space. In the left
figure, 𝐴 and 𝐵 are independent, while in the right
Pr[𝑥0 = 1 ∧ 𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{101, 110, 111}] = 3
8 > 1
2 ⋅ 1
2 figure they are negatively correlated, since 𝐵 is less
likely to occur if we condition on 𝐴 (and vice versa).
and hence, as we already observed, the events {𝑥0 = 1} and {𝑥0 + Mathematically, one can see this by noticing that in
the left figure the areas of 𝐴 and 𝐵 respectively are
𝑥1 + 𝑥2 ≥ 2} are not independent and in fact are positively correlated.
𝑎 ⋅ 𝑥 and 𝑏 ⋅ 𝑥, and so their probabilities are 𝑎⋅𝑥 = 𝑥𝑎
On the other hand, Pr[𝑥0 = 1 ∧ 𝑥1 = 1] = Pr[{110, 111}] = 28 = 12 ⋅ 21
𝑥2
and 𝑥2 = 𝑥 respectively, while the area of 𝐴 ∩ 𝐵 is
𝑏⋅𝑥 𝑏

and hence the events {𝑥0 = 1} and {𝑥1 = 1} are indeed independent. 𝑎 ⋅ 𝑏 which corresponds to the probability 𝑎⋅𝑏 𝑥2
. In the
right figure, the area of the triangle 𝐵 is 𝑏⋅𝑥
2 which
corresponds to a probability of 2𝑥 𝑏
, but the area of
R 𝐴 ∩ 𝐵 is 𝑏′ ⋅𝑎
for some 𝑏′ < 𝑏. This means that the
Remark 18.5 — Disjointness vs independence. People 2

sometimes confuse the notion of disjointness and in- probability of 𝐴 ∩ 𝐵 is 𝑏2𝑥⋅𝑎2 < 𝑏
2𝑥 ⋅ 𝑥𝑎 , or in other words
Pr[𝐴 ∩ 𝐵] < Pr[𝐴] ⋅ Pr[𝐵].
p roba bi l i ty the ory 1 0 1 535

dependence, but these are actually quite different. Two


events 𝐴 and 𝐵 are disjoint if 𝐴 ∩ 𝐵 = ∅, which means
that if 𝐴 happens then 𝐵 definitely does not happen.
They are independent if Pr[𝐴 ∩ 𝐵] = Pr[𝐴] Pr[𝐵] which
means that knowing that 𝐴 happens gives us no infor-
mation about whether 𝐵 happened or not. If 𝐴 and 𝐵
have non-zero probability, then being disjoint implies
that they are not independent, since in particular it
means that they are negatively correlated.

Conditional probability:If 𝐴 and 𝐵 are events, and 𝐴 happens with


non-zero probability then we define the probability that 𝐵 happens
conditioned on 𝐴 to be Pr[𝐵|𝐴] = Pr[𝐴 ∩ 𝐵]/ Pr[𝐴]. This corresponds
to calculating the probability that 𝐵 happens if we already know
that 𝐴 happened. Note that 𝐴 and 𝐵 are independent if and only if
Pr[𝐵|𝐴] = Pr[𝐵].

More than two events: We can generalize this definition to more than
two events. We say that events 𝐴1 , … , 𝐴𝑘 are mutually independent
if knowing that any set of them occurred or didn’t occur does not
change the probability that an event outside the set occurs. Formally,
the condition is that for every subset 𝐼 ⊆ [𝑘],

Pr[∧𝑖∈𝐼 𝐴𝑖 ] = ∏ Pr[𝐴𝑖 ].
𝑖∈𝐼

For example, if 𝑥 ∼ {0, 1}3 , then the events {𝑥0 = 1}, {𝑥1 = 1} and
{𝑥2 = 1} are mutually independent. On the other hand, the events
{𝑥0 = 1}, {𝑥1 = 1} and {𝑥0 + 𝑥1 = 0 mod 2} are not mutually
independent, even though every pair of these events is independent
(can you see why? see also Fig. 18.5).

18.2.1 Independent random variables


We say that two random variables 𝑋 ∶ {0, 1}𝑛 → ℝ and 𝑌 ∶ {0, 1}𝑛 → ℝ
are independent if for every 𝑢, 𝑣 ∈ ℝ, the events {𝑋 = 𝑢} and {𝑌 = 𝑣}
are independent. (We use {𝑋 = 𝑢} as shorthand for {𝑥 | 𝑋(𝑥) = 𝑢}.)
In other words, 𝑋 and 𝑌 are independent if Pr[𝑋 = 𝑢 ∧ 𝑌 = 𝑣] =
Pr[𝑋 = 𝑢] Pr[𝑌 = 𝑣] for every 𝑢, 𝑣 ∈ ℝ. For example, if two random
Figure 18.5: Consider the sample space {0, 1}𝑛 and
variables depend on the result of tossing different coins then they are
the events 𝐴, 𝐵, 𝐶, 𝐷, 𝐸 corresponding to 𝐴: 𝑥0 = 1,
independent: 𝐵: 𝑥1 = 1, 𝐶: 𝑥0 + 𝑥1 + 𝑥2 ≥ 2, 𝐷: 𝑥0 + 𝑥1 + 𝑥2 = 0
mod 2 and 𝐸: 𝑥0 + 𝑥1 = 0 mod 2. We can see that
Lemma 18.6 Suppose that 𝑆 = {𝑠0 , … , 𝑠𝑘−1 } and 𝑇 = {𝑡0 , … , 𝑡𝑚−1 } are 𝐴 and 𝐵 are independent, 𝐶 is positively correlated
disjoint subsets of {0, … , 𝑛 − 1} and let 𝑋, 𝑌 ∶ {0, 1}𝑛 → ℝ be random with 𝐴 and positively correlated with 𝐵, the three
events 𝐴, 𝐵, 𝐷 are mutually independent, and while
variables such that 𝑋 = 𝐹 (𝑥𝑠0 , … , 𝑥𝑠𝑘−1 ) and 𝑌 = 𝐺(𝑥𝑡0 , … , 𝑥𝑡𝑚−1 ) for every pair out of 𝐴, 𝐵, 𝐸 is independent, the three
some functions 𝐹 ∶ {0, 1}𝑘 → ℝ and 𝐺 ∶ {0, 1}𝑚 → ℝ. Then 𝑋 and 𝑌 events 𝐴, 𝐵, 𝐸 are not mutually independent since
are independent. their intersection has probability 28 = 41 instead of
2 ⋅ 2 ⋅ 2 = 8.
1 1 1 1
536 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

P
The notation in the lemma’s statement is a bit cum-
bersome, but at the end of the day, it simply says that
if 𝑋 and 𝑌 are random variables that depend on two
disjoint sets 𝑆 and 𝑇 of coins (for example, 𝑋 might
be the sum of the first 𝑛/2 coins, and 𝑌 might be the
largest consecutive stretch of zeroes in the second 𝑛/2
coins), then they are independent.

Proof of Lemma 18.6. Let 𝑎, 𝑏 ∈ ℝ, and let 𝐴 = {𝑥 ∈ {0, 1}𝑘 ∶ 𝐹 (𝑥) = 𝑎}


and 𝐵 = {𝑥 ∈ {0, 1}𝑚 ∶ 𝐺(𝑥) = 𝑏}. Since 𝑆 and 𝑇 are disjoint, we can
reorder the indices so that 𝑆 = {0, … , 𝑘 − 1} and 𝑇 = {𝑘, … , 𝑘 + 𝑚 − 1}
without affecting any of the probabilities. Hence we can write Pr[𝑋 =
𝑎 ∧ 𝑌 = 𝑏] = |𝐶|/2𝑛 where 𝐶 = {𝑥0 , … , 𝑥𝑛−1 ∶ (𝑥0 , … , 𝑥𝑘−1 ) ∈
𝐴 ∧ (𝑥𝑘 , … , 𝑥𝑘+𝑚−1 ) ∈ 𝐵}. Another way to write this using string
concatenation is that 𝐶 = {𝑥𝑦𝑧 ∶ 𝑥 ∈ 𝐴, 𝑦 ∈ 𝐵, 𝑧 ∈ {0, 1}𝑛−𝑘−𝑚 }, and
hence |𝐶| = |𝐴||𝐵|2𝑛−𝑘−𝑚 , which means that

|𝐶|
2𝑛 = |𝐴| |𝐵| 2𝑛−𝑘−𝑚
2𝑘 2𝑚 2𝑛−𝑘−𝑚 = Pr[𝑋 = 𝑎] Pr[𝑌 = 𝑏].

If 𝑋 and 𝑌 are independent random variables then (letting 𝑆𝑋 , 𝑆𝑌


denote the sets of all numbers that have positive probability of being
the output of 𝑋 and 𝑌 , respectively):

𝔼[XY] = ∑ Pr[𝑋 = 𝑎 ∧ 𝑌 = 𝑏] ⋅ 𝑎𝑏 =(1) ∑ Pr[𝑋 = 𝑎] Pr[𝑌 = 𝑏] ⋅ 𝑎𝑏 =(2)


𝑎∈𝑆𝑋 ,𝑏∈𝑆𝑌 𝑎∈𝑆𝑋 ,𝑏∈𝑆𝑌

( ∑ Pr[𝑋 = 𝑎] ⋅ 𝑎) ( ∑ Pr[𝑌 = 𝑏] ⋅ 𝑏) =(3)


𝑎∈𝑆𝑋 𝑏∈𝑆𝑌

𝔼[𝑋] 𝔼[𝑌 ]

where the first equality (=(1) ) follows from the independence of 𝑋


and 𝑌 , the second equality (=(2) ) follows by “opening the paren-
theses” of the right-hand side, and the third equality (=(3) ) follows
from the definition of expectation. (This is not an “if and only if”; see
Exercise 18.3.)
Another useful fact is that if 𝑋 and 𝑌 are independent random
variables, then so are 𝐹 (𝑋) and 𝐺(𝑌 ) for all functions 𝐹 , 𝐺 ∶ ℝ → ℝ.
This is intuitively true since learning 𝐹 (𝑋) can only provide us with
less information than does learning 𝑋 itself. Hence, if learning 𝑋
does not teach us anything about 𝑌 (and so also about 𝐺(𝑌 )) then
neither will learning 𝐹 (𝑋). Indeed, to prove this we can write for
every 𝑎, 𝑏 ∈ ℝ:
p roba bi l i ty the ory 1 0 1 537

Pr[𝐹 (𝑋) = 𝑎 ∧ 𝐺(𝑌 ) = 𝑏] = ∑ Pr[𝑋 = 𝑥 ∧ 𝑌 = 𝑦] =


𝑥 s.t.𝐹 (𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏

∑ Pr[𝑋 = 𝑥] Pr[𝑌 = 𝑦] =
𝑥 s.t.𝐹 (𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏

⎜ ∑ Pr[𝑋 = 𝑥]⎞
⎛ ⎜ ∑ Pr[𝑌 = 𝑦]⎞
⎟⋅⎛ ⎟=
⎝𝑥 s.t.𝐹 (𝑥)=𝑎 ⎠ ⎝𝑦 s.t.𝐺(𝑦)=𝑏 ⎠
Pr[𝐹 (𝑋) = 𝑎] Pr[𝐺(𝑌 ) = 𝑏].

18.2.2 Collections of independent random variables


We can extend the notions of independence to more than two random
variables: we say that the random variables 𝑋0 , … , 𝑋𝑛−1 are mutually
independent if for every 𝑎0 , … , 𝑎𝑛−1 ∈ ℝ,

Pr [𝑋0 = 𝑎0 ∧ ⋯ ∧ 𝑋𝑛−1 = 𝑎𝑛−1 ] = Pr[𝑋0 = 𝑎0 ] ⋯ Pr[𝑋𝑛−1 = 𝑎𝑛−1 ].

And similarly, we have that


Lemma 18.7 — Expectation of product of independent random variables. If
𝑋0 , … , 𝑋𝑛−1 are mutually independent then
𝑛−1 𝑛−1
𝔼[ ∏ 𝑋𝑖 ] = ∏ 𝔼[𝑋𝑖 ].
𝑖=0 𝑖=0

Lemma 18.8 — Functions preserve independence. If 𝑋0 , … , 𝑋𝑛−1 are mu-


tually independent, and 𝑌0 , … , 𝑌𝑛−1 are defined as 𝑌𝑖 = 𝐹𝑖 (𝑋𝑖 ) for
some functions 𝐹0 , … , 𝐹𝑛−1 ∶ ℝ → ℝ, then 𝑌0 , … , 𝑌𝑛−1 are mutually
independent as well.

P
We leave proving Lemma 18.7 and Lemma 18.8 as
Exercise 18.6 and Exercise 18.7. It is a good idea for
you stop now and do these exercises to make sure you
are comfortable with the notion of independence, as
we will use it heavily later on in this course.

18.3 CONCENTRATION AND TAIL BOUNDS


The name “expectation” is somewhat misleading. For example, sup-
pose that you and I place a bet on the outcome of 10 coin tosses, where
if they all come out to be 1’s then I pay you 100,000 dollars and other-
wise you pay me 10 dollars. If we let 𝑋 ∶ {0, 1}10 → ℝ be the random
variable denoting your gain, then we see that

𝔼[𝑋] = 2−10 ⋅ 100000 − (1 − 2−10 )10 ∼ 90.


538 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

But we don’t really “expect” the result of this experiment to be for


you to gain 90 dollars. Rather, 99.9% of the time you will pay me 10
dollars, and you will hit the jackpot 0.1% of the times.
However, if we repeat this experiment again and again (with fresh
and hence independent coins), then in the long run we do expect your
average earning to be close to 90 dollars, which is the reason why
casinos can make money in a predictable way even though every
individual bet is random. For example, if we toss 𝑛 independent and
unbiased coins, then as 𝑛 grows, the number of coins that come up
ones will be more and more concentrated around 𝑛/2 according to the
famous “bell curve” (see Fig. 18.6).
Much of probability theory is concerned with so called concentration
or tail bounds, which are upper bounds on the probability that a ran-
dom variable 𝑋 deviates too much from its expectation. The first and
simplest one of them is Markov’s inequality:

If 𝑋 is a non-negative random
Theorem 18.9 — Markov’s inequality.
variable then for every 𝑘 > 0, Pr[𝑋 ≥ 𝑘 𝔼[𝑋]] ≤ 1/𝑘.

Figure 18.6: The probabilities that we obtain a partic-


ular sum when we toss 𝑛 = 10, 20, 100, 1000 coins
converge quickly to the Gaussian/normal distribu-
P
Markov’s Inequality is actually a very natural state- tion.
ment (see also Fig. 18.7). For example, if you know
that the average (not the median!) household income
in the US is 70,000 dollars, then in particular you can
deduce that at most 25 percent of households make
more than 280,000 dollars, since otherwise, even if
the remaining 75 percent had zero income, the top
25 percent alone would cause the average income to
be larger than 70,000 dollars. From this example you
can already see that in many situations, Markov’s
inequality will not be tight and the probability of devi-
ating from expectation will be much smaller: see the
Chebyshev and Chernoff inequalities below.

Proof of Theorem 18.9. Let 𝜇 = 𝔼[𝑋] and define 𝑌 = 1𝑋≥𝑘𝜇 . That


is, 𝑌 (𝑥) = 1 if 𝑋(𝑥) ≥ 𝑘𝜇 and 𝑌 (𝑥) = 0 otherwise. Note that by
definition, for every 𝑥, 𝑌 (𝑥) ≤ 𝑋/(𝑘𝜇). We need to show 𝔼[𝑌 ] ≤ 1/𝑘.
But this follows since 𝔼[𝑌 ] ≤ 𝔼[𝑋/𝑘(𝜇)] = 𝔼[𝑋]/(𝑘𝜇) = 𝜇/(𝑘𝜇) = 1/𝑘.

Figure 18.7: Markov’s Inequality tells us that a non-


The averaging principle.While the expectation of a random variable negative random variable 𝑋 cannot be much larger
𝑋 is hardly always the “typical value”, we can show that 𝑋 is guar- than its expectation, with high probability. For exam-
ple, if the expectation of 𝑋 is 𝜇, then the probability
anteed to achieve a value that is at least its expectation with positive
that 𝑋 > 4𝜇 must be at most 1/4, as otherwise just
probability. For example, if the average grade in an exam is 87 points, the contribution from this part of the sample space
at least one student got a grade 87 or more on the exam. This is known will be too large.
p roba bi l i ty the ory 1 0 1 539

as the averaging principle, and despite its simplicity it is surprisingly


useful.
Lemma 18.10 Let 𝑋 be a random variable, then Pr[𝑋 ≥ 𝔼[𝑋]] > 0.

Proof. Suppose towards the sake of contradiction that Pr[𝑋 < 𝔼[𝑋]] =
1. Then the random variable 𝑌 = 𝔼[𝑋] − 𝑋 is always positive. By
linearity of expectation 𝔼[𝑌 ] = 𝔼[𝑋] − 𝔼[𝑋] = 0. Yet by Markov, a
non-negative random variable 𝑌 with 𝔼[𝑌 ] = 0 must equal 0 with
probability 1, since the probability that 𝑌 > 𝑘 ⋅ 0 = 0 is at most 1/𝑘 for
every 𝑘 > 1. Hence we get a contradiction to the assumption that 𝑌 is
always positive.

18.3.1 Chebyshev’s Inequality


Markov’s inequality says that a (non-negative) random variable 𝑋
can’t go too crazy and be, say, a million times its expectation, with
significant probability. But ideally we would like to say that with
high probability, 𝑋 should be very close to its expectation, e.g., in the
range [0.99𝜇, 1.01𝜇] where 𝜇 = 𝔼[𝑋]. In such a case we say that 𝑋 is
concentrated, and hence its expectation (i.e., mean) will be close to its
median and other ways of measuring 𝑋’s “typical value”. Chebyshev’s
inequality can be thought of as saying that 𝑋 is concentrated if it has a
small standard deviation.
A standard way to measure the deviation of a random variable
from its expectation is by using its standard deviation. For a random
variable 𝑋, we define the variance of 𝑋 as Var[𝑋] = 𝔼[(𝑋 − 𝜇)2 ]
where 𝜇 = 𝔼[𝑋]; i.e., the variance is the average squared distance
of 𝑋 from its expectation. The standard deviation of 𝑋 is defined as
𝜎[𝑋] = √Var[𝑋]. (This is well-defined since the variance, being an
average of a square, is always a non-negative number.)
Using Chebyshev’s inequality, we can control the probability that
a random variable is too many standard deviations away from its
expectation.

Suppose that 𝜇 = 𝔼[𝑋] and


Theorem 18.11 — Chebyshev’s inequality.
𝜎2 = Var[𝑋]. Then for every 𝑘 > 0, Pr[|𝑋 − 𝜇| ≥ 𝑘𝜎] ≤ 1/𝑘2 .

Proof. The proof follows from Markov’s inequality. We define the


random variable 𝑌 = (𝑋 − 𝜇)2 . Then 𝔼[𝑌 ] = Var[𝑋] = 𝜎2 , and hence
by Markov the probability that 𝑌 > 𝑘2 𝜎2 is at most 1/𝑘2 . But clearly
(𝑋 − 𝜇)2 ≥ 𝑘2 𝜎2 if and only if |𝑋 − 𝜇| ≥ 𝑘𝜎.

One example of how to use Chebyshev’s inequality is the setting


when 𝑋 = 𝑋1 + ⋯ + 𝑋𝑛 where 𝑋𝑖 ’s are independent and identically
540 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

distributed (i.i.d for short) variables with values in [0, 1] where each
has expectation 1/2. Since 𝔼[𝑋] = ∑𝑖 𝔼[𝑋𝑖 ] = 𝑛/2, we would like to
say that 𝑋 is very likely to be in, say, the interval [0.499𝑛, 0.501𝑛]. Us-
ing Markov’s inequality directly will not help us, since it will only tell
us that 𝑋 is very likely to be at most 100𝑛 (which we already knew,
since it always lies between 0 and 𝑛). However, since 𝑋1 , … , 𝑋𝑛 are
independent,

Var[𝑋1 + ⋯ + 𝑋𝑛 ] = Var[𝑋1 ] + ⋯ + Var[𝑋𝑛 ] . (18.1)

(We leave showing this to the reader as Exercise 18.8.)


For every random variable 𝑋𝑖 in [0, 1], Var[𝑋𝑖 ] ≤ 1 (if the variable
is always in [0, 1], it can’t be more than 1 away from its expectation),

and hence (18.1) implies that Var[𝑋] ≤ 𝑛 and hence 𝜎[𝑋] ≤ 𝑛. For
√ √
large 𝑛, 𝑛 ≪ 0.001𝑛, and in particular if 𝑛 ≤ 0.001𝑛/𝑘, we can
use Chebyshev’s inequality to bound the probability that 𝑋 is not in
[0.499𝑛, 0.501𝑛] by 1/𝑘2 .

18.3.2 The Chernoff bound


Chebyshev’s inequality already shows a connection between inde-
pendence and concentration, but in many cases we can hope for
a quantitatively much stronger result. If, as in the example above,
𝑋 = 𝑋1 + … + 𝑋𝑛 where the 𝑋𝑖 ’s are bounded i.i.d random variables
of mean 1/2, then as 𝑛 grows, the distribution of 𝑋 would be roughly
the normal or Gaussian distribution− that is, distributed according to
the bell curve (see Fig. 18.6 and Fig. 18.8). This distribution has the
property of being very concentrated in the sense that the probability of
deviating 𝑘 standard deviations from the mean is not merely 1/𝑘2 as is
guaranteed by Chebyshev, but rather is roughly 𝑒−𝑘 . Specifically, for
2

a normal random variable 𝑋 of expectation 𝜇 and standard deviation


𝜎, the probability that |𝑋 − 𝜇| ≥ 𝑘𝜎 is at most 2𝑒−𝑘 /2 . That is, we have
2

an exponential decay of the probability of deviation.


The following extremely useful theorem shows that such expo-
nential decay occurs every time we have a sum of independent and
bounded variables. This theorem is known under many names in dif-
ferent communities, though it is mostly called the Chernoff bound in
the computer science literature:

If 𝑋0 , … , 𝑋𝑛−1 are i.i.d ran-


Theorem 18.12 — Chernoff/Hoeffding bound. Figure 18.8: In the normal distribution or the bell curve,
the probability of deviating 𝑘 standard deviations
dom variables such that 𝑋𝑖 ∈ [0, 1] and 𝔼[𝑋𝑖 ] = 𝑝 for every 𝑖, then from the expectation shrinks exponentially in 𝑘2 , and
for every 𝜖 > 0 2
specifically with probability at least 1 − 2𝑒−𝑘 /2 , a
random variable 𝑋 of expectation 𝜇 and standard
𝑛−1 deviation 𝜎 satisfies 𝜇 − 𝑘𝜎 ≤ 𝑋 ≤ 𝜇 + 𝑘𝜎. This figure
Pr [∣∑ 𝑋𝑖 − 𝑝𝑛∣ > 𝜖𝑛] ≤ 2 ⋅ 𝑒−2𝜖 𝑛 . (18.2)
2
gives more precise bounds for 𝑘 = 1, 2, 3, 4, 5, 6.
𝑖=0 (Image credit:Imran Baghirov)
p roba bi l i ty the ory 1 0 1 541

We omit the proof, which appears in many texts, and uses Markov’s
inequality on i.i.d random variables 𝑌0 , … , 𝑌𝑛 that are of the form
𝑌𝑖 = 𝑒𝜆𝑋𝑖 for some carefully chosen parameter 𝜆. See Exercise 18.11
for a proof of the simple (but highly useful and representative) case
where each 𝑋𝑖 is {0, 1} valued and 𝑝 = 1/2. (See also Exercise 18.12
for a generalization.)

R
Remark 18.13 — Slight simplification of Chernoff. Since 𝑒
is roughly 2.7 (and in particular larger than 2),
(18.2) would still be true if we replaced its right-hand
side with 𝑒−2𝜖 𝑛+1 . For 𝑛 > 1/𝜖2 , the equation will
2

still be true if we replaced the right-hand side with


the simpler 𝑒−𝜖 𝑛 . Hence we will sometimes use the
2

Chernoff bound as stating that for 𝑋0 , … , 𝑋𝑛−1 and 𝑝


as above, 𝑛 > 1/𝜖2 then
𝑛−1
Pr [∣∑ 𝑋𝑖 − 𝑝𝑛∣ > 𝜖𝑛] ≤ 𝑒−𝜖 𝑛 . (18.3)
2

𝑖=0

18.3.3 Application: Supervised learning and empirical risk minimization


Here is a nice application of the Chernoff bound. Consider the task
of supervised learning. You are given a set 𝑆 of 𝑛 samples of the form
(𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ) drawn from some unknown distribution 𝐷
over pairs (𝑥, 𝑦). For simplicity we will assume that 𝑥𝑖 ∈ {0, 1}𝑚 and
𝑦𝑖 ∈ {0, 1}. (We use here the concept of general distribution over the
finite set {0, 1}𝑚+1 as discussed in Section 18.1.3.) The goal is to find
a classifier ℎ ∶ {0, 1}𝑚 → {0, 1} that will minimize the test error which
is the probability 𝐿(ℎ) that ℎ(𝑥) ≠ 𝑦 where (𝑥, 𝑦) is drawn from the
distribution 𝐷. That is, 𝐿(ℎ) = Pr(𝑥,𝑦)∼𝐷 [ℎ(𝑥) ≠ 𝑦].
One way to find such a classifier is to consider a collection 𝒞 of po-
tential classifiers and look at the classifier ℎ in 𝒞 that does best on the
training set 𝑆. The classifier ℎ is known as the empirical risk minimizer
(see also Section 12.1.6) . The Chernoff bound can be used to show
that as long as the number 𝑛 of samples is sufficiently larger than the
logarithm of |𝒞|, the test error 𝐿(ℎ) will be close to its training error
𝐿̂ 𝑆 (ℎ), which is defined as the fraction of pairs (𝑥𝑖 , 𝑦𝑖 ) ∈ 𝑆 that it fails
to classify. (Equivalently, 𝐿̂ 𝑆 (ℎ) = 𝑛1 ∑𝑖∈[𝑛] |ℎ(𝑥𝑖 ) − 𝑦𝑖 |.)

Theorem 18.14 — Generalization of ERM. Let 𝐷 be any distribution over


pairs (𝑥, 𝑦) ∈ {0, 1}𝑚+1 and 𝒞 be any set of functions mapping
{0, 1}𝑚 to {0, 1}. Then for every 𝜖, 𝛿 > 0, if 𝑛 > log |𝒞|𝜖log(1/𝛿)
2 and 𝑆
is a set of (𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ) samples that are drawn indepen-
542 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

dently from 𝐷 then

Pr [∀ℎ∈𝒞 |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| ≤ 𝜖] > 1 − 𝛿 ,


𝑆

where the probability is taken over the choice of the set of samples
𝑆.
In particular if |𝒞| ≤ 2𝑘 and 𝑛 > 𝑘 log(1/𝛿)
𝜖2 then with probability at
least 1 − 𝛿, the classifier ℎ∗ ∈ 𝒞 that minimizes that empirical test er-
ror 𝐿̂ 𝑆 (𝐶) satisfies 𝐿(ℎ∗ ) ≤ 𝐿̂ 𝑆 (ℎ∗ ) + 𝜖, and hence its test error is at
most 𝜖 worse than its training error.

Proof Idea:
The idea is to combine the Chernoff bound with the union bound.
Let 𝑘 = log |𝒞|. We first use the Chernoff bound to show that for
every fixed ℎ ∈ 𝒞, if we choose 𝑆 at random then the probability that
|𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖 will be smaller than 2𝛿𝑘 . We can then use the union
bound over all the 2𝑘 members of 𝒞 to show that this will be the case
for every ℎ.

Proof of Theorem 18.14. Set 𝑘 = log |𝒞| and so 𝑛 > 𝑘 log(1/𝛿)/𝜖2 . We


start by making the following claim.
CLAIM: For every ℎ ∈ 𝒞, the probability over 𝑆 that |𝐿(ℎ) −
𝐿̂ 𝑆 (ℎ)| ≥ 𝜖 is smaller than 𝛿/2𝑘 .
We prove the claim using the Chernoff bound. Specifically, for ev-
ery such ℎ, let us define a collection of random variables 𝑋0 , … , 𝑋𝑛−1
as follows:


{1 ℎ(𝑥𝑖 ) ≠ 𝑦𝑖
𝑋𝑖 = ⎨ .
⎩0 otherwise
{

Since the samples (𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ) are drawn independently


from the same distribution 𝐷, the random variables 𝑋0 , … , 𝑋𝑛−1 are
independently and identically distributed. Moreover, for every 𝑖,
𝔼[𝑋𝑖 ] = 𝐿(ℎ). Hence by the Chernoff bound (see (18.3)), the probabil-
𝑛−1
ity that | ∑𝑖=0 𝑋𝑖 − 𝑛 ⋅ 𝐿(ℎ)| ≥ 𝜖𝑛 is at most 𝑒−𝜖 𝑛 < 𝑒−𝑘 log(1/𝛿) < 𝛿/2𝑘
2

(using the fact that 𝑒 > 2). Since 𝐿(ℎ)̂ = 𝑛1 ∑𝑖∈[𝑛] 𝑋𝑖 , this completes
the proof of the claim.
Given the claim, the theorem follows from the union bound. In-
deed, for every ℎ ∈ 𝒞, define the “bad event” 𝐵ℎ to be the event (over
the choice of 𝑆) that |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖. By the claim Pr[𝐵ℎ ] < 𝛿/2𝑘 ,
and hence by the union bound the probability that the union of 𝐵ℎ for
all ℎ ∈ ℋ happens is smaller than |𝒞|𝛿/2𝑘 = 𝛿. If for every ℎ ∈ 𝒞, 𝐵ℎ
does not happen, it means that for every ℎ ∈ ℋ, |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| ≤ 𝜖,
p roba bi l i ty the ory 1 0 1 543

and so the probability of the latter event is larger than 1 − 𝛿 which is


what we wanted to prove.

✓ Chapter Recap

• A basic probabilistic experiment corresponds to


tossing 𝑛 coins or choosing 𝑥 uniformly at random
from {0, 1}𝑛 .
• Random variables assign a real number to every
result of a coin toss. The expectation of a random
variable 𝑋 is its average value.
• There are several concentration results, also known
as tail bounds showing that under certain condi-
tions, random variables deviate significantly from
their expectation only with small probability.

18.4 EXERCISES
Suppose that we toss three independent fair coins 𝑎, 𝑏, 𝑐 ∈
Exercise 18.1
{0, 1}. What is the probability that the XOR of 𝑎,𝑏, and 𝑐 is equal to 1?
What is the probability that the AND of these three values is equal to
1? Are these two events independent?

Give an example of random variables 𝑋, 𝑌 ∶ {0, 1}3 → ℝ


Exercise 18.2
such that 𝔼[XY] ≠ 𝔼[𝑋] 𝔼[𝑌 ].

Give an example of random variables 𝑋, 𝑌 ∶ {0, 1} → ℝ


Exercise 18.3 3

such that 𝑋 and 𝑌 are not independent but 𝔼[XY] = 𝔼[𝑋] 𝔼[𝑌 ].

Let 𝑛 be an odd number, and let 𝑋 ∶ {0, 1}𝑛 → ℝ be the


Exercise 18.4
random variable defined as follows: for every 𝑥 ∈ {0, 1}𝑛 , 𝑋(𝑥) = 1 if
∑𝑖=0 𝑥𝑖 > 𝑛/2 and 𝑋(𝑥) = 0 otherwise. Prove that 𝔼[𝑋] = 1/2.

1. Give an example for a random


Exercise 18.5 — standard deviation.
variable 𝑋 such that 𝑋’s standard deviation is equal to 𝔼[|𝑋 − 𝔼[𝑋]|].

2. Give an example for a random variable 𝑋 such that 𝑋’s standard


deviation is not equal to 𝔼[|𝑋 − 𝔼[𝑋]|].

Exercise 18.6 — Product of expectations. Prove Lemma 18.7.


Exercise 18.7 — Transformations preserve independence. Prove Lemma 18.8.



544 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Prove that if
Exercise 18.8 — Variance of independent random variables.
𝑋0 , … , 𝑋𝑛−1 are independent random variables then Var[𝑋0 + ⋯ +
𝑛−1
𝑋𝑛−1 ] = ∑𝑖=0 Var[𝑋𝑖 ].

Recall the definition of a distribution


Exercise 18.9 — Entropy (challenge).
𝜇 over some finite set 𝑆. Shannon defined the entropy of a distribution
𝜇, denoted by 𝐻(𝜇), to be ∑𝑥∈𝑆 𝜇(𝑥) log(1/𝜇(𝑥)). The idea is that if 𝜇
is a distribution of entropy 𝑘, then encoding members of 𝜇 will require
𝑘 bits, in an amortized sense. In this exercise we justify this definition.
Let 𝜇 be such that 𝐻(𝜇) = 𝑘.
1. Prove that for every one-to-one function 𝐹 ∶ 𝑆 → {0, 1}∗ ,
𝔼𝑥∼𝜇 |𝐹 (𝑥)| ≥ 𝑘.
2. Prove that for every 𝜖, there is some 𝑛 and a one-to-one function
𝐹 ∶ 𝑆 𝑛 → {0, 1}∗ , such that 𝔼𝑥∼𝜇𝑛 |𝐹 (𝑥)| ≤ 𝑛(𝑘 + 𝜖), where 𝑥 ∼ 𝜇
denotes the experiments of choosing 𝑥0 , … , 𝑥𝑛−1 each independently
from 𝑆 using the distribution 𝜇.

Exercise 18.10 — Entropy approximation to binomial. Let 𝐻(𝑝) = 𝑝 log(1/𝑝) + 1


While you don’t need this to solve this exercise, this
(1 − 𝑝) log(1/(1 − 𝑝)).1 Prove that for every 𝑝 ∈ (0, 1) and 𝜖 > 0, if 𝑛 is is the function that maps 𝑝 to the entropy (as defined
large enough then2 in Exercise 18.9) of the 𝑝-biased coin distribution over
{0, 1}, which is the function 𝜇 ∶ {0, 1} → [0, 1] s.t.
𝑛 𝜇(0) = 1 − 𝑝 and 𝜇(1) = 𝑝.
2(𝐻(𝑝)−𝜖)𝑛 ≤ ( ) ≤ 2(𝐻(𝑝)+𝜖)𝑛 , 2
Hint: Use Stirling’s formula for approximating the
𝑝𝑛 factorial function.

where (𝑛𝑘) is the binomial coefficient 𝑘!(𝑛−𝑘)!


𝑛!
which is equal to the
number of 𝑘-size subsets of {0, … , 𝑛 − 1}.

Exercise 18.11 — Chernoff using Stirling. 1. Prove that Pr𝑥∼{0,1}𝑛 [∑ 𝑥𝑖 =


𝑘] = (𝑛𝑘)2−𝑛 .

2. Use this and Exercise 18.10 to prove (an approximate version of)
the Chernoff bound for the case that 𝑋0 , … , 𝑋𝑛−1 are i.i.d. random
variables over {0, 1} each equaling 0 and 1 with probability 1/2.
That is, prove that for every 𝜖 > 0, and 𝑋0 , … , 𝑋𝑛−1 as above,
𝑛−1
Pr[| ∑𝑖=0 𝑋𝑖 − 𝑛2 | > 𝜖𝑛] < 20.1⋅𝜖 𝑛 .
2

Exercise 18.12 — Poor man’s Chernoff. Exercise 18.11 establishes the Cher-
noff bound for the case that 𝑋0 , … , 𝑋𝑛−1 are i.i.d variables over {0, 1}
with expectation 1/2. In this exercise we use a slightly different
method (bounding the moments of the random variables) to estab-
lish a version of Chernoff where the random variables range over [0, 1]
and their expectation is some number 𝑝 ∈ [0, 1] that may be different
than 1/2. Let 𝑋0 , … , 𝑋𝑛−1 be i.i.d random variables with 𝔼 𝑋𝑖 = 𝑝 and
Pr[0 ≤ 𝑋𝑖 ≤ 1] = 1. Define 𝑌𝑖 = 𝑋𝑖 − 𝑝.
p roba bi l i ty the ory 1 0 1 545

1. Prove that for every 𝑗0 , … , 𝑗𝑛−1 ∈ ℕ, if there exists one 𝑖 such that 𝑗𝑖
𝑛−1 𝑗
is odd then 𝔼[∏𝑖=0 𝑌𝑖 𝑖 ] = 0.
3
Hint: Bound the number of tuples 𝑗0 , … , 𝑗𝑛−1 such
2. Prove that for every 𝑘, 𝔼[(∑𝑖=0 𝑌𝑖 )𝑘 ] ≤ (10𝑘𝑛)𝑘/2 .3
𝑛−1
that every 𝑗𝑖 is even and ∑ 𝑗𝑖 = 𝑘 using the Binomial
coefficient and the fact that in any such tuple there are
𝑛/(10000 log 1/𝜖) 4 at most 𝑘/2 distinct indices.
3. Prove that for every 𝜖 > 0, Pr[| ∑𝑖 𝑌𝑖 | ≥ 𝜖𝑛] ≥ 2−𝜖 .
2
4
Hint: Set 𝑘 = 2⌈𝜖2 𝑛/1000⌉ and then show that if the
event | ∑ 𝑌𝑖 | ≥ 𝜖𝑛 happens then the random variable
■ (∑ 𝑌𝑖 )𝑘 is a factor of 𝜖−𝑘 larger than its expectation.

Exercise 18.13 — Sampling.Suppose that a country has 300,000,000 citi-


zens, 52 percent of which prefer the color “green” and 48 percent of
which prefer the color “orange”. Suppose we sample 𝑛 random citi-
zens and ask them their favorite color (assume they will answer truth-
fully). What is the smallest value 𝑛 among the following choices so
that the probability that the majority of the sample answers “green” is
at most 0.05?

a. 1,000

b. 10,000

c. 100,000

d. 1,000,000

Would the answer to Exercise 18.13 change if the country


Exercise 18.14
had 300,000,000,000 citizens?

Under the same assumptions as Exer-


Exercise 18.15 — Sampling (2).
cise 18.13, what is the smallest value 𝑛 among the following choices so
that the probability that the majority of the sample answers “green” is
at most 2−100 ?

a. 1,000

b. 10,000

c. 100,000

d. 1,000,000

e. It is impossible to get such low probability since there are fewer


than 2100 citizens.


546 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

18.5 BIBLIOGRAPHICAL NOTES


There are many sources for more information on discrete probability,
including the texts referenced in Section 1.9. One particularly recom-
mended source for probability is Harvard’s STAT 110 class, whose
lectures are available on youtube and whose book is available online.
The version of the Chernoff bound that we stated in Theorem 18.12
is sometimes known as Hoeffding’s Inequality. Other variants of the
Chernoff bound are known as well, but all of them are equally good
for the applications of this book.
Learning Objectives:
• See examples of randomized algorithms
• Get more comfort with analyzing
probabilistic processes and tail bounds
• Success amplification using tail bounds

19
Probabilistic computation

“in 1946 .. (I asked myself) what are the chances that a Canfield solitaire laid
out with 52 cards will come out successfully? After spending a lot of time
trying to estimate them by pure combinatorial calculations, I wondered whether
a more practical method … might not be to lay it out say one hundred times and
simply observe and count”, Stanislaw Ulam, 1983
“The salient features of our method are that it is probabilistic … and with a
controllable miniscule probability of error.”, Michael Rabin, 1977

In early computer systems, much effort was taken to drive out


randomness and noise. Hardware components were prone to non-
deterministic behavior from a number of causes, whether it was vac-
uum tubes overheating or actual physical bugs causing short circuits
(see Fig. 19.1). This motivated John von Neumann, one of the early
computing pioneers, to write a paper on how to error correct computa-
tion, introducing the notion of redundancy.
So it is quite surprising that randomness turned out not just a hin-
drance but also a resource for computation, enabling us to achieve
tasks much more efficiently than previously known. One of the first
applications involved the very same John von Neumann. While he
was sick in bed and playing cards, Stan Ulam came up with the ob-
servation that calculating statistics of a system could be done much
faster by running several randomized simulations. He mentioned this
idea to von Neumann, who became very excited about it; indeed, it
turned out to be crucial for the neutron transport calculations that
Figure 19.1: A 1947 entry in the log book of the Har-
were needed for development of the Atom bomb and later on the hy- vard MARK II computer containing an actual bug that
drogen bomb. Because this project was highly classified, Ulam, von caused a hardware malfunction. By Courtesy of the
Naval Surface Warfare Center.
Neumann and their collaborators came up with the codeword “Monte
Carlo” for this approach (based on the famous casinos where Ulam’s
uncle gambled). The name stuck, and randomized algorithms are 1
Some texts also talk about “Las Vegas algorithms”
known as Monte Carlo algorithms to this day.1 that always return the right answer but whose run-
ning time is only polynomial on the average. Since
In this chapter, we will see some examples of randomized algo-
this Monte Carlo vs Las Vegas terminology is confus-
rithms that use randomness to compute a quantity in a faster or sim- ing, we will not use these terms anymore, and simply
pler way than was known otherwise. We will describe the algorithms talk about randomized algorithms.

Compiled on 12.6.2023 00:05


548 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

in an informal / “pseudo-code” way, rather than as Turing macines


or NAND-TM/NAND-RAM programs. In Chapter 20 we will discuss
how to augment the computational models we saw before to incorpo-
rate the ability to “toss coins”.

This chapter: A non-mathy overview


This chapter gives some examples of randomized algorithms
to get a sense of why probability can be useful for compu-
tation. We will also see the technique of success amplification
which is key for many randomized algorithms.

19.1 FINDING APPROXIMATELY GOOD MAXIMUM CUTS


We start with the following example. Recall the maximum cut problem
of finding, given a graph 𝐺 = (𝑉 , 𝐸), the cut that maximizes the num-
ber of edges. This problem is NP-hard, which means that we do not
know of any efficient algorithm that can solve it, but randomization
enables a simple algorithm that can cut at least half of the edges:

Theorem 19.1 — Approximating max cut. There is an efficient probabilis-


tic algorithm that on input an 𝑛-vertex 𝑚-edge graph 𝐺, outputs a
cut (𝑆, 𝑆) that cuts at least 𝑚/2 of the edges of 𝐺 in expectation.

Proof Idea:
We simply choose a random cut: we choose a subset 𝑆 of vertices by
choosing every vertex 𝑣 to be a member of 𝑆 with probability 1/2 in-
dependently. It’s not hard to see that each edge is cut with probability
1/2 and so the expected number of cut edges is 𝑚/2.

Proof of Theorem 19.1. The algorithm is extremely simple:

Algorithm Random Cut:


Input: Graph 𝐺 = (𝑉 , 𝐸) with 𝑛 vertices and 𝑚 edges. Denote 𝑉 =
{𝑣0 , 𝑣1 , … , 𝑣𝑛−1 }.
Operation:

1. Pick 𝑥 uniformly at random in {0, 1}𝑛 .


2. Let 𝑆 ⊆ 𝑉 be the set {𝑣𝑖 ∶ 𝑥𝑖 = 1 , 𝑖 ∈ [𝑛]} that includes all vertices
corresponding to coordinates of 𝑥 where 𝑥𝑖 = 1.
3. Output the cut (𝑆, 𝑆).

We claim that the expected number of edges cut by the algorithm is


𝑚/2. Indeed, for every edge 𝑒 ∈ 𝐸, let 𝑋𝑒 be the random variable such
that 𝑋𝑒 (𝑥) = 1 if the edge 𝑒 is cut by 𝑥, and 𝑋𝑒 (𝑥) = 0 otherwise. For
proba bi l i st i c comp u tati on 549

every such edge 𝑒 = {𝑖, 𝑗}, 𝑋𝑒 (𝑥) = 1 if and only if 𝑥𝑖 ≠ 𝑥𝑗 . Since the
pair (𝑥𝑖 , 𝑥𝑗 ) obtains each of the values 00, 01, 10, 11 with probability
1/4, the probability that 𝑥𝑖 ≠ 𝑥𝑗 is 1/2 and hence 𝔼[𝑋𝑒 ] = 1/2. If we let
𝑋 be the random variable corresponding to the total number of edges
cut by 𝑆, then 𝑋 = ∑𝑒∈𝐸 𝑋𝑒 and hence by linearity of expectation

𝔼[𝑋] = ∑ 𝔼[𝑋𝑒 ] = 𝑚(1/2) = 𝑚/2 .


𝑒∈𝐸

Randomized algorithms work in the worst case. It is tempting to think of


a randomized algorithm such as the one of Theorem 19.1 as an algo-
rithm that works for a “random input graph” but it is actually much
better than that. The expectation in this theorem is not taken over the
choice of the graph, but rather only over the random choices of the algo-
rithm. In particular, for every graph 𝐺, the algorithm is guaranteed to
cut half of the edges of the input graph in expectation. That is,

 Big Idea 24 A randomized algorithm outputs the correct value


with good probability on every possible input.

We will define more formally what “good probability” means in


Chapter 20 but the crucial point is that this probability is always only
taken over the random choices of the algorithm, while the input is not
chosen at random.

19.1.1 Amplifying the success of randomized algorithms


Theorem 19.1 gives us an algorithm that cuts 𝑚/2 edges in expectation.
But, as we saw before, expectation does not immediately imply con-
centration, and so a priori, it may be the case that when we run the
algorithm, most of the time we don’t get a cut matching the expecta-
tion. Luckily, we can amplify the probability of success by repeating
the process several times and outputting the best cut we find. We
start by arguing that the probability the algorithm above succeeds in
cutting at least 𝑚/2 edges is not too tiny.
Lemma 19.2 The probability that a random cut in an 𝑚 edge graph cuts
at least 𝑚/2 edges is at least 1/(2𝑚).

Proof Idea:
To see the idea behind the proof, think of the case that 𝑚 = 1000. In
this case one can show that we will cut at least 500 edges with proba-
bility at least 0.001 (and so in particular larger than 1/(2𝑚) = 1/2000).
Specifically, if we assume otherwise, then this means that with proba-
bility more than 0.999 the algorithm cuts 499 or fewer edges. But since
550 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

we can never cut more than the total of 1000 edges, given this assump-
tion, the highest value of the expected number of edges cut is if we
cut exactly 499 edges with probability 0.999 and cut 1000 edges with
probability 0.001. Yet even in this case the expected number of edges
will be 0.999 ⋅ 499 + 0.001 ⋅ 1000 < 500, which contradicts the fact that
we’ve calculated the expectation to be at least 500 in Theorem 19.1.

Proof of Lemma 19.2. Let 𝑝 be the probability that we cut at least 𝑚/2
edges and suppose, towards a contradiction, that 𝑝 < 1/(2𝑚). Since
the number of edges cut is an integer, and 𝑚/2 is a multiple of 0.5,
by definition of 𝑝, with probability 1 − 𝑝 we cut at most 𝑚/2 − 0.5
edges. Moreover, since we can never cut more than 𝑚 edges, under
our assumption that 𝑝 < 1/(2𝑚), we can bound the expected number
of edges cut by

𝑝𝑚 + (1 − 𝑝)(𝑚/2 − 0.5) ≤ 𝑝𝑚 + 𝑚/2 − 0.5


But if 𝑝 < 1/(2𝑚) then 𝑝𝑚 < 0.5 and so the right-hand side is smaller
than 𝑚/2, which contradicts the fact that (as proven in Theorem 19.1)
the expected number of edges cut is at least 𝑚/2.

19.1.2 Success amplification


Lemma 19.2 shows that our algorithm succeeds at least some of the
time, but we’d like to succeed almost all of the time. The approach
to do that is to simply repeat our algorithm many times, with fresh
randomness each time, and output the best cut we get in one of these
repetitions. It turns out that with extremely high probability we will
get a cut of size at least 𝑚/2. For example, if we repeat this experiment
2000𝑚 times, then (using the inequality (1 − 1/𝑘)𝑘 ≤ 1/𝑒 ≤ 1/2)
we can show that the probability that we will never cut at least 𝑚/2
edges, where 𝑘 = 2𝑚, is at most

(1 − 1/(2𝑚))2000𝑚 = (1 − 1/𝑘)1000𝑘 = ((1 − 1/𝑘)𝑘 )1000 ≤ 2−1000 .

More generally, the same calculations can be used to show the


following lemma:
Lemma 19.3 There is an algorithm that on input a graph 𝐺 = (𝑉 , 𝐸) and
a number 𝑘, runs in time polynomial in |𝑉 | and 𝑘 and outputs a cut
(𝑆, 𝑆) such that

Pr[number of edges cut by (𝑆, 𝑆) ≥ |𝐸|/2] ≥ 1 − 2−𝑘 .

Proof of Lemma 19.3. The algorithm will work as follows:


proba bi l i st i c comp u tati on 551

Algorithm AMPLIFY RANDOM CUT:


Input: Graph 𝐺 = (𝑉 , 𝐸) with 𝑛 vertices and 𝑚 edges. Denote 𝑉 =
{𝑣0 , 𝑣1 , … , 𝑣𝑛−1 }. Number 𝑘 > 0.
Operation:

1. Repeat the following 200𝑘𝑚 times:


a. Pick 𝑥 uniformly at random in {0, 1}𝑛 .
b. Let 𝑆 ⊆ 𝑉 be the set {𝑣𝑖 ∶ 𝑥𝑖 = 1 , 𝑖 ∈ [𝑛]} that includes all
vertices corresponding to coordinates of 𝑥 where 𝑥𝑖 = 1.
c. If (𝑆, 𝑆) cuts at least 𝑚/2 then halt and output (𝑆, 𝑆).
2. Output “failed”

We leave completing the analysis as an exercise to the reader (see


Exercise 19.1).

19.1.3 Two-sided amplification


The analysis above relied on the fact that the maximum cut has one
sided error. By this we mean that if we get a cut of size at least 𝑚/2
then we know we have succeeded. This is common for randomized
algorithms, but is not the only case. In particular, consider the task of
computing some Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1}. A randomized
algorithm 𝐴 for computing 𝐹 , given input 𝑥, might toss coins and suc-
ceed in outputting 𝐹 (𝑥) with probability, say, 0.9. We say that 𝐴 has
two sided errors if there is positive probability that 𝐴(𝑥) outputs 1 when
𝐹 (𝑥) = 0, and positive probability that 𝐴(𝑥) outputs 0 when 𝐹 (𝑥) = 1.
In such a case, to amplify 𝐴’s success, we cannot simply repeat it 𝑘
times and output 1 if a single one of those repetitions resulted in 1,
nor can we output 0 if a single one of the repetitions resulted in 0. But
we can output the majority value of these repetitions. By the Chernoff
bound (Theorem 18.12), with probability exponentially close to 1 (i.e.,
1 − 2Ω(𝑘) ), the fraction of the repetitions where 𝐴 will output 𝐹 (𝑥)
will be at least, say 0.89, and in such cases we will of course output the
correct answer.
The above translates into the following theorem

If 𝐹 ∶ {0, 1}∗ → {0, 1} is a func-


Theorem 19.4 — Two-sided amplification.
tion such that there is a polynomial-time algorithm 𝐴 satisfying

Pr[𝐴(𝑥) = 𝐹 (𝑥)] ≥ 0.51


552 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

for every 𝑥 ∈ {0, 1}∗ , then there is a polynomial time algorithm 𝐵


satisfying
Pr[𝐵(𝑥) = 𝐹 (𝑥)] ≥ 1 − 2−|𝑥|
for every 𝑥 ∈ {0, 1}∗ .

We omit the proof of Theorem 19.4, since we will prove a more


general result later on in Theorem 20.5.

19.1.4 What does this mean?


We have shown a probabilistic algorithm that on any 𝑚 edge graph
𝐺, will output a cut of at least 𝑚/2 edges with probability at least
1 − 2−1000 . Does it mean that we can consider this problem as “easy”?
Should we be somewhat wary of using a probabilistic algorithm, since
it can sometimes fail?
First of all, it is important to emphasize that this is still a worst case
guarantee. That is, we are not assuming anything about the input
graph: the probability is only due to the internal randomness of the al-
gorithm. While a probabilistic algorithm might not seem as nice as a
deterministic algorithm that is guaranteed to give an output, to get a
sense of what a failure probability of 2−1000 means, note that:

• The chance of winning the Massachusetts Mega Millions lottery is


one over (75)5 ⋅ 15, which is roughly 2−35 . So 2−1000 corresponds
to winning the lottery about 300 times in a row, at which point you
might not care so much about your algorithm failing.

• The chance for a U.S. resident to be struck by lightning is about


1/700000, which corresponds to about a 2−45 chance that you’ll
be struck by lightning the very second that you’re reading this
sentence (after which again you might not care so much about the
algorithm’s performance).

• Since the earth is about 5 billion years old, we can estimate the
chance that an asteroid of the magnitude that caused the dinosaurs’
extinction will hit us this very second to be about 2−60 . It is quite
likely that even a deterministic algorithm will fail if this happens.

So, in practical terms, a probabilistic algorithm is just as good as


a deterministic one. But it is still a theoretically fascinating ques-
tion whether randomized algorithms actually yield more power, or
whether is it the case that for any computational problem that can be
solved by a probabilistic algorithm, there is a deterministic algorithm
2
This question does have some significance to prac-
with nearly the same performance.2 For example, we will see in Ex- tice, since hardware that generates high quality
ercise 19.2 that there is in fact a deterministic algorithm that can cut randomness at speed is non-trivial to construct.
at least 𝑚/2 edges in an 𝑚-edge graph. We will discuss this question
in generality in Chapter 20. For now, let us see a couple of examples
proba bi l i st i c comp u tati on 553

where randomization leads to algorithms that are better in some sense


than the known deterministic algorithms.

19.2 SOLVING SAT THROUGH RANDOMIZATION


The 3SAT problem is NP hard, and so it is unlikely that it has a poly-
nomial (or even subexponential) time algorithm. But this does not
mean that we can’t do at least somewhat better than the trivial 2𝑛 al-
gorithm for 𝑛-variable 3SAT. The best known worst-case algorithms
for 3SAT are randomized, and are related to the following simple
algorithm, variants of which are also used in practice:

Algorithm WalkSAT:
Input: An 𝑛 variable 3CNF formula 𝜑.
Parameters: 𝑇 , 𝑆 ∈ ℕ
Operation:

1. Repeat the following 𝑇 steps:


a. Choose a random assignment 𝑥 ∈ {0, 1}𝑛 and repeat the following
for 𝑆 steps:
1. If 𝑥 satisfies 𝜑 then output 𝑥.
2. Otherwise, choose a random clause (ℓ𝑖 ∨ ℓ𝑗 ∨ ℓ𝑘 ) that 𝑥 does
not satisfy, choose a random literal in ℓ𝑖 , ℓ𝑗 , ℓ𝑘 and modify 𝑥 to
satisfy this literal.
2. If all the 𝑇 ⋅ 𝑆 repetitions above did not result in a satisfying assign-
ment then output Unsatisfiable

The running time of this algorithm is 𝑆 ⋅ 𝑇 ⋅ 𝑝𝑜𝑙𝑦(𝑛), and so the


key question is how small we can make 𝑆 and 𝑇 so that the proba-
bility that WalkSAT outputs Unsatisfiable on a satisfiable formula
𝜑 is small. It is known that we can do so with ST = 𝑂((4/3)
̃ 𝑛
) =
𝑂(1.333 … ) (see Exercise 19.4 for a weaker result), but we’ll show
̃ 𝑛
√ 𝑛 3
At the time of this writing, the best known ran-
below a simpler analysis yielding ST = 𝑂(̃ 3 ) = 𝑂(1.74̃ 𝑛
), which is domized algorithms for 3SAT run in time roughly
still much better than the trivial 2 bound.
𝑛 3 𝑂(1.308𝑛 ), and the best known deterministic algo-
rithms run in time 𝑂(1.3303𝑛 ) in the worst case.
√ 𝑛
If we set 𝑇 = 100 ⋅ 3 and
Theorem 19.5 — WalkSAT simple analysis.
𝑆 = 𝑛/2, then the probability we output Unsatisfiable for a
satisfiable 𝜑 is at most 1/2.

Proof. Suppose that 𝜑 is a satisfiable formula and let 𝑥∗ be a satisfying


assignment for it. For every 𝑥 ∈ {0, 1}𝑛 , denote by Δ(𝑥, 𝑥∗ ) the num-
ber of coordinates that differ between 𝑥 and 𝑥∗ . The heart of the proof
is the following claim:
Claim I: For every 𝑥, 𝑥∗ as above, in every local improvement step,
the value Δ(𝑥, 𝑥∗ ) is decreased by one with probability at least 1/3.
554 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proof of Claim I: Since 𝑥∗ is a satisfying assignment, if 𝐶 is a clause


that 𝑥 does not satisfy, then at least one of the variables involved in 𝐶
must get different values in 𝑥 and 𝑥∗ . Thus when we change 𝑥 by one
of the three literals in the clause, we have probability at least 1/3 of
decreasing the distance.
The second claim is that our starting point is not that bad:
Claim 2: With probability at least 1/2 over a random 𝑥 ∈ {0, 1}𝑛 ,
Δ(𝑥, 𝑥∗ ) ≤ 𝑛/2.
Proof of Claim II: Consider the map FLIP ∶ {0, 1}𝑛 → {0, 1}𝑛 that
simply “flips” all the bits of its input from 0 to 1 and vice versa. That
is, FLIP(𝑥0 , … , 𝑥𝑛−1 ) = (1 − 𝑥0 , … , 1 − 𝑥𝑛−1 ). Clearly FLIP is one to
one. Moreover, if 𝑥 is of distance 𝑘 to 𝑥∗ , then FLIP(𝑥) is distance 𝑛 − 𝑘
to 𝑥∗ . Now let 𝐵 be the “bad event” in which 𝑥 is of distance > 𝑛/2
from 𝑥∗ . Then the set 𝐴 = FLIP(𝐵) = {FLIP(𝑥) ∶ 𝑥 ∈ 𝐵} satisfies
|𝐴| = |𝐵| and that if 𝑥 ∈ 𝐴 then 𝑥 is of distance < 𝑛/2 from 𝑥∗ . Since
𝐴 and 𝐵 are disjoint events, Pr[𝐴] + Pr[𝐵] ≤ 1. Since they have the
same cardinality, they have the same probability and so we get that
2 Pr[𝐵] ≤ 1 or Pr[𝐵] ≤ 1/2. (See also Fig. 19.2).
Claims I and II imply that each of the 𝑇 iterations of the outer loop
√ −𝑛
succeeds with probability at least 1/2 ⋅ 3 . Indeed, by Claim II,
the original guess 𝑥 will satisfy Δ(𝑥, 𝑥∗ ) ≤ 𝑛/2 with probability
Pr[Δ(𝑥, 𝑥∗ ) ≤ 𝑛/2] ≥ 1/2. By Claim I, even conditioned on all the
history so far, for each of the 𝑆 = 𝑛/2 steps of the inner loop we have
probability at least ≥ 1/3 of being “lucky” and decreasing the distance
(i.e. the output of Δ) by one. The chance we will be lucky in all 𝑛/2
√ −𝑛
steps is hence at least (1/3)𝑛/2 = 3 .
Figure 19.2: For every 𝑥∗ ∈ {0, 1}𝑛 , we can sort all
Since any single iteration of the outer loop succeeds with probabil-
√ −𝑛 √ 𝑛 strings in {0, 1}𝑛 according to their distance from
ity at least 21 ⋅ 3 , the probability that we never do so in 𝑇 = 100 3 𝑥∗ (top to bottom in the above figure), where we let
√ 𝑛
repetitions is at most (1 − √1 𝑛 )100⋅ 3 ≤ (1/𝑒)50 . 𝐴 = {𝑥 ∈ {0, 1}𝑛 | 𝑑𝑖𝑠𝑡(𝑥, 𝑥∗ ≤ 𝑛/2} be the “top
2 3 half” of strings. If we define FLIP ∶ {0, 1}𝑛 → {0, 1}
■ to be the map that “flips” the bits of a given string 𝑥
then it maps every 𝑥 ∈ 𝐴 to an output FLIP(𝑥) ∈ 𝐴
in a one-to-one way, and so it demonstrates that
|𝐴| ≤ |𝐴| which implies that Pr[𝐴] ≥ Pr[𝐴] and hence
19.3 BIPARTITE MATCHING Pr[𝐴] ≥ 1/2.

The matching problem is one of the canonical optimization problems,


arising in all kinds of applications: matching residents with hospitals,
kidney donors with patients, flights with crews, and many others.
One prototypical variant is bipartite perfect matching. In this problem,
we are given a bipartite graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸) which has 2𝑛 vertices
partitioned into 𝑛-sized sets 𝐿 and 𝑅, where all edges have one end-
point in 𝐿 and the other in 𝑅. The goal is to determine whether there
is a perfect matching, a subset 𝑀 ⊆ 𝐸 of 𝑛 disjoint edges. That is, 𝑀 Figure 19.3: The bipartite matching problem in the
matches every vertex in 𝐿 to a unique vertex in 𝑅. graph 𝐺 = (𝐿∪𝑅, 𝐸) can be reduced to the minimum
𝑠, 𝑡 cut problem in the graph 𝐺′ obtained by adding
The bipartite matching problem turns out to have a polynomial- vertices 𝑠, 𝑡 to 𝐺, connecting 𝑠 with 𝐿 and connecting
time algorithm, since we can reduce finding a matching in 𝐺 to find- 𝑡 with 𝑅.
proba bi l i st i c comp u tati on 555

ing a maximum flow (or equivalently, minimum cut) in a related


graph 𝐺′ (see Fig. 19.3). However, we will see a different probabilistic
algorithm to determine whether a graph contains such a matching.
Let us label 𝐺’s vertices as 𝐿 = {ℓ0 , … , ℓ𝑛−1 } and 𝑅 = {𝑟0 , … , 𝑟𝑛−1 }.
A matching 𝑀 corresponds to a permutation 𝜋 ∈ 𝑆𝑛 (i.e., one-to-one
and onto function 𝜋 ∶ [𝑛] → [𝑛]) where for every 𝑖 ∈ [𝑛], we define 𝜋(𝑖)
to be the unique 𝑗 such that 𝑀 contains the edge {ℓ𝑖 , 𝑟𝑗 }. Define an
𝑛 × 𝑛 matrix 𝐴 = 𝐴(𝐺) where 𝐴𝑖,𝑗 = 1 if and only if the edge {ℓ𝑖 , 𝑟𝑗 }
is present and 𝐴𝑖,𝑗 = 0 otherwise. The correspondence between
matchings and permutations implies the following claim:
Lemma 19.6 — Matching polynomial. Define 𝑃 = 𝑃 (𝐺) to be the polynomial
mapping ℝ 𝑛2
to ℝ where

𝑛−1 𝑛−1
𝑃 (𝑥0,0 , … , 𝑥𝑛−1,𝑛−1 ) = ∑ ( ∏ 𝑠𝑖𝑔𝑛(𝜋)𝐴𝑖,𝜋(𝑖) ) ∏ 𝑥𝑖,𝜋(𝑖) (19.1)
𝜋∈𝑆𝑛 𝑖=0 𝑖=0

Then 𝐺 has a perfect matching if and only if 𝑃 is not identically zero.


That is, 𝐺 has a perfect matching if and only if there exists some as- 4
The sign of a permutation 𝜋 ∶ [𝑛] → [𝑛], denoted by
signment 𝑥 = (𝑥𝑖,𝑗 )𝑖,𝑗∈[𝑛] ∈ ℝ𝑛 such that 𝑃 (𝑥) ≠ 0.4
2
𝑠𝑖𝑔𝑛(𝜋), can be defined in several equivalent ways,
one of which is that it equals −1 if the number of
Proof. If 𝐺 has a perfect matching 𝑀 ∗ , then let 𝜋∗ be the permutation pairs 𝑥 < 𝑦 s.t. 𝜋(𝑥) > 𝜋(𝑦) is odd and equals +1
otherwise. The importance of the term 𝑠𝑖𝑔𝑛(𝜋) is
corresponding to 𝑀 and let 𝑥∗ ∈ ℝ𝑛 defined as follows: 𝑥𝑖,𝑗 = 1 if 𝑗 =
2

that it makes 𝑃 equal to the determinant of the matrix


𝜋∗ (𝑖) and 𝑥∗𝑖,𝑗 = 0 otherwise. (That is, 𝑥∗𝑖,𝑗 = 1 iff 𝜋∗ (𝑖) = 𝑗.) We claim (𝑥𝑖,𝑗 ) and hence efficiently computable.
that 𝑃 (𝑥∗ ) = 𝑠𝑖𝑔𝑛(𝜋∗ ) which in particular means that 𝑃 is not identi-
cally zero. To see why this is true, write 𝑃 (𝑥∗ ) = ∑𝜋∈𝑆 𝑠𝑖𝑔𝑛(𝜋)𝑃𝜋 (𝑥∗ )
𝑛
𝑛−1
where 𝑃𝜋 (𝑥∗ ) = ∏𝑖=0 𝐴𝑖,𝜋(𝑖) 𝑥∗𝑖,𝜋(𝑖) . But for all 𝜋 ≠ 𝜋∗ there will
be some 𝑖 such that 𝜋(𝑖) ≠ 𝜋∗ (𝑗) and so 𝑥∗𝑖,𝜋(𝑖) = 0, which means
that Π𝜋 (𝑥∗ ) = 0. On the other hand, since 𝜋∗ is a matching in 𝐺,
𝑛−1
𝐴𝑖,𝜋∗ (𝑖) = 1 for all 𝑖, and hence 𝑃𝜋∗ (𝑥∗ ) = ∏𝑖=0 𝐴𝑖,𝜋∗ (𝑖) 𝑥∗𝑖,𝜋∗ (𝑖) = 1, and
so 𝑃 (𝑥∗ ) = 𝑠𝑖𝑔𝑛(𝜋∗ ).
On the other hand, suppose that 𝑃 is not identically zero. By (19.1),
this means there is some 𝑥 ∈ {0, 1}𝑛 and some permutation 𝜋 such
2

𝑛−1
that ∏𝑖=0 𝐴𝑖,𝜋(𝑖) 𝑥𝑖,𝜋(𝑖) ≠ 0. But for this to happen, it must be that
𝐴𝑖,𝜋(𝑖) ≠ 0 for all 𝑖, which means that for every 𝑖, the edge (𝑖, 𝜋(𝑖))
exists in the graph, and hence 𝜋 must be a perfect matching in 𝐺.

As we’ve seen before, for every 𝑥 ∈ ℝ𝑛 , we can compute 𝑃 (𝑥)


2

by simply computing the determinant of the matrix 𝐴(𝑥), which is


obtained by replacing 𝐴𝑖,𝑗 with 𝐴𝑖,𝑗 𝑥𝑖,𝑗 . This reduces testing perfect
matching to the zero testing problem for polynomials: given some poly-
nomial 𝑃 (⋅), test whether 𝑃 is identically zero or not. The intuition
behind our randomized algorithm for zero testing is the following:

If a polynomial is not identically zero, then it can’t have “too many” roots.
556 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

This intuition sort of makes sense. For one variable polynomi-


als, we know that a non-zero linear function has at most one root, a
quadratic function (e.g., a parabola) has at most two roots, and gener-
ally a degree 𝑑 equation has at most 𝑑 roots. While in more than one
variable there can be an infinite number of roots (e.g., the polynomial
𝑥0 + 𝑦0 vanishes on the line 𝑦 = −𝑥) it is still the case that the set of
roots is very “small” compared to the set of all inputs. For example,
the root of a bivariate polynomial form a curve, the roots of a three-
variable polynomial form a surface, and more generally the roots of an
𝑛-variable polynomial are a space of dimension 𝑛 − 1.
This intuition leads to the following simple randomized algorithm:

To decide if 𝑃 is identically zero, choose a “random” input 𝑥 and check if


𝑃 (𝑥) ≠ 0.

This makes sense: if there are only “few” roots, then we expect that
with high probability the random input 𝑥 is not going to be one of
those roots. However, to transform this into an actual algorithm, we
need to make both the intuition and the notion of a “random” input
precise. Choosing a random real number is quite problematic, espe-
cially when you have only a finite number of coins at your disposal,
and so we start by reducing the task to a finite setting. We will use the
following result:

Theorem 19.7 — Schwartz–Zippel lemma. For every integer 𝑞, and poly-


nomial 𝑃 ∶ ℝ𝑛 → ℝ with integer coefficients. If 𝑃 has degree at
most 𝑑 and is not identically zero, then it has at most 𝑑𝑞 𝑛−1 roots in
the set [𝑞]𝑛 = {(𝑥0 , … , 𝑥𝑛−1 ) ∶ 𝑥𝑖 ∈ {0, … , 𝑞 − 1}}.

We omit the (not too complicated) proof of Theorem 19.7. We


remark that it holds not just over the real numbers but over any field
as well. Since the matching polynomial 𝑃 of Lemma 19.6 has degree at
most 𝑛, Theorem 19.7 leads directly to a simple algorithm for testing if
it is non-zero:

Algorithm Perfect-Matching:
Input: Bipartite graph 𝐺 on 2𝑛 vertices {ℓ0 , … , ℓ𝑛−1 , 𝑟0 , … , 𝑟𝑛−1 }.
Operation:

1. For every 𝑖, 𝑗 ∈ [𝑛], choose 𝑥𝑖,𝑗 independently at random from


[2𝑛] = {0, … 2𝑛 − 1}.
2. Compute the determinant of the matrix 𝐴(𝑥) whose (𝑖, 𝑗)𝑡ℎ entry
equals 𝑥𝑖,𝑗 if the edge {ℓ𝑖 , 𝑟𝑗 } is present and 0 otherwise.
3. Output no perfect matching if this determinant is zero, and out-
put perfect matching otherwise.
proba bi l i st i c comp u tati on 557

This algorithm can be improved further (e.g., see Exercise 19.5).


While it is not necessarily faster than the cut-based algorithms for per-
fect matching, it does have some advantages. In particular, it is more
amenable for parallelization. (However, it also has the significant dis-
advantage that it does not produce a matching but only states that one
exists.) The Schwartz–Zippel Lemma, and the associated zero testing
algorithm for polynomials, is widely used across computer science,
including in several settings where we have no known deterministic
algorithm matching their performance.

✓ Chapter Recap

• Using concentration results, we can amplify in


polynomial time the success probability of a prob-
abilistic algorithm from a mere 1/𝑝(𝑛) to 1 − 2−𝑞(𝑛)
for every polynomials 𝑝 and 𝑞.
• There are several randomized algorithms that are
better in various senses (e.g., simpler, faster, or
other advantages) than the best known determinis-
tic algorithm for the same problem.

19.4 EXERCISES
Exercise 19.1 — Amplification for max cut. Prove Lemma 19.3

5
TODO: add exercise to give a deterministic max cut
Exercise 19.2 — Deterministic max cut algorithm. 5 algorithm that gives 𝑚/2 edges. Talk about greedy
■ approach.

Our model for proba-


Exercise 19.3 — Simulating distributions using coins.
bility involves tossing 𝑛 coins, but sometimes algorithm require sam-
pling from other distributions, such as selecting a uniform number in
{0, … , 𝑀 − 1} for some 𝑀 . Fortunately, we can simulate this with an
exponentially small probability of error: prove that for every 𝑀 , if 𝑛 >
𝑘⌈log 𝑀 ⌉, then there is a function 𝐹 ∶ {0, 1}𝑛 → {0, … , 𝑀 − 1} ∪ {⊥}
such that (1) The probability that 𝐹 (𝑥) = ⊥ is at most 2−𝑘 and (2) the
distribution of 𝐹 (𝑥) conditioned on 𝐹 (𝑥) ≠ ⊥ is equal to the uniform 6
Hint: Think of 𝑥 ∈ {0, 1}𝑛 as choosing 𝑘 numbers
distribution over {0, … , 𝑀 − 1}.6 𝑦1 , … , 𝑦𝑘 ∈ {0, … , 2⌈log 𝑀⌉ − 1}. Output the first such
■ number that is in {0, … , 𝑀 − 1}.

1. Prove that for every 𝜖 > 0, if


Exercise 19.4 — Better walksat analysis.
𝑛 is large enough then for every 𝑥∗ ∈ {0, 1}𝑛 Pr𝑥∼{0,1}𝑛 [Δ(𝑥, 𝑥∗ ) ≤
𝑛/3] ≤ 2−(1−𝐻(1/3)−𝜖)𝑛 where 𝐻(𝑝) = 𝑝 log(1/𝑝)+(1−𝑝) log(1/(1−𝑝))
is the same function as in Exercise 18.10.
2. Prove that 21−𝐻(1/4)+(1/4) log 3 = (3/2).
3. Use the above to prove that for every 𝛿 > 0 and large enough 𝑛, if
we set 𝑇 = 1000 ⋅ (3/2 + 𝛿)𝑛 and 𝑆 = 𝑛/4 in the WalkSAT algorithm
558 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

then for every satisfiable 3CNF 𝜑, the probability that we output


unsatisfiable is at most 1/2.

(to be completed: im-


Exercise 19.5 — Faster bipartite matching (challenge).
prove the matching algorithm by working modulo a prime)

19.5 BIBLIOGRAPHICAL NOTES


The books of Motwani and Raghavan [MR95] and Mitzenmacher and
Upfal [MU17] are two excellent resources for randomized algorithms.
Some of the history of the discovery of Monte Carlo algorithm is cov-
ered here.

19.6 ACKNOWLEDGEMENTS
Learning Objectives:
• Formal definition of probabilistic polynomial
time: the class BPP.
• Proof that every function in BPP can be
computed by 𝑝𝑜𝑙𝑦(𝑛)-sized NAND-CIRC
programs/circuits.
• Relations between BPP and NP.
• Pseudorandom generators

20
Modeling randomized computation

“Any one who considers arithmetical methods of producing random digits is, of
course, in a state of sin.” John von Neumann, 1951.

So far we have described randomized algorithms in an informal


way, assuming that an operation such as “pick a string 𝑥 ∈ {0, 1}𝑛 ”
can be done efficiently. We have neglected to address two questions:

1. How do we actually efficiently obtain random strings in the physi-


cal world?

2. What is the mathematical model for randomized computations,


and is it more powerful than deterministic computation?

The first question is of both practical and theoretical importance,


but for now let’s just say that there are various physical sources of
“random” or “unpredictable” data. A user’s mouse movements and
typing pattern, (non-solid state) hard drive and network latency,
thermal noise, and radioactive decay have all been used as sources for
randomness (see discussion in Section 20.8). For example, many Intel
chips come with a random number generator built in. One can even
build mechanical coin tossing machines (see Fig. 20.1).

This chapter: A non-mathy overview


In this chapter we focus on the second question: formally
modeling probabilistic computation and studying its power.
We will show that:

1. We can define the class BPP that captures all Boolean


functions that can be computed in polynomial time by a
randomized algorithm. Crucially BPP is still very much Figure 20.1: A mechanical coin tosser built for Percy
a worst case class of computation: the probability is only Diaconis by Harvard technicians Steve Sansone and
Rick Haggerty
over the choice of the random coins of the algorithm, as
opposed to the choice of the input.

Compiled on 12.6.2023 00:05


560 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

2. We can amplify the success probability of randomized


algorithms, and as a result the definition of the class BPP
is equivalent whether or not we require 2/3 success, 0.51
success or every 1 − 2−𝑛 success.

3. Though, as is the case for P and NP, there is much we


do not know about the class BPP, we can establish
some relations between BPP and the other complexity
classes we saw before. In particular we will show that
P ⊆ BPP ⊆ EXP and BPP ⊆ P/poly .

4. While the relation between BPP and NP is not known, we


can show that if P = NP then BPP = P.

5. We also show that the concept of NP completeness ap-


plies equally well if we use randomized algorithms as our
model of “efficient computation”. That is, if a single NP
complete problem has a randomized polynomial-time
algorithm, then all of NP can be computed in polynomial-
time by randomized algorithms.

6. Finally we will discuss the question of whether BPP = P


and show some of the intriguing evidence that the answer
might actually be “Yes” using the concept of pseudorandom
generators.

20.1 MODELING RANDOMIZED COMPUTATION


Modeling randomized computation is actually quite easy. We can
add the following operations to any programming language such as
NAND-TM, NAND-RAM, NAND-CIRC etc..:

foo = RAND()

where foo is a variable. The result of applying this operation is that


foo is assigned a random bit in {0, 1}. (Every time the RAND operation
is invoked it returns a fresh independent random bit.) We call the
programming languages that are augmented with this extra operation
RNAND-TM, RNAND-RAM, and RNAND-CIRC respectively.
Similarly, we can easily define randomized Turing machines as
Turing machines in which the transition function 𝛿 gets as an extra
input (in addition to the current state and symbol read from the tape)
a bit 𝑏 that in each step is chosen at random ∈{0, 1}. Of course the
transition function can ignore this bit (and have the same output
regardless of whether 𝑏 = 0 or 𝑏 = 1), and hence randomized Turing
machines generalize deterministic Turing machines.
mod e l i ng r a n d omi ze d comp u tati on 561

We can use the RAND() operation to define the notion of a function


being computed by a randomized 𝑇 (𝑛) time algorithm for every nice
time bound 𝑇 ∶ ℕ → ℕ, as well as the notion of a finite function being
computed by a size 𝑆 randomized NAND-CIRC program (or, equiv-
alently, a randomized circuit with 𝑆 gates that correspond to either
the NAND or coin-tossing operations). However, for simplicity we
will not define randomized computation in full generality, but simply
focus on the class of functions that are computable by randomized
algorithms running in polynomial time, which by historical convention
is known as BPP:

Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that


Definition 20.1 — The class BPP.
𝐹 ∈ BPP if there exist constants 𝑎, 𝑏 ∈ ℕ and an RNAND-TM
program 𝑃 such that for every 𝑥 ∈ {0, 1}∗ , on input 𝑥, the program
𝑃 halts within at most 𝑎|𝑥|𝑏 steps and

Pr[𝑃 (𝑥) = 𝐹 (𝑥)] ≥ 2


3 (20.1)

where this probability is taken over the result of the RAND opera-
tions of 𝑃 .

Note that the probability in (20.1) is taken only over the ran-
dom choices in the execution of 𝑃 and not over the choice of the in-
put 𝑥. In particular, as discussed in Big Idea 24, BPP is still a worst
case complexity class, in the sense that if 𝐹 is in BPP then there is a
polynomial-time randomized algorithm that computes 𝐹 with proba-
bility at least 2/3 on every possible (and not just random) input.
The same polynomial-overhead simulation of NAND-RAM pro-
grams by NAND-TM programs we saw in Theorem 13.5 extends to
randomized programs as well. Hence the class BPP is the same re-
gardless of whether it is defined via RNAND-TM or RNAND-RAM
programs. Similarly, we could have just as well defined BPP using
randomized Turing machines.
Because of these equivalences, below we will use the name “poly-
nomial time randomized algorithm” to denote a computation that can be
modeled by a polynomial-time RNAND-TM program, RNAND-RAM
program, or a randomized Turing machine (or any programming lan-
guage that includes a coin tossing operation). Since all these models
are equivalent up to polynomial factors, you can use your favorite
model to capture polynomial-time randomized algorithms without
any loss in generality.
Modern programming lan-
Solved Exercise 20.1 — Choosing from a set.
guages often involve not just the ability to toss a random coin in {0, 1}
but also to choose an element at random from a set 𝑆. Show that you
can emulate this primitive using coin tossing. Specifically, show that
562 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

there is a randomized algorithm 𝐴 that, on input a set 𝑆 of 𝑚 strings


of length 𝑛, runs in time 𝑝𝑜𝑙𝑦(𝑛, 𝑚) and outputs either an element
𝑥 ∈ 𝑆 or “fail” such that

1. Let 𝑝 be the probability that 𝐴 outputs “fail”, then 𝑝 < 2−𝑛 (a


number small enough that it can be ignored).

2. For every 𝑥 ∈ 𝑆, the probability that 𝐴 outputs 𝑥 is exactly 1−𝑝


𝑚
(and so the output is uniform over 𝑆 if we ignore the tiny probabil-
ity of failure)

Solution:
If the size of 𝑆 is a power of two, that is 𝑚 = 2ℓ for some ℓ ∈ 𝑁 ,
then we can choose a random element in 𝑆 by tossing ℓ coins to
obtain a string 𝑤 ∈ {0, 1}ℓ and then output the 𝑖-th element of 𝑆
where 𝑖 is the number whose binary representation is 𝑤.
If 𝑆 is not a power of two, then our first attempt will be to let
ℓ = ⌈log 𝑚⌉ and do the same, but then output the 𝑖-th element of
𝑆 if 𝑖 ∈ [𝑚] and output “fail” otherwise. Conditioned on not out-
putting “fail”, this element is distributed uniformly in 𝑆. However,
in the worst case, 2ℓ can be almost 2𝑚 and so the probability of fail
might be close to half. To reduce the failure probability, we can
repeat the experiment above 𝑛 times. Specifically, we will use the
following algorithm

Algorithm 20.2 — Sample from set.

Input: Set 𝑆 = {𝑥0 , … , 𝑥𝑚−1 } with 𝑥𝑖 ∈ {0, 1}𝑛 for all


𝑖 ∈ [𝑚].
Output: Either 𝑥 ∈ 𝑆 or ”fail”
1: Let ℓ ← ⌈log 𝑚⌉
2: for 𝑗 = 0, 1, … , 𝑛 − 1 do
3: Pick 𝑤 ∼ {0, 1}ℓ
4: Let 𝑖 ∈ [2ℓ ] be number whose binary representa-
tion is 𝑤.
5: if 𝑖 < 𝑚 then
6: return 𝑥𝑖
7: end if
8: end for
9: return ”fail”

Conditioned on not failing, the output of Algorithm 20.2 is uni-


formly distributed in 𝑆. However, since 2ℓ < 2𝑚, the probability
mod e l i ng r a n d omi ze d comp u tati on 563

of failure in each iteration is less than 1/2 and so the probability of


failure in all of them is at most (1/2)𝑛 = 2−𝑛 .

20.1.1 An alternative view: random coins as an “extra input”


While we presented randomized computation as adding an extra
“coin tossing” operation to our programs, we can also model this as
being given an additional extra input. That is, we can think of a ran-
domized algorithm 𝐴 as a deterministic algorithm 𝐴′ that takes two
inputs 𝑥 and 𝑟 where the second input 𝑟 is chosen at random from
{0, 1}𝑚 for some 𝑚 ∈ ℕ (see Fig. 20.2). The equivalence to the Defini-
tion 20.1 is shown in the following theorem:

Let 𝐹 ∶ {0, 1}∗ →


Theorem 20.3 — Alternative characterization of BPP.
{0, 1}. Then 𝐹 ∈ BPP if and only if there exists 𝑎, 𝑏 ∈ ℕ and
𝐺 ∶ {0, 1}∗ → {0, 1} such that 𝐺 is in P and for every 𝑥 ∈ {0, 1}∗ ,

Pr [𝐺(𝑥𝑟) = 𝐹 (𝑥)] ≥ 2
3 . (20.2)
𝑟∼{0,1}𝑎|𝑥|𝑏

Proof Idea: Figure 20.2: The two equivalent views of random-


ized algorithms. We can think of such an algorithm
The idea behind the proof is that, as illustrated in Fig. 20.2, we can as having access to an internal RAND() operation
simply replace sampling a random coin with reading a bit from the that outputs a random independent value in {0, 1}
whenever it is invoked, or we can think of it as a de-
extra “random input” 𝑟 and vice versa. To prove this rigorously we terministic algorithm that in addition to the standard
need to work through some slightly cumbersome formal notation. input 𝑥 ∈ {0, 1}𝑛 obtains an additional auxiliary
This might be one of those proofs that is easier to work out on your input 𝑟 ∈ {0, 1}𝑚 that is chosen uniformly at random.

own than to read.


Proof of Theorem 20.3. We start by showing the “only if” direction.


Let 𝐹 ∈ BPP and let 𝑃 be an RNAND-TM program that computes
𝐹 as per Definition 20.1, and let 𝑎, 𝑏 ∈ ℕ be such that on every input
of length 𝑛, the program 𝑃 halts within at most 𝑎𝑛𝑏 steps. We will
construct a polynomial-time algorithm 𝑃 ′ such that for every 𝑥 ∈
{0, 1}𝑛 , if we set 𝑚 = 𝑎𝑛𝑏 , then

Pr [𝑃 ′ (𝑥𝑟) = 1] = Pr[𝑃 (𝑥) = 1] ,


𝑟∼{0,1}𝑚

where the probability in the right-hand side is taken over the RAND()
operations in 𝑃 . In particular this means that if we define 𝐺(𝑥𝑟) =
𝑃 ′ (𝑥𝑟) then the function 𝐺 satisfies the conditions of (20.2).
The algorithm 𝑃 ′ will be very simple: it simulates the program 𝑃 ,
maintaining a counter 𝑖 initialized to 0. Every time that 𝑃 makes a
RAND() operation, the program 𝑃 ′ will supply the result from 𝑟𝑖 and
increment 𝑖 by one. We will never “run out” of bits, since the running
time of 𝑃 is at most 𝑎𝑛𝑏 and hence it can make at most this number of
564 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

RAND() calls. The output of 𝑃 ′ (𝑥𝑟) for a random 𝑟 ∼ {0, 1}𝑚 will be
distributed identically to the output of 𝑃 (𝑥).
For the other direction, given a function 𝐺 ∈ P satisfying the condi-
tion (20.2) and a NAND-TM 𝑃 ′ that computes 𝐺 in polynomial time,
we can construct an RNAND-TM program 𝑃 that computes 𝐹 in poly-
nomial time. On input 𝑥 ∈ {0, 1}𝑛 , the program 𝑃 will simply use the
RAND() instruction 𝑎𝑛𝑏 times to fill an array R[0] , …, R[𝑎𝑛𝑏 − 1] and
then execute the original program 𝑃 ′ on input 𝑥𝑟 where 𝑟𝑖 is the 𝑖-th
element of the array R. Once again, it is clear that if 𝑃 ′ runs in polyno-
mial time then so will 𝑃 , and for every input 𝑥 and 𝑟 ∈ {0, 1}𝑎𝑛 , the
𝑏

output of 𝑃 on input 𝑥 and where the coin tosses outcome is 𝑟 is equal


to 𝑃 ′ (𝑥𝑟).

R
Remark 20.4 — Definitions of BPP and NP. The char-
acterization of BPP in Theorem 20.3 is reminiscent
of the characterization of NP in Definition 15.1, with
the randomness in the case of BPP playing the role
of the solution in the case of NP. However, there are
important differences between the two:

• The definition of NP is “one sided”: 𝐹 (𝑥) = 1 if


there exists a solution 𝑤 such that 𝐺(𝑥𝑤) = 1 and
𝐹 (𝑥) = 0 if for every string 𝑤 of the appropriate
length, 𝐺(𝑥𝑤) = 0. In contrast, the characteriza-
tion of BPP is symmetric with respect to the cases
𝐹 (𝑥) = 0 and 𝐹 (𝑥) = 1.
• The relation between NP and BPP is not immedi-
ately clear. It is not known whether BPP ⊆ NP,
NP ⊆ BPP, or these two classes are incomparable.
It is however known (with a non-trivial proof) that
if P = NP then BPP = P (see Theorem 20.11).
• Most importantly, the definition of NP is “inef-
fective,” since it does not yield a way of actually
finding whether there exists a solution among the
exponentially many possibilities. By contrast, the
definition of BPP gives us a way to compute the
function in practice by simply choosing the second
input at random.

”Random tapes”. Theorem 20.3 motivates sometimes considering the


randomness of an RNAND-TM (or RNAND-RAM) program as an
extra input. As such, if 𝐴 is a randomized algorithm that on inputs of
length 𝑛 makes at most 𝑚 coin tosses, we will often use the notation
𝐴(𝑥; 𝑟) (where 𝑥 ∈ {0, 1}𝑛 and 𝑟 ∈ {0, 1}𝑚 ) to refer to the result of
executing 𝑥 when the coin tosses of 𝐴 correspond to the coordinates
of 𝑟. This second, or “auxiliary,” input is sometimes referred to as
mod e l i ng r a n d omi ze d comp u tati on 565

a “random tape.” This terminology originates from the model of


randomized Turing machines.

20.1.2 Success amplification of two-sided error algorithms


The number 2/3 might seem arbitrary, but as we’ve seen in Chapter 19
it can be amplified to our liking:

Let 𝐹 ∶ {0, 1}∗ → {0, 1} be a Boolean


Theorem 20.5 — Amplification.
function such that there is a polynomial 𝑝 ∶ ℕ → ℕ and a
polynomial-time randomized algorithm 𝐴 satisfying that for ev-
ery 𝑥 ∈ {0, 1}𝑛 ,

1 1
Pr[𝐴(𝑥) = 𝐹 (𝑥)] ≥ + . (20.3)
2 𝑝(𝑛)

Then for every polynomial 𝑞 ∶ ℕ → ℕ there is a polynomial-time


randomized algorithm 𝐵 satisfying for every 𝑥 ∈ {0, 1}𝑛 ,

Pr[𝐵(𝑥) = 𝐹 (𝑥)] ≥ 1 − 2−𝑞(𝑛) .

 Big Idea 25 We can amplify the success of randomized algorithms


to a value that is arbitrarily close to 1.

Proof Idea:
The proof is the same as we’ve seen before in the case of maximum
cut and other examples. We use the Chernoff bound to argue that if
𝐴 computes 𝐹 with probability at least 12 + 𝜖 and we run it 𝑂(𝑘/𝜖2 )
times, each time using fresh and independent random coins, then the
probability that the majority of the answers will not be correct will be
less than 2−𝑘 . Amplification can be thought of as a “polling” of the
choices for randomness for the algorithm (see Fig. 20.3).

Proof of Theorem 20.5. Let 𝐴 be an algorithm satisfying (20.3). Set


1
𝜖 = 𝑝(𝑛) and 𝑘 = 𝑞(𝑛) where 𝑝, 𝑞 are the polynomials in the theorem
statement. We can run 𝑃 on input 𝑥 for 𝑡 = 10𝑘/𝜖2 times, using fresh
randomness in each execution, and compute the outputs 𝑦0 , … , 𝑦𝑡−1 .
We output the value 𝑦 that appeared the largest number of times. Let
𝑋𝑖 be the random variable that is equal to 1 if 𝑦𝑖 = 𝐹 (𝑥) and equal to
0 otherwise. The random variables 𝑋0 , … , 𝑋𝑡−1 are i.i.d. and satisfy
𝔼[𝑋𝑖 ] = Pr[𝑋𝑖 = 1] ≥ 1/2 + 𝜖, and hence by linearity of expectation
𝑡−1
𝔼[∑𝑖=0 𝑋𝑖 ] ≥ 𝑡(1/2 + 𝜖). For the plurality value to be incorrect, it must
𝑡−1 𝑡−1
hold that ∑𝑖=0 𝑋𝑖 ≤ 𝑡/2, which means that ∑𝑖=0 𝑋𝑖 is at least 𝜖𝑡 far
from its expectation. Hence by the Chernoff bound (Theorem 18.12),
566 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

the probability that the plurality value is not correct is at most 2𝑒−𝜖 𝑡 ,
2

which is smaller than 2−𝑘 for our choice of 𝑡.


20.2 BPP AND NP COMPLETENESS


Since “noisy processes” abound in nature, randomized algorithms can
be realized physically, and so it is reasonable to propose BPP rather
than P as our mathematical model for “feasible” or “tractable” com-
putation. One might wonder if this makes all the previous chapters
irrelevant, and in particular if the theory of NP completeness still
applies to probabilistic algorithms. Fortunately, the answer is Yes: Figure 20.3: If 𝐹 ∈ BPP then there is a randomized
polynomial-time algorithm 𝑃 with the following
property: In the case 𝐹 (𝑥) = 0 two thirds of the
Theorem 20.6 — NP hardness and BPP. Suppose that 𝐹 is NP-hard and “population” of random choices satisfy 𝑃 (𝑥; 𝑟) = 0
𝐹 ∈ BPP. Then NP ⊆ BPP. and in the case 𝐹 (𝑥) = 1 two thirds of the population
satisfy 𝑃 (𝑥; 𝑟) = 1. We can think of amplification as
a form of “polling” of the choices of randomness. By
Before seeing the proof, note that Theorem 20.6 implies that if there log(1/𝛿)
the Chernoff bound, if we poll a sample of 𝑂( 𝜖2 )
was a randomized polynomial time algorithm for any NP-complete random choices 𝑟, then with probability at least
problem such as 3SAT, ISET etc., then there would be such an algo- 1 − 𝛿, the fraction of 𝑟’s in the sample satisfying
𝑃 (𝑥; 𝑟) = 1 will give us an estimate of the fraction of
rithm for every problem in NP. Thus, regardless of whether our model the population within an 𝜖 margin of error. This is the
of computation is deterministic or randomized algorithms, NP com- same calculation used by pollsters to determine the
needed sample size in their polls.
plete problems retain their status as the “hardest problems in NP.”
Proof Idea:
The idea is to simply run the reduction as usual, and plug it into
the randomized algorithm instead of a deterministic one. It would
be an excellent exercise, and a way to reinforce the definitions of NP-
hardness and randomized algorithms, for you to work out the proof
for yourself. However for the sake of completeness, we include this
proof below.

Proof of Theorem 20.6. Suppose that 𝐹 is NP-hard and 𝐹 ∈ BPP.


We will now show that this implies that NP ⊆ BPP. Let 𝐺 ∈ NP.
By the definition of NP-hardness, it follows that 𝐺 ≤𝑝 𝐹 , or that in
other words there exists a polynomial-time computable function 𝑅 ∶
{0, 1}∗ → {0, 1}∗ such that 𝐺(𝑥) = 𝐹 (𝑅(𝑥)) for every 𝑥 ∈ {0, 1}∗ . Now
if 𝐹 is in BPP then there is a polynomial-time RNAND-TM program 𝑃
such that
Pr[𝑃 (𝑦) = 𝐹 (𝑦)] ≥ 2/3 (20.4)
for every 𝑦 ∈ {0, 1}∗ (where the probability is taken over the random
coin tosses of 𝑃 ). Hence we can get a polynomial-time RNAND-TM
program 𝑃 ′ to compute 𝐺 by setting 𝑃 ′ (𝑥) = 𝑃 (𝑅(𝑥)). By (20.4)
Pr[𝑃 ′ (𝑥) = 𝐹 (𝑅(𝑥))] ≥ 2/3 and since 𝐹 (𝑅(𝑥)) = 𝐺(𝑥) this implies that
Pr[𝑃 ′ (𝑥) = 𝐺(𝑥)] ≥ 2/3, which proves that 𝐺 ∈ BPP.
mod e l i ng r a n d omi ze d comp u tati on 567

Most of the results we’ve seen about NP hardness, including the


search to decision reduction of Theorem 16.1, the decision to optimiza-
tion reduction of Theorem 16.3, and the quantifier elimination result
of Theorem 16.6, all carry over in the same way if we replace P with
BPP as our model of efficient computation. Thus if NP ⊆ BPP then
we get essentially all of the strange and wonderful consequences of
P = NP. Unsurprisingly, we cannot rule out this possibility. In fact,
unlike P = EXP, which is ruled out by the time hierarchy theorem, we
don’t even know how to rule out the possibility that BPP = EXP! Thus
a priori it’s possible (though seems highly unlikely) that randomness
is a magical tool that allows us to speed up arbitrary exponential time 1
At the time of this writing, the largest “natural” com-
computation.1 Nevertheless, as we discuss below, it is believed that plexity class which we can’t rule out being contained
in BPP is the class NEXP, which we did not define
randomization’s power is much weaker and BPP lies in much more in this course, but corresponds to non-deterministic
“pedestrian” territory. exponential time. See this paper for a discussion of
this question.

20.3 THE POWER OF RANDOMIZATION


A major question is whether randomization can add power to compu-
tation. Mathematically, we can phrase this as the following question:
does BPP = P? Given what we’ve seen so far about the relations of
other complexity classes such as P and NP, or NP and EXP, one might
guess that:

1. We do not know the answer to this question.

2. But we suspect that BPP is different than P.

One would be correct about the former, but wrong about the latter.
As we will see, we do in fact have reasons to believe that BPP = P.
This can be thought of as supporting the extended Church Turing hy-
pothesis that deterministic polynomial-time Turing machines capture
what can be feasibly computed in the physical world.
We now survey some of the relations that are known between
BPP and other complexity classes we have encountered. (See also
Fig. 20.4.)
Figure 20.4: Some possibilities for the relations be-
tween BPP and other complexity classes. Most
20.3.1 Solving BPP in exponential time researchers believe that BPP = P and that these
It is not hard to see that if 𝐹 is in BPP then it can be computed in classes are not powerful enough to solve NP-complete
problems, let alone all problems in EXP. However,
exponential time. we have not even been able yet to rule out the possi-
bility that randomness is a “silver bullet” that allows
Theorem 20.7 — Simulating randomized algorithms in exponential time. exponential speedup on all problems, and hence
BPP = EXP. As we’ve already seen, we also can’t
BPP ⊆ EXP rule out that P = NP. Interestingly, in the latter case,
P = BPP.
568 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

P
The proof of Theorem 20.7 readily follows by enumer-
ating over all the (exponentially many) choices for the
random coins. We omit the formal proof, as doing it
by yourself is an excellent way to get comfortable with
Definition 20.1.

20.3.2 Simulating randomized algorithms by circuits


We have seen in Theorem 13.12 that if 𝐹 is in P, then there is a polyno-
mial 𝑝 ∶ ℕ → ℕ such that for every 𝑛, the restriction 𝐹↾𝑛 of 𝐹 to inputs
{0, 1}𝑛 is in SIZE(𝑝(𝑛)). (In other words, that P ⊆ P/poly .) A priori it is
not at all clear that the same holds for a function in BPP, but this does
turn out to be the case.

Figure 20.5: The possible guarantees for a random-


ized algorithm 𝐴 computing some function 𝐹 . In
the tables above, the columns correspond to differ-
ent inputs and the rows to different choices of the
random tape. A cell at position 𝑟, 𝑥 is colored green
if 𝐴(𝑥; 𝑟) = 𝐹 (𝑥) (i.e., the algorithm outputs the
correct answer) and red otherwise. The standard BPP
guarantee corresponds to the middle figure, where
for every input 𝑥, at least two thirds of the choices
𝑟 for a random tape will result in 𝐴 computing the
correct value. That is, every column is colored green
in at least two thirds of its coordinates. In the left
figure we have an “average case” guarantee where
the algorithm is only guaranteed to output the correct
answer with probability two thirds over a random
input (i.e., at most one third of the total entries of the
table are colored red, but there could be an all red
column). The right figure corresponds to the “offline
BPP” case, with probability at least two thirds over
the random choice 𝑟, 𝑟 will be good for every input.
Theorem 20.8 — Randomness does not help for non-uniform computation.
That is, at least two thirds of the rows are all green.
BPP ⊆ P/poly . Theorem 20.8 (BPP ⊆ P/poly ) is proven by amplify-
ing the success of a BPP algorithm until we have the
“offline BPP” guarantee, and then hardwiring the
That is, for every 𝐹 ∈ BPP, there exist some 𝑎, 𝑏 ∈ ℕ such that
choice of the randomness 𝑟 to obtain a non-uniform
for every 𝑛 > 0, 𝐹↾𝑛 ∈ SIZE(𝑎𝑛𝑏 ) where 𝐹↾𝑛 is the restriction of 𝐹 to deterministic algorithm.
inputs in {0, 1}𝑛 .

Proof Idea:
The idea behind the proof is that we can first amplify by repetition
the probability of success from 2/3 to 1 − 0.1 ⋅ 2−𝑛 . This will allow us to
show that for every 𝑛 ∈ ℕ there exists a single fixed choice of “favorable
coins” which is a string 𝑟 of length polynomial in 𝑛 such that if 𝑟 is
used for the randomness then we output the right answer on all of
the possible 2𝑛 inputs. We can then use the standard “unravelling the
loop” technique to transform an RNAND-TM program to an RNAND-
CIRC program, and “hardwire” the favorable choice of random coins
mod e l i ng r a n d omi ze d comp u tati on 569

to transform the RNAND-CIRC program into a plain old deterministic


NAND-CIRC program.

Proof of Theorem 20.8. Suppose that 𝐹 ∈ BPP. Let 𝑃 be a polynomial-


time RNAND-TM program that computes 𝐹 as per Definition 20.1.
Using Theorem 20.5, we can amplify the success probability of 𝑃 to
obtain an RNAND-TM program 𝑃 ′ that is at most a factor of 𝑂(𝑛)
slower (and hence still polynomial time) such that for every 𝑥 ∈
{0, 1}𝑛

Pr [𝑃 ′ (𝑥; 𝑟) = 𝐹 (𝑥)] ≥ 1 − 0.1 ⋅ 2−𝑛 , (20.5)


𝑟∼{0,1}𝑚

where 𝑚 is the number of coin tosses that 𝑃 ′ uses on inputs of


length 𝑛. We use the notation 𝑃 ′ (𝑥; 𝑟) to denote the execution of 𝑃 ′
on input 𝑥 and when the result of the coin tosses corresponds to the
string 𝑟.
For every 𝑥 ∈ {0, 1}𝑛 , define the “bad” event 𝐵𝑥 to hold if 𝑃 ′ (𝑥) ≠
𝐹 (𝑥), where the sample space for this event consists of the coins of 𝑃 ′ .
Then by (20.5), Pr[𝐵𝑥 ] ≤ 0.1 ⋅ 2−𝑛 for every 𝑥 ∈ {0, 1}𝑛 . Since there are
2𝑛 many such 𝑥’s, by the union bound we see that the probability that
the union of the events {𝐵𝑥 }𝑥∈{0,1}𝑛 is at most 0.1. This means that if
we choose 𝑟 ∼ {0, 1}𝑚 , then with probability at least 0.9 it will be the
case that for every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 𝑃 ′ (𝑥; 𝑟). (Indeed, otherwise the
event 𝐵𝑥 would hold for some 𝑥.) In particular, because of the mere
fact that the probability of ∪𝑥∈{0,1}𝑛 𝐵𝑥 is smaller than 1, this means
that there exists a particular 𝑟∗ ∈ {0, 1}𝑚 such that

𝑃 ′ (𝑥; 𝑟∗ ) = 𝐹 (𝑥) (20.6)


for every 𝑥 ∈ {0, 1}𝑛 .
Now let us use the standard “unravelling the loop” technique and
transform 𝑃 ′ into a NAND-CIRC program 𝑄 of polynomial in 𝑛 size,
such that 𝑄(𝑥𝑟) = 𝑃 ′ (𝑥; 𝑟) for every 𝑥 ∈ {0, 1}𝑛 and 𝑟 ∈ {0, 1}𝑚 . Then
by “hardwiring” the values 𝑟0∗ , … , 𝑟𝑚−1

in place of the last 𝑚 inputs of
𝑄, we obtain a new NAND-CIRC program 𝑄𝑟∗ that satisfies by (20.6)
that 𝑄𝑟∗ (𝑥) = 𝐹 (𝑥) for every 𝑥 ∈ {0, 1}𝑛 . This demonstrates that 𝐹↾𝑛
has a polynomial-sized NAND-CIRC program, hence completing the
proof of Theorem 20.8.

20.4 DERANDOMIZATION
The proof of Theorem 20.8 can be summarized as follows: we can
replace a 𝑝𝑜𝑙𝑦(𝑛)-time algorithm that tosses coins as it runs with an
algorithm that uses a single set of coin tosses 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) which
570 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

will be good enough for all inputs of size 𝑛. Another way to say it is
that for the purposes of computing functions, we do not need “online”
access to random coins and can generate a set of coins “offline” ahead
of time, before we see the actual input.
But this does not really help us with answering the question of
whether BPP equals P, since we still need to find a way to generate
these “offline” coins in the first place. To derandomize an RNAND-
TM program we will need to come up with a single deterministic
algorithm that will work for all input lengths. That is, unlike in the
case of RNAND-CIRC programs, we cannot choose for every input
length 𝑛 some string 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) to use as our random coins.
Can we derandomize randomized algorithms, or does randomness
add an inherent extra power for computation? This is a fundamentally
interesting question but is also of practical significance. Ever since
people started to use randomized algorithms during the Manhattan
project, they have been trying to remove the need for randomness and
replace it with numbers that are selected through some deterministic
process. Throughout the years this approach has often been used 2
One amusing anecdote is a recent case where scam-
successfully, though there have been a number of failures as well.2 mers managed to predict the imperfect “pseudo-
A common approach people used over the years was to replace random generator” used by slot machines to cheat
casinos. Unfortunately we don’t know the details of
the random coins of the algorithm by a “randomish looking” string
how they did it, since the case was sealed.
that they generated through some arithmetic progress. For example,
one can use the digits of 𝜋 for the random tape. Using these type of
methods corresponds to what von Neumann referred to as a “state
of sin”. (Though this is a sin that he himself frequently committed,
as generating true randomness in sufficient quantity was and still is
often too expensive.) The reason that this is considered a “sin” is that
such a procedure will not work in general. For example, it is easy to
modify any probabilistic algorithm 𝐴 such as the ones we have seen in
Chapter 19, to an algorithm 𝐴′ that is guaranteed to fail if the random
tape happens to equal the digits of 𝜋. This means that the procedure
“replace the random tape by the digits of 𝜋” does not yield a general
way to transform a probabilistic algorithm to a deterministic one that
will solve the same problem. Of course, this procedure does not always
fail, but we have no good way to determine when it fails and when
it succeeds. This reasoning is not specific to 𝜋 and holds for every
deterministically produced string, whether it obtained by 𝜋, 𝑒, the
Fibonacci series, or anything else.
An algorithm that checks if its random tape is equal to 𝜋 and then
fails seems to be quite silly, but this is but the “tip of the iceberg” for a
very serious issue. Time and again people have learned the hard way
that one needs to be very careful about producing random bits using
deterministic means. As we will see when we discuss cryptography,
mod e l i ng r a n d omi ze d comp u tati on 571

many spectacular security failures and break-ins were the result of


using “insufficiently random” coins.

20.4.1 Pseudorandom generators


So, we can’t use any single string to “derandomize” a probabilistic
algorithm. It turns out however, that we can use a collection of strings
to do so. Another way to think about it is that rather than trying to
eliminate the need for randomness, we start by focusing on reducing the
amount of randomness needed. (Though we will see that if we reduce
the randomness sufficiently, we can eventually get rid of it altogether.)
We make the following definition:

A function 𝐺 ∶ {0, 1}ℓ →


Definition 20.9 — Pseudorandom generator.
{0, 1} is a (𝑇 , 𝜖)-pseudorandom generator if for every circuit 𝐶 with
𝑚

𝑚 inputs, one output, and at most 𝑇 gates,

∣ Pr [𝐶(𝐺(𝑠)) = 1] − Pr [𝐶(𝑟) = 1]∣ < 𝜖 (20.7)


𝑠∼{0,1}ℓ 𝑟∼{0,1}𝑚

P
This is a definition that’s worth reading more than
once, and spending some time to digest it. Note that it
takes several parameters:

• 𝑇 is the limit on the number of gates of the circuit


𝐶 that the generator needs to “fool”. The larger 𝑇
is, the stronger the generator.
Figure 20.6: A pseudorandom generator 𝐺 maps
• 𝜖 is how close the output of the pseudorandom a short string 𝑠 ∈ {0, 1}ℓ into a long string 𝑟 ∈
generator is to the true uniform distribution over {0, 1}𝑚 such that a small program/circuit 𝑃 cannot
{0, 1}𝑚 . The smaller 𝜖 is, the stronger the generator. distinguish between the case that it is provided a
random input 𝑟 ∼ {0, 1}𝑚 and the case that it
• ℓ is the input length and 𝑚 is the output length.
is provided a “pseudorandom” input of the form
If ℓ ≥ 𝑚 then it is trivial to come up with such 𝑟 = 𝐺(𝑠) where 𝑠 ∼ {0, 1}ℓ . The short string 𝑠
a generator: on input 𝑠 ∈ {0, 1}ℓ , we can output is sometimes called the seed of the pseudorandom
𝑠0 , … , 𝑠𝑚−1 . In this case Pr𝑠∼{0,1}ℓ [𝑃 (𝐺(𝑠)) = 1] generator, as it is a small object that can be thought as
will simply equal Pr𝑟∼{0,1}𝑚 [𝑃 (𝑟) = 1], no matter yielding a large “tree of randomness”.
how many lines 𝑃 has. So, the smaller ℓ is and the
larger 𝑚 is, the stronger the generator, and to get
anything non-trivial, we need 𝑚 > ℓ.

Furthermore note that although our eventual goal is to


fool probabilistic randomized algorithms that take an
unbounded number of inputs, Definition 20.9 refers to
finite and deterministic NAND-CIRC programs.

We can think of a pseudorandom generator as a “randomness


amplifier.” It takes an input 𝑠 of ℓ bits chosen at random and ex-
pands these ℓ bits into an output 𝑟 of 𝑚 > ℓ pseudorandom bits. If
572 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝜖 is small enough then the pseudorandom bits will “look random”


to any NAND-CIRC program that is not too big. Still, there are two
questions we haven’t answered:

• What reason do we have to believe that pseudorandom generators with


non-trivial parameters exist?

• Even if they do exist, why would such generators be useful to derandomize


randomized algorithms? After all, Definition 20.9 does not involve
RNAND-TM or RNAND-RAM programs, but rather deterministic
NAND-CIRC programs with no randomness and no loops.

We will now (partially) answer both questions. For the first ques-
tion, let us come clean and confess we do not know how to prove that
interesting pseudorandom generators exist. By interesting we mean
pseudorandom generators that satisfy that 𝜖 is some small constant
(say 𝜖 < 1/3), 𝑚 > ℓ, and the function 𝐺 itself can be computed in
𝑝𝑜𝑙𝑦(𝑚) time. Nevertheless, Lemma 20.12 (whose statement and proof
is deferred to the end of this chapter) shows that if we only drop the
last condition (polynomial-time computability), then there do in fact
exist pseudorandom generators where 𝑚 is exponentially larger than ℓ.

P
At this point you might want to skip ahead and look at
the statement of Lemma 20.12. However, since its proof
is somewhat subtle, I recommend you defer reading it
until you’ve finished reading the rest of this chapter.

20.4.2 From existence to constructivity


The fact that there exists a pseudorandom generator does not mean
that there is one that can be efficiently computed. However, it turns
out that we can turn complexity “on its head” and use the assumed
non-existence of fast algorithms for problems such as 3SAT to obtain
pseudorandom generators that can then be used to transform random-
ized algorithms into deterministic ones. This is known as the Hardness
vs Randomness paradigm. A number of results along those lines, most
of which are outside the scope of this course, have led researchers to
believe the following conjecture:

Optimal PRG conjecture: There is a polynomial-time computable


function PRG ∶ {0, 1}∗ → {0, 1} that yields an exponentially secure
pseudorandom generator.
Specifically, there exists a constant 𝛿 > 0 such that for every ℓ and
𝑚 < 2𝛿ℓ , if we define 𝐺 ∶ {0, 1}ℓ → {0, 1}𝑚 as 𝐺(𝑠)𝑖 = PRG(𝑠, 𝑖) for every
𝑠 ∈ {0, 1}ℓ and 𝑖 ∈ [𝑚], then 𝐺 is a (2𝛿ℓ , 2−𝛿ℓ ) pseudorandom generator.
mod e l i ng r a n d omi ze d comp u tati on 573

P
The “optimal PRG conjecture” is worth while reading
more than once. What it posits is that we can obtain
a (𝑇 , 𝜖) pseudorandom generator 𝐺 such that every
output bit of 𝐺 can be computed in time polynomial
in the length ℓ of the input, where 𝑇 is exponentially
large in ℓ and 𝜖 is exponentially small in ℓ. (Note that
we could not hope for the entire output to be com-
putable in ℓ, as just writing the output down will take
too long.)
To understand why we call such a pseudorandom
generator “optimal,” it is a great exercise to convince
yourself that, for example, there does not exist a
(21.1ℓ , 2−1.1ℓ ) pseudorandom generator (in fact, the
number 𝛿 in the conjecture must be smaller than 1). To
see that we can’t have 𝑇 ≫ 2ℓ , note that if we allow a
NAND-CIRC program with much more than 2ℓ lines
then this NAND-CIRC program could “hardwire” in-
side it all the outputs of 𝐺 on all its 2ℓ inputs, and use
that to distinguish between a string of the form 𝐺(𝑠)
and a uniformly chosen string in {0, 1}𝑚 . To see that
we can’t have 𝜖 ≪ 2−ℓ , note that by guessing the input
𝑠 (which will be successful with probability 2−ℓ ), we
can obtain a small (i.e., 𝑂(ℓ) line) NAND-CIRC pro-
gram that achieves a 2−ℓ advantage in distinguishing a
pseudorandom and uniform input. Working out these
details is a highly recommended exercise.

We emphasize again that the optimal PRG conjecture is, as its name
implies, a conjecture, and we still do not know how to prove it. In par-
ticular, it is stronger than the conjecture that P ≠ NP. But we do have
some evidence for its truth. There is a spectrum of different types of
pseudorandom generators, and there are weaker assumptions than
the optimal PRG conjecture that suffice to prove that BPP = P. In
particular this is known to hold under the assumption that there exists
a function 𝐹 ∈ TIME(2𝑂(𝑛) ) and 𝜖 > 0 such that for every sufficiently
large 𝑛, 𝐹↾𝑛 is not in SIZE(2𝜖𝑛 ). The name “Optimal PRG conjecture”
is non-standard. This conjecture is sometimes known in the literature 3
A pseudorandom generator of the form we posit,
as the existence of exponentially strong pseudorandom functions.3 where each output bit can be computed individually
in time polynomial in the seed length, is commonly
known as a pseudorandom function generator. For more
20.4.3 Usefulness of pseudorandom generators on the many interesting results and connections in the
study of pseudorandomness, see this monograph of Salil
We now show that optimal pseudorandom generators are indeed very Vadhan.
useful, by proving the following theorem:

Suppose that the optimal


Theorem 20.10 — Derandomization of BPP.
PRG conjecture is true. Then BPP = P.

Proof Idea:
574 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

The optimal PRG conjecture tells us that we can achieve exponential


expansion of ℓ truly random coins into as many as 2𝛿ℓ “pseudorandom
coins.” Looked at from the other direction, it allows us to reduce the
need for randomness by taking an algorithm that uses 𝑚 coins and
converting it into an algorithm that only uses 𝑂(log 𝑚) coins. Now an
algorithm of the latter type by can be made fully deterministic by enu-
merating over all the 2𝑂(log 𝑚) (which is polynomial in 𝑚) possibilities
for its random choices.

We now proceed with the proof details.

Proof of Theorem 20.10. Let 𝐹 ∈ BPP and let 𝑃 be a NAND-TM pro-


gram and 𝑎, 𝑏, 𝑐, 𝑑 constants such that for every 𝑥 ∈ {0, 1}𝑛 , 𝑃 (𝑥)
runs in at most 𝑐 ⋅ 𝑛𝑑 steps and Pr𝑟∼{0,1}𝑚 [𝑃 (𝑥; 𝑟) = 𝐹 (𝑥)] ≥ 2/3.
By “unrolling the loop” and hardwiring the input 𝑥, we can obtain
for every input 𝑥 ∈ {0, 1}𝑛 a NAND-CIRC program 𝑄𝑥 of at most,
say, 𝑇 = 10𝑐 ⋅ 𝑛𝑑 lines, that takes 𝑚 bits of input and such that
𝑄(𝑟) = 𝑃 (𝑥; 𝑟).
Now suppose that 𝐺 ∶ {0, 1}ℓ → {0, 1} is a (𝑇 , 0.1) pseudorandom
generator. Then we could deterministically estimate the probability
𝑝(𝑥) = Pr𝑟∼{0,1}𝑚 [𝑄𝑥 (𝑟) = 1] up to 0.1 accuracy in time 𝑂(𝑇 ⋅ 2ℓ ⋅ 𝑚 ⋅
𝑐𝑜𝑠𝑡(𝐺)) where 𝑐𝑜𝑠𝑡(𝐺) is the time that it takes to compute a single
output bit of 𝐺.
The reason is that we know that 𝑝(𝑥)̃ = Pr𝑠∼{0,1}ℓ [𝑄𝑥 (𝐺(𝑠)) =
1] will give us such an estimate for 𝑝(𝑥), and we can compute the
probability 𝑝(𝑥)
̃ by simply trying all 2ℓ possibillites for 𝑠. Now, under
the optimal PRG conjecture we can set 𝑇 = 2𝛿ℓ or equivalently ℓ =
𝛿 log 𝑇 , and our total computation time is polynomial in 2 = 𝑇 .
1 ℓ 1/𝛿

Since 𝑇 ≤ 10𝑐 ⋅ 𝑛 , this running time will be polynomial in 𝑛.


𝑑

This completes the proof, since we are guaranteed that


Pr𝑟∼{0,1}𝑚 [𝑄𝑥 (𝑟) = 𝐹 (𝑥)] ≥ 2/3, and hence estimating the
probability 𝑝(𝑥) to within 0.1 accuracy is sufficient to compute 𝐹 (𝑥).

20.5 P = NP AND BPP VS P


Two computational complexity questions that we cannot settle are:

• Is P = NP? Where we believe the answer is negative.

• Is BPP = P? Where we believe the answer is positive.

However we can say that the “conventional wisdom” is correct on


at least one of these questions. Namely, if we’re wrong on the first
count, then we’ll be right on the second one:
mod e l i ng r a n d omi ze d comp u tati on 575

Theorem 20.11 — Sipser–Gács Theorem. If P = NP then BPP = P.

P
Before reading the proof, it is instructive to think
why this result is not “obvious.” If P = NP then
given any randomized algorithm 𝐴 and input 𝑥,
we will be able to figure out in polynomial time if
there is a string 𝑟 ∈ {0, 1}𝑚 of random coins for 𝐴
such that 𝐴(𝑥𝑟) = 1. The problem is that even if
Pr𝑟∼{0,1}𝑚 [𝐴(𝑥𝑟) = 𝐹 (𝑥)] ≥ 0.9999, it can still be the
case that even when 𝐹 (𝑥) = 0 there exists a string 𝑟
such that 𝐴(𝑥𝑟) = 1.
The proof is rather subtle. It is much more important
that you understand the statement of the theorem than
that you follow all the details of the proof.

Proof Idea:
The construction follows the “quantifier elimination” idea which
we have seen in Theorem 16.6. We will show that for every 𝐹 ∈ BPP,
we can reduce the question of some input 𝑥 satisfies 𝐹 (𝑥) = 1 to the
question of whether a formula of the form ∃𝑢∈{0,1}𝑚 ∀𝑣∈{0,1}𝑘 𝑃 (𝑢, 𝑣)
is true, where 𝑚, 𝑘 are polynomial in the length of 𝑥 and 𝑃 is
polynomial-time computable. By Theorem 16.6, if P = NP then we can
decide in polynomial time whether such a formula is true or false.
The idea behind this construction is that using amplification we
can obtain a randomized algorithm 𝐴 for computing 𝐹 using 𝑚 coins
such that for every 𝑥 ∈ {0, 1}𝑛 , if 𝐹 (𝑥) = 0 then the set 𝑆 ⊆ {0, 1}𝑚
of coins that make 𝐴 output 1 is extremely tiny (i.e., exponentially
small relative to 2𝑚 ), and if 𝐹 (𝑥) = 1 then 𝑆 is very large (of size
close to 2𝑚 ). We then consider “shifts” of the set 𝑆: sets of the form
𝑆 ⊕ 𝑠 where 𝑠 ∈ {0, 1}𝑚 is some string, where 𝑆 ⊕ 𝑠 is defined as
{𝑟 ⊕ 𝑠 | 𝑟 ∈ 𝑆}. Note that for every such shift 𝑠, the cardinality of 𝑆 ⊕ 𝑠
is the same as the cardinality of 𝑆. Hence, if 𝐹 (𝑥) = 0, and so 𝑆 is
“tiny”, then for every polynomial number of shifts 𝑠0 , … , 𝑠𝑘 ∈ {0, 1}𝑚 ,
Figure 20.7: If 𝐹 ∈ BPP then through amplification we
the union of the sets 𝑆 ⊕ 𝑠𝑖 will not cover {0, 1}𝑚 . On the other hand,
can ensure that there is an algorithm 𝐴 to compute
we will show that if 𝑆 is very large, then there exists a polynomial 𝐹 on 𝑛-length inputs and using 𝑚 coins such that
number of such shifts such as ∪𝑘−1 Pr𝑟∼{0,1}𝑚 [𝐴(𝑥𝑟) ≠ 𝐹 (𝑥)] ≪ 1/𝑝𝑜𝑙𝑦(𝑚). Hence
𝑖=0 (𝑆 ⊕ 𝑠𝑖 ) = {0, 1} .
𝑚
if 𝐹 (𝑥) = 1 then almost all of the 2𝑚 choices for 𝑟
We can express the condition that there exists 𝑠0 , … , 𝑠𝑘−1 such that will cause 𝐴(𝑥𝑟) to output 1, while if 𝐹 (𝑥) = 0 then
∪𝑖∈[𝑘] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 as a statement with a constant number of 𝐴(𝑥𝑟) = 0 for almost all 𝑟’s. To prove the Sipser–
Gács Theorem we consider several “shifts” of the set
quantifiers. (Specifically, this condition holds if for every 𝑦 ∈ {0, 1}𝑚 ,
𝑆 ⊆ {0, 1}𝑚 of the coins 𝑟 such that 𝐴(𝑥𝑟) = 1. If
there exists 𝑠 ∈ 𝑆 and 𝑖 ∈ {0, … , 𝑘 − 1} such that 𝑦 = 𝑠 ⊕ 𝑠𝑖 .) 𝐹 (𝑥) = 1 then we can find a set of 𝑘 shifts 𝑠0 , … , 𝑠𝑘−1
⋆ for which ∪𝑖∈[𝑘] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 . If 𝐹 (𝑥) = 0 then
for every such set | ∪𝑖∈[𝑘] 𝑆𝑖 | ≤ 𝑘|𝑆| ≪ 2𝑚 . We can
phrase the question of whether there is such a set of
Proof of Theorem 20.11. Let 𝐹 ∈ BPP. Using Theorem 20.5, there shifts using a constant number of quantifiers, and so
exists a polynomial-time algorithm 𝐴 such that for every 𝑥 ∈ {0, 1}𝑛 , can solve it in polynomial time if P = NP.
576 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Pr𝑟∈{0,1}𝑚 [𝐴(𝑥𝑟) = 𝐹 (𝑥)] ≥ 1 − 2−𝑛 where 𝑚 is polynomial in 𝑛. In


particular (since an exponential dominates a polynomial, and we can
always assume 𝑛 is sufficiently large), it holds that

Pr [𝐴(𝑥𝑟) = 𝐹 (𝑥)] ≥ 1 − 1
10𝑚2 . (20.8)
𝑟∈{0,1}𝑚

Let 𝑥 ∈ {0, 1}𝑛 , and let 𝑆𝑥 ⊆ {0, 1}𝑚 be the set {𝑟 ∈ {0, 1}𝑚 ∶
𝐴(𝑥𝑟) = 1}. By our assumption, if 𝐹 (𝑥) = 0 then |𝑆𝑥 | ≤ 10𝑚 1
22
𝑚
and if
𝐹 (𝑥) = 1 then |𝑆𝑥 | ≥ (1 − 10𝑚2 )2 .
1 𝑚

For a set 𝑆 ⊆ {0, 1}𝑚 and a string 𝑠 ∈ {0, 1}𝑚 , we define the set
𝑆 ⊕ 𝑠 to be {𝑟 ⊕ 𝑠 ∶ 𝑟 ∈ 𝑆} where ⊕ denotes the XOR operation. That
is, 𝑆 ⊕ 𝑠 is the set 𝑆 “shifted” by 𝑠. Note that |𝑆 ⊕ 𝑠| = |𝑆|. (Please
make sure that you see why this is true.)
The heart of the proof is the following two claims:
CLAIM I: For every subset 𝑆 ⊆ {0, 1}𝑚 , if |𝑆| ≤ 1000𝑚 1
2𝑚 , then for
every 𝑠0 , … , 𝑠100𝑚−1 ∈ {0, 1}𝑚 , ∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) ⊊ {0, 1}𝑚 .
CLAIM II: For every subset 𝑆 ⊆ {0, 1}𝑚 , if |𝑆| ≥ 21 2𝑚 then there
exist a set of string 𝑠0 , … , 𝑠100𝑚−1 such that ∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 .
CLAIM I and CLAIM II together imply the theorem. Indeed, they
mean that under our assumptions, for every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 1 if
and only if

∃𝑠0 ,…,𝑠100𝑚−1 ∈{0,1}𝑚 ∪𝑖∈[100𝑚] (𝑆𝑥 ⊕ 𝑠𝑖 ) = {0, 1}𝑚

which we can re-write as

∃𝑠0 ,…,𝑠100𝑚−1 ∈{0,1}𝑚 ∀𝑤∈{0,1}𝑚 (𝑤 ∈ (𝑆𝑥 ⊕𝑠0 )∨𝑤 ∈ (𝑆𝑥 ⊕𝑠1 )∨⋯ 𝑤 ∈ (𝑆𝑥 ⊕𝑠100𝑚−1 ))

or equivalently

∃𝑠0 ,…,𝑠100𝑚−1 ∈{0,1}𝑚 ∀𝑤∈{0,1}𝑚 (𝐴(𝑥(𝑤⊕𝑠0 )) = 1∨𝐴(𝑥(𝑤⊕𝑠1 )) = 1∨⋯∨𝐴(𝑥(𝑤⊕𝑠100𝑚−1 )) = 1)

which (since 𝐴 is computable in polynomial time) is exactly the


type of statement shown in Theorem 16.6 to be decidable in polyno-
mial time if P = NP.
We see that all that is left is to prove CLAIM I and CLAIM II.
CLAIM I follows immediately from the fact that

100𝑚−1 100𝑚−1
∣∪𝑖∈[100𝑚−1] 𝑆𝑥 ⊕ 𝑠𝑖 ∣ ≤ ∑ |𝑆𝑥 ⊕ 𝑠𝑖 | = ∑ |𝑆𝑥 | = 100𝑚|𝑆𝑥 | .
𝑖=0 𝑖=0

To prove CLAIM II, we will use a technique known as the prob-


abilistic method (see the proof of Lemma 20.12 for a more extensive
discussion). Note that this is a completely different use of probability
mod e l i ng r a n d omi ze d comp u tati on 577

than in the theorem statement, we just use the methods of probability


to prove an existential statement.
Proof of CLAIM II: Let 𝑆 ⊆ {0, 1}𝑚 with |𝑆| ≥ 0.5 ⋅ 2𝑚 be as
in the claim’s statement. Consider the following probabilistic ex-
periment: we choose 100𝑚 random shifts 𝑠0 , … , 𝑠100𝑚−1 indepen-
dently at random in {0, 1}𝑚 , and consider the event GOOD that
∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 . To prove CLAIM II it is enough to show
that Pr[GOOD] > 0, since that means that in particular there must exist
shifts 𝑠0 , … , 𝑠100𝑚−1 that satisfy this condition.
For every 𝑧 ∈ {0, 1}𝑚 , define the event BAD𝑧 to hold if 𝑧 ∉
∪𝑖∈[100𝑚−1] (𝑆 ⊕ 𝑠𝑖 ). The event GOOD holds if BAD𝑧 fails for every
𝑧 ∈ {0, 1}𝑚 , and so our goal is to prove that Pr[∪𝑧∈{0,1}𝑚 BAD𝑧 ] < 1. By
the union bound, to show this, it is enough to show that Pr[BAD𝑧 ] <
2−𝑚 for every 𝑧 ∈ {0, 1}𝑚 . Define the event BAD𝑧 to hold if 𝑧 ∉ 𝑆 ⊕ 𝑠𝑖 .
𝑖

Since every shift 𝑠𝑖 is chosen independently, for every fixed 𝑧 the 4


The condition of independence here is subtle. It
events BAD𝑧 , … , BAD𝑧 are mutually independent,4 and hence
0 100𝑚−1
is not the case that all of the 2𝑚 × 100𝑚 events
𝑖
{BAD𝑧 }𝑧∈{0,1}𝑚 ,𝑖∈[100𝑚] are mutually independent.
Only for a fixed 𝑧 ∈ {0, 1}𝑚 , the 100𝑚 events of the
100𝑚−1 𝑖
form BAD𝑧 are mutually independent.
Pr[BAD𝑧 ] = Pr[∩𝑖∈[100𝑚−1] BAD𝑧 ] = ∏ Pr[BAD𝑧 ] . (20.9)
𝑖 𝑖

𝑖=0

So this means that the result will follow by showing that


Pr[BAD𝑧 ] ≤ 12 for every 𝑧 ∈ {0, 1}𝑚 and 𝑖 ∈ [100𝑚] (as that would
𝑖

allow to bound the right-hand side of (20.9) by 2−100𝑚 ). In other


words, we need to show that for every 𝑧 ∈ {0, 1}𝑚 and set 𝑆 ⊆ {0, 1}𝑚
with |𝑆| ≥ 21 2𝑚 ,

Pr [𝑧 ∈ 𝑆 ⊕ 𝑠] ≥ 1
2 . (20.10)
𝑠∼{0,1}𝑚

To show this, we observe that 𝑧 ∈ 𝑆 ⊕ 𝑠 if and only if 𝑠 ∈ 𝑆 ⊕ 𝑧


(can you see why). Hence we can rewrite the probability on the left-
hand side of (20.10) as Pr𝑠∼{0,1}𝑚 [𝑠 ∈ 𝑆 ⊕ 𝑧] which simply equals
|𝑆 ⊕ 𝑧|/2𝑚 = |𝑆|/2𝑚 ≥ 1/2! This concludes the proof of CLAIM II and
hence of Theorem 20.11.

20.6 NON-CONSTRUCTIVE EXISTENCE OF PSEUDORANDOM GEN-


ERATORS (ADVANCED, OPTIONAL)
We now show that, if we don’t insist on constructivity of pseudoran-
dom generators, then we can show that there exist pseudorandom
generators with output that is exponentially larger in the input length.
There is
Lemma 20.12 — Existence of inefficient pseudorandom generators.
some absolute constant 𝐶 such that for every 𝜖, 𝑇 , if ℓ > 𝐶(log 𝑇 +
log(1/𝜖)) and 𝑚 ≤ 𝑇 , then there is a (𝑇 , 𝜖) pseudorandom generator
𝐺 ∶ {0, 1}ℓ → {0, 1}𝑚 .
578 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Proof Idea:
The proof uses an extremely useful technique known as the “prob-
abilistic method” which is not too hard mathematically but can be 5
There is a whole (highly recommended) book by
confusing at first.5 The idea is to give a “non-constructive” proof of Alon and Spencer devoted to this method.
existence of the pseudorandom generator 𝐺 by showing that if 𝐺 was
chosen at random, then the probability that it would be a valid (𝑇 , 𝜖)
pseudorandom generator is positive. In particular this means that
there exists a single 𝐺 that is a valid (𝑇 , 𝜖) pseudorandom generator.
The probabilistic method is just a proof technique to demonstrate the
existence of such a function. Ultimately, our goal is to show the exis-
tence of a deterministic function 𝐺 that satisfies the condition.

The above discussion might be rather abstract at this point, but


would become clearer after seeing the proof.

Proof of Lemma 20.12. Let 𝜖, 𝑇 , ℓ, 𝑚 be as in the lemma’s statement. We


need to show that there exists a function 𝐺 ∶ {0, 1}ℓ → {0, 1}𝑚 that
“fools” every 𝑇 line program 𝑃 in the sense of (20.7). We will show
that this follows from the following claim:
Claim I: For every fixed NAND-CIRC program 𝑃 , if we pick 𝐺 ∶
{0, 1}ℓ → {0, 1}𝑚 at random then the probability that (20.7) is violated
is at most 2−𝑇 .
2

Before proving Claim I, let us see why it implies Lemma 20.12. We


can identify a function 𝐺 ∶ {0, 1}ℓ → {0, 1}𝑚 with its “truth table”
or simply the list of evaluations on all its possible 2ℓ inputs. Since
each output is an 𝑚 bit string, we can also think of 𝐺 as a string in
{0, 1}𝑚⋅2 . We define ℱ𝑚 ℓ to be the set of all functions from {0, 1} to

{0, 1}𝑚 . As discussed above we can identify ℱ𝑚 ℓ with {0, 1} and



𝑚⋅2

choosing a random function 𝐺 ∼ ℱℓ corresponds to choosing a


𝑚

random 𝑚 ⋅ 2ℓ -long bit string.


For every NAND-CIRC program 𝑃 let 𝐵𝑃 be the event that, if we
choose 𝐺 at random from ℱ𝑚 ℓ then (20.7) is violated with respect to
the program 𝑃 . It is important to understand what is the sample space
that the event 𝐵𝑃 is defined over, namely this event depends on the
choice of 𝐺 and so 𝐵𝑃 is a subset of ℱ𝑚 ℓ . An equivalent way to define
the event 𝐵𝑃 is that it is the subset of all functions mapping {0, 1}ℓ to
{0, 1}𝑚 that violate (20.7), or in other words:


{ ⎫
}
1 1
𝐵𝑃 = 𝐺 ∈ ℱ𝑚
ℓ ∣ ∣ 2ℓ ∑ 𝑃 (𝐺(𝑠)) − ∑ 𝑃 (𝑟)∣ > 𝜖

{
2𝑚 ⎬
}
⎩ 𝑠∈{0,1}ℓ 𝑟∈{0,1}𝑚 ⎭
(20.11)
mod e l i ng r a n d omi ze d comp u tati on 579

(We’ve replaced here the probability statements in (20.7) with the


equivalent sums so as to reduce confusion as to what is the sample
space that 𝐵𝑃 is defined over.)
To understand this proof it is crucial that you pause here and see
how the definition of 𝐵𝑃 above corresponds to (20.11). This may well
take re-reading the above text once or twice, but it is a good exercise
at parsing probabilistic statements and learning how to identify the
sample space that these statements correspond to.
Now, we’ve shown in Theorem 5.2 that up to renaming variables
(which makes no difference to program’s functionality) there are
2𝑂(𝑇 log 𝑇 ) NAND-CIRC programs of at most 𝑇 lines. Since 𝑇 log 𝑇 <
𝑇 2 for sufficiently large 𝑇 , this means that if Claim I is true, then
by the union bound it holds that the probability of the union of
𝐵𝑃 over all NAND-CIRC programs of at most 𝑇 lines is at most
2𝑂(𝑇 log 𝑇 ) 2−𝑇 < 0.1 for sufficiently large 𝑇 . What is important for
2

us about the number 0.1 is that it is smaller than 1. In particular this


means that there exists a single 𝐺∗ ∈ ℱ𝑚 ℓ such that 𝐺 does not violate

(20.7) with respect to any NAND-CIRC program of at most 𝑇 lines,


but that precisely means that 𝐺∗ is a (𝑇 , 𝜖) pseudorandom generator.
Hence to conclude the proof of Lemma 20.12, it suffices to prove
Claim I. Choosing a random 𝐺 ∶ {0, 1}ℓ → {0, 1}𝑚 amounts to choos-
ing 𝐿 = 2ℓ random strings 𝑦0 , … , 𝑦𝐿−1 ∈ {0, 1}𝑚 and letting 𝐺(𝑥) = 𝑦𝑥
(identifying {0, 1}ℓ and [𝐿] via the binary representation). This means
that proving the claim amounts to showing that for every fixed func-
tion 𝑃 ∶ {0, 1}𝑚 → {0, 1}, if 𝐿 > 2𝐶(log 𝑇 +log 𝜖) (which by setting 𝐶 > 4,
we can ensure is larger than 10𝑇 2 /𝜖2 ) then the probability that

𝐿−1
∣ 𝐿1 ∑ 𝑃 (𝑦𝑖 ) − Pr [𝑃 (𝑠) = 1]∣ > 𝜖 (20.12)
𝑠∼{0,1}𝑚
𝑖=0

is at most 2−𝑇 .
2

(20.12) follows directly from the Chernoff bound. Indeed, if we


let for every 𝑖 ∈ [𝐿] the random variable 𝑋𝑖 denote 𝑃 (𝑦𝑖 ), then
since 𝑦0 , … , 𝑦𝐿−1 is chosen independently at random, these are in-
dependently and identically distributed random variables with mean
𝔼𝑦∼{0,1}𝑚 [𝑃 (𝑦)] = Pr𝑦∼{0,1}𝑚 [𝑃 (𝑦) = 1] and hence the probability that
they deviate from their expectation by 𝜖 is at most 2 ⋅ 2−𝜖 𝐿/2 .
2

✓ Chapter Recap

• We can model randomized algorithms by either


adding a special “coin toss” operation or assuming
an extra randomly chosen input.
580 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Figure 20.8: The relation between BPP and the other


complexity classes that we have seen. We know that
P ⊆ BPP ⊆ EXP and BPP ⊆ P/poly but we don’t
know how BPP compares with NP and can’t rule out
even BPP = EXP. Most evidence points out to the
possibliity that BPP = P.

• The class BPP contains the set of Boolean func-


tions that can be computed by polynomial time
randomized algorithms.
• BPP is a worst case class of computation: a ran-
domized algorithm to compute a function must
compute it correctly with high probability on every
input.
• We can amplify the success probability of random-
ized algorithm from any value strictly larger than
1/2 into a success probability that is exponentially
close to 1.
• We know that P ⊆ BPP ⊆ EXP.
• We also know that BPP ⊆ P/poly .
• The relation between BPP and NP is not known,
but we do know that if P = NP then BPP = P.
• Pseudorandom generators are objects that take
a short random “seed” and expand it to a much
longer output that “appears random” for effi-
cient algorithms. We conjecture that exponentially
strong pseudorandom generators exist. Under this
conjecture, BPP = P.

20.7 EXERCISES

20.8 BIBLIOGRAPHICAL NOTES


In this chapter we ignore the issue of how we actually get random
bits in practice. The output of many physical processes, whether it
is thermal heat, network and hard drive latency, user typing pat-
tern and mouse movements, and more can be thought of as a binary
string sampled from some distribution 𝜇 that might have significant
unpredictability (or entropy) but is not necessarily the uniform distri-
bution over {0, 1}𝑛 . Indeed, as this paper shows, even (real-world)
coin tosses do not have exactly the distribution of a uniformly random
string. Therefore, to use the resulting measurements for randomized
algorithms, one typically needs to apply a “distillation” or random-
mod e l i ng r a n d omi ze d comp u tati on 581

ness extraction process to the raw measurements to transform them


to the uniform distribution. Vadhan’s book [Vad+12] is an excellent
source for more discussion on both randomness extractors and pseu-
dorandom generators.
The name BPP stands for “bounded probability polynomial time”.
This is an historical accident: this class probably should have been
called RP or PP but both names were taken by other classes.
The proof of Theorem 20.8 actually yields more than its statement.
We can use the same “unrolling the loop” arguments we’ve used be-
fore to show that the restriction to {0, 1}𝑛 of every function in BPP
is also computable by a polynomial-size RNAND-CIRC program
(i.e., NAND-CIRC program with the RAND operation). Like in the P
vs SIZE(𝑝𝑜𝑙𝑦(𝑛)) case, there are also functions outside BPP whose
restrictions can be computed by polynomial-size RNAND-CIRC pro-
grams. Nevertheless the proof of Theorem 20.8 shows that even such
functions can be computed by polynomial-sized NAND-CIRC pro-
grams without using the rand operations. This can be phrased as
saying that BPSIZE(𝑇 (𝑛)) ⊆ SIZE(𝑂(𝑛𝑇 (𝑛))) (where BPSIZE is
defined in the natural way using RNAND progams). The stronger
version of Theorem 20.8 we mentioned can be phrased as saying that
BPP/poly = P/poly .
V
ADVANCED TOPICS
Learning Objectives:
• Definition of perfect secrecy
• The one-time pad encryption scheme
• Necessity of long keys for perfect secrecy
• Computational secrecy and the derandomized
one-time pad.
• Public key encryption
• A taste of advanced topics

21
Cryptography

“Human ingenuity cannot concoct a cipher which human ingenuity cannot


resolve.”, Edgar Allen Poe, 1841

“A good disguise should not reveal the person’s height”, Shafi Goldwasser
and Silvio Micali, 1982

““Perfect Secrecy” is defined by requiring of a system that after a cryptogram


is intercepted by the enemy the a posteriori probabilities of this cryptogram rep-
resenting various messages be identically the same as the a priori probabilities
of the same messages before the interception. It is shown that perfect secrecy is
possible but requires, if the number of messages is finite, the same number of
possible keys.”, Claude Shannon, 1945

“We stand today on the brink of a revolution in cryptography.”, Whitfeld


Diffie and Martin Hellman, 1976

Cryptography - the art or science of “secret writing” - has been


around for several millennia, and for almost all of that time Edgar
Allan Poe’s quote above held true. Indeed, the history of cryptography
is littered with the figurative corpses of cryptosystems believed secure
and then broken, and sometimes with the actual corpses of those who
have mistakenly placed their faith in these cryptosystems.
Yet, something changed in the last few decades, which is the “revo-
lution” alluded to (and to a large extent initiated by) Diffie and Hell-
man’s 1976 paper quoted above. New cryptosystems have been found
that have not been broken despite being subjected to immense efforts
involving both human ingenuity and computational power on a scale
that completely dwarves the “code breakers” of Poe’s time. Even more
amazingly, these cryptosystems are not only seemingly unbreakable,
but they also achieve this under much harsher conditions. Not only
do today’s attackers have more computational power, but they also
have more data to work with. In Poe’s age, an attacker would be lucky
if they got access to more than a few encryptions of known messages.
These days attackers might have massive amounts of data—terabytes

Compiled on 12.6.2023 00:05


586 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

or more—at their disposal. In fact, with public key encryption, an at-


tacker can generate as many ciphertexts as they wish.
The key to this success has been a clearer understanding of both
how to define security for cryptographic tools and how to relate this
security to concrete computational problems. Cryptography is a vast
and continuously changing topic, but we will touch on some of these
issues in this chapter.

This chapter: A non-mathy overview


Cryptography cannot be covered in a single chapter, and
so this chapter merely gives a “taste” of crypto, focusing on
the aspects most related to computational complexity. For a
more extensive treatment, see my lecture notes from which
this chapter is adapted. We will discuss some “classical
cryptosystems” and show how we can mathematically define
security of encryption, and use the one-time pad to achieve
an encryption that provably satisfies this definition. We will
then see the fundamental limitation of this definition, and
how to bypass it we need to relax security by only restrict-
ing attention to attackers that have bounded computational
resources. This notion of computational security is inherently
tied to computational complexity and the P vs NP question.
We will also give a small taste of some of the “paradoxical”
cryptographic constructions that go way beyond encryp-
tion, including public-key cryptography, fully-homomorphic
encryption, and multi-party secure computation.

21.1 CLASSICAL CRYPTOSYSTEMS


A great many cryptosystems have been devised and broken through-
out the ages. Let us recount just some of these stories. In 1587, Mary,
Queen of Scots, and the heir to the throne of England, wanted to ar-
range the assassination of her cousin, Queen Elizabeth I of England,
so that she could ascend to the throne and finally escape the house
arrest under which she had been for the last 18 years. As part of this
complicated plot, she sent a coded letter to Sir Anthony Babington.
Mary used what’s known as a substitution cipher where each letter
is transformed into a different obscure symbol (see Fig. 21.1). At a
first look, such a letter might seem rather inscrutable—a meaningless
sequence of strange symbols. However, after some thought, one might
recognize that these symbols repeat several times and moreover that
different symbols repeat with different frequencies. Now it doesn’t Figure 21.1: Snippet from encrypted communication
take a large leap of faith to assume that perhaps each symbol corre- between queen Mary and Sir Babington

sponds to a different letter and the more frequent symbols correspond


c ry p tog ra p hy 587

to letters that occur in the alphabet with higher frequency. From this
observation, there is a short gap to completely breaking the cipher,
which was in fact done by Queen Elizabeth’s spies, who used the de-
coded letters to learn of all the co-conspirators and to convict Queen
Mary of treason, a crime for which she was executed. Trusting in su-
perficial security measures (such as using “inscrutable” symbols) is a
trap that users of cryptography have been falling into again and again
over the years. (As with many things, this is the subject of a great
XKCD cartoon, see Fig. 21.2.)
The Vigenère cipher is named after Blaise de Vigenère, who de-
scribed it in a book in 1586 (though it was invented earlier by Bellaso).
The idea is to use a collection of substitution cyphers: if there are 𝑛
different ciphers then the first letter of the plaintext is encoded with
the first cipher, the second with the second cipher, the 𝑛𝑡ℎ with the 𝑛𝑡ℎ
cipher, and then the 𝑛 + 1𝑠𝑡 letter is again encoded with the first cipher.
The key 𝑘 is usually a word or a phrase of 𝑛 letters. The 𝑖𝑡ℎ substitu-
tion cipher will shift each letter by the same shift needed to get from A
Figure 21.2: XKCD’s take on the added security of
to 𝑘𝑖 . If 𝑘𝑖 is C, for example, the 𝑖𝑡ℎ substitution cipher will shift every using uncommon symbols
letter by two places. This “flattens” the frequencies and makes it much
harder to do frequency analysis, which is why this cipher was consid-
ered “unbreakable” for 300+ years and got the nickname “le chiffre
indéchiffrable” (“the unbreakable cipher”). Nevertheless, Charles
Babbage cracked the Vigenère cipher in 1854 (though he did not pub-
lish it). In 1863 Friedrich Kasiski broke the cipher and published the
result. The idea is that once you guess the length of the cipher, you
can reduce the task to breaking a simple substitution cipher which can
be done via frequency analysis (can you see why?). Confederate gen-
erals used Vigenère regularly during the civil war, and their messages
were routinely cryptanalyzed by Union officers.
The Enigma cipher was a mechanical cipher (looking like a type-
Figure 21.3: Confederate Cipher Disk for implement-
writer, see Fig. 21.5) where each letter typed would get mapped into
ing the Vigenère cipher
a different letter depending on the (rather complicated) key and cur-
rent state of the machine,d which had several rotors that rotated at
different paces. An identically wired machine at the other end could
be used to decrypt. Just as many ciphers in history, this has also been
believed by the Germans to be “impossible to break” and even quite
late in the war they refused to believe it was broken despite mount-
ing evidence to that effect. (In fact, some German generals refused
to believe it was broken even after the war.) Breaking Enigma was an
heroic effort which was initiated by the Poles and then completed by Figure 21.4: Confederate encryption of the message
“Gen’l Pemberton: You can expect no help from this
the British at Bletchley Park, with Alan Turing (of the Turing machine) side of the river. Let Gen’l Johnston know, if possible,
playing a key role. As part of this effort the Brits built arguably the when you can attack the same point on the enemy’s
world’s first large scale mechanical computation devices (though they lines. Inform me also and I will endeavor to make a
diversion. I have sent some caps. I subjoin a despatch
looked more similar to washing machines than to iPhones). They were from General Johnston.”
588 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

also helped along the way by some quirks and errors of the German
operators. For example, the fact that their messages ended with “Heil
Hitler” turned out to be quite useful.
Here is one entertaining anecdote: the Enigma machine would
never map a letter to itself. In March 1941, Mavis Batey, a cryptana-
lyst at Bletchley Park received a very long message that she tried to
decrypt. She then noticed a curious property— the message did not
contain the letter “L”.1 She realized that the probability that no “L”’s
appeared in the message is too small for this to happen by chance.
Hence she surmised that the original message must have been com-
posed only of L’s. That is, it must have been the case that the operator,
perhaps to test the machine, have simply sent out a message where he Figure 21.5: In the Enigma mechanical cipher the secret
repeatedly pressed the letter “L”. This observation helped her decode key would be the settings of the rotors and internal
wires. As the operator typed up their message, the
the next message, which helped inform of a planned Italian attack and encrypted version appeared in the display area above,
secure a resounding British victory in what became known as “the and the internal state of the cipher was updated (and
so typing the same letter twice would generally result
Battle of Cape Matapan”. Mavis also helped break another Enigma
in two different letters output). Decrypting follows
machine. Using the information she provided, the Brits were able the same process: if the sender and receiver are using
to feed the Germans with the false information that the main allied the same key then typing the ciphertext would result
in the plaintext appearing in the display.
invasion would take place in Pas de Calais rather than on Normandy. 1
Here is a nice exercise: compute (up to an order
In the words of General Eisenhower, the intelligence from Bletchley of magnitude) the probability that a 50-letter long
Park was of “priceless value”. It made a huge difference for the Allied message composed of random letters will end up not
containing the letter “L”.
war effort, thereby shortening World War II and saving millions of
lives. See also this interview with Sir Harry Hinsley.

21.2 DEFINING ENCRYPTION


Many of the troubles that cryptosystem designers faced over history
(and still face!) can be attributed to not properly defining or under-
standing the goals they want to achieve in the first place. Let us focus
on the setting of private key encryption. (This is also known as “sym-
metric encryption”; for thousands of years, “private key encryption”
was synonymous with encryption and only in the 1970s was the con-
cept of public key encryption invented, see Definition 21.11.) A sender
(traditionally called “Alice”) wants to send a message (known also
as a plaintext) 𝑥 ∈ {0, 1}∗ to a receiver (traditionally called “Bob”).
They would like their message to be kept secret from an adversary
who listens in or “eavesdrops” on the communication channel (and is
traditionally called “Eve”).
Alice and Bob share a secret key 𝑘 ∈ {0, 1}∗ . (While the letter 𝑘
is often used elsewhere in the book to denote a natural number, in
this chapter we use it to denote the string corresponding to a secret
key.) Alice uses the key 𝑘 to “scramble” or encrypt the plaintext 𝑥 into
a ciphertext 𝑦, and Bob uses the key 𝑘 to “unscramble” or decrypt the
ciphertext 𝑦 back into the plaintext 𝑥. This motivates the following
definition which attempts to capture what it means for an encryption
c ry p tog ra p hy 589

scheme to be valid or “make sense”, regardless of whether or not it is


secure:

Definition 21.1 — Valid encryption scheme. Let 𝐿 ∶ ℕ → ℕ and 𝐶 ∶ ℕ → ℕ


be two functions mapping natural numbers to natural numbers.
A pair of polynomial-time computable functions (𝐸, 𝐷) map-
ping strings to strings is a valid private key encryption scheme (or
encryption scheme for short) with plaintext length function 𝐿(⋅) and
ciphertext length function 𝐶(⋅) if for every 𝑛 ∈ ℕ, 𝑘 ∈ {0, 1}𝑛 and
𝑥 ∈ {0, 1}𝐿(𝑛) , |𝐸𝑘 (𝑥)| = 𝐶(𝑛) and

𝐷(𝑘, 𝐸(𝑘, 𝑥)) = 𝑥 . (21.1)

We will often write the first input (i.e., the key) to the encryp-
tion and decryption as a subscript and so can write (21.1) also as
𝐷𝑘 (𝐸𝑘 (𝑥)) = 𝑥.
Prove that for
Solved Exercise 21.1 — Lengths of ciphertext and plaintext.
every valid encryption scheme (𝐸, 𝐷) with functions 𝐿, 𝐶. 𝐶(𝑛) ≥
𝐿(𝑛) for every 𝑛.

Solution:
For every fixed key 𝑘 ∈ {0, 1}𝑛 , the equation (21.1) implies that Figure 21.6: A private-key encryption scheme is a
the map 𝑦 ↦ 𝐷𝑘 (𝑦) inverts the map 𝑥 ↦ 𝐸𝑘 (𝑥), which in partic- pair of algorithms 𝐸, 𝐷 such that for every key
𝑘 ∈ {0, 1}𝑛 and plaintext 𝑥 ∈ {0, 1}𝐿(𝑛) , 𝑦 = 𝐸𝑘 (𝑥)
ular means that the map 𝑥 ↦ 𝐸𝑘 (𝑥) must be one to one. Hence
is a ciphertext of length 𝐶(𝑛). The encryption scheme
its codomain must be at least as large as its domain, and since its is valid if for every such 𝑦, 𝐷𝑘 (𝑦) = 𝑥. That is, the
domain is {0, 1}𝐿(𝑛) and its codomain is {0, 1}𝐶(𝑛) it follows that decryption of an encryption of 𝑥 is 𝑥, as long as both
encryption and decryption use the same key.
𝐶(𝑛) ≥ 𝐿(𝑛).

Since the ciphertext length is always at least the plaintext length


(and in most applications it is not much longer than that), we typi-
cally focus on the plaintext length as the quantity to optimize in an
encryption scheme. The larger 𝐿(𝑛) is, the better the scheme, since it
means we need a shorter secret key to protect messages of the same
length.

21.3 DEFINING SECURITY OF ENCRYPTION


Definition 21.1 says nothing about the security of 𝐸 and 𝐷, and even
allows the trivial encryption scheme that ignores the key altogether
and sets 𝐸𝑘 (𝑥) = 𝑥 for every 𝑥. Defining security is not a trivial matter.
590 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

P
You would appreciate the subtleties of defining secu-
rity of encryption more if at this point you take a five
minute break from reading, and try (possibly with a
partner) to brainstorm on how you would mathemat-
ically define the notion that an encryption scheme is
secure, in the sense that it protects the secrecy of the
plaintext 𝑥.

Throughout history, many attacks on cryptosystems were rooted


in the cryptosystem designers’ reliance on “security through
obscurity”— trusting that the fact their methods are not known to
their enemy will protect them from being broken. This is a faulty
assumption - if you reuse a method again and again (even with a
different key each time) then eventually your adversaries will figure
out what you are doing. And if Alice and Bob meet frequently in a
secure location to decide on a new method, they might as well take
the opportunity to exchange their secrets. These considerations led
Auguste Kerckhoffs in 1883 to state the following principle:

A cryptosystem should be secure even if everything about the system, except the
key, is public knowledge.2
2
The actual quote is “Il faut qu’il n’exige pas le
secret, et qu’il puisse sans inconvénient tomber entre
les mains de l’ennemi” loosely translated as “The
Why is it OK to assume the key is secret and not the algorithm? system must not require secrecy and can be stolen by
Because we can always choose a fresh key. But of course that won’t the enemy without causing trouble”. According to
Steve Bellovin the NSA version is “assume that the
help us much if our key is “1234” or “passw0rd!”. In fact, if you use
first copy of any device we make is shipped to the
any deterministic algorithm to choose the key then eventually your Kremlin”.
adversary will figure this out. Therefore for security we must choose
the key at random and can restate Kerckhoffs’s principle as follows:

There is no secrecy without randomness

This is such a crucial point that is worth repeating:

 Big Idea 26 There is no secrecy without randomness.

At the heart of every cryptographic scheme there is a secret key,


and the secret key is always chosen at random. A corollary of that
is that to understand cryptography, you need to know probability
theory.

R
Remark 21.2 — Randomness in the real world. Choos-
ing the secrets for cryptography requires generating
randomness, which is often done by measuring some
“unpredictable” or “high entropy” data, and then
applying hash functions to the result to “extract” a
c ry p tog ra p hy 591

uniformly random string. Great care must be taken in


doing this, and randomness generators often turn out
to be the Achilles heel of secure systems.
In 2006 a programmer removed a line of code from the
procedure to generate entropy in OpenSSL package
distributed by Debian since it caused a warning in
some automatic verification code. As a result for two
years (until this was discovered) all the randomness
generated by this procedure used only the process
ID as an “unpredictable” source. This means that all
communication done by users in that period is fairly
easily breakable (and in particular, if some entities
recorded that communication they could break it also
retroactively). See XKCD’s take on that incident.
In 2012 two separate teams of researchers scanned a
large number of RSA keys on the web and found out
that about 4 percent of them are easy to break. The
main issue were devices such as routers, internet-
connected printers and such. These devices sometimes
run variants of Linux—a desktop operating system—
but without a hard drive, mouse or keyboard, they
don’t have access to many of the entropy sources that
desktops have. Coupled with some good old fash-
ioned ignorance of cryptography and software bugs,
this led to many keys that are downright trivial to
break, see this blog post and this web page for more
details.
Since randomness is so crucial to security, breaking
the procedure to generate randomness can lead to a
complete break of the system that uses this random-
ness. Indeed, the Snowden documents, combined with
observations of Shumow and Ferguson, strongly sug-
gest that the NSA has deliberately inserted a trapdoor
in one of the pseudorandom generators published by
the National Institute of Standards and Technology
(NIST). Fortunately, this generator wasn’t widely
adapted, but apparently the NSA did pay 10 million
dollars to RSA Security so the latter would make this
generator the default option in their products.

21.4 PERFECT SECRECY


If you think about encryption scheme security for a while, you might
come up with the following principle for defining security: “An
encryption scheme is secure if it is not possible to recover the key 𝑘 from
𝐸𝑘 (𝑥)”. However, a moment’s thought shows that the key is not really
what we’re trying to protect. After all, the whole point of an encryp-
tion is to protect the confidentiality of the plaintext 𝑥. So, we can try to
define that “an encryption scheme is secure if it is not possible to recover the
plaintext 𝑥 from 𝐸𝑘 (𝑥)”. Yet it is not clear what this means either. Sup-
pose that an encryption scheme reveals the first 10 bits of the plaintext
592 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

𝑥. It might still not be possible to recover 𝑥 completely, but on an in-


tuitive level, this seems like it would be extremely unwise to use such
an encryption scheme in practice. Indeed, often even partial information
about the plaintext is enough for the adversary to achieve its goals.
The above thinking led Shannon in 1945 to formalize the notion of
perfect secrecy, which is that an encryption reveals absolutely nothing
about the message. There are several equivalent ways to define it, but
perhaps the cleanest one is the following:

A valid encryption scheme (𝐸, 𝐷)


Definition 21.3 — Perfect secrecy.
with plaintext length 𝐿(⋅) is perfectly secret if for every 𝑛 ∈ ℕ and
plaintexts 𝑥, 𝑥′ ∈ {0, 1}𝐿(𝑛) , the following two distributions 𝑌 and
𝑌 ′ over {0, 1}∗ are identical:

• 𝑌 is obtained by sampling 𝑘 ∼ {0, 1}𝑛 and outputting 𝐸𝑘 (𝑥).

• 𝑌 ′ is obtained by sampling 𝑘 ∼ {0, 1}𝑛 and outputting 𝐸𝑘 (𝑥′ ).

P
This definition might take more than one reading
to parse. Try to think of how this condition would
correspond to your intuitive notion of “learning no
information” about 𝑥 from observing 𝐸𝑘 (𝑥), and to
Shannon’s quote in the beginning of this chapter.
In particular, suppose that you knew ahead of time
that Alice sent either an encryption of 𝑥 or an en-
cryption of 𝑥′ . Would you learn anything new from
observing the encryption of the message that Alice
actually sent? It may help you to look at Fig. 21.7.

21.4.1 Example: Perfect secrecy in the battlefield


To understand Definition 21.3, suppose that Alice sends only one of
two possible messages: “attack” or “retreat”, which we denote by 𝑥0
and 𝑥1 respectively, and that she sends each one of those messages
with probability 1/2. Let us put ourselves in the shoes of Eve, the
eavesdropping adversary. A priori we would have guessed that Alice Figure 21.7: For any key length 𝑛, we can visualize an
sent either 𝑥0 or 𝑥1 with probability 1/2. Now we observe 𝑦 = 𝐸𝑘 (𝑥𝑖 ) encryption scheme (𝐸, 𝐷) as a graph with a vertex
for every one of the 2𝐿(𝑛) possible plaintexts and for
where 𝑘 is a uniformly chosen key in {0, 1}𝑛 . How does this new
every one of the ciphertexts in {0, 1}∗ of the form
information cause us to update our beliefs on whether Alice sent the 𝐸𝑘 (𝑥) for 𝑘 ∈ {0, 1}𝑛 and 𝑥 ∈ {0, 1}𝐿(𝑛) . For every
plaintext 𝑥0 or the plaintext 𝑥1 ? plaintext 𝑥 and key 𝑘, we add an edge labeled 𝑘
between 𝑥 and 𝐸𝑘 (𝑥). By the validity condition, if we
pick any fixed key 𝑘, the map 𝑥 ↦ 𝐸𝑘 (𝑥) must be
one-to-one. The condition of perfect secrecy simply
P
Before reading the next paragraph, you might want corresponds to requiring that every two plaintexts
𝑥 and 𝑥′ have exactly the same set of neighbors (or
to try the analysis yourself. You may find it useful to
multi-set, if there are parallel edges).
c ry p tog ra p hy 593

look at the Wikipedia entry on Bayesian Inference or


these MIT lecture notes.

Let us define 𝑝0 (𝑦) to be the probability (taken over 𝑘 ∼ {0, 1}𝑛 )


that 𝑦 = 𝐸𝑘 (𝑥0 ) and similarly 𝑝1 (𝑦) to be Pr𝑘∼{0,1}𝑛 [𝑦 = 𝐸𝑘 (𝑥1 )].
Note that, since Alice chooses the message to send at random, our
a priori probability for observing 𝑦 is 21 𝑝0 (𝑦) + 12 𝑝1 (𝑦). However,
as per Definition 21.3, the perfect secrecy condition guarantees that
𝑝0 (𝑦) = 𝑝1 (𝑦)! Let us denote the number 𝑝0 (𝑦) = 𝑝1 (𝑦) by 𝑝. By the
formula for conditional probability, the probability that Alice sent the
message 𝑥0 conditioned on our observation 𝑦 is simply

Pr[𝑖 = 0 ∧ 𝑦 = 𝐸𝑘 (𝑥𝑖 )]
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = . (21.2)
Pr[𝑦 = 𝐸𝑘 (𝑥)]

(The equation (21.2) is a special case of Bayes’ rule which, although


a simple restatement of the formula for conditional probability, is
an extremely important and widely used tool in statistics and data
analysis.)
Since the probability that 𝑖 = 0 and 𝑦 is the ciphertext 𝐸𝑘 (0) is equal
to 21 ⋅ 𝑝0 (𝑦), and the a priori probability of observing 𝑦 is 21 𝑝0 (𝑦) +
2 𝑝1 (𝑦), we can rewrite (21.2) as
1

1
2 𝑝0 (𝑦) 𝑝 1
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = 1 1
= =
2 𝑝0 (𝑦) + 2 𝑝1 (𝑦)
𝑝+𝑝 2

using the fact that 𝑝0 (𝑦) = 𝑝1 (𝑦) = 𝑝. This means that observing the
ciphertext 𝑦 did not help us at all! We still would not be able to guess
whether Alice sent “attack” or “retreat” with better than 50/50 odds!
This example can be vastly generalized to show that perfect secrecy
is indeed “perfect” in the sense that observing a ciphertext gives Eve
no additional information about the plaintext beyond her a priori knowl-
edge.

21.4.2 Constructing perfectly secret encryption


Perfect secrecy is an extremely strong condition, and implies that an
eavesdropper does not learn any information from observing the ci-
phertext. You might think that an encryption scheme satisfying such a
strong condition will be impossible, or at least extremely complicated, Figure 21.8: A perfectly secret encryption scheme
for two-bit keys and messages. The blue vertices
to achieve. However it turns out we can in fact obtain a perfectly secret represent plaintexts and the red vertices represent
encryption scheme fairly easily. Such a scheme for two-bit messages is ciphertexts, each edge mapping a plaintext 𝑥 to a ci-
illustrated in Fig. 21.8. phertext 𝑦 = 𝐸𝑘 (𝑥) is labeled with the corresponding
key 𝑘. Since there are four possible keys, the degree of
In fact, this can be generalized to any number of bits: the graph is four and it is in fact a complete bipartite
graph. The encryption scheme is valid in the sense
that for every 𝑘 ∈ {0, 1}2 , the map 𝑥 ↦ 𝐸𝑘 (𝑥) is
one-to-one, which in other words means that the set
of edges labeled with 𝑘 is a matching.
594 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

There is a per-
Theorem 21.4 — One Time Pad (Vernam 1917, Shannon 1949).
fectly secret valid encryption scheme (𝐸, 𝐷) with 𝐿(𝑛) = 𝐶(𝑛) = 𝑛.

Proof Idea:
Our scheme is the one-time pad also known as the “Vernam Ci-
pher”, see Fig. 21.9. The encryption is exceedingly simple: to encrypt
a message 𝑥 ∈ {0, 1}𝑛 with a key 𝑘 ∈ {0, 1}𝑛 we simply output 𝑥 ⊕ 𝑘
where ⊕ is the bitwise XOR operation that outputs the string corre-
sponding to XORing each coordinate of 𝑥 and 𝑘.

Proof of Theorem 21.4. For two binary strings 𝑎 and 𝑏 of the same
length 𝑛, we define 𝑎 ⊕ 𝑏 to be the string 𝑐 ∈ {0, 1}𝑛 such that
𝑐𝑖 = 𝑎𝑖 + 𝑏𝑖 mod 2 for every 𝑖 ∈ [𝑛]. The encryption scheme
(𝐸, 𝐷) is defined as follows: 𝐸𝑘 (𝑥) = 𝑥 ⊕ 𝑘 and 𝐷𝑘 (𝑦) = 𝑦 ⊕ 𝑘.
By the associative law of addition (which works also modulo two),
𝐷𝑘 (𝐸𝑘 (𝑥)) = (𝑥 ⊕ 𝑘) ⊕ 𝑘 = 𝑥 ⊕ (𝑘 ⊕ 𝑘) = 𝑥 ⊕ 0𝑛 = 𝑥, using the fact
that for every bit 𝜎 ∈ {0, 1}, 𝜎 + 𝜎 mod 2 = 0 and 𝜎 + 0 = 𝜎 mod 2.
Hence (𝐸, 𝐷) form a valid encryption.
To analyze the perfect secrecy property, we claim that for every
𝑥 ∈ {0, 1}𝑛 , the distribution 𝑌𝑥 = 𝐸𝑘 (𝑥) where 𝑘 ∼ {0, 1}𝑛 is simply
the uniform distribution over {0, 1}𝑛 , and hence in particular the
distributions 𝑌𝑥 and 𝑌𝑥′ are identical for every 𝑥, 𝑥′ ∈ {0, 1}𝑛 . Indeed,
for every particular 𝑦 ∈ {0, 1}𝑛 , the value 𝑦 is output by 𝑌𝑥 if and
only if 𝑦 = 𝑥 ⊕ 𝑘 which holds if and only if 𝑘 = 𝑥 ⊕ 𝑦. Since 𝑘 is
chosen uniformly at random in {0, 1}𝑛 , the probability that 𝑘 happens
to equal 𝑥 ⊕ 𝑦 is exactly 2−𝑛 , which means that every string 𝑦 is output
by 𝑌𝑥 with probability 2−𝑛 .

P
The argument above is quite simple but is worth
reading again. To understand why the one-time pad
is perfectly secret, it is useful to envision it as a bi-
partite graph as we’ve done in Fig. 21.8. (In fact the
Figure 21.9: In the one time pad encryption scheme we
encryption scheme of Fig. 21.8 is precisely the one-
encrypt a plaintext 𝑥 ∈ {0, 1}𝑛 with a key 𝑘 ∈ {0, 1}𝑛
time pad for 𝑛 = 2.) For every 𝑛, the one-time pad
by the ciphertext 𝑥 ⊕ 𝑘 where ⊕ denotes the bitwise
encryption scheme corresponds to a bipartite graph XOR operation.
with 2𝑛 vertices on the “left side” corresponding to the
plaintexts in {0, 1}𝑛 and 2𝑛 vertices on the “right side”
corresponding to the ciphertexts {0, 1}𝑛 . For every
𝑥 ∈ {0, 1}𝑛 and 𝑘 ∈ {0, 1}𝑛 , we connect 𝑥 to the vertex
𝑦 = 𝐸𝑘 (𝑥) with an edge that we label with 𝑘. One can
see that this is the complete bipartite graph, where
every vertex on the left is connected to all vertices on
the right. In particular this means that for every left
vertex 𝑥, the distribution on the ciphertexts obtained
c ry p tog ra p hy 595

by taking a random 𝑘 ∈ {0, 1}𝑛 and going to the


neighbor of 𝑥 on the edge labeled 𝑘 is the uniform dis-
tribution over {0, 1}𝑛 . This ensures the perfect secrecy
condition.

21.5 NECESSITY OF LONG KEYS


So, does Theorem 21.4 give the final word on cryptography, and
means that we can all communicate with perfect secrecy and live
happily ever after? No it doesn’t. While the one-time pad is efficient,
and gives perfect secrecy, it has one glaring disadvantage: to commu-
nicate 𝑛 bits you need to store a key of length 𝑛. In contrast, practically
used cryptosystems such as AES-128 have a short key of 128 bits (i.e.,
16 bytes) that can be used to protect terabytes or more of communica-
tion! Imagine that we all needed to use the one time pad. If that was
the case, then if you had to communicate with 𝑚 people, you would
have to maintain (securely!) 𝑚 huge files that are each as long as the
length of the maximum total communication you expect with that per-
son. Imagine that every time you opened an account with Amazon,
Google, or any other service, they would need to send you in the mail
(ideally with a secure courier) a DVD full of random numbers, and
every time you suspected a virus, you’d need to ask all these services
for a fresh DVD. This doesn’t sound so appealing.
This is not just a theoretical issue. The Soviets have used the one-
time pad for their confidential communication since before the 1940’s.
In fact, even before Shannon’s work, the U.S. intelligence already
knew in 1941 that the one-time pad is in principle “unbreakable” (see
Figure 21.10: Gene Grabeel, who founded the U.S.
page 32 in the Venona document). However, it turned out that the Russian SigInt program on 1 Feb 1943. Photo taken in
hassle of manufacturing so many keys for all the communication took 1942, see Page 7 in the Venona historical study.
its toll on the Soviets and they ended up reusing the same keys for
more than one message. They did try to use them for completely dif-
ferent receivers in the (false) hope that this wouldn’t be detected. The
Venona Project of the U.S. Army was founded in February 1943 by
Gene Grabeel (see Fig. 21.10), a former home economics teacher from
Madison Heights, Virgnia and Lt. Leonard Zubko. In October 1943,
they had their breakthrough when it was discovered that the Russians
were reusing their keys. In the 37 years of its existence, the project has
resulted in a treasure chest of intelligence, exposing hundreds of KGB Figure 21.11: An encryption scheme where the num-
agents and Russian spies in the U.S. and other countries, including ber of keys is smaller than the number of plaintexts
corresponds to a bipartite graph where the degree is
Julius Rosenberg, Harry Gold, Klaus Fuchs, Alger Hiss, Harry Dexter smaller than the number of vertices on the left side.
White and many others. Together with the validity condition this implies that
there will be two left vertices 𝑥, 𝑥′ with non-identical
Unfortunately it turns out that such long keys are necessary for neighborhoods, and hence the scheme does not satisfy
perfect secrecy: perfect secrecy.
596 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

For every perfectly


Theorem 21.5 — Perfect secrecy requires long keys.
secret encryption scheme (𝐸, 𝐷) the length function 𝐿 satisfies
𝐿(𝑛) ≤ 𝑛.

Proof Idea:
The idea behind the proof is illustrated in Fig. 21.11. We define a
graph between the plaintexts and ciphertexts, where we put an edge
between plaintext 𝑥 and ciphertext 𝑦 if there is some key 𝑘 such that
𝑦 = 𝐸𝑘 (𝑥). The degree of this graph is at most the number of potential
keys. The fact that the degree is smaller than the number of plaintexts
(and hence of ciphertexts) implies that there would be two plaintexts
𝑥 and 𝑥′ with different sets of neighbors, and hence the distribution
of a ciphertext corresponding to 𝑥 (with a random key) will not be
identical to the distribution of a ciphertext corresponding to 𝑥′ .

Proof of Theorem 21.5. Let 𝐸, 𝐷 be a valid encryption scheme with


messages of length 𝐿 and key of length 𝑛 < 𝐿. We will show that
(𝐸, 𝐷) is not perfectly secret by providing two plaintexts 𝑥0 , 𝑥1 ∈
{0, 1}𝐿 such that the distributions 𝑌𝑥0 and 𝑌𝑥1 are not identical, where
𝑌𝑥 is the distribution obtained by picking 𝑘 ∼ {0, 1}𝑛 and outputting
𝐸𝑘 (𝑥).
We choose 𝑥0 = 0𝐿 . Let 𝑆0 ⊆ {0, 1}∗ be the set of all ciphertexts
that have non-zero probability of being output in 𝑌𝑥0 . That is, 𝑆0 =
{𝑦 | ∃𝑘∈{0,1}𝑛 𝑦 = 𝐸𝑘 (𝑥0 )}. Since there are only 2𝑛 keys, we know that
|𝑆0 | ≤ 2𝑛 .
We will show the following claim:
Claim I: There exists some 𝑥1 ∈ {0, 1}𝐿 and 𝑘 ∈ {0, 1}𝑛 such that
𝐸𝑘 (𝑥1 ) ∉ 𝑆0 .
Claim I implies that the string 𝐸𝑘 (𝑥1 ) has positive probability of
being output by 𝑌𝑥1 and zero probability of being output by 𝑌𝑥0 and
hence in particular 𝑌𝑥0 and 𝑌𝑥1 are not identical. To prove Claim I, just
choose a fixed 𝑘 ∈ {0, 1}𝑛 . By the validity condition, the map 𝑥 ↦
𝐸𝑘 (𝑥) is a one to one map of {0, 1}𝐿 to {0, 1}∗ and hence in particular
the image of this map which is the set 𝐼𝑘 = {𝑦 | ∃𝑥∈{0,1}𝐿 𝑦 = 𝐸𝑘 (𝑥)}
has size at least (in fact exactly) 2𝐿 . Since |𝑆0 | ≤ 2𝑛 < 2𝐿 , this means
that |𝐼𝑘 | > |𝑆0 | and so in particular there exists some string 𝑦 in 𝐼𝑘 ⧵ 𝑆0 .
But by the definition of 𝐼𝑘 this means that there is some 𝑥 ∈ {0, 1}𝐿
such that 𝐸𝑘 (𝑥) ∉ 𝑆0 which concludes the proof of Claim I and hence
of Theorem 21.5.

c ry p tog ra p hy 597

21.6 COMPUTATIONAL SECRECY


To sum up the previous episodes, we now know that:

• It is possible to obtain a perfectly secret encryption scheme with key


length the same as the plaintext.

and

• It is not possible to obtain such a scheme with key that is even a


single bit shorter than the plaintext.

How does this mesh with the fact that, as we’ve already seen, peo-
ple routinely use cryptosystems with a 16 byte (i.e., 128 bit) key but
many terabytes of plaintext? The proof of Theorem 21.5 does give in
fact a way to break all these cryptosystems, but an examination of this
proof shows that it only yields an algorithm with time exponential in
the length of the key. This motivates the following relaxation of perfect
secrecy to a condition known as “computational secrecy”. Intuitively,
an encryption scheme is computationally secret if no polynomial time
algorithm can break it. The formal definition is below:

Let (𝐸, 𝐷) be a valid encryp-


Definition 21.6 — Computational secrecy.
tion scheme where for keys of length 𝑛, the plaintexts are of length
𝐿(𝑛) and the ciphertexts are of length 𝑚(𝑛). We say that (𝐸, 𝐷) is
computationally secret if for every polynomial 𝑝 ∶ ℕ → ℕ, and large
enough 𝑛, if 𝑃 is an 𝑚(𝑛)-input and single output NAND-CIRC
program of at most 𝑝(𝑛) lines, and 𝑥0 , 𝑥1 ∈ {0, 1}𝐿(𝑛) then

∣ 𝔼 [𝑃 (𝐸𝑘 (𝑥0 ))] − 𝔼 [𝑃 (𝐸𝑘 (𝑥1 ))]∣ < 1


𝑝(𝑛) (21.3)
𝑘∼{0,1}𝑛 𝑘∼{0,1}𝑛

P
Definition 21.6 requires a second or third read and
some practice to truly understand. One excellent exer-
cise to make sure you follow it is to see that if we allow
𝑃 to be an arbitrary function mapping {0, 1}𝑚(𝑛) to
{0, 1}, and we replace the condition in (21.3) that the
left-hand side is smaller than 𝑝(𝑛)1
with the condition
that it is equal to 0 then we get the perfect secrecy
condition of Definition 21.3. Indeed if the distributions
𝐸𝑘 (𝑥0 ) and 𝐸𝑘 (𝑥1 ) are identical then applying any
function 𝑃 to them we get the same expectation. On
the other hand, if the two distributions above give a
different probability for some element 𝑦∗ ∈ {0, 1}𝑚(𝑛) ,
then the function 𝑃 (𝑦) that outputs 1 iff 𝑦 = 𝑦∗ will
598 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

have a different expectation under the former distribu-


tion than under the latter.

Definition 21.6 raises two natural questions:

• Is it strong enough to ensure that a computationally secret encryp-


tion scheme protects the secrecy of messages that are encrypted
with it?

• It is weak enough that, unlike perfect secrecy, it is possible to obtain


a computationally secret encryption scheme where the key is much
smaller than the message?

To the best of our knowledge, the answer to both questions is Yes.


This is just one example of a much broader phenomenon. We can
use computational hardness to achieve many cryptographic goals,
including some goals that have been dreamed about for millenia, and
other goals that people have not even dared to imagine.

 Big Idea 27 Computational hardness is necessary and sufficient for


almost all cryptographic applications.

Regarding the first question, it is not hard to show that if, for ex-
ample, Alice uses a computationally secret encryption algorithm to
encrypt either “attack” or “retreat” (each chosen with probability
1/2), then as long as she’s restricted to polynomial-time algorithms, an
adversary Eve will not be able to guess the message with probability
better than, say, 0.51, even after observing its encrypted form. (We
omit the proof, but it is an excellent exercise for you to work it out on
your own.)
To answer the second question we will show that under the same
assumption we used for derandomizing BPP, we can obtain a com-
putationally secret cryptosystem where the key is almost exponentially
smaller than the plaintext.

21.6.1 Stream ciphers or the “derandomized one-time pad”


It turns out that if pseudorandom generators exist as in the optimal
PRG conjecture, then there exists a computationally secret encryption
scheme with keys that are much shorter than the plaintext. The con-
struction below is known as a stream cipher, though perhaps a better
name is the “derandomized one-time pad”. It is widely used in prac-
tice with keys on the order of a few tens or hundreds of bits protecting Figure 21.12: In a stream cipher or “derandomized
many terabytes or even petabytes of communication. one-time pad” we use a pseudorandom generator
𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 to obtain an encryption scheme
We start by recalling the notion of a pseudorandom generator, as de-
with a key length of 𝑛 and plaintexts of length 𝐿.
fined in Definition 20.9. For this chapter, we will fix a special case of We encrypt the plaintext 𝑥 ∈ {0, 1}𝐿 with key
the definition: 𝑘 ∈ {0, 1}𝑛 by the ciphertext 𝑥 ⊕ 𝐺(𝑘).
c ry p tog ra p hy 599

Let 𝐿 ∶ ℕ → ℕ be
Definition 21.7 — Cryptographic pseudorandom generator.
some function. A cryptographic pseudorandom generator with stretch
𝐿(⋅) is a polynomial-time computable function 𝐺 ∶ {0, 1}∗ → {0, 1}∗
such that:

• For every 𝑛 ∈ ℕ and 𝑠 ∈ {0, 1}𝑛 , |𝐺(𝑠)| = 𝐿(𝑛).

• For every polynomial 𝑝 ∶ ℕ → ℕ and 𝑛 large enough, if 𝐶 is a cir-


cuit of 𝐿(𝑛) inputs, one output, and at most 𝑝(𝑛) gates then

1
∣ Pr [𝐶(𝐺(𝑠)) = 1] − Pr [𝐶(𝑟) = 1]∣ < .
𝑠∼{0,1}ℓ 𝑟∼{0,1}𝑚 𝑝(𝑛)

In this chapter we will call a cryptographic pseudorandom gener-


ator simply a pseudorandom generator or PRG for short. The optimal
PRG conjecture of Section 20.4.2 implies that there is a pseudoran-
dom generator that can “fool” circuits of exponential size and where
the gap in probabilities is at most one over an exponential quantity.
Since exponential grow faster than every polynomial, the optimal PRG
conjecture implies the following:

The crypto PRG conjecture: For every 𝑎 ∈ ℕ, there is a cryptographic


pseudorandom generator with 𝐿(𝑛) = 𝑛𝑎 .

The crypto PRG conjecture is a weaker conjecture than the optimal


PRG conjecture, but it too (as we will see) is still stronger than the
conjecture that P ≠ NP.

Suppose that the crypto


Theorem 21.8 — Derandomized one-time pad.
PRG conjecture is true. Then for every constant 𝑎 ∈ ℕ there is a
computationally secret encryption scheme (𝐸, 𝐷) with plaintext
length 𝐿(𝑛) at least 𝑛𝑎 .

Proof Idea:
The proof is illustrated in Fig. 21.12. We simply take the one-time
pad on 𝐿 bit plaintexts, but replace the key with 𝐺(𝑘) where 𝑘 is a
string in {0, 1}𝑛 and 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 is a pseudorandom gen-
erator. Since the one time pad cannot be broken, an adversary that
breaks the derandomized one-time pad can be used to distinguish
between the output of the pseudorandom generator and the uniform
distribution.

Proof of Theorem 21.8. Let 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 for 𝐿 = 𝑛𝑎 be the
restriction to input length 𝑛 of the pseudorandom generator 𝐺 whose
600 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

existence we are guaranteed from the crypto PRG conjecture. We


now define our encryption scheme as follows: given key 𝑘 ∈ {0, 1}𝑛
and plaintext 𝑥 ∈ {0, 1}𝐿 , the encryption 𝐸𝑘 (𝑥) is simply 𝑥 ⊕ 𝐺(𝑘).
To decrypt a string 𝑦 ∈ {0, 1}𝑚 we output 𝑦 ⊕ 𝐺(𝑘). This is a valid
encryption since 𝐺 is computable in polynomial time and (𝑥 ⊕ 𝐺(𝑘)) ⊕
𝐺(𝑘) = 𝑥 ⊕ (𝐺(𝑘) ⊕ 𝐺(𝑘)) = 𝑥 for every 𝑥 ∈ {0, 1}𝐿 .
Computational secrecy follows from the condition of a pseudo-
random generator. Suppose, towards a contradiction, that there is
a polynomial 𝑝, NAND-CIRC program 𝑄 of at most 𝑝(𝐿) lines and
𝑥, 𝑥′ ∈ {0, 1}𝐿(𝑛) such that

1
∣ 𝔼 [𝑄(𝐸𝑘 (𝑥))] − 𝔼 [𝑄(𝐸𝑘 (𝑥′ ))]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑘∼{0,1}𝑛

(We use here the simple fact that for a {0, 1}-valued random variable
𝑋, Pr[𝑋 = 1] = 𝔼[𝑋].)
By the definition of our encryption scheme, this means that

∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 1


𝑝(𝐿) . (21.4)
𝑘∼{0,1}𝑛 𝑘∼{0,1}𝑛

Now since (as we saw in the security analysis of the one-time pad),
for every strings 𝑥, 𝑥′ ∈ {0, 1}𝐿 , the distribution 𝑟 ⊕ 𝑥 and 𝑟 ⊕ 𝑥′ are
identical, where 𝑟 ∼ {0, 1}𝐿 . Hence

𝔼 [𝑄(𝑟 ⊕ 𝑥)] = 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] . (21.5)


𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿

By plugging (21.5) into (21.4) we can derive that

1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)] + 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.6)
(Please make sure that you can see why this is true.)
Now we can use the triangle inequality that |𝐴 + 𝐵| ≤ |𝐴| + |𝐵| for
every two numbers 𝐴, 𝐵, applying it for 𝐴 = 𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘) ⊕ 𝑥)] −
𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥)] and 𝐵 = 𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥′ )]−𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘)⊕𝑥′ )]
to derive

1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)]∣+∣ 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.7)
In particular, either the first term or the second term of the left-
hand side of (21.7) must be at least 2𝑝(𝐿)
1
. Let us assume the first case
holds (the second case is analyzed in exactly the same way). Then we
get that

∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)]∣ > 1


2𝑝(𝐿) . (21.8)
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿
c ry p tog ra p hy 601

But if we now define the NAND-CIRC program 𝑃𝑥 that on input


𝑟 ∈ {0, 1}𝐿 outputs 𝑄(𝑟⊕𝑥) then (since XOR of 𝐿 bits can be computed
in 𝑂(𝐿) lines), we get that 𝑃𝑥 has 𝑝(𝐿) + 𝑂(𝐿) lines and by (21.8) it
can distinguish between an input of the form 𝐺(𝑘) and an input of the
form 𝑟 ∼ {0, 1}𝑘 with advantage better than 2𝑝(𝐿) 1
. Since a polynomial
is dominated by an exponential, if we make 𝐿 large enough, this will
contradict the (2𝛿𝑛 , 2−𝛿𝑛 ) security of the pseudorandom generator 𝐺.

R
Remark 21.9 — Stream ciphers in practice. The two
most widely used forms of (private key) encryption
schemes in practice are stream ciphers and block ciphers.
(To make things more confusing, a block cipher is
always used in some mode of operation and some
of these modes effectively turn a block cipher into
a stream cipher.) A block cipher can be thought as
a sort of a “random invertible map” from {0, 1}𝑛 to
{0, 1}𝑛 , and can be used to construct a pseudorandom
generator and from it a stream cipher, or to encrypt
data directly using other modes of operations. There
are a great many other security notions and consider-
ations for encryption schemes beyond computational
secrecy. Many of those involve handling scenarios
such as chosen plaintext, man in the middle, and cho-
sen ciphertext attacks, where the adversary is not just
merely a passive eavesdropper but can influence the
communication in some way. While this chapter is
meant to give you some taste of the ideas behind cryp-
tography, there is much more to know before applying
it correctly to obtain secure applications, and a great
many people have managed to get it wrong.

21.7 COMPUTATIONAL SECRECY AND NP


We’ve also mentioned before that an efficient algorithm for NP could
be used to break all cryptography. We now give an example of how
this can be done:

If P
Theorem 21.10 — Breaking encryption using NP algorithm. = NP
then there is no computationally secret encryption scheme with
𝐿(𝑛) > 𝑛.
Furthermore, for every valid encryption scheme (𝐸, 𝐷) with
𝐿(𝑛) > 𝑛 + 100 there is a polynomial 𝑝 such that for every large
enough 𝑛 there exist 𝑥0 , 𝑥1 ∈ {0, 1}𝐿(𝑛) and a 𝑝(𝑛)-line NAND-
602 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

CIRC program EVE s.t.

Pr [EVE(𝐸𝑘 (𝑥𝑖 )) = 𝑖] ≥ 0.99 .


𝑖∼{0,1},𝑘∼{0,1}𝑛

Note that the “furthermore” part is extremely strong. It means


that if the plaintext is even a little bit larger than the key, then we can
already break the scheme in a very strong way. That is, there will be
a pair of messages 𝑥0 , 𝑥1 (think of 𝑥0 as “sell” and 𝑥1 as “buy”) and
an efficient strategy for Eve such that if Eve gets a ciphertext 𝑦 then
she will be able to tell whether 𝑦 is an encryption of 𝑥0 or 𝑥1 with
probability very close to 1. (We model breaking the scheme as Eve
outputting 0 or 1 corresponding to whether the message sent was 𝑥0
or 𝑥1 . Note that we could have just as well modified Eve to output 𝑥0
instead of 0 and 𝑥1 instead of 1. The key point is that a priori Eve only
had a 50/50 chance of guessing whether Alice sent 𝑥0 or 𝑥1 but after
seeing the ciphertext this chance increases to better than 99/100.) The
condition P = NP can be relaxed to NP ⊆ BPP and even the weaker
condition NP ⊆ P/poly with essentially the same proof.

Proof Idea:
The proof follows along the lines of Theorem 21.5 but this time
paying attention to the computational aspects. If P = NP then for
every plaintext 𝑥 and ciphertext 𝑦, we can efficiently tell whether there
exists 𝑘 ∈ {0, 1}𝑛 such that 𝐸𝑘 (𝑥) = 𝑦. So, to prove this result we need
to show that if the plaintexts are long enough, there would exist a pair
𝑥0 , 𝑥1 such that the probability that a random encryption of 𝑥1 also is
a valid encryption of 𝑥0 will be very small. The details of how to show
this are below.

Proof of Theorem 21.10. We focus on showing only the “furthermore”


part since it is the more interesting and the other part follows by es-
sentially the same proof.
Suppose that (𝐸, 𝐷) is such an encryption, let 𝑛 be large enough,
and let 𝑥0 = 0𝐿(𝑛) . For every 𝑥 ∈ {0, 1}𝐿(𝑛) we define 𝑆𝑥 to be the set
of all valid encryptions of 𝑥. That is 𝑆𝑥 = {𝑦 | ∃𝑘∈{0,1}𝑛 𝑦 = 𝐸𝑘 (𝑥)}. As
in the proof of Theorem 21.5, since there are 2𝑛 keys 𝑘, |𝑆𝑥 | ≤ 2𝑛 for
every 𝑥 ∈ {0, 1}𝐿(𝑛) .
We denote by 𝑆0 the set 𝑆𝑥0 . We define our algorithm EVE to out-
put 0 on input 𝑦 ∈ {0, 1}∗ if 𝑦 ∈ 𝑆0 and to output 1 otherwise. This
can be implemented in polynomial time if P = NP, since the key 𝑘
can serve the role of an efficiently verifiable solution. (Can you see
why?) Clearly Pr[EVE(𝐸𝑘 (𝑥0 )) = 0] = 1 and so in the case that EVE
gets an encryption of 𝑥0 then she guesses correctly with probability
1. The remainder of the proof is devoted to showing that there ex-
c ry p tog ra p hy 603

ists 𝑥1 ∈ {0, 1}𝐿(𝑛) such that Pr[EVE(𝐸𝑘 (𝑥1 )) = 0] ≤ 0.01, which


will conclude the proof by showing that EVE guesses wrongly with
probability at most 21 0 + 21 0.01 < 0.01.
Consider now the following probabilistic experiment (which we
define solely for the sake of analysis). We consider the sample space
of choosing 𝑥 uniformly in {0, 1}𝐿(𝑛) and define the random variable
𝑍𝑘 (𝑥) to equal 1 if and only if 𝐸𝑘 (𝑥) ∈ 𝑆0 . For every 𝑘, the map 𝑥 ↦
𝐸𝑘 (𝑥) is one-to-one, which means that the probability that 𝑍𝑘 = 1
𝐿(𝑛) . So by the
is equal to the probability that 𝑥 ∈ 𝐸𝑘−1 (𝑆0 ) which is 2|𝑆 0|

linearity of expectation 𝔼[∑𝑘∈{0,1}𝑛 𝑍𝑘 ] ≤ 2𝐿(𝑛) ≤ 2𝐿(𝑛) .


2𝑛 |𝑆0 | 22𝑛

We will now use the following extremely simple but useful fact
known as the averaging principle (see also Lemma 18.10): for every
random variable 𝑍, if 𝔼[𝑍] = 𝜇, then with positive probability 𝑍 ≤ 𝜇.
(Indeed, if 𝑍 > 𝜇 with probability one, then the expected value of 𝑍
will have to be larger than 𝜇, just like you can’t have a class in which
all students got A or A- and yet the overall average is B+.) In our case
it means that with positive probability ∑𝑘∈{0,1}𝑛 𝑍𝑘 ≤ 22𝐿(𝑛) . In other
2𝑛

words, there exists some 𝑥1 ∈ {0, 1}𝐿(𝑛) such that ∑𝑘∈{0,1}𝑛 𝑍𝑘 (𝑥1 ) ≤
22𝑛
2𝐿(𝑛)
.Yet this means that if we choose a random 𝑘 ∼ {0, 1}𝑛 , then
the probability that 𝐸𝑘 (𝑥1 ) ∈ 𝑆0 is at most 21𝑛 ⋅ 22𝐿(𝑛) = 2𝑛−𝐿(𝑛) .
2𝑛

So, in particular if we have an algorithm EVE that outputs 0 if 𝑥 ∈


𝑆0 and outputs 1 otherwise, then Pr[EVE(𝐸𝑘 (𝑥0 )) = 0] = 1 and
Pr[EVE(𝐸𝑘 (𝑥1 )) = 0] ≤ 2𝑛−𝐿(𝑛) which will be smaller than 2−10 < 0.01
if 𝐿(𝑛) ≥ 𝑛 + 10.

In retrospect Theorem 21.10 is perhaps not surprising. After all, as


we’ve mentioned before it is known that the Optimal PRG conjecture
(which is the basis for the derandomized one-time pad encryption) is
false if P = NP (and in fact even if NP ⊆ BPP or even NP ⊆ P/poly ).

21.8 PUBLIC KEY CRYPTOGRAPHY


People have been dreaming about heavier-than-air flight since at least
the days of Leonardo Da Vinci (not to mention Icarus from the greek
mythology). Jules Verne wrote with rather insightful details about
going to the moon in 1865. But, as far as I know, in all the thousands
of years people have been using secret writing, until about 50 years
ago no one has considered the possibility of communicating securely
without first exchanging a shared secret key.
Yet in the late 1960’s and early 1970’s, several people started to
question this “common wisdom”. Perhaps the most surprising of
these visionaries was an undergraduate student at Berkeley named
Ralph Merkle. In the fall of 1974 Merkle wrote in a project proposal
for his computer security course that while “it might seem intuitively
604 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

obvious that if two people have never had the opportunity to prear-
range an encryption method, then they will be unable to communicate
securely over an insecure channel… I believe it is false”. The project
proposal was rejected by his professor as “not good enough”. Merkle
later submitted a paper to the communication of the ACM where he
apologized for the lack of references since he was unable to find any
mention of the problem in the scientific literature, and the only source
where he saw the problem even raised was in a science fiction story.
The paper was rejected with the comment that “Experience shows that
it is extremely dangerous to transmit key information in the clear.”
Merkle showed that one can design a protocol where Alice and Bob
can use 𝑇 invocations of a hash function to exchange a key, but an
adversary (in the random oracle model, though he of course didn’t
use this name) would need roughly 𝑇 2 invocations to break it. He
conjectured that it may be possible to obtain such protocols where
breaking is exponentially harder than using them, but could not think of
any concrete way to doing so.
We only found out much later that in the late 1960’s, a few years
before Merkle, James Ellis of the British Intelligence agency GCHQ
was having similar thoughts. His curiosity was spurred by an old
World-War II manuscript from Bell Labs that suggested the following
way that two people could communicate securely over a phone line.
Alice would inject noise to the line, Bob would relay his messages,
and then Alice would subtract the noise to get the signal. The idea is
that an adversary over the line sees only the sum of Alice’s and Bob’s
signals, and doesn’t know what came from what. This got James Ellis
thinking whether it would be possible to achieve something like that
digitally. As Ellis later recollected, in 1970 he realized that in princi-
ple this should be possible, since he could think of an hypothetical
black box 𝐵 that on input a “handle” 𝛼 and plaintext 𝑥 would give a
“ciphertext” 𝑦 and that there would be a secret key 𝛽 corresponding
to 𝛼, such that feeding 𝛽 and 𝑦 to the box would recover 𝑥. However,
Ellis had no idea how to actually instantiate this box. He and others
kept giving this question as a puzzle to bright new recruits until one
of them, Clifford Cocks, came up in 1973 with a candidate solution
loosely based on the factoring problem; in 1974 another GCHQ re-
cruit, Malcolm Williamson, came up with a solution using modular
exponentiation.
But among all those thinking of public key cryptography, probably
the people who saw the furthest were two researchers at Stanford,
Whit Diffie and Martin Hellman. They realized that with the advent
of electronic communication, cryptography would find new applica-
tions beyond the military domain of spies and submarines, and they
understood that in this new world of many users and point to point
c ry p tog ra p hy 605

communication, cryptography will need to scale up. Diffie and Hell-


man envisioned an object which we now call “trapdoor permutation”
though they called “one way trapdoor function” or sometimes simply
“public key encryption”. Though they didn’t have full formal defini-
tions, their idea was that this is an injective function that is easy (e.g.,
polynomial-time) to compute but hard (e.g., exponential-time) to in-
vert. However, there is a certain trapdoor, knowledge of which would
allow polynomial time inversion. Diffie and Hellman argued that us-
ing such a trapdoor function, it would be possible for Alice and Bob
to communicate securely without ever having exchanged a secret key. But
they didn’t stop there. They realized that protecting the integrity of
communication is no less important than protecting its secrecy. Thus
they imagined that Alice could “run encryption in reverse” in order to
certify or sign messages.
At the point, Diffie and Hellman were in a position not unlike
physicists who predicted that a certain particle should exist but with-
out any experimental verification. Luckily they met Ralph Merkle,
and his ideas about a probabilistic key exchange protocol, together with
a suggestion from their Stanford colleague John Gill, inspired them
to come up with what today is known as the Diffie Hellman Key Ex-
change (which unbeknownst to them was found two years earlier at
GCHQ by Malcolm Williamson). They published their paper “New
Directions in Cryptography” in 1976, and it is considered to have
brought about the birth of modern cryptography.
The Diffie-Hellman Key Exchange is still widely used today for
secure communication. However, it still felt short of providing Diffie
and Hellman’s elusive trapdoor function. This was done the next year
by Rivest, Shamir and Adleman who came up with the RSA trapdoor
function, which through the framework of Diffie and Hellman yielded
not just encryption but also signatures. (A close variant of the RSA
function was discovered earlier by Clifford Cocks at GCHQ, though
as far as I can tell Cocks, Ellis and Williamson did not realize the
application to digital signatures.) From this point on began a flurry of
advances in cryptography which hasn’t died down till this day.

21.8.1 Defining public key encryption


Figure 21.13: Top left: Ralph Merkle, Martin Hellman
A public key encryption consists of a triple of algorithms: and Whit Diffie, who together came up in 1976
with the concept of public key encryption and a key
• The key generation algorithm, which we denote by 𝐾𝑒𝑦𝐺𝑒𝑛 or KG for exchange protocol. Bottom left: Adi Shamir, Ron Rivest,
short, is a randomized algorithm that outputs a pair of strings (𝑒, 𝑑) and Leonard Adleman who, following Diffie and
Hellman’s paper, discovered the RSA function that
where 𝑒 is known as the public (or encryption) key, and 𝑑 is known can be used for public key encryption and digital
as the private (or decryption) key. The key generation algorithm gets signatures. Interestingly, one can see the equation
P = NP on the blackboard behind them. Right: John
as input 1𝑛 (i.e., a string of ones of length 𝑛). We refer to 𝑛 as the
Gill, who was the first person to suggest to Diffie and
security parameter of the scheme. The bigger we make 𝑛, the more Hellman that they use modular exponentiation as an
secure the encryption will be, but also the less efficient it will be. easy-to-compute but hard-to-invert function.
606 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• The encryption algorithm, which we denote by 𝐸, takes the encryp-


tion key 𝑒 and a plaintext 𝑥, and outputs the ciphertext 𝑦 = 𝐸𝑒 (𝑥).

• The decryption algorithm, which we denote by 𝐷, takes the decryp-


tion key 𝑑 and a ciphertext 𝑦, and outputs the plaintext 𝑥 = 𝐷𝑑 (𝑦).

We now make this a formal definition:

A computationally secret public


Definition 21.11 — Public Key Encryption.
key encryption with plaintext length 𝐿 ∶ ℕ → ℕ is a triple of ran-
domized polynomial-time algorithms (KG, 𝐸, 𝐷) that satisfy the
following conditions:

• For every 𝑛, if (𝑒, 𝑑) is output by KG(1𝑛 ) with positive proba-


bility, and 𝑥 ∈ {0, 1}𝐿(𝑛) , then 𝐷𝑑 (𝐸𝑒 (𝑥)) = 𝑥 with probability Figure 21.14: In a public key encryption, Alice generates
one. a private/public keypair (𝑒, 𝑑), publishes 𝑒 and keeps
𝑑 secret. To encrypt a message for Alice, one only
• For every polynomial 𝑝, and sufficiently large 𝑛, if 𝑃 is a NAND- needs to know 𝑒. To decrypt it we need to know 𝑑.
CIRC program of at most 𝑝(𝑛) lines then for every 𝑥, 𝑥′ ∈
{0, 1} 𝐿(𝑛)
, |𝔼[𝑃 (𝑒, 𝐸𝑒 (𝑥))] − 𝔼[𝑃 (𝑒, 𝐸𝑒 (𝑥 ))]| < 1/𝑝(𝑛), where

this probability is taken over the coins of KG and 𝐸.

Definition 21.11 allows 𝐸 and 𝐷 to be randomized algorithms. In


fact, it turns out that it is necessary for 𝐸 to be randomized to obtain
computational secrecy. It also turns out that, unlike the private key
case, we can transform a public-key encryption that works for mes-
sages that are only one bit long into a public-key encryption scheme
that can encrypt arbitrarily long messages, and in particular messages
that are longer than the key. In particular this means that we cannot ob-
tain a perfectly secret public-key encryption scheme even for one-bit
long messages (since it would imply a perfectly secret public-key, and
hence in particular private-key, encryption with messages longer than
the key).
We will not give full constructions for public key encryption
schemes in this chapter, but will mention some of the ideas that
underlie the most widely used schemes today. These generally belong
to one of two families:

• Group theoretic constructions based on problems such as integer factor-


ing and the discrete logarithm over finite fields or elliptic curves.

• Lattice/coding based constructions based on problems such as the


closest vector in a lattice or bounded distance decoding.

Group-theory based encryptions such as the RSA cryptosystem, the


Diffie-Hellman protocol, and Elliptic-Curve Cryptography, are cur-
rently more widely implemented. But the lattice/coding schemes are
c ry p tog ra p hy 607

recently on the rise, particularly because the known group theoretic


encryption schemes can be broken by quantum computers, which we
discuss in Chapter 23.

21.8.2 Diffie-Hellman key exchange


As just one example of how public key encryption schemes are con-
structed, let us now describe the Diffie-Hellman key exchange. We
describe the Diffie-Hellman protocol in a somewhat of an informal
level, without presenting a full security analysis.
The computational problem underlying the Diffie Hellman protocol
is the discrete logarithm problem. Let’s suppose that 𝑔 is some integer.
We can compute the map 𝑥 ↦ 𝑔𝑥 and also its inverse 𝑦 ↦ log𝑔 𝑦. (For
example, we can compute a logarithm is by binary search: start with
some interval [𝑥𝑚𝑖𝑛 , 𝑥𝑚𝑎𝑥 ] that is guaranteed to contain log𝑔 𝑦. We can
then test whether the interval’s midpoint 𝑥𝑚𝑖𝑑 satisfies 𝑔𝑥𝑚𝑖𝑑 > 𝑦, and
based on that halve the size of the interval.)
However, suppose now that we use modular arithmetic and work
modulo some prime number 𝑝. If 𝑝 has 𝑛 binary digits and 𝑔 is in [𝑝]
then we can compute the map 𝑥 ↦ 𝑔𝑥 mod 𝑝 in time polynomial in 𝑛.
(This is not trivial, and is a great exercise for you to work this out; as a
hint, start by showing that one can compute the map 𝑘 ↦ 𝑔2 mod 𝑝
𝑘

using 𝑘 modular multiplications modulo 𝑝, if you’re stumped, you


can look up this Wikipedia entry.) On the other hand, because of the
“wraparound” property of modular arithmetic, we cannot run binary
search to find the inverse of this map (known as the discrete logarithm).
In fact, there is no known polynomial-time algorithm for computing
this discrete logarithm map (𝑔, 𝑥, 𝑝) ↦ log𝑔 𝑥 mod 𝑝, where we define
log𝑔 𝑥 mod 𝑝 as the number 𝑎 ∈ [𝑝] such that 𝑔𝑎 = 𝑥 mod 𝑝.
The Diffie-Hellman protocol for Bob to send a message to Alice is as
follows:

• Alice: Chooses 𝑝 to be a random 𝑛 bit long prime (which can be


done by choosing random numbers and running a primality testing
algorithm on them), and 𝑔 and 𝑎 at random in [𝑝]. She sends to Bob
the triple (𝑝, 𝑔, 𝑔𝑎 mod 𝑝).

• Bob: Given the triple (𝑝, 𝑔, ℎ), Bob sends a message 𝑥 ∈ {0, 1}𝐿
to Alice by choosing 𝑏 at random in [𝑝], and sending to Alice the
pair (𝑔𝑏 mod 𝑝, 𝑟𝑒𝑝(ℎ𝑏 mod 𝑝) ⊕ 𝑥) where 𝑟𝑒𝑝 ∶ [𝑝] → {0, 1}∗
is some “representation function” that maps [𝑝] to {0, 1}𝐿 . (The
function 𝑟𝑒𝑝 does not need to be one-to-one and you can think of
𝑟𝑒𝑝(𝑧) as simply outputting 𝐿 of the bits of 𝑧 in the natural binary
representation, it does need to satisfy certain technical conditions
which we omit in this description.)
608 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

• Alice: Given 𝑔′ , 𝑧, Alice recovers 𝑥 by outputting 𝑟𝑒𝑝(𝑔′𝑎 mod 𝑝) ⊕


𝑧.

The correctness of the protocol follows from the simple fact that
(𝑔𝑎 )𝑏 = (𝑔𝑏 )𝑎 for every 𝑔, 𝑎, 𝑏 and this still holds if we work modulo
𝑝. Its security relies on the computational assumption that computing
this map is hard, even in a certain “average case” sense (this computa-
tional assumption is known as the Decisional Diffie Hellman assump-
tion). The Diffie-Hellman key exchange protocol can be thought of as
a public key encryption where Alice’s first message is the public key,
and Bob’s message is the encryption.
One can think of the Diffie-Hellman protocol as being based on
a “trapdoor pseudorandom generator” where the triple 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑎𝑏
looks “random” to someone that doesn’t know 𝑎, but someone that
does know 𝑎 can see that raising the second element to the 𝑎-th power
yields the third element. The Diffie-Hellman protocol can be described
abstractly in the context of any finite Abelian group for which we can
efficiently compute the group operation. It has been implemented
on other groups than numbers modulo 𝑝, and in particular Elliptic
Curve Cryptography (ECC) is obtained by basing the Diffie Hell-
man on elliptic curve groups which gives some practical advantages.
Another common group theoretic basis for key-exchange/public key
encryption protocol is the RSA function. A big disadvantage of Diffie-
Hellman (both the modular arithmetic and elliptic curve variants)
and RSA is that both schemes can be broken in polynomial time by a
quantum computer. We will discuss quantum computing later in this
course.

21.9 OTHER SECURITY NOTIONS


There is a great deal to cryptography beyond just encryption schemes,
and beyond the notion of a passive adversary. A central objective
is integrity or authentication: protecting communications from being
modified by an adversary. Integrity is often more fundamental than
secrecy: whether it is a software update or viewing the news, you
might often not care about the communication being secret as much as
that it indeed came from its claimed source. Digital signature schemes
are the analog of public key encryption for authentication, and are
widely used (in particular as the basis for public key certificates) to
provide a foundation of trust in the digital world.
Similarly, even for encryption, we often need to ensure security
against active attacks, and so notions such as non-malleability and
adaptive chosen ciphertext security have been proposed. An encryp-
tion scheme is only as secure as the secret key, and mechanisms to
make sure the key is generated properly, and is protected against re-
c ry p tog ra p hy 609

fresh or even compromise (i.e., forward secrecy) have been studied as


well. Hopefully this chapter provides you with some appreciation for
cryptography as an intellectual field, but does not imbue you with a
false self confidence in implementing it.
Cryptographic hash functions are another widely used tool with a
variety of uses, including extracting randomness from high entropy
sources, achieving hard-to-forge short “digests” of files, protecting
passwords, and much more.

21.10 MAGIC
Beyond encryption and signature schemes, cryptographers have man-
aged to obtain objects that truly seem paradoxical and “magical”. We
briefly discuss some of these objects. We do not give any details, but
hopefully this will spark your curiosity to find out more.

21.10.1 Zero knowledge proofs


On October 31, 1903, the mathematician Frank Nelson Cole
gave an hourlong lecture to a meeting of the American Mathe-
matical Society where he did not speak a single word. Rather,
he calculated on the board the value 267 − 1 which is equal to
147, 573, 952, 589, 676, 412, 927, and then showed that this number is
equal to 193, 707, 721 × 761, 838, 257, 287. Cole’s proof showed that
267 − 1 is not a prime, but it also revealed additional information,
namely its actual factors. This is often the case with proofs: they teach
us more than just the validity of the statements.
In Zero Knowledge Proofs we try to achieve the opposite effect. We
want a proof for a statement 𝑋 where we can rigorously show that the
proofs reveals absolutely no additional information about 𝑋 beyond the
fact that it is true. This turns out to be an extremely useful object for
a variety of tasks including authentication, secure protocols, voting,
anonymity in cryptocurrencies, and more. Constructing these ob-
jects relies on the theory of NP completeness. Thus this theory that
originally was designed to give a negative result (show that some prob-
lems are hard) ended up yielding positive applications, enabling us to
achieve tasks that were not possible otherwise.

21.10.2 Fully homomorphic encryption


Suppose that we are given a bit-by-bit encryption of a string
𝐸𝑘 (𝑥0 ), … , 𝐸𝑘 (𝑥𝑛−1 ). By design, these ciphertexts are supposed to
be “completely unscrutable” and we should not be able to extract
any information about 𝑥𝑖 ’s from it. However, already in 1978, Rivest,
Adleman and Dertouzos observed that this does not imply that we
could not manipulate these encryptions. For example, it turns out the
security of an encryption scheme does not immediately rule out the
610 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

ability to take a pair of encryptions 𝐸𝑘 (𝑎) and 𝐸𝑘 (𝑏) and compute


from them 𝐸𝑘 (𝑎NAND𝑏) without knowing the secret key 𝑘. But do there
exist encryption schemes that allow such manipulations? And if so, is
this a bug or a feature?
Rivest et al already showed that such encryption schemes could
be immensely useful, and their utility has only grown in the age of
cloud computing. After all, if we can compute NAND then we can
use this to run any algorithm 𝑃 on the encrypted data, and map
𝐸𝑘 (𝑥0 ), … , 𝐸𝑘 (𝑥𝑛−1 ) to 𝐸𝑘 (𝑃 (𝑥0 , … , 𝑥𝑛−1 )). For example, a client
could store their secret data 𝑥 in encrypted form on the cloud, and
have the cloud provider perform all sorts of computation on these
data without ever revealing to the provider the private key, and so
without the provider ever learning any information about the secret
data.
The question of existence of such a scheme took much longer time
to resolve. Only in 2009 Craig Gentry gave the first construction of
an encryption scheme that allows to compute a universal basis of
gates on the data (known as a Fully Homomorphic Encryption scheme in
crypto parlance). Gentry’s scheme left much to be desired in terms of
efficiency, and improving upon it has been the focus of an intensive
research program that has already seen significant improvements.

21.10.3 Multiparty secure computation


Cryptography is about enabling mutually distrusting parties to
achieve a common goal. Perhaps the most general primitive achiev-
ing this objective is secure multiparty computation. The idea in secure
multiparty computation is that 𝑛 parties interact together to compute
some function 𝐹 (𝑥0 , … , 𝑥𝑛−1 ) where 𝑥𝑖 is the private input of the 𝑖-th
party. The crucial point is that there is no commonly trusted party or
authority and that nothing is revealed about the secret data beyond the
function’s output. One example is an electronic voting protocol where
only the total vote count is revealed, with the privacy of the individual
voters protected, but without having to trust any authority to either
count the votes correctly or to keep information confidential. Another
example is implementing a second price (aka Vickrey) auction where
𝑛 − 1 parties submit bids to an item owned by the 𝑛-th party, and the
item goes to the highest bidder but at the price of the second highest bid.
Using secure multiparty computation we can implement second price
auction in a way that will ensure the secrecy of the numerical values
of all bids (including even the top one) except the second highest one,
and the secrecy of the identity of all bidders (including even the sec-
ond highest bidder) except the top one. We emphasize that such a
protocol requires no trust even in the auctioneer itself, who will also
not learn any additional information. Secure multiparty computation
c ry p tog ra p hy 611

can be used even for computing randomized processes, with one exam-
ple being playing Poker over the net without having to trust any server
for correct shuffling of cards or not revealing the information.

✓ Chapter Recap

• We can formally define the notion of security of an


encryption scheme.
• Perfect secrecy ensures that an adversary does not
learn anything about the plaintext from the cipher-
text, regardless of their computational powers.
• The one-time pad is a perfectly secret encryption
with the length of the key equaling the length of
the message. No perfectly secret encryption can
have key shorter than the message.
• Computational secrecy can be as good as perfect
secrecy since it ensures that the advantage that
computationally bounded adversaries gain from
observing the ciphertext is exponentially small. If
the optimal PRG conjecture is true then there exists
a computationally secret encryption scheme with
messages that can be (almost) exponentially bigger
than the key.
• There are many cryptographic tools that go well be-
yond private key encryption. These include public
key encryption, digital signatures and hash functions,
as well as more “magical” tools such as multiparty
secure computation, fully homomorphic encryption, zero
knowledge proofs, and many others.

21.11 EXERCISES

21.12 BIBLIOGRAPHICAL NOTES


Much of this text is taken from my lecture notes on cryptography.
Shannon’s manuscript was written in 1945 but was classified, and a
partial version was only published in 1949. Still it has revolutionized
cryptography, and is the forerunner to much of what followed.
The Venona project’s history is described in this document. Aside
from Grabeel and Zubko, credit to the discovery that the Soviets were
reusing keys is shared by Lt. Richard Hallock, Carrie Berry, Frank
Lewis, and Lt. Karl Elmquist, and there are others that have made
important contribution to this project. See pages 27 and 28 in the
document.
In a 1955 letter to the NSA that only recently came forward, John
Nash proposed an “unbreakable” encryption scheme. He wrote “I
hope my handwriting, etc. do not give the impression I am just a crank or
circle-squarer…. The significance of this conjecture [that certain encryption
612 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

schemes are exponentially secure against key recovery attacks] .. is that it is


quite feasible to design ciphers that are effectively unbreakable.”. John Nash
made seminal contributions in mathematics and game theory, and was
awarded both the Abel Prize in mathematics and the Nobel Memorial
Prize in Economic Sciences. However, he has struggled with mental
illness throughout his life. His biography, A Beautiful Mind was made
into a popular movie. It is natural to compare Nash’s 1955 letter to the
NSA to Gödel’s letter to von Neumann we mentioned before. From
the theoretical computer science point of view, the crucial difference
is that while Nash informally talks about exponential vs polynomial
computation time, he does not mention the word “Turing machine” or
other models of computation, and it is not clear if he is aware or not
that his conjecture can be made mathematically precise (assuming a
formalization of “sufficiently complex types of enciphering”).
The definition of computational secrecy we use is the notion of
computational indistinguishability (known to be equivalent to semantic
security) that was given by Goldwasser and Micali in 1982.
Although they used a different terminology, Diffie and Hellman
already made clear in their paper that their protocol can be used as
a public key encryption, with the first message being put in a “pub-
lic file”. In 1985, ElGamal showed how to obtain a signature scheme
based on the Diffie Hellman ideas, and since he described the Diffie-
Hellman encryption scheme in the same paper, the public key encryp-
tion scheme originally proposed by Diffie and Hellman is sometimes
also known as ElGamal encryption.
My survey contains a discussion on the different types of public key
assumptions. While the standard elliptic curve cryptographic schemes
are as susceptible to quantum computers as Diffie-Hellman and RSA,
their main advantage is that the best known classical algorithms for
computing discrete logarithms over elliptic curve groups take time 2𝜖𝑛
for some 𝜖 > 0 where 𝑛 is the number of bits to describe a group ele-
ment. In contrast, for the multiplicative group modulo a prime 𝑝 the
best algorithm take time 2𝑂(𝑛 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛)) which means that (assum-
1/3

ing the known algorithms are optimal) we need to set the prime to be
bigger (and so have larger key sizes with corresponding overhead in
communication and computation) to get the same level of security.
Zero-knowledge proofs were constructed by Goldwasser, Micali,
and Rackoff in 1982, and their wide applicability was shown (using
the theory of NP completeness) by Goldreich, Micali, and Wigderson
in 1986.
Two party and multiparty secure computation protocols were con-
structed (respectively) by Yao in 1982 and Goldreich, Micali, and
Wigderson in 1987. The latter work gave a general transformation
c ry p tog ra p hy 613

from security against passive adversaries to security against active


adversaries using zero knowledge proofs.
22
Proofs and algorithms

“Let’s not try to define knowledge, but try to define zero-knowledge.”, Shafi
Goldwasser.

Proofs have captured human imagination for thousands of years,


ever since the publication of Euclid’s Elements, a book second only to
the bible in the number of editions printed.
Plan:

• Proofs and algorithms

• Interactive proofs

• Zero knowledge proofs

• Propositions as types, Coq and other proof assistants.

22.1 EXERCISES

22.2 BIBLIOGRAPHICAL NOTES

Compiled on 12.6.2023 00:05


Learning Objectives:
• See main aspects in which quantum
mechanics differs from local deterministic
theories.
• Model of quantum circuits, or equivalently
QNAND-CIRC programs
• The complexity class BQP and what we know
about its relation to other classes
• Ideas behind Shor’s Algorithm and the

23 Quantum Fourier Transform

Quantum computing

“We always have had (secret, secret, close the doors!) … a great deal of diffi-
culty in understanding the world view that quantum mechanics represents …
It has not yet become obvious to me that there’s no real problem. … Can I learn
anything from asking this question about computers–about this may or may
not be mystery as to what the world view of quantum mechanics is?” , Richard
Feynman, 1981

“The only difference between a probabilistic classical world and the equations
of the quantum world is that somehow or other it appears as if the probabilities
would have to go negative”, Richard Feynman, 1981

There were two schools of natural philosophy in ancient Greece.


Aristotle believed that objects have an essence that explains their behav-
ior, and a theory of the natural world has to refer to the reasons (or “fi-
nal cause” to use Aristotle’s language) as to why they exhibit certain
phenomena. Democritus believed in a purely mechanistic explanation
of the world. In his view, the universe was ultimately composed of
elementary particles (or Atoms) and our observed phenomena arise
from the interactions between these particles according to some local
rules. Modern science (arguably starting with Newton) has embraced
Democritus’ point of view, of a mechanistic or “clockwork” universe
of particles and forces acting upon them.
While the classification of particles and forces evolved with time,
to a large extent the “big picture” has not changed from Newton till
Einstein. In particular it was held as an axiom that if we knew fully
the current state of the universe (i.e., the particles and their properties
such as location and velocity) then we could predict its future state at
any point in time. In computational language, in all these theories the
state of a system with 𝑛 particles could be stored in an array of 𝑂(𝑛)
numbers, and predicting the evolution of the system can be done by
running some efficient (e.g., 𝑝𝑜𝑙𝑦(𝑛) time) deterministic computation
on this array.

Compiled on 12.6.2023 00:05


618 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

23.1 THE DOUBLE SLIT EXPERIMENT


Alas, in the beginning of the 20th century, several experimental re-
sults were calling into question this “clockwork” or “billiard ball”
theory of the world. One such experiment is the famous double slit ex-
periment. Here is one way to describe it. Suppose that we buy one of
those baseball pitching machines, and aim it at a soft plastic wall, but
put a metal barrier with a single slit between the machine and the plastic
wall (see Fig. 23.1). If we shoot baseballs at the plastic wall, then some
of the baseballs would bounce off the metal barrier, while some would
make it through the slit and dent the wall. If we now carve out an ad-
ditional slit in the metal barrier then more balls would get through,
and so the plastic wall would be even more dented.
So far this is pure common sense, and it is indeed (to my knowl-
edge) an accurate description of what happens when we shoot base-
balls at a plastic wall. However, this is not the same when we shoot
photons. Amazingly, if we shoot with a “photon gun” (i.e., a laser) at
a wall equipped with photon detectors through some barrier, then
(as shown in Fig. 23.2) in some positions of the wall we will see fewer
hits when the two slits are open than when only one of them is.1 In
particular there are positions in the wall that are hit when the first slit
Figure 23.1: In the “double baseball experiment” we
is open, hit when the second slit is open, but are not hit at all when both shoot baseballs from a gun at a soft wall through a
slits are open! hard barrier that has one or two slits open in it. There
It seems as if each photon coming out of the gun is aware of the is only “constructive interference” in the sense that
the dent in each position in the wall when both slits
global setup of the experiment, and behaves differently if two slits are are open is the sum of the dents when each slit is
open than if only one is. If we try to “catch the photon in the act” and open on its own.
place a detector right next to each slit so we can see exactly the path
1
A nice illustrated description of the double slit
experiment appears in this video.
each photon takes then something even more bizarre happens. The
mere fact that we measure the path changes the photon’s behavior, and
now this “destructive interference” pattern is gone and the number
of times a position is hit when two slits are open is the sum of the
number of times it is hit when each slit is open.

P
You should read the paragraphs above more than
once and make sure you appreciate how truly mind Figure 23.2: The setup of the double slit experiment
boggling these results are. in the case of photon or electron guns. We see also
destructive interference in the sense that there are
some positions on the wall that get fewer hits when
both slits are open than they get when only one of the
slits is open. Image credit: Wikipedia.
23.2 QUANTUM AMPLITUDES
The double slit and other experiments ultimately forced scientists to
accept a very counterintuitive picture of the world. It is not merely
about nature being randomized, but rather it is about the probabilities
in some sense “going negative” and cancelling each other!
q ua n tu m comp u ti ng 619

To see what we mean by this, let us go back to the baseball exper-


iment. Suppose that the probability a ball passes through the left slit
is 𝑝𝐿 and the probability that it passes through the right slit is 𝑝𝑅 .
Then, if we shoot 𝑁 balls out of each gun, we expect the wall will be
hit (𝑝𝐿 + 𝑝𝑅 )𝑁 times. In contrast, in the quantum world of photons
instead of baseballs, it can sometimes be the case that in both the first
and second case the wall is hit with positive probabilities 𝑝𝐿 and 𝑝𝑅
respectively but somehow when both slits are open the wall (or a par-
ticular position in it) is not hit at all. It’s almost as if the probabilities
can “cancel each other out”.
To understand the way we model this in quantum mechanics, it is
helpful to think of a “lazy evaluation” approach to probability. We
can think of a probabilistic experiment such as shooting a baseball
through two slits in two different ways:
• When a ball is shot, “nature” tosses a coin and decides if it will go
through the left slit (which happens with probability 𝑝𝐿 ), right slit
(which happens with probability 𝑝𝑅 ), or bounce back. If it passes
through one of the slits then it will hit the wall. Later we can look at
the wall and find out whether or not this event happened, but the
fact that the event happened or not is determined independently of
whether or not we look at the wall.

• The other viewpoint is that when a ball is shot, “nature” computes


the probabilities 𝑝𝐿 and 𝑝𝑅 as before, but does not yet “toss the
coin” and determines what happened. Only when we actually
look at the wall, nature tosses a coin and with probability 𝑝𝐿 + 𝑝𝑅
ensures we see a dent. That is, nature uses “lazy evaluation”, and
only determines the result of a probabilistic experiment when we
decide to measure it.
While the first scenario seems much more natural, the end result
in both is the same (the wall is hit with probability 𝑝𝐿 + 𝑝𝑅 ) and so
the question of whether we should model nature as following the first
scenario or second one seems like asking about the proverbial tree that
falls in the forest with no one hearing it.
However, when we want to describe the double slit experiment
with photons rather than baseballs, it is the second scenario that lends
itself better to a quantum generalization. Quantum mechanics as-
sociates a number 𝛼 known as an amplitude with each probabilistic
experiment. This number 𝛼 can be negative, and in fact even complex.
We never observe the amplitudes directly, since whenever we mea-
sure an event with amplitude 𝛼, nature tosses a coin and determines
that the event happens with probability |𝛼|2 . However, the sign (or
in the complex case, phase) of the amplitudes can affect whether two
different events have constructive or destructive interference.
620 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Specifically, consider an event that can either occur or not (e.g. “de-
tector number 17 was hit by a photon”). In classical probability, we
model this by a probability distribution over the two outcomes: a pair
of non-negative numbers 𝑝 and 𝑞 such that 𝑝 + 𝑞 = 1, where 𝑝 corre-
sponds to the probability that the event occurs and 𝑞 corresponds to
the probability that the event does not occur. In quantum mechanics,
we model this also by pair of numbers, which we call amplitudes. This
is a pair of (potentially negative or even complex) numbers 𝛼 and 𝛽
such that |𝛼|2 + |𝛽|2 = 1. The probability that the event occurs is |𝛼|2
and the probability that it does not occur is |𝛽|2 . In isolation, these
negative or complex numbers don’t matter much, since we square
them anyway to obtain probabilities. But the interaction of positive
and negative amplitudes can result in surprising cancellations where
somehow combining two scenarios where an event happens with
positive probability results in a scenario where it never does.

P
If you don’t find the above description confusing and
unintuitive, you probably didn’t get it. Please make
sure to re-read the above paragraphs until you are
thoroughly confused.

Quantum mechanics is a mathematical theory that allows us to


calculate and predict the results of the double-slit and many other ex-
periments. If you think of quantum mechanics as an explanation as to
what “really” goes on in the world, it can be rather confusing. How-
ever, if you simply “shut up and calculate” then it works amazingly
well at predicting experimental results. In particular, in the double
slit experiment, for any position in the wall, we can compute num-
bers 𝛼 and 𝛽 such that photons from the first and second slit hit that
position with probabilities |𝛼|2 and |𝛽|2 respectively. When we open
both slits, the probability that the position will be hit is proportional
to |𝛼 + 𝛽|2 , and so in particular, if 𝛼 = −𝛽 then it will be the case that,
despite being hit when either slit one or slit two are open, the position
is not hit at all when they both are. If you are confused by quantum
mechanics, you are not alone: for decades people have been trying to
come up with explanations for “the underlying reality” behind quan-
tum mechanics, including Bohmian Mechanics, Many Worlds and
others. However, none of these interpretations have gained universal
acceptance and all of those (by design) yield the same experimental
predictions. Thus at this point many scientists prefer to just ignore the
question of what is the “true reality” and go back to simply “shutting
up and calculating”.
q ua n tu m comp u ti ng 621

R
Remark 23.1 — Complex vs real, other simplifications. If
(like the author) you are a bit intimidated by complex
numbers, don’t worry: you can think of all ampli-
tudes as real (though potentially negative) numbers
without loss of understanding. All the “magic” of
quantum computing already arises in this case, and
so we will often restrict attention to real amplitudes in
this chapter.
We will also only discuss so-called pure quantum
states, and not the more general notion of mixed states.
Pure states turn out to be sufficient for understanding
the algorithmic aspects of quantum computing.
More generally, this chapter is not meant to be a com-
plete description of quantum mechanics, quantum
information theory, or quantum computing, but rather
illustrate the main points where these differ from
classical computing.

23.3 BELL’S INEQUALITY


There is something weird about quantum mechanics. In 1935 Einstein,
Podolsky and Rosen (EPR) tried to pinpoint this issue by highlighting
a previously unrealized corollary of this theory. They showed that
the idea that nature does not determine the results of an experiment
until it is measured results in so called “spooky action at a distance”.
Namely, making a measurement of one object may instantaneously
effect the state (i.e., the vector of amplitudes) of another object in the
other end of the universe.
Since the vector of amplitudes is just a mathematical abstraction,
the EPR paper was considered to be merely a thought experiment for
philosophers to be concerned about, without bearing on experiments.
This changed when in 1965 John Bell showed an actual experiment
to test the predictions of EPR and hence pit intuitive common sense
against the quantum mechanics. Quantum mechanics won: it turns
out that it is in fact possible to use measurements to create correlations
between the states of objects far removed from one another that cannot
be explained by any prior theory. Nonetheless, since the results of
these experiments are so obviously wrong to anyone that has ever sat
in an armchair, that there are still a number of Bell denialists arguing 2
If you are extremely paranoid about Alice and Bob
that this can’t be true and quantum mechanics is wrong. communicating with one another, you can coordinate
with your assistant to perform the experiment exactly
So, what is this Bell’s Inequality? Suppose that Alice and Bob try at the same time, and make sure that the rooms
to convince you they have telepathic ability, and they aim to prove it are sufficiently far apart (e.g., are on two different
continents, or maybe even one is on the moon and
via the following experiment. Alice and Bob will be in separate closed
another is on earth) so that Alice and Bob couldn’t
rooms.2 You will interrogate Alice and your associate will interrogate communicate to each other in time even if they do so
Bob. You choose a random bit 𝑥 ∈ {0, 1} and your associate chooses at the speed of light.
622 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

a random 𝑦 ∈ {0, 1}. We let 𝑎 be Alice’s response and 𝑏 be Bob’s


response. We say that Alice and Bob win this experiment if 𝑎 ⊕ 𝑏 =
𝑥 ∧ 𝑦. In other words, Alice and Bob need to output two bits that 3
This form of Bell’s game was shown by Clauser,
disagree if 𝑥 = 𝑦 = 1 and agree otherwise.3 Horne, Shimony, and Holt.
Now if Alice and Bob are not telepathic, then they need to agree in
advance on some strategy. It’s not hard for Alice and Bob to succeed
with probability 3/4: just always output the same bit. Moreover, by
doing some case analysis, we can show that no matter what strategy
they use, Alice and Bob cannot succeed with higher probability than 4
Theorem 23.2 below assumes that Alice and Bob
that:4 use deterministic strategies 𝑓 and 𝑔 respectively. More
generally, Alice and Bob could use a randomized
strategy, or equivalently, each could choose 𝑓 and
Theorem 23.2 — Bell’s Inequality.For every two functions 𝑓, 𝑔 ∶ {0, 1} →
𝑔 from some distributions ℱ and 𝒢 respectively.
{0, 1}, Pr𝑥,𝑦∈{0,1} [𝑓(𝑥) ⊕ 𝑔(𝑦) = 𝑥 ∧ 𝑦] ≤ 3/4. However the averaging principle (Lemma 18.10)
implies that if all possible deterministic strategies
succeed with probability at most 3/4, then the same is
Proof. Since the probability is taken over all four choices of 𝑥, 𝑦 ∈
true for all randomized strategies.
{0, 1}, the only way the theorem can be violated if if there exist two
functions 𝑓, 𝑔 that satisfy

𝑓(𝑥) ⊕ 𝑔(𝑦) = 𝑥 ∧ 𝑦
for all the four choices of 𝑥, 𝑦 ∈ {0, 1}2 . Let’s plug in all these four
choices and see what we get (below we use the equalities 𝑧 ⊕ 0 = 𝑧,
𝑧 ∧ 0 = 0 and 𝑧 ∧ 1 = 𝑧):

𝑓(0) ⊕ 𝑔(0) = 0 (plugging in 𝑥 = 0, 𝑦 = 0)


𝑓(0) ⊕ 𝑔(1) = 0 (plugging in 𝑥 = 0, 𝑦 = 1)
𝑓(1) ⊕ 𝑔(0) = 0 (plugging in 𝑥 = 1, 𝑦 = 0)
𝑓(1) ⊕ 𝑔(1) = 1 (plugging in 𝑥 = 1, 𝑦 = 1)
If we XOR together the first and second equalities we get 𝑔(0) ⊕
𝑔(1) = 0 while if we XOR together the third and fourth equalities we
get 𝑔(0) ⊕ 𝑔(1) = 1, thus obtaining a contradiction.

An amazing experimentally verified fact is that quantum mechanics


5
More accurately, one either has to give up on a
allows for “telepathy”.5 Specifically, it has been shown that using the “billiard ball type” theory of the universe or believe
weirdness of quantum mechanics, there is in fact a strategy for Alice in telepathy (believe it or not, some scientists went for
and Bob to succeed in this game with probability larger than 3/4 (in the latter option).

fact, they can succeed with probability about 0.85, see Lemma 23.5).

23.4 QUANTUM WEIRDNESS


Some of the counterintuitive properties that arise from quantum me-
chanics include:

• Interference - As we’ve seen, quantum amplitudes can “cancel each


other out”.
q ua n tu m comp u ti ng 623

• Measurement - The idea that amplitudes are negative as long as


“no one is looking” and “collapse” (by squaring them) to positive
probabilities when they are measured is deeply disturbing. Indeed,
as shown by EPR and Bell, this leads to various strange outcomes
such as “spooky actions at a distance”, where we can create corre-
lations between the results of measurements in places far removed.
Unfortunately (or fortunately?) these strange outcomes have been
confirmed experimentally.

• Entanglement - The notion that two parts of the system could be


connected in this weird way where measuring one will affect the
other is known as quantum entanglement.

As counter-intuitive as these concepts are, they have been experi-


mentally confirmed, so we just have to live with them.

R
Remark 23.3 — More on quantum. The discussion in this
lecture is quite brief and somewhat superficial. The
chapter on quantum computation in my book with
Arora (see draft here) is one relatively short resource
that contains essentially everything we discuss here
and more. See also this blog post of Aaronson for a
high level explanation of Shor’s algorithm which ends
with links to several more detailed expositions. This
lecture of Aaronson contains a great discussion of
the feasibility of quantum computing (Aaronson’s
course lecture notes and the book that they spawned
are fantastic reads as well). The videos of Umesh Vazi-
rani’z EdX course are an accessible and recommended
introduction to quantum computing. See the “biblio-
graphical notes” section at the end of this chapter for
more resources.

23.5 QUANTUM COMPUTING AND COMPUTATION - AN EXECUTIVE


SUMMARY.
One of the strange aspects of the quantum-mechanical picture of the
world is that unlike in the billiard ball example, there is no obvious
algorithm to simulate the evolution of 𝑛 particles over 𝑡 time periods
in 𝑝𝑜𝑙𝑦(𝑛, 𝑡) steps. In fact, the natural way to simulate 𝑛 quantum par-
ticles will require a number of steps that is exponential in 𝑛. This is a
huge headache for scientists that actually need to do these calculations
in practice.
In the 1981, physicist Richard Feynman proposed to “turn this
lemon to lemonade” by making the following almost tautological
observation:
624 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

If a physical system cannot be simulated by a computer in 𝑇 steps, the


system can be considered as performing a computation that would take
more than 𝑇 steps.

So, he asked whether one could design a quantum system such that
its outcome 𝑦 based on the initial condition 𝑥 would be some function
6
As its title suggests, Feynman’s lecture was actually
𝑦 = 𝑓(𝑥) such that (a) we don’t know how to efficiently compute
focused on the other side of simulating physics with
in any other way, and (b) is actually useful for something.6 In 1985, a computer. However, he mentioned that as a “side
David Deutsch formally suggested the notion of a quantum Turing remark” one could wonder if it’s possible to simulate
physics with a new kind of computer - a “quantum
machine, and the model has been since refined in works of Deutsch computer” which would “not [be] a Turing machine,
and Josza and Bernstein and Vazirani. Such a system is now known as but a machine of a different kind”. As far as I know,
Feynman did not suggest that such a computer could
a quantum computer.
be useful for computations completely outside the
For a while these hypothetical quantum computers seemed useful domain of quantum simulation. Indeed, he was
for one of two things. First, to provide a general-purpose mecha- more interested in the question of whether quantum
mechanics could be simulated by a classical computer.
nism to simulate a variety of the real quantum systems that people
care about, such as various interactions inside molecules in quantum
chemistry. Second, as a challenge to the Extended Church Turing hypoth-
esis which says that every physically realizable computation device
can be modeled (up to polynomial overhead) by Turing machines (or
equivalently, NAND-TM / NAND-RAM programs).
Quantum chemistry is important (and in particular understand-
ing it can be a bottleneck for designing new materials, drugs, and
more), but it is still a rather niche area within the broader context of
computing (and even scientific computing) applications. Hence for a
while most researchers (to the extent they were aware of it), thought
of quantum computers as a theoretical curiosity that has little bear-
ing to practice, given that this theoretical “extra power” of quantum
computer seemed to offer little advantage in the majority of the prob-
lems people want to solve in areas such as combinatorial optimization,
machine learning, data structures, etc..
To some extent this is still true today. As far as we know, quantum
7
This “95 percent” is a figure of speech, but not com-
computers, if built, will not provide exponential speed ups for 95%
pletely so. At the time of this writing, cryptocurrency
of the applications of computing.7 In particular, as far as we know, mining electricity consumption is estimated to use
quantum computers will not help us solve NP complete problems in up at least 70Twh or 0.3 percent of the world’s pro-
duction, which is about 2 to 5 percent of the total
polynomial or even sub-exponential time, though Grover’s algorithm ( energy usage for the computing industry. All the
Remark 23.4) does yield a quadratic advantage in many cases. current cryptocurrencies will be broken by quantum
computers. Also, for many web servers the TLS pro-
However, there is one cryptography-sized exception: In 1994 Peter
tocol (which is based on the current non-lattice based
Shor showed that quantum computers can solve the integer factoring systems and would be completely broken by quantum
and discrete logarithm problems in polynomial time. This result has computing) is responsible for about 1 percent of the
CPU usage.
captured the imagination of a great many people, and completely
energized research into quantum computing. This is both because the
hardness of these particular problems provides the foundations for
securing such a huge part of our communications (and these days,
our economy), and because it was a powerful demonstration that
q ua n tu m comp u ti ng 625

quantum computers could turn out to be useful for problems that


a-priori seem to have nothing to do with quantum physics.
As we’ll discuss later, at the moment there are several intensive
efforts to construct large scale quantum computers. It seems safe
to say that, as far as we know, in the next five years or so there will
not be a quantum computer large enough to factor, say, a 1024 bit
number. On the other hand, it does seem quite likely that in the very
near future there will be quantum computers which achieve some task
exponentially faster than the best-known way to achieve the same
task with a classical computer. When and if a quantum computer is
built that is strong enough to break reasonable parameters of Diffie
Hellman, RSA and elliptic curve cryptography is anybody’s guess. It
could also be a “self destroying prophecy” whereby the existence of
a small-scale quantum computer would cause everyone to shift away
to lattice-based crypto which in turn will diminish the motivation
to invest the huge resources needed to build a large scale quantum 8
Of course, given that we’re still hearing of attacks
computer.8 exploiting “export grade” cryptography that was
supposed to disappear in the 1990’s, I imagine that
we’ll still have products running 1024 bit RSA when
R everyone has a quantum laptop.
Remark 23.4 — Quantum computing and NP. Despite
popular accounts of quantum computers as having
variables that can take “zero and one at the same
time” and therefore can “explore an exponential num-
ber of possibilities simultaneously”, their true power
is much more subtle and nuanced. In particular, as far
as we know, quantum computers do not enable us to
solve NP complete problems such as 3SAT in polyno-
mial or even sub-exponential time. However, Grover’s
search algorithm does give a more modest advan-
tage (namely, quadratic) for quantum computers
over classical ones for problems in NP. In particular,
due to Grover’s search algorithm, we know that the
𝑘-SAT problem for 𝑛 variables can be solved in time
𝑂(2𝑛/2 𝑝𝑜𝑙𝑦(𝑛)) on a quantum computer for every 𝑘.
In contrast, the best known algorithms for 𝑘-SAT on a
1
classical computer take roughly 2(1− 𝑘 )𝑛 steps.

23.6 QUANTUM SYSTEMS


Before we talk about quantum computing, let us recall how we phys-
ically realize “vanilla” or classical computing. We model a logical bit
that can equal 0 or a 1 by some physical system that can be in one of
two states. For example, it might be a wire with high or low voltage,
charged or uncharged capacitor, or even (as we saw) a pipe with or
without a flow of water, or the presence or absence of a soldier crab. A
classical system of 𝑛 bits is composed of 𝑛 such “basic systems”, each
of which can be in either a “zero” or “one” state. We can model the
626 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

state of such a system by a string 𝑠 ∈ {0, 1}𝑛 . If we perform an op-


eration such as writing to the 17-th bit the NAND of the 3rd and 5th
bits, this corresponds to applying a local function to 𝑠 such as setting
𝑠17 = 1 − 𝑠3 ⋅ 𝑠5 .
In the probabilistic setting, we would model the state of the system
by a distribution. For an individual bit, we could model it by a pair of
non-negative numbers 𝛼, 𝛽 such that 𝛼 + 𝛽 = 1, where 𝛼 is the prob-
ability that the bit is zero and 𝛽 is the probability that the bit is one.
For example, applying the negation (i.e., NOT) operation to this bit
corresponds to mapping the pair (𝛼, 𝛽) to (𝛽, 𝛼) since the probability
that NOT(𝜎) is equal to 1 is the same as the probability that 𝜎 is equal
to 0. This means that we can think of the NOT function as the linear
𝛼 𝛽
map 𝑁 ∶ ℝ2 → ℝ2 such that 𝑁 ( ) = ( ) or equivalently as the
𝛽 𝛼
0 1
matrix ( ).
1 0
If we think of the 𝑛-bit system as a whole, then since the 𝑛 bits can
take one of 2𝑛 possible values, we model the state of the system as a
vector 𝑝 of 2𝑛 probabilities. For every 𝑠 ∈ {0, 1}𝑛 , we denote by 𝑒𝑠
the 2𝑛 dimensional vector that has 1 in the coordinate correspond-
ing to 𝑠 (identifying it with a number in [2𝑛 ]), and so can write 𝑝 as
∑𝑠∈{0,1}𝑛 𝑝𝑠 𝑒𝑠 where 𝑝𝑠 is the probability that the system is in the state
𝑠.
Applying the operation above of setting the 17-th bit to the NAND
of the 3rd and 5th bits, corresponds to transforming the vector 𝑝 to
the vector 𝐹 𝑝 where 𝐹 ∶ ℝ2 → ℝ2 is the linear map that maps 𝑒𝑠 to
𝑛 𝑛
𝑛
9
Since {𝑒𝑠 }𝑠∈{0,1}𝑛 is a basis for 𝑅2 , it sufficies to
𝑒𝑠0 ⋯𝑠16 (1−𝑠3 ⋅𝑠5 )𝑠18 ⋯𝑠𝑛−1 .9 define the map 𝐹 on vectors of this form.

P
Please make sure you understand why performing the
operation will take a system in state 𝑝 to a system in
the state 𝐹 𝑝. Understanding the evolution of proba-
bilistic systems is a prerequisite to understanding the
evolution of quantum systems.
If your linear algebra is a bit rusty, now would be a
good time to review it, and in particular make sure
you are comfortable with the notions of matrices, vec-
tors, (orthogonal and orthonormal) bases, and norms.

23.6.1 Quantum amplitudes


In the quantum setting, the state of an individual bit (or “qubit”,
to use quantum parlance) is modeled by a pair of numbers (𝛼, 𝛽)
such that |𝛼|2 + |𝛽|2 = 1. While in general these numbers can be
complex, for the rest of this chapter, we will often assume they are
q ua n tu m comp u ti ng 627

real (though potentially negative), and hence often drop the absolute
value operator. (This turns out not to make much of a difference in
explanatory power.) As before, we think of 𝛼2 as the probability that
the bit equals 0 and 𝛽 2 as the probability that the bit equals 1. As we
did before, we can model the NOT operation by the map 𝑁 ∶ ℝ2 → ℝ2
where 𝑁 (𝛼, 𝛽) = (𝛽, 𝛼).
Following quantum tradition, instead of using 𝑒0 and 𝑒1 as we did
above, from now on we will denote the vector (1, 0) by |0⟩ and the
vector (0, 1) by |1⟩ (and moreover, think of these as column vectors).
This is known as the Dirac “ket” notation. This means that NOT is
the unique linear map 𝑁 ∶ ℝ2 → ℝ2 that satisfies 𝑁 |0⟩ = |1⟩ and
𝑁 |1⟩ = |0⟩. In other words, in the quantum case, as in the probabilistic
case, NOT corresponds to the matrix

0 1
𝑁 =( ) .
1 0

In classical computation, we typically think that there are only two


operations that we can do on a single bit: keep it the same or negate
it. In the quantum setting, a single bit operation corresponds to any
linear map OP ∶ ℝ2 → ℝ2 that is norm preserving in the sense that
𝛼
for every 𝛼, 𝛽, if we apply OP to the vector ( ) then we obtain a
𝛽
𝛼′
vector ( ′ ) such that 𝛼′2 + 𝛽 ′2 = 𝛼2 + 𝛽 2 . Such a linear map OP
𝛽
10
As we mentioned, quantum mechanics actually
corresponds to a unitary two by two matrix.10 Keeping the bit the models states as vectors with complex coordinates.
1 0 However, this does not make any qualitative differ-
same corresponds to the matrix 𝐼 = ( ) and (as we’ve seen) the ence to our discussion.
0 1
0 1
NOT operations corresponds to the matrix 𝑁 = ( ). But there
1 0
are other operations we can use as well. One such useful operation is
the Hadamard operation, which corresponds to the matrix

+1 +1
𝐻= √1 ( ) .
2 +1 −1

In fact it turns out that Hadamard is all that we need to add to a


classical universal basis to achieve the full power of quantum comput-
ing.

23.6.2 Recap
The state of a quantum system of 𝑛 qubits is modeled by an 2𝑛 dimen-
sional vector 𝜓 of unit norm (i.e., squares of all coordinates sums up
to 1), which we write as 𝜓 = ∑𝑥∈{0,1}𝑛 𝜓𝑥 |𝑥⟩ where |𝑥⟩ is the col-
umn vector that has 0 in all coordinates except the one corresponding
628 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

to 𝑥 (identifying {0, 1}𝑛 with the numbers {0, … , 2𝑛 − 1}). We use


the convention that if 𝑎, 𝑏 are strings of lengths 𝑘 and ℓ respectively
then we can write the 2𝑘+ℓ dimensional vector with 1 in the 𝑎𝑏-th
coordinate and zero elsewhere not just as |𝑎𝑏⟩ but also as |𝑎⟩|𝑏⟩. In
particular, for every 𝑥 ∈ {0, 1}𝑛 , we can write the vector |𝑥⟩ also as
|𝑥0 ⟩|𝑥1 ⟩ ⋯ |𝑥𝑛−1 ⟩. This notation satisfies certain nice distributive laws
such as |𝑎⟩(|𝑏⟩ + |𝑏′ ⟩)|𝑐⟩ = |𝑎𝑏𝑐⟩ + |𝑎𝑏′ 𝑐⟩.
A quantum operation on such a system is modeled by a 2𝑛 × 2𝑛
unitary matrix 𝑈 (one that satisfies UU = 𝐼 where 𝑈 ⊤ is the transpose

operation, or conjugate transpose for complex matrices). If the system


is in state 𝜓 and we apply to it the operation 𝑈 , then the new state of
the system is 𝑈 𝜓.
When we measure an 𝑛-qubit system in a state 𝜓 = ∑𝑥∈{0,1}𝑛 𝜓𝑥 |𝑥⟩,
then we observe the value 𝑥 ∈ {0, 1}𝑛 with probability |𝜓𝑥 |2 . In this
case, the system collapses to the state |𝑥⟩.

23.7 ANALYSIS OF BELL’S INEQUALITY (OPTIONAL)


Now that we have the notation in place, we can show a strategy for
Alice and Bob to display “quantum telepathy” in Bell’s Game. Re-
call that in the classical case, Alice and Bob can succeed in the “Bell
Game” with probability at most 3/4 = 0.75. We now show that quan- 11
The strategy we show is not the best one. Alice and
tum mechanics allows them to succeed with probability at least 0.8.11 Bob can in fact succeed with probability cos2 (𝜋/8) ∼
0.854.
Lemma 23.5 There is a 2-qubit quantum state 𝜓 ∈ ℂ4 so that if Alice
has access to the first qubit of 𝜓, can manipulate and measure it and
output 𝑎 ∈ {0, 1} and Bob has access to the second qubit of 𝜓 and can
manipulate and measure it and output 𝑏 ∈ {0, 1} then Pr[𝑎 ⊕ 𝑏 =
𝑥 ∧ 𝑦] ≥ 0.8.

Proof. Alice and Bob will start by preparing a 2-qubit quantum system
in the state

𝜓= √1 |00⟩ + √1 |11⟩
2 2

(this state is known as an EPR pair). Alice takes the first qubit of
the system to her room, and Bob takes the qubit to his room. Now,
when Alice receives 𝑥 if 𝑥 = 0 she does nothing and if 𝑥 = 1 she ap-
𝑐𝑜𝑠𝜃 − sin 𝜃
plies the unitary map 𝑅−𝜋/8 to her qubit where 𝑅𝜃 = ( )
sin 𝜃 cos 𝜃
is the unitary operation corresponding to rotation in the plane with
angle 𝜃. When Bob receives 𝑦, if 𝑦 = 0 he does nothing and if 𝑦 = 1
he applies the unitary map 𝑅𝜋/8 to his qubit. Then each one of them
measures their qubit and sends this as their response.
Recall that to win the game Bob and Alice want their outputs to
be more likely to differ if 𝑥 = 𝑦 = 1 and to be more likely to agree
q ua n tu m comp u ti ng 629

otherwise. We will split the analysis in one case for each of the four
possible values of 𝑥 and 𝑦.
Case 1: 𝑥 = 0 and 𝑦 = 0. If 𝑥 = 𝑦 = 0 then the state does not
change. Because the state 𝜓 is proportional to |00⟩ + |11⟩, the measure-
ments of Bob and Alice will always agree (if Alice measures 0 then the
state collapses to |00⟩ and so Bob measures 0 as well, and similarly for
1). Hence in the case 𝑥 = 𝑦 = 0, Alice and Bob always win.
Case 2: 𝑥 = 0 and 𝑦 = 1. If 𝑥 = 0 and 𝑦 = 1 then after Alice
measures her bit, if she gets 0 then the system collapses to the state
|00⟩, in which case after Bob performs his rotation, his qubit is in
the state cos(𝜋/8)|0⟩ + sin(𝜋/8)|1⟩. Thus, when Bob measures his
qubit, he will get 0 (and hence agree with Alice) with probability
cos2 (𝜋/8) ≥ 0.85. Similarly, if Alice gets 1 then the system collapses
to |11⟩, in which case after rotation Bob’s qubit will be in the state
− sin(𝜋/8)|0⟩ + cos(𝜋/8)|1⟩ and so once again he will agree with Alice
with probability cos2 (𝜋/8).
The analysis for Case 3, where 𝑥 = 1 and 𝑦 = 0, is completely
analogous to Case 2. Hence Alice and Bob will agree with probability 12
We are using the (not too hard) observation that
cos2 (𝜋/8) in this case as well.12 the result of this experiment is the same regardless of
Case 4: 𝑥 = 1 and 𝑦 = 1. For the case that 𝑥 = 1 and 𝑦 = 1, the order in which Alice and Bob apply their rotations
and measurements.
after both Alice and Bob perform their rotations, the state will be
proportional to

𝑅−𝜋/8 |0⟩𝑅𝜋/8 |0⟩ + 𝑅−𝜋/8 |1⟩𝑅𝜋/8 |1⟩ . (23.1)

Intuitively, since we rotate one state by 45 degrees and the other


state by -45 degrees, they will become orthogonal to each other, and
the measurements will behave like independent coin tosses that agree
with probability 1/2. However, for the sake of completeness, we now
show the full calculation.
Opening up the coefficients and using cos(−𝑥) = cos(𝑥) and
sin(−𝑥) = − sin(𝑥), we can see that (23.1) is proportional to

cos2 (𝜋/8)|00⟩ + cos(𝜋/8) sin(𝜋/8)|01⟩


− sin(𝜋/8) cos(𝜋/8)|10⟩ + sin2 (𝜋/8)|11⟩
− sin2 (𝜋/8)|00⟩ + sin(𝜋/8) cos(𝜋/8)|01⟩
− cos(𝜋/8) sin(𝜋/8)|10⟩ + cos2 (𝜋/8)|11⟩ .
Using the trigonometric identities 2 sin(𝛼) cos(𝛼) = sin(2𝛼) and
cos( 𝛼) − sin2 (𝛼) = cos(2𝛼), we see that the probability of getting any
one of |00⟩, |10⟩, |01⟩, |11⟩ is proportional to cos(𝜋/4) = sin(𝜋/4) = √12 .
Hence all four options for (𝑎, 𝑏) are equally likely, which mean that in
this case 𝑎 = 𝑏 with probability 0.5.
Taking all the four cases together, the overall probability of winning
the game is at least 41 ⋅ 1 + 21 ⋅ 0.85 + 14 ⋅ 0.5 = 0.8.
630 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

R
Remark 23.6 — Quantum vs probabilistic strategies. It
is instructive to understand what about quantum
mechanics enabled this gain in Bell’s Inequality.
Consider the following analogous probabilistic strat-
egy for Alice and Bob. They agree that each one of
them will output 0 if they get 0 as input and output 1
with probability 𝑝 if they get 1 as input. In this case
one can see that their success probability would be
1
4
⋅ 1 + 21 (1 − 𝑝) + 14 [2𝑝(1 − 𝑝)] = 0.75 − 0.5𝑝2 ≤ 0.75.
The quantum strategy we described above can be
thought of as a variant of the probabilistic strategy for
parameter 𝑝 set to sin2 (𝜋/8) = 0.15. But in the case
𝑥 = 𝑦 = 1, instead of disagreeing only with probability
2𝑝(1 − 𝑝) = 1/4, we can use the negative probabilities
in the quantum world and rotate the state in opposite
directions. Therefore, the probability of disagreement
ends up being sin2 (𝜋/4) = 0.5.

23.8 QUANTUM COMPUTATION


Recall that in the classical setting, we modeled computation as ob-
tained by a sequence of basic operations. We had two types of computa-
tional models:

• Non uniform models of computation such as Boolean circuits and


NAND-CIRC programs, where a finite function 𝑓 ∶ {0, 1}𝑛 → {0, 1}
is computable in size 𝑇 if it can be expressed as a combination of
𝑇 basic operations (gates in a circuit or lines in a NAND-CIRC
program)

• Uniform models of computation such as Turing machines and NAND-


TM programs, where an infinite function 𝐹 ∶ {0, 1}∗ → {0, 1} is
computable in time 𝑇 (𝑛) if there is a single algorithm that on input
𝑥 ∈ {0, 1}𝑛 evaluates 𝐹 (𝑥) using at most 𝑇 (𝑛) basic steps.

When considering efficient computation, we defined the class P to


consist of all infinite functions 𝐹 ∶ {0, 1}∗ → {0, 1} that can be com-
puted by a Turing machine or NAND-TM program in time 𝑝(𝑛) for
some polynomial 𝑝(⋅). We defined the class P/poly to consists of all
infinite functions 𝐹 ∶ {0, 1}∗ → {0, 1} such that for every 𝑛, the re-
striction 𝐹↾𝑛 of 𝐹 to {0, 1}𝑛 can be computed by a Boolean circuit or
NAND-CIRC program of size at most 𝑝(𝑛) for some polynomial 𝑝(⋅).
We will do the same for quantum computation, focusing mostly on
the non uniform setting of quantum circuits, since that is simpler, and
already illustrates the important differences with classical computing.
q ua n tu m comp u ti ng 631

23.8.1 Quantum circuits


A quantum circuit is analogous to a Boolean circuit, and can be de-
scribed as a directed acyclic graph. One crucial difference that the
out degree of every vertex in a quantum circuit is at most one. This
is because we cannot “reuse” quantum states without measuring
them (which collapses their “probabilities”). Therefore, we can- 13
This is known as the No Cloning Theorem.
not use the same qubit as input for two different gates.13 Another
more technical difference is that to express our operations as uni-
tary matrices, we will need to make sure all our gates are reversible.
This is not hard to ensure. For example, in the quantum context, in-
stead of thinking of NAND as a (non reversible) map from {0, 1}2 to
{0, 1}, we will think of it as the reversible map on three qubits that
maps 𝑎, 𝑏, 𝑐 to 𝑎, 𝑏, 𝑐 ⊕ NAND(𝑎, 𝑏) (i.e., flip the last bit if NAND
of the first two bits is 1). Equivalently, the NAND operation cor-
responds to the 8 × 8 unitary matrix 𝑈𝑁𝐴𝑁𝐷 such that (identify-
ing {0, 1}3 with [8]) for every 𝑎, 𝑏, 𝑐 ∈ {0, 1}, if |𝑎𝑏𝑐⟩ is the basis
element with 1 in the 𝑎𝑏𝑐-th coordinate and zero elsewhere, then 14
Readers familiar with quantum computing should
𝑈𝑁𝐴𝑁𝐷 |𝑎𝑏𝑐⟩ = |𝑎𝑏(𝑐 ⊕ NAND(𝑎, 𝑏))⟩.14 If we order the rows and note that 𝑈𝑁𝐴𝑁𝐷 is a close variant of the so called
columns as 000, 001, 010, … , 111, then 𝑈𝑁𝐴𝑁𝐷 can be written as the Toffoli gate and so QNAND-CIRC programs corre-
spond to quantum circuits with the Hadamard and
following matrix: Toffoli gates.

0 1 0 0 0 0 0 0

⎜ ⎞
⎜1 0 0 0 0 0 0 0⎟⎟

⎜ ⎟
⎜0 0 0 1 0 0 0 0⎟⎟

⎜ ⎟
0 0 1 0 0 0 0 0⎟
𝑈𝑁𝐴𝑁𝐷 =⎜




⎜0 0 0 0 0 1 0 0⎟⎟

⎜ ⎟
⎜0 0 0 0 1 0 0 0⎟⎟

⎜ ⎟
⎜0 0 0 0 0 0 1 0⎟⎟
⎝0 0 0 0 0 0 0 1⎠

If we have an 𝑛 qubit system, then for 𝑖, 𝑗, 𝑘 ∈ [𝑛], we will denote


by 𝑈𝑁𝐴𝑁𝐷
𝑖,𝑗,𝑘
as the 2𝑛 × 2𝑛 unitary matrix that corresponds to applying
𝑈𝑁𝐴𝑁𝐷 to the 𝑖-th, 𝑗-th, and 𝑘-th bits, leaving the others intact. That is,
for every 𝑣 = ∑𝑥∈{0,1}𝑛 𝑣𝑥 |𝑥⟩, 𝑈𝑁𝐴𝑁𝐷
𝑖,𝑗,𝑘
𝑣 = ∑𝑥∈{0,1}𝑛 𝑣𝑥 |𝑥0 ⋯ 𝑥𝑘−1 (𝑥𝑘 ⊕
NAND(𝑥𝑖 , 𝑥𝑗 ))𝑥𝑘+1 ⋯ 𝑥𝑛−1 ⟩.
As mentioned above, we will also use the Hadamard or HAD opera-
tion. A quantum circuit is obtained by applying a sequence of 𝑈𝑁𝐴𝑁𝐷
and HAD gates, which correspond to the matrix

+1 +1
𝐻= √1 ( ) .
2 +1 −1

Another way to define 𝐻 is that for 𝑏 ∈ {0, 1}, 𝐻|𝑏⟩ = √1 |0⟩


2
+
We define HAD to be the 2𝑛 × 2𝑛 unitary matrix that
𝑖
√1 (−1)𝑏 |1⟩.
2
applies HAD to the 𝑖-th qubit and leaves the others intact. Using the
632 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

ket notation, we can write this as

HAD
𝑖
∑ 𝑣𝑥 |𝑥⟩ = √1 ∑ |𝑥0 ⋯ 𝑥𝑖−1 ⟩ (|0⟩ + (−1)𝑥𝑖 |1⟩) |𝑥𝑖 ⋯ 𝑥𝑛−1 ⟩ .
2
𝑥∈{0,1}𝑛 𝑥∈{0,1}𝑛

A quantum circuit is obtained by composing these basic operations


on some 𝑚 qubits. If 𝑚 ≥ 𝑛, we use a circuit to compute a function
𝑓 ∶ {0, 1}𝑛 → {0, 1}:

• On input 𝑥, we initialize the system to hold 𝑥0 , … , 𝑥𝑛−1 in the first 𝑛


qubits, and initialize all remaining 𝑚 − 𝑛 qubits to zero.

• We execute each elementary operation one by one.

• At the end of the computation, we measure the system, and output


the result of the last qubit (i.e. the qubit in location 𝑚 − 1).15 For simplicity we restrict attention to functions
15

with a single bit of output, though the definition of


• We say that the circuit computes 𝑓, if the probability that this output quantum circuits naturally extends to circuits with
equals 𝑓(𝑥) is at least 2/3. Note that this probability is obtained multiple outputs.
by summing up the squares of the amplitudes of all coordinates
in the final state of the system corresponding to vectors |𝑦⟩ where
𝑦𝑚−1 = 𝑓(𝑥).

Formally this is defined as follows:

Definition 23.7 — Quantum circuit. A quantum circuit of 𝑚 inputs and


𝑠 gates over the {𝑈𝑁𝐴𝑁𝐷 , HAD} basis is a sequence of 𝑠 unitary
2𝑛 × 2𝑛 matrices 𝑈0 , … , 𝑈𝑠−1 such that each matrix 𝑈ℓ is either of
the form NAND for 𝑖, 𝑗, 𝑘 ∈ [𝑛] or HAD for 𝑖 ∈ [𝑛].
𝑖,𝑗,𝑘 𝑖

A quantum circuit computes a function 𝑓 ∶ {0, 1}𝑛 → {0, 1} if the


following is true for every 𝑥 ∈ {0, 1}𝑛 :
Let 𝑣 be the vector

𝑣 = 𝑈𝑠−1 𝑈𝑠−2 ⋯ 𝑈1 𝑈0 |𝑥0𝑚−𝑛 ⟩

and write 𝑣 as ∑𝑦∈{0,1}𝑚 𝑣𝑦 |𝑦⟩. Then

2
∑ |𝑣𝑦 |2 ≥ .
𝑦∈{0,1}𝑚 s.t. 𝑦𝑚−1 =𝑓(𝑥)
3

P
Please stop here and see that this definition makes
sense to you.

Once we have the notion of quantum circuits, we can define the


quantum analog of P/poly (i.e., define the class of functions com-
putable by polynomial size quantum circuits) as follows:
q ua n tu m comp u ti ng 633

Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that


Definition 23.8 — BQP/poly .
𝐹 ∈ BQP/poly if there exists some polynomial 𝑝 ∶ ℕ → ℕ such that
for every 𝑛 ∈ ℕ, if 𝐹↾𝑛 is the restriction of 𝐹 to inputs of length 𝑛,
then there is a quantum circuit of size at most 𝑝(𝑛) that computes
𝐹↾𝑛 .

R
Remark 23.9 — The obviously exponential fallacy. A
priori it might seem “obvious” that quantum com-
puting is exponentially powerful, since to perform a
quantum computation on 𝑛 bits we need to maintain
the 2𝑛 dimensional state vector and apply 2𝑛 × 2𝑛 ma-
trices to it. Indeed popular descriptions of quantum
computing (too) often say something along the lines
that the difference between quantum and classical
computers is that a classical bit can either be zero or
one while a qubit can be in both states at once, and
so in many qubits a quantum computer can perform
exponentially many computations at once.
Depending on how you interpret it, this description
is either false or would apply equally well to proba-
bilistic computation, even though we’ve already seen
that every randomized algorithm can be simulated by
a similar-sized circuit, and in fact we conjecture that
BPP = P.
Moreover, this “obvious” approach for simulating
a quantum computation will take not just exponen-
tial time but exponential space as well, while it can be
shown that using a simple recursive formula one can
calculate the final quantum state using polynomial
space (in physics this is known as “Feynman path inte-
grals”). So, the exponentially long vector description
by itself does not imply that quantum computers are
exponentially powerful. Indeed, we cannot prove that
they are (i.e., as far as we know, every QNAND-CIRC
program could be simulated by a NAND-CIRC pro-
gram with polynomial overhead), but we do have
some problems (integer factoring most prominently)
for which they do provide exponential speedup over
the currently best known classical (deterministic or
probabilistic) algorithms.

23.8.2 QNAND-CIRC programs (optional)


Just like in the classical case, there is an equivalence between circuits
and straightline programs, and so we can define the programming
language QNAND that is the quantum analog of our NAND-CIRC
programming language. To do so, we only add a single operation:
HAD(foo) which applies the single-bit operation 𝐻 to the variable foo.
634 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

We also use the following interpretation to make NAND reversible: foo


= NAND(bar,blah) means that we modify foo to be the XOR of its
original value and the NAND of bar and blah. (In other words, apply
the 8 by 8 unitary transformation 𝑈𝑁𝐴𝑁𝐷 defined above to the three
qubits corresponding to foo, bar and blah.) If foo is initialized to
zero then this makes no difference.
If 𝑃 is a QNAND-CIRC program with 𝑛 input variables, ℓ
workspace variables, and 𝑚 output variables, then running it on the
input 𝑥 ∈ {0, 1}𝑛 corresponds to setting up a system with 𝑛 + 𝑚 + ℓ
qubits and performing the following process:

1. We initialize the input variables X[0] … X[𝑛 − 1] to 𝑥0 , … , 𝑥𝑛−1 and


all other variables to 0.

2. We execute the program line by line, applying the corresponding


physical operation 𝐻 or 𝑈𝑁𝐴𝑁𝐷 to the qubits that are referred to by
the line.

3. We measure the output variables Y[0], …, Y[𝑚 − 1] and output


the result (if there is more than one output then we measure more
variables).

23.8.3 Uniform computation


Just as in the classical case, we can define uniform computational mod-
els. For example, we can define the QNAND-TM programming language
to be QNAND augmented with loops and arrays just like NAND-
TM is obtained from NAND. Using this we can define the class BQP
which is the uniform analog of BQP/poly . Just as in the classical setting
it holds that BPP ⊆ P/poly , in the quantum setting it can be shown that
BQP ⊆ BQP/poly . Just like the classical case, we can also use Quantum
Turing Machines instead of QNAND-TM to define BQP.
Yet another way to define BQP is the following: a function 𝐹 ∶
{0, 1}∗ → {0, 1} is in BQP if (1) 𝐹 ∈ BQP/poly and (2) moreover
for every 𝑛, the quantum circuit that verifies this can be generated 16
This is analogous to the alternative characterization
by a classical polynomial time NAND-TM program (or, equivalently, of P that appears in ??.

a polynomial-time Turing machine).16 We use this definition here,


though an equivalent one can be made using QNAND-TM or quan-
tum Turing machines:

Definition 23.10 — The class BQP. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that
𝐹 ∈ BQP if there exists a polynomial time NAND-TM program 𝑃
such that for every 𝑛, 𝑃 (1𝑛 ) is the description of a quantum circuit
𝐶𝑛 that computes the restriction of 𝐹 to {0, 1}𝑛 .
q ua n tu m comp u ti ng 635

P
One way to verify that you’ve understood these def-
initions it to see that you can prove (1) P ⊆ BQP
and in fact the stronger statement BPP ⊆ BQP, (2)
BQP ⊆ EXP, and (3) For every NP-complete function
𝐹 , if 𝐹 ∈ BQP then NP ⊆ BQP. Exercise 23.1 asks you
to work these out.

The relation between NP and BQP is not known (see also Re-
mark 23.4). It is widely believed that NP ⊈ BQP, but there is no
consensus whether or not BQP ⊆ NP. It is quite possible that these
two classes are incomparable, in the sense that NP ⊈ BQP (and in par-
ticular no NP-complete function belongs to BQP) but also BQP ⊈ NP
(and there are some interesting candidates for such problems).
It can be shown that QNANDEVAL (evaluating a quantum circuit
on an input) is computable by a polynomial size QNAND-CIRC pro-
gram, and moreover this program can even be generated uniformly
and hence QNANDEVAL is in BQP. This allows us to “port” many
of the results of classical computational complexity into the quantum
realm as well.

R
Remark 23.11 — Restricting attention to circuits. Because
the non uniform model is a little cleaner to work with,
in the rest of this chapter we mostly restrict attention
to this model, though all the algorithms we discuss
can be implemented in uniform computation as well.

23.9 PHYSICALLY REALIZING QUANTUM COMPUTATION


To realize quantum computation one needs to create a system with 𝑛
independent binary states (i.e., “qubits”), and be able to manipulate
small subsets of two or three of these qubits to change their state.
While by the way we defined operations above it might seem that
one needs to be able to perform arbitrary unitary operations on these
two or three qubits, it turns out that there several choices for universal
sets - a small constant number of gates that generate all others. The
biggest challenge is how to keep the system from being measured and
collapsing to a single classical combination of states. This is sometimes
known as the coherence time of the system. The threshold theorem says
that there is some absolute constant level of errors 𝜏 so that if errors
are created at every gate at rate smaller than 𝜏 then we can recover
from those and perform arbitrary long computations. (Of course there
are different ways to model the errors and so there are actually several
threshold theorems corresponding to various noise models).
636 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

There have been several proposals to build quantum computers:

• Superconducting quantum computers use super-conducting elec-


tric circuits to do quantum computation. These are currently the
devices with largest number of fully controllable qubits.

• At Harvard, Lukin’s group is using cold atoms to implement quan-


tum computers.

• Trapped ion quantum computers use the states of an ion to sim-


ulate a qubit. People have made some recent advances on these
computers too. For example, an ion-trap computer was used to im-
plement Shor’s algorithm to factor 15. (It turns out that 15 = 3 × 5
:) )

• Topological quantum computers use a different technology, which


is more stable by design but arguably harder to manipulate to cre-
ate quantum computers.

These approaches are not mutually exclusive and it could be that


ultimately quantum computers are built by combining all of them
together. At the moment, we have devices with about 100 qubits,
and about 1% error per gate. Such restricted machines are sometimes
called “Noisy Intermediate-Scale Quantum Computers” or “NISQ”.
See this article by John Preskil for some of the progress and applica-
tions of such more restricted devices. If the number of qubits is in-
creased and the error is decreased by one or two orders of magnitude,
we could start seeing more applications.

23.10 SHOR’S ALGORITHM: HEARING THE SHAPE OF PRIME FAC-


TORS
Bell’s Inequality is a powerful demonstration that there is some-
thing very strange going on with quantum mechanics. But could
this “strangeness” be of any use to solve computational problems not
directly related to quantum systems? A priori, one could guess the
answer is no. In 1994 Peter Shor showed that one would be wrong:
Figure 23.3: Superconducting quantum computer
prototype at Google. Image credit: Google / MIT
There is a polynomial-time quan-
Theorem 23.12 — Shor’s Algorithm. Technology Review.
tum algorithm that on input an integer 𝑀 (represented in base
two), outputs the prime factorization of 𝑀 .

Another way to state Theorem 23.12 is that if we define


FACTORING ∶ {0, 1}∗ → {0, 1} to be the function that on input a
pair of numbers (𝑀 , 𝑋) outputs 1 if and only if 𝑀 has a factor 𝑃 such
that 2 ≤ 𝑃 ≤ 𝑋, then FACTORING is in BQP. This is an exponential
improvement over the best known classical algorithms, which take
q ua n tu m comp u ti ng 637

roughly 2𝑂(𝑛 ) time, where the 𝑂̃ notation hides factors that are
̃ 1/3

polylogarithmic in 𝑛. While we will not prove Theorem 23.12 in this


chapter, we will sketch some of the ideas behind the proof.

23.10.1 Period finding


At the heart of Shor’s Theorem is an efficient quantum algorithm for
finding periods of a given function. For example, a function 𝑓 ∶ ℝ → ℝ
is periodic if there is some ℎ > 0 such that 𝑓(𝑥 + ℎ) = 𝑓(𝑥) for every 𝑥
(e.g., see Fig. 23.4).
Musical notes yield one type of periodic function. When you pull
on a string on a musical instrument, it vibrates in a repeating pattern.
Hence, if we plot the speed of the string (and so also the speed of
the air around it) as a function of time, it will correspond to some
periodic function. The length of the period is known as the wave length
of the note. The frequency is the number of times the function repeats
itself within a unit of time. For example, the “Middle C” note has
a frequency of 261.63 Hertz, which means its period is 1/(261.63) Figure 23.4: Top: A periodic function. Bottom: An
seconds. a-periodic function.
If we play a chord by playing several notes at once, we get a more
complex periodic function obtained by combining the functions of
the individual notes (see Fig. 23.5). The human ear contains many
small hairs, each of which is sensitive to a narrow band of frequencies.
Hence when we hear the sound corresponding to a chord, the hairs in
our ears actually separate it out to the components corresponding to
each frequency.
It turns out that (essentially) every periodic function 𝑓 ∶ ℝ → ℝ
can be decomposed into a sum of simple wave functions (namely Figure 23.5: Left: The air-pressure when playing a
functions of the form 𝑥 ↦ sin(𝜃𝑥) or 𝑥 ↦ cos(𝜃𝑥)). This is known as “C Major” chord as a function of time. Right: The
coefficients of the Fourier transform of the same func-
the Fourier Transform (see Fig. 23.6). The Fourier transform makes it tion, we can see that it is the sum of three freuencies
easy to compute the period of a given function: it will simply be the corresponding to the C, E and G notes (261.63, 329.63
least common multiple of the periods of the constituent waves. and 392 Hertz respectively). Credit: Bjarke Mønsted’s
Quora answer.

23.10.2 Shor’s Algorithm: A bird’s eye view


On input a an integer 𝑀 , Shor’s algorithm outputs the prime factor-
ization of 𝑀 in time that is polynomial in log 𝑀 . The main steps in the
algorithm are the following:

The first step in the algorithm is to


Step 1: Reduce to period finding.
pick a random 𝐴 ∈ {0, 1 … , 𝑀 − 1} and define the function 𝐹𝐴 ∶
{0, 1}𝑚 → {0, 1}𝑚 as 𝐹𝐴 (𝑥) = 𝐴𝑥 ( mod 𝑀 ) where we identify the
string 𝑥 ∈ {0, 1}𝑚 with an integer using the binary representation, Figure 23.6: If 𝑓 is a periodic function then when we
and similarly represent the integer 𝐴𝑥 ( mod 𝑀 ) as a string. (We will represent it in the Fourier transform, we expect the
coefficients corresponding to wavelengths that do
choose 𝑚 to be some polynomial in 𝑚 and so in particular {0, 1}𝑚 is a not evenly divide the period to be very small, as they
large enough set to represent all the numbers in {0, 1, … , 𝑀 − 1}). would tend to “cancel out”.
638 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Some not-too-hard (though somewhat technical) calculations show


that: (1) The function 𝐹𝐴 is periodic (i.e., there is some integer 𝑝𝐴 such 17
We’ll ignore this “almost” qualifier in the discussion
that 𝐹𝐴 (𝑥 + 𝑝𝐴 ) = 𝐹𝐴 (𝑥) for almost17 every 𝑥) and more importantly below. It causes some annoying, yet ultimately
(2) If we can recover the period 𝑝𝐴 of 𝐹𝐴 for several randomly chosen manageable, technical issues in the full-fledged
algorithm.
𝐴’s, then we can recover the factorization of 𝑀 . Hence, factoring 𝑀
reduces to finding out the period of the function 𝐹𝐴 . Exercise 23.2
asks you to work out this for the related task of computing the discrete
logarithm (which underlies the security of the Diffie-Hellman key
exchange and elliptic curve cryptography).

Using a simple
Step 2: Period finding via the Quantum Fourier Transform.
trick known as “repeated squaring”, it is possible to compute the
map 𝑥 ↦ 𝐹𝐴 (𝑥) in time polynomial in 𝑚, which means we can also
compute this map using a polynomial number of NAND gates,and so
in particular we can generate in polynomial quantum time a quantum
state 𝜌 that is (up to normalization) equal to

∑ |𝑥⟩|𝐹𝐴 (𝑥)⟩ .
𝑥∈{0,1}𝑚

In particular, if we were to measure the state 𝜌, we would get a ran-


dom pair of the form (𝑥, 𝑦) where 𝑦 = 𝐹𝐴 (𝑥). So far, this is not at all
impressive. After all, we did not need the power of quantum comput-
ing to generate such pairs: we could simply generate a random 𝑥 and
then compute 𝐹𝐴 (𝑥).
Another way to describe the state 𝜌 is that the coefficient of |𝑥⟩|𝑦⟩ in
𝜌 is proportional to 𝑓𝐴,𝑦 (𝑥) where 𝑓𝐴,𝑦 ∶ {0, 1}𝑚 → ℝ is the function
such that

{1 𝑦 = 𝐴𝑥 ( mod 𝑀 )

𝑓𝐴,𝑦 (𝑥) = ⎨ .
⎩0 otherwise
{

The magic of Shor’s algorithm comes from a procedure known


as the Quantum Fourier Transform. It allows to change the state 𝜌 into
the state 𝜌 ̂ where the coefficient of |𝑥⟩|𝑦⟩ is now proportional to the
𝑥-th Fourier coefficient of 𝑓𝐴,𝑦 . In other words, if we measure the state
𝜌,̂ we will obtain a pair (𝑥, 𝑦) such that the probability of choosing 𝑥
is proportional to the square of the weight of the frequency 𝑥 in the 18
The “almost” qualifier again appears because
representation of the function 𝑓𝐴,𝑦 . Since for every 𝑦, the function the original function was only “almost” periodic,
but it turns out this can be handled by using an
𝑓𝐴,𝑦 has the period 𝑝𝐴 , it can be shown that the frequency 𝑥 will be
“approximate greatest common divisor” algorithm
(almost18 ) a multiple of 𝑝𝐴 . If we make several such samples 𝑦0 , … , 𝑦𝑘 instead of a standard g.c.d. below. The latter can be
and obtain the frequencies 𝑥1 , … , 𝑥𝑘 , then the true period 𝑝𝐴 divides obtained using a tool known as the continued fraction
representation of a number.
all of them, and it can be shown that it is going to be in fact the greatest
common divisor (g.c.d.) of all these frequencies: a value which can be
computed in polynomial time.
q ua n tu m comp u ti ng 639

As mentioned above, we can recover the factorization of 𝑀 from


the periods of 𝐹𝐴0 , … , 𝐹𝐴𝑡 for some randomly chosen 𝐴0 , … , 𝐴𝑡 in
{0, … , 𝑀 − 1} and 𝑡 which is polynomial in log 𝑀 .
The resulting algorithm can be described in a high (and somewhat
inaccurate) level as follows:

Shor’s Algorithm: (sketch)


Input: Number 𝑀 ∈ ℕ.
Output: Prime factorization of 𝑀 .
Operations:

1. Repeat the following 𝑘 = 𝑝𝑜𝑙𝑦(log 𝑀 ) number of times:


a. Choose 𝐴 ∈ {0, … , 𝑀 − 1} at random, and let 𝑓𝐴 ∶ ℤ𝑀 → ℤ𝑀 be
the map 𝑥 ↦ 𝐴𝑥 mod 𝑀 .
b. For 𝑡 = 𝑝𝑜𝑙𝑦(log 𝑀 ), repeat 𝑡 times the following step: Quantum
Fourier Transform to create a quantum state |𝜓⟩ over 𝑝𝑜𝑙𝑦(log(𝑚))
qubits, such that if we measure |𝜓⟩ we obtain a pair of strings
(𝑗, 𝑦) with probability proportional to the square of the coefficient
corresponding to the wave function 𝑥 ↦ cos(𝑥𝜋𝑗/𝑀 ) or 𝑥 ↦
sin(𝑥𝜋𝑗/𝑀 ) in the Fourier transform of the function 𝑓𝐴,𝑦 ∶ ℤ𝑚 →
{0, 1} defined as 𝑓𝐴,𝑦 (𝑥) = 1 iff 𝑓𝐴 (𝑥) = 𝑦.
c. If 𝑗1 , … , 𝑗𝑡 are the coefficients we obtained in the previous step,
then the least common multiple of 𝑀 /𝑗1 , … , 𝑀 /𝑗𝑡 is likely to be
the period of the function 𝑓𝐴 .
2. If we let 𝐴0 , … , 𝐴𝑘−1 and 𝑝0 , … , 𝑝𝑘−1 be the numbers we chose in
the previous step and the corresponding periods of the functions
𝑓𝐴0 , … , 𝑓𝐴𝑘−1 then we can use classical results in number theory to
obtain from these a non-trivial prime factor 𝑄 of 𝑀 (if such exists).
We can now run the algorithm again with the (smaller) input 𝑀 /𝑄
to obtain all other factors.

Reducing factoring to order finding is cummbersome, but can be


done in polynomial time using a classical computer. The key quantum
ingredient in Shor’s algorithm is the quantum fourier transform.

R
Remark 23.13 — Quantum Fourier Transform. Despite
its name, the Quantum Fourier Transform does not
actually give a way to compute the Fourier Trans-
form of a function 𝑓 ∶ {0, 1}𝑚 → ℝ. This would be
impossible to do in time polynomial in 𝑚, as simply
writing down the Fourier Transform would require 2𝑚
coefficients. Rather the Quantum Fourier Transform
gives a quantum state where the amplitude correspond-
ing to an element (think: frequency) ℎ is equal to
the corresponding Fourier coefficient. This allows to
sample from a distribution where ℎ is drawn with
probability proportional to the square of its Fourier
coefficient. This is not the same as computing the
640 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Fourier transform, but is good enough for recovering


the period.

23.11 QUANTUM FOURIER TRANSFORM (ADVANCED, OPTIONAL)


The above description of Shor’s algorithm skipped over the implemen-
tation of the main quantum ingredient: the Quantum Fourier Transform
algorithm. In this section we discuss the ideas behind this algorithm.
We will be rather brief and imprecise. Remark 23.3 and Section 23.13
contain references to sources of more information about this topic.
To understand the Quantum Fourier Transform, we need to better
understand the Fourier Transform itself. In particular, we will need
to understand how it applies not just to functions whose input is a
real number but to functions whose domain can be any arbitrary
commutative group. Therefore we now take a short detour to (very
basic) group theory, and define the notion of periodic functions over
groups.

R
Remark 23.14 — Group theory. While we define the con-
cepts we use, some background in group or number
theory might be quite helpful for fully understanding
this section.
We will not use anything more than the basic proper-
ties of finite Abelian groups. Specifically we use the
following notions:

• A finite group 𝔾 can be thought of as simply a set


of elements and some binary operation ⋆ on these
elements (i.e., if 𝑔, ℎ ∈ 𝔾 then 𝑔 ⋆ ℎ is an element of
𝔾 as well).
• The operation ⋆ satisfies the sort of properties that
a product operation does, namely, it is associative
(i.e., (𝑔 ⋆ ℎ) ⋆ 𝑓 = 𝑔 ⋆ (ℎ ⋆ 𝑓)) and there is some
element 1 such that 𝑔 ⋆ 1 = 𝑔 for all 𝑔, and for
every 𝑔 ∈ 𝔾 there exists an element 𝑔−1 such that
𝑔 ⋆ 𝑔−1 = 1.
• A group is called commutative (also known as
Abelian) if 𝑔 ⋆ ℎ = ℎ ⋆ 𝑔 for all 𝑔, ℎ ∈ 𝔾.

The Fourier transform is a deep and vast topic, on which we will


barely touch upon here. Over the real numbers, the Fourier trans-
form of a function 𝑓 is obtained by expressing 𝑓 in the form ∑ 𝑓(𝛼)𝜒̂
𝛼
where the 𝜒𝛼 ’s are “wave functions” (e.g. sines and cosines). How-
ever, it turns out that the same notion exists for every Abelian group 𝔾.
q ua n tu m comp u ti ng 641

Specifically, for every such group 𝔾, if 𝑓 is a function mapping 𝔾 to ℂ,


then we can write 𝑓 as

̂
𝑓 = ∑ 𝑓(𝑔)𝜒 𝑔 , (23.2)
𝑔∈𝔾

where the 𝜒𝑔 ’s are functions mapping 𝔾 to ℂ that are analogs of


the “wave functions” for the group 𝔾 and for every 𝑔 ∈ 𝔾, 𝑓(𝑔) ̂ is a
complex number known as the Fourier coefficient of 𝑓 corresponding to 19
The equation (23.2) means that if we think of 𝑓 as
𝑔.19 The representation (23.2) is known as the Fourier expansion or a |𝔾| dimensional vector over the complex numbers,
Fourier transform of 𝑓, the numbers (𝑓(𝑔))̂
𝑔∈𝔾 are known as the Fourier then we can write this vector as a sum (with certain
coefficients) of the vectors {𝜒𝑔 }𝑔∈𝔾 .
coefficients of 𝑓 and the functions (𝜒𝑔 )𝑔∈𝔾 are known as the Fourier
characters. The central property of the Fourier characters is that they
are homomorphisms of the group into the complex numbers, in the
sense that for every 𝑥, 𝑥′ ∈ 𝔾, 𝜒𝑔 (𝑥 ⋆ 𝑥′ ) = 𝜒𝑔 (𝑥)𝜒𝑔 (𝑥′ ), where ⋆ is
the group operation. One corollary of this property is that if 𝜒𝑔 (ℎ) = 1
then 𝜒𝑔 is ℎ periodic in the sense that 𝜒𝑔 (𝑥 ⋆ ℎ) = 𝜒𝑔 (𝑥) for every 𝑥.
It turns out that if 𝑓 is periodic with minimal period ℎ, then the only
Fourier characters that have non zero coefficients in the expression
(23.2) are those that are ℎ periodic as well. This can be used to recover
the period of 𝑓 from its Fourier expansion.

23.11.1 Quantum Fourier Transform over the Boolean Cube: Simon’s


Algorithm
We now describe the simplest setting of the Quantum Fourier Trans-
form: the group {0, 1}𝑛 with the XOR operation, which we’ll de-
note by ({0, 1}𝑛 , ⊕). It can be shown that the Fourier transform over
({0, 1}𝑛 , ⊕) corresponds to expressing 𝑓 ∶ {0, 1}𝑛 → ℂ as

̂
𝑓 = ∑ 𝑓(𝑦)𝜒 𝑦
𝑦∈{0,1}

where 𝜒𝑦 ∶ {0, 1}𝑛 → ℂ is defined as 𝜒𝑦 (𝑥) = (−1)∑𝑖 𝑦𝑖 𝑥𝑖 and


̂ = √1 𝑛 ∑
𝑓(𝑦) 2 𝑥∈{0,1}𝑛
𝑓(𝑥)(−1)∑𝑖 𝑦𝑖 𝑥𝑖 .
The Quantum Fourier Transform over ({0, 1}𝑛 , ⊕) is actually qutie
simple:

Let 𝜌 = ∑𝑥∈{0,1}𝑛 𝑓(𝑥)|𝑥⟩


Theorem 23.15 — QFT Over the Boolean Cube.
be a quantum state where 𝑓 ∶ {0, 1}𝑛 → ℂ is some function satisfy-
ing ∑𝑥∈{0,1}𝑛 |𝑓(𝑥)|2 = 1. Then we can use 𝑛 gates to transform 𝜌 to
the state

̂
∑ 𝑓(𝑦)|𝑦⟩
𝑦∈{0,1}𝑛
642 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

where 𝑓 = ∑𝑦 𝑓(𝑦)𝜒
̂
𝑦 and 𝜒𝑦 ∶ {0, 1}
𝑛
→ ℂ is the function
𝜒𝑦 (𝑥) = −1∑ 𝑥𝑖 𝑦𝑖 .

Proof Idea:
The idea behind the proof is that the Hadamard operation corre-
sponds to the Fourier transform over the group {0, 1}𝑛 (with the XOR
operations). To show this, we just need to do the calculations.

Proof of Theorem 23.15. We can express the Hadamard operation HAD


as follows:

HAD|𝑎⟩ = √1 (|0⟩
2
+ (−1)𝑎 |1⟩) .

We are given the state

𝜌= ∑ 𝑓(𝑥)|𝑥⟩ .
𝑥∈{0,1}𝑛

Now suppose that we apply the HAD operation to each of the 𝑛


qubits. We can see that we get the state

𝑛−1
2−𝑛/2 ∑ 𝑓(𝑥) ∏ (|0⟩ + (−1)𝑥𝑖 |1⟩) .
𝑥∈{0,1}𝑛 𝑖=0

We can now use the distributive law and open up a term of the
form

𝑓(𝑥)(|0⟩ + (−1)𝑥0 |1⟩) ⋯ (|0⟩ + (−1)𝑥𝑛−1 |1⟩)


20
If you find this confusing, try to work out why (|0⟩+
to the following sum over 2𝑛 terms:20 (−1)𝑥0 |1⟩)(|0⟩ + (−1)𝑥1 |1⟩)(|0⟩ + (−1)𝑥2 |1⟩) is the
same as the sum over 23 terms |000⟩ + (−1)𝑥2 |001⟩ +
⋯ + (−1)𝑥0 +𝑥1 +𝑥2 |111⟩.
𝑓(𝑥) ∑ (−1)∑ 𝑦𝑖 𝑥𝑖 |𝑦⟩ .
𝑦∈{0,1}𝑛

But by changing the order of summations, we see that the final state
is

∑ 2−𝑛/2 ( ∑ 𝑓(𝑥)(−1)∑ 𝑥𝑖 𝑦𝑖 )|𝑦⟩


𝑦∈{0,1}𝑛 𝑥∈{0,1}𝑛

which exactly corresponds to 𝜌.̂


23.11.2 From Fourier to Period finding: Simon’s Algorithm (advanced,


optional)
Using Theorem 23.15 it is not hard to get an algorithm that can recover
a string ℎ∗ ∈ {0, 1}𝑛 given a circuit that computes a function 𝐹 ∶
{0, 1}𝑛 → {0, 1}∗ that is ℎ∗ periodic in the sense that 𝐹 (𝑥) = 𝐹 (𝑥′ ) for
q ua n tu m comp u ti ng 643

distinct 𝑥, 𝑥′ if and only if 𝑥′ = 𝑥 ⊕ ℎ∗ . The key observation is that if


we compute the state ∑𝑥∈{0,1}𝑛 |𝑥⟩|𝐹 (𝑥)⟩, and perform the Quantum
Fourier transform on the first 𝑛 qubits, then we would get a state such
that the only basis elements with nonzero coefficients would be of the
form |𝑦⟩ where

∑ 𝑦𝑖 ℎ∗𝑖 = 0( mod 2) (23.3)


So, by measuring the state, we can obtain a sample of a random 𝑦
satisfying (23.3). But since (23.3) is a linear equation modulo 2 about
the unknown 𝑛 variables ℎ∗0 , … , ℎ∗𝑛−1 , if we repeat this procedure
to get 𝑛 such equations, we will have at least as many equations as
variables and (it can be shown that) this will suffice to recover ℎ∗ .
This result is known as Simon’s Algorithm, and it preceded and
inspired Shor’s algorithm.

23.11.3 From Simon to Shor (advanced, optional)


Theorem 23.15 seemed to really use the special bit-wise structure of
the group {0, 1}𝑛 , and so one could wonder if it can be extended to
other groups. However, it turns out that we can in fact achieve such a
generalization.
The key step in Shor’s algorithm is to implement the Fourier trans-
form for the group ℤ𝐿 which is the set of numbers {0, … , 𝐿 − 1} with
the operation being addition modulo 𝐿. In this case it turns out that
the Fourier characters are the functions 𝜒𝑦 (𝑥) = 𝜔𝑦𝑥 where 𝜔 = 𝑒2𝜋𝑖/𝐿

(𝑖 here denotes the complex number −1). The 𝑦-th Fourier coeffi-
cient of a function 𝑓 ∶ ℤ𝐿 → ℂ is

̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(𝑥)𝜔𝑥𝑦 . (23.4)
𝑥∈ℤ𝐿

The key to implementing the Quantum Fourier Transform for such


groups is to use the same recursive equations that enable the classical
Fast Fourier Transform (FFT) algorithm. Specifically, consider the case
that 𝐿 = 2ℓ . We can separate the sum over 𝑥 in (23.4) to the terms
corresponding to even 𝑥’s (of the form 𝑥 = 2𝑧) and odd 𝑥’s (of the
form 𝑥 = 2𝑧 + 1) to obtain

̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(2𝑧)(𝜔2 )𝑦𝑧 + 𝜔𝑦

𝐿
∑ 𝑓(2𝑧 + 1)(𝜔2 )𝑦𝑧 (23.5)
𝑧∈𝑍𝐿/2 𝑧∈ℤ𝐿/2

which reduces computing the Fourier transform of 𝑓 over the group


ℤ2ℓ to computing the Fourier transform of the functions 𝑓𝑒𝑣𝑒𝑛 and
𝑓𝑜𝑑𝑑 (corresponding to the applying 𝑓 to only the even and odd 𝑥’s
respectively) which have 2ℓ−1 inputs that we can identify with the
group ℤ2ℓ−1 = ℤ𝐿/2 .
644 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Specifically, the Fourier characters of the group ℤ𝐿/2 are the func-
tions 𝜒𝑦 (𝑥) = 𝑒2𝜋𝑖/(𝐿/2)𝑦𝑥 = (𝜔2 )𝑦𝑥 for every 𝑥, 𝑦 ∈ ℤ𝐿/2 . Moreover,
since 𝜔𝐿 = 1, (𝜔2 )𝑦 = (𝜔2 )𝑦 mod 𝐿/2 for every 𝑦 ∈ ℕ. Thus (23.5)
translates into

̂ = 𝑓 ̂ (𝑦
𝑓(𝑦) 𝑒𝑣𝑒𝑛 mod 𝐿/2) + 𝜔𝑦 𝑓𝑜𝑑𝑑
̂ (𝑦 mod 𝐿/2) .

This observation is usually used to obtain a fast (e.g. 𝑂(𝐿 log 𝐿))
time to compute the Fourier transform in a classical setting, but it can
be used to obtain a quantum circuit of 𝑝𝑜𝑙𝑦(log 𝐿) gates to transform a
state of the form ∑𝑥∈ℤ 𝑓(𝑥)|𝑥⟩ to a state of the form ∑𝑦∈ℤ 𝑓(𝑦)|𝑦⟩.
̂
𝐿 𝐿
The case that 𝐿 is not an exact power of two causes some complica-
tions in both the classical case of the Fast Fourier Transform and the
quantum setting of Shor’s algorithm. However, it is possible to handle
these. The idea is that we can embed 𝑍𝐿 in the group ℤ𝐴⋅𝐿 for any
integer 𝐴, and we can find an integer 𝐴 such that 𝐴 ⋅ 𝐿 will be close
enough to a power of 2 (i.e., a number of the form 2𝑚 for some 𝑚), so
that if we do the Fourier transform over the group ℤ2𝑚 then we will
not introduce too many errors.

✓ Chapter Recap

• The state of an 𝑛-qubit quantum system can be


modeled as a 2𝑛 dimensional vector.
• An operation on the state corresponds to applying
a unitary matrix to this vector.
• Quantum circuits are obtained by composing basic
operations such as HAD and 𝑈𝑁𝐴𝑁𝐷 .
• We can use quantum circuits to define the classes
BQP/poly and BQP, which are the quantum analogs
of P/poly and BPP respectively.
• There are some problems for which the best known
quantum algorithm is exponentially faster than
the best known, but quantum computing is not a
panacea. In particular, as far as we know, quantum
computers could still require exponential time to
solve NP-complete problems such as SAT.

23.12 EXERCISES

R
Remark 23.16 — Disclaimer. Most of the exercises have
been written in the summer of 2018 and haven’t yet
been fully debugged. While I would prefer people
do not post online solutions to the exercises, I would
greatly appreciate if you let me know of any bugs. You
can do so by posting a GitHub issue about the exer-
q ua n tu m comp u ti ng 645

cise, and optionally complement this with an email to


me with more details about the attempted solution.

Prove the
Exercise 23.1 — Quantum and classical complexity class relations.
following relations between quantum complexity classes and classical
ones:
21
Hint: You can use 𝑈𝑁𝐴𝑁𝐷 to simulate NAND gates.
1. P/poly ⊆ BQP/poly .21
22
Hint: Use the alternative characterization of P as in
2. P ⊆ BQP.22 Solved Exercise 13.4.
23
Hint: You can use the HAD gate to simulate a coin
toss.
3. BPP ⊆ BQP.23 24
Hint: In exponential time simulating quantum
computation boils down to matrix multiplication.
4. BQP ⊆ EXP.24 25
Hint: If a reduction can be implemented in P it can
be implemented in BQP as well.
5. If SAT ∈ BQP then NP ⊆ BQP.25

Show a probabilistic
Exercise 23.2 — Discrete logarithm from order finding.
polynomial time classical algorithm that given an Abelian finite group
𝔾 (in the form of an algorithm that computes the group operation),
a generator 𝑔 for the group, and an element ℎ ∈ 𝔾, as well access to a
black box that on input 𝑓 ∈ 𝔾 outputs the order of 𝑓 (the smallest 𝑎
such that 𝑓 𝑎 = 1), computes the discrete logarithm of ℎ with respect to 26
We are given ℎ = 𝑔𝑥 and need to recover 𝑥. To
𝑔. That is the algorithm should output a number 𝑥 such that 𝑔𝑥 = ℎ. do so we can compute the order of various elements
of the form ℎ𝑎 𝑔𝑏 . The order of such an element is
See footnote for hint.26
a number 𝑐 satisfying 𝑐(𝑥𝑎 + 𝑏) = 0 (mod |𝔾|).
■ With a few random examples we will get a non trivial
equation on 𝑥 (where 𝑐 is not zero modulo |𝔾|) and
then we can use our knowledge of 𝑎, 𝑏, 𝑐 to recover 𝑥.
23.13 BIBLIOGRAPHICAL NOTES
Chapters 9 and 10 in the book Quantum Computing Since Democritus
give an informal but highly informative introduction to the topics
of this lecture and much more. Shor’s and Simon’s algorithms are
also covered in Chapter 10 of my book with Arora on computational
complexity.
There are many excellent videos available online covering some
of these materials. The Fourier transform is covered in this videos of
Dr. Chris Geoscience, Clare Zhang and Vi Hart. More specifically to
quantum computing, the videos of Umesh Vazirani on the Quantum
Fourier Transform and Kelsey Houston-Edwards on Shor’s Algorithm
are very recommended.
Chapter 10 in Avi Wigderson’s book gives a high level overview of
quantum computing. Andrew Childs’ lecture notes on quantum algo-
rithms, as well as the lecture notes of Umesh Vazirani, John Preskill,
and John Watrous
646 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

Regarding quantum mechanics in general, this video illustrates


the double slit experiment, this Scientific American video is a nice ex-
position of Bell’s Theorem. This talk and panel moderated by Brian
Greene discusses some of the philosophical and technical issues
around quantum mechanics and its so called “measurement prob-
lem”. The Feynmann lecture on the Fourier Transform and quantum
mechanics in general are very much worth reading.
The Fast Fourier Transform, used as a component in Shor’s algo-
rithm, is one of the most useful algorithms across many applications
areas. The stories of its discovery by Gauss in trying to calculate aster-
oid orbits and rediscovery by Tukey during the cold war are fascinat-
ing as well.

23.14 FURTHER EXPLORATIONS


Some topics related to this chapter that might be accessible to ad-
vanced students include: (to be completed)

23.15 ACKNOWLEDGEMENTS
Thanks to Scott Aaronson for many helpful comments about this
chapter.
VI
APPENDICES
Bibliography

[Aar05] Scott Aaronson. “NP-complete problems and physical


reality”. In: ACM Sigact News 36.1 (2005). Available on
https://arxiv.org/abs/quant-ph/0502072, pp. 30–52.
[Aar13] Scott Aaronson. Quantum computing since Democritus.
Cambridge University Press, 2013.
[Aar16] Scott Aaronson. “P =? NP”. In: Open problems in mathe-
matics. Available on https://www.scottaaronson.com/
papers/pnp.pdf. Springer, 2016, pp. 1–122.
[Aar20] Scott Aaronson. “The Busy Beaver Frontier”. In: SIGACT
News (2020). Available on https://www.scottaaronson.
com/papers/bb.pdf.
[AKS04] Manindra Agrawal, Neeraj Kayal, and Nitin Saxena.
“PRIMES is in P”. In: Annals of mathematics (2004),
pp. 781–793.
[Asp18] James Aspens. Notes on Discrete Mathematics. Online
textbook for CS 202. Available on http://www.cs.yale.
edu/homes/aspnes/classes/202/notes.pdf. 2018.
[Bar84] H. P. Barendregt. The lambda calculus : its syntax and
semantics. Amsterdam New York New York, N.Y: North-
Holland Sole distributors for the U.S.A. and Canada,
Elsevier Science Pub. Co, 1984. i sbn : 0444875085.
[Bjo14] Andreas Bjorklund. “Determinant sums for undirected
hamiltonicity”. In: SIAM Journal on Computing 43.1
(2014), pp. 280–299.
[Blä13] Markus Bläser. Fast Matrix Multiplication. Graduate Sur-
veys 5. Theory of Computing Library, 2013, pp. 1–60.
d oi : 10 . 4086 / toc . gs . 2013 . 005. u rl : http : / / www .
theoryofcomputing.org/library.html.
[Boo47] George Boole. The mathematical analysis of logic. Philo-
sophical Library, 1847.

Compiled on 12.6.2023 00:05


650 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

[BU11] David Buchfuhrer and Christopher Umans. “The com-


plexity of Boolean formula minimization”. In: Journal of
Computer and System Sciences 77.1 (2011), pp. 142–153.
[Bur78] Arthur W Burks. “Booke review: Charles S. Peirce, The
new elements of mathematics”. In: Bulletin of the Ameri-
can Mathematical Society 84.5 (1978), pp. 913–918.
[Cho56] Noam Chomsky. “Three models for the description of
language”. In: IRE Transactions on information theory 2.3
(1956), pp. 113–124.
[Chu41] Alonzo Church. The calculi of lambda-conversion. Prince-
ton London: Princeton University Press H. Milford,
Oxford University Press, 1941. i sbn : 978-0-691-08394-0.
[CM00] Bruce Collier and James MacLachlan. Charles Babbage:
And the engines of perfection. Oxford University Press,
2000.
[Coh08] Paul Cohen. Set theory and the continuum hypothesis. Mine-
ola, N.Y: Dover Publications, 2008. i sbn : 0486469212.
[Coh81] Danny Cohen. “On holy wars and a plea for peace”. In:
Computer 14.10 (1981), pp. 48–54.
[Coo66] SA Cook. “On the minimum computation time for mul-
tiplication”. In: Doctoral dissertation, Harvard University
Department of Mathematics, Cambridge, Mass. 1 (1966).
[Coo87] James W Cooley. “The re-discovery of the fast Fourier
transform algorithm”. In: Microchimica Acta 93.1-6
(1987), pp. 33–45.
[Cor+09] Thomas H. Cormen et al. Introduction to Algorithms, 3rd
Edition. MIT Press, 2009. i sbn : 978-0-262-03384-8. u rl :
http://mitpress.mit.edu/books/introduction-algorithms.
[CR73] Stephen A Cook and Robert A Reckhow. “Time bounded
random access machines”. In: Journal of Computer and
System Sciences 7.4 (1973), pp. 354–375.
[CRT06] Emmanuel J Candes, Justin K Romberg, and Terence Tao.
“Stable signal recovery from incomplete and inaccurate
measurements”. In: Communications on Pure and Applied
Mathematics: A Journal Issued by the Courant Institute of
Mathematical Sciences 59.8 (2006), pp. 1207–1223.
[CT06] Thomas M. Cover and Joy A. Thomas. “Elements of
information theory 2nd edition”. In: Willey-Interscience:
NJ (2006).
[Dau90] Joseph Warren Dauben. Georg Cantor: His mathematics and
philosophy of the infinite. Princeton University Press, 1990.
BIB L IO GRA PH Y 651

[De 47] Augustus De Morgan. Formal logic: or, the calculus of


inference, necessary and probable. Taylor and Walton, 1847.
[Don06] David L Donoho. “Compressed sensing”. In: IEEE Trans-
actions on information theory 52.4 (2006), pp. 1289–1306.
[DPV08] Sanjoy Dasgupta, Christos H. Papadimitriou, and Umesh
V. Vazirani. Algorithms. McGraw-Hill, 2008. i sbn : 978-0-
07-352340-8.
[Ell10] Jordan Ellenberg. “Fill in the blanks: Using math to turn
lo-res datasets into hi-res samples”. In: Wired Magazine
18.3 (2010), pp. 501–509.
[Ern09] Elizabeth Ann Ernst. “Optimal combinational multi-
level logic synthesis”. Available on https://deepblue.
lib . umich . edu / handle / 2027 . 42 / 62373. PhD thesis.
University of Michigan, 2009.
[Fle18] Margaret M. Fleck. Building Blocks for Theoretical Com-
puter Science. Online book, available at http://mfleck.cs.
illinois.edu/building-blocks/. 2018.
[Für07] Martin Fürer. “Faster integer multiplication”. In: Pro-
ceedings of the 39th Annual ACM Symposium on Theory of
Computing, San Diego, California, USA, June 11-13, 2007.
2007, pp. 57–66.
[FW93] Michael L Fredman and Dan E Willard. “Surpassing
the information theoretic bound with fusion trees”.
In: Journal of computer and system sciences 47.3 (1993),
pp. 424–436.
[GKS17] Daniel Günther, Ágnes Kiss, and Thomas Schneider.
“More efficient universal circuit constructions”. In: Inter-
national Conference on the Theory and Application of Cryp-
tology and Information Security. Springer. 2017, pp. 443–
470.
[Gom+08] Carla P Gomes et al. “Satisfiability solvers”. In: Founda-
tions of Artificial Intelligence 3 (2008), pp. 89–134.
[Gra05] Judith Grabiner. The origins of Cauchy’s rigorous cal-
culus. Mineola, N.Y: Dover Publications, 2005. i sbn :
9780486438153.
[Gra83] Judith V Grabiner. “Who gave you the epsilon? Cauchy
and the origins of rigorous calculus”. In: The American
Mathematical Monthly 90.3 (1983), pp. 185–194.
[Hag98] Torben Hagerup. “Sorting and searching on the word
RAM”. In: Annual Symposium on Theoretical Aspects of
Computer Science. Springer. 1998, pp. 366–398.
652 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

[Hal60] Paul R Halmos. Naive set theory. Republished in 2017 by


Courier Dover Publications. 1960.
[Har41] GH Hardy. “A Mathematician’s Apology”. In: (1941).
[HJB85] Michael T Heideman, Don H Johnson, and C Sidney
Burrus. “Gauss and the history of the fast Fourier trans-
form”. In: Archive for history of exact sciences 34.3 (1985),
pp. 265–277.
[HMU14] John Hopcroft, Rajeev Motwani, and Jeffrey Ullman.
Introduction to automata theory, languages, and compu-
tation. Harlow, Essex: Pearson Education, 2014. i sbn :
1292039051.
[Hof99] Douglas Hofstadter. Gödel, Escher, Bach : an eternal golden
braid. New York: Basic Books, 1999. i sbn : 0465026567.
[Hol01] Jim Holt. “The Ada perplex: how Byron’s daughter came
to be celebrated as a cybervisionary”. In: The New Yorker
5 (2001), pp. 88–93.
[Hol18] Jim Holt. When Einstein walked with Gödel : excursions to
the edge of thought. New York: Farrar, Straus and Giroux,
2018. i sbn : 0374146705.
[HU69] John E Hopcroft and Jeffrey D Ullman. “Formal lan-
guages and their relation to automata”. In: (1969).
[HU79] John E. Hopcroft and Jeffrey D. Ullman. Introduction to
Automata Theory, Languages and Computation. Addison-
Wesley, 1979. i sbn : 0-201-02988-X.
[HV19] David Harvey and Joris Van Der Hoeven. “Integer multi-
plication in time O(n log n)”. working paper or preprint.
Mar. 2019. u rl : https://hal.archives- ouvertes.fr/hal-
02070778.
[Joh12] David S Johnson. “A brief history of NP-completeness,
1954–2012”. In: Documenta Mathematica (2012), pp. 359–
376.
[Juk12] Stasys Jukna. Boolean function complexity: advances and
frontiers. Vol. 27. Springer Science & Business Media,
2012.
[Kar+97] David R. Karger et al. “Consistent Hashing and Random
Trees: Distributed Caching Protocols for Relieving Hot
Spots on the World Wide Web”. In: Proceedings of the
Twenty-Ninth Annual ACM Symposium on the Theory of
Computing, El Paso, Texas, USA, May 4-6, 1997. 1997,
pp. 654–663. d oi : 10.1145/258533.258660. u rl : https:
//doi.org/10.1145/258533.258660.
BIB L IO GRA PH Y 653

[Kar95] ANATOLII ALEXEEVICH Karatsuba. “The complexity


of computations”. In: Proceedings of the Steklov Institute of
Mathematics-Interperiodica Translation 211 (1995), pp. 169–
183.
[Kle91] Israel Kleiner. “Rigor and Proof in Mathematics: A
Historical Perspective”. In: Mathematics Magazine 64.5
(1991), pp. 291–314. i ssn : 0025570X, 19300980. u rl :
http://www.jstor.org/stable/2690647.
[Kle99] Jon M Kleinberg. “Authoritative sources in a hyper-
linked environment”. In: Journal of the ACM (JACM) 46.5
(1999), pp. 604–632.
[Koz97] Dexter Kozen. Automata and computability. New York:
Springer, 1997. i sbn : 978-3-642-85706-5.
[KT06] Jon M. Kleinberg and Éva Tardos. Algorithm design.
Addison-Wesley, 2006. i sbn : 978-0-321-37291-8.
[Kun18] Jeremy Kun. A programmer’s introduction to mathematics.
Middletown, DE: CreateSpace Independent Publishing
Platform, 2018. i sbn : 1727125452.
[Liv05] Mario Livio. The equation that couldn’t be solved : how
mathematical genius discovered the language of symmetry.
New York: Simon & Schuster, 2005. i sbn : 0743258207.
[LLM18] Eric Lehman, Thomson F. Leighton, and Albert R. Meyer.
Mathematics for Computer Science. 2018.
[LMS16] Helger Lipmaa, Payman Mohassel, and Seyed Saeed
Sadeghian. “Valiant’s Universal Circuit: Improvements,
Implementation, and Applications.” In: IACR Cryptology
ePrint Archive 2016 (2016), p. 17.
[Lup58] O Lupanov. “A circuit synthesis method”. In: Izv. Vuzov,
Radiofizika 1.1 (1958), pp. 120–130.
[Lup84] Oleg B. Lupanov. Asymptotic complexity bounds for control
circuits. In Russian. MSU, 1984.
[Lus+08] Michael Lustig et al. “Compressed sensing MRI”. In:
IEEE signal processing magazine 25.2 (2008), pp. 72–82.
[Lüt02] Jesper Lützen. “Between Rigor and Applications: De-
velopments in the Concept of Function in Mathematical
Analysis”. In: The Cambridge History of Science. Ed. by
Mary JoEditor Nye. Vol. 5. The Cambridge History of
Science. Cambridge University Press, 2002, pp. 468–487.
d oi : 10.1017/CHOL9780521571999.026.
654 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

[LZ19] Harry Lewis and Rachel Zax. Essential Discrete Mathemat-


ics for Computer Science. Princeton University Press, 2019.
i sbn : 9780691190617.
[Maa85] Wolfgang Maass. “Combinatorial lower bound argu-
ments for deterministic and nondeterministic Turing
machines”. In: Transactions of the American Mathematical
Society 292.2 (1985), pp. 675–693.
[MM11] Cristopher Moore and Stephan Mertens. The nature of
computation. Oxford University Press, 2011.
[MP43] Warren S McCulloch and Walter Pitts. “A logical calcu-
lus of the ideas immanent in nervous activity”. In: The
bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.
[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized
algorithms. Cambridge university press, 1995.
[MU17] Michael Mitzenmacher and Eli Upfal. Probability and
computing: randomization and probabilistic techniques in
algorithms and data analysis. Cambridge university press,
2017.
[Neu45] John von Neumann. “First Draft of a Report on the ED-
VAC”. In: (1945). Reprinted in the IEEE Annals of the
History of Computing journal, 1993.
[NS05] Noam Nisan and Shimon Schocken. The elements of com-
puting systems: building a modern computer from first princi-
ples. MIT press, 2005.
[Pag+99] Lawrence Page et al. The PageRank citation ranking: Bring-
ing order to the web. Tech. rep. Stanford InfoLab, 1999.
[Pie02] Benjamin Pierce. Types and programming languages. Cam-
bridge, Mass: MIT Press, 2002. i sbn : 0262162091.
[Ric53] Henry Gordon Rice. “Classes of recursively enumerable
sets and their decision problems”. In: Transactions of the
American Mathematical Society 74.2 (1953), pp. 358–366.
[Rog96] Yurii Rogozhin. “Small universal Turing machines”. In:
Theoretical Computer Science 168.2 (1996), pp. 215–240.
[Ros19] Kenneth Rosen. Discrete mathematics and its applications.
New York, NY: McGraw-Hill, 2019. i sbn : 125967651x.
[RS59] Michael O Rabin and Dana Scott. “Finite automata and
their decision problems”. In: IBM journal of research and
development 3.2 (1959), pp. 114–125.
[Sav98] John E Savage. Models of computation. Vol. 136. Available
electronically at http://cs.brown.edu/people/jsavage/
book/. Addison-Wesley Reading, MA, 1998.
BIB L IO GRA PH Y 655

[Sch05] Alexander Schrijver. “On the history of combinatorial


optimization (till 1960)”. In: Handbooks in operations
research and management science 12 (2005), pp. 1–68.
[Sha38] Claude E Shannon. “A symbolic analysis of relay and
switching circuits”. In: Electrical Engineering 57.12 (1938),
pp. 713–723.
[Sha79] Adi Shamir. “Factoring numbers in O (logn) arith-
metic steps”. In: Information Processing Letters 8.1 (1979),
pp. 28–31.
[She03] Saharon Shelah. “Logical dreams”. In: Bulletin of the
American Mathematical Society 40.2 (2003), pp. 203–228.
[She13] Henry Maurice Sheffer. “A set of five independent pos-
tulates for Boolean algebras, with application to logical
constants”. In: Transactions of the American mathematical
society 14.4 (1913), pp. 481–488.
[She16] Margot Shetterly. Hidden figures : the American dream and
the untold story of the Black women mathematicians who
helped win the space race. New York, NY: William Morrow,
2016. i sbn : 9780062363602.
[Sin97] Simon Singh. Fermat’s enigma : the quest to solve the world’s
greatest mathematical problem. New York: Walker, 1997.
i sbn : 0385493622.
[Sip97] Michael Sipser. Introduction to the theory of computation.
PWS Publishing Company, 1997. i sbn : 978-0-534-94728-
6.
[Sob17] Dava Sobel. The Glass Universe : How the Ladies of the
Harvard Observatory Took the Measure of the Stars. New
York: Penguin Books, 2017. i sbn : 9780143111344.
[Sol14] Daniel Solow. How to read and do proofs : an introduction
to mathematical thought processes. Hoboken, New Jersey:
John Wiley & Sons, Inc, 2014. i sbn : 9781118164020.
[SS71] Arnold Schönhage and Volker Strassen. “Schnelle mul-
tiplikation grosser zahlen”. In: Computing 7.3-4 (1971),
pp. 281–292.
[Ste87] Dorothy Stein. Ada : a life and a legacy. Cambridge, Mass:
MIT Press, 1987. i sbn : 0262691167.
[Str69] Volker Strassen. “Gaussian elimination is not optimal”.
In: Numerische mathematik 13.4 (1969), pp. 354–356.
[Swa02] Doron Swade. The difference engine : Charles Babbage and
the quest to build the first computer. New York: Penguin
Books, 2002. i sbn : 0142001449.
656 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e

[Tho84] Ken Thompson. “Reflections on trusting trust”. In: Com-


munications of the ACM 27.8 (1984), pp. 761–763.
[Too63] Andrei L Toom. “The complexity of a scheme of func-
tional elements realizing the multiplication of integers”.
In: Soviet Mathematics Doklady. Vol. 3. 4. 1963, pp. 714–
716.
[Tur37] Alan M Turing. “On computable numbers, with an ap-
plication to the Entscheidungsproblem”. In: Proceedings
of the London mathematical society 2.1 (1937), pp. 230–265.
[Vad+12] Salil P Vadhan et al. “Pseudorandomness”. In: Founda-
tions and Trends® in Theoretical Computer Science 7.1–3
(2012), pp. 1–336.
[Val76] Leslie G Valiant. “Universal circuits (preliminary re-
port)”. In: Proceedings of the eighth annual ACM sympo-
sium on Theory of computing. ACM. 1976, pp. 196–203.
[Wan57] Hao Wang. “A variant to Turing’s theory of computing
machines”. In: Journal of the ACM (JACM) 4.1 (1957),
pp. 63–92.
[Weg87] Ingo Wegener. The complexity of Boolean functions. Vol. 1.
BG Teubner Stuttgart, 1987.
[Wer74] PJ Werbos. “Beyond regression: New tools for prediction
and analysis in the behavioral sciences. Ph. D. thesis,
Harvard University, Cambridge, MA, 1974.” In: (1974).
[Wig19] Avi Wigderson. Mathematics and Computation. Draft
available on https://www.math.ias.edu/avi/book.
Princeton University Press, 2019.
[Wil09] Ryan Williams. “Finding paths of length k in 𝑂∗ (2𝑘 )
time”. In: Information Processing Letters 109.6 (2009),
pp. 315–318.
[WN09] Damien Woods and Turlough Neary. “The complex-
ity of small universal Turing machines: A survey”. In:
Theoretical Computer Science 410.4-5 (2009), pp. 443–450.
[WR12] Alfred North Whitehead and Bertrand Russell. Principia
mathematica. Vol. 2. University Press, 1912.

You might also like