Book
Book
I N T RO D U C T I O N TO
THEORETICAL
COMPUTER SCIENCE
T E X T B O O K I N P R E PA R AT I O N .
AVA I L A B L E O N HTTPS://INTROTCS.ORG
Text available on https://github.com/boazbk/tcs - please post any issues there - thank you!
Preface 9
Preliminaries 17
0 Introduction 19
1 Mathematical Background 37
21 Cryptography 563
VI Appendices 625
Contents (detailed)
Preface 9
0.1 To the student . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.1 Is the effort worth it? . . . . . . . . . . . . . . . . 11
0.2 To potential instructors . . . . . . . . . . . . . . . . . . . 12
0.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 14
Preliminaries 17
0 Introduction 19
0.1 Integer multiplication: an example of an algorithm . . . 20
0.2 Extended Example: A faster way to multiply (optional) 22
0.3 Algorithms beyond arithmetic . . . . . . . . . . . . . . . 27
0.4 On the importance of negative results . . . . . . . . . . 28
0.5 Roadmap to the rest of this book . . . . . . . . . . . . . 29
0.5.1 Dependencies between chapters . . . . . . . . . . 30
0.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 33
1 Mathematical Background 37
1.1 This chapter: a reader’s manual . . . . . . . . . . . . . . 37
1.2 A quick overview of mathematical prerequisites . . . . 38
1.3 Reading mathematical texts . . . . . . . . . . . . . . . . 39
1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . 40
1.3.2 Assertions: Theorems, lemmas, claims . . . . . . 40
1.3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4 Basic discrete math objects . . . . . . . . . . . . . . . . . 41
1.4.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.2 Special sets . . . . . . . . . . . . . . . . . . . . . . 42
1.4.3 Functions . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.5 Logic operators and quantifiers . . . . . . . . . . 49
1.4.6 Quantifiers for summations and products . . . . 50
1.4.7 Parsing formulas: bound and free variables . . . 50
1.4.8 Asymptotics and Big-𝑂 notation . . . . . . . . . . 52
8
21 Cryptography 563
21.1 Classical cryptosystems . . . . . . . . . . . . . . . . . . . 564
21.2 Defining encryption . . . . . . . . . . . . . . . . . . . . . 566
21.3 Defining security of encryption . . . . . . . . . . . . . . 567
21.4 Perfect secrecy . . . . . . . . . . . . . . . . . . . . . . . . 569
17
VI Appendices 625
Preface
“We make ourselves no promises, but we cherish the hope that the unobstructed
pursuit of useless knowledge will prove to have consequences in the future
as in the past” … “An institution which sets free successive generations of
human souls is amply justified whether or not this graduate or that makes a
so-called useful contribution to human knowledge. A poem, a symphony, a
painting, a mathematical truth, a new scientific fact, all bear in themselves all
the justification that universities, colleges, and institutes of research need or
require”, Abraham Flexner, The Usefulness of Useless Knowledge, 1939.
“I suggest that you take the hardest courses that you can, because you learn
the most when you challenge yourself… CS 121 I found pretty hard.”, Mark
Zuckerberg, 2005.
• Actively notice which questions arise in your mind as you read the
text, and whether or not they are answered in the text.
very well be true, but the main benefit of this book is not in teaching
you any practical tool or technique, but instead in giving you a differ-
ent way of thinking: an ability to recognize computational phenomena
even when they occur in non-obvious settings, a way to model compu-
tational tasks and questions, and to reason about them.
Regardless of any use you will derive from this book, I believe
learning this material is important because it contains concepts that
are both beautiful and fundamental. The role that energy and matter
played in the 20th century is played in the 21st by computation and
information, not just as tools for our technology and economy, but also
as the basic building blocks we use to understand the world. This
book will give you a taste of some of the theory behind those, and
hopefully spark your curiosity to study more.
0.3 ACKNOWLEDGEMENTS
This text is continually evolving, and I am getting input from many
people, for which I am deeply grateful. Salil Vadhan co-taught with
me the first iteration of this course and gave me a tremendous amount
of useful feedback and insights during this process. Michele Amoretti
and Marika Swanberg carefully read several chapters of this text and
gave extremely helpful detailed comments. Dave Evans and Richard
Xu contributed many pull requests fixing errors and improving phras-
25
ing. Thanks to Anil Ada, Venkat Guruswami, and Ryan O’Donnell for
helpful tips from their experience in teaching CMU 15-251.
Thanks to everyone that sent me comments, typo reports, or posted
issues or pull requests on the GitHub repository https://github.
com/boazbk/tcs. In particular I would like to acknowledge helpful
feedback from Scott Aaronson, Michele Amoretti, Aadi Bajpai, Mar-
guerite Basta, Anindya Basu, Sam Benkelman, Jarosław Błasiok, Emily
Chan, Christy Cheng, Michelle Chiang, Daniel Chiu, Chi-Ning Chou,
Michael Colavita, Rodrigo Daboin Sanchez, Robert Darley Waddilove,
Anlan Du, Juan Esteller, David Evans, Michael Fine, Simon Fischer,
Leor Fishman, Zaymon Foulds-Cook, William Fu, Kent Furuie, Piotr
Galuszka, Carolyn Ge, Mark Goldstein, Alexander Golovnev, Sayan
Goswami, Michael Haak, Rebecca Hao, Joosep Hook, Thomas HUET,
Emily Jia, Chan Kang, Nina Katz-Christy, Vidak Kazic, Eddie Kohler,
Estefania Lahera, Allison Lee, Benjamin Lee, Ondřej Lengál, Raymond
Lin, Emma Ling, Alex Lombardi, Lisa Lu, Aditya Mahadevan, Chris-
tian May, Jacob Meyerson, Leon Mlodzian, George Moe, Glenn Moss,
Hamish Nicholson, Owen Niles, Sandip Nirmel, Sebastian Oberhoff,
Thomas Orton, Joshua Pan, Pablo Parrilo, Juan Perdomo, Banks Pick-
ett, Aaron Sachs, Abdelrhman Saleh, Brian Sapozhnikov, Anthony
Scemama, Peter Schäfer, Josh Seides, Alaisha Sharma, Haneul Shin,
Noah Singer, Matthew Smedberg, Miguel Solano, Hikari Sorensen,
David Steurer, Alec Sun, Amol Surati, Everett Sussman, Marika Swan-
berg, Garrett Tanzer, Eric Thomas, Sarah Turnill, Salil Vadhan, Patrick
Watts, Jonah Weissman, Ryan Williams, Licheng Xu, Richard Xu, Wan-
qian Yang, Elizabeth Yeoh-Wang, Josh Zelinsky, Fred Zhang, Grace
Zhang, and Jessica Zhu.
I am using many open source software packages in the production
of these notes for which I am grateful. In particular, I am thankful to
Donald Knuth and Leslie Lamport for LaTeX and to John MacFarlane
for Pandoc. David Steurer wrote the original scripts to produce this
text. The current version uses Sergio Correia’s panflute. The templates
for the LaTeX and HTML versions are derived from Tufte LaTeX,
Gitbook and Bookdown. Thanks to Amy Hendrickson for some LaTeX
consulting. Juan Esteller and Gabe Montague initially implemented
the NAND* programming languages in OCaml and Javascript. I used
the Jupyter project to write the supplemental code snippets.
Finally, I would like to thank my family: my wife Ravit, and my
children Alma and Goren. Working on this book (and the correspond-
ing course) took so much of my time that Alma wrote an essay for her
fifth-grade class saying that “universities should not pressure profes-
sors to work too much.” I’m afraid all I have to show for this effort is
600 pages of ultra-boring mathematical text.
PRELIMINARIES
Learning Objectives:
• Introduce and motivate the study of
computation for its own sake, irrespective of
particular implementations.
• The notion of an algorithm and some of its
history.
• Algorithms as not just tools, but also ways of
thinking and understanding.
• Taste of Big-𝑂 analysis and the surprising
Introduction
R
Remark 0.3 — Specification, implementation and analysis
of algorithms.. A full description of an algorithm has
three components:
since the numbers 𝑥,𝑥, 𝑦,𝑦,𝑥 + 𝑥,𝑦 + 𝑦 all have at most 𝑚 + 1 < 𝑛 digits,
the induction hypothesis implies that the values 𝐴, 𝐵, 𝐶 computed
by the recursive calls will satisfy 𝐴 = 𝑥𝑦, 𝐵 = (𝑥 + 𝑥)(𝑦 + 𝑦) and
𝐶 = 𝑥𝑦. Plugging this into (4) we see that 𝑥 ⋅ 𝑦 equals the value
(102𝑚 − 10𝑚 ) ⋅ 𝐴 + 10𝑚 ⋅ 𝐵 + (1 − 10𝑚 ) ⋅ 𝐶 computed by Algorithm 0.4.
■
i n trod u c ti on 35
Proof. Fig. 2 illustrates the idea behind the proof, which we only
sketch here, leaving filling out the details as Exercise 0.4. The proof
is again by induction. We define 𝑇 (𝑛) to be the maximum number of
steps that Algorithm 0.4 takes on inputs of length at most 𝑛. Since in
the base case 𝑛 ≤ 2, Exercise 0.4 performs a constant number of com-
putation, we know that 𝑇 (2) ≤ 𝑐 for some constant 𝑐 and for 𝑛 > 2, it
satisfies the recursive equation
for some constant 𝑐′ (using the fact that addition can be done in 𝑂(𝑛)
operations).
The recursive equation (5) solves to 𝑂(𝑛log2 3 ). The intuition be-
hind this is presented in Fig. 2, and this is also a consequence of the
so called “Master Theorem” on recurrence relations. As mentioned
above, we leave completing the proof to the reader as Exercise 0.4.
■
R
Remark 0.7 — Matrix Multiplication (advanced note).
(This book contains many “advanced” or “optional”
notes and sections. These may assume background
that not every student has, and can be safely skipped
over as none of the future parts depends on them.)
Ideas similar to Karatsuba’s can be used to speed up
matrix multiplications as well. Matrices are a powerful
way to represent linear equations and operations,
widely used in a great many applications of scientific
computing, graphics, machine learning, and many
many more.
One of the basic operations one can do with
two matrices is to multiply them. For example,
𝑥 𝑥0,1 𝑦 𝑦0,1
if 𝑥 = ( 0,0 ) and 𝑦 = ( 0,0 )
𝑥1,0 𝑥1,1 𝑦1,0 𝑦1,1
then the product of 𝑥 and 𝑦 is the matrix
𝑥 𝑦 + 𝑥0,1 𝑦1,0 𝑥0,0 𝑦0,1 + 𝑥0,1 𝑦1,1
( 0,0 0,0 ). You can
𝑥1,0 𝑦0,0 + 𝑥1,1 𝑦1,0 𝑥1,0 𝑦0,1 + 𝑥1,1 𝑦1,1
see that we can compute this matrix by eight products
of numbers.
Now suppose that 𝑛 is even and 𝑥 and 𝑦 are a pair of
𝑛 × 𝑛 matrices which we can think of as each com-
posed of four (𝑛/2) × (𝑛/2) blocks 𝑥0,0 , 𝑥0,1 , 𝑥1,0 , 𝑥1,1
and 𝑦0,0 , 𝑦0,1 , 𝑦1,0 , 𝑦1,1 . Then the formula for the matrix
product of 𝑥 and 𝑦 can be expressed in the same way
as above, just replacing products 𝑥𝑎,𝑏 𝑦𝑐,𝑑 with matrix
products, and addition with matrix addition. This
means that we can use the formula above to give an
algorithm that doubles the dimension of the matrices
at the expense of increasing the number of operation
by a factor of 8, which for 𝑛 = 2ℓ results in 8ℓ = 𝑛3
operations.
In 1969 Volker Strassen noted that we can compute
the product of a pair of two-by-two matrices using
only seven products of numbers by observing that
each entry of the matrix 𝑥𝑦 can be computed by
adding and subtracting the following seven terms:
𝑡1 = (𝑥0,0 + 𝑥1,1 )(𝑦0,0 + 𝑦1,1 ), 𝑡2 = (𝑥0,0 + 𝑥1,1 )𝑦0,0 ,
𝑡3 = 𝑥0,0 (𝑦0,1 − 𝑦1,1 ), 𝑡4 = 𝑥1,1 (𝑦0,1 − 𝑦0,0 ),
𝑡5 = (𝑥0,0 + 𝑥0,1 )𝑦1,1 , 𝑡6 = (𝑥1,0 − 𝑥0,0 )(𝑦0,0 + 𝑦0,1 ),
𝑡7 = (𝑥0,1 − 𝑥1,1 )(𝑦1,0 + 𝑦1,1 ). Indeed, one can verify
𝑡1 + 𝑡4 − 𝑡5 + 𝑡7 𝑡3 + 𝑡5
that 𝑥𝑦 = ( ).
𝑡2 + 𝑡4 𝑡1 + 𝑡 3 − 𝑡 2 + 𝑡 6
i n trod u c ti on 37
Even for classical questions, studied through the ages, new dis-
coveries are still being made. For example, for the question of de-
termining whether a given integer is prime or composite, which has
been studied since the days of Pythagoras, efficient probabilistic algo-
rithms were only discovered in the 1970s, while the first deterministic
polynomial-time algorithm was only found in 2002. For the related
problem of actually finding the factors of a composite number, new
algorithms were found in the 1980s, and (as we’ll see later in this
course) discoveries in the 1990s raised the tantalizing prospect of
obtaining faster algorithms through the use of quantum mechanical
effects.
Despite all this progress, there are still many more questions than
answers in the world of algorithms. For almost all natural prob-
lems, we do not know whether the current algorithm is the “best”,
or whether a significantly better one is still waiting to be discovered.
As alluded in Cobham’s opening quote for this chapter, even for the
basic problem of multiplying numbers we have not yet answered the
question of whether there is a multiplication algorithm that is as ef-
ficient as our algorithms for addition. But at least we now know the
right way to ask it.
✓ Chapter Recap
The book largely proceeds in linear order, with each chapter build-
ing on the previous ones, with the following exceptions:
• The topics of 𝜆 calculus (Section 8.5 and Section 8.5), Gödel’s in-
completeness theorem (Chapter 11), Automata/regular expres-
sions and context-free grammars (Chapter 10), and space-bounded
computation (Chapter 17), are not used in the following chapters.
Hence you can choose whether to cover or skip any subset of them.
with minor modification. Boolean circuits are used Part III (efficient
computation) for results such as P ⊆ P/poly and the Cook-Levin
Theorem, as well as in Part IV (for BPP ⊆ P/poly and derandom-
ization) and Part V (specifically in cryptography and quantum
computing).
A course based on this book can use all of Parts I, II, and III (possi-
bly skipping over some or all of the 𝜆 calculus, Chapter 11, Chapter 10
or Chapter 17), and then either cover all or some of Part IV (random-
ized computation), and add a “sprinkling” of advanced topics from
Part V based on student or instructor interest.
0.6 EXERCISES
Exercise 0.1Rank the significance of the following inventions in speed-
ing up multiplication of large (that is 100-digit or more) numbers.
That is, use “back of the envelope” estimates to order them in terms of
the speedup factor they offered over the previous state of affairs.
a. 𝑛 operations.
b. 𝑛2 operations.
c. 𝑛 log 𝑛 operations.
d. 2𝑛 operations.
i n trod u c ti on 43
e. 𝑛! operations.
true for all 𝑛’s from 1 to 𝑚 and prove that this is true
b. Prove that the number of single-digit operations that Karatsuba’s also for 𝑚 + 1.
1
• Transform an intuitive argument into a
rigorous proof.
Mathematical Background
“I found that every number, which may be expressed from one to ten, surpasses
the preceding by one unit: afterwards the ten is doubled or tripled … until
a hundred; then the hundred is doubled and tripled in the same manner as
the units and the tens … and so forth to the utmost limit of numeration.”,
Muhammad ibn Mūsā al-Khwārizmī, 820, translation by Fredric Rosen,
1831.
the whole chapter. You can just take quick look at Section 1.2 to see
the main tools we will use, Section 1.7 for our notation and conven-
tions, and then skip ahead to the rest of this book. Alternatively,
you can sit back, relax, and read this chapter just to get familiar
with our notation, as well as to enjoy (or not) my philosophical
musings and attempts at humor.
• If your background is less extensive, see Section 1.9 for some re-
sources on these topics. This chapter briefly covers the concepts
that we need, but you may find it helpful to see a more in-depth
treatment. As usual with math, the best way to get comfort with
this material is to work out exercises on your own.
• Proofs: First and foremost, this book involves a heavy dose of for-
mal mathematical reasoning, which includes mathematical defini-
tions, statements, and proofs.
In the rest of this chapter we briefly review the above notions. This
is partially to remind the reader and reinforce material that might
not be fresh in your mind, and partially to introduce our notation
and conventions which might occasionally differ from those you’ve
encountered before.
1.3.1 Definitions
Mathematicians often define new concepts in terms of old concepts.
For example, here is a mathematical definition which you may have
encountered in the past (and will see again shortly):
1.3.3 Proofs
Mathematical proofs are the arguments we use to demonstrate that our
theorems, lemmas, and claims area indeed true. We discuss proofs in
Section 1.5 below, but the main point is that the mathematical stan-
dard of proof is very high. Unlike in some other realms, in mathe-
matics a proof is an “airtight” argument that demonstrates that the
statement is true beyond a shadow of a doubt. Some examples in this
section for mathematical proofs are given in Solved Exercise 1.1 and
Section 1.6. As mentioned in the preface, as a general rule, it is more
important you understand the definitions than the theorems, and it is
more important you understand a theorem statement than its proof.
mathe mati ca l backg rou n d 51
1.4.1 Sets
A set is an unordered collection of objects. For example, when we
write 𝑆 = {2, 4, 7}, we mean that 𝑆 denotes the set that contains the
numbers 2, 4, and 7. (We use the notation “2 ∈ 𝑆” to denote that 2 is
an element of 𝑆.) Note that the set {2, 4, 7} and {7, 4, 2} are identical,
since they contain the same elements. Also, a set either contains an
element or does not contain it – there is no notion of containing it
“twice” – and so we could even write the same set 𝑆 as {2, 2, 4, 7}
(though that would be a little weird). The cardinality of a finite set 𝑆,
denoted by |𝑆|, is the number of elements it contains. (Cardinality can
be defined for infinite sets as well; see the sources in Section 1.9.) So,
in the example above, |𝑆| = 3. A set 𝑆 is a subset of a set 𝑇 , denoted
by 𝑆 ⊆ 𝑇 , if every element of 𝑆 is also an element of 𝑇 . (We can
also describe this by saying that 𝑇 is a superset of 𝑆.) For example,
{2, 7} ⊆ {2, 4, 7}. The set that contains no elements is known as the
empty set and it is denoted by ∅. If 𝐴 is a subset of 𝐵 that is not equal
to 𝐵 we say that 𝐴 is a strict subset of 𝐵, and denote this by 𝐴 ⊊ 𝐵.
We can define sets by either listing all their elements or by writing
down a rule that they satisfy such as
Of course there is more than one way to write the same set, and of-
ten we will use intuitive notation listing a few examples that illustrate
the rule. For example, we can also define EVEN as
ℕ = {0, 1, 2, …} (1.3)
contains all natural numbers, i.e., non-negative integers. For any natural
number 𝑛 ∈ ℕ, we define the set [𝑛] as {0, … , 𝑛 − 1} = {𝑘 ∈ ℕ ∶
𝑘 < 𝑛}. (We start our indexing of both ℕ and [𝑛] from 0, while many
other texts index those sets from 1. Starting from zero or one is simply
a convention that doesn’t make much difference, as long as one is
consistent about it.)
We will also occasionally use the set ℤ = {… , −2, −1, 0, +1, +2, …} 1
The letter Z stands for the German word “Zahlen”,
of (negative and non-negative) integers,1 as well as the set ℝ of real which means numbers.
mathe mati ca l backg rou n d 53
numbers. (This is the set that includes not just the integers, but also
fractional and irrational numbers; e.g., ℝ contains numbers such as
+0.5, −𝜋, etc.) We denote by ℝ+ the set {𝑥 ∈ ℝ ∶ 𝑥 > 0} of positive real
numbers. This set is sometimes also denoted as (0, ∞).
{0, 1}3 = {000, 001, 010, 011, 100, 101, 110, 111} . (1.5)
For every string 𝑥 ∈ {0, 1}𝑛 and 𝑖 ∈ [𝑛], we write 𝑥𝑖 for the 𝑖𝑡ℎ
element of 𝑥.
We will also often talk about the set of binary strings of all lengths,
which is
or more concisely as
Σ∗ = ∪𝑛∈ℕ Σ𝑛 . (1.9)
For example, if Σ = {𝑎, 𝑏, 𝑐, 𝑑, … , 𝑧} then Σ∗ denotes the set of all finite
length strings over the alphabet a-z.
1.4.3 Functions
If 𝑆 and 𝑇 are nonempty sets, a function 𝐹 mapping 𝑆 to 𝑇 , denoted
by 𝐹 ∶ 𝑆 → 𝑇 , associates with every element 𝑥 ∈ 𝑆 an element
𝐹 (𝑥) ∈ 𝑇 . The set 𝑆 is known as the domain of 𝐹 and the set 𝑇
is known as the codomain of 𝐹 . The image of a function 𝐹 is the set
{𝐹 (𝑥) | 𝑥 ∈ 𝑆} which is the subset of 𝐹 ’s codomain consisting of all
output elements that are mapped from some input. (Some texts use
range to denote the image of a function, while other texts use range
to denote the codomain of a function. Hence we will avoid using the
term “range” altogether.) As in the case of sets, we can write a func-
tion either by listing the table of all the values it gives for elements
in 𝑆 or by using a rule. For example if 𝑆 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and 𝑇 = {0, 1}, then the table below defines a function 𝐹 ∶ 𝑆 → 𝑇 .
Note that this function is the same as the function defined by the rule 2
For two natural numbers 𝑥 and 𝑎, 𝑥 mod 𝑎 (short-
𝐹 (𝑥) = (𝑥 mod 2).2 hand for “modulo”) denotes the remainder of 𝑥
when it is divided by 𝑎. That is, it is the number 𝑟 in
Table 1.1: An example of a function. {0, … , 𝑎 − 1} such that 𝑥 = 𝑎𝑘 + 𝑟 for some integer 𝑘.
We sometimes also use the notation 𝑥 = 𝑦 ( mod 𝑎)
to denote the assertion that 𝑥 mod 𝑎 is the same as 𝑦
Input Output mod 𝑎.
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Basic facts about functions:Verifying that you can prove the following
results is an excellent way to brush up on functions:
• If 𝑆 and 𝑇 are finite sets then the following conditions are equiva-
lent to one another: (a) |𝑆| ≤ |𝑇 |, (b) there is a one-to-one function
𝐹 ∶ 𝑆 → 𝑇 , and (c) there is an onto function 𝐺 ∶ 𝑇 → 𝑆. (This is
actually true even for infinite 𝑆 and 𝑇 : in that case (b) (or equiva-
Figure 1.4: We can represent finite functions as a
lently (c)) is the commonly accepted definition for |𝑆| ≤ |𝑇 |.)
directed graph where we put an edge from 𝑥 to
𝑓(𝑥). The onto condition corresponds to requiring
that every vertex in the codomain of the function
P has in-degree at least one. The one-to-one condition
You can find the proofs of these results in many dis- corresponds to requiring that every vertex in the
crete math texts, including for example, Section 4.5 codomain of the function has in-degree at most one. In
in the Lehman-Leighton-Meyer notes. However, I the examples above 𝐹 is an onto function, 𝐺 is one to
one, and 𝐻 is neither onto nor one to one.
56 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
are connected if either 𝑢 = 𝑣 or there is a path from (𝑢0 , … , 𝑢𝑘 ) where rected graph. The undirected graph has vertex set
{1, 2, 3, 4} and edge set {{1, 2}, {2, 3}, {2, 4}}. The
𝑢0 = 𝑢 and 𝑢𝑘 = 𝑣. We say that the graph 𝐺 is connected if every pair of directed graph has vertex set {𝑎, 𝑏, 𝑐} and the edge
vertices in it is connected. set {(𝑎, 𝑏), (𝑏, 𝑐), (𝑐, 𝑎), (𝑎, 𝑐)}.
mathe mati ca l backg rou n d 57
Here are some basic facts about undirected graphs. We give some
informal arguments below, but leave the full proofs as exercises (the
proofs can be found in many of the resources listed in Section 1.9).
Lemma 1.4 In any undirected graph 𝐺 = (𝑉 , 𝐸), the sum of the degrees
of all vertices is equal to twice the number of edges.
Lemma 1.4 can be shown by seeing that every edge {𝑢, 𝑣} con-
tributes twice to the sum of the degrees (once for 𝑢 and the second
time for 𝑣).
Lemma 1.5The connectivity relation is transitive, in the sense that if 𝑢 is
connected to 𝑣, and 𝑣 is connected to 𝑤, then 𝑢 is connected to 𝑤.
Lemma 1.5 can be shown by simply attaching a path of the form
(𝑢, 𝑢1 , 𝑢2 , … , 𝑢𝑘−1 , 𝑣) to a path of the form (𝑣, 𝑢′1 , … , 𝑢′𝑘′ −1 , 𝑤) to obtain
the path (𝑢, 𝑢1 , … , 𝑢𝑘−1 , 𝑣, 𝑢′1 , … , 𝑢′𝑘′ −1 , 𝑤) that connects 𝑢 to 𝑤.
Lemma 1.6 For every undirected graph 𝐺 = (𝑉 , 𝐸) and connected pair
𝑢, 𝑣, the shortest path from 𝑢 to 𝑣 is simple. In particular, for every
connected pair there exists a simple path that connects them.
Lemma 1.6 can be shown by “shortcutting” any non simple path
from 𝑢 to 𝑣 where the same vertex 𝑤 appears twice to remove it (see
Fig. 1.6). It is a good exercise to transforming this intuitive reasoning
to a formal proof:
Solved Exercise 1.1 — Connected vertices have simple paths. Prove Lemma 1.6
■
Solution:
The proof follows the idea illustrated in Fig. 1.6. One complica-
tion is that there can be more than one vertex that is visited twice
by a path, and so “shortcutting” might not necessarily result in a
58 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 1.7 — Finding proofs. Solved Exercise 1.1 is a
good example of the process of finding a proof. You
start by ensuring you understand what the statement
means, and then come up with an informal argument
why it should be true. You then transform the infor-
mal argument into a rigorous proof. This proof need
not be very long or overly formal, but should clearly
establish why the conclusion of the statement follows
from its assumptions.
graph is a tuple (𝑢0 , … , 𝑢𝑘 ) ∈ 𝑉 𝑘+1 , for some 𝑘 > 0 such that 𝑢𝑖+1 is an
out-neighbor of 𝑢𝑖 for every 𝑖 ∈ [𝑘]. As in the undirected case, a simple
path is a path (𝑢0 , … , 𝑢𝑘−1 ) where all the 𝑢𝑖 ’s are distinct and a cycle
is a path (𝑢0 , … , 𝑢𝑘 ) where 𝑢0 = 𝑢𝑘 . One type of directed graphs we
often care about is directed acyclic graphs or DAGs, which, as their name
implies, are directed graphs without any cycles:
R
Remark 1.13 — Labeled graphs. For some applications
we will consider labeled graphs, where the vertices or
edges have associated labels (which can be numbers,
strings, or members of some other set). We can think
of such a graph as having an associated (possibly
partial) labelling function 𝐿 ∶ 𝑉 ∪ 𝐸 → ℒ, where ℒ is
the set of potential labels. However we will typically
not refer explicitly to this labeling function and simply
say things such as “vertex 𝑣 has the label 𝛼”.
For example, the sum of the squares of all numbers from 1 to 100
can be written as
∑ 𝑖2 . (1.13)
𝑖∈{1,…,100}
100
∑ 𝑖2 . (1.14)
𝑖=1
∃𝑎,𝑏∈ℕ (𝑎 ≠ 1) ∧ (𝑎 ≠ 𝑛) ∧ (𝑛 = 𝑎 × 𝑏) (1.15)
Since 𝑛 is free, it can be set to any value, and the truth of the state-
ment (1.15) depends on the value of 𝑛. For example, if 𝑛 = 8 then
(1.15) is true, but for 𝑛 = 11 it is false. (Can you see why?)
The same issue appears when parsing code. For example, in the
following snippet from the C programming language
the variable i is bound within the for block but the variable n is
free.
The main property of bound variables is that we can rename them
(as long as the new name doesn’t conflict with another used variable)
without changing the meaning of the statement. Thus for example the
statement
∃𝑥,𝑦∈ℕ (𝑥 ≠ 1) ∧ (𝑥 ≠ 𝑛) ∧ (𝑛 = 𝑥 × 𝑦) (1.16)
is equivalent to (1.15) in the sense that it is true for exactly the same
set of 𝑛’s.
Similarly, the code
produces the same result as the code above that used i instead of j.
R
Remark 1.14 — Aside: mathematical vs programming no-
tation. Mathematical notation has a lot of similarities
with programming language, and for the same rea-
sons. Both are formalisms meant to convey complex
concepts in a precise way. However, there are some
cultural differences. In programming languages, we
often try to use meaningful variable names such as
NumberOfVertices while in math we often use short
identifiers such as 𝑛. Part of it might have to do with
the tradition of mathematical proofs as being hand-
written and verbally presented, as opposed to typed
up and compiled. Another reason is if the wrong
variable name is used in a proof, at worst is causes
confusion to readers; when the wrong variable name
62 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
texts write 𝐹 ∈ 𝑂(𝐺) instead of 𝐹 = 𝑂(𝐺), but we will not use this
notation.) Despite the misleading equality sign, you should remember
that a statement such as 𝐹 = 𝑂(𝐺) means that 𝐹 is “at most” 𝐺 in
some rough sense when we ignore constants, and a statement such as
𝐹 = Ω(𝐺) means that 𝐹 is “at least” 𝐺 in the same rough sense.
• When adding two functions, we only care about the larger one. For
example, for the purpose of 𝑂-notation, 𝑛3 + 100𝑛2 is the same as
𝑛3 , and in general in any polynomial, we only care about the larger
exponent.
64 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
every two constants 𝑎 > 0 and 𝜖 > 0 even if 𝜖 is much smaller than
√
𝑎. For example, 100𝑛100 = 𝑜(2 𝑛 ).
R
Remark 1.16 — Big 𝑂 for other applications (optional).
While Big-𝑂 notation is often used to analyze running
time of algorithms, this is by no means the only ap-
plication. We can use 𝑂 notation to bound asymptotic
relations between any functions mapping integers
to positive numbers. It can be used regardless of
whether these functions are a measure of running
time, memory usage, or any other quantity that may
have nothing to do with computation. Here is one
example which is unrelated to this book (and hence
one that you can feel free to skip): one way to state the
Riemann Hypothesis (one of the most famous open
questions in mathematics) is that it corresponds to
the conjecture that the number of primes between 0
𝑛
and 𝑛 is equal to ∫2 ln1𝑥 𝑑𝑥 up to an additive error of
√
magnitude at most 𝑂( 𝑛 log 𝑛).
1.5 PROOFS
Many people think of mathematical proofs as a sequence of logical
deductions that starts from some axioms and ultimately arrives at a
conclusion. In fact, some dictionaries define proofs that way. This is
not entirely wrong, but at its essence mathematical proof of a state-
ment X is simply an argument that convinces the reader that X is true
beyond a shadow of a doubt.
To produce such a proof you need to:
In many cases, the first part is the most important one. Understand-
ing what a statement means is oftentimes more than halfway towards
understanding why it is true. In third part, to convince the reader
beyond a shadow of a doubt, we will often want to break down the
reasoning to “basic steps”, where each basic step is simple enough
to be “self evident”. The combination of all steps yields the desired
statement.
what this purpose is. When you write a proof, for every equation or
sentence you include, ask yourself:
2. If so, does this statement follow from the previous steps, or are we
going to establish it in the next step?
R
Remark 1.20 — Hierarchical Proofs (optional). Mathe-
matical proofs are ultimately written in English prose.
The well-known computer scientist Leslie Lamport
argues that this is a problem, and proofs should be
written in a more formal and rigorous way. In his
manuscript he proposes an approach for structured
hierarchical proofs, that have the following form:
P
If you have not seen the proof of this theorem before
(or don’t remember it), this would be an excellent
point to pause and try to prove it yourself. One way
to do it would be to describe an algorithm that given as
input a directed acyclic graph 𝐺 on 𝑛 vertices and 𝑛−2
or fewer edges, constructs an array 𝐹 of length 𝑛 such
that for every edge 𝑢 → 𝑣 in the graph 𝐹 [𝑢] < 𝐹 [𝑣].
(a) 𝑃 is true
and
Figure 1.9: Some examples of DAGs of one, two and
(b) 𝑃 implies 𝑄 three vertices, and valid ways to assign layers to the
then 𝑄 is true. vertices.
mathe mati ca l backg rou n d 71
R
Remark 1.25 — Induction and recursion. Proofs by in-
duction are closely related to algorithms by recursion.
In both cases we reduce solving a larger problem to
solving a smaller instance of itself. In a recursive algo-
rithm to solve some problem P on an input of length
𝑘 we ask ourselves “what if someone handed me a
way to solve P on instances smaller than 𝑘?”. In an
inductive proof to prove a statement Q parameterized
by a number 𝑘, we ask ourselves “what if I already
knew that 𝑄(𝑘′ ) is true for 𝑘′ < 𝑘?”. Both induction
and recursion are crucial concepts for this course and
Computer Science at large (and even other areas of
inquiry, including not just mathematics but other
sciences as well). Both can be confusing at first, but
with time and practice they become clearer. For more
on proofs by induction and recursion, you might find
the following Stanford CS 103 handout, this MIT 6.00
lecture or this excerpt of the Lehman-Leighton book
useful.
⎧
{𝑓 ′ (𝑣) + 1 𝑣 ≠ 𝑣0
𝑓(𝑣) = ⎨ . (1.20)
{
⎩0 𝑣 = 𝑣0
We claim that 𝑓 is a valid layering, namely that for every edge 𝑢 →
𝑣, 𝑓(𝑢) < 𝑓(𝑣). To prove this, we split into cases:
P
Reading a proof is no less of an important skill than
producing one. In fact, just like understanding code,
it is a highly non-trivial skill in itself. Therefore I
strongly suggest that you re-read the above proof, ask-
ing yourself at every sentence whether the assumption
it makes is justified, and whether this sentence truly
demonstrates what it purports to achieve. Another
good habit is to ask yourself when reading a proof for
every variable you encounter (such as 𝑢, 𝑖, 𝐺′ , 𝑓 ′ , etc.
in the above proof) the following questions: (1) What
type of variable is it? is it a number? a graph? a vertex?
a function? and (2) What do we know about it? Is it
an arbitrary member of the set? Have we shown some
facts about it?, and (3) What are we trying to show
about it?.
Let 𝐺 = (𝑉 , 𝐸) be a DAG. We
Theorem 1.26 — Minimal layering is unique.
say that a layering 𝑓 ∶ 𝑉 → ℕ is minimal if for every vertex 𝑣 ∈ 𝑉 , if
𝑣 has no in-neighbors then 𝑓(𝑣) = 0 and if 𝑣 has in-neighbors then
there exists an in-neighbor 𝑢 of 𝑣 such that 𝑓(𝑢) = 𝑓(𝑣) − 1.
For every layering 𝑓, 𝑔 ∶ 𝑉 → ℕ of 𝐺, if both 𝑓 and 𝑔 are minimal
then 𝑓 = 𝑔.
Proof Idea:
The idea is to prove the theorem by induction on the layers. If 𝑓 and
𝑔 are minimal then they must agree on the source vertices, since both
𝑓 and 𝑔 should assign these vertices to layer 0. We can then show that
if 𝑓 and 𝑔 agree up to layer 𝑖 − 1, then the minimality property implies
that they need to agree in layer 𝑖 as well. In the actual proof we use
a small trick to save on writing. Rather than proving the statement
that 𝑓 = 𝑔 (or in other words that 𝑓(𝑣) = 𝑔(𝑣) for every 𝑣 ∈ 𝑉 ),
we prove the weaker statement that 𝑓(𝑣) ≤ 𝑔(𝑣) for every 𝑣 ∈ 𝑉 .
(This is a weaker statement since the condition that 𝑓(𝑣) is lesser or
equal than to 𝑔(𝑣) is implied by the condition that 𝑓(𝑣) is equal to
𝑔(𝑣).) However, since 𝑓 and 𝑔 are just labels we give to two minimal
layerings, by simply changing the names “𝑓” and “𝑔” the same proof
also shows that 𝑔(𝑣) ≤ 𝑓(𝑣) for every 𝑣 ∈ 𝑉 and hence that 𝑓 = 𝑔.
⋆
P
The proof of Theorem 1.26 is fully rigorous, but is
written in a somewhat terse manner. Make sure that
you read through it and understand why this is indeed
an airtight proof of the Theorem’s statement.
• We also index the set [𝑛] starting with 0, and hence define it as
{0, … , 𝑛 − 1}. In other texts it is often defined as {1, … , 𝑛}. Similarly,
we index our strings starting with 0, and hence a string 𝑥 ∈ {0, 1}𝑛
is written as 𝑥0 𝑥1 ⋯ 𝑥𝑛−1 .
• We use ⌈𝑥⌉ and ⌊𝑥⌋ for the “ceiling” and “floor” operators that
correspond to “rounding up” or “rounding down” a number to the
76 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Also, such conventions do not replace the need to explicitly declare for
each new variable the type of object that it denotes.
• “Let 𝑋 be …”, “let 𝑋 denote …”, or “let 𝑋 = …”: These are all
different ways for us to say that we are defining the symbol 𝑋 to
stand for whatever expression is in the …. When 𝑋 is a property of
some objects we might define 𝑋 by writing something along the
lines of “We say that … has the property 𝑋 if ….”. While we often
78 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• “Thus”, “Therefore” , “We get that”: This means that the following
sentence is implied by the preceding one, as in “The 𝑛-vertex graph
𝐺 is connected. Therefore it contains at least 𝑛 − 1 edges.” We
sometimes use “indeed” to indicate that the following text justifies
the claim that was made in the preceding sentence as in “The 𝑛-
vertex graph 𝐺 has at least 𝑛 − 1 edges. Indeed, this follows since 𝐺 is
connected.”
✓ Chapter Recap
1.8 EXERCISES
a. Write a logical expression 𝜑(𝑥)
Exercise 1.1 — Logical expressions.
involving the variables 𝑥0 , 𝑥1 , 𝑥2 and the operators ∧ (AND), ∨
(OR), and ¬ (NOT), such that 𝜑(𝑥) is true if the majority of the
inputs are True.
b. Let 𝑛 > 10. 𝑆 is the set of all functions mapping {0, 1}𝑛 to {0, 1}.
𝑇 = {0, 1}𝑛 .
3
c. Let 𝐴0 , … , 𝐴𝑘−1 be finite subsets of {1, … , 𝑛}, such that |𝐴𝑖 | = 𝑚 for
every 𝑖 ∈ [𝑘]. Prove that if 𝑘 > 100𝑛, then there exist two distinct
sets 𝐴𝑖 , 𝐴𝑗 s.t. |𝐴𝑖 ∩ 𝐴𝑗 | ≥ 𝑚2 /(10𝑛).
■
Exercise 1.9Prove that for every finite 𝑆, 𝑇 , there are (|𝑇 | + 1) |𝑆|
partial
functions from 𝑆 to 𝑇 .
■
Exercise 1.11 Prove that for every undirected graph 𝐺 of 100 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 20 vertices such that no two vertices in 𝑆 are neighbors of one
another.
■
√
d. 𝐹 (𝑛) = 𝑛, 𝐺(𝑛) = 2√log 𝑛
e. 𝐹 (𝑛) = (⌈0.2𝑛⌉
𝑛
) , 𝐺(𝑛) = 20.1𝑛 (where (𝑛𝑘) is the number of 𝑘-sized 7
one way to do this is to use Stirling’s approximation
subsets of a set of size 𝑛) and 𝑔(𝑛) = 2 0.1𝑛
. See footnote for hint.7 for the factorial function..
Exercise 1.15 Prove that for every undirected graph 𝐺 of 1000 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 200 vertices such that no two vertices in 𝑆 are neighbors of one
another.
■
2 • Prefix-free representations.
• Cantor’s Theorem: The real numbers cannot
“The alphabet (sic) was a great invention, which enabled men (sic) to store
and to learn with little effort what others had learned the hard way – that is, to
learn from books rather than from direct, possibly painful, contact with the real
world.”, B.F. Skinner
“The name of the song is called ‘HADDOCK’S EYES.”’ [said the Knight]
“Oh, that’s the name of the song, is it?” Alice said, trying to feel interested.
“No, you don’t understand,” the Knight said, looking a little vexed. “That’s
what the name is CALLED. The name really is ‘THE AGED AGED MAN.”’
“Then I ought to have said ‘That’s what the SONG is called’?” Alice cor-
rected herself.
“No, you oughtn’t: that’s quite another thing! The SONG is called ‘WAYS
AND MEANS’: but that’s only what it’s CALLED, you know!”
“Well, what IS the song, then?” said Alice, who was by this time com-
pletely bewildered.
“I was coming to that,” the Knight said. “The song really IS ‘A-SITTING ON
A GATE’: and the tune’s my own invention.”
Lewis Carroll, Through the Looking-Glass
networks, MRI scans, gene data, and even other programs. We will
represent all these objects as strings of zeroes and ones, that is objects
such as 0011101 or 1011 or any other finite list of 1’s and 0’s. (This
choice is for convenience: there is nothing “holy” about zeroes and
ones, and we could have used any other finite collection of symbols.)
Today, we are so used to the notion of digital representation that
we are not surprised by the existence of such an encoding. But it is
actually a deep insight with significant implications. Many animals
can convey a particular fear or desire, but what is unique about hu-
mans is language: we use a finite collection of basic symbols to describe
a potentially unlimited range of experiences. Language allows trans-
mission of information over both time and space and enables soci-
eties that span a great many people and accumulate a body of shared
Figure 2.2: We represent numbers, texts, images, net-
knowledge over time. works and many other objects using strings of zeroes
Over the last several decades, we have seen a revolution in what we and ones. Writing the zeroes and ones themselves in
green font over a black background is optional.
can represent and convey in digital form. We can capture experiences
with almost perfect fidelity, and disseminate it essentially instanta-
neously to an unlimited audience. Moreover, once information is in
digital form, we can compute over it, and gain insights from data that
were not accessible in prior times. At the heart of this revolution is the
simple but profound observation that we can represent an unbounded
variety of objects using a finite set of symbols (and in fact using only
the two symbols 0 and 1).
In later chapters, we will typically take such representations for
granted, and hence use expressions such as “program 𝑃 takes 𝑥 as
input” when 𝑥 might be a number, a vector, a graph, or any other
object, when we really mean that 𝑃 takes as input the representation of
𝑥 as a binary string. However, in this chapter we will dwell a bit more
on how we can construct such representations.
The two “big ideas’ ’ we discuss are Big Idea 1 - we can com-
pose representations for simple objects to represent more
complex objects and Big Idea 2 - it is crucial to distinguish
between functions (“what”) and programs (“how”). The latter
will be a theme we will come back to time and again in this
book.
40 101000
53 110101
389 110000101
3750 111010100110
⎧0 𝑛=0
{
{
𝑁 𝑡𝑆(𝑛) = 1 𝑛=1 (2.1)
⎨
{
{𝑁 𝑡𝑆(⌊𝑛/2⌋)𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) 𝑛 > 1
⎩
where 𝑝𝑎𝑟𝑖𝑡𝑦 ∶ ℕ → {0, 1} is the function defined as 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 0
if 𝑛 is even and 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 1 if 𝑛 is odd, and as usual, for strings
𝑥, 𝑦 ∈ {0, 1}∗ , 𝑥𝑦 denotes the concatenation of 𝑥 and 𝑦. The function
comp u tati on a n d re p re se n tati on 87
R
Remark 2.1 — Binary representation in python (optional).
We can implement the binary representation in Python
as follows:
print(NtS(236))
# 11101100
print(NtS(19))
# 10011
print(StN(NtS(236)))
# 236
R
Remark 2.2 — Programming examples. In this book,
we sometimes use code examples as in Remark 2.1.
The point is always to emphasize that certain com-
putations can be achieved concretely, rather than
illustrating the features of Python or any other pro-
gramming language. Indeed, one of the messages of
this book is that all programming languages are in
a certain precise sense equivalent to one another, and
hence we could have just as well used JavaScript, C,
COBOL, Visual Basic or even BrainF*ck. This book
is not about programming, and it is absolutely OK if
88 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{0 𝑁 𝑡𝑆(𝑚) 𝑚≥0
𝑍𝑡𝑆(𝑚) = ⎨ (2.2)
{
⎩1 𝑁 𝑡𝑆(−𝑚) 𝑚 < 0
where 𝑁 𝑡𝑆 is defined as in (2.1).
While the encoding function of a representation needs to be one
to one, it does not have to be onto. For example, in the representation
above there is no number that is represented by the empty string
but it is still a fine representation, since every integer is represented
uniquely by some string.
R
Remark 2.3 — Interpretation and context. Given a string
𝑦 ∈ {0, 1}∗ , how do we know if it’s “supposed” to
represent a (nonnegative) natural number or a (po-
tentially negative) integer? For that matter, even if
we know 𝑦 is “supposed” to be an integer, how do
we know what representation scheme it uses? The
short answer is that we do not necessarily know this
information, unless it is supplied from the context. (In
programming languages, the compiler or interpreter
determines the representation of the sequence of bits
corresponding to a variable based on the variable’s
type.) We can treat the same string 𝑦 as representing a
natural number, an integer, a piece of text, an image,
or a green gremlin. Whenever we say a sentence such
as “let 𝑛 be the number represented by the string 𝑦,”
we will assume that we are fixing some canonical rep-
resentation scheme such as the ones above. The choice
of the particular representation scheme will rarely
matter, except that we want to make sure to stick with
the same one for consistency.
⎧
{𝑁 𝑡𝑆𝑛+1 (𝑘) 0 ≤ 𝑘 ≤ 2𝑛 − 1
𝑍𝑡𝑆𝑛 (𝑘) = ⎨ , (2.3)
𝑛+1
{
⎩𝑁 𝑡𝑆𝑛+1 (2 + 𝑘) −2𝑛 ≤ 𝑘 ≤ −1
Given the issues with floating point approximations for real num-
bers, a natural question is whether it is possible to represent real num-
bers exactly as strings. Unfortunately, the following theorem shows
that this cannot be done:
Theorem 2.6 was proven by Georg Cantor in 1874. This result (and
the theory around it) was quite shocking to mathematicians at the
time. By showing that there is no one-to-one map from ℝ to {0, 1}∗ (or
ℕ), Cantor showed that these two infinite sets have “different forms of
infinity” and that the set of real numbers ℝ is in some sense “bigger”
than the infinite set {0, 1}∗ . The notion that there are “shades of infin-
ity” was deeply disturbing to mathematicians and philosophers at the
time. The philosopher Ludwig Wittgenstein (whom we mentioned be-
fore) called Cantor’s results “utter nonsense” and “laughable.” Others
thought they were even worse than that. Leopold Kronecker called
Cantor a “corrupter of youth,” while Henri Poincaré said that Can-
tor’s ideas “should be banished from mathematics once and for all.”
The tide eventually turned, and these days Cantor’s work is univer-
sally accepted as the cornerstone of set theory and the foundations of
mathematics. As David Hilbert said in 1925, “No one shall expel us from
the paradise which Cantor has created for us.” As we will see later in this
94 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
book, Cantor’s ideas also play a huge role in the theory of computa-
tion.
Now that we have discussed Theorem 2.5’s importance, let us see
the proof. It is achieved in two steps:
1. Define some infinite set 𝒳 for which it is easier for us to prove that
𝒳 is not countable (namely, it’s easier for us to prove that there is
no one-to-one function from 𝒳 to {0, 1}∗ ).
We now proceed to do precisely that. That is, we will define the set
{0, 1}∞ , which will play the role of 𝒳, and then state and prove two
lemmas that show that this set satisfies our two desired properties.
That is, {0, 1}∞ is a set of functions, and a function 𝑓 is in {0, 1}∞
iff its domain is ℕ and its codomain is {0, 1}. We can also think of
{0, 1}∞ as the set of all infinite sequences of bits, since a function 𝑓 ∶
ℕ → {0, 1} can be identified with the sequence (𝑓(0), 𝑓(1), 𝑓(2), …).
The following two lemmas show that {0, 1}∞ can play the role of 𝒳 to
establish Theorem 2.5.
Lemma 2.8 There does not exist a one-to-one map 𝐹 𝑡𝑆 ∶ {0, 1}∞ → 3
𝐹 𝑡𝑆 stands for “functions to strings”.
{0, 1}∗ .3
4
𝐹 𝑡𝑅 stands for “functions to reals.”
Lemma 2.9 There does exist a one-to-one map 𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ.4
As we’ve seen above, Lemma 2.8 and Lemma 2.9 together imply
Theorem 2.5. To repeat the argument more formally, suppose, for
the sake of contradiction, that there did exist a one-to-one function
𝑅𝑡𝑆 ∶ ℝ → {0, 1}∗ . By Lemma 2.9, there exists a one-to-one function
𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ. Thus, under this assumption, since the composi-
tion of two one-to-one functions is one-to-one (see Exercise 2.12), the
comp u tati on a n d re p re se n tati on 95
Now all that is left is to prove these two lemmas. We start by prov-
ing Lemma 2.8 which is really the heart of Theorem 2.5.
Warm-up: ”Baby Cantor”. The proof of Lemma 2.8 is rather subtle. One
way to get intution for it is to consider the following finite statement
”there is no onto function 𝑓 ∶ {0, … , 99} → {0, 1}100 . Of course we
know it’s true since the set {0, 1}100 is bigger than the set [100], but
let’s see a direct proof. For every 𝑓 ∶ {0, … , 99} → {0, 1}100 , we
can define the string 𝑑 ∈ {0, 1}100 as follows: 𝑑 = (1 − 𝑓(0)0 , 1 −
𝑓(1)1 , … , 1 − 𝑓(99)99 ). If 𝑓 was onto, then there would exist some
𝑛 ∈ [100] such that 𝑓(𝑛) = 𝑑(𝑛), but we claim that no such 𝑛 exists.
96 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Indeed, if there was such 𝑛, then the 𝑛-th coordinate of 𝑑 would equal
𝑓(𝑛)𝑛 but by definition this coordinate equals 1 − 𝑓(𝑛)𝑛 . See also a
“proof by code” of this statement.
Proof of Lemma 2.8. We will prove that there does not exist an onto
function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . This implies the lemma since
for every two sets 𝐴 and 𝐵, there exists an onto function from 𝐴 to
𝐵 if and only if there exists a one-to-one function from 𝐵 to 𝐴 (see
Lemma 1.2).
The technique of this proof is known as the “diagonal argument”
and is illustrated in Fig. 2.8. We assume, towards a contradiction, that
there exists such a function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . We will show
that 𝑆𝑡𝐹 is not onto by demonstrating a function 𝑑 ∈ {0, 1}∞ such that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Consider the lexicographic ordering
of binary strings (i.e., "",0,1,00,01,…). For every 𝑛 ∈ ℕ, we let 𝑥𝑛 be the
𝑛-th string in this order. That is 𝑥0 = "", 𝑥1 = 0, 𝑥2 = 1 and so on and
so forth. We define the function 𝑑 ∈ {0, 1}∞ as follows:
𝑆𝑡𝐹 ("")(0), 𝑆𝑡𝐹 (0)(1), 𝑆𝑡𝐹 (1)(2), 𝑆𝑡𝐹 (00)(3), 𝑆𝑡𝐹 (01)(4), … (2.5)
which correspond to the elements 𝑆𝑡𝐹 (𝑥𝑛 )(𝑛) in the 𝑛-th row and
𝑛-th column of this table for 𝑛 = 0, 1, 2, …. The function 𝑑 we defined
above maps every 𝑛 ∈ ℕ to the negation of the 𝑛-th diagonal value.
To complete the proof that 𝑆𝑡𝐹 is not onto we need to show that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Indeed, let 𝑥 ∈ {0, 1}∗ be some string
and let 𝑔 = 𝑆𝑡𝐹 (𝑥). If 𝑛 is the position of 𝑥 in the lexicographical
order then by construction 𝑑(𝑛) = 1 − 𝑔(𝑛) ≠ 𝑔(𝑛) which means that
𝑔 ≠ 𝑑 which is what we wanted to prove.
■
comp u tati on a n d re p re se n tati on 97
R
Remark 2.10 — Generalizing beyond strings and reals.
Lemma 2.8 doesn’t really have much to do with the
natural numbers or the strings. An examination of
the proof shows that it really shows that for every
set 𝑆, there is no one-to-one map 𝐹 ∶ {0, 1}𝑆 → 𝑆
where {0, 1}𝑆 denotes the set {𝑓 | 𝑓 ∶ 𝑆 → {0, 1}}
of all Boolean functions with domain 𝑆. Since we can
identify a subset 𝑉 ⊆ 𝑆 with its characteristic function
𝑓 = 1𝑉 (i.e., 1𝑉 (𝑥) = 1 iff 𝑥 ∈ 𝑉 ), we can think of
{0, 1}𝑆 also as the set of all subsets of 𝑆. This subset
is sometimes called the power set of 𝑆 and denoted by
𝒫(𝑆) or 2𝑆 .
The proof of Lemma 2.8 can be generalized to show
that there is no one-to-one map between a set and its
power set. In particular, it means that the set {0, 1}ℝ is
“even bigger” than ℝ. Cantor used these ideas to con-
struct an infinite hierarchy of shades of infinity. The
number of such shades turns out to be much larger
than |ℕ| or even |ℝ|. He denoted the cardinality of ℕ
by ℵ0 and denoted the next largest infinite number
by ℵ1 . (ℵ is the first letter in the Hebrew alphabet.)
Cantor also made the continuum hypothesis that
|ℝ| = ℵ1 . We will come back to the fascinating story
of this hypothesis later on in this book. This lecture of
Aaronson mentions some of these issues (see also this
Berkeley CS 70 lecture).
Proof Idea:
We define 𝐹 𝑡𝑅(𝑓) to be the number between 0 and 2 whose dec-
imal expansion is 𝑓(0).𝑓(1)𝑓(2) …, or in other words 𝐹 𝑡𝑅(𝑓) =
∞
∑𝑖=0 𝑓(𝑖) ⋅ 10−𝑖 . If 𝑓 and 𝑔 are two distinct functions in {0, 1}∞ , then
there must be some input 𝑘 in which they disagree. If we take the
minimum such 𝑘, then the numbers 𝑓(0).𝑓(1)𝑓(2) … 𝑓(𝑘 − 1)𝑓(𝑘) …
and 𝑔(0).𝑔(1)𝑔(2) … 𝑔(𝑘) … agree with each other all the way up to the
𝑘 − 1-th digit after the decimal point, and disagree on the 𝑘-th digit.
But then these numbers must be distinct. Concretely, if 𝑓(𝑘) = 1 and
𝑔(𝑘) = 0 then the first number is larger than the second, and otherwise
(𝑓(𝑘) = 0 and 𝑔(𝑘) = 1) the first number is smaller than the second.
In the proof we have to be a little careful since these are numbers with
infinite expansions. For example, the number one half has two decimal
98 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
expansions 0.5 and 0.49999 ⋯. However, this issue does not come up
here, since we restrict attention only to numbers with decimal expan-
sions that do not involve the digit 9.
⋆
Proof of Lemma 2.9. For every 𝑓 ∈ {0, 1}∞ , we define 𝐹 𝑡𝑅(𝑓) to be the
number whose decimal expansion is 𝑓(0).𝑓(1)𝑓(2)𝑓(3) …. Formally,
∞
𝐹 𝑡𝑅(𝑓) = ∑ 𝑓(𝑖) ⋅ 10−𝑖 (2.6)
𝑖=0
R
Remark 2.11 — Using decimal expansion (op-
tional). In the proof above we used the fact that
1 + 1/10 + 1/100 + ⋯ converges to 10/9, which
plugging into (2.7) yields that the difference between
𝐹 𝑡𝑅(𝑔) and 𝐹 𝑡𝑅(ℎ) is at least 10−𝑘 −10−𝑘−1 ⋅(10/9) > 0.
While the choice of the decimal representation for 𝐹 𝑡𝑅
was arbitrary, we could not have used the binary
representation in its place. Had we used the binary
expansion instead of decimal, the corresponding se-
quence 1 + 1/2 + 1/4 + ⋯ converges to 2/1 = 2,
comp u tati on a n d re p re se n tati on 99
alently, there does not exist an onto map 𝑆𝑡𝐴𝐿𝐿 ∶ {0, 1}∗ → ALL.
Proof Idea:
This is a direct consequence of Lemma 2.8, since we can use the
binary representation to show a one-to-one map from {0, 1}∞ to ALL.
Hence the uncountability of {0, 1}∞ implies the uncountability of
ALL.
⋆
Proof of Theorem 2.12. Since {0, 1}∞ is uncountable, the result will
follow by showing a one-to-one map from {0, 1}∞ to ALL. The reason
is that the existence of such a map implies that if ALL was countable,
and hence there was a one-to-one map from ALL to ℕ, then there
would have been a one-to-one map from {0, 1}∞ to ℕ, contradicting
Lemma 2.8.
We now show this one-to-one map. We simply map a function
𝑓 ∈ {0, 1}∞ to the function 𝐹 ∶ {0, 1}∗ → {0, 1} as follows. We let
𝐹 (0) = 𝑓(0), 𝐹 (1) = 𝑓(1), 𝐹 (10) = 𝐹 (2), 𝐹 (11) = 𝐹 (3) and so on and
so forth. That is, for every 𝑥 ∈ {0, 1}∗ that represents a natural number
𝑛 in the binary basis, we define 𝐹 (𝑥) = 𝑓(𝑛). If 𝑥 does not represent
such a number (e.g., it has a leading zero), then we set 𝐹 (𝑥) = 0.
This map is one-to-one since if 𝑓 ≠ 𝑔 are two distinct elements in
{0, 1}∞ , then there must be some input 𝑛 ∈ ℕ on which 𝑓(𝑛) ≠ 𝑔(𝑛).
But then if 𝑥 ∈ {0, 1}∗ is the string representing 𝑛, we see that 𝐹 (𝑥) ≠
𝐺(𝑥) where 𝐹 is the function in ALL that 𝑓 mapped to, and 𝐺 is the
function that 𝑔 is mapped to.
■
100 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Make sure you know how to prove the equivalence of
all the results above.
R
Remark 2.15 — Total decoding functions. While the
decoding function of a representation scheme can in
general be a partial function, the proof of Lemma 2.14
implies that every representation scheme has a total
decoding function. This observation can sometimes be
useful.
if you have a pigeon coop with 𝑚 holes, and 𝑘 > 𝑚 pigeons, then there
must be two pigeons in the same hole. )
■
Recall that for every set 𝒪, the set 𝒪∗ consists of all finite length
tuples (i.e., lists) of elements in 𝒪. The following theorem shows that
if 𝐸 is a prefix-free encoding of 𝒪 then by concatenating encodings we
can obtain a valid (i.e., one-to-one) representation of 𝒪∗ :
Suppose that 𝐸 ∶ 𝒪 →
Theorem 2.18 — Prefix-free implies tuple encoding.
{0, 1} is prefix-free. Then the following map 𝐸 ∶ 𝒪∗ → {0, 1}∗ is
∗
P
Theorem 2.18 is an example of a theorem that is a little
hard to parse, but in fact is fairly straightforward to
prove once you understand what it means. Therefore,
I highly recommend that you pause here to make
sure you understand the statement of this theorem.
You should also try to prove it on your own before
proceeding further.
Proof Idea:
The idea behind the proof is simple. Suppose that for example
Figure 2.9: If we have a prefix-free representation of
we want to decode a triple (𝑜0 , 𝑜1 , 𝑜2 ) from its representation 𝑥 =
each object then we can concatenate the representa-
𝐸(𝑜0 , 𝑜1 , 𝑜2 ) = 𝐸(𝑜0 )𝐸(𝑜1 )𝐸(𝑜2 ). We will do so by first finding the tions of 𝑘 objects to obtain a representation for the
first prefix 𝑥0 of 𝑥 that is a representation of some object. Then we tuple (𝑜0 , … , 𝑜𝑘−1 ).
Proof of Theorem 2.18. We now show the formal proof. Suppose, to-
wards the sake of contradiction, that there exist two distinct tuples
(𝑜0 , … , 𝑜𝑘−1 ) and (𝑜0′ , … , 𝑜𝑘′ ′ −1 ) such that
and
where 𝑥𝑗 = 𝐸(𝑜𝑗 ) = 𝐸(𝑜𝑗′ ) for all 𝑗 < 𝑖. Let 𝑦 be the string obtained
after removing the prefix 𝑥0 ⋯ 𝑥𝑖−𝑖 from 𝑥. We see that 𝑦 can be writ-
ten as both 𝑦 = 𝐸(𝑜𝑖 )𝑠 for some string 𝑠 ∈ {0, 1}∗ and as 𝑦 = 𝐸(𝑜𝑖′ )𝑠′
for some 𝑠′ ∈ {0, 1}∗ . But this means that one of 𝐸(𝑜𝑖 ) and 𝐸(𝑜𝑖′ ) must
be a prefix of the other, contradicting the prefix-freeness of 𝐸.
104 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 2.19 — Prefix freeness of list representation.
Even if the representation 𝐸 of objects in 𝒪 is prefix
free, this does not mean that our representation 𝐸
of lists of such objects will be prefix free as well. In
fact, it won’t be: for every three objects 𝑜, 𝑜′ , 𝑜″ the
representation of the list (𝑜, 𝑜′ ) will be a prefix of the
representation of the list (𝑜, 𝑜′ , 𝑜″ ). However, as we see
in Lemma 2.20 below, we can transform every repre-
sentation into prefix-free form, and so will be able to
use that transformation if needed to represent lists of
lists, lists of lists of lists, and so on and so forth.
P
For the sake of completeness, we will include the
proof below, but it is a good idea for you to pause
here and try to prove it on your own, using the same
technique we used for representing rational numbers.
Proof of Lemma 2.20. The idea behind the proof is to use the map 0 ↦
00, 1 ↦ 11 to “double” every bit in the string 𝑥 and then mark the
end of the string by concatenating to it the pair 01. If we encode a
comp u tati on a n d re p re se n tati on 105
The proof of Lemma 2.20 is not the only or even the best way to
transform an arbitrary representation into prefix-free form. Exer-
cise 2.10 asks you to construct a more efficient prefix-free transforma-
tion satisfying |𝐸(𝑜)| ≤ |𝐸(𝑜)| + 𝑂(log |𝐸(𝑜)|).
NtS(234)
# 11101010
pfNtS(234)
# 111111001100110001
pfStN(pfNtS(234))
# 234
pfvalidM(pfNtS(234))
# true
comp u tati on a n d re p re se n tati on 107
P
Note that the Python function prefixfree above
takes two Python functions as input and outputs
three Python functions as output. (When it’s not
too awkward, we use the term “Python function” or
“subroutine” to distinguish between such snippets of
Python programs and mathematical functions.) You
don’t have to know Python in this course, but you do
need to get comfortable with the idea of functions as
mathematical objects in their own right, that can be
used as inputs and outputs of other functions.
def represlists(pfencode,pfdecode,pfvalid):
"""
Takes functions pfencode, pfdecode and pfvalid,
and returns functions encodelists, decodelists
that can encode and decode lists of the objects
respectively.
"""
def encodelist(L):
"""Gets list of objects, encodes it as list of
↪ bits"""
return "".join([pfencode(obj) for obj in L])
def decodelist(S):
"""Gets lists of bits, returns lists of objects"""
i=0; j=1 ; res = []
while j<=len(S):
if pfvalid(S[i:j]):
res += [pfdecode(S[i:j])]
i=j
j+= 1
return res
return encodelist,decodelist
LtS([234,12,5])
# 111111001100110001111100000111001101
StL(LtS([234,12,5]))
# [234, 12, 5]
edge ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 ∈ 𝐸. We can transform an undirected graph to a directed
graph by replacing every edge {𝑖, 𝑗} with both edges ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 and ⃖⃖⃖⃖⃖⃖⃖⃖
𝑖𝑗
Another representation for graphs is the adjacency list representa-
tion. That is, we identify the vertex set 𝑉 of a graph with the set [𝑛]
110 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2.5.9 Notation
We will typically identify an object with its representation as a string.
For example, if 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is some function that maps
strings to strings and 𝑛 is an integer, we might make statements such
as “𝐹 (𝑛) + 1 is prime” to mean that if we represent 𝑛 as a string 𝑥,
then the integer 𝑚 represented by the string 𝐹 (𝑥) satisfies that 𝑚 + 1
is prime. (You can see how this convention of identifying objects with
their representation can save us a lot of cumbersome formalism.)
Similarly, if 𝑥, 𝑦 are some objects and 𝐹 is a function that takes strings
as inputs, then by 𝐹 (𝑥, 𝑦) we will mean the result of applying 𝐹 to the
representation of the ordered pair (𝑥, 𝑦). We use the same notation to
invoke functions on 𝑘-tuples of objects for every 𝑘.
This convention of identifying an object with its representation as
a string is one that we humans follow all the time. For example, when
people say a statement such as “17 is a prime number”, what they
really mean is that the integer whose decimal representation is the
string “17”, is prime.
When we say
𝐴 is an algorithm that computes the multiplication function on natural num-
bers.
what we really mean is that
𝐴 is an algorithm that computes the function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ such that
for every pair 𝑎, 𝑏 ∈ ℕ, if 𝑥 ∈ {0, 1}∗ is a string representing the pair (𝑎, 𝑏)
then 𝐹 (𝑥) will be a string representing their product 𝑎 ⋅ 𝑏.
comp u tati on a n d re p re se n tati on 111
R
Remark 2.23 — Boolean functions and languages. An
important special case of computational tasks corre-
sponds to computing Boolean functions, whose output
is a single bit {0, 1}. Computing such functions corre-
sponds to answering a YES/NO question, and hence
this task is also known as a decision problem. Given any
function 𝐹 ∶ {0, 1}∗ → {0, 1} and 𝑥 ∈ {0, 1}∗ , the task
of computing 𝐹 (𝑥) corresponds to the task of deciding
whether or not 𝑥 ∈ 𝐿 where 𝐿 = {𝑥 ∶ 𝐹 (𝑥) = 1} is
known as the language that corresponds to the function
𝐹 . (The language terminology is due to historical
connections between the theory of computation and
formal linguistics as developed by Noam Chomsky.)
Hence many texts refer to such a computational task
as deciding a language.
def mult1(x,y):
res = 0
while y>0:
res += x
y -= 1
return res
def mult2(x,y):
a = str(x) # represent x as string in decimal notation
b = str(y) # represent y as string in decimal notation
res = 0
for i in range(len(a)):
for j in range(len(b)):
res += int(a[len(a)-i])*int(b[len(b)-
↪ j])*(10**(i+j))
return res
print(mult1(12,7))
# 84
print(mult2(12,7))
# 84
Both mult1 and mult2 produce the same output given the same
pair of natural number inputs. (Though mult1 will take far longer to
do so when the numbers become large.) Hence, even though these are
two different programs, they compute the same mathematical function.
This distinction between a program or algorithm 𝐴, and the function 𝐹
that 𝐴 computes will be absolutely crucial for us in this course (see also
Fig. 2.13).
✓ Chapter Recap
2.7 EXERCISES
Exercise 2.1 Which one of these objects can be represented by a binary
string?
a. An integer 𝑥
b. An undirected graph 𝐺.
c. A directed graph 𝐻
Exercise 2.3 — More compact than ASCII representation. The ASCII encoding
can be used to encode a string of 𝑛 English letters as a 7𝑛 bit binary
string, but in this exercise, we ask about finding a more compact rep-
resentation for strings of English lowercase letters.
2. Prove that there exists no representation scheme for strings over the
alphabet {𝑎, 𝑏, … , 𝑧} as binary strings such that for every length-𝑛
string 𝑥 ∈ {𝑎, 𝑏, … , 𝑧}𝑛 , the representation 𝐸(𝑥) is a binary string of
length ⌊4.6𝑛 + 1000⌋. In other words, prove that there exists some
𝑛 > 0 such that there is no one-to-one function 𝐸 ∶ {𝑎, 𝑏, … , 𝑧}𝑛 →
{0, 1}⌊4.6𝑛+1000⌋ .
Exercise 2.4 — Representing graphs: upper bound. Show that there is a string
representation of directed graphs with vertex set [𝑛] and degree at
most 10 that uses at most 1000𝑛 log 𝑛 bits. More formally, show the
following: Suppose we define for every 𝑛 ∈ ℕ, the set 𝐺𝑛 as the set
containing all directed graphs (with no self loops) over the vertex
set [𝑛] where every vertex has degree at most 10. Then, prove that for
every sufficiently large 𝑛, there exists a one-to-one function 𝐸 ∶ 𝐺𝑛 →
{0, 1}⌊1000𝑛 log 𝑛⌋ .
■
1. Define 𝑆𝑛 to be the
Exercise 2.5 — Representing graphs: lower bound.
set of one-to-one and onto functions mapping [𝑛] to [𝑛]. Prove that
there is a one-to-one mapping from 𝑆𝑛 to 𝐺2𝑛 , where 𝐺2𝑛 is the set
defined in Exercise 2.4 above.
2. In this question you will show that one cannot improve the rep-
resentation of Exercise 2.4 to length 𝑜(𝑛 log 𝑛). Specifically, prove
for every sufficiently large 𝑛 ∈ ℕ there is no one-to-one function
𝐸 ∶ 𝐺𝑛 → {0, 1}⌊0.001𝑛 log 𝑛⌋+1000 .
■
2. Use 1. to compute the size of the set {𝑦 ∈ {0, 1}∗ ∶ |𝑦| ≤ 𝑘} where |𝑦|
denotes the length of the string 𝑦.
Suppose that 𝐹 ∶
Exercise 2.10 — More efficient prefix-free transformation.
𝑂 → {0, 1}∗ is some (not necessarily prefix-free) representation of the
objects in the set 𝑂, and 𝐺 ∶ ℕ → {0, 1}∗ is a prefix-free representa-
tion of the natural numbers. Define 𝐹 ′ (𝑜) = 𝐺(|𝐹 (𝑜)|)𝐹 (𝑜) (i.e., the
concatenation of the representation of the length 𝐹 (𝑜) and 𝐹 (𝑜)).
118 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Exercise 2.11 — Kraft’s Inequality. Suppose that 𝑆 ⊆ {0, 1}𝑛 is some finite
prefix-free set.
a. For every 𝑘 ≤ 𝑛 and length-𝑘 string 𝑥 ∈ 𝑆, let 𝐿(𝑥) ⊆ {0, 1}𝑛 denote
all the length-𝑛 strings whose first 𝑘 bits are 𝑥0 , … , 𝑥𝑘−1 . Prove that
(1) |𝐿(𝑥)| = 2𝑛−|𝑥| and (2) If 𝑥 ≠ 𝑥′ then 𝐿(𝑥) is disjoint from
𝐿(𝑥′ ).
3
• Examples of computing in the physical world.
Defining computation
“there is no reason why mental as well as bodily labor should not be economized
by the aid of machinery”, Charles Babbage, 1852
“If, unwarned by my example, any man shall undertake and shall succeed
in constructing an engine embodying in itself the whole of the executive de-
partment of mathematical analysis upon different principles or by simpler
mechanical means, I have no fear of leaving my reputation in his charge, for he
alone will be fully able to appreciate the nature of my efforts and the value of
their results.”, Charles Babbage, 1864
“To understand a program you must become both the machine and the pro-
gram.”, Alan Perlis, 1982
[How to solve an equation of the form ] “roots and squares are equal to num-
bers”: For instance “one square , and ten roots of the same, amount to thirty-
nine dirhems” that is to say, what must be the square which, when increased
by ten of its own root, amounts to thirty-nine? The solution is this: you halve
the number of the roots, which in the present instance yields five. This you
multiply by itself; the product is twenty-five. Add this to thirty-nine’ the sum
is sixty-four. Now take the root of this, which is eight, and subtract from it half
the number of roots, which is five; the remainder is three. This is the root of the
square which you sought for; the square itself is nine.
For the purposes of this book, we will need a much more precise
way to describe algorithms. Fortunately (or is it unfortunately?), at
least at the moment, computers lag far behind school-age children
in learning from examples. Hence in the 20th century, people came
Figure 3.4: Text pages from Algebra manuscript with
up with exact formalisms for describing algorithms, namely program- geometrical solutions to two quadratic equations.
ming languages. Here is al-Khwarizmi’s quadratic equation solving Shelfmark: MS. Huntington 214 fol. 004v-005r
def solve_eq(b,c):
# return solution of x^2 + bx = c following Al
↪ Khwarizmi's instructions Figure 3.5: An explanation for children of the two digit
# Al Kwarizmi demonstrates this for the case b=10 and addition algorithm
↪ c= 39
126 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{0 𝑎 = 𝑏 = 0
OR(𝑎, 𝑏) = ⎨ (3.1)
⎩1 otherwise
{
⎧
{1 𝑎 = 𝑏 = 1
AND(𝑎, 𝑏) = ⎨ (3.2)
⎩0 otherwise
{
⎧
{0 𝑎 = 1
NOT(𝑎) = ⎨ (3.3)
{
⎩1 𝑎 = 0
The functions AND, OR and NOT, are the basic logical operators
used in logic and many computer systems. In the context of logic, it is
common to use the notation 𝑎 ∧ 𝑏 for AND(𝑎, 𝑏), 𝑎 ∨ 𝑏 for OR(𝑎, 𝑏) and
𝑎 and ¬𝑎 for NOT(𝑎), and we will use this notation as well.
Each one of the functions AND, OR, NOT takes either one or two
single bits as input, and produces a single bit as output. Clearly, it
cannot get much more basic than that. However, the power of compu-
tation comes from composing such simple building blocks together.
128 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{1 𝑥 0 + 𝑥 1 + 𝑥 2 ≥ 2
MAJ(𝑥) = ⎨ . (3.4)
⎩0 otherwise
{
That is, for every 𝑥 ∈ {0, 1}3 , MAJ(𝑥) = 1 if and only if the ma-
jority (i.e., at least two out of the three) of 𝑥’s elements are equal
to 1. Can you come up with a formula involving AND, OR and
NOT to compute MAJ? (It would be useful for you to pause at this
point and work out the formula for yourself. As a hint, although
the NOT operator is needed to compute some functions, you will
not need to use it to compute MAJ.)
Let us first try to rephrase MAJ(𝑥) in words: “MAJ(𝑥) = 1 if and
only if there exists some pair of distinct elements 𝑖, 𝑗 such that both
𝑥𝑖 and 𝑥𝑗 are equal to 1.” In other words it means that MAJ(𝑥) = 1
iff either both 𝑥0 = 1 and 𝑥1 = 1, or both 𝑥1 = 1 and 𝑥2 = 1, or both
𝑥0 = 1 and 𝑥2 = 1. Since the OR of three conditions 𝑐0 , 𝑐1 , 𝑐2 can
be written as OR(𝑐0 , OR(𝑐1 , 𝑐2 )), we can now translate this into a
formula as follows:
def MAJ(X[0],X[1],X[2]):
firstpair = AND(X[0],X[1])
secondpair = AND(X[1],X[2])
thirdpair = AND(X[0],X[2])
temp = OR(secondpair,thirdpair)
return OR(firstpair,temp)
Solution:
We can prove this by enumerating over all the 8 possible values
for 𝑎, 𝑏, 𝑐 ∈ {0, 1} but it also follows from the standard distributive
law. Suppose that we identify any positive integer with “true” and
the value zero with “false”. Then for every numbers 𝑢, 𝑣 ∈ ℕ, 𝑢 + 𝑣
is positive if and only if 𝑢 ∨ 𝑣 is true and 𝑢 ⋅ 𝑣 is positive if and only
if 𝑢 ∧ 𝑣 is true. This means that for every 𝑎, 𝑏, 𝑐 ∈ {0, 1}, the expres-
sion 𝑎 ∧ (𝑏 ∨ 𝑐) is true if and only if 𝑎 ⋅ (𝑏 + 𝑐) is positive, and the
expression (𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑐) is true if and only if 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 is positive,
But by the standard distributive law 𝑎 ⋅ (𝑏 + 𝑐) = 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 and
hence the former expression is true if and only if the latter one is.
■
3.2.2 Extended example: Computing XOR from AND, OR, and NOT
Let us see how we can obtain a different function from the same
building blocks. Define XOR ∶ {0, 1}2 → {0, 1} to be the function
XOR(𝑎, 𝑏) = 𝑎 + 𝑏 mod 2. That is, XOR(0, 0) = XOR(1, 1) = 0 and
XOR(1, 0) = XOR(0, 1) = 1. We claim that we can construct XOR
using only AND, OR, and NOT.
P
As usual, it is a good exercise to try to work out the
algorithm for XOR using AND, OR and NOT on your
own before reading further.
The following algorithm computes XOR using AND, OR, and NOT:
def XOR(a,b):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
return AND(w2,w3)
Solution:
Addition modulo two satisfies the same properties of associativ-
ity ((𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐)) and commutativity (𝑎 + 𝑏 = 𝑏 + 𝑎) as
standard addition. This means that, if we define 𝑎 ⊕ 𝑏 to equal 𝑎 + 𝑏
d e fi n i ng comp u tati on 131
mod 2, then
XOR3 (𝑎, 𝑏, 𝑐) = (𝑎 ⊕ 𝑏) ⊕ 𝑐 (3.7)
or in other words
def XOR3(a,b,c):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
w4 = AND(w2,w3)
w5 = AND(w4,c)
w6 = NOT(w5)
w7 = OR(w4,c)
return AND(w6,w7)
P
Try to generalize the above examples to obtain a way
to compute XOR𝑛 ∶ {0, 1}𝑛 → {0, 1} for every 𝑛 us-
ing at most 4𝑛 basic steps involving applications of a
function in {AND, OR, NOT} to outputs or previously
computed values.
P
These concerns will to a large extent guide us in the
upcoming chapters. Thus you would be well advised
to re-read the above informal definition and see what
you think about these issues.
In the remainder of this chapter, and the rest of this book, we will
begin to answer some of these questions. We will see more examples
of the power of simple operations to compute more complex opera-
tions including addition, multiplication, sorting and more. We will
also discuss how to physically implement simple operations such as
AND, OR and NOT using a variety of technologies.
from it. We also designate some gates as output gates, and their value
corresponds to the result of evaluating the circuit. For example, ??
gives such a circuit for the XOR function, following Section 3.2.2. We
evaluate an 𝑛-input Boolean circuit 𝐶 on an input 𝑥 ∈ {0, 1}𝑛 by plac-
ing the bits of 𝑥 on the inputs, and then propagating the values on the
wires until we reach an output, see Fig. 3.9.
R
Remark 3.4 — Physical realization of Boolean circuits. Figure 3.8: A circuit with AND, OR and NOT gates for
Boolean circuits are a mathematical model that does not computing the XOR function.
necessarily correspond to a physical object, but they
can be implemented physically. In physical imple-
mentation of circuits, the signal is often implemented
by electric potential, or voltage, on a wire, where for
example voltage above a certain level is interpreted
as a logical value of 1, and below a certain level is
interpreted as a logical value of 0. Section 3.4 dis-
cusses physical implementation of Boolean circuits
(with examples including using electrical signals such
as in silicon-based circuits, but also biological and
mechanical implementations as well).
Solution:
Another way to describe the function ALLEQ is that it outputs
1 on an input 𝑥 ∈ {0, 1}4 if and only if 𝑥 = 04 or 𝑥 = 14 . We can
phrase the condition 𝑥 = 14 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be
computed using three AND gates. Similarly we can phrase the con-
dition 𝑥 = 04 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be computed using four
NOT gates and three AND gates. The output of ALLEQ is the OR
of these two conditions, which results in the circuit of 4 NOT gates,
6 AND gates, and one OR gate presented in Fig. 3.10.
■
1. Formally define a Boolean circuit as a mathematical object. Figure 3.10: A Boolean circuit for computing the all
equal function ALLEQ ∶ {0, 1}4 → {0, 1} that outputs
2. Formally define what it means for a circuit 𝐶 to compute a function 1 on 𝑥 ∈ {0, 1}4 if and only if 𝑥0 = 𝑥1 = 𝑥2 = 𝑥3 .
𝑓.
• The other 𝑠 vertices are known as gates. Each gate is labeled with
∧, ∨ or ¬. Gates labeled with ∧ (AND) or ∨ (OR) have two in-
neighbors. Gates labeled with ¬ (NOT) have one in-neighbor.
We will allow parallel edges. 1
• Exactly 𝑚 of the gates are also labeled with the 𝑚 labels Y[0], …,
Y[𝑚 − 1] (in addition to their label ∧/∨/¬). These are known as
outputs.
• For every 𝑣 in the ℓ-th layer (i.e., 𝑣 such that ℎ(𝑣) = ℓ) do:
• The result of this process is the value 𝑦 ∈ {0, 1}𝑚 such that for
every 𝑗 ∈ [𝑚], 𝑦𝑗 is the value assigned to the vertex with label
Y[𝑗].
Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . We say that the circuit 𝐶 computes 𝑓 if
for every 𝑥 ∈ {0, 1}𝑛 , 𝐶(𝑥) = 𝑓(𝑥).
R
Remark 3.7 — Boolean circuits nitpicks (optional). In
phrasing Definition 3.5, we’ve made some technical
d e fi n i ng comp u tati on 137
Solution:
Writing such a program is tedious but not truly hard. To com-
pare two numbers we first compare their most significant digit,
and then go down to the next digit and so on and so forth. In this
case where the numbers have just two binary digits, these compar-
isons are particularly simple. The number represented by (𝑎, 𝑏) is
larger than the number represented by (𝑐, 𝑑) if and only if one of
the following conditions happens:
1. The most significant bit 𝑎 of (𝑎, 𝑏) is larger than the most signifi-
cant bit 𝑐 of (𝑐, 𝑑).
or
2. The two most significant bits 𝑎 and 𝑐 are equal, but 𝑏 > 𝑑.
temp_1 = NOT(X[2])
temp_2 = AND(X[0],temp_1)
temp_3 = OR(X[0],temp_1)
temp_4 = NOT(X[3])
d e fi n i ng comp u tati on 139
temp_5 = AND(X[1],temp_4)
temp_6 = AND(temp_5,temp_3)
Y[0] = OR(temp_2,temp_6)
Let
Theorem 3.9 — Equivalence of circuits and straight-line programs.
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 and 𝑠 ≥ 𝑚 be some number. Then 𝑓 is
computable by a Boolean circuit of 𝑠 gates if and only if 𝑓 is com-
putable by an AON-CIRC program of 𝑠 lines.
Figure 3.12: A circuit for computing the CMP function.
Proof Idea: The evaluation of this circuit on (1, 1, 1, 0) yields the
output 1, since the number 3 (represented in binary
The idea is simple - AON-CIRC programs and Boolean circuits as 11) is larger than the number 2 (represented in
are just different ways of describing the exact same computational binary as 10).
process. For example, an AND gate in a Boolean circuit corresponding
to computing the AND of two previously-computed values. In a AON-
CIRC program this will correspond to the line that stores in a variable
the AND of two previously-computed variables.
⋆
P
This proof of Theorem 3.9 is simple at heart, but all
the details it contains can make it a little cumbersome
to read. You might be better off trying to work it out
yourself before reading it. Our GitHub repository con-
tains a “proof by Python” of Theorem 3.9: implemen-
tation of functions circuit2prog and prog2circuits
mapping Boolean circuits to AON-CIRC programs and
vice versa.
Proof of Theorem 3.9. Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . Since the theorem is an
“if and only if” statement, to prove it we need to show both directions:
translating an AON-CIRC program that computes 𝑓 into a circuit that
computes 𝑓, and translating a circuit that computes 𝑓 into an AON-
CIRC program that does so.
We start with the first direction. Let 𝑃 be an 𝑠 line AON-CIRC that
computes 𝑓. We define a circuit 𝐶 as follows: the circuit will have 𝑛
inputs and 𝑠 gates. For every 𝑖 ∈ [𝑠], if the 𝑖-th line has the form foo
= AND(bar,blah) then the 𝑖-th gate in the circuit will be an AND
gate that is connected to gates 𝑗 and 𝑘 where 𝑗 and 𝑘 correspond to
the last lines before 𝑖 where the variables bar and blah (respectively)
140 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
where written to. (For example, if 𝑖 = 57 and the last line bar was
written to is 35 and the last line blah was written to is 17 then the two
in-neighbors of gate 57 will be gates 35 and 17.) If either bar or blah is
an input variable then we connect the gate to the corresponding input
vertex instead. If foo is an output variable of the form Y[𝑗] then we
add the same label to the corresponding gate to mark it as an output
gate. We do the analogous operations if the 𝑖-th line involves an OR
or a NOT operation (except that we use the corresponding OR or NOT
gate, and in the latter case have only one in-neighbor instead of two).
For every input 𝑥 ∈ {0, 1}𝑛 , if we run the program 𝑃 on 𝑥, then the
value written that is computed in the 𝑖-th line is exactly the value
that will be assigned to the 𝑖-th gate if we evaluate the circuit 𝐶 on 𝑥.
Hence 𝐶(𝑥) = 𝑃 (𝑥) for every 𝑥 ∈ {0, 1}𝑛 .
For the other direction, let 𝐶 be a circuit of 𝑠 gates and 𝑛 inputs that
computes the function 𝑓. We sort the gates according to a topological
order and write them as 𝑣0 , … , 𝑣𝑠−1 . We now can create a program
𝑃 of 𝑠 lines as follows. For every 𝑖 ∈ [𝑠], if 𝑣𝑖 is an AND gate with
in-neighbors 𝑣𝑗 , 𝑣𝑘 then we will add a line to 𝑃 of the form temp_𝑖
= AND(temp_𝑗,temp_𝑘), unless one of the vertices is an input vertex
or an output gate, in which case we change this to the form X[.] or
Y[.] appropriately. Because we work in topological ordering, we are
guaranteed that the in-neighbors 𝑣𝑗 and 𝑣𝑘 correspond to variables
that have already been assigned a value. We do the same for OR and
NOT gates. Once again, one can verify that for every input 𝑥, the
value 𝑃 (𝑥) will equal 𝐶(𝑥) and hence the program computes the
same function as the circuit. (Note that since 𝐶 is a valid circuit, per
Definition 3.5, every input vertex of 𝐶 has at least one out-neighbor
and there are exactly 𝑚 output gates labeled 0, … , 𝑚 − 1; hence all the
variables X[0], …, X[𝑛 − 1] and Y[0] ,…, Y[𝑚 − 1] will appear in the
program 𝑃 .)
■
3.4.1 Transistors
A transistor can be thought of as an electric circuit with two inputs,
known as the source and the gate and an output, known as the sink.
The gate controls whether current flows from the source to the sink. In
a standard transistor, if the gate is “ON” then current can flow from the Figure 3.14: Crab-based logic gates from the paper
source to the sink and if it is “OFF” then it can’t. In a complementary “Robust soldier-crab ball gate” by Gunji, Nishiyama
and Adamatzky. This is an example of an AND gate
transistor this is reversed: if the gate is “OFF” then current can flow that relies on the tendency of two swarms of crabs
from the source to the sink and if it is “ON” then it can’t. arriving from different directions to combine to a
single swarm that continues in the average of the
There are several ways to implement the logic of a transistor. For
directions.
example, we can use faucets to implement it using water pressure
(e.g. Fig. 3.15). This might seem as merely a curiosity, but there is
a field known as fluidics concerned with implementing logical op-
erations using liquids or gasses. Some of the motivations include
operating in extreme environmental conditions such as in space or a
battlefield, where standard electronic equipment would not survive.
The standard implementations of transistors use electrical current.
One of the original implementations used vacuum tubes. As its name Figure 3.15: We can implement the logic of transistors
implies, a vacuum tube is a tube containing nothing (i.e., a vacuum) using water. The water pressure from the gate closes
or opens a faucet between the source and the sink.
and where a priori electrons could freely flow from the source (a
wire) to the sink (a plate). However, there is a gate (a grid) between
the two, where modulating its voltage can block the flow of electrons.
Early vacuum tubes were roughly the size of lightbulbs (and
looked very much like them too). In the 1950’s they were supplanted
by transistors, which implement the same logic using semiconduc-
tors which are materials that normally do not conduct electricity but
whose conductivity can be modified and controlled by inserting impu-
rities (“doping”) and applying an external electric field (this is known
as the field effect). In the 1960’s computers started to be implemented
using integrated circuits which enabled much greater density. In 1965,
Gordon Moore predicted that the number of transistors per integrated
circuit would double every year (see Fig. 3.16), and that this would
lead to “such wonders as home computers —or at least terminals con-
142 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
𝑎 pipe of the first pair and the 𝑏 pipe of the second pair, then a marble
will roll out of the object in the NAND(𝑎, 𝑏)-pipe of the outgoing pair.
In fact, there is even a commercially-available educational game that
uses marbles as a basis of computing, see Fig. 3.26.
Proof. We start with the following observation. For every 𝑎 ∈ {0, 1},
AND(𝑎, 𝑎) = 𝑎. Hence, NAND(𝑎, 𝑎) = NOT(AND(𝑎, 𝑎)) = NOT(𝑎).
Figure 3.24: A physical implementation of a NAND
This means that NAND can compute NOT. By the principle of “dou-
gate using marbles. Each wire in a Boolean circuit is
ble negation”, AND(𝑎, 𝑏) = NOT(NOT(AND(𝑎, 𝑏))), and hence modeled by a pair of pipes representing the values
we can use NAND to compute AND as well. Once we can compute 0 and 1 respectively, and hence a gate has four input
pipes (two for each logical input) and two output
AND and NOT, we can compute OR using “De Morgan’s Law”: pipes. If one of the input pipes representing the value
OR(𝑎, 𝑏) = NOT(AND(NOT(𝑎), NOT(𝑏))) (which can also be writ- 0 has a marble in it then that marble will flow to the
output pipe representing the value 1. (The dashed
ten as 𝑎 ∨ 𝑏 = 𝑎 ∧ 𝑏) for every 𝑎, 𝑏 ∈ {0, 1}.
line represents a gadget that will ensure that at most
■ one marble is allowed to flow onward in the pipe.)
If both the input pipes representing the value 1 have
marbles in them, then the first marble will be stuck
but the second one will flow onwards to the output
P pipe representing the value 0.
Theorem 3.10’s proof is very simple, but you should
make sure that (i) you understand the statement of
the theorem, and (ii) you follow its proof. In partic-
ular, you should make sure you understand why De
Morgan’s law is true.
Solution:
Recall that (3.5) states that
NAND(𝑏, 𝑐) )
1. Let 𝑢 = NAND(𝑥0 , 𝑥1 ).
2. Let 𝑣 = NAND(𝑥0 , 𝑢)
3. Let 𝑤 = NAND(𝑥1 , 𝑢).
4. The XOR of 𝑥0 and 𝑥1 is 𝑦0 = NAND(𝑣, 𝑤).
146 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
One can verify that this algorithm does indeed compute XOR
by enumerating all the four choices for 𝑥0 , 𝑥1 ∈ {0, 1}. We can also
represent this algorithm graphically as a circuit, see Fig. 3.28.
Proof Idea:
Figure 3.28: A circuit with NAND gates to compute
The idea of the proof is to just replace every AND, OR and NOT the XOR of two bits.
gate with their NAND implementation following the proof of Theo-
rem 3.10.
⋆
• NOT(𝑎) = NAND(𝑎, 𝑎)
Big Idea 3 Two models are equivalent in power if they can be used
to compute the same set of functions.
input 𝑥 ∈ {0, 1}2𝑛 outputs the binary representation of the sum of the
numbers represented by 𝑥0 , … , 𝑥𝑛−1 and 𝑥𝑛+1 , … , 𝑥𝑛 :
foo = NAND(bar,blah)
u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)
d e fi n i ng comp u tati on 149
P
Do you know what function this program computes?
Hint: you have seen it before.
For
Theorem 3.17 — NAND circuits and straight-line program equivalence.
every 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 and 𝑠 ≥ 𝑚, 𝑓 is computable by a
NAND-CIRC program of 𝑠 lines if and only if 𝑓 is computable by a
NAND circuit of 𝑠 gates.
R
Remark 3.18 — Is the NAND-CIRC programming language
Turing Complete? (optional note). You might have heard
of a term called “Turing Complete” that is sometimes
150 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
For
Theorem 3.19 — Equivalence between models of finite computation.
every sufficiently large 𝑠, 𝑛, 𝑚 and 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , the
following conditions are all equivalent to one another:
Proof Idea:
We omit the formal proof, which is obtained by combining Theo-
rem 3.9, Theorem 3.12, and Theorem 3.17. The key observation is that
the results we have seen allow us to translate a program/circuit that
computes 𝑓 in one of the above models into a program/circuit that
d e fi n i ng comp u tati on 151
■ Let ℱ
Example 3.21 — IF,ZERO,ONE circuits. = {IF, ZERO, ONE}
where ZERO ∶ {0, 1} → {0} and ONE ∶ {0, 1} → {1} are the
constant zero and one functions, 3 and IF ∶ {0, 1}3 → {0, 1} is the
function that on input (𝑎, 𝑏, 𝑐) outputs 𝑏 if 𝑎 = 1 and 𝑐 otherwise.
Then ℱ is universal.
Indeed, we can demonstrate that {IF, ZERO, ONE} is universal
using the following formula for NAND:
✓ Chapter Recap
3.7 EXERCISES
Exercise 3.1 — Compare 4 bit numbers. Give a Boolean circuit
(with AND/OR/NOT gates) that computes the function
CMP8 ∶ {0, 1}8 → {0, 1} such that CMP8 (𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑏0 , 𝑏1 , 𝑏2 , 𝑏3 ) = 1
if and only if the number represented by 𝑎0 𝑎1 𝑎2 𝑎3 is larger than the
number represented by 𝑏0 𝑏1 𝑏2 𝑏3 .
■
Exercise 3.4 — AND,OR is not universal. Prove that for every 𝑛-bit input
circuit 𝐶 that contains only AND and OR gates, as well as gates that
compute the constant functions 0 and 1, 𝐶 is monotone, in the sense
that if 𝑥, 𝑥′ ∈ {0, 1}𝑛 , 𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 ∈ [𝑛], then 𝐶(𝑥) ≤ 𝐶(𝑥′ ).
Conclude that the set {AND, OR, 0, 1} is not universal.
■
Exercise 3.7 — MAJ,NOT is not universal. Prove that {MAJ, NOT} is not a 4
Hint: Use the fact that MAJ(𝑎, 𝑏, 𝑐) = 𝑀𝐴𝐽(𝑎, 𝑏, 𝑐)
universal set. See footnote for hint.4 to prove that every 𝑓 ∶ {0, 1}𝑛 → {0, 1} computable
■
by a circuit with only MAJ and NOT gates satisfies
𝑓(0, 0, … , 0) ≠ 𝑓(1, 1, … , 1). Thanks to Nathan
Let NOR ∶ {0, 1}2 → {0, 1} defined as
Exercise 3.8 — NOR is universal. Brunelle and David Evans for suggesting this exercise.
4. Prove that for every NAND-circuit 𝐶 with 𝑛 inputs and one output
that computes a function 𝑔 ∶ {0, 1}𝑛 → {0, 1}, if we replace every
gate of 𝐶 with a NAND-approximator and then invoke the result-
ing circuit on some 𝑥 ∈ {0, 1}𝑛 , the output will be a number 𝑦 such
that |𝑦 − 𝑔(𝑥)| ≤ 1/3.
4
Syntactic sugar, and computing every function
“[In 1951] I had a running compiler and nobody would touch it because,
they carefully told me, computers could only do arithmetic; they could not do
programs.”, Grace Murray Hopper, 1986.
2. So you can realize how lucky you are to be taking a theory of com-
putation course and not a compilers course… :)
def Proc(a,b):
proc_code
return c
some_code
f = Proc(d,e)
some_more_code
some_code
proc_code'
some_more_code
Let NAND-CIRC-
Theorem 4.1 — Procedure definition synctatic sugar.
PROC be the programming language NAND-CIRC augmented
with the syntax above for defining procedures. Then for every
NAND-CIRC-PROC program 𝑃 , there exists a standard (i.e.,
“sugar free”) NAND-CIRC program 𝑃 ′ that computes the same
function as 𝑃 .
R
Remark 4.2 — No recursive procedure. NAND-CIRC-
PROC only allows non recursive procedures. In particu-
lar, the code of a procedure Proc cannot call Proc but
only use procedures that were defined before it. With-
out this restriction, the above “search and replace”
procedure might never terminate and Theorem 4.1
would not be true.
■ Pro-
Example 4.3 — Computing Majority from NAND using syntactic sugar.
cedures allow us to express NAND-CIRC programs much more
cleanly and succinctly. For example, because we can compute
AND, OR, and NOT using NANDs, we can compute the Majority
function as follows:
def NOT(a):
return NAND(a,a)
def AND(a,b):
temp = NAND(a,b)
return NOT(temp)
def OR(a,b):
temp1 = NOT(a)
temp2 = NOT(b)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 163
return NAND(temp1,temp2)
def MAJ(a,b,c):
and1 = AND(a,b)
and2 = AND(a,c)
and3 = AND(b,c)
or1 = OR(and1,and2)
return OR(or1,and3)
print(MAJ(0,1,1))
# 1
R
Remark 4.4 — Counting lines. While we can use syn-
tactic sugar to present NAND-CIRC programs in more
readable ways, we did not change the definition of
the language itself. Therefore, whenever we say that
some function 𝑓 has an 𝑠-line NAND-CIRC program
we mean a standard “sugar free” NAND-CIRC pro-
gram, where all syntactic sugar has been expanded
out. For example, the program of Example 4.3 is a
12-line program for computing the MAJ function,
164 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2. A line foo = exp, where exp is the expression following the re-
turn statement in the definition of the procedure Proc.
R
Remark 4.5 — Parsing function definitions (optional). The
function desugar in Fig. 4.3 assumes that it is given
the procedure already split up into its name, argu-
ments, and body. It is not crucial for our purposes to
describe precisely how to scan a definition and split it
up into these components, but in case you are curious,
it can be achieved in Python via the following code:
def parse_func(code):
"""Parse a function definition into name,
↪ arguments and body"""
lines = [l.strip() for l in code.split('\n')]
regexp = r'def\s+([a-zA-Z\_0-9]+)\(([\sa-zA-
↪ Z0-9\_,]+)\)\s*:\s*'
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 165
Figure 4.3: Python code for transforming NAND-CIRC-PROC programs into standard sugar free NAND-CIRC programs.
m = re.match(regexp,lines[0])
return m.group(1), m.group(2).split(','),
↪ '\n'.join(lines[1:])
P
Before reading onward, try to see how you could com-
pute the IF function using NAND’s. Once you do that,
see how you can use that to emulate if/then types of
constructs.
def IF(cond,a,b):
notcond = NAND(cond,cond)
temp = NAND(b,notcond)
temp1 = NAND(a,cond)
return NAND(temp,temp1)
that assigns to foo its old value when condition equals 0, and
assign to foo the value of blah otherwise. More generally we can
replace code of the form
if (cond):
a = ...
b = ...
c = ...
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 167
temp_a = ...
temp_b = ...
temp_c = ...
a = IF(cond,temp_a,a)
b = IF(cond,temp_b,b)
c = IF(cond,temp_c,c)
Let NAND-CIRC-
Theorem 4.6 — Conditional statements synctatic sugar.
IF be the programming language NAND-CIRC augmented with
if/then/else statements for allowing code to be conditionally
executed based on whether a variable is equal to 0 or 1.
Then for every NAND-CIRC-IF program 𝑃 , there exists a stan-
dard (i.e., “sugar free”) NAND-CIRC program 𝑃 ′ that computes
the same function as 𝑃 .
ADD([1,1,1,0,0],[1,0,0,0,0]);;
# [0, 0, 0, 1, 0, 0]
where zero is the constant zero function, and MAJ and XOR corre-
spond to the majority and XOR functions respectively. While we use
Python syntax for convenience, in this example 𝑛 is some fixed integer
and so for every such 𝑛, ADD is a finite function that takes as input 2𝑛
168 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
For every 𝑛,
Theorem 4.8 — Multiplication using NAND-CIRC programs.
let MULT𝑛 ∶ {0, 1} 2𝑛
→ {0, 1} be the function that, given
2𝑛
def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
if i[0]==1:
return LOOKUP1(X[2],X[3],i[1])
else:
return LOOKUP1(X[0],X[1],i[1])
or in other words,
def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
a = LOOKUP1(X[2],X[3],i[1])
b = LOOKUP1(X[0],X[1],i[1])
return IF( i[0],a,b)
Proof of Theorem 4.10 from Lemma 4.11. Now that we have Lemma 4.11,
we can complete the proof of Theorem 4.10. We will prove by induc-
tion on 𝑘 that there is a NAND-CIRC program of at most 4 ⋅ (2𝑘 − 1)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 171
lines for LOOKUP𝑘 . For 𝑘 = 1 this follows by the four line program for
IF we’ve seen before. For 𝑘 > 1, we use the following pseudocode:
a = LOOKUP_(k-1)(X[0],...,X[2^(k-1)-1],i[1],...,i[k-1])
b = LOOKUP_(k-1)(X[2^(k-1)],...,Z[2^(k-1)],i[1],...,i[k-
↪ 1])
return IF(i[0],b,a)
2. Coming up with NAND-CIRC programs for various functions is a Figure 4.7: The number of lines in our implementation
very tedious task. of the LOOKUP_k function as a function of 𝑘 (i.e., the
length of the index). The number of lines in our
implementation is roughly 3 ⋅ 2𝑘 .
Thus I would not blame the reader if they were not particularly
looking forward to a long sequence of examples of functions that can
be computed by NAND-CIRC programs. However, it turns out we are
not going to need this, as we can show in one fell swoop that NAND-
CIRC programs can compute every finite function:
G0000 = 1
G1000 = 1
G0100 = 0
...
G0111 = 1
G1111 = 1
Y[0] = LOOKUP_4(G0000,G1000,...,G1111,
X[0],X[1],X[2],X[3])
R
Remark 4.14 — Result in perspective. While Theo-
rem 4.12 seems striking at first, in retrospect, it is
perhaps not that surprising that every finite function
can be computed with a NAND-CIRC program. After
all, a finite function 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be
represented by simply the list of its outputs for each
one of the 2𝑛 input values. So it makes sense that we
could write a NAND-CIRC program of similar size
to compute it. What is more interesting is that some
functions, such as addition and multiplication, have
174 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
There ex-
Theorem 4.15 — Universality of NAND circuits, improved bound.
ists a constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and function
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a NAND-CIRC program with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 lines that computes the function 𝑓. 3
3
The constant 𝑐 in this theorem is at most 10 and in
fact can be arbitrarily close to 1, see Section 4.8.
Proof. As before, it is enough to prove the case that 𝑚 = 1. Hence
we let 𝑓 ∶ {0, 1}𝑛 → {0, 1}, and our goal is to prove that there exists
a NAND-CIRC program of 𝑂(2𝑛 /𝑛) lines (or equivalently a Boolean
circuit of 𝑂(2𝑛 /𝑛) gates) that computes 𝑓.
We let 𝑘 = log(𝑛 − 2 log 𝑛) (the reasoning behind this choice will
become clear later on). We define the function 𝑔 ∶ {0, 1}𝑘 → {0, 1}2
𝑛−𝑘
as follows:
𝑔𝑖 (𝑥) = 𝑔(𝑎)𝑖 for every 𝑎 ∈ {0, 1}𝑘 and 𝑖 ∈ [2𝑛−𝑘 ]. (That is, 𝑔𝑖 (𝑎) is
the 𝑖-th bit of 𝑔(𝑎).) Naively, we could use Theorem 4.12 to compute
each 𝑔𝑖 in 𝑂(2𝑘 ) lines, but then the total cost is 𝑂(2𝑛−𝑘 ⋅ 2𝑘 ) = 𝑂(2𝑛 )
which does not save us anything. However, the crucial observation
is that there are only 22 distinct functions mapping {0, 1}𝑘 to {0, 1}.
𝑘
ping {0, 1}𝑘 to {0, 1}, we can compute the function 𝑔 (and hence by
(4.5) also 𝑓) using at most
(4.7)
𝑘
𝑂(22 ⋅ 2𝑘 + 2𝑛−𝑘 )
operations. Now all that is left is to plug into (4.7) our choice of 𝑘 =
log(𝑛 − 2 log 𝑛). By definition, 2𝑘 = 𝑛 − 2 log 𝑛, which means that (4.7)
can be bounded Figure 4.9: If 𝑔0 , … , 𝑔𝑁−1 is a collection of functions
each mapping {0, 1}𝑘 to {0, 1} such that at most 𝑆
of them are distinct then for every 𝑎 ∈ {0, 1}𝑘 , we
𝑂 (2𝑛−2 log 𝑛 ⋅ (𝑛 − 2 log 𝑛) + 2𝑛−log(𝑛−2 log 𝑛) ) ≤ (4.8) can compute all the values 𝑔0 (𝑎), … , 𝑔𝑁−1 (𝑎) using
at most 𝑂(𝑆 ⋅ 2𝑘 + 𝑁) operations by first computing
the distinct functions and then copying the resulting
(4.9) values.
𝑛
2𝑛 𝑛
2𝑛 𝑛
𝑂 ( 2𝑛2 ⋅ 𝑛 + 𝑛−2 log 𝑛 ) ≤ 𝑂 ( 2𝑛 + 0.5𝑛 ) = 𝑂 ( 2𝑛 )
which is what we wanted to prove. (We used above the fact that 𝑛 −
2 log 𝑛 ≥ 0.5 log 𝑛 for sufficiently large 𝑛.)
■
There
Theorem 4.16 — Universality of Boolean circuits, improved bound.
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 gates that computes the function 𝑓 .
directly with circuits and avoiding the usage of all the syntactic sugar
machinery. (However, that machinery is useful in its own right, and
will find other applications later on.)
There
Theorem 4.17 — Universality of Boolean circuits (alternative phrasing).
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚 ⋅ 𝑛2𝑛 gates that computes the function 𝑓 .
Proof Idea:
The idea of the proof is illustrated in Fig. 4.10. As before, it is
enough to focus on the case that 𝑚 = 1 (the function 𝑓 has a sin-
gle output), since we can always extend this to the case of 𝑚 > 1
by looking at the composition of 𝑚 circuits each computing a differ-
ent output bit of the function 𝑓. We start by showing that for every
𝛼 ∈ {0, 1}𝑛 , there is an 𝑂(𝑛) sized circuit that computes the function
𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} defined as follows: 𝛿𝛼 (𝑥) = 1 iff 𝑥 = 𝛼 (that is,
𝛿𝛼 outputs 0 on all inputs except the input 𝛼). We can then write any
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} as the OR of at most 2𝑛 functions 𝛿𝛼 for
the 𝛼’s on which 𝑓(𝛼) = 1.
⋆
Proof of Theorem 4.17. We prove the theorem for the case 𝑚 = 1. The
result can be extended for 𝑚 > 1 as before (see also Exercise 4.9). Let
𝑓 ∶ {0, 1}𝑛 → {0, 1}. We will prove that there is an 𝑂(𝑛 ⋅ 2𝑛 )-sized
Boolean circuit to compute 𝑓 in the following steps:
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 177
1. We show that for every 𝛼 ∈ {0, 1}𝑛 , there is an 𝑂(𝑛) sized circuit
that computes the function 𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1}, where 𝛿𝛼 (𝑥) = 1 iff
𝑥 = 𝛼.
4.6 THE CLASS SIZE(𝑇 ) Figure 4.11: For every string 𝛼 ∈ {0, 1}𝑛 , there is a
Boolean circuit of 𝑂(𝑛) gates to compute the function
We have seen that every function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be com- 𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} such that 𝛿𝛼 (𝑥) = 1 if and
puted by a circuit of size 𝑂(𝑚 ⋅ 2𝑛 ), and some functions (such as ad- only if 𝑥 = 𝛼. The circuit is very simple. Given input
𝑥0 , … , 𝑥𝑛−1 we compute the AND of 𝑧0 , … , 𝑧𝑛−1
dition and multiplication) can be computed by much smaller circuits. where 𝑧𝑖 = 𝑥𝑖 if 𝛼𝑖 = 1 and 𝑧𝑖 = NOT(𝑥𝑖 ) if 𝛼𝑖 = 0.
We define SIZE(𝑠) to be the set of functions that can be computed by While formally Boolean circuits only have a gate for
NAND circuits of at most 𝑠 gates (or equivalently, by NAND-CIRC computing the AND of two inputs, we can implement
an AND of 𝑛 inputs by composing 𝑛 two-input
programs of at most 𝑠 lines). Formally, the definition is as follows: ANDs.
178 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
𝑛
Figure 4.12: There are 22 functions mapping {0, 1}𝑛
to {0, 1}, and an infinite number of circuits with 𝑛 bit
inputs and a single bit of output. Every circuit com-
putes one function, but every function can be com-
puted by many circuits. We say that 𝑓 ∈ SIZE𝑛,1 (𝑠)
if the smallest circuit that computes 𝑓 has 𝑠 or fewer
gates. For example XOR𝑛 ∈ SIZE𝑛,1 (4𝑛). Theo-
rem 4.12 shows that every function 𝑔 is computable
by some circuit of at most 𝑐 ⋅ 2𝑛 /𝑛 gates, and hence
SIZE𝑛,1 (𝑐 ⋅ 2𝑛 /𝑛) corresponds to the set of all func-
tions from {0, 1}𝑛 to {0, 1}.
Solution:
If 𝑓 ∈ SIZE(𝑠) then there is an 𝑠-line NAND-CIRC program
𝑃 that computes 𝑓. We can rename the variable Y[0] in 𝑃 to a
variable temp and add the line
Y[0] = NAND(temp,temp)
✓ Chapter Recap
4.7 EXERCISES
This exercise asks you to give a one-to-one map
Exercise 4.1 — Pairing.
from ℕ2 to ℕ. This can be useful to implement two-dimensional arrays
as “syntactic sugar” in programming languages that only have one-
dimensional array.
t = NAND(X[2],X[2])
u = NAND(X[0],t)
v = NAND(X[1],X[2])
Y[0] = NAND(u,v)
2. A full adder is the function FA ∶ {0, 1}3 → {0, 1}2 that takes in
two bits and a “carry” bit and outputs their sum. That is, for every
𝑎, 𝑏, 𝑐 ∈ {0, 1}, FA(𝑎, 𝑏, 𝑐) = (𝑒, 𝑓) such that 2𝑒 + 𝑓 = 𝑎 + 𝑏 + 𝑐.
Prove that there is a NAND circuit of at most nine NAND gates that
computes FA.
Temp[0] = NAND(X[0],X[0])
Temp[1] = NAND(X[1],X[1])
Temp[2] = NAND(Temp[0],Temp[1])
Temp[3] = NAND(X[2],X[2])
Temp[4] = NAND(X[3],X[3])
Temp[5] = NAND(Temp[3],Temp[4])
Temp[6] = NAND(Temp[2],Temp[2])
Temp[7] = NAND(Temp[5],Temp[5])
Y[0] = NAND(Temp[6],Temp[7])
1. Write a program 𝑃 ′ with at most three lines of code that uses both
NAND as well as the syntactic sugar OR that computes the same func-
tion as 𝑃 .
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 183
2. Draw a circuit that computes the same function as 𝑃 and uses only
AND and NOT gates.
and
such that for every 𝑛 > 1, MAJ𝑛 ∈ SIZE(𝑐𝑛) where MAJ𝑛 ∶ {0, 1}𝑛 →
184 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
{0, 1} is the majority function on 𝑛 input bits. That is MAJ𝑛 (𝑥) = 1 iff
9
One approach to solve this is using recursion and the
∑𝑖=0 𝑥𝑖 > 𝑛/2. See footnote for hint.9
𝑛−1
so-called Master Theorem.
■
Exercise 4.14 — Circuits for threshold. Prove that there is some constant 𝑐
such that for every 𝑛 > 1, and integers 𝑎0 , … , 𝑎𝑛−1 , 𝑏 ∈ {−2𝑛 , −2𝑛 +
1, … , −1, 0, +1, … , 2𝑛 }, there is a NAND circuit with at most 𝑛𝑐 gates
that computes the threshold function 𝑓𝑎0 ,…,𝑎𝑛−1 ,𝑏 ∶ {0, 1}𝑛 → {0, 1} that
𝑛−1
on input 𝑥 ∈ {0, 1}𝑛 outputs 1 if and only if ∑𝑖=0 𝑎𝑖 𝑥𝑖 > 𝑏.
■
“The term code script is, of course, too narrow. The chromosomal structures
are at the same time instrumental in bringing about the development they
foreshadow. They are law-code and executive power - or, to use another simile,
they are architect’s plan and builder’s craft - in one.” , Erwin Schrödinger,
1944.
This correspondence between code and data is one of the most fun-
damental aspects of computing. It underlies the notion of general
purpose computers, that are not pre-wired to compute only one task,
and also forms the basis of our hope for obtaining general artificial
intelligence. This concept finds immense use in all areas of comput-
ing, from scripting languages to machine learning, but it is fair to say
that we haven’t yet fully mastered it. Many security exploits involve
cases such as “buffer overflows” when attackers manage to inject code
where the system expected only “passive” data (see Fig. 5.1). The re-
lation between code and data reaches beyond the realm of electronic
temp_0 = NAND(X[0],X[1])
temp_1 = NAND(X[0],temp_0)
temp_2 = NAND(X[1],temp_0)
Y[0] = NAND(temp_1,temp_2)
There is a constant 𝑐
Theorem 5.1 — Representing programs as strings.
such that for 𝑓 ∈ SIZE(𝑠), there exists a program 𝑃 computing 𝑓
whose string representation has length at most 𝑐𝑠 log 𝑠.
P
188 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
That is, there are at most 2𝑂(𝑠 log 𝑠) functions computed by NAND-
CIRC programs of at most 𝑠 lines. 1
1
The implicit constant in the 𝑂(⋅) notation is smaller
than 10. That is, for all sufficiently large 𝑠, |SIZE(𝑠)| <
210𝑠 log 𝑠 , see Remark 5.4. As discussed in Section 1.7,
Proof. We will show a one-to-one map 𝐸 from SIZE(𝑠) to the set of we use the bound 10 simply because it is a round
strings of length 𝑐𝑠 log 𝑠 for some constant 𝑐. This will conclude the number.
proof, since it implies that |SIZE(𝑠)| is smaller than the size of the set
of all strings of length at most ℓ, which equals 1+2+4+⋯+2ℓ = 2ℓ+1 −1
by the formula for sums of geometric progressions.
The map 𝐸 will simply map 𝑓 to the representation of the program
computing 𝑓. Specifically, we let 𝐸(𝑓) be the representation of the
program 𝑃 computing 𝑓 given by Theorem 5.1. This representation
has size at most 𝑐𝑠 log 𝑠, and moreover the map 𝐸 is one to one, since
if 𝑓 ≠ 𝑓 ′ then every two programs computing 𝑓 and 𝑓 ′ respectively
must have different representations.
■
There is a constant
Theorem 5.3 — Counting argument lower bound.
𝛿 > 0, such that for every sufficiently large 𝑛, there is a function
𝑓 ∶ {0, 1}𝑛 → {0, 1} such that 𝑓 ∉ SIZE ( 𝛿2𝑛 ). That is, the shortest
𝑛
𝛿2𝑛
|SIZE( 𝛿2𝑛 )| ≤ 2𝑐 𝑛 log 𝑠 (5.2)
𝑛 𝑛 𝑛
< 2𝑐𝛿2 = 22
using the fact that since 𝑠 < 2𝑛 , log 𝑠 < 𝑛 and 𝛿 = 1/𝑐. But since
|SIZE(𝑠)| is smaller than the total number of functions mapping 𝑛 bits
to 1 bit, there must be at least one such function not in SIZE(𝑠), which
is what we needed to prove.
■
We have seen before that every function mapping {0, 1}𝑛 to {0, 1}
can be computed by an 𝑂(2𝑛 /𝑛) line program. Theorem 5.3 shows
that this is tight in the sense that some functions do require such an
astronomical number of lines to compute.
In fact, as we explore in the exercises, this is the case for most func-
tions. Hence functions that can be computed in a small number of
lines (such as addition, multiplication, finding short paths in graphs,
or even the EVAL function) are the exception, rather than the rule.
R
Remark 5.4 — More efficient representation (advanced,
optional). The ASCII representation is not the shortest
representation for NAND-CIRC programs. NAND-
CIRC programs are equivalent to circuits with NAND
gates, which means that a NAND-CIRC program of 𝑠
lines, 𝑛 inputs, and 𝑚 outputs can be represented by
a labeled directed graph of 𝑠 + 𝑛 vertices, of which 𝑛
have in-degree zero, and the 𝑠 others have in-degree
at most two. Using the adjacency matrix represen-
tation for such graphs, we can reduce the implicit
constant in Theorem 5.2 to be arbitrarily close to 5, see
Exercise 5.6.
190 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
It turns out that we can use Theorem 5.3 to show a more general re-
sult: whenever we increase our “budget” of gates we can compute
new functions.
Proof Idea:
To prove the theorem we need to find a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}
such that 𝑓 can be computed by a circuit of 𝑠 + 10𝑛 gates but it cannot
be computed by a circuit of 𝑠 gates. We will do so by coming up with
a sequence of functions 𝑓0 , 𝑓1 , 𝑓2 , … , 𝑓𝑁 with the following properties:
(1) 𝑓0 can be computed by a circuit of at most 10𝑛 gates, (2) 𝑓𝑁 cannot
be computed by a circuit of 0.1 ⋅ 2𝑛 /𝑛 gates, and (3) for every 𝑖 ∈
{0, … , 𝑁 }, if 𝑓𝑖 can be computed by a circuit of size 𝑠, then 𝑓𝑖+1 can be
computed by a circuit of size at most 𝑠+10𝑛. Together these properties
imply that if we let 𝑖 be the smallest number such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠),
then since 𝑓𝑖−1 ∈ SIZE(𝑠) it must hold that 𝑓𝑖 ∈ SIZE(𝑠 + 10𝑛) which is
what we need to prove. See Fig. 5.4 for an illustration.
⋆
must exist such an index 𝑖, and moreover 𝑖 > 0 since the constant zero
function is a member of SIZE𝑛 (10𝑛).
By our choice of 𝑖, 𝑓𝑖−1 is a member of SIZE𝑛 (𝑠). To complete the
proof, we need to show that 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛). Let 𝑥∗ be the string
such that 𝑙𝑒𝑥(𝑥∗ ) = 𝑖 𝑏 ∈ {0, 1} is the value of 𝑓 ∗ (𝑥∗ ). Then we can
define 𝑓𝑖 also as follows
⎧
{𝑏 𝑥 = 𝑥∗
𝑓𝑖 (𝑥) = (5.6)
⎨
{ ∗
⎩𝑓𝑖 (𝑥) 𝑥 ≠ 𝑥
or in other words
blah = NAND(baz,boo)
u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)
{𝑃 (𝑥) 𝑝 ∈ {0, 1}𝑆(𝑠) represents a size-𝑠 program 𝑃 with 𝑛 inputs and 𝑚 outputs
⎧
EVAL𝑠,𝑛,𝑚 (𝑝𝑥) = ⎨
{
⎩0
𝑚
otherwise
(5.9)
where 𝑆(𝑠) is defined as in (5.8) and we use the concrete representa-
tion scheme described in Section 5.1.
That is, EVAL𝑠,𝑛,𝑚 takes as input the concatenation of two strings:
a string 𝑝 ∈ {0, 1}𝑆(𝑠) and a string 𝑥 ∈ {0, 1}𝑛 . If 𝑝 is a string that
represents a list of triples 𝐿 such that (𝑛, 𝑚, 𝐿) is a list-of-tuples rep-
resentation of a size-𝑠 NAND-CIRC program 𝑃 , then EVAL𝑠,𝑛,𝑚 (𝑝𝑥)
is equal to the evaluation 𝑃 (𝑥) of the program 𝑃 on the input 𝑥. Oth-
erwise, EVAL𝑠,𝑛,𝑚 (𝑝𝑥) equals 0𝑚 (this case is not very important: you
can simply think of 0𝑚 as some “junk value” that indicates an error).
One of the first examples of self circularity we will see in this book is
the following theorem, which we can think of as showing a “NAND-
CIRC interpreter in NAND-CIRC”:
For every
Theorem 5.9 — Bounded Universality of NAND-CIRC programs.
𝑠, 𝑛, 𝑚 ∈ ℕ with 𝑠 ≥ 𝑚 there is a NAND-CIRC program 𝑈𝑠,𝑛,𝑚 that
computes the function EVAL𝑠,𝑛,𝑚 .
s/outputs) and any input 𝑥, and computes the result of evaluating the
program 𝑃 on the input 𝑥. Given the equivalence between NAND-
CIRC programs and Boolean circuits, we can also think of 𝑈𝑠,𝑛,𝑚 as
a circuit that takes as input the description of other circuits and their
inputs, and returns their evaluation, see Fig. 5.6. We call this NAND-
CIRC program 𝑈𝑠,𝑛,𝑚 that computes EVAL𝑠,𝑛,𝑚 a bounded universal
program (or a universal circuit, see Fig. 5.6). “Universal” stands for
the fact that this is a single program that can evaluate arbitrary code,
where “bounded” stands for the fact that 𝑈𝑠,𝑛,𝑚 only evaluates pro-
grams of bounded size. Of course this limitation is inherent for the
NAND-CIRC programming language, since a program of 𝑠 lines (or,
equivalently, a circuit of 𝑠 gates) can take at most 2𝑠 inputs. Later, in
Chapter 7, we will introduce the concept of loops (and the model of
Turing Machines), that allow to escape this limitation.
P
Theorem 5.9 is simple but important. Make sure you
understand what this theorem means, and why it is a
corollary of Theorem 4.12.
{0, 1}𝑆+𝑛 → {0, 1}𝑚 defined above (where 𝑆 is the number of bits
needed to represent programs of 𝑠 lines).
P
If you haven’t done so already, now might be a good
time to review 𝑂 notation in Section 1.4.8. In particu-
lar, an equivalent way to state Theorem 5.10 is that it
says that there exists some number 𝑐 > 0 such that for
every 𝑠, 𝑛, 𝑚 ∈ ℕ, there exists a NAND-CIRC program
𝑃 of at most 𝑐𝑠2 log 𝑠 lines that computes the function
EVAL𝑠,𝑛,𝑚 .
This approach yields much more than just proving Theorem 5.10:
we will see that it is in fact always possible to transform (loop free)
code in high level languages such as Python to NAND-CIRC pro-
grams (and hence to Boolean circuits as well).
P
It would be highly worthwhile for you to stop here
and try to solve this problem yourself. For example,
you can try thinking how you would write a program
cod e a s data, data a s cod e 197
P
Before reading further, try to think how you could give
a “constructive proof” of Theorem 5.10. That is, think
of how you would write, in the programming lan-
guage of your choice, a function universal(s,n,m)
that on input 𝑠, 𝑛, 𝑚 outputs the code for the NAND-
CIRC program 𝑈𝑠,𝑛,𝑚 such that 𝑈𝑠,𝑛,𝑚 computes
EVAL𝑠,𝑛,𝑚 . There is a subtle but crucial difference
between this function and the Python NANDEVAL pro-
gram described above. Rather than actually evaluating
a given program 𝑃 on some input 𝑤, the function
universal should output the code of a NAND-CIRC
program that computes the map (𝑃 , 𝑥) ↦ 𝑃 (𝑥).
P
Please make sure that you understand why GET and
LOOKUPℓ are the same function.
Figure 5.7: Code for evaluating a NAND-CIRC program given in the list-of-tuples representation
def NANDEVAL(n,m,L,X):
# Evaluate a NAND-CIRC program from list of tuple representation.
s = len(L) # num of lines
t = max(max(a,b,c) for (a,b,c) in L)+1 # max index in L + 1
Vartable = [0] * t # initialize array
# helper functions
def GET(V,i): return V[i]
def UPDATE(V,i,b):
V[i]=b
return V
For every ℓ, let UPDATEℓ ∶ {0, 1}2 +ℓ+1 → {0, 1}2 correspond to the
ℓ ℓ
UPDATE function for arrays of length 2ℓ . That is, on input 𝑉 ∈ {0, 1}2 ,
ℓ
that
⎧
{𝑉𝑗 𝑗 ≠ 𝑖
𝑉𝑗′ = (5.10)
⎨
{
⎩𝑏 𝑗=1
where we identify the string 𝑖 ∈ {0, 1}ℓ with a number in {0, … , 2ℓ − 1}
using the binary representation. We can compute UPDATEℓ using an
𝑂(2ℓ ℓ) = (𝑠 log 𝑠) line NAND-CIRC program as as follows:
2. We have seen that we can compute the function IF ∶ {0, 1}3 → {0, 1}
such that IF(𝑎, 𝑏, 𝑐) equals 𝑏 if 𝑎 = 1 and 𝑐 if 𝑎 = 0.
def UPDATE_ell(V,i,b):
# Get V[0]...V[2^ell-1], i in {0,1}^ell, b in {0,1}
# Return NewV[0],...,NewV[2^ell-1]
# updated array with NewV[i]=b and all
# else same as V
for j in range(2**ell): # j = 0,1,2,....,2^ell -1
a = EQUALS_j(i)
NewV[j] = IF(a,b,V[j])
return NewV
R
Remark 5.12 — Improving to quasilinear overhead (ad-
vanced optional note). The NAND-CIRC program
above is less efficient than its Python counterpart,
since NAND does not offer arrays with efficient ran-
dom access. Hence for example the LOOKUP operation
cod e a s data, data a s cod e 201
R
Remark 5.13 — Advanced note: making PECTT concrete
(advanced, optional). We can attempt a more exact
phrasing of the PECTT as follows. Suppose that 𝑍 is
a physical system that accepts 𝑛 binary stimuli and
has a binary output, and can be enclosed in a sphere
of volume 𝑉 . We say that the system 𝑍 computes a
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} within 𝑡 seconds if when-
ever we set the stimuli to some value 𝑥 ∈ {0, 1}𝑛 , if
we measure the output after 𝑡 seconds then we obtain
𝑓(𝑥).
One can then phrase the PECTT as stipulating that if
there exists such a system 𝑍 that computes 𝐹 within
𝑡 seconds, then there exists a NAND-CIRC program
that computes 𝐹 and has at most 𝛼(𝑉 𝑡)2 lines, where
𝛼 is some normalization constant. (We can also con-
sider variants where we use surface area instead
of volume, or take (𝑉 𝑡) to a different power than 2.
However, none of these choices makes a qualitative
difference to the discussion below.) In particular,
suppose that 𝑓 ∶ {0, 1}𝑛 → {0, 1} is a function that
requires 2𝑛 /(100𝑛) > 20.8𝑛 lines for any NAND-CIRC
program (such a function exists by Theorem 5.3).
Then the PECTT would imply that either the volume
or the time of a system that computes 𝐹 will have to
√
be at least 20.2𝑛 / 𝛼. Since this quantity grows expo-
nentially in 𝑛, it is not hard to set parameters so that
even for moderately large values of 𝑛, such a system
could not fit in our universe.
To fully make the PECTT concrete, we need to decide
on the units for measuring time and volume, and the
normalization constant 𝛼. One conservative choice is
to assume that we could squeeze computation to the
absolute physical limits (which are many orders of
magnitude beyond current technology). This corre-
sponds to setting 𝛼 = 1 and using the Planck units
for volume and time. The Planck length ℓ𝑃 (which is,
roughly speaking, the shortest distance that can the-
oretically be measured) is roughly 2−120 meters. The
Planck time 𝑡𝑃 (which is the time it takes for light to
travel one Planck length) is about 2−150 seconds. In the
above setting, if a function 𝐹 takes, say, 1KB of input
(e.g., roughly 104 bits, which can encode a 100 by 100
bitmap image), and requires at least 20.8𝑛 = 20.8⋅10
4
• Spaghetti sort: One of the first lower bounds that Computer Sci-
ence students encounter is that sorting 𝑛 numbers requires making
Ω(𝑛 log 𝑛) comparisons. The “spaghetti sort” is a description of a
proposed “mechanical computer” that would do this faster. The
idea is that to sort 𝑛 numbers 𝑥1 , … , 𝑥𝑛 , we could cut 𝑛 spaghetti
noodles into lengths 𝑥1 , … , 𝑥𝑛 , and then if we simply hold them
together in our hand and bring them down to a flat surface, they
will emerge in sorted order. There are a great many reasons why
this is not truly a challenge to the PECTT hypothesis, and I will not
ruin the reader’s fun in finding them out by her or himself.
206 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• Continuous/real computers. The physical world is often described stored in a millimeter radius region.
R
Remark 5.14 — Physical Extended Church-Turing Thesis
and Cryptography. While even the precise phrasing of
the PECTT, let alone understanding its correctness, is
still a subject of active research, some variants of it are
cod e a s data, data a s cod e 209
✓ Chapter Recap
Sneak preview: In the next part we will discuss how to model compu-
tational tasks on unbounded inputs, which are specified using functions
𝐹 ∶ {0, 1}∗ → {0, 1}∗ (or 𝐹 ∶ {0, 1}∗ → {0, 1}) that can take an
unbounded number of Boolean inputs.
5.8 EXERCISES
Exercise 5.1 Which one of the following statements is false:
Exercise 5.5 — Size hierarchy theorem for multibit functions. Prove that there
exists a number 𝐶 such that for every 𝑛, 𝑚 and 𝑛+𝑚 < 𝑠 < 𝑚⋅2𝑛 /(𝐶𝑛)
there exists a function 𝑓 ∈ SIZE𝑛,𝑚 (𝐶 ⋅ 𝑠) ⧵ SIZE𝑛,𝑚 (𝑠). See footnote for 11
Follow the proof of Theorem 5.5, replacing the use
hint.11 of the counting argument with Exercise 5.4.
■
|SIZE𝑛,𝑚 (𝑠)| < 2(2+𝜖)𝑠 log 𝑠+𝑛 log 𝑛+𝑚 log 𝑠 . (5.11)
Conclude that the implicit constant in Theorem 5.2 can be made arbi-
12
Using the adjacency list representation, a graph
trarily close to 5. See footnote for hint.12 with 𝑛 in-degree zero vertices and 𝑠 in-degree two
■ vertices can be represented using roughly 2𝑠 log(𝑠 +
𝑛) ≤ 2𝑠(log 𝑠 + 𝑂(1)) bits. The labeling of the 𝑛 input
Exercise 5.7 — Tighter counting lower bound. Prove that for every 𝛿 < 1/2, if and 𝑚 output vertices can be specified by a list of 𝑛
𝑛 is sufficiently large then there exists a function 𝑓 ∶ {0, 1}𝑛 → {0, 1} labels in [𝑛] and 𝑚 labels in [𝑚].
13
Hint: Use the results of Exercise 5.6 and the fact that
such that 𝑓 ∉ SIZE𝑛,1 ( 𝛿2𝑛 ). See footnote for hint.13
𝑛
Exercise 5.8 — Random functions are hard. Suppose 𝑛 > 1000 and that we
choose a function 𝐹 ∶ {0, 1}𝑛 → {0, 1} at random, choosing for every
𝑥 ∈ {0, 1}𝑛 the value 𝐹 (𝑥) to be the result of tossing an independent
unbiased coin. Prove that the probability that there is a 2𝑛 /(1000𝑛)
14
Hint: An equivalent way to say this is that you
line program that computes 𝐹 is at most 2−100 .14 need to prove that the set of functions that can be
■ computed using at most 2𝑛 /(1000𝑛) lines has fewer
𝑛
than 2−100 22 elements. Can you see why?
Exercise 5.9 The following is a tuple representing a NAND program:
(3, 1, ((3, 2, 2), (4, 1, 1), (5, 3, 4), (6, 2, 1), (7, 6, 6), (8, 0, 0), (9, 7, 8), (10, 5, 0), (11, 9, 10))).
1. Write a table with the eight values 𝑃 (000), 𝑃 (001), 𝑃 (010), 𝑃 (011),
𝑃 (100), 𝑃 (101), 𝑃 (110), 𝑃 (111) in this order.
prove that there is some constant 𝑛0 such that for every 𝑛 > 𝑛0 and
XOR circuit 𝐶 of 𝑛2 inputs and a single output, there exists a pair
(𝑃 , 𝑥) such that 𝐶(𝑃 , 𝑥) ≠ 𝐸𝑛 (𝑃 , 𝑥).
■
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result += X[i] % 2
return result
Big Idea 8 A function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ specifies the computa-
tional task mapping an input 𝑥 ∈ {0, 1}∗ into the output 𝐹 (𝑥).
MULT(𝑥, 𝑦) = 𝑥 ⋅ 𝑦 (6.2)
that takes the binary representation of a pair of integers 𝑥, 𝑦 ∈ ℕ
and outputs the binary representation of their product 𝑥 ⋅ 𝑦. How-
ever, since we can represent a pair of strings as a single string, we will
consider functions such as MULT as mapping {0, 1}∗ to {0, 1}∗ . We
will typically not be concerned with low-level details such as the pre-
cise way to represent a pair of integers as a string, since essentially all
choices will be equivalent for our purposes.
220 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{1 ∀𝑖∈[|𝑥|] 𝑥𝑖 = 𝑥|𝑥|−𝑖
PALINDROME(𝑥) = (6.3)
⎨
⎩0 otherwise
{
PALINDROME has a single bit as output. Functions with a single
bit of output are known as Boolean functions. Boolean functions are
central to the theory of computation, and we will come discuss them
often in this book. Note that even though Boolean functions have a
single bit of output, their input can be of arbitrary length. Thus they
are still infinite functions, that cannot be described via a finite table of
values.
“Booleanizing” functions. Sometimes it might be convenient to ob-
tain a Boolean variant for a non-Boolean function. For example, the
following is a Boolean variant of MULT.
{𝑖𝑡ℎ bit of 𝑥 ⋅ 𝑦
⎧ 𝑖 < |𝑥 ⋅ 𝑦|
BMULT(𝑥, 𝑦, 𝑖) = ⎨ (6.4)
{
⎩0 otherwise
If we can compute BMULT via any programming language such
as Python, C, Java, etc. then we can compute MULT as well, and vice
versa.
Show that for every
Solved Exercise 6.1 — Booleanizing general functions.
function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ there exists a Boolean function BF ∶
{0, 1}∗ → {0, 1} such that a Python program to compute BF can be
transformed into a program to compute 𝐹 and vice versa.
■
Solution:
For every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , we can define
def F(x):
res = []
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 221
i = 0
while BF(x,i,1):
res.apppend(BF(x,i,0))
i += 1
return res
despite the fact that we don’t know of any program to compute it. In-
deed, this is not that surprising: for every particular 𝑛 ∈ ℕ, TWINP𝑛
is either the constant zero function or the constant one function, both
of which can be computed by very simple Boolean circuits. Hence
a collection of circuits {𝐶𝑛 } that computes TWINP certainly exists.
The difficulty in computing TWINP using Python or any other pro-
gramming language arises from the fact that we don’t know for each
particular 𝑛 what is the circuit 𝐶𝑛 in this collection.
For example, recall the Python program that computes the XOR
function:
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result += X[i] % 2
return result
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 223
In each step, this program reads a single bit X[i] and updates
its state result based on that bit (flipping result if X[i] is 1 and
keeping it the same otherwise). When its done transversing the input,
the program outputs result. In computer science, such a program is
called a single-pass constant-memory algorithm since it makes a single
pass over the input and its working memory is of finite size. (Indeed
in this case result can either be 0 or 1.) Such an algorithm is also
known as a Deterministic Finite Automaton or DFA (another name for
DFA’s is a finite state machine). We can think of such an algorithm as
a “machine” that can be in one of 𝐶 states, for some constant 𝐶. The
machine starts in some initial state, and then reads its input 𝑥 ∈ {0, 1}∗
one bit at a time. Whenever the machine reads a bit 𝜎 ∈ {0, 1}, it
transitions into a new state based on 𝜎 and its prior state. The output
of the machine is based on the final state. Every single-pass constant-
memory algorithm corresponds to such a machine. If an algorithm
uses 𝑐 bits of memory, then the contents of its memory are a string of
length 𝑐. Since there are 2𝑐 such strings, at any point in the execution,
such an algorithm can be in one of 2𝑐 states.
We can specify a DFA of 𝐶 states by a list of 𝐶 ⋅ 2 rules. Each rule
will be of the form “If the DFA is in state 𝑣 and the bit read from the
input is 𝜎 then the new state is 𝑣′ ”. At the end of the computation
we will also have a rule of the form “If the final state is one of the
following … then output 1, otherwise output 0”. For example, the
Python program above can be represented by a two state automata for
computing XOR of the following form:
• Initialize in state 0
• For every state 𝑠 ∈ {0, 1} and input bit 𝜎 read, if 𝜎 = 1 then change
to state 1 − 𝑠, otherwise stay in state 𝑠.
• At the end output 1 iff 𝑠 = 1.
A deterministic finite
Definition 6.2 — Deterministic Finite Automaton.
automaton (DFA) with 𝐶 states over {0, 1} is a pair (𝑇 , 𝒮) with
𝑇 ∶ [𝐶] × {0, 1} → [𝐶] and 𝒮 ⊆ [𝐶]. The finite function 𝑇 is known
as the transition function of the DFA and the set 𝒮 is known as the
set of accepting states.
Let 𝐹 ∶ {0, 1}∗ → {0, 1} be a Boolean function with the infinite
domain {0, 1}∗ . We say that (𝑇 , 𝒮) computes a function 𝐹 ∶ {0, 1}∗ →
{0, 1} if for every 𝑛 ∈ ℕ and 𝑥 ∈ {0, 1}𝑛 , if we define 𝑠0 = 0 and
𝑠𝑖+1 = 𝑇 (𝑠𝑖 , 𝑥𝑖 ) for every 𝑖 ∈ [𝑛], then
𝑠𝑛 ∈ 𝒮 ⇔ 𝐹 (𝑥) = 1 (6.6)
P
Make sure not to confuse the transition function of
an automaton (𝑇 in Definition 6.2), which is a finite
function specifying the table of “rules” which it fol-
lows, with the function the automaton computes (𝐹 in
Definition 6.2) which is an infinite function.
R
Remark 6.3 — Definitions in other texts. Deterministic
finite automata can be defined in several equivalent
ways. In particular Sipser [Sip97] defines a DFAs as a
five-tuple (𝑄, Σ, 𝛿, 𝑞0 , 𝐹 ) where 𝑄 is the set of states,
Σ is the alphabet, 𝛿 is the transition function, 𝑞0 is
the initial state, and 𝐹 is the set of accepting states.
In this book the set of states is always of the form
𝑄 = {0, … , 𝐶 − 1} and the initial state is always 𝑞0 = 0,
but this makes no difference to the computational
power of these models. Also, we restrict our attention
to the case that the alphabet Σ is equal to {0, 1}.
Solved Exercise 6.2 — DFA for (010)∗ . Prove that there is a DFA that com-
putes the following function 𝐹 :
Solution:
When asked to construct a deterministic finite automaton, it
helps to start by thinking of it a single-pass constant-memory al-
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 225
def F(X):
'''Return 1 iff X is a concatenation of zero/more
↪ copies of [0,1,0]'''
if len(X) % 3 != 0:
return False
ultimate = 0
penultimate = 1
antepenultimate = 0
for idx, b in enumerate(X):
antepenultimate = penultimate
penultimate = ultimate
ultimate = b
if idx % 3 == 2 and ((antepenultimate,
↪ penultimate, ultimate) != (0,1,0)):
return False
return True
• The set 𝒮 ⊆ [𝐶] of accepting states. There are at most 2𝐶 such states,
each of which can be described by a string in {0, 1}𝐶 specifiying
which state is in 𝒮 and which isn’t
• The length of the input 𝑥 ∈ {0, 1}∗ that the DFA is provided with. It
is always finite, but not bounded.
• The number of steps that the DFA takes can grow with the length of
the input. Indeed, a DFA makes a single pass on the input and so it
takes exactly |𝑥| steps on an input 𝑥 ∈ {0, 1}∗ .
Proof Idea:
Every DFA can be described by a finite length string, which yields
an onto map from {0, 1}∗ to DFACOMP: namely the function that
maps a string describing an automaton 𝐴 to the function that it com-
putes.
⋆
⎧
{𝐹 𝑎 represents automaton 𝐴 and 𝐹 is the function 𝐴 computes
𝑆𝑡𝐷𝐶(𝑎) = ⎨
⎩ONE otherwise
{
(6.8)
where ONE ∶ {0, 1} → {0, 1} is the constant function that outputs
∗
There exists a
Theorem 6.5 — Existence of DFA-uncomputable functions.
Boolean function 𝐹 ∶ {0, 1} → {0, 1} that is not computable by any
∗
DFA.
that are matched by some pattern (e.g., all files whose names end with
the string .txt). In full generality, we can allow the user to specify the
pattern by specifying a (computable) function 𝐹 ∶ {0, 1}∗ → {0, 1},
where 𝐹 (𝑥) = 1 corresponds to the pattern matching 𝑥. That is, the
user provides a program 𝑃 in some Turing-complete programming
language such as Python, and the system will return all the 𝑥 ∈ 𝑋
such that 𝑃 (𝑥) = 1. For example, one could search for all text files
that contain the string important document or perhaps (letting 𝑃
correspond to a neural-network based classifier) all images that con-
tain a cat. However, we don’t want our system to get into an infinite
loop just trying to evaluate the program 𝑃 ! For this reason, typical
systems for searching files or databases do not allow users to specify
the patterns using full-fledged programming languages. Rather, such
systems use restricted computational models that on the one hand are
rich enough to capture many of the queries needed in practice (e.g., all
filenames ending with .txt, or all phone numbers of the form (617)
xxx-xxxx), but on the other hand are restricted enough so the queries
can be evaluated very efficiently on huge files and in particular cannot
result in an infinite loop.
One of the most popular such computational models is regular
expressions. If you ever used an advanced text editor, a command line
shell, or have done any kind of manipulation of text files, then you
have probably come across regular expressions.
A regular expression over some alphabet Σ is obtained by combin-
ing elements of Σ with the operation of concatenation, as well as |
(corresponding to or) and ∗ (corresponding to repetition zero or
more times). (Common implementations of regular expressions in
programming languages and shells typically include some extra oper-
ations on top of | and ∗, but these operations can be implemented as
“syntactic sugar” using the operators | and ∗.) For example, the fol-
lowing regular expression over the alphabet {0, 1} corresponds to the
set of all strings 𝑥 ∈ {0, 1}∗ where every digit is repeated at least twice:
1. 𝑒 = 𝜎 where 𝜎 ∈ Σ
P
The formal definition of Φ𝑒 is one of those definitions
that is more cumbersome to write than to grasp. Thus
it might be easier for you to first work it out on your
own and then check that your definition matches what
is written below.
2. If 𝑒 = (𝑒′ |𝑒″ ) then Φ𝑒 (𝑥) = Φ𝑒′ (𝑥)∨Φ𝑒″ (𝑥) where ∨ is the OR op-
erator.
230 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
5. Finally, for the edge cases Φ∅ is the constant zero function, and
Φ"" is the function that only outputs 1 on the empty string "".
P
The definitions above are not inherently difficult, but
are a bit cumbersome. So you should pause here and
go over it again until you understand why it corre-
sponds to our intuitive notion of regular expressions.
This is important not just for understanding regular
expressions themselves (which are used time and
again in a great many applications) but also for get-
ting better at understanding recursive definitions in
general.
𝑒 = (𝑎|𝑏|𝑐|𝑑)(𝑎|𝑏|𝑐|𝑑)∗ (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗
(6.11)
is the expression we saw in (6.10).
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 231
either a shorter string 𝑥 and the same expression, or with the shorter
expression 𝑒′ and a string 𝑥′ that is equal in length or shorter than 𝑥.
Give an algorithm that on
Solved Exercise 6.3 — Match the empty string.
input a regular expression 𝑒, outputs 1 if and only if Φ𝑒 ("") = 1.
■
Solution:
We can obtain such a recursive algorithm by using the following
observations:
Let 𝑒 be a
Theorem 6.11 — Matching regular expressions in linear time.
regular expression. Then there is an 𝑂(𝑛) time algorithm that
computes Φ𝑒 .
Φ𝑒 (𝑥) on the length of 𝑥 and not about the dependence of this time on
the length of 𝑒.
Algorithm 6.13 is a recursive algorithm that input an expression
𝑒 and a string 𝑥 ∈ {0, 1}𝑛 , does computation of at most 𝐶(|𝑒|) steps
and then calls itself with input some expression 𝑒′ and a string 𝑥′ of
length 𝑛 − 1. It will terminate after 𝑛 steps when it reaches a string of
length 0. So, the running time 𝑇 (𝑒, 𝑛) that it takes for Algorithm 6.13
to compute Φ𝑒 for inputs of length 𝑛 satisfies the recursive equation:
Claim: Let 𝑒 be a regular expression over {0, 1}, then there is a num-
ber 𝐿(𝑒) ∈ ℕ, such that for every sequence of symbols 𝛼0 , … , 𝛼𝑛−1 , if
we define 𝑒′ = 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑛−1 ] (i.e., restricting 𝑒 to 𝛼0 , and then 𝛼1
and so on and so forth), then |𝑒′ | ≤ 𝐿(𝑒).
Proof of claim: For a regular expression 𝑒 over {0, 1} and 𝛼 ∈ {0, 1}𝑚 ,
we denote by 𝑒[𝛼] the expression 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑚−1 ] obtained by restrict-
ing 𝑒 to 𝛼0 and then to 𝛼1 and so on. We let 𝑆(𝑒) = {𝑒[𝛼]|𝛼 ∈ {0, 1}∗ }.
We will prove the claim by showing that for every 𝑒, the set 𝑆(𝑒) is fi-
nite, and hence so is the number 𝐿(𝑒) which is the maximum length of
𝑒′ for 𝑒′ ∈ 𝑆(𝑒).
We prove this by induction on the structure of 𝑒. If 𝑒 is a symbol, the
empty string, or the empty set, then this is straightforward to show
as the most expressions 𝑆(𝑒) can contain are the expression itself, "",
and ∅. Otherwise we split to the two cases (i) 𝑒 = 𝑒′∗ and (ii) 𝑒 =
𝑒′ 𝑒″ , where 𝑒′ , 𝑒″ are smaller expressions (and hence by the induction
hypothesis 𝑆(𝑒′ ) and 𝑆(𝑒″ ) are finite). In the case (i), if 𝑒 = (𝑒′ )∗ then
𝑒[𝛼] is either equal to (𝑒′ )∗ 𝑒′ [𝛼] or it is simply the empty set if 𝑒′ [𝛼] = ∅.
Since 𝑒′ [𝛼] is in the set 𝑆(𝑒′ ), the number of distinct expressions in
𝑆(𝑒) is at most |𝑆(𝑒′ )| + 1. In the case (ii), if 𝑒 = 𝑒′ 𝑒″ then all the
restrictions of 𝑒 to strings 𝛼 will either have the form 𝑒′ 𝑒″ [𝛼] or the form
𝑒′ 𝑒″ [𝛼]|𝑒′ [𝛼′ ] where 𝛼′ is some string such that 𝛼 = 𝛼′ 𝛼″ and 𝑒″ [𝛼″ ]
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 237
matches the empty string. Since 𝑒″ [𝛼] ∈ 𝑆(𝑒″ ) and 𝑒′ [𝛼′ ] ∈ 𝑆(𝑒′ ), the
number of the possible distinct expressions of the form 𝑒[𝛼] is at most
|𝑆(𝑒″ )| + |𝑆(𝑒″ )| ⋅ |𝑆(𝑒′ )|. This completes the proof of the claim.
Let 𝑒 be a regular
Theorem 6.14 — DFA for regular expression matching.
expression. Then there is an algorithm that on input 𝑥 ∈ {0, 1}∗
computes Φ𝑒 (𝑥) while making a single pass over 𝑥 and maintaining
a constant amount of memory.
Proof Idea:
The single-pass constant-memory for checking if a string matches
a regular expression is presented in Algorithm 6.15. The idea is to
replace the recursive algorithm of Algorithm 6.13 with a dynamic pro-
gram, using the technique of memoization. If you haven’t taken yet an
algorithms course, you might not know these techniques. This is OK;
while this more efficient algorithm is crucial for the many practical
applications of regular expressions, it is not of great importance for
this book.
⋆
238 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 239
One direction follows from Theorem 6.14, which shows that for
every regular expression 𝑒, the function Φ𝑒 can be computed by a DFA
(see for example Fig. 6.6). For the other direction, we show that given
a DFA (𝑇 , 𝒮) for every 𝑣, 𝑤 ∈ [𝐶] we can find a regular expression that
would match 𝑥 ∈ {0, 1}∗ if and only if the DFA starting in state 𝑣, will
end up in state 𝑤 after reading 𝑥.
⋆
Proof of Theorem 6.16. Since Theorem 6.14 proves the “only if” direc-
tion, we only need to show the “if” direction. Let 𝐴 = (𝑇 , 𝒮) be a DFA
with 𝐶 states that computes the function 𝐹 . We need to show that 𝐹 is
regular.
For every 𝑣, 𝑤 ∈ [𝐶], we let 𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} be the function
Figure 6.6: A deterministic finite automaton that
that maps 𝑥 ∈ {0, 1}∗ to 1 if and only if the DFA 𝐴, starting at the computes the function Φ(01)∗ .
state 𝑣, will reach the state 𝑤 if it reads the input 𝑥. We will prove that
𝐹𝑣,𝑤 is regular for every 𝑣, 𝑤. This will prove the theorem, since by
Definition 6.2, 𝐹 (𝑥) is equal to the OR of 𝐹0,𝑤 (𝑥) for every 𝑤 ∈ 𝒮.
Hence if we have a regular expression for every function of the form
𝐹𝑣,𝑤 then (using the | operation) we can obtain a regular expression
for 𝐹 as well.
To give regular expressions for the functions 𝐹𝑣,𝑤 , we start by
defining the following functions 𝐹𝑣,𝑤 𝑡
: for every 𝑣, 𝑤 ∈ [𝐶] and Figure 6.7: Given a DFA of 𝐶 states, for every 𝑣, 𝑤 ∈
[𝐶] and number 𝑡 ∈ {0, … , 𝐶} we define the function
0 ≤ 𝑡 ≤ 𝐶, 𝐹𝑣,𝑤 (𝑥) = 1 if and only if starting from 𝑣 and observ-
𝑡
𝑡
𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} to output one on input
ing 𝑥, the automata reaches 𝑤 with all intermediate states being in the set 𝑥 ∈ {0, 1}∗ if and only if when the DFA is initialized
[𝑡] = {0, … , 𝑡 − 1} (see Fig. 6.7). That is, while 𝑣, 𝑤 themselves might in the state 𝑣 and is given the input 𝑥, it will reach the
state 𝑤 while going only through the intermediate
be outside [𝑡], 𝐹𝑣,𝑤
𝑡
(𝑥) = 1 if and only if throughout the execution of states {0, … , 𝑡 − 1}.
the automaton on the input 𝑥 (when initiated at 𝑣) it never enters any
of the states outside [𝑡] and still ends up at 𝑤. If 𝑡 = 0 then [𝑡] is the
empty set, and hence 𝐹𝑣,𝑤 0
(𝑥) = 1 if and only if the automaton reaches
𝑤 from 𝑣 directly on 𝑥, without any intermediate state. If 𝑡 = 𝐶 then
all states are in [𝑡], and hence 𝐹𝑣,𝑤
𝑡
= 𝐹𝑣,𝑤 .
We will prove the theorem by induction on 𝑡, showing that 𝐹𝑣,𝑤 𝑡
is
regular for every 𝑣, 𝑤 and 𝑡. For the base case of 𝑡 = 0, 𝐹𝑣,𝑤 is regular
0
𝑡
𝑅𝑣,𝑤 𝑡
| 𝑅𝑣,𝑡 𝑡 ∗ 𝑡
(𝑅𝑡,𝑡 ) 𝑅𝑡,𝑤 . (6.14)
This completes the proof of the inductive step and hence of the theo-
rem.
■
1. |𝑦| ≥ 1.
2. |𝑥𝑦| ≤ 𝑛0 .
Proof Idea:
The idea behind the proof the following. Let 𝑛0 be twice the num-
ber of symbols that are used in the expression 𝑒, then the only way
that there is some 𝑤 with |𝑤| > 𝑛0 and Φ𝑒 (𝑤) = 1 is that 𝑒 contains
the ∗ (i.e. star) operator and that there is a nonempty substring 𝑦 of
𝑤 that was matched by (𝑒′ )∗ for some sub-expression 𝑒′ of 𝑒. We can
now repeat 𝑦 any number of times and still get a matching string. See
also Fig. 6.9.
⋆
P
The pumping lemma is a bit cumbersome to state,
but one way to remember it is that it simply says the
following: “if a string matching a regular expression is
long enough, one of its substrings must be matched using
the ∗ operator”.
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 243
|𝑤′ | > 2|𝑒′ | then by the induction hypothesis there exist 𝑥, 𝑦, 𝑧 ′ with
|𝑦| ≤ 1, |𝑥𝑦| ≤ 2|𝑒′ | < 𝑛0 such that 𝑤′ = 𝑥𝑦𝑧 ′ and 𝑒′ matches 𝑥𝑦𝑘 𝑧′
for every 𝑘 ∈ ℕ. This completes the proof since if we set 𝑧 = 𝑧 ′ 𝑤″
then we see that 𝑤 = 𝑤′ 𝑤″ = 𝑥𝑦𝑧 and 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 for
every 𝑘 ∈ ℕ. Otherwise, if |𝑤′ | ≤ 2|𝑒′ | then since |𝑤| = |𝑤′ | + |𝑤″ | >
𝑛0 = 2(|𝑒′ | + |𝑒″ |), it must be that |𝑤″ | > 2|𝑒″ |. Hence by the induction
hypothesis there exist 𝑥′ , 𝑦, 𝑧 such that |𝑦| ≥ 1, |𝑥′ 𝑦| ≤ 2|𝑒″ | and 𝑒″
matches 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ. But now if we set 𝑥 = 𝑤′ 𝑥′ we see that
|𝑥𝑦| ≤ |𝑤′ | + |𝑥′ 𝑦| ≤ 2|𝑒′ | + 2|𝑒″ | = 𝑛0 and on the other hand the
expression 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 = 𝑤′ 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ.
In case (c), if 𝑤 is matched by (𝑒′ )∗ then 𝑤 = 𝑤0 ⋯ 𝑤𝑡 where for
every 𝑖 ∈ [𝑡], 𝑤𝑖 is a nonempty string matched by 𝑒′ . If |𝑤0 | > 2|𝑒′ |
then we can use the same approach as in the concatenation case above.
Otherwise, we simply note that if 𝑥 is the empty string, 𝑦 = 𝑤0 , and
𝑧 = 𝑤1 ⋯ 𝑤𝑡 then |𝑥𝑦| ≤ 𝑛0 and 𝑥𝑦𝑘 𝑧 is matched by (𝑒′ )∗ for every
𝑘 ∈ ℕ.
■
244 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 6.21 — Recursive definitions and inductive
proofs. When an object is recursively defined (as in the
case of regular expressions) then it is natural to prove
properties of such objects by induction. That is, if we
want to prove that all objects of this type have prop-
erty 𝑃 , then it is natural to use an inductive steps that
says that if 𝑜′ , 𝑜″ , 𝑜‴ etc have property 𝑃 then so is an
object 𝑜 that is obtained by composing them.
Using the pumping lemma, we can easily prove Lemma 6.19 (i.e.,
the non-regularity of the “matching parenthesis” function):
The pumping lemma is a very useful tool to show that certain func-
tions are not computable by a regular expression. However, it is not
an “if and only if” condition for regularity: there are non regular
functions that still satisfy the conditions of the pumping lemma. To
understand the pumping lemma, it is important to follow the order of
quantifiers in Theorem 6.20. In particular, the number 𝑛0 in the state-
ment of Theorem 6.20 depends on the regular expression (in the proof
we chose 𝑛0 to be twice the number of symbols in the expression). So,
if we want to use the pumping lemma to rule out the existence of a
regular expression 𝑒 computing some function 𝐹 , we need to be able
to choose an appropriate input 𝑤 ∈ {0, 1}∗ that can be arbitrarily large
and satisfies 𝐹 (𝑤) = 1. This makes sense if you think about the intu-
ition behind the pumping lemma: we need 𝑤 to be large enough as to
force the use of the star operator.
Prove that the following
Solved Exercise 6.4 — Palindromes is not regular.
function over the alphabet {0, 1, ; } is not regular: PAL(𝑤) = 1 if and
only if 𝑤 = 𝑢; 𝑢𝑅 where 𝑢 ∈ {0, 1}∗ and 𝑢𝑅 denotes 𝑢 “reversed”:
the string 𝑢|𝑢|−1 ⋯ 𝑢0 . (The Palindrome function is most often defined
without an explicit separator character ;, but the version with such a
separator is a bit cleaner and so we use it here. This does not make
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 245
Figure 6.10: A cartoon of a proof using the pumping lemma that a function 𝐹 is not regular. The pumping lemma states that if 𝐹 is regular then there
exists a number 𝑛0 such that for every large enough 𝑤 with 𝐹 (𝑤) = 1, there exists a partition of 𝑤 to 𝑤 = 𝑥𝑦𝑧 satisfying certain conditions such
that for every 𝑘 ∈ ℕ, 𝐹 (𝑥𝑦𝑘 𝑧) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there exists
quantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifier
corresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof
corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice
of 𝑘 that would result in 𝐹 (𝑥𝑦𝑘 𝑧) = 0, hence violating the conclusion of the pumping lemma.
246 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
We use the pumping lemma. Suppose towards the sake of con-
tradiction that there is a regular expression 𝑒 computing PAL,
and let 𝑛0 be the number obtained by the pumping lemma (The-
orem 6.20). Consider the string 𝑤 = 0𝑛0 ; 0𝑛0 . Since the reverse
of the all zero string is the all zero string, PAL(𝑤) = 1. Now, by
the pumping lemma, if PAL is computed by 𝑒, then we can write
𝑤 = 𝑥𝑦𝑧 such that |𝑥𝑦| ≤ 𝑛0 , |𝑦| ≥ 1 and PAL(𝑥𝑦𝑘 𝑧) = 1 for
every 𝑘 ∈ ℕ. In particular, it must hold that PAL(𝑥𝑧) = 1, but this
is a contradiction, since 𝑥𝑧 = 0𝑛0 −|𝑦| ; 0𝑛0 and so its two parts are
not of the same length and in particular are not the reverse of one
another.
■
function?” and “does there exist a string 𝑥 that is matched by the ex-
pression 𝑒?”. The following theorem shows that we can answer the
latter question:
There is an
Theorem 6.22 — Emptiness of regular languages is computable.
algorithm that given a regular expression 𝑒, outputs 1 if and only if
Φ𝑒 is the constant zero function.
Proof Idea:
The idea is that we can directly observe this from the structure
of the expression. The only way a regular expression 𝑒 computes
the constant zero function is if 𝑒 has the form ∅ or is obtained by
concatenating ∅ with other expressions.
⋆
• ∅ is empty.
Let
Theorem 6.23 — Equivalence of regular expressions is computable.
REGEQ ∶ {0, 1} → {0, 1} be the function that on input (a string
∗
Proof Idea:
The idea is to show that given a pair of regular expressions 𝑒 and
𝑒′ we can find an expression 𝑒″ such that Φ𝑒″ (𝑥) = 1 if and only if
Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). Therefore Φ𝑒″ is the constant zero function if and only
248 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof of Theorem 6.23. We will prove Theorem 6.23 from Theorem 6.22.
(The two theorems are in fact equivalent: it is easy to prove Theo-
rem 6.22 from Theorem 6.23, since checking for emptiness is the same
as checking equivalence with the expression ∅.) Given two regu-
lar expressions 𝑒 and 𝑒′ , we will compute an expression 𝑒″ such that
Φ𝑒″ (𝑥) = 1 if and only if Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). One can see that 𝑒 is equiva-
lent to 𝑒′ if and only if 𝑒″ is empty.
We start with the observation that for every bit 𝑎, 𝑏 ∈ {0, 1}, 𝑎 ≠ 𝑏 if
and only if
(𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑏) . (6.15)
Hence we need to construct 𝑒″ such that for every 𝑥,
Φ𝑒″ (𝑥) = (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) ∨ (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) . (6.16)
To construct the expression 𝑒″ , we will show how given any pair of
expressions 𝑒 and 𝑒′ , we can construct expressions 𝑒 ∧ 𝑒′ and 𝑒 that
compute the functions Φ𝑒 ∧ Φ𝑒′ and Φ𝑒 respectively. (Computing the
expression for 𝑒 ∨ 𝑒′ is straightforward using the | operation of regular
expressions.)
Specifically, by Lemma 6.17, regular functions are closed under
negation, which means that for every regular expression 𝑒, there is an
expression 𝑒 such that Φ𝑒 (𝑥) = 1 − Φ𝑒 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Now,
for every two expression 𝑒 and 𝑒′ , the expression
𝑒 ∧ 𝑒′ = (𝑒|𝑒′ ) (6.17)
computes the AND of the two expressions. Given these two transfor-
mations, we see that for every regular expressions 𝑒 and 𝑒′ we can find
a regular expression 𝑒″ satisfying (6.16) such that 𝑒″ is empty if and
only if 𝑒 and 𝑒′ are equivalent.
■
✓ Chapter Recap
6.7 EXERCISES
Suppose that 𝐹 , 𝐺 ∶
Exercise 6.1 — Closure properties of regular functions.
{0, 1} → {0, 1} are regular. For each one of the following defini-
∗
Exercise 6.2One among the following two functions that map {0, 1}∗
to {0, 1} can be computed by a regular expression, and the other one
cannot. For the one that can be computed by a regular expression,
write the expression that does it. For the one that cannot, prove that
this cannot be done using the pumping lemma.
|𝑥|−1
• 𝐹 (𝑥) = 1 if 4 divides ∑𝑖=0 𝑥𝑖 and 𝐹 (𝑥) = 0 otherwise.
|𝑥|−1
• 𝐺(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 ≥ |𝑥|/4 and 𝐺(𝑥) = 0 otherwise.
2. Prove that the following function 𝐹 ∶ {0, 1}∗ → {0, 1} is not regular.
For every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 iff ∑𝑗 𝑥𝑗 = 3𝑖 for some 𝑖 > 0.
■
7
and NAND-TM programs.
“The bounds of arithmetic were however outstepped the moment the idea of
applying the [punched] cards had occurred; and the Analytical Engine does
not occupy common ground with mere ‘calculating machines.’ … In enabling
mechanism to combine together general symbols, in successions of unlim-
ited variety and extent, a uniting link is established between the operations of
matter and the abstract mental processes of the most abstract branch of mathe-
matical science.”, Ada Augusta, countess of Lovelace, 1843
It turns out that these two models are equivalent, and in fact
they are equivalent to many other computational models
including programming languages such as C, Lisp, Python,
JavaScript, etc. This notion, known as Turing equivalence
or Turing completeness, will be discussed in Chapter 8. See
Fig. 7.2 for an overview of the models presented in this chap-
ter and Chapter 8.
“What is the difference between a Turing machine and the modern computer?
It’s the same as that between Hillary’s ascent of Everest and the establishment
of a Hilton hotel on its peak.” , Alan Perlis, 1982.
• At each step, the machine reads the symbol 𝜎 = 𝑇 [𝑖] that is in the
𝑖𝑡ℎ location of the tape, and based on this symbol and its state 𝑠
decides on:
• When the machine halts then its output is the binary string ob-
tained by reading the tape from the beginning until the head posi-
tion, dropping all symbols such as ▷, ∅, etc. that are not either 0 or
1.
In our case, 𝑀 will use the alphabet {0, 1, ▷, ∅, ×} and will have
𝑘 = 14 states. Though the states are simply numbers between 0 and
𝑘 − 1, for convenience we will give them the following labels:
State Label
0 START
1 RIGHT_0
2 RIGHT_1
3 LOOK_FOR_0
4 LOOK_FOR_1
5 RETURN
6 REJECT
7 ACCEPT
8 OUTPUT_0
9 OUTPUT_1
10 0_AND_BLANK
11 1_AND_BLANK
12 BLANK_AND_STOP
• 𝑀 starts in state START and will go right, looking for the first sym-
bol that is 0 or 1. If we find ∅ before we hit such a symbol then we
will move to the OUTPUT_1 state that we describe below.
• Once 𝑀 finds such a symbol 𝑏 ∈ {0, 1}, 𝑀 deletes 𝑏 from the tape
by writing the × symbol, it enters either the RIGHT_0 or RIGHT_1
mode according to the value of 𝑏 and starts moving rightwards
until it hits the first ∅ or × symbol.
loop s a n d i n fi n i ty 255
• The OUTPUT_𝑏 states mean that we are going to output the value 𝑏.
In both these states we go left until we hit ▷. Once we do so, we
make a right step, and change to the 1_AND_BLANK or 0_AND_BLANK
states respectively. In the latter states, we write the corresponding
value, and then move right and change to the BLANK_AND_STOP
state, in which we write ∅ to the tape and halt.
The above description can be turned into a table describing for each
one of the 13 ⋅ 5 combination of state and symbol, what the Turing
machine will do when it is in that state and it reads that symbol. This
table is known as the transition function of the Turing machine.
P
You should make sure you see why this formal def-
inition corresponds to our informal description of
a Turing Machine. To get more intuition on Turing
Machines, you can explore some of the online avail-
able simulators such as Martin Ugarte’s, Anthony
Morphett’s, or Paul Rendell’s.
This is a good point to remind the reader that functions are not the
same as programs:
R
Remark 7.4 — Functions vs. languages. As discussed
in Section 6.1.2, many texts use the terminology of
“languages” rather than functions to refer to compu-
tational tasks. A Turing machine 𝑀 decides a language
𝐿 if for every input 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 1 if
and only if 𝑥 ∈ 𝐿. This is equivalent to computing
the Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1} defined as
𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿. A language 𝐿 is decidable if there
is a Turing machine 𝑀 that decides it. For historical
reasons, some texts also call such a language recursive
(which is the reason that the letter R is often used
to denote the set of computable Boolean functions /
decidable languages defined in Definition 7.3).
In this book we stick to the terminology of functions
rather than languages, but all definitions and results
can be easily translated back and forth by using the
equivalence between the function 𝐹 ∶ {0, 1}∗ → {0, 1}
and the language 𝐿 = {𝑥 ∈ {0, 1}∗ | 𝐹 (𝑥) = 1}.
Let 𝐹 be either a
Definition 7.5 — Computable (partial or total) functions.
total or partial function mapping {0, 1} to {0, 1}∗ and let 𝑀 be a
∗
R
Remark 7.6 — Bot symbol. We often use ⊥ as our spe-
cial “failure symbol”. If a Turing machine 𝑀 fails to
halt on some input 𝑥 ∈ {0, 1}∗ then we denote this by
𝑀 (𝑥) = ⊥. This does not mean that 𝑀 outputs some
encoding of the symbol ⊥ but rather that 𝑀 enters
into an infinite loop when given 𝑥 as input.
If a partial function 𝐹 is undefined on 𝑥 then we can
also write 𝐹 (𝑥) = ⊥. Therefore one might think
that Definition 7.5 can be simplified to requiring that
𝑀 (𝑥) = 𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ , which would imply
that for every 𝑥, 𝑀 halts on 𝑥 if and only if 𝐹 is de-
fined on 𝑥. However this is not the case: for a Turing
Machine 𝑀 to compute a partial function 𝐹 it is not
necessary for 𝑀 to enter an infinite loop on inputs 𝑥
on which 𝐹 is not defined. All that is needed is for 𝑀
to output 𝐹 (𝑥) on 𝑥’s on which 𝐹 is defined: on other
inputs it is OK for 𝑀 to output an arbitrary value such
as 0, 1, or anything else, or not to halt at all. To borrow
a term from the C programming language, on inputs 𝑥
on which 𝐹 is not defined, what 𝑀 does is “undefined
behavior”.
def PAL(Tape):
head = 0
state = 0 # START
while (state != 12):
if (state == 0 && Tape[head]=='0'):
state = 3 # LOOK_FOR_0
Tape[head] = 'x'
head += 1 # move right
if (state==0 && Tape[head]=='1')
state = 4 # LOOK_FOR_1
Tape[head] = 'x'
head += 1 # move right
... # more if statements here
The particular details of this program are not important. What mat-
ters is that we can describe Turing machines as programs. Moreover,
note that when translating a Turing machine into a program, the tape
2
Most programming languages use arrays of fixed
becomes a list or array that can hold values from the finite set Σ.2 The size, while a Turing machine’s tape is unbounded. But
head position can be thought of as an integer valued variable that can of course there is no need to store an infinite number
of ∅ symbols. If you want, you can think of the tape
hold integers of unbounded size. The state is a local register that can
as a list that starts off just long enough to store the
hold one of a fixed number of values in [𝑘]. input, but is dynamically grown in size as the Turing
More generally we can think of every Turing Machine 𝑀 as equiva- machine’s head explores new positions.
R
Remark 7.7 — NAND-CIRC + loops + arrays = every-
thing.. As we will see, adding loops and arrays to
NAND-CIRC is enough to capture the full power of
all programming languages! Hence we could replace
“NAND-TM” with any of Python, C, Javascript, OCaml,
etc. in the lefthand side of (7.2). But we’re getting
ahead of ourselves: this issue will be discussed in
Chapter 8.
• We use the convention that arrays always start with a capital letter,
and scalar variables (which are never indexed with i) start with
lowercase letters. Hence Foo is an array and bar is a scalar variable.
• The input and output X and Y are now considered arrays with val-
ues of zeroes and ones. (There are also two other special arrays
X_nonblank and Y_nonblank, see below.)
2. The program is executed line by line, when the last line MODAND-
JMP(foo,bar) is executed then we do as follows:
a. If foo= 1 and bar= 0 then jump to the first line without mod-
ifying the value of i.
b. If foo= 1 and bar= 1 then increment i by one and jump to
the first line.
264 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
7.2.3 Examples
We now present some examples of NAND-TM programs.
loop s a n d i n fi n i ty 265
carry = IF(started,carry,one(started))
started = one(started)
Y[i] = XOR(X[i],carry)
carry = AND(X[i],carry)
Y_nonblank[i] = one(started)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
temp_0 = NAND(started,started)
temp_1 = NAND(started,temp_0)
temp_2 = NAND(started,started)
temp_3 = NAND(temp_1,temp_2)
temp_4 = NAND(carry,started)
carry = NAND(temp_3,temp_4)
temp_6 = NAND(started,started)
started = NAND(started,temp_6)
temp_8 = NAND(X[i],carry)
temp_9 = NAND(X[i],temp_8)
temp_10 = NAND(carry,temp_8)
Y[i] = NAND(temp_9,temp_10)
temp_12 = NAND(X[i],carry)
carry = NAND(temp_12,temp_12)
temp_14 = NAND(started,started)
Y_nonblank[i] = NAND(started,temp_14)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
266 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
P
Working out the above two examples can go a long
way towards understanding the NAND-TM language.
See our GitHub repository for a full specification of
the NAND-TM language.
loop s a n d i n fi n i ty 267
For
Theorem 7.11 — Turing machines and NAND-TM programs are equivalent.
every 𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM pro-
∗ ∗
Proof Idea:
To prove such an equivalence theorem, we need to show two di-
rections. We need to be able to (1) transform a Turing machine 𝑀 to
a NAND-TM program 𝑃 that computes the same function as 𝑀 and
(2) transform a NAND-TM program 𝑃 into a Turing machine 𝑀 that
computes the same function as 𝑃 .
The idea of the proof is illustrated in Fig. 7.9. To show (1), given
a Turing machine 𝑀 , we will create a NAND-TM program 𝑃 that
will have an array Tape for the tape of 𝑀 and scalar (i.e., non array)
variable(s) state for the state of 𝑀 . Specifically, since the state of a
Turing machine is not in {0, 1} but rather in a larger set [𝑘], we will use
⌈log 𝑘⌉ variables state_0 , …, state_⌈log 𝑘⌉ − 1 variables to store the
representation of the state. Similarly, to encode the larger alphabet Σ
of the tape, we will use ⌈log |Σ|⌉ arrays Tape_0 , …, Tape_⌈log |Σ|⌉ − 1,
such that the 𝑖𝑡ℎ location of these arrays encodes the 𝑖𝑡ℎ symbol in the
tape for every tape. Using the fact that every function can be computed
by a NAND-CIRC program, we will be able to compute the transition
function of 𝑀 , replacing moving left and right by decrementing and
incrementing i respectively.
We show (2) using very similar ideas. Given a program 𝑃 that
uses 𝑎 array variables and 𝑏 scalar variables, we will create a Turing
machine with about 2𝑏 states to encode the values of scalar variables,
and an alphabet of about 2𝑎 so we can encode the arrays using our
tape. (The reason the sizes are only “about” 2𝑎 and 2𝑏 is that we will
need to add some symbols and steps for bookkeeping purposes.) The
Turing Machine 𝑀 will simulate each iteration of the program 𝑃 by
updating its state and tape accordingly.
⋆
• We encode [𝑘] using {0, 1}ℓ and Σ using {0, 1}ℓ , where ℓ = ⌈log 𝑘⌉
′
• We encode the set {L, R, S, H} using {0, 1}2 . We will choose the
encode L ↦ 01, R ↦ 11, S ↦ 10, H ↦ 00. (This conveniently
corresponds to the semantics of the MODANDJUMP operation.)
Every step of the main loop of the above program perfectly mimics
the computation of the Turing Machine 𝑀 and so the program carries
out exactly the definition of computation by a Turing Machine as per
Definition 7.1.
For the other direction, suppose that 𝑃 is a NAND-TM program
with 𝑠 lines, ℓ scalar variables, and ℓ′ array variables. We will show
that there exists a Turing machine 𝑀𝑃 with 2ℓ + 𝐶 states and alphabet
Σ of size 𝐶 ′ + 2ℓ that computes the same functions as 𝑃 (where 𝐶, 𝐶 ′
′
{0, 1}ℓ that on input the contents of 𝑃 ’s scalar variables and the con-
′
3. When the program halts (i.e., MODANDJMP gets 00) then the Turing
machine will enter into a special loop to copy the results of the Y
array into the output and then halt. We can achieve this by adding a
few more states.
R
Remark 7.13 — Running time equivalence (optional). If
we examine the proof of Theorem 7.11 then we can see
that every iteration of the loop of a NAND-TM pro-
gram corresponds to one step in the execution of the
Turing machine. We will come back to this question
of measuring number of computation steps later in
this course. For now the main take away point is that
NAND-TM programs and Turing Machines are essen-
tially equivalent in power even when taking running
time into account.
• Inner loops such as the while and for operations common to many
programming language.
• Multiple index variables (e.g., not just i but we can add j, k, etc.).
In all of these cases (and many others) we can implement the new
feature as mere “syntactic sugar” on top of standard NAND-TM,
which means that the set of functions computable by NAND-TM
with this feature is the same as the set of functions computable by
standard NAND-TM. Similarly, we can show that the set of functions
computable by Turing Machines that have more than one tape, or
tapes of more dimensions than one, is the same as the set of functions
computable by standard Turing machines.
272 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
"start": do foo
GOTO("end")
"skip": do bar
"end": do blah
then the program will only do foo and blah as when it reaches the
line GOTO("end") it will jump to the line labeled with "end". We can
achieve the effect of GOTO in NAND-TM using conditionals. In the
code below, we assume that we have a variable pc that can take strings
of some constant length. This can be encoded using a finite number
of Boolean variables pc_0, pc_1, …, pc_𝑘 − 1, and so when we write
below pc = "label" what we mean is something like pc_0 = 0,pc_1
= 1, … (where the bits 0, 1, … correspond to the encoding of the finite
string "label" as a string of length 𝑘). We also assume that we have
access to conditional (i.e., if statements), which we can emulate using
syntactic sugar in the same way as we did in NAND-CIRC.
To emulate a GOTO statement, we will first modify a program P of
the form
do foo
do bar
do blah
pc = "line1"
if (pc=="line1"):
do foo
pc = "line2"
if (pc=="line2"):
do bar
pc = "line3"
if (pc=="line3"):
do blah
Other loops. Once we have GOTO, we can emulate all the standard loop
constructs such as while, do .. until or for in NAND-TM as well.
For example, we can replace the code
while foo:
do blah
do bar
with
"loop":
if NOT(foo): GOTO("next")
do blah
GOTO("loop")
"next":
do bar
R
Remark 7.14 — GOTO’s in programming languages. The
GOTO statement was a staple of most early program-
ming languages, but has largely fallen out of favor and
is not included in many modern languages such as
Python, Java, Javascript. In 1968, Edsger Dijsktra wrote a
famous letter titled “Go to statement considered harm-
ful.” (see also Fig. 7.10). The main trouble with GOTO
is that it makes analysis of programs more difficult
by making it harder to argue about invariants of the
program.
When a program contains a loop of the form:
for j in range(100):
do something
do blah
code. This notion of “self replication”, and the related notion of “self
reference” is crucial to many aspects of computation, as well of course
to life itself, whether in the form of digital or biological programs.
For now, what you ought to remember is the following differences
between uniform and non uniform computational models:
✓ Chapter Recap
7.6 EXERCISES
Produce the code of a
Exercise 7.1 — Explicit NAND TM programming.
(syntactic-sugar free) NAND-TM program 𝑃 that computes the (un-
bounded input length) Majority function 𝑀 𝑎𝑗 ∶ {0, 1}∗ → {0, 1} where
|𝑥|
for every 𝑥 ∈ {0, 1}∗ , 𝑀 𝑎𝑗(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 > |𝑥|/2. We
say “produce” rather than “write” because you do not have to write
276 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
the code of 𝑃 by hand, but rather can use the programming language
of your choice to compute this code.
■
4. SORT ∶ {0, 1}∗ → {0, 1}∗ which takes as input the representation of
a list of natural numbers (𝑎0 , … , 𝑎𝑛−1 ) and returns its sorted version
(𝑏0 , … , 𝑏𝑛−1 ) such that for every 𝑖 ∈ [𝑛] there is some 𝑗 ∈ [𝑛] with
𝑏𝑖 = 𝑎𝑗 and 𝑏0 ≤ 𝑏1 ≤ ⋯ ≤ 𝑏𝑛−1 .
there are two index variables i and j, but now the arrays are two di-
mensional and so we index an array Foo by Foo[i][j]. Prove that for
every function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , 𝐹 is computable by a NAND-TM
program if and only if 𝐹 is computable by a NAND-TM’ ’ program.
■
⎧
{∃𝑦∈{0,1}|𝑥| 𝐹 (𝑥𝑦) = 1
𝐺(𝑥) = ⎨ (7.3)
{
⎩0 otherwise
is in R.
machine (see Definition 7.3). Prove that R is countable. That is, prove
that there exists a one-to-one map 𝐷𝑡𝑁 ∶ R → ℕ. You can use the
equivalence between Turing machines and NAND-TM programs.
■
8
Equivalent models of computation
Theorem 8.1 — Turing Machines (aka NAND-TM programs) and RAM ma-
For every function
chines (aka NAND-RAM programs) are equivalent.
𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM program if
∗ ∗
Proof Idea:
Clearly NAND-RAM is only more powerful than NAND-TM, and Figure 8.4: Overview of the steps in the proof of The-
so if a function 𝐹 is computable by a NAND-TM program then it can orem 8.1 simulating NANDRAM with NANDTM.
We first use the inner loop syntactic sugar of Sec-
be computed by a NAND-RAM program. The challenging direction is tion 7.4.1 to enable loading an integer from an array
to transform a NAND-RAM program 𝑃 to an equivalent NAND-TM to the index variable i of NANDTM. Once we can do
program 𝑄. To describe the proof in full we will need to cover the full that, we can simulate indexed access in NANDTM. We
then use an embedding of ℕ2 in ℕ to simulate two
formal specification of the NAND-RAM language, and show how we dimensional bit arrays in NANDTM. Finally, we use
can implement every one of its features as syntactic sugar on top of the binary representation to encode one-dimensional
arrays of integers as two dimensional arrays of bits
NAND-TM. hence completing the simulation of NANDRAM with
This can be done but going over all the operations in detail is rather NANDTM.
tedious. Hence we will focus on describing the main ideas behind this
286 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2. Two dimensional bit arrays: We then show how we can use “syntactic
sugar” to augment NAND-TM with two dimensional arrays. That is,
have two indices i and j and two dimensional arrays, such that we can
use the syntax Foo[i][j] to access the (i,j)-th location of Foo.
R
Remark 8.2 — RAM machines / NAND-RAM and assembly
language (optional). RAM machines correspond quite
closely to actual microprocessors such as those in the
Intel x86 series that also contains a large primary mem-
ory and a constant number of small registers. This is of
course no accident: RAM machines aim at modeling
more closely than Turing machines the architecture of
actual computing systems, which largely follows the
so called von Neumann architecture as described in
the report [Neu45]. As a result, NAND-RAM is sim-
ilar in its general outline to assembly languages such
as x86 or NIPS. These assembly languages all have
instructions to (1) move data from registers to mem-
ory, (2) perform arithmetic or logical computations
on registers, and (3) conditional execution and loops
(“if” and “goto”, commonly known as “branches” and
“jumps” in the context of assembly languages).
The main difference between RAM machines and
actual microprocessors (and correspondingly between
eq u i va l e n t mod e l s of comp u tati on 287
# set i to 0.
LABEL("zero_idx")
dir0 = zero
dir1 = one
# corresponds to i <- i-1
GOTO("zero_idx",NOT(Atzero[i]))
...
# zero out temp
#(code below assumes a specific prefix-free encoding in
↪ which 10 is the "end marker")
Temp[0] = 1
Temp[1] = 0
# set i to Bar, assume we know how to increment, compare
LABEL("increment_temp")
cond = EQUAL(Temp,Bar)
dir0 = one
dir1 = one
# corresponds to i <- i+1
INC(Temp)
GOTO("increment_temp",cond)
# if we reach this point, i is number encoded by Bar
...
# final instruction of program
MODANDJUMP(dir0,dir1)
Exercise 8.3 asks you to prove that 𝑒𝑚𝑏𝑒𝑑 is indeed one to one, as
well as computable by a NAND-TM program. (The latter can be done
by simply following the grade-school algorithms for multiplication,
addition, and division.) This means that we can replace code of the
form Two[Foo][Bar] = something (i.e., access the two dimensional
array Two at the integers encoded by the one dimensional arrays Foo
and Bar) by code of the form:
R
Remark 8.3 — Recursion in NAND-RAM (advanced). One
concept that appears in many programming languages
but we did not include in NAND-RAM programs is
recursion. However, recursion (and function calls in
general) can be implemented in NAND-RAM using
the stack data structure. A stack is a data structure con-
taining a sequence of elements, where we can “push”
elements into it and “pop” them from it in “first in last
out” order.
We can implement a stack using an array of integers
Stack and a scalar variable stackpointer that will
290 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Stack[stackpointer]=foo
stackpointer += one
bar = Stack[stackpointer]
stackpointer -= one
Let ℱ be
Definition 8.5 — Turing completeness and equivalence (optional).
the set of all partial functions from {0, 1}∗ to {0, 1}∗ . A computa-
tional model is a map ℳ ∶ {0, 1}∗ → ℱ.
We say that a program 𝑃 ∈ {0, 1}∗ ℳ-computes a function 𝐹 ∈ ℱ
if ℳ(𝑃 ) = 𝐹 .
A computational model ℳ is Turing complete if there is a com-
putable map ENCODEℳ ∶ {0, 1}∗ → {0, 1}∗ for every Turing
machine 𝑁 (represented as a string), ℳ(ENCODEℳ (𝑁 )) is equal
to the partial function computed by 𝑁 .
A computational model ℳ is Turing equivalent if it is Tur-
ing complete and there exists a computable map DECODEℳ ∶
{0, 1}∗ → {0, 1}∗ such that or every string 𝑃 ∈ {0, 1}∗ , 𝑁 =
DECODEℳ (𝑃 ) is a string representation of a Turing machine that
computes the function ℳ(𝑃 ).
• Turing machines
• NAND-TM programs
• NAND-RAM programs
• λ calculus
• Game of life (mapping programs and inputs/outputs to starting
and ending configurations)
• Programming languages such as Python/C/Javascript/OCaml…
(allowing for unbounded storage)
Since the cells in the game of life are are arranged in an infinite two-
dimensional grid, it is an example of a two dimensional cellular automa-
ton. We can also consider the even simpler setting of a one dimensional
cellular automaton, where the cells are arranged in an infinite line, see
Fig. 8.10. It turns out that even this simple model is enough to achieve
296 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
For every
Theorem 8.7 — One dimensional automata are Turing complete.
Turing machine 𝑀 , there is a one dimensional cellular automaton
that can simulate 𝑀 on every input 𝑥.
Figure 8.11: A Game-of-Life configuration simulating
To make the notion of “simulating a Turing machine” more precise
a Turing Machine. Figure by Paul Rendell.
we will need to define configurations of Turing machines. We will
do so in Section 8.4.2 below, but at a high level a configuration of a
Turing machine is a string that encodes its full state at a given step in
eq u i va l e n t mod e l s of comp u tati on 297
its computation. That is, the contents of all (non empty) cells of its
tape, its current state, as well as the head position.
The key idea in the proof of Theorem 8.7 is that at every point in
the computation of a Turing machine 𝑀 , the only cell in 𝑀 ’s tape that
can change is the one where the head is located, and the value this
cell changes to is a function of its current state and the finite state of
𝑀 . This observation allows us to encode the configuration of a Turing
machine 𝑀 as a finite configuration of a cellular automaton 𝑟, and
ensure that a one-step evolution of this encoded configuration under
the rules of 𝑟 corresponds to one step in the execution of the Turing
machine 𝑀 .
• 𝑀 ’s tape contains 𝛼𝑗,0 for all 𝑗 < |𝛼| and contains ∅ for all po-
sitions that are at least |𝛼|, where we let 𝛼𝑗,0 be the value 𝜎 such
that 𝛼𝑗 = (𝜎, 𝑡) with 𝜎 ∈ Σ and 𝑡 ∈ {⋅} ∪ [𝑘]. (In other words,
298 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Definition 8.8 below has some technical details, but
is not actually that deep or complicated. Try to take a
moment to stop and think how you would encode as a
string the state of a Turing machine at a given point in
an execution.
Think what are all the components that you need to
know in order to be able to continue the execution
from this point onwards, and what is a simple way
to encode them using a list of finite symbols. In par-
ticular, with an eye towards our future applications,
try to think of an encoding which will make it as sim-
ple as possible to map a configuration at step 𝑡 to the
configuration at step 𝑡 + 1.
2. The full contents of the large scale memory, that is the tape.
3. The contents of the “local registers”, that is the state of the ma-
chine.
Theorem 8.10 — One dimensional automata are Turing complete (formal state-
ment). For every Turing Machine 𝑀 , if we denote by Σ the alphabet
of its configuration strings, then there is a one-dimensional cellular
∗
automaton 𝑟 over the alphabet Σ such that
The automaton arising from the proof of Theorem 8.10 has a large
alphabet, and furthermore one whose size that depends on the ma-
chine 𝑀 that is being simulated. It turns out that one can obtain an
automaton with an alphabet of fixed size that is independent of the
program being simulated, and in fact the alphabet of the automaton
can be the minimal set {0, 1}! See Fig. 8.13 for an example of such an
Turing-complete automaton.
R
Remark 8.11 — Configurations of NAND-TM programs.
We can use the same approach as Definition 8.8 to
define configurations of a NAND-TM program. Such a
configuration will need to encode:
𝑓(𝑥) = 𝑥 × 𝑥 (8.3)
we can write it as
𝜆𝑥.𝑥 × 𝑥 (8.4)
and so (𝜆𝑥.𝑥 × 𝑥)(7) = 49. That is, you can think of 𝜆𝑥.𝑒𝑥𝑝(𝑥),
where 𝑒𝑥𝑝 is some expression as a way of specifying the anonymous
function 𝑥 ↦ 𝑒𝑥𝑝(𝑥). Anonymous functions, using either 𝜆𝑥.𝑓(𝑥), 𝑥 ↦
𝑓(𝑥) or other closely related notation, appear in many programming
languages. For example, in Python we can define the squaring function
using lambda x: x*x while in JavaScript we can use x => x*x or
(x) => x*x. In Scheme we would define it as (lambda (x) (* x x)).
Clearly, the name of the argument to a function doesn’t matter, and so
𝜆𝑦.𝑦 × 𝑦 is the same as 𝜆𝑥.𝑥 × 𝑥, as both correspond to the squaring
function.
Dropping parenthesis. To reduce notational clutter, when writing
𝜆 calculus expressions we often drop the parentheses for function
evaluation. Hence instead of writing 𝑓(𝑥) for the result of applying
the function 𝑓 to the input 𝑥, we can also write this as simply 𝑓 𝑥.
Therefore we can write (𝜆𝑥.𝑥 × 𝑥)7 = 49. In this chapter, we will use
both the 𝑓(𝑥) and 𝑓 𝑥 notations for function application. Function
evaluations are associative and bind from left to right, and hence 𝑓 𝑔 ℎ
is the same as (𝑓𝑔)ℎ.
For example, can you guess what number is the following expression
equal to?
P
The expression (8.5) might seem daunting, but before
you look at the solution below, try to break it apart
to its components, and evaluate each component at a
time. Working out this example would go a long way
toward understanding the λ calculus.
((𝐹 𝑔) 3) . (8.6)
((𝜆𝑥.(𝜆𝑦.𝑥)) 2) 9) . (8.7)
Solution:
𝜆𝑦.𝑥 is the function that on input 𝑦 ignores its input and outputs
𝑥. Hence (𝜆𝑥.(𝜆𝑦.𝑥))2 yields the function 𝑦 ↦ 2 (or, using 𝜆 nota-
tion, the function 𝜆𝑦.2). Hence (8.7) is equivalent to (𝜆𝑦.2)9 = 2.
■
𝜆𝑥.(𝜆𝑦.𝑥 + 𝑦) (8.8)
(𝜆𝑥.𝑓)(𝜆𝑦.𝑔𝑧) . (8.9)
There are two natural conventions for this:
Because the λ calculus has only pure functions, that do not have
“side effects”, in many cases the order does not matter. In fact, it can
be shown that if we obtain a definite irreducible expression (for ex-
ample, a number) in both strategies, then it will be the same one.
304 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
However, for concreteness we will always use the “call by name” (i.e.,
lazy evaluation) order. (The same choice is made in the programming
language Haskell, though many other programming languages use
eager evaluation.) Formally, the evaluation of a λ expression using
“call by name” is captured by the following process:
𝑒 = 𝜆𝑥.𝑥 (8.10)
𝑓 = (𝜆𝑎.(𝜆𝑏.𝑏))(𝜆𝑧.𝑧𝑧) (8.11)
Solution:
The canonical simplification of 𝑒 is simply 𝜆𝑣0 .𝑣0 . To do the
canonical simplification of 𝑓 we first use 𝛽 reduction to plug in
𝜆𝑧.𝑧𝑧 instead of 𝑎 in (𝜆𝑏.𝑏) but since 𝑎 is not used in this function at
all, we simply obtained 𝜆𝑏.𝑏 which simplifies to 𝜆𝑣0 .𝑣0 as well.
■
eq u i va l e n t mod e l s of comp u tati on 305
⎧
{𝑧 𝐿 = NIL
REDUCE 𝐿 𝑓 𝑧 = ⎨ .
{
⎩𝑓 (HEAD 𝐿) (REDUCE (TAIL 𝐿) 𝑓 𝑧) otherwise
(8.14)
See Fig. 8.16 for an illustration of the three list-processing operations.
Give a λ expression
Solved Exercise 8.3 — Compute NAND using λ calculus.
𝑁 such that 𝑁 𝑥 𝑦 = NAND(𝑥, 𝑦) for every 𝑥, 𝑦 ∈ {0, 1}.
■
eq u i va l e n t mod e l s of comp u tati on 307
Solution:
The NAND of 𝑥, 𝑦 is equal to 1 unless 𝑥 = 𝑦 = 1. Hence we can
write
Solution:
First, we note that we can compute XOR of two bits as follows:
and
XOR2 = 𝜆𝑎, 𝑏.IF(𝑏, NOT(𝑎), 𝑎) (8.17)
(We are using here a bit of syntactic sugar to describe the func-
tions. To obtain the λ expression for XOR we will simply replace
the expression (8.16) in (8.17).) Now recursively we can define the
XOR of a list as follows:
⎧
{0 𝐿 is empty
XOR(𝐿) =
⎨
{XOR2 (HEAD(𝐿), XOR(TAIL(𝐿))) otherwise
⎩
(8.18)
This means that XOR is equal to
Proof Idea:
To prove the theorem, we need to show that (1) if 𝐹 is computable
by a λ calculus expression then it is computable by a Turing machine,
and (2) if 𝐹 is computable by a Turing machine, then it is computable
by an enhanced λ calculus expression.
Showing (1) is fairly straightforward. Applying the simplification
rules to a λ expression basically amounts to “search and replace”
which we can implement easily in, say, NAND-RAM, or for that
matter Python (both of which are equivalent to Turing machines in
power). Showing (2) essentially amounts to simulating a Turing ma-
chine (or writing a NAND-TM interpreter) in a functional program-
ming language such as LISP or Scheme. We give the details below but
how this can be done is a good exercise in mastering some functional
programming techniques that are useful in their own right.
⋆
Proof of Theorem 8.16. We only sketch the proof. The “if” direction
is simple. As mentioned above, evaluating λ expressions basically
amounts to “search and replace”. It is also a fairly straightforward
programming exercise to implement all the above basic operations in
an imperative language such as Python or C, and using the same ideas
we can do so in NAND-RAM as well, which we can then transform to
a NAND-TM program.
For the “only if” direction we need to simulate a Turing machine
using a λ expression. We will do so by first showing that showing
for every Turing machine 𝑀 a λ expression to compute the next-step
∗ ∗
function NEXT𝑀 ∶ Σ → Σ that maps a configuration of 𝑀 to the next
one (see Section 8.4.2).
∗
A configuration of 𝑀 is a string 𝛼 ∈ Σ for a finite set Σ. We can
encode every symbol 𝜎 ∈ Σ by a finite string {0, 1}ℓ , and so we will
encode a configuration 𝛼 in the λ calculus as a list ⟨𝛼0 , 𝛼1 , … , 𝛼𝑚−1 , ⊥⟩
where 𝛼𝑖 is an ℓ-length string (i.e., an ℓ-length list of 0’s and 1’s) en-
coding a symbol in Σ.
∗
By Lemma 8.9, for every 𝛼 ∈ Σ , NEXT𝑀 (𝛼)𝑖 is equal to
3
𝑟(𝛼𝑖−1 , 𝛼𝑖 , 𝛼𝑖+1 ) for some finite function 𝑟 ∶ Σ → Σ. Using our
encoding of Σ as {0, 1}ℓ , we can also think of 𝑟 as mapping {0, 1}3ℓ to
{0, 1}ℓ . By Solved Exercise 8.3, we can compute the NAND function,
and hence every finite function, including 𝑟, using the λ calculus.
Using this insight, we can compute NEXT𝑀 using the λ calculus as
follows. Given a list 𝐿 encoding the configuration 𝛼0 ⋯ 𝛼𝑚−1 , we
define the lists 𝐿𝑝𝑟𝑒𝑣 and 𝐿𝑛𝑒𝑥𝑡 encoding the configuration 𝛼 shifted
by one step to the right and left respectively. The next configuration
𝛼′ is defined as 𝛼′𝑖 = 𝑟(𝐿𝑝𝑟𝑒𝑣 [𝑖], 𝐿[𝑖], 𝐿𝑛𝑒𝑥𝑡 [𝑖]) where we let 𝐿′ [𝑖] denote
310 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{𝛼 𝛼 is halting configuration
FINAL(𝛼) = . (8.21)
⎨
⎩NEXT𝑀 (𝛼) otherwise
{
P
This is a good point to pause and think how
you would implement these operations your-
self. For example, start by thinking how you
could implement MAP using REDUCE, and
then REDUCE using RECURSE combined with
0, 1, IF, PAIR, HEAD, TAIL, NIL, ISEMPTY. You can
also PAIR, HEAD and TAIL based on 0, 1, IF. The most
challenging part is to implement RECURSE using only
the operations of the pure λ calculus.
There
Theorem 8.18 — Enhanced λ calculus equivalent to pure λ calculus..
are λ expressions that implement the functions 0,1,IF,PAIR, HEAD,
TAIL, NIL, ISEMPTY, MAP, REDUCE, and RECURSE.
• We define NIL to be the function that ignores its input and always
outputs 1. That is, NIL = 𝜆𝑥.1. The ISEMPTY function checks,
given an input 𝑝, whether we get 1 if we apply 𝑝 to the function
𝑧𝑒𝑟𝑜 = 𝜆𝑥, 𝑦.0 that ignores both its inputs and always outputs 0. For
every valid pair of the form 𝑝 = PAIR𝑥𝑦, 𝑝𝑧𝑒𝑟𝑜 = 𝑝𝑥𝑦 = 0 while
NIL𝑧𝑒𝑟𝑜 = 1. Formally, ISEMPTY = 𝜆𝑝.𝑝(𝜆𝑥, 𝑦.0).
R
Remark 8.19 — Church numerals (optional). There is
nothing special about Boolean values. You can use
similar tricks to implement natural numbers using
λ terms. The standard way to do so is to represent
the number 𝑛 by the function ITER𝑛 that on input a
function 𝑓 outputs the function 𝑥 ↦ 𝑓(𝑓(⋯ 𝑓(𝑥))) (𝑛
times). That is, we represent the natural number 1 as
𝜆𝑓.𝑓, the number 2 as 𝜆𝑓.(𝜆𝑥.𝑓(𝑓𝑥)), the number 3 as
𝜆𝑓.(𝜆𝑥.𝑓(𝑓(𝑓𝑥))), and so on and so forth. (Note that
this is not the same representation we used for 1 in
the Boolean context: this is fine; we already know that
the same object can be represented in more than one
way.) The number 0 is represented by the function
that maps any function 𝑓 to the identity function 𝜆𝑥.𝑥.
(That is, 0 = 𝜆𝑓.(𝜆𝑥.𝑥).)
In this representation, we can compute PLUS(𝑛, 𝑚)
as 𝜆𝑓.𝜆𝑥.(𝑛𝑓)((𝑚𝑓)𝑥) and TIMES(𝑛, 𝑚) as 𝜆𝑓.𝑛(𝑚𝑓).
Subtraction and division are trickier, but can be
achieved using recursion. (Working this out is a great
exercise.)
⎧
{0 𝐿 is empty
XOR(𝐿) = ⎨ (8.24)
⎩XOR2 (HEAD(𝐿), XOR(TAIL(𝐿))) otherwise
{
where XOR2 ∶ {0, 1}2 → {0, 1} is the XOR on two bits. In Python we
would write this as
def xor2(a,b): return 1-b if a else b
def head(L): return L[0]
def tail(L): return L[1:]
print(xor([0,1,1,0,0,1]))
# 1
Now, how could we eliminate this recursive call? The main idea is
that since functions can take other functions as input, it is perfectly
legal in Python (and the λ calculus of course) to give a function itself
as input. So, our idea is to try to come up with a non recursive function
tempxor that takes two inputs: a function and a list, and such that
tempxor(tempxor,L) will output the XOR of L!
314 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
At this point you might want to stop and try to im-
plement this on your own in Python or any other
programming language of your choice (as long as it
allows functions as inputs).
Our first attempt might be to simply use the idea of replacing the
recursive call by me. Let’s define this function as myxor
myxor(myxor,[1,0,1])
If you do this, you will get the following complaint from the inter-
preter:
TypeError: myxor() missing 1 required positional argu-
ment
The problem is that myxor expects two inputs- a function and a
list- while in the call to me we only provided a list. To correct this, we
modify the call to also provide the function itself:
tempxor(tempxor,[1,0,1])
# 0
tempxor(tempxor,[1,0,1,1])
# 1
1. Create the function myf that takes a pair of inputs me and x, and
replaces recursive calls to f with calls to me.
2. Create the function tempf that converts calls in myf of the form
me(x) to calls of the form me(me,x).
eq u i va l e n t mod e l s of comp u tati on 315
def RECURSE(myf):
def tempf(me,x): return myf(lambda y: me(me,y),x)
xor = RECURSE(myxor)
print(xor([0,1,1,0,0,1]))
# 1
print(xor([1,1,0,0,1,1,1,1]))
# 0
# XOR function
316 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
XOR = RECURSE(myXOR)
#TESTING:
R
Remark 8.20 — The Y combinator. The RECURSE opera-
tor above is better known as the Y combinator.
It is one of a family of a fixed point operators that given
a lambda expression 𝐹 , find a fixed point 𝑓 of 𝐹 such
that 𝑓 = 𝐹 𝑓. If you think about it, XOR is the fixed
point of 𝑚𝑦𝑋𝑂𝑅 above. XOR is the function such
that for every 𝑥, if plug in XOR as the first argument
of 𝑚𝑦𝑋𝑂𝑅 then we get back XOR, or in other words
XOR = 𝑚𝑦𝑋𝑂𝑅 XOR. Hence finding a fixed point for
𝑚𝑦𝑋𝑂𝑅 is the same as applying RECURSE to it.
“[The thesis is] not so much a definition or to an axiom but … a natural law.”,
Emil Post, 1936.
Computational
problems Type of model Examples
Finite functions Non uniform Boolean circuits,
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 computation NAND circuits,
(algorithm straight-line programs
depends on input (e.g., NAND-CIRC)
length)
Functions with Sequential access Turing machines,
unbounded inputs to memory NAND-TM programs
𝐹 ∶ {0, 1}∗ → {0, 1}∗
– Indexed access / RAM machines,
RAM NAND-RAM, modern
programming
languages
– Other Lambda calculus,
cellular automata
✓ Chapter Recap
8.9 EXERCISES
Exercise 8.1 — Alternative proof for TM/RAM equivalence. Let SEARCH ∶
{0, 1}∗ → {0, 1}∗ be the following function. The input is a pair
(𝐿, 𝑘) where 𝑘 ∈ {0, 1}∗ , 𝐿 is an encoding of a list of key value pairs
(𝑘0 , 𝑣1 ), … , (𝑘𝑚−1 , 𝑣𝑚−1 ) where 𝑘0 , … , 𝑘𝑚−1 , 𝑣0 , … , 𝑣𝑚−1 are binary
strings. The output is 𝑣𝑖 for the smallest 𝑖 such that 𝑘𝑖 = 𝑘, if such 𝑖
exists, and otherwise the empty string.
4. Prove that for every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ that is computable by a
NAND-RAM program, 𝐹 is computable by a Turing machine.
Exercise 8.2 — NAND-TM lookup. This exercise shows part of the proof that
NAND-TM can simulate NAND-RAM. Produce the code of a NAND-
TM program that computes the function LOOKUP ∶ {0, 1}∗ → {0, 1}
that is defined as follows. On input 𝑝𝑓(𝑖)𝑥, where 𝑝𝑓(𝑖) denotes a
prefix-free encoding of an integer 𝑖, LOOKUP(𝑝𝑓(𝑖)𝑥) = 𝑥𝑖 if 𝑖 < |𝑥|
eq u i va l e n t mod e l s of comp u tati on 319
Exercise 8.7 — Next-step function is local. Prove Lemma 8.9 and use it to
complete the proof of Theorem 8.7.
320 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Exercise 8.8 — λ calculus requires at most three variables. Prove that for ev-
ery λ-expression 𝑒 with no free variables there is an equivalent λ- 6
Hint: You can reduce the number of variables a
expression 𝑓 that only uses the variables 𝑥,𝑦, and 𝑧.6 function takes by “pairing them up”. That is, define a
■ λ expression PAIR such that for every 𝑥, 𝑦 PAIR𝑥𝑦 is
some function 𝑓 such that 𝑓0 = 𝑥 and 𝑓1 = 𝑦. Then
1. Let 𝑒 =
Exercise 8.9 — Evaluation order example in λ calculus. use PAIR to iteratively reduce the number of variables
𝜆𝑥.7 ((𝜆𝑥.𝑥𝑥)(𝜆𝑥.𝑥𝑥)). Prove that the simplification process of 𝑒 used.
Let 𝑀 be a Turing
Exercise 8.11 — Next-step function without 𝑅𝐸𝐶𝑈 𝑅𝑆𝐸 .
machine. Give an enhanced λ calculus expression to compute the
next-step function NEXT𝑀 of 𝑀 (as in the proof of Theorem 8.16) 9
Use MAP and REDUCE (and potentially FILTER).
without using RECURSE. See footnote for hint.9 You might also find the function 𝑧𝑖𝑝 of Exercise 8.10
■
useful.
Give a program
Exercise 8.12 — λ calculus to NAND-TM compiler (challenging).
in the programming language of your choice that takes as input a λ
expression 𝑒 and outputs a NAND-TM program 𝑃 that computes the
same function as 𝑒. For partial credit you can use the GOTO and all
NAND-CIRC syntactic sugar in your output program. You can use
any encoding of λ expressions as binary string that is convenient for 10
Try to set up a procedure such that if array Left
you. See footnote for hint.10 contains an encoding of a λ expression 𝜆𝑥.𝑒 and
■
array Right contains an encoding of another λ expres-
sion 𝑒′ , then the array Result will contain 𝑒[𝑥 → 𝑒′ ].
Exercise 8.13 — At least two in 𝜆 calculus. Let 1 = 𝜆𝑥, 𝑦.𝑥 and 0 = 𝜆𝑥, 𝑦.𝑦 as
before. Define
Prove that ALT is a 𝜆 expression that computes the at least two func-
tion. That is, for every 𝑎, 𝑏, 𝑐 ∈ {0, 1} (as encoded above) ALT𝑎𝑏𝑐 = 1
if and only at least two of {𝑎, 𝑏, 𝑐} are equal to 1.
■
eq u i va l e n t mod e l s of comp u tati on 321
if search('110011') {
replace('110011','00')
} else if search('110111') {
replace('110111','00')
} else if search('111011') {
replace('111011','00')
} else if search('111111') {
replace('1111111','00')
}
322 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
sion you fed it. Typed variants of the λ calculus are objects of intense
research, and are strongly related to type systems for programming
language and computer-verifiable proof systems, see [Pie02]. Some of
the typed variants of the λ calculus do not have infinite loops, which
makes them very useful as ways of enabling static analysis of pro-
grams as well as computer-verifiable proofs. We will come back to this
point in Chapter 10 and Chapter 22.
Tao has proposed showing the Turing completeness of fluid dy-
namics (a “water computer”) as a way of settling the question of the
behavior of the Navier-Stokes equations, see this popular article.
Learning Objectives:
• The universal machine/program - “one
program to rule them all”
• A fundamental result in computer science and
mathematics: the existence of uncomputable
functions.
• The halting problem: the canonical example of
an uncomputable function.
• Introduction to the technique of reductions.
can represent 𝑀 as a string (i.e., using code) and then input 𝑀 to the
universal machine 𝑈 .
Beyond the practical applications, the existence of a universal algo-
rithm also has surprising theoretical ramifications, and in particular
can be used to show the existence of uncomputable functions, upend-
ing the intuitions of mathematicians over the centuries from Euler
to Hilbert. In this chapter we will prove the existence of the univer-
sal program, and also show its implications for uncomputability, see
Fig. 9.1
Proof Idea:
Once you understand what the theorem says, it is not that hard to
prove. The desired program 𝑈 is an interpreter for Turing machines.
Figure 9.2: A Universal Turing Machine is a single
That is, 𝑈 gets a representation of the machine 𝑀 (think of it as source Turing Machine 𝑈 that can evaluate, given input the
code), and some input 𝑥, and needs to simulate the execution of 𝑀 on (description as a string of) arbitrary Turing machine
𝑀 and input 𝑥, the output of 𝑀 on 𝑥. In contrast to
𝑥.
the universal circuit depicted in Fig. 5.6, the machine
Think of how you would code 𝑈 in your favorite programming 𝑀 can be much more complex (e.g., more states or
language. First, you would need to decide on some representation tape alphabet symbols) than 𝑈.
Let 𝑀 be a Turing
Definition 9.2 — String representation of Turing Machine.
machine with 𝑘 states and a size ℓ alphabet Σ = {𝜎0 , … , 𝜎ℓ−1 } (we
use the convention 𝜎0 = 0,𝜎1 = 1, 𝜎2 = ∅, 𝜎3 = ▷). We represent
𝑀 as the triple (𝑘, ℓ, 𝑇 ) where 𝑇 is the table of values for 𝛿𝑀 :
R
Remark 9.3 — Take away points of representation. The
details of the representation scheme of Turing ma-
chines as strings are immaterial for almost all applica-
tions. What you need to remember are the following
points:
Proof of Theorem 9.1. We will only sketch the proof, giving the major
ideas. First, we observe that we can easily write a Python program
that, on input a representation (𝑘, ℓ, 𝑇 ) of a Turing machine 𝑀 and
an input 𝑥, evaluates 𝑀 on 𝑋. Here is the code of this program for
concreteness, though you can feel free to skip it if you are not familiar
with (or interested in) Python:
# constants
def EVAL(δ,x):
'''Evaluate TM given by transition table δ
on input x'''
Tape = [""] + [a for a in x]
i = 0; s = 0 # i = head pos, s = state
while True:
s, Tape[i], d = δ[(s,Tape[i])]
if d == "H": break
if d == "L": i = max(i-1,0)
if d == "R": i += 1
if i>= len(Tape): Tape.append('Φ')
j = 1; Y = [] # produce output
while Tape[j] != 'Φ':
Y.append(Tape[j])
j += 1
return Y
R
Remark 9.4 — Efficiency of the simulation. The argu-
ment in the proof of Theorem 9.1 is a very inefficient
way to implement the dictionary data structure in
practice, but it suffices for the purpose of proving the
theorem. Reading and writing to a dictionary of 𝑚
values in this implementation takes Ω(𝑚) steps, but
it is in fact possible to do this in 𝑂(log 𝑚) steps using
a search tree data structure or even 𝑂(1) (for “typical”
instances) using a hash table. NAND-RAM and RAM
machines correspond to the architecture of modern
electronic computers, and so we can implement hash
tables and search trees in NAND-RAM just as they are
implemented in other programming languages.
Proof Idea:
The idea behind the proof follows quite closely Cantor’s proof that
the reals are uncountable (Theorem 2.5), and in fact the theorem can
also be obtained fairly directly from that result (see Exercise 7.11).
However, it is instructive to see the direct proof. The idea is to con-
struct 𝐹 ∗ in a way that will ensure that every possible machine 𝑀 will
in fact fail to compute 𝐹 ∗ . We do so by defining 𝐹 ∗ (𝑥) to equal 0 if 𝑥
describes a Turing machine 𝑀 which satisfies 𝑀 (𝑥) = 1 and defining
𝐹 ∗ (𝑥) = 1 otherwise. By construction, if 𝑀 is any Turing machine and
𝑥 is the string describing it, then 𝐹 ∗ (𝑥) ≠ 𝑀 (𝑥) and therefore 𝑀 does
not compute 𝐹 ∗ .
⋆
Big Idea 12 There are some functions that can not be computed by
any algorithm.
P
The proof of Theorem 9.5 is short but subtle. I suggest
that you pause here and go back to read it again and
think about it - this is a proof that is worth reading at
least twice if not three or four times. It is not often the
case that a few lines of mathematical reasoning estab-
lish a deeply profound fact - that there are problems
we simply cannot solve.
Proof Idea:
One way to think about this proof is as follows:
Proof of Theorem 9.6. The proof will use the previously established
result Theorem 9.5. Recall that Theorem 9.5 shows that the following
function 𝐹 ∗ ∶ {0, 1}∗ → {0, 1} is uncomputable:
⎧
{1 𝑥(𝑥) = 0
𝐹 ∗ (𝑥) = ⎨ (9.3)
⎩0 otherwise
{
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 335
where 𝑥(𝑥) denotes the output of the Turing machine described by the
string 𝑥 on the input 𝑥 (with the usual convention that 𝑥(𝑥) = ⊥ if this
computation does not halt).
We will show that the uncomputability of 𝐹 ∗ implies the uncom-
putability of HALT. Specifically, we will assume, towards a contra-
diction, that there exists a Turing machine 𝑀 that can compute the
HALT function, and use that to obtain a Turing machine 𝑀 ′ that com-
putes the function 𝐹 ∗ . (This is known as a proof by reduction, since we
reduce the task of computing 𝐹 ∗ to the task of computing HALT. By
the contrapositive, this means the uncomputability of 𝐹 ∗ implies the
uncomputability of HALT.)
Indeed, suppose that 𝑀 is a Turing machine that computes HALT.
Algorithm 9.7 describes a Turing Machine 𝑀 ′ that computes 𝐹 ∗ . (We
use “high level” description of Turing machines, appealing to the
“have your cake and eat it too” paradigm, see Big Idea 10.)
P
Once again, this is a proof that’s worth reading more
than once. The uncomputability of the halting prob-
lem is one of the fundamental theorems of computer
science, and is the starting point for much of the in-
vestigations we will see later. An excellent way to get
a better understanding of Theorem 9.6 is to go over
Section 9.3.2, which presents an alternative proof of
the same result.
enter an infinite loop (or programs that we know for sure that will
enter such a loop). However, there is no general procedure that would
determine for an arbitrary program 𝑃 whether it halts or not. More-
over, there are some very simple programs for which no one knows
whether they halt or not. For example, the following Python program
will halt if and only if Goldbach’s conjecture is false:
def isprime(p):
return all(p % i for i in range(2,p-1))
def Goldbach(n):
return any( (isprime(p) and isprime(n-p))
for p in range(2,n-1))
n = 4
while True:
if not Goldbach(n): break
n+= 2
rec routine P
§L: if T[P] go to L
Return §
If T[P] = True the routine P will loop, and it will only terminate if
T[P] = False. In each case ‘T[P]“ has exactly the wrong value, and this
contradiction shows that the function T cannot exist.
Yours faithfully,
C. Strachey
Churchill College, Cambridge
P
Try to stop and extract the argument for proving
Theorem 9.6 from the letter above.
Since CPL is not as common today, let us reproduce this proof. The
idea is the following: suppose for the sake of contradiction that there
exists a program T such that T(f,x) equals True iff f halts on input
x. (Strachey’s letter considers the no-input variant of HALT, but as
we’ll see, this is an immaterial distinction.) Then we can construct a
program P and an input x such that T(P,x) gives the wrong answer.
The idea is that on input x, the program P will do the following: run
T(x,x), and if the answer is True then go into an infinite loop, and
otherwise halt. Now you can see that T(P,P) will give the wrong
answer: if P halts when it gets its own code as input, then T(P,P) is
supposed to be True, but then P(P) will go into an infinite loop. And
if P does not halt, then T(P,P) is supposed to be False but then P(P)
will halt. We can also code this up in Python:
def CantSolveMe(T):
"""
Gets function T that claims to solve HALT.
Returns a pair (P,x) of code and input on which
T(P,x) ≠ HALT(x)
"""
def fool(x):
if T(x,x):
while True: pass
return "I halted"
return (fool,fool)
def T(f,x):
"""Crude halting tester - decides it doesn't halt if it
↪ contains a loop."""
import inspect
source = inspect.getsource(f)
if source.find("while"): return False
if source.find("for"): return False
return True
9.4 REDUCTIONS
The Halting problem turns out to be a linchpin of uncomputability, in
the sense that Theorem 9.6 has been used to show the uncomputabil-
ity of a great many interesting functions. We will see several examples
of such results in this chapter and the exercises, but there are many
more such results (see Fig. 9.6).
R
Remark 9.8 — Reductions are algorithms. A reduction
is an algorithm, which means that, as discussed in
Remark 0.3, a reduction has three components:
P
The proof of Theorem 9.9 is below, but before reading
it you might want to pause for a couple of minutes
and think how you would prove it yourself. In partic-
ular, try to think of what a reduction from HALT to
HALTONZERO would look like. Doing so is an excel-
lent way to get some initial comfort with the notion
of proofs by reduction, which a technique we will be
using time and again in this book.
in Big Idea 10, following our “have your cake and eat it too” paradigm,
we just use the generic name “algorithm” rather than worrying
whether we model them as Turing machines, NAND-TM programs,
NAND-RAM, etc.; this makes no difference since all these models are
equivalent to one another.)
Since this is our first proof by reduction from the Halting prob-
lem, we will spell it out in more details than usual. Such a proof by
reduction consists of two steps:
2. Analysis of the reduction: We will then prove that under the hypoth-
esis that Algorithm 𝐴 computes HALTONZERO, Algorithm 𝐵 will
compute HALT.
def N(z):
M = r'.......'
# a string constant containing desc. of M
x = r'.......'
# a string constant containing x
return eval(M,x)
# note that we ignore the input z
R
Remark 9.11 — The hardwiring technique. In the proof of
Theorem 9.9 we used the technique of “hardwiring”
an input 𝑥 to a program/machine 𝑃 . That is, modify-
ing a program 𝑃 that it uses “hardwired constants”
for some of all of its input. This technique is quite
common in reductions and elsewhere, and we will
often use it again in this course.
P
Despite the similarity in their names, ZEROFUNC and
HALTONZERO are two different functions. For exam-
ple, if 𝑀 is a Turing machine that on input 𝑥 ∈ {0, 1}∗ ,
halts and outputs the OR of all of 𝑥’s coordinates, then
HALTONZERO(𝑀 ) = 1 (since 𝑀 does halt on the
input 0) but ZEROFUNC(𝑀 ) = 0 (since 𝑀 does not
compute the constant zero function).
2. Return 𝐴(𝑀 ).
P
We leave the proof of Theorem 9.13 as an exercise
(Exercise 9.6). I strongly encourage you to stop here
and try to solve this exercise.
int First(int n) {
if (n<0) return 0;
return 2*n;
}
int Second(int n) {
int i = 0;
int j = 0
if (n<0) return 0;
while (j<n) {
i = i + 2;
j= j + 1;
}
return i;
}
First and Second are two distinct C programs, but they compute
the same function. A semantic property, would be either true for both
programs or false for both programs, since it depends on the function
the programs compute and not on their code. An example for a se-
mantic property that both First and Second satisfy is the following:
“The program 𝑃 computes a function 𝑓 mapping integers to integers satisfy-
ing that 𝑓(𝑛) ≥ 𝑛 for every input 𝑛”.
346 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
Recall that ZEROFUNC(𝑀 ) = 1 if and only if 𝑀 (𝑥) = 0 for
every 𝑥 ∈ {0, 1}∗ . If 𝑀 and 𝑀 ′ are functionally equivalent, then for
every 𝑥, 𝑀 (𝑥) = 𝑀 ′ (𝑥). Hence ZEROFUNC(𝑀 ) = 1 if and only if
ZEROFUNC(𝑀 ′ ) = 1.
■
Theorem 9.15 — Rice’s Theorem. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. If 𝐹 is seman-
tic and non-trivial then it is uncomputable.
Proof Idea:
The idea behind the proof is to show that every semantic non-
trivial function 𝐹 is at least as hard to compute as HALTONZERO.
This will conclude the proof since by Theorem 9.9, HALTONZERO
is uncomputable. If a function 𝐹 is non trivial then there are two
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 347
Proof of Theorem 9.15. We will not give the proof in full formality, but
rather illustrate the proof idea by restricting our attention to a particu-
lar semantic function 𝐹 . However, the same techniques generalize to
all possible semantic functions. Define MONOTONE ∶ {0, 1}∗ → {0, 1}
as follows: MONOTONE(𝑀 ) = 1 if there does not exist 𝑛 ∈ ℕ and
two inputs 𝑥, 𝑥′ ∈ {0, 1}𝑛 such that for every 𝑖 ∈ [𝑛] 𝑥𝑖 ≤ 𝑥′𝑖 but 𝑀 (𝑥)
outputs 1 and 𝑀 (𝑥′ ) = 0. That is, MONOTONE(𝑀 ) = 1 if it’s not
possible to find an input 𝑥 such that flipping some bits of 𝑥 from 0 to
1 will change 𝑀 ’s output in the other direction from 1 to 0. We will
prove that MONOTONE is uncomputable, but the proof will easily
generalize to any semantic function.
We start by noting that MONOTONE is neither the constant zero
nor the constant one function:
• The machine INF that simply goes into an infinite loop on every
input satisfies MONOTONE(INF) = 1, since INF is not defined
anywhere and so in particular there are no two inputs 𝑥, 𝑥′ where
𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 but INF(𝑥) = 0 and INF(𝑥′ ) = 1.
• The machine PAR that computes the XOR or parity of its input, is
not monotone (e.g., PAR(1, 1, 0, 0, … , 0) = 0 but PAR(1, 0, 0, … , 0) =
0) and hence MONOTONE(PAR) = 0.
(Note that INF and PAR are machines and not functions.)
We will now give a reduction from HALTONZERO to
MONOTONE. That is, we assume towards a contradiction that
there exists an algorithm 𝐴 that computes MONOTONE and we will
build an algorithm 𝐵 that computes HALTONZERO. Our algorithm 𝐵
will work as follows:
Algorithm 𝐵:
Input: String 𝑁 describing a Turing machine. (Goal: Compute
HALTONZERO(𝑁 ))
Assumption: Access to Algorithm 𝐴 to compute MONOTONE.
Operation:
348 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
1. Construct the following machine 𝑀 : “On input 𝑧 ∈ {0, 1}∗ do: (a)
Run 𝑁 (0), (b) Return PAR(𝑧)”.
2. Return 1 − 𝐴(𝑀 ).
R
Remark 9.16 — Semantic is not the same as uncom-
putable. Rice’s Theorem is so powerful and such a
popular way of proving uncomputability that peo-
ple sometimes get confused and think that it is the
only way to prove uncomputability. In particular, a
common misconception is that if a function 𝐹 is not
semantic then it is computable. This is not at all the
case.
For example, consider the following function
HALTNOYALE ∶ {0, 1}∗ → {0, 1}. This is a function
that on input a string that represents a NAND-TM
program 𝑃 , outputs 1 if and only if both (i) 𝑃 halts
on the input 0, and (ii) the program 𝑃 does not con-
tain a variable with the identifier Yale. The function
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 349
Yale[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Yale[0])
and
Harvard[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Harvard[0])
P
Once again, this is a good point for you to stop and try
to prove the result yourself before reading the proof
below.
Proof. We have seen in Theorem 7.11 that for every Turing machine
𝑀 , there is an equivalent NAND-TM program 𝑃𝑀 such that for ev-
ery 𝑥, 𝑃𝑀 (𝑥) = 𝑀 (𝑥). In particular this means that HALT(𝑀 ) =
NANDTMHALT(𝑃𝑀 ).
The transformation 𝑀 ↦ 𝑃𝑀 that is obtained from the proof
of Theorem 7.11 is constructive. That is, the proof yields a way to
compute the map 𝑀 ↦ 𝑃𝑀 . This means that this proof yields a
reduction from task of computing HALT to the task of computing
NANDTMHALT, which means that since HALT is uncomputable,
neither is NANDTMHALT.
■
and providing proofs for their correctness is what we do all the time in
algorithms research.
The field of software verification is concerned with verifying that
given programs satisfy certain conditions. These conditions can be
that the program computes a certain function, that it never writes
into a dangerous memory location, that is respects certain invari-
ants, and others. While the general tasks of verifying this may be
uncomputable, researchers have managed to do so for many inter-
esting cases, especially if the program is written in the first place in
a formalism or programming language that makes verification eas-
ier. That said, verification, especially of large and complex programs,
remains a highly challenging task in practice as well, and the num-
ber of programs that have been formally proven correct is still quite
small. Moreover, even phrasing the right theorem to prove (i.e., the
specification) if often a highly non-trivial endeavor.
✓ Chapter Recap
9.6 EXERCISES
Let NANDRAMHALT ∶ {0, 1}∗ → {0, 1}
Exercise 9.1 — NAND-RAM Halt.
be the function such that on input (𝑃 , 𝑥) where 𝑃 represents a NAND-
RAM program, NANDRAMHALT(𝑃 , 𝑥) = 1 iff 𝑃 halts on the input 𝑥.
Prove that NANDRAMHALT is uncomputable.
■
2. 𝐻(𝑥) = 1 iff there exist two nonempty strings 𝑢, 𝑣 ∈ {0, 1}∗ such
that 𝑥 = 𝑢𝑣 (i.e., 𝑥 is the concatenation of 𝑢 and 𝑣), 𝐹 (𝑢) = 1 and
𝐺(𝑣) = 1.
3. 𝐻(𝑥) = 1 iff there exist a list 𝑢0 , … , 𝑢𝑡−1 of non empty strings such
that strings𝐹 (𝑢𝑖 ) = 1 for every 𝑖 ∈ [𝑡] and 𝑥 = 𝑢0 𝑢1 ⋯ 𝑢𝑡−1 .
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 353
Exercise 9.6 — Computing parity. Prove Theorem 9.13 without using Rice’s
Theorem.
■
Exercise 9.8 For each of the following two functions, say whether it is
computable or not:
3. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that 𝐹 is
6
You can either use the diagonalization method to
not recursively enumerable. See footnote for hint.6
prove this directly or show that the set of all recur-
sively enumerable functions is countable.
4. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 is recursively enumerable but the function 𝐹 defined as 𝐹 (𝑥) =
HALT has this property: show that if both HALT
1 − 𝐹 (𝑥) is not recursively enumerable. See footnote for hint.7
7
2. Use Theorem 9.15 to prove that for every 𝐺 ∶ {0, 1}∗ → {0, 1}, if (a)
𝐺 is neither the constant zero nor the constant one function, and
(b) for every 𝑀 , 𝑀 ′ such that 𝐿(𝑀 ) = 𝐿(𝑀 ′ ), 𝐺(𝑀 ) = 𝐺(𝑀 ′ ),
then 𝐺 is uncomputable. See footnote for hint.8 8
Show that any 𝐺 satisfying (b) must be semantic.
… they are invented on purpose to show that our ancestor’s reasoning was at
fault, and we shall never get anything more than that out of them”. Some of
this fascinating history is discussed in [Gra83; Kle91; Lüt02; Gra05].
The existence of a universal Turing machine, and the uncomputabil-
ity of HALT was first shown by Turing in his seminal paper [Tur37],
though closely related results were shown by Church a year before.
These works built on Gödel’s 1931 incompleteness theorem that we will
discuss in Chapter 11.
Some universal Turing Machines with a small alphabet and number
of states are given in [Rog96], including a single-tape universal Turing
machine with the binary alphabet and with less than 25 states; see
also the survey [WN09]. Adam Yedidia has written software to help
in producing Turing machines with a small number of states. This is
related to the recreational pastime of “Code Golfing” which is about
solving a certain computational task using the as short as possible
program. Finding “highly complex” small Turing machine is also
related to the “Busy Beaver” problem, see Exercise 9.13 and the survey
[Aar20].
The diagonalization argument used to prove uncomputability of 𝐹 ∗
is derived from Cantor’s argument for the uncountability of the reals
discussed in Chapter 2.
Christopher Strachey was an English computer scientist and the
inventor of the CPL programming language. He was also an early
artificial intelligence visionary, programming a computer to play
Checkers and even write love letters in the early 1950’s, see this New
Yorker article and this website.
Rice’s Theorem was proven in [Ric53]. It is typically stated in a
form somewhat different than what we used, see Exercise 9.11.
We do not discuss in the chapter the concept of recursively enumer-
able languages, but it is covered briefly in Exercise 9.10. As usual, we
use function, as opposed to language, notation.
The cartoon of the Halting problem in Fig. 9.1 is copyright 2019
Charles F. Cooper.
Learning Objectives:
• See that Turing completeness is not always a
good thing.
• Another example of an always-halting
formalism: context-free grammars and simply
typed 𝜆 calculus.
• The pumping lemma for non context-free
functions.
• Examples of computable and uncomputable
“Happy families are all alike; every unhappy family is unhappy in its own
way”, Leo Tolstoy (opening of the book “Anna Karenina”).
• An operation is one of +, −, ×, ÷
operation := +|-|*|/
digit := 0|1|2|3|4|5|6|7|8|9
number := digit|digit number
expression := number|expression operation
↪ expression|(expression)
A string over the alphabet { (,) } can be generated from this gram-
mar (where match is the starting expression and "" corresponds to the
empty string) if and only if it consists of a matching set of parenthesis.
362 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
If 𝐺 = (𝑉 , 𝑅, 𝑠) is a
Definition 10.4 — Deriving a string from a grammar.
context-free grammar over Σ, then for two strings 𝛼, 𝛽 ∈ (Σ ∪ 𝑉 )∗
we say that 𝛽 can be derived in one step from 𝛼, denoted by 𝛼 ⇒𝐺 𝛽,
if we can obtain 𝛽 from 𝛼 by applying one of the rules of 𝐺. That is,
we obtain 𝛽 by replacing in 𝛼 one occurrence of the variable 𝑣 with
the string 𝑧, where 𝑣 ⇒ 𝑧 is a rule of 𝐺.
We say that 𝛽 can be derived from 𝛼, denoted by 𝛼 ⇒∗𝐺 𝛽, if it
can be derived by some finite number 𝑘 of steps. That is, if there
are 𝛼1 , … , 𝛼𝑘−1 ∈ (Σ ∪ 𝑉 )∗ , so that 𝛼 ⇒𝐺 𝛼1 ⇒𝐺 𝛼2 ⇒𝐺 ⋯ ⇒𝐺
𝛼𝑘−1 ⇒𝐺 𝛽.
We say that 𝑥 ∈ Σ∗ is matched by 𝐺 = (𝑉 , 𝑅, 𝑠) if 𝑥 can be de-
rived from the starting variable 𝑠 (i.e., if 𝑠 ⇒∗𝐺 𝑥). We define the
function computed by (𝑉 , 𝑅, 𝑠) to be the map Φ𝑉 ,𝑅,𝑠 ∶ Σ∗ → {0, 1}
such that Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 iff 𝑥 is matched by (𝑉 , 𝑅, 𝑠). A function
𝐹 ∶ Σ∗ → {0, 1} is context free if 𝐹 = Φ𝑉 ,𝑅,𝑠 for some CFG (𝑉 , 𝑅, 𝑠).
1
1
As in the case of Definition 6.7 we can also use
A priori it might not be clear that the map Φ𝑉 ,𝑅,𝑠 is computable, language rather than function notation and say that a
language 𝐿 ⊆ Σ∗ is context free if the function 𝐹 such
but it turns out that this is the case. that 𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿 is context free.
Proof. We only sketch the proof. We start with the observation we can
convert every CFG to an equivalent version of Chomsky normal form,
where all rules either have the form 𝑢 → 𝑣𝑤 for variables 𝑢, 𝑣, 𝑤 or the
form 𝑢 → 𝜎 for a variable 𝑢 and symbol 𝜎 ∈ Σ, plus potentially the
rule 𝑠 → "" where 𝑠 is the starting variable.
The idea behind such a transformation is to simply add new vari-
ables as needed, and so for example we can translate a rule such as
𝑣 → 𝑢𝜎𝑤 into the three rules 𝑣 → 𝑢𝑟, 𝑟 → 𝑡𝑤 and 𝑡 → 𝜎.
re stri c te d comp u tati ona l mod e l s 363
R
Remark 10.6 — Parse trees. While we focus on the
task of deciding whether a CFG matches a string, the
algorithm to compute Φ𝑉 ,𝑅,𝑠 actually gives more in-
formation than that. That is, on input a string 𝑥, if
Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 then the algorithm yields the sequence
of rules that one can apply from the starting vertex 𝑠
to obtain the final string 𝑥. We can think of these rules
as determining a tree with 𝑠 being the root vertex and
the sinks (or leaves) corresponding to the substrings
of 𝑥 that are obtained by the rules that do not have a
variable in their second element. This tree is known
as the parse tree of 𝑥, and often yields very useful
information about the structure of 𝑥.
Often the first step in a compiler or interpreter for a
programming language is a parser that transforms the
source into the parse tree (also known as the abstract
syntax tree). There are also tools that can automati-
cally convert a description of a context-free grammars
into a parser algorithm that computes the parse tree of
a given string. (Indeed, the above recursive algorithm
can be used to achieve this, but there are much more
efficient versions, especially for grammars that have
particular forms, and programming language design-
ers often try to ensure their languages have these more
efficient grammars.)
Let 𝑒 be a
Theorem 10.7 — Context free grammars and regular expressions.
regular expression over {0, 1}, then there is a CFG (𝑉 , 𝑅, 𝑠) over
{0, 1} such that Φ𝑉 ,𝑅,𝑠 = Φ𝑒 .
computes it. Otherwise, we fall into one of the following case: case
1: 𝑒 = 𝑒′ 𝑒″ , case 2: 𝑒 = 𝑒′ |𝑒″ or case 3: 𝑒 = (𝑒′ )∗ where in all cases
𝑒′ , 𝑒″ are shorter regular expressions. By the induction hypothesis
have grammars (𝑉 ′ , 𝑅′ , 𝑠′ ) and (𝑉 ″ , 𝑅″ , 𝑠″ ) that compute Φ𝑒′ and Φ𝑒″
respectively. By renaming of variables, we can also assume without
loss of generality that 𝑉 ′ and 𝑉 ″ are disjoint.
In case 1, we can define the new grammar as follows: we add a new
starting variable 𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rule 𝑠 ↦ 𝑠′ 𝑠″ . In case 2, we can
define the new grammar as follows: we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rules 𝑠 ↦ 𝑠′ and 𝑠 ↦ 𝑠″ . Case 3 will be the
only one that uses recursion. As before we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ , but now add the rules 𝑠 ↦ "" (i.e., the empty string) and
also add, for every rule of the form (𝑠′ , 𝛼) ∈ 𝑅′ , the rule 𝑠 ↦ 𝑠𝛼 to 𝑅.
We leave it to the reader as (a very good!) exercise to verify that in
all three cases the grammars we produce capture the same function as
the original expression.
■
It turns out that CFG’s are strictly more powerful than regular
expressions. In particular, as we’ve seen, the “matching parenthesis”
function MATCHPAREN can be computed by a context free grammar,
whereas, as shown in Lemma 6.19, it cannot be computed by regular
expressions. Here is another example:
Let PAL ∶
Solved Exercise 10.1 — Context free grammar for palindromes.
{0, 1, ; } → {0, 1} be the function defined in Solved Exercise 6.4 where
∗
Solution:
A simple grammar computing PAL can be described using
Backus–Naur notation:
Solution:
Using Backus–Naur notation we can describe such a grammar as
follows
𝑤 = 𝛼𝑏𝑢; 𝑢𝑅 𝑏′ 𝛽 (10.1)
P
The context-free pumping lemma is even more cum-
bersome to state than its regular analog, but you can
remember it as saying the following: “If a long enough
string is matched by a grammar, there must be a variable
that is repeated in the derivation.”
Proof of Theorem 10.8. We only sketch the proof. The idea is that if
the total number of symbols in the rules of the grammar is 𝑘0 , then
the only way to get |𝑥| > 𝑛0 with Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 is to use recursion.
That is, there must be some variable 𝑣 ∈ 𝑉 such that we are able to
366 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
derive from 𝑣 the value 𝑏𝑣𝑑 for some strings 𝑏, 𝑑 ∈ Σ∗ , and then further
on derive from 𝑣 some string 𝑐 ∈ Σ∗ such that 𝑏𝑐𝑑 is a substring of
𝑥 (in other words, 𝑥 = 𝑎𝑏𝑐𝑑𝑒 for some 𝑎, 𝑒 ∈ {0, 1}∗ ). If we take
the variable 𝑣 satisfying this requirement with a minimum number
of derivation steps, then we can ensure that |𝑏𝑐𝑑| is at most some
constant depending on 𝑛0 and we can set 𝑛1 to be that constant (𝑛1 =
10 ⋅ |𝑅| ⋅ 𝑛0 will do, since we will not need more than |𝑅| applications
of rules, and each such application can grow the string by at most 𝑛0
symbols).
Thus by the definition of the grammar, we can repeat the derivation
to replace the substring 𝑏𝑐𝑑 in 𝑥 with 𝑏𝑘 𝑐𝑑𝑘 for every 𝑘 ∈ ℕ while
retaining the property that the output of Φ𝑉 ,𝑅,𝑠 is still one. Since 𝑏𝑐𝑑
is a substring of 𝑥, we can write 𝑥 = 𝑎𝑏𝑐𝑑𝑒 and are guaranteed that
𝑎𝑏𝑘 𝑐𝑑𝑘 𝑒 is matched by the grammar for every 𝑘.
■
Using Theorem 10.8 one can show that even the simple function
𝐹 ∶ {0, 1}∗ → {0, 1} defined as follows:
is not context free. (In contrast, the function 𝐺 ∶ {0, 1}∗ → {0, 1}
defined as 𝐺(𝑥) = 1 iff 𝑥 = 𝑤0 𝑤1 ⋯ 𝑤𝑛−1 𝑤𝑛−1 𝑤𝑛−2 ⋯ 𝑤0 for some
𝑤 ∈ {0, 1}∗ and 𝑛 = |𝑤| is context free, can you see why?.)
Let EQ ∶ {0, 1, ; }∗ →
Solved Exercise 10.3 — Equality is not context-free.
{0, 1} be the function such that EQ(𝑥) = 1 if and only if 𝑥 = 𝑢; 𝑢 for
some 𝑢 ∈ {0, 1}∗ . Then EQ is not context free.
■
Solution:
We use the context-free pumping lemma. Suppose towards the
sake of contradiction that there is a grammar 𝐺 that computes EQ,
and let 𝑛0 be the constant obtained from Theorem 10.8.
Consider the string 𝑥 = 1𝑛0 0𝑛0 ; 1𝑛0 0𝑛0 , and write it as 𝑥 = 𝑎𝑏𝑐𝑑𝑒
as per Theorem 10.8, with |𝑏𝑐𝑑| ≤ 𝑛0 and with |𝑏| + |𝑑| ≥ 1. By The-
orem 10.8, it should hold that EQ(𝑎𝑐𝑒) = 1. However, by case anal-
ysis this can be shown to be a contradiction.
Firstly, unless 𝑏 is on the left side of the ; separator and 𝑑 is on
the right side, dropping 𝑏 and 𝑑 will definitely make the two parts
different. But if it is the case that 𝑏 is on the left side and 𝑑 is on the
right side, then by the condition that |𝑏𝑐𝑑| ≤ 𝑛0 we know that 𝑏 is a
string of only zeros and 𝑑 is a string of only ones. If we drop 𝑏 and
𝑑 then since one of them is non empty, we get that there are either
re stri c te d comp u tati ona l mod e l s 367
less zeroes on the left side than on the right side, or there are less
ones on the right side than on the left side. In either case, we get
that EQ(𝑎𝑐𝑒) = 0, obtaining the desired contradiction.
■
There is an algorithm
Theorem 10.9 — Emptiness for CFG’s is decidable.
that on input a context-free grammar 𝐺, outputs 1 if and only if Φ𝐺
is the constant zero function.
Proof Idea:
The proof is easier to see if we transform the grammar to Chomsky
Normal Form as in Theorem 10.5. Given a grammar 𝐺, we can recur-
sively define a non-terminal variable 𝑣 to be non empty if there is either
a rule of the form 𝑣 ⇒ 𝜎, or there is a rule of the form 𝑣 ⇒ 𝑢𝑤 where
both 𝑢 and 𝑤 are non empty. Then the grammar is non empty if and
only if the starting variable 𝑠 is non-empty.
⋆
Proof Idea:
We prove the theorem by reducing from the Halting problem. To
do that we use the notion of configurations of NAND-TM programs, as
defined in Definition 8.8. Recall that a configuration of a program 𝑃 is a
binary string 𝑠 that encodes all the information about the program in
the current iteration.
We define Σ to be {0, 1} plus some separator characters and define
INVALID𝑃 ∶ Σ∗ → {0, 1} to be the function that maps every string 𝐿 ∈
Σ∗ to 1 if and only if 𝐿 does not encode a sequence of configurations
that correspond to a valid halting history of the computation of 𝑃 on
the empty input.
The heart of the proof is to show that INVALID𝑃 is context-free.
Once we do that, we see that 𝑃 halts on the empty input if and only if
INVALID𝑃 (𝐿) = 1 for every 𝐿. To show that, we will encode the list
in a special way that makes it amenable to deciding via a context-free
grammar. Specifically we will reverse all the odd-numbered strings.
⋆
Proof of Theorem 10.10. We only sketch the proof. We will show that if
we can compute CFGFULL then we can solve HALTONZERO, which
has been proven uncomputable in Theorem 9.9. Let 𝑀 be an input
re stri c te d comp u tati ona l mod e l s 369
• A halting configuration will have the value a certain state (which can
be easily “read off” from it) set to 1.
✓ Chapter Recap
10.5 EXERCISES
Suppose that
Exercise 10.1 — Closure properties of context-free functions.
𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1} are context free. For each one of the following
definitions of the function 𝐻, either prove that 𝐻 is always context
free or give a counterexample for regular 𝐹 , 𝐺 that would make 𝐻 not
context free.
Exercise 10.2 Prove that the function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 (𝑥) = 1 if and only if |𝑥| is a power of two is not context free.
■
372 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• A statement has either the form foo = bar; where foo and bar are
variables, or the form IF (foo) BEGIN ... END where ... is list
of one or more statements, potentially separated by newlines.
1. Let VAR ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑥 ∈ {0, 1}∗ , outputs 1 if and only if 𝑥 corresponds to an ASCII
encoding of a valid variable identifier. Prove that VAR is regular.
2. Let SYN ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑠 ∈ {0, 1}∗ , outputs 1 if and only if 𝑠 is an ASCII encoding of a valid
program in our language. Prove that SYN is context free. (You do
not have to specify the full formal grammar for SYN, but you need
to show that such a grammar exists.)
2
Try to see if you can “embed” in some way a func-
3. Prove that SYN is not regular. See footnote for hint2 tion that looks similar to MATCHPAREN in SYN, so
you can use a similar proof. Of course for a function
to be non-regular, it does not need to utilize literal
■
parentheses symbols.
11
Is every theorem provable?
For
Theorem 11.1 — Gödel’s Incompleteness Theorem: informal version.
every sound proof system 𝑉 for sufficiently rich mathematical
statements, there is a mathematical statement that is true but is not
provable in 𝑉 .
is e ve ry the ore m p rova bl e ? 377
def f(n):
if n==1: return 1
return f(3*n+1) if n % 2 else f(n//2)
Proof Idea:
If we had such a complete and sound proof system then we could
solve the HALTONZERO problem. On input a Turing machine 𝑀 ,
we would search all purported proofs 𝑤 and halt as soon as we find
a proof of either “𝑀 halts on zero” or “𝑀 does not halt on zero”. If
the system is sound and complete then we will eventually find such a
proof, and it will provide us with the correct output.
⋆
is e ve ry the ore m p rova bl e ? 379
Proof of Theorem 11.3. Assume for the sake of contradiction that there
was such a proof system 𝑉 . We will use 𝑉 to build an algorithm 𝐴
that computes HALTONZERO, hence contradicting Theorem 9.9. Our
algorithm 𝐴 will will work as follows:
R
Remark 11.5 — The Gödel statement (optional). One can
extract from the proof of Theorem 11.3 a procedure
that for every proof system 𝑉 , yields a true statement
𝑥∗ that cannot be proven in 𝑉 . But Gödel’s proof
gave a very explicit description of such a statement 𝑥∗
which is closely related to the “Liar’s paradox”. That
is, Gödel’s statement 𝑥∗ was designed to be true if and
only if ∀𝑤∈{0,1}∗ 𝑉 (𝑥, 𝑤) = 0. In other words, it satisfied
the following property
R
Remark 11.7 — Syntactic sugar for quantified integer
statements. To make our statements more readable,
we often use syntactic sugar and so write 𝑥 ≠ 𝑦 as
shorthand for ¬(𝑥 = 𝑦), and so on. Similarly, the
“implication operator” 𝑎 ⇒ 𝑏 is “syntactic sugar” or
shorthand for ¬𝑎 ∨ 𝑏, and the “if and only if operator”
𝑎 ⇔ is shorthand for (𝑎 ⇒ 𝑏) ∧ (𝑏 ⇒ 𝑎). We will
also allow ourselves the use of “macros”: plugging in
one quantified integer statement in another, as we did
with DIVIDES and PRIME above.
382 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
or
Let
Theorem 11.9 — Uncomputability of quantified integer statements.
QIS ∶ {0, 1} → {0, 1} be the function that given a (string rep-
∗
P
Please stop here and make sure you understand
why the uncomputability of QIS (i.e., Theorem 11.9)
means that there is no sound and complete proof
system for proving quantified integer statements (i.e.,
Theorem 11.8). This follows in the same way that
Theorem 11.3 followed from the uncomputability of
HALTONZERO, but working out the details is a great
exercise (see Exercise 11.1)
In the rest of this chapter, we will show the proof of Theorem 11.8,
following the outline illustrated in Fig. 11.1.
is e ve ry the ore m p rova bl e ? 383
R
Remark 11.11 — Active code vs static data. The diffi-
culty in finding a way to distinguish between “code”
such as NAND-TM programs, and “static content”
such as polynomials is just another manifestation of
the phenomenon that code is the same as data. While
a fool-proof solution for distinguishing between the
two is inherently impossible, finding heuristics that do
a reasonable job keeps many firewall and anti-virus
manufacturers very busy (and finding ways to bypass
these tools keeps many hackers busy as well).
P
If you find the last sentence confusing, it is worth-
while to reread it until you are sure you follow its
logic. We are so accustomed to trying to find solu-
tions for problems that it can sometimes be hard to
follow the arguments for showing that problems are
uncomputable.
1. We will first use a reduction from the Halting problem to show that
deciding the truth of quantified mixed statements is uncomputable.
Quantified mixed statements involve both strings and integers.
Since quantified mixed statements are a more general concept than
quantified integer statements, it is easier to prove the uncomputabil-
ity of deciding their truth.
For example, the true statement that for every string 𝑎 there is a
string 𝑏 that corresponds to 𝑎 in reverse order can be phrased as the
following quantified mixed statement
∀𝑎∈{0,1}∗ ∃𝑏∈{0,1}∗ (|𝑎| = |𝑏|) ∧ (∀𝑖∈ℕ 𝑖 < |𝑎| ⇒ (𝑎𝑖 ⇔ 𝑏|𝑎|−𝑖 )) . (11.5)
Let
Theorem 11.13 — Uncomputability of quantified mixed statements.
QMS ∶ {0, 1} → {0, 1} be the function that given a (string rep-
∗
Proof Idea:
The idea behind the proof is similar to that used in showing that
one-dimensional cellular automata are Turing complete (Theorem 8.7)
as well as showing that equivalence (or even “fullness”) of context
free grammars is uncomputable (Theorem 10.10). We use the notion
386 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2. Using the above we can now write the condition that for every
substring of 𝐻 that has the form 𝛼ENC(; )𝛽 with 𝛼, 𝛽 ∈ {0, 1}ℓ
and ENC(; ) being the encoding of the separator “;”, it holds that
NEXT(𝛼, 𝛽) is true.
R
Remark 11.14 — Alternative proofs. There are sev-
eral other ways to show that QMS is uncomputable.
For example, we can express the condition that a 1-
dimensional cellular automaton eventually writes a
“1” to a given cell from a given initial configuration
as a quantified mixed statement over a string encod-
ing the history of all configurations. We can then use
the fact that cellular automatons can simulate Tur-
ing machines (Theorem 8.7) to reduce the halting
problem to QMS. We can also use other well known
uncomputable problems such as tiling or the post cor-
respondence problem. Exercise 11.5 and Exercise 11.6
explore two alternative proofs of Theorem 11.13.
• 𝑛 = |𝑥|
This will mean that we can replace a “for all” quantifier over strings
such as ∀𝑥∈{0,1}∗ with a pair of quantifiers over integers of the form
∀𝑋∈ℕ ∀𝑛∈ℕ (and similarly replace an existential quantifier of the form
∃𝑥∈{0,1}∗ with a pair of quantifiers ∃𝑋∈ℕ ∃𝑛∈ℕ ) . We can then replace all
calls to |𝑥| by 𝑛 and all calls to 𝑥𝑖 by COORD(𝑋, 𝑖). This means that
if we are able to define COORD via a quantified integer statement,
then we obtain a proof of Theorem 11.9, since we can use it to map
every mixed quantified statement 𝜑 to an equivalent quantified inte-
ger statement 𝜉 such that 𝜉 is true if and only if 𝜑 is true, and hence
QMS(𝜑) = QIS(𝜉). Such a procedure implies that the task of comput-
ing QMS reduces to the task of computing QIS, which means that the
uncomputability of QMS implies the uncomputability of QIS.
is e ve ry the ore m p rova bl e ? 389
The above shows that proof of Theorem 11.9 all boils down to find-
ing the right encoding of strings as integers, and the right way to
implement COORD as a quantified integer statement. To achieve this
we use the following technical result :
There is a sequence of prime
Lemma 11.15 — Constructible prime sequence.
numbers 𝑝0 < 𝑝1 < 𝑝2 < ⋯ such that there is a quantified integer
statement PSEQ(𝑝, 𝑖) that is true if and only if 𝑝 = 𝑝𝑖 .
Using Lemma 11.15 we can encode a 𝑥 ∈ {0, 1}∗ by the numbers
(𝑋, 𝑛) where 𝑋 = ∏𝑥 =1 𝑝𝑖 and 𝑛 = |𝑥|. We can then define the
𝑖
statement COORD(𝑋, 𝑖) as
✓ Chapter Recap
11.6 EXERCISES
Exercise 11.1 — Gödel’s Theorem from uncomputability of 𝑄𝐼𝑆 . Prove Theo-
rem 11.8 using Theorem 11.9.
■
Let FINDPROOF ∶
Exercise 11.2 — Proof systems and uncomputability.
{0, 1} → {0, 1} be the following function. On input a Turing machine
∗
(We can think of each pair (𝛼, 𝛽) ∈ 𝑆 as a “domino tile” and the ques-
tion is whether we can stack a list of such tiles so that the top and the
bottom yield the same string.) It can be shown that the PCP is uncom-
putable by a fairly straightforward though somewhat tedious proof
(see for example the Wikipedia page for the Post Correspondence
Problem or Section 5.2 in [Sip97]).
Use this fact to provide a direct proof that QMS is uncomputable by
showing that there exists a computable map 𝑅 ∶ {0, 1}∗ → {0, 1}∗ such
that PCP(𝑆) = QMS(𝑅(𝑆)) for every string 𝑆 encoding an instance of
the post correspondence problem.
■
𝑛 times
ers of two” of height 𝑛). To get a sense of how fast this function
grows, TOWER(1) = 2, TOWER(2) = 22 = 4, TOWER(3) = 22 =
2
“For practical purposes, the difference between algebraic and exponential order
is often more crucial than the difference between finite and non-finite.”, Jack
Edmunds, “Paths, Trees, and Flowers”, 1963
“What is the most efficient way to sort a million 32-bit integers?”, Eric
Schmidt to Barack Obama, 2008
“I think the bubble sort would be the wrong way to go.”, Barack Obama.
• “Is there a function that can be computed in 𝑂(𝑛2 ) time but not in
𝑂(𝑛) time?”
• “Are there natural problems for which the best algorithm (and not
just the best known) requires 2Ω(𝑛) time?”
While the difference between 𝑂(𝑛) and 𝑂(𝑛2 ) time can be crucial in
practice, in this book we focus on the even bigger difference between
polynomial and exponential running time. As we will see, the difference
between polynomial versus exponential time is typically insensitive to
the choice of the particular computational model, a polynomial-time
algorithm is still polynomial whether you use Turing machines, RAM
machines, or parallel cluster as your model of computation, and sim-
ilarly an exponential-time algorithm will remain exponential in all of
these platforms. One of the interesting phenomena of computing is
that there is often a kind of a “threshold phenomenon” or “zero-one
law” for running time. Many natural problems can either be solved
in polynomial running time with a not-too-large exponent (e.g., some-
thing like 𝑂(𝑛2 ) or 𝑂(𝑛3 )), or require exponential (e.g., at least 2Ω(𝑛)
√
or 2Ω( 𝑛) ) time to solve. The reasons for this phenomenon are still not
fully understood, but some light on it is shed by the concept of NP
completeness, which we will see in Chapter 15.
This chapter is merely a tiny sample of the landscape of computa-
tional problems and efficient algorithms. If you want to explore the
field of algorithms and data structures more deeply (which I very
much hope you do!), the bibliographical notes contain references to
some excellent texts, some of which are available freely on the web.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 399
R
Remark 12.1 — Relations between parts of this book.
Part I of this book contained a quantitative study of
computation of finite functions. We asked what are
the resources (in terms of gates of Boolean circuits or
lines in straight-line programs) required to compute
various finite functions.
Part II of the book contained a qualitative study of
computation of infinite functions (i.e., functions of
unbounded input length). In that part we asked the
qualitative question of whether or not a function is com-
putable at all, regardless of the number of operations.
Part III of the book, beginning with this chapter,
merges the two approaches and contains a quantitative
study of computation of infinite functions. In this part
we ask how do resources for computing a function
scale with the length of the input. In Chapter 13 we
define the notion of running time, and the class P of
functions that can be computed using a number of
steps that scales polynomially with the input length.
In Section 13.6 we will relate this class to the models
of Boolean circuits and straightline programs that we
studied in Part I.
R
Remark 12.3 — On data structures. If you’ve ever taken
an algorithms course, you have probably encountered
many data structures such as lists, arrays, queues,
stacks, heaps, search trees, hash tables and many
more. Data structures are extremely important in com-
puter science, and each one of those offers different
tradeoffs between overhead in storage, operations
supported, cost in time for each operation, and more.
For example, if we store 𝑛 items in a list, we will need
a linear (i.e., 𝑂(𝑛) time) scan to retrieve an element,
while we achieve the same operation in 𝑂(1) time if
we used a hash table. However, when we only care
about polynomial-time algorithms, such factors of
𝑂(𝑛) in the running time will not make much differ-
402 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
∑ 𝑥𝑒 + ∑ 𝑥𝑒 = 0
𝑒∋𝑠 𝑒∋𝑡
−1 ≤ 𝑥𝑒 ≤ 1 ∀𝑒∈𝐸
where for every vertex 𝑣, summing over 𝑒 ∋ 𝑣 means summing over all
the edges that touch 𝑣.
The maximum flow problem can be thought of as the task of max-
imizing ∑𝑒∋𝑠 𝑥𝑒 over all the vectors 𝑥 ∈ ℝ𝑚 that satisfy the above
conditions (12.1). Maximizing a linear function ℓ(𝑥) over the set of
𝑥 ∈ ℝ𝑚 that satisfy certain linear equalities and inequalities is known
as linear programming. Luckily, there are polynomial-time algorithms
for solving linear programming, and hence we can solve the maxi-
mum flow (and so, equivalently, minimum cut) problem in polyno-
mial time. In fact, there are much better algorithms for maximum-
flow/minimum-cut, even for weighted directed graphs, with currently
√
the record standing at 𝑂(min{𝑚10/7 , 𝑚 𝑛}) time.
Given a graph 𝐺 = (𝑉 , 𝐸),
Solved Exercise 12.1 — Global minimum cut.
define the global minimum cut of 𝐺 to be the minimum over all 𝑆 ⊆ 𝑉
with 𝑆 ≠ ∅ and 𝑆 ≠ 𝑉 of the number of edges cut by 𝑆. Prove that
there is a polynomial-time algorithm to compute the global minimum
cut of a graph.
■
Solution:
By the above we know that there is a polynomial-time algorithm
𝐴 that on input (𝐺, 𝑠, 𝑡) finds the minimum 𝑠, 𝑡 cut in the graph
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 405
where 𝐿 is some loss function measuring how far is the predicted la-
bel ℎ(𝑥𝑖 ) from the true label 𝑦𝑖 . When 𝐿 is the square loss function
𝐿(𝑦, 𝑦′ ) = (𝑦 − 𝑦′ )2 and ℎ is a linear function, empirical risk mini-
mization corresponds to the well-known convex minimization task of
linear regression. In other cases, when the task is non convex, there can
be many global or local minima. That said, even if we don’t find the
global (or even a local) minima, this continuous embedding can still
help us. In particular, when running a local improvement algorithm
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 407
12.2.1 SAT
A propositional formula 𝜑 involves 𝑛 variables 𝑥1 , … , 𝑥𝑛 and the logical
operators AND (∧), OR (∨), and NOT (¬, also denoted as ⋅). We say
that such a formula is in conjunctive normal form (CNF for short) if it is
an AND of ORs of variables or their negations (we call a term of the
form 𝑥𝑖 or 𝑥𝑖 a literal). For example, this is a CNF formula
R
Remark 12.4 — Bit complexity of numbers. Whenever we
discuss problems whose inputs correspond to num-
bers, the input length corresponds to how many bits
are needed to describe the number (or, as is equiv-
alent up to a constant factor, the number of digits
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 409
where 𝑆𝑛 is the set of all permutations from [𝑛] to [𝑛] and the sign of
a permutation 𝜋 is equal to −1 raised to the power of the number of
inversions in 𝜋 (pairs 𝑖, 𝑗 such that 𝑖 > 𝑗 but 𝜋(𝑖) < 𝜋(𝑗)).
This definition suggests that computing det(𝐴) might require
summing over |𝑆𝑛 | terms which would take exponential time since
|𝑆𝑛 | = 𝑛! > 2𝑛 . However, there are other ways to compute the de-
terminant. For example, it is known that det is the only function that
satisfies the following conditions:
✓ Chapter Recap
12.5 EXERCISES
The naive algo-
Exercise 12.1 — exponential time algorithm for longest path.
rithm for computing the longest path in a given graph could take
more than 𝑛! steps. Give a 𝑝𝑜𝑙𝑦(𝑛)2𝑛 time algorithm for the longest 2
Hint: Use dynamic programming to compute for
every 𝑠, 𝑡 ∈ [𝑛] and 𝑆 ⊆ [𝑛] the value 𝑃 (𝑠, 𝑡, 𝑆)
path problem in 𝑛 vertex graphs.2 which equals 1 if there is a simple path from 𝑠 to 𝑡
■ that uses exactly the vertices in 𝑆. Do this iteratively
for 𝑆’s of growing sizes.
Exercise 12.2 — 2SAT algorithm. For every 2CNF 𝜑, define the graph 𝐺𝜑
on 2𝑛 vertices corresponding to the literals 𝑥1 , … , 𝑥𝑛 , 𝑥1 , … , 𝑥𝑛 , such
that there is an edge ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
ℓ𝑖 ℓ𝑗 iff the constraint ℓ𝑖 ∨ ℓ𝑗 is in 𝜑. Prove that 𝜑
is unsatisfiable if and only if there is some 𝑖 such that there is a path
from 𝑥𝑖 to 𝑥𝑖 and from 𝑥𝑖 to 𝑥𝑖 in 𝐺𝜑 . Show how to use this to solve
2SAT in polynomial time.
■
13
compute in 𝑂(𝑛𝑘 ) time.
• The class P/poly of non uniform computation
and the result that P ⊆ P/poly
Max Newman: It is all very well to say that a machine could … do this or
that, but … what about the time it would take to do it?
Alan Turing: To my mind this time factor is the one question which will
involve all the real technical difficulty.
BBC radio panel on “Can automatic Calculating Machines Be Said to
Think?”, 1952
Let 𝑇 ∶ ℕ → ℕ be some
Definition 13.1 — Running time (Turing Machines).
function mapping natural numbers to natural numbers. We say
that a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is computable in 𝑇 (𝑛) Turing-
Machine time (TM-time for short) if there exists a Turing Machine 𝑀
such that for every sufficiently large 𝑛 and every 𝑥 ∈ {0, 1}𝑛 , when
given input 𝑥, the machine 𝑀 halts after executing at most 𝑇 (𝑛)
steps and outputs 𝐹 (𝑥).
We define TIMETM (𝑇 (𝑛)) to be the set of Boolean functions
(functions mapping {0, 1}∗ to {0, 1}) that are computable in 𝑇 (𝑛)
TM time.
P
Definition 13.1 is not very complicated but is one of
the most important definitions of this book. As usual,
TIMETM (𝑇 (𝑛)) is a class of functions, not of machines. If
𝑀 is a Turing Machine then a statement such as “𝑀
is a member of TIMETM (𝑛2 )” does not make sense.
The concept of TM-time as defined here is sometimes
known as “single-tape Turing machine time” in the
literature, since some texts consider Turing machines
with more than one working tape.
420 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
The proof is illustrated in Fig. 13.2. Suppose that 𝐹 ∈ TIMETM (10⋅
𝑛3 ) and hence there some number 𝑁0 and a machine 𝑀 such that
for every 𝑛 > 𝑁0 , and 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 𝐹 (𝑥) within at
most 10 ⋅ 𝑛3 steps. Since 10 ⋅ 𝑛3 = 𝑜(2𝑛 ), there is some number
𝑁1 such that for every 𝑛 > 𝑁1 , 10 ⋅ 𝑛3 < 2𝑛 . Hence for every
𝑛 > max{𝑁0 , 𝑁1 }, 𝑀 (𝑥) will output 𝐹 (𝑥) within at most 2𝑛 steps,
just demonstrating that 𝐹 ∈ TIMETM (2𝑛 ). Figure 13.2: Comparing 𝑇 (𝑛) = 10𝑛3 with 𝑇 ′ (𝑛) =
2𝑛 (on the right figure the Y axis is in log scale).
Since for every large enough 𝑛, 𝑇 ′ (𝑛) ≥ 𝑇 (𝑛),
■
P
Please take the time to make sure you understand
these definitions. In particular, sometimes students
think of the class EXP as corresponding to functions
that are not in P. However, this is not the case. If 𝐹 is
in EXP then it can be computed in exponential time.
This does not mean that it cannot be computed in
polynomial time as well.
Solution:
To show these two sets are equal we need to show that P ⊆
∪𝑐∈{1,2,3,…} TIMETM (𝑛 ) and ∪𝑐∈{1,2,3,…} TIMETM (𝑛 ) ⊆ P. We start
𝑐 𝑐
Table : A table of the examples from Chapter 12. All these problems
are in EXP but only the ones on the left column are currently known to
be in P as well (i.e., they have a polynomial-time algorithm). See also
Fig. 13.3.
R
Remark 13.3 — Boolean versions of problems. Many
of the problems defined in Chapter 12 correspond to
non Boolean functions (functions with more than one
bit of output) while P and EXP are sets of Boolean
functions. However, for every non-Boolean function
𝐹 we can always define a computationally-equivalent
Boolean function 𝐺 by letting 𝐺(𝑥, 𝑖) be the 𝑖-th bit
of 𝐹 (𝑥) (see Exercise 13.3). Hence the table above,
as well as Fig. 13.3, refer to the computationally- Figure 13.3: Some examples of problems that are
equivalent Boolean variants of these problems. known to be in P and problems that are known to
be in EXP but not known whether or not they are
in P. Since both P and EXP are classes of Boolean
functions, in this figure we always refer to the Boolean
(i.e., Yes/No) variant of the problems.
13.2 MODELING RUNNING TIME USING RAM MACHINES / NAND-
RAM
Turing Machines are a clean theoretical model of computation, but
do not closely correspond to real-world computing architectures. The
discrepancy between Turing Machines and actual computers does
not matter much when we consider the question of which functions
mod e l i ng ru n n i ng ti me 423
Let 𝑇 ∶ ℕ → ℕ be a
Theorem 13.5 — Relating RAM and Turing machines.
function such that 𝑇 (𝑛) ≥ 𝑛 for every 𝑛 and the map 𝑛 ↦ 𝑇 (𝑛) can
be computed by a Turing machine in time 𝑂(𝑇 (𝑛)). Then
P
The technical details of Theorem 13.5, such as the con-
dition that 𝑛 ↦ 𝑇 (𝑛) is computable in 𝑂(𝑇 (𝑛)) time
or the constants 10 and 4 in (13.1) (which are not tight
and can be improved), are not very important. In par-
ticular, all non pathological time bound functions we
encounter in practice such as 𝑇 (𝑛) = 𝑛, 𝑇 (𝑛) = 𝑛 log 𝑛,
𝑇 (𝑛) = 2𝑛 etc. will satisfy the conditions of Theo-
rem 13.5, see also Remark 13.6.
424 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
That is, we could have equally well defined P as the class of functions
computable by NAND-RAM programs (instead of Turing Machines)
Figure 13.4: The proof of Theorem 13.5 shows that
that run in time polynomial in the length of the input. Similarly, by
we can simulate 𝑇 steps of a Turing Machine with 𝑇
instantiating Theorem 13.5 with 𝑇 (𝑛) = 2𝑛 we see that the class EXP
𝑎
steps of a NAND-RAM program, and can simulate
can also be defined as the set of functions computable by NAND-RAM 𝑇 steps of a NAND-RAM program with 𝑜(𝑇 4 )
steps of a Turing Machine. Hence TIMETM (𝑇 (𝑛)) ⊆
programs in time at most 2𝑝(𝑛) where 𝑝 is some polynomial. Similar TIMERAM (10 ⋅ 𝑇 (𝑛)) ⊆ TIMETM (𝑇 (𝑛)4 ).
equivalence results are known for many models including cellular
automata, C/Python/Javascript programs, parallel computers, and a
great many other models, which justifies the choice of P as capturing
a technology-independent notion of tractability. (See Section 13.3
for more discussion of this issue.) This equivalence between Turing
machines and NAND-RAM (as well as other models) allows us to
pick our favorite model depending on the task at hand (i.e., “have our
cake and eat it too”) even when we study questions of efficiency, as
long as we only care about the gap between polynomial and exponential
time. When we want to design an algorithm, we can use the extra
power and convenience afforded by NAND-RAM. When we want
to analyze a program or prove a negative result, we can restrict our
attention to Turing machines.
The total cost for each such operation is 𝑂(𝑇 (𝑛)2 +𝑇 (𝑛)𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) =
𝑂(𝑇 (𝑛)2 ) steps.
In sum, we simulate a single step of NAND-RAM using
𝑂(𝑇 (𝑛)2 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) steps of NAND-TM, and hence the total
simulation time is 𝑂(𝑇 (𝑛)3 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) which is smaller than 𝑇 (𝑛)4
for sufficiently large 𝑛.
■
mod e l i ng ru n n i ng ti me 427
R
Remark 13.6 — Nice time bounds. When considering
general time bounds such we need to make sure to
rule out some “pathological” cases such as functions 𝑇
that don’t give enough time for the algorithm to read
the input, or functions where the time bound itself is
uncomputable. We say that a function 𝑇 ∶ ℕ → ℕ is
a nice time bound function (or nice function for short)
if for every 𝑛 ∈ ℕ, 𝑇 (𝑛) ≥ 𝑛 (i.e., 𝑇 allows enough
time to read the input), for every 𝑛′ ≥ 𝑛, 𝑇 (𝑛′ ) ≥ 𝑇 (𝑛)
(i.e., 𝑇 allows more time on longer inputs), and the
map 𝐹 (𝑥) = 1𝑇 (|𝑥|) (i.e., mapping a string of length
𝑛 to a sequence of 𝑇 (𝑛) ones) can be computed by a
NAND-RAM program in 𝑂(𝑇 (𝑛)) time.
All the “normal” time complexity bounds we en-
counter in applications such as 𝑇 (𝑛) √ = 100𝑛,
𝑇 (𝑛) = 𝑛2 log 𝑛,𝑇 (𝑛) = 2 𝑛 , etc. are “nice”.
Hence from now on we will only care about the
class TIME(𝑇 (𝑛)) when 𝑇 is a “nice” function. The
computability condition is in particular typically easily
satisfied. For example, for arithmetic functions such
as 𝑇 (𝑛) = 𝑛3 , we can typically compute the binary
representation of 𝑇 (𝑛) in time polynomial in the num-
ber of bits of 𝑇 (𝑛) and hence poly-logarithmic in 𝑇 (𝑛).
Hence the time to write the string 1𝑇 (𝑛) in such cases
will be 𝑇 (𝑛) + 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛)) = 𝑂(𝑇 (𝑛)).
• Cellular automata
• Parallel computers
The Extended Church Turing Thesis is the statement that this is true
for all physically realizable computing models. In other words, the
extended Church Turing thesis says that for every scalable computing
device 𝐶 (which has a finite description but can be in principle used
to run computation on arbitrarily large inputs), there is some con-
stant 𝑎 such that for every function 𝐹 ∶ {0, 1}∗ → {0, 1} that 𝐶 can
428 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
As in the case of Theorem 13.5, the proof of Theo-
rem 13.7 is not very deep and so it is more important
to understand its statement. Specifically, if you under-
stand how you would go about writing an interpreter
for NAND-RAM using a modern programming lan-
guage such as Python, then you know everything you
need to know about the proof of this theorem.
Let TIMEDEVAL
Theorem 13.8 — Timed Universal Turing Machine. ∶
{0, 1}∗ → {0, 1}∗ be the function defined as
Proof. We only sketch the proof since the result follows fairly directly
from Theorem 13.5 and Theorem 13.7. By Theorem 13.5 to show that
TIMEDEVAL ∈ P, it suffices to give a polynomial-time NAND-RAM
program to compute TIMEDEVAL.
Such a program can be obtained as follows. Given a Turing Ma-
chine 𝑀 , by Theorem 13.5 we can transform it in time polynomial in Figure 13.6: The timed universal Turing Machine takes
as input a Turing machine 𝑀, an input 𝑥, and a time
its description into a functionally-equivalent NAND-RAM program
bound 𝑇 , and outputs 𝑀(𝑥) if 𝑀 halts within at
𝑃 such that the execution of 𝑀 on 𝑇 steps can be simulated by the most 𝑇 steps. Theorem 13.8 states that there is such a
execution of 𝑃 on 𝑐 ⋅ 𝑇 steps. We can then run the universal NAND- machine that runs in time polynomial in 𝑇 .
There is nothing special about log 𝑛, and we could have used any
other efficiently computable function that tends to infinity with 𝑛.
432 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 13.10 — Simpler corollary of the time hierarchy
theorem. The generality of the time hierarchy theorem
can make its proof a little hard to read. It might be
easier to follow the proof if you first try to prove by
yourself the easier statement P ⊊ EXP.
You can do so by showing that the following function
𝐹 ∶ {0, 1}∗ ∶→ {0, 1} is in EXP ⧵ P: for every Turing
Machine 𝑀 and input 𝑥, 𝐹 (𝑀 , 𝑥) = 1 if and only if
𝑀 halts on 𝑥 within at most |𝑥|log |𝑥| steps. One can
show that 𝐹 ∈ TIME(𝑛𝑂(log 𝑛) ) ⊆ EXP using the
universal Turing machine (or the efficient universal
NAND-RAM program of Theorem 13.7). On the other
harnd, we can use similar ideas to those used to show
the uncomputability of HALT in Section 9.3.2 to prove
that 𝐹 ∉ P.
Proof Idea:
In the proof of Theorem 9.6 (the uncomputability of the Halting
problem), we have shown that the function HALT cannot be com-
puted in any finite time. An examination of the proof shows that it
gives something stronger. Namely, the proof shows that if we fix our
computational budget to be 𝑇 steps, then not only we can’t distinguish
between programs that halt and those that do not, but cannot even
distinguish between programs that halt within at most 𝑇 ′ steps and
those that take more than that (where 𝑇 ′ is some number depending
on 𝑇 ). Therefore, the proof of Theorem 13.9 follows the ideas of the
mod e l i ng ru n n i ng ti me 433
Proof of Theorem 13.9. Our proof is inspired by the proof of the un-
computability of the halting problem. Specifically, for every function
𝑇 as in the theorem’s statement, we define the Bounded Halting func-
tion HALT𝑇 as follows. The input to HALT𝑇 is a pair (𝑃 , 𝑥) such that
|𝑃 | ≤ log log |𝑥| encodes some NAND-RAM program. We define
Solution:
mod e l i ng ru n n i ng ti me 435
The time hierarchy theorem tells us that there are functions we can
√
compute in 𝑂(𝑛2 ) time but not 𝑂(𝑛), in 2𝑛 time, but not 2 𝑛 , etc.. In
particular there are most definitely functions that we can compute in
time 2𝑛 but not 𝑂(𝑛). We have seen that we have no shortage of natu-
ral functions for which the best known algorithm requires roughly 2𝑛
time, and that many people have invested significant effort in trying
to improve that. However, unlike in the finite vs. infinite case, for all
of the examples above at the moment we do not know how to rule
out even an 𝑂(𝑛) time algorithm. We will however see that there is a
single unproven conjecture that would imply such a result for most of
these problems.
The time hierarchy theorem relies on the existence of an efficient
universal NAND-RAM program, as proven in Theorem 13.7. For
other models such as Turing Machines we have similar time hierarchy
results showing that there are functions computable in time 𝑇 (𝑛) and
not in time 𝑇 (𝑛)/𝑓(𝑛) where 𝑓(𝑛) corresponds to the overhead in the
corresponding universal machine.
13.6 NON UNIFORM COMPUTATION Figure 13.8: Some complexity classes and some of the
functions we know (or conjecture) to be contained in
We have now seen two measures of “computation cost” for functions.
them.
In Section 4.6 we defined the complexity of computing finite functions
using circuits / straightline programs. Specifically, for a finite function
𝑔 ∶ {0, 1}𝑛 → {0, 1} and number 𝑇 ∈ ℕ, 𝑔 ∈ SIZE(𝑇 ) if there is a circuit
of at most 𝑇 NAND gates (or equivalently a 𝑇 -line NAND-CIRC
program) that computes 𝑔. To relate this to the classes TIME(𝑇 (𝑛))
defined in this chapter we first need to extend the class SIZE(𝑇 (𝑛))
from finite functions to functions with unbounded input length.
The non uniform analog to the class P is the class P/poly defined as
for i in range(4):
print(i)
print(0)
print(1)
print(2)
print(3)
To make this idea into an actual proof we need to tackle one tech-
nical difficulty, and this is to ensure that the NAND-TM program is
oblivious in the sense that the value of the index variable i in the 𝑗-th
iteration of the loop will depend only on 𝑗 and not on the contents of
the input. We make a digression to do just that in Section 13.6.1 and
then complete the proof of Theorem 13.12.
⋆
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
438 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
one = NAND(X[0],temp_0)
zero = NAND(one,one)
temp_2 = NAND(X[0],zero)
temp_3 = NAND(X[0],temp_2)
temp_4 = NAND(zero,temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[1],Y[0])
temp_3 = NAND(X[1],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[2],Y[0])
temp_3 = NAND(X[2],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
Key to this transformation was the fact that in our original NAND-
TM program for XOR, regardless of whether the input is 011, 100, or
any other string, the index variable i is guaranteed to equal 0 in the
first iteration, 1 in the second iteration, 2 in the third iteration, and so
on and so forth. The particular sequence 0, 1, 2, … is immaterial: the
crucial property is that the NAND-TM program for XOR is oblivious
in the sense that the value of the index i in the 𝑗-th iteration depends
only on 𝑗 and does not depend on the particular choice of the input. Figure 13.10: A NAND circuit for XOR3 obtained by
Luckily, it is possible to transform every NAND-TM program into “unrolling the loop” of the NAND-TM program for
computing XOR three times.
a functionally equivalent oblivious program with at most quadratic
overhead. (Similarly we can transform any Turing machine into a
functionally equivalent oblivious Turing machine, see Exercise 13.6.)
mod e l i ng ru n n i ng ti me 439
Proof Idea:
We can translate any NAND-TM program 𝑃 ′ into an oblivious
program 𝑃 by making 𝑃 “sweep” its arrays. That is, the index i in
𝑃 will always move all the way from position 0 to position 𝑇 (𝑛) − 1
and back again. We can then simulate the program 𝑃 ′ with at most
𝑇 (𝑛) overhead: if 𝑃 ′ wants to move i left when we are in a rightward
sweep then we simply wait the at most 2𝑇 (𝑛) steps until the next time
we are back in the same position while sweeping to the left.
⋆
the worst case this will take 2𝑇 (𝑛) steps (if 𝑃 has to go all the way
from one end to the other and back again.)
There is algorithm
Theorem 13.14 — Turing-machine to circuit compiler.
UNROLL such that for every Turing Machine 𝑀 and numbers 𝑛, 𝑇 ,
UNROLL(𝑀 , 1𝑇 , 1𝑛 ) runs for 𝑝𝑜𝑙𝑦(|𝑀 |, 𝑇 , 𝑛) steps and outputs a
NAND circuit 𝐶 with 𝑛 inputs, 𝑂(𝑇 2 ) gates, and one output, such
that
⎧
{𝑦 𝑀 halts in ≤ 𝑇 steps and outputs 𝑦 Figure 13.12: The function UNROLL takes as input a
𝐶(𝑥) = . (13.7) Turing Machine 𝑀, an input length parameter 𝑛, a
⎨
{ otherwise
⎩0 step budget parameter 𝑇 , and outputs a circuit 𝐶 of
size 𝑝𝑜𝑙𝑦(𝑇 ) that takes 𝑛 bits of inputs and outputs
Proof. We only sketch the proof since it follows by directly translat- 𝑀(𝑥) if 𝑀 halts on 𝑥 within at most 𝑇 steps.
ing the proof of Theorem 13.12 into an algorithm together with the
simulation of Turing machines by NAND-TM programs (see also
Fig. 13.13). Specifically, UNROLL does the following:
P
Reviewing the transformations described in Fig. 13.13,
as well as solving the following two exercises is a great
way to get more comfort with non-uniform complexity
and in particular with P/poly and its relation to P.
Solution:
We start with the “if” direction. Suppose that there is a polynomial-
time Turing Machine 𝑀 that on input 1𝑛 outputs a circuit 𝐶𝑛 that
442 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• |𝑎𝑛 | ≤ 𝑝(𝑛)
• For every 𝑥 ∈ {0, 1}𝑛 , 𝑀 (𝑎𝑛 , 𝑥) = 𝐹 (𝑥).
Solution:
We only sketch the proof. For the “only if” direction, if 𝐹 ∈
P/poly then we can use for 𝑎𝑛 simply the description of the cor-
responding circuit 𝐶𝑛 and for 𝑀 the program that computes in
polynomial time the evaluation of a circuit on its input.
For the “if” direction, we can use the same “unrolling the loop”
technique of Theorem 13.12 to show that if 𝑃 is a polynomial-time
NAND-TM program, then for every 𝑛 ∈ ℕ, the map 𝑥 ↦ 𝑃 (𝑎𝑛 , 𝑥)
can be computed by a polynomial size NAND-CIRC program 𝑄𝑛 .
■
There exists an
Theorem 13.15 — P/poly contains uncomputable functions.
uncomputable function 𝐹 ∶ {0, 1}∗ → {0, 1} such that 𝐹 ∈ P/poly .
Proof Idea:
Since P/poly corresponds to non uniform computation, a function
𝐹 is in P/poly if for every 𝑛 ∈ ℕ, the restriction 𝐹↾𝑛 to inputs of length
𝑛 has a small circuit/program, even if the circuits for different values
of 𝑛 are completely different from one another. In particular, if 𝐹 has
the property that for every equal-length inputs 𝑥 and 𝑥′ , 𝐹 (𝑥) =
𝐹 (𝑥′ ) then this means that 𝐹↾𝑛 is either the constant function zero
or the constant function one for every 𝑛 ∈ ℕ. Since the constant
function has a (very!) small circuit, such a function 𝐹 will always
be in P/poly (indeed even in smaller classes). Yet by a reduction from
the Halting problem, we can obtain a function with this property that
is uncomputable.
⋆
✓ Chapter Recap
13.7 EXERCISES
Prove
Exercise 13.1 — Equivalence of different definitions of P and EXP..
that the classes P and EXP defined in Definition 13.2 are equal to
∪𝑐∈{1,2,3,…} TIME(𝑛𝑐 ) and ∪𝑐∈{1,2,3,…} TIME(2𝑛 ) respectively. (If
𝑐
Prove that if
Exercise 13.4 — Composition of polynomial time.
𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1}∗ are in P then their composition 𝐹 ∘ 𝐺,
which is the function 𝐻 s.t. 𝐻(𝑥) = 𝐹 (𝐺(𝑥)), is also in P.
■
Exercise 13.7Let EDGE ∶ {0, 1}∗ → {0, 1} be the function such that on
input a string representing a triple (𝐿, 𝑖, 𝑗), where 𝐿 is the adjacency
list representation of an 𝑛 vertex graph 𝐺, and 𝑖 and 𝑗 are numbers in
[𝑛], EDGE(𝐿, 𝑖, 𝑗) = 1 if the edge {𝑖, 𝑗} is present in the graph. EDGE
outputs 0 on all other inputs.
{0, 1} be the function such that for every string representing a pair
(𝑄, 𝑥) where 𝑄 is an 𝑛-input 1-output NAND-CIRC (not NAND-TM!)
program and 𝑥 ∈ {0, 1}𝑛 , NANDEVAL(𝑄, 𝑥) = 𝑄(𝑥). On all other
inputs NANDEVAL outputs 0.
Prove that NANDEVAL ∈ P.
■
• 𝑠∈ℕ
14
Polynomial-time reductions
• At the moment, for all these problems the best known algorithm is
not much faster than the trivial one in the worst case.
In this chapter we will see that for each one of the problems of find-
ing a longest path in a graph, solving quadratic equations, and finding
the maximum cut, if there exists a polynomial-time algorithm for this
problem then there exists a polynomial-time algorithm for the 3SAT
problem as well. In other words, we will reduce the task of solving
3SAT to each one of the above tasks. Another way to interpret these
results is that if there does not exist a polynomial-time algorithm for
3SAT then there does not exist a polynomial-time algorithm for these
other problems as well. In Chapter 15 we will see evidence (though
no proof!) that all of the above problems do not have polynomial-time
algorithms and hence are inherently intractable.
Solution:
If 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐻 then there exist polynomial-time com-
putable functions 𝑅1 and 𝑅2 mapping {0, 1}∗ to {0, 1}∗ such that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐺(𝑅1 (𝑥)) and for every 𝑦 ∈ {0, 1}∗ ,
𝐺(𝑦) = 𝐻(𝑅2 (𝑦)). Combining these two equalities, we see that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐻(𝑅2 (𝑅1 (𝑥))) and so to show that
𝐹 ≤𝑝 𝐻, it is sufficient to show that the map 𝑥 ↦ 𝑅2 (𝑅1 (𝑥)) is
computable in polynomial time. But if there are some constants 𝑐, 𝑑
such that 𝑅1 (𝑥) is computable in time |𝑥|𝑐 and 𝑅2 (𝑦) is computable
in time |𝑦|𝑑 then 𝑅2 (𝑅1 (𝑥)) is computable in time (|𝑥|𝑐 )𝑑 = |𝑥|𝑐𝑑
which is polynomial.
■
𝑥0 + 𝑥 1 + 𝑥 2 = 2
𝑥0 + 𝑥 2 = 1 (14.3)
𝑥1 + 𝑥 2 = 2
then 01EQ(𝐸) = 1 since the assignment 𝑥 = 011 satisfies all three
equations. We specifically restrict attention to linear equations in
variables 𝑥0 , … , 𝑥𝑛−1 in which every equation has the form ∑𝑖∈𝑆 𝑥𝑖 = 1
If you are familiar with matrix notation you may
𝑏 where 𝑆 ⊆ [𝑛] and 𝑏 ∈ ℕ.1 note that such equations can be written as 𝐴𝑥 = b
If we asked the question of whether there is a solution 𝑥 ∈ ℝ𝑛 of where 𝐴 is an 𝑚 × 𝑛 matrix with entries in 0/1 and
b ∈ ℕ𝑚 .
real numbers to 𝐸, then this can be solved using the famous Gaussian
elimination algorithm in polynomial time. However, there is no known
efficient algorithm to solve 01EQ. Indeed, such an algorithm would
imply an algorithm for 3SAT as shown by the following theorem:
Proof Idea:
A constraint 𝑥2 ∨ 𝑥5 ∨ 𝑥7 can be written as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 ≥ 1.
This is a linear inequality but since the sum on the left-hand side is
at most three, we can also turn it into an equality by adding two new
variables 𝑦, 𝑧 and writing it as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 + 𝑦 + 𝑧 = 3. (We will
use fresh such variables 𝑦, 𝑧 for every constraint.) Finally, for every
variable 𝑥𝑖 we can add a variable 𝑥′𝑖 corresponding to its negation by
poly nomi a l -ti me re d u c ti on s 457
𝑥20 − 𝑥0 = 0
𝑥21 − 𝑥1 = 0
(14.4)
𝑥22 − 𝑥2 = 0
1 − 𝑥 0 − 𝑥 1 + 𝑥 0 𝑥1 = 0
You can verify that 𝑥 ∈ ℝ3 satisfies this set of equations if and only if
𝑥 ∈ {0, 1}3 and 𝑥0 ∨ 𝑥1 = 1.
460 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
Using the transitivity of reductions (Solved Exercise 14.2), it is
enough to show that 01EQ ≤𝑝 QUADEQ, but this follows since we
can phrase the equation 𝑥𝑖 ∈ {0, 1} as the quadratic constraint 𝑥2𝑖 −
𝑥𝑖 = 0. The takeaway technique of this reduction is that we can use
nonlinearity to force continuous variables (e.g., variables taking values
in ℝ) to be discrete (e.g., take values in {0, 1}).
⋆
Proof Idea:
462 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
add three vertices to 𝐺, and label them (𝐶, 𝑦), (𝐶, 𝑦′ ), and (𝐶, 𝑦″ )
respectively. We will also add the three edges between all pairs of
these vertices, so they form a triangle. Since there are 𝑚 clauses in 𝜑,
the graph 𝐺 will have 3𝑚 vertices.
Solution:
The key observation is that if 𝑆 ⊆ 𝑉 is a vertex cover that
touches all vertices, then there is no edge 𝑒 such that both 𝑠’s end-
points are in the set 𝑆 = 𝑉 ⧵ 𝑆, and vice versa. In other words,
𝑆 is a vertex cover if and only if 𝑆 is an independent set. Since
the size of 𝑆 is |𝑉 | − |𝑆|, we see that the polynomial-time map
𝑅(𝐺, 𝑘) = (𝐺, 𝑛 − 𝑘) (where 𝑛 is the number of vertices of 𝐺)
satisfies that VC(𝑅(𝐺, 𝑘)) = ISET(𝐺, 𝑘) which means that it is a Figure 14.6: A vertex cover in a graph is a subset of
vertices that touches all edges. In this 7-vertex graph,
reduction from independent set to vertex cover. the 3 filled vertices are a vertex cover.
■
The maximum
Solved Exercise 14.4 — Clique is equivalent to independent set.
clique problem corresponds to the function CLIQUE ∶ {0, 1}∗ → {0, 1}
such that for a graph 𝐺 and a number 𝑘, CLIQUE(𝐺, 𝑘) = 1 iff there
is a 𝑆 subset of 𝑘 vertices such that for every distinct 𝑢, 𝑣 ∈ 𝑆, the edge
𝑢, 𝑣 is in 𝐺. Such a set is known as a clique.
Prove that CLIQUE ≤𝑝 ISET and ISET ≤𝑝 CLIQUE.
■
Solution:
If 𝐺 = (𝑉 , 𝐸) is a graph, we denote by 𝐺 its complement which
is the graph on the same vertices 𝑉 and such that for every distinct
𝑢, 𝑣 ∈ 𝑉 , the edge {𝑢, 𝑣} is present in 𝐺 if and only if this edge is
not present in 𝐺.
This means that for every set 𝑆, 𝑆 is an independent set in 𝐺 if
and only if 𝑆 is a clique in 𝐺. Therefore for every 𝑘, ISET(𝐺, 𝑘) =
CLIQUE(𝐺, 𝑘). Since the map 𝐺 ↦ 𝐺 can be computed efficiently,
this yields a reduction ISET ≤𝑝 CLIQUE. Moreover, since 𝐺 = 𝐺
this yields a reduction in the other direction as well.
■
Solution:
Since we know that ISET ≤𝑝 VC, using transitivity, it is enough
to show that VC ≤𝑝 DS. As Fig. 14.7 shows, a dominating set is
not the same thing as a vertex cover. However, we can still relate
the two problems. The idea is to map a graph 𝐺 into a graph 𝐻
such that a vertex cover in 𝐺 would translate into a dominating set
in 𝐻 and vice versa. We do so by including in 𝐻 all the vertices
and edges of 𝐺, but for every edge {𝑢, 𝑣} of 𝐺 we also add to 𝐻 a Figure 14.7: A dominating set is a subset 𝑆 of vertices
new vertex 𝑤𝑢,𝑣 and connect it to both 𝑢 and 𝑣. Let ℓ be the number such that every vertex in the graph is either in 𝑆 or a
of isolated vertices in 𝐺. The idea behind the proof is that we can neighbor of 𝑆. The figure above are two copies of the
same graph. The red vertices on the left are a vertex
transform a vertex cover 𝑆 of 𝑘 vertices in 𝐺 into a dominating set cover that is not a dominating set. The blue vertices
of 𝑘 + ℓ vertices in 𝐻 by adding to 𝑆 all the isolated vertices, and on the right are a dominating set that is not a vertex
cover.
moreover we can transform every 𝑘 + ℓ sized dominating set in 𝐻
into a vertex cover in 𝐺. We now give the details.
Description of the algorithm. Given an instance (𝐺, 𝑘) for the
vertex cover problem, we will map 𝐺 into an instance (𝐻, 𝑘′ ) for
the dominating set problem as follows (see Fig. 14.8 for Python
implementation):
DS(𝐻 ′ , 𝑘′ ) = 1.
Soundness. Suppose that DS(𝐻, 𝑘′ ) = 1. Then there is a domi-
nating set 𝐷 of size at most 𝑘′ = 𝑘 + ℓ in 𝐻. For every edge {𝑢, 𝑣} in
the graph 𝐺, if 𝐷 contains the vertex 𝑤𝑢,𝑣 then we remove this ver-
tex and add 𝑢 in its place. The only two neighbors of 𝑤𝑢,𝑣 are 𝑢 and
𝑣, and since 𝑢 is a neighbor of both 𝑤𝑢,𝑣 and of 𝑣, replacing 𝑤𝑢,𝑣
with 𝑣 maintains the property that it is a dominating set. Moreover,
this change cannot increase the size of 𝐷. Thus following this mod-
ification, we can assume that 𝐷 is a dominating set of at most 𝑘 + ℓ
vertices that does not contain any vertices of the form 𝑤𝑢,𝑣 .
Let 𝐼 be the set of isolated vertices in 𝐺. These vertices are also
isolated in 𝐻 and hence must be included in 𝐷 (an isolated ver-
tex must be in any dominating set, since it has no neighbors). We
let 𝑆 = 𝐷 ⧵ 𝐼. Then |𝑆| ≤ 𝐼. We claim that 𝑆 is a vertex cover
in 𝐺. Indeed, for every edge {𝑢, 𝑣} of 𝐺, either the vertex 𝑤𝑢,𝑣 or
468 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
We will map a graph 𝐺 into a graph 𝐻 such that a large indepen-
dent set in 𝐺 becomes a partition cutting many edges in 𝐻. We can
think of a cut in 𝐻 as coloring each vertex either “blue” or “red”. We
will add a special “source” vertex 𝑠∗ , connect it to all other vertices,
and assume without loss of generality that it is colored blue. Hence
the more vertices we color red, the more edges from 𝑠∗ we cut. Now,
for every edge 𝑢, 𝑣 in the original graph 𝐺 we will add a special “gad-
get” which will be a small subgraph that involves 𝑢,𝑣, the source 𝑠∗ ,
and two other additional vertices. We design the gadget in a way so
that if the red vertices are not an independent set in 𝐺 then the cor-
responding cut in 𝐻 will be “penalized” in the sense that it would
not cut as many edges. Once we set for ourselves this objective, it is
not hard to find a gadget that achieves it− see the proof below. Once
again the takeaway technique is to use (this time a slightly more
clever) gadget.
⋆
def TSAT2LONGPATH(φ):
"""Reduce 3SAT to LONGPATH"""
def var(v): # return variable and True/False depending
↪ if positive or negated
return int(v[2:]),False if v[0]=="¬" else
↪ int(v[1:]),True
n = numvars(φ)
Figure 14.14: The graph above with the longest path
clauses = getclauses(φ) marked on it, the part of the path corresponding to
m = len(clauses) variables is in green and part corresponding to the
clauses is in pink.
G =Graph()
poly nomi a l -ti me re d u c ti on s 473
G.edge("start","start_0")
for i in range(n): # add 2 length-m paths per variable
G.edge(f"start_{i}",f"v_{i}_{0}_T")
G.edge(f"start_{i}",f"v_{i}_{0}_F")
for j in range(m-1):
G.edge(f"v_{i}_{j}_T",f"v_{i}_{j+1}_T")
G.edge(f"v_{i}_{j}_F",f"v_{i}_{j+1}_F")
G.edge(f"v_{i}_{m-1}_T",f"end_{i}")
G.edge(f"v_{i}_{m-1}_F",f"end_{i}")
if i<n-1:
G.edge(f"end_{i}",f"start_{i+1}")
G.edge(f"end_{n-1}","start_clauses")
for j,C in enumerate(clauses): # add gadget for each
↪ clause
for v in enumerate(C):
i,sign = var(v[1])
s = "F" if sign else "T"
G.edge(f"C_{j}_in",f"v_{i}_{j}_{s}")
G.edge(f"v_{i}_{j}_{s}",f"C_{j}_out")
if j<m-1:
G.edge(f"C_{j}_out",f"C_{j+1}_in")
G.edge("start_clauses","C_0_in")
G.edge(f"C_{m-1}_out","end")
return G, 1+n*(m+1)+1+2*m+1
14.8 EXERCISES
15
NP, NP completeness, and the Cook-Levin Theorem
“In this paper we give theorems that suggest, but do not imply, that these
problems, as well as many others, will remain intractable perpetually”, Richard
Karp, 1972
“Sad to say, but it will be many more years, if ever before we really understand
the Mystical Power of Twoness… 2-SAT is easy, 3-SAT is hard, 2-dimensional
matching is easy, 3-dimensional matching is hard. Why? oh, Why?” Eugene
Lawler
that |𝑤| ≤ 𝑝(|𝑥|) for some polynomial 𝑝. That is, prove that for every
𝐹 ∶ {0, 1}∗ → {0, 1}, 𝐹 ∈ NP if and only if there is a polynomial-
time Turing machine 𝑉 and a polynomial 𝑝 ∶ ℕ → ℕ such that for
every 𝑥 ∈ {0, 1}∗ 𝐹 (𝑥) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥, 𝑤) = 1.
■
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 479
Solution:
The “only if” direction (namely that if 𝐹 ∈ NP then there is an
algorithm 𝑉 and a polynomial 𝑝 as above) follows immediately
from Definition 15.1 by letting 𝑝(𝑛) = 𝑛𝑎 . For the “if” direc-
tion, the idea is that if a string 𝑤 is of size at most 𝑝(𝑛) for degree
𝑑 polynomial 𝑝, then there is some 𝑛0 such that for all 𝑛 > 𝑛0 ,
|𝑤| < 𝑛𝑑+1 . Hence we can encode 𝑤 by a string of exactly length
𝑛𝑑+1 by padding it with 1 and an appropriate number of zeroes.
Hence if there is an algorithm 𝑉 and polynomial 𝑝 as above, then
we can define an algorithm 𝑉 ′ that does the following on input
𝑥, 𝑤′ with |𝑥| = 𝑛 and |𝑤′ | = 𝑛𝑎 :
such that 𝑉 ′ (𝑥𝑤′ ) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥𝑤) = 1.
■
R
Remark 15.2 — NP not (necessarily) closed under com-
plement. Definition 15.1 is asymmetric in the sense that
480 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Here are some more examples for problems in NP. For each one
of these problems we merely sketch how the witness is represented
and why it is efficiently checkable, but working out the details can be a
good way to get more comfortable with Definition 15.1:
subset of 𝐺’s vertices and enumerating over all the edges {𝑢, 𝑣} of
𝐺, counting those edges such that 𝑢 ∈ 𝑆 and 𝑣 ∉ 𝑆 or vice versa.
Solution:
Suppose that 𝐹 ∈ P. Define the following function 𝑉 : 𝑉 (𝑥0𝑛 ) =
1 iff 𝑛 = |𝑥| and 𝐹 (𝑥) = 1. (𝑉 outputs 0 on all other inputs.) Since
𝐹 ∈ P we can clearly compute 𝑉 in polynomial time as well.
Let 𝑥 ∈ {0, 1}𝑛 be some string. If 𝐹 (𝑥) = 1 then 𝑉 (𝑥0𝑛 ) = 1. On
the other hand, if 𝐹 (𝑥) = 0 then for every 𝑤 ∈ {0, 1}𝑛 , 𝑉 (𝑥𝑤) = 0.
Therefore, setting 𝑎 = 𝑏 = 1, we see that 𝑉 satisfies (15.1), and es-
tablishes that 𝐹 ∈ NP.
■
R
Remark 15.5 — NP does not mean non-polynomial!.
People sometimes think that NP stands for “non poly-
nomial time”. As Solved Exercise 15.2 shows, this is
far from the truth, and in fact every polynomial-time
computable function is in NP as well.
If 𝐹 is in NP it certainly does not mean that 𝐹 is hard
to compute (though it does not, as far as we know,
necessarily mean that it’s easy to compute either).
Rather, it means that 𝐹 is easy to verify, in the technical
sense of Definition 15.1.
Solution:
Suppose that 𝐹 ∈ NP and let 𝑉 be the polynomial-time com-
putable function that satisfies (15.1) and 𝑎 the corresponding
constant. Then given every 𝑥 ∈ {0, 1}𝑛 , we can check whether
𝐹 (𝑥) = 1 in time 𝑝𝑜𝑙𝑦(𝑛) ⋅ 2 𝑛𝑎
= 𝑜(2𝑛 ) by enumerating over
𝑎+1
Solved Exercise 15.2 and Solved Exercise 15.3 together imply that
P ⊆ NP ⊆ EXP . (15.2)
Solution:
Suppose that 𝐺 is in NP and in particular there exists 𝑎 and 𝑉 ∈
P such that for every 𝑦 ∈ {0, 1}∗ , 𝐺(𝑦) = 1 ⇔ ∃𝑤∈{0,1}|𝑦|𝑎 𝑉 (𝑦𝑤) = 1.
Suppose also that 𝐹 ≤𝑝 𝐺 and so in particular there is a 𝑛𝑏 -
time computable function 𝑅 such that 𝐹 (𝑥) = 𝐺(𝑅(𝑥)) for all
𝑥 ∈ {0, 1} . Define 𝑉 to be a Turing Machine that on input a pair
∗ ′
We will soon show the proof of Theorem 15.6, but note that it im-
mediately implies that QUADEQ, LONGPATH, and MAXCUT all
reduce to 3SAT. Combining it with the reductions we’ve seen in Chap-
ter 14, it implies that all these problems are equivalent! For example,
to reduce QUADEQ to LONGPATH, we can first reduce QUADEQ to
3SAT using Theorem 15.6 and use the reduction we’ve seen in Theo-
rem 14.10 from 3SAT to LONGPATH. That is, since QUADEQ ∈ NP,
Theorem 15.6 implies that QUADEQ ≤𝑝 3SAT, and Theorem 14.10
implies that 3SAT ≤𝑝 LONGPATH, which by the transitivity of reduc-
tions (Solved Exercise 14.2) means that QUADEQ ≤𝑝 LONGPATH.
Similarly, since LONGPATH ∈ NP, we can use Theorem 15.6 and
Theorem 14.4 to show that LONGPATH ≤𝑝 3SAT ≤𝑝 QUADEQ,
concluding that LONGPATH and QUADEQ are computationally
equivalent.
There is of course nothing special about QUADEQ and LONGPATH
here: by combining (15.6) with the reductions we saw, we see that just
like 3SAT, every 𝐹 ∈ NP reduces to LONGPATH, and the same is true
for QUADEQ and MAXCUT. All these problems are in some sense
“the hardest in NP” since an efficient algorithm for any one of them
would imply an efficient algorithm for all the problems in NP. This
motivates the following definition:
Solution:
We have seen that the circuit (or straightline program) evalua-
tion problem can be computed in polynomial time. Specifically,
given a NAND-CIRC program 𝑄 of 𝑠 lines and 𝑛 inputs, and
𝑤 ∈ {0, 1}𝑛 , we can evaluate 𝑄 on the input 𝑤 in time which is
polynomial in 𝑠 and hence verify whether or not 𝑄(𝑤) = 1.
■
Proof Idea:
The proof closely follows the proof that P ⊆ P/poly (Theorem 13.12
, see also Section 13.6.2). Specifically, if 𝐹 ∈ NP then there is a poly-
nomial time Turing machine 𝑀 and positive integer 𝑎 such that for
every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 1 iff there is some 𝑤 ∈ {0, 1}𝑛 such that
𝑎
𝑀 (𝑥𝑤) = 1. The proof that P ⊆ P/poly gave us way (via “unrolling the
loop”) to come up in polynomial time with a Boolean circuit 𝐶 on 𝑛𝑎
inputs that computes the function 𝑤 ↦ 𝑀 (𝑥𝑤). We can then translate
𝐶 into an equivalent NAND circuit (or NAND-CIRC program) 𝑄. We
see that there is a string 𝑤 ∈ {0, 1}𝑛 such that 𝑄(𝑤) = 1 if and only if
𝑎
P
The proof is a little bit technical but ultimately follows
quite directly from the definition of NP, as well as the
ability to “unroll the loop” of NAND-TM programs as
discussed in Section 13.6.2. If you find it confusing, try
to pause here and think how you would implement
in your favorite programming language the function
unroll which on input a NAND-TM program 𝑃
and numbers 𝑇 , 𝑛 outputs an 𝑛-input NAND-CIRC
program 𝑄 of 𝑂(|𝑇 |) lines such that for every input
𝑧 ∈ {0, 1}𝑛 , if 𝑃 halts on 𝑧 within at most 𝑇 steps and
outputs 𝑦, then 𝑄(𝑧) = 𝑦.
Proof Idea:
To prove Lemma 15.9 we need to give a polynomial-time map from
every NAND-CIRC program 𝑄 to a 3NAND formula Ψ such that there
exists 𝑤 such that 𝑄(𝑤) = 1 if and only if there exists 𝑧 satisfying Ψ.
For every line 𝑖 of 𝑄, we define a corresponding variable 𝑧𝑖 of Ψ. If
the line 𝑖 has the form foo = NAND(bar,blah) then we will add the
clause 𝑧𝑖 = NAND(𝑧𝑗 , 𝑧𝑘 ) where 𝑗 and 𝑘 are the last lines in which bar
and blah were written to. We will also set variables corresponding
to the input variables, as well as add a clause to ensure that the final
output is 1. The resulting reduction can be implemented in about a
dozen lines of Python, see Fig. 15.6.
⋆
• Let ℓ∗ be the last line in which the output y_0 is assigned a value.
Then we add the constraint 𝑧ℓ∗ = NAND(𝑧ℓ0 , 𝑧ℓ0 ) where ℓ0 − 𝑛 is as
above the last line in which zero is assigned a value. Note that this
is effectively the constraint 𝑧ℓ∗ = NAND(0, 0) = 1.
To complete the proof we need to show that there exists 𝑤 ∈ {0, 1}𝑛
s.t. 𝑄(𝑤) = 1 if and only if there exists 𝑧 ∈ {0, 1}𝑛+𝑚 that satisfies all
constraints in Ψ. We now show both sides of this equivalence.
Part I: Completeness. Suppose that there is 𝑤 ∈ {0, 1}𝑛 s.t. 𝑄(𝑤) =
1. Let 𝑧 ∈ {0, 1}𝑛+𝑚 be defined as follows: for 𝑖 ∈ [𝑛], 𝑧𝑖 = 𝑤𝑖 and
for 𝑖 ∈ {𝑛, 𝑛 + 1, … , 𝑛 + 𝑚} 𝑧𝑖 equals the value that is assigned in
the (𝑖 − 𝑛)-th line of 𝑄 when executed on 𝑤. Then by construction
𝑧 satisfies all of the constraints of Ψ (including the constraint that
𝑧ℓ∗ = NAND(0, 0) = 1 since 𝑄(𝑤) = 1.)
Part II: Soundness. Suppose that there exists 𝑧 ∈ {0, 1}𝑛+𝑚 satisfy-
ing Ψ. Soundness will follow by showing that 𝑄(𝑧0 , … , 𝑧𝑛−1 ) = 1 (and
hence in particular there exists 𝑤 ∈ {0, 1}𝑛 , namely 𝑤 = 𝑧0 ⋯ 𝑧𝑛−1 ,
such that 𝑄(𝑤) = 1). To do this we will prove the following claim
(∗): for every ℓ ∈ [𝑚], 𝑧ℓ+𝑛 equals the value assigned in the ℓ-th step
of the execution of the program 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . Note that because 𝑧
satisfies the constraints of Ψ, (∗) is sufficient to prove the soundness
condition since these constraints imply that the last value assigned to
the variable y_0 in the execution of 𝑄 on 𝑧0 ⋯ 𝑤𝑛−1 is equal to 1. To
prove (∗) suppose, towards a contradiction, that it is false, and let ℓ be
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 491
the smallest number such that 𝑧ℓ+𝑛 is not equal to the value assigned
in the ℓ-th step of the execution of 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . But since 𝑧 sat-
isfies the constraints of Ψ, we get that 𝑧ℓ+𝑛 = NAND(𝑧𝑖 , 𝑧𝑗 ) where
(by the assumption above that ℓ is smallest with this property) these
values do correspond to the values last assigned to the variables on the
righthand side of the assignment operator in the ℓ-th line of the pro-
gram. But this means that the value assigned in the ℓ-th step is indeed
simply the NAND of 𝑧𝑖 and 𝑧𝑗 , contradicting our assumption on the
choice of ℓ.
■
Proof Idea:
To prove Lemma 15.10 we need to map a 3NAND formula 𝜑 into
a 3SAT formula 𝜓 such that 𝜑 is satisfiable if and only if 𝜓 is. The Figure 15.7: A 3NAND instance that is obtained by
taking a NAND-TM program for computing the
idea is that we can transform every NAND constraint of the form
AND function, unrolling it to obtain a NANDSAT
𝑎 = NAND(𝑏, 𝑐) into the AND of ORs involving the variables 𝑎, 𝑏, 𝑐 instance, and then composing it with the reduction of
and their negations, where each of the ORs contains at most three Lemma 15.9.
P
It is a good exercise for you to try to find a 3CNF for-
mula 𝜉 on three variables 𝑎, 𝑏, 𝑐 such that 𝜉(𝑎, 𝑏, 𝑐) is
true if and only if 𝑎 = NAND(𝑏, 𝑐). Once you do so, try
to see why this implies a reduction from 3NAND to
3SAT, and hence completes the proof of Lemma 15.10
15.6 WRAPPING UP
We have shown that for every function 𝐹 in NP, 𝐹 ≤𝑝 NANDSAT ≤𝑝
3NAND ≤𝑝 3SAT, and so 3SAT is NP-hard. Since in Chapter 14 we
saw that 3SAT ≤𝑝 QUADEQ, 3SAT ≤𝑝 ISET, 3SAT ≤𝑝 MAXCUT
and 3SAT ≤𝑝 LONGPATH, all these problems are NP-hard as well.
Finally, since all the aforementioned problems are in NP, they are
all in fact NP-complete and have equivalent complexity. There are
thousands of other natural problems that are NP-complete as well.
Finding a polynomial-time algorithm for any one of them will imply a
polynomial-time algorithm for all of them.
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 493
✓ Chapter Recap
15.7 EXERCISES
Prove that if there is no
Exercise 15.1 — Poor man’s Ladner’s Theorem.
2
𝑛𝑂(log 𝑛) time algorithm for 3SAT then there is some 𝐹 ∈ NP such 2
Hint: Use the function 𝐹 that on input a formula 𝜑
that 𝐹 ∉ P and 𝐹 is not NP complete.2 and a string of the form 1𝑡 , outputs 1 if and only if 𝜑
is satisfiable and 𝑡 = |𝜑|log |𝜑| .
■
16
• What is the evidence for P = NP vs P ≠ NP?
“You don’t have to believe in God, but you should believe in The Book.”, Paul 1
Paul Erdős (1913-1996) was one of the most prolific
Erdős, 1985.1 mathematicians of all times. Though he was an athe-
ist, Erdős often referred to “The Book” in which God
“No more half measures, Walter”, Mike Ehrmantraut in “Breaking Bad”, keeps the most elegant proof of each mathematical
2010. theorem.
“Suppose aliens invade the earth and threaten to obliterate it in a year’s time
unless human beings can find the [fifth Ramsey number]. We could marshal
the world’s best minds and fastest computers, and within a year we could prob-
ably calculate the value. If the aliens demanded the [sixth Ramsey number],
however, we would have no choice but to launch a preemptive attack.”, Paul
Erdős, as quoted by Graham and Spencer, 1990.2 2
The 𝑘-th Ramsey number, denoted as 𝑅(𝑘, 𝑘), is the
smallest number 𝑛 such that for every graph 𝐺 on 𝑛
vertices, both 𝐺 and its complement contain a 𝑘-sized
We have mentioned that the question of whether P = NP, which
independent set. If P = NP then we can compute
is equivalent to whether there is a polynomial-time algorithm for 𝑅(𝑘, 𝑘) in time polynomial in 2𝑘 , while otherwise it
3SAT, is the great open question of Computer Science. But why is it so
2𝑘
can potentially take closer to 22 steps.
important? In this chapter, we will try to figure out the implications of
such an algorithm.
First, let us get one qualm out of the way. Sometimes people say,
“What if P = NP but the best algorithm for 3SAT takes 𝑛1000 time?” Well,
√
𝑛1000 is much larger than, say, 20.001 𝑛 for any input smaller than 250 ,
as large as a harddrive as you will encounter, and so another way to
phrase this question is to say “what if the complexity of 3SAT is ex-
ponential for all inputs that we will ever encounter, but then grows
much smaller than that?” To me this sounds like the computer science
equivalent of asking, “what if the laws of physics change completely
once they are out of the range of our telescopes?”. Sure, this is a valid
possibility, but wondering about it does not sound like the most pro-
ductive use of our time.
So, as the saying goes, we’ll keep an open mind, but not so open
that our brains fall out, and assume from now on that:
and
• She does not “beat around the bush’ ’ or take “half measures”.
• 3SAT is very easy: 3SAT has an 𝑂(𝑛) or 𝑂(𝑛2 ) time algorithm with
a not too huge constant (say smaller than 106 .)
At the time of writing, the fastest known algorithm for 3SAT re-
quires more than 20.35𝑛 to solve 𝑛 variable formulas, while we do not
even know how to rule out the possibility that we can compute 3SAT
using 10𝑛 gates. To put it in perspective, for the case 𝑛 = 1000 our
lower and upper bounds for the computational costs are apart by
a factor of about 10100 . As far as we know, it could be the case that
1000-variable 3SAT can be solved in a millisecond on a first-generation
iPhone, and it can also be the case that such instances require more
than the age of the universe to solve on the world’s fastest supercom-
puter.
So far, most of our evidence points to the latter possibility of 3SAT
being exponentially hard, but we have not ruled out the former possi-
bility either. In this chapter we will explore some of the consequences
of the “3SAT easy” scenario.
Suppose that P
Theorem 16.1 — Search vs Decision. = NP. Then
for every polynomial-time algorithm 𝑉 and 𝑎, 𝑏 ∈ ℕ,there is a
polynomial-time algorithm FIND𝑉 such that for every 𝑥 ∈ {0, 1}𝑛 ,
if there exists 𝑦 ∈ {0, 1}𝑎𝑛 satisfying 𝑉 (𝑥𝑦) = 1, then FIND𝑉 (𝑥)
𝑏
P
To understand what the statement of Theo-
rem 16.1 means, let us look at the special case of
the MAXCUT problem. It is not hard to see that there
is a polynomial-time algorithm VERIFYCUT such that
VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if and only if 𝑆 is a subset
of 𝐺’s vertices that cuts at least 𝑘 edges. Theorem 16.1
implies that if P = NP then there is a polynomial-time
algorithm FINDCUT that on input 𝐺, 𝑘 outputs a set
𝑆 such that VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if such a set
exists. This means that if P = NP, by trying all values
of 𝑘 we can find in polynomial time a maximum cut
in any given graph. We can use a similar argument to
show that if P = NP then we can find a satisfying as-
signment for every satisfiable 3CNF formula, find the
longest path in a graph, solve integer programming,
and so and so forth.
Proof Idea:
The idea behind the proof of Theorem 16.1 is simple; let us
demonstrate it for the special case of 3SAT. (In fact, this case is not
so “special”− since 3SAT is NP-complete, we can reduce the task of
solving the search problem for MAXCUT or any other problem in
NP to the task of solving it for 3SAT.) Suppose that P = NP and we
are given a satisfiable 3CNF formula 𝜑, and we now want to find a
satisfying assignment 𝑦 for 𝜑. Define 3SAT0 (𝜑) to output 1 if there is
a satisfying assignment 𝑦 for 𝜑 such that its first bit is 0, and similarly
define 3SAT1 (𝜑) = 1 if there is a satisfying assignment 𝑦 with 𝑦0 = 1.
The key observation is that both 3SAT0 and 3SAT1 are in NP, and so if
P = NP then we can compute them in polynomial time as well. Thus
we can use this to find the first bit of the satisfying assignment. We
can continue in this way to recover all the bits.
⋆
are 𝑧 s.t. 𝑉 (𝑥𝑦) = 1. Note that this claim implies the theorem, since in
particular it means that for ℓ = 𝑎𝑛𝑏 − 1, 𝑧 satisfies 𝑉 (𝑥𝑧) = 1.
We prove the claim by induction. For ℓ = 0, this holds vacuously.
Now for every ℓ > 0, if the call STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0)
returns 1, then we are guaranteed the invariant by definition of
STARTSWITH𝑉 . Now under our inductive hypothesis, there is
𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 such that 𝑃 (𝑥𝑧0 , … , 𝑧ℓ−1 𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 ) = 1. If the call to
STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0) returns 0 then it must be the case that
𝑦ℓ = 1, and hence when we set 𝑧ℓ = 1 we maintain the invariant.
■
w hat i f p e q ua l s n p ? 501
16.2 OPTIMIZATION
Theorem 16.1 allows us to find solutions for NP problems if P = NP,
but it is not immediately clear that we can find the optimal solution.
For example, suppose that P = NP, and you are given a graph 𝐺. Can
you find the longest simple path in 𝐺 in polynomial time?
P
This is actually an excellent question for you to at-
tempt on your own. That is, assuming P = NP, give
a polynomial-time algorithm that on input a graph 𝐺,
outputs a maximally long simple path in the graph 𝐺.
P
The statement of Theorem 16.3 is a bit cumbersome.
To understand it, think how it would subsume the
example above of a polynomial time algorithm for
finding the maximum length path in a graph. In
this case the function 𝑓 would be the map that on
input a pair 𝑥, 𝑦 outputs 0 if the pair (𝑥, 𝑦) does not
represent some graph and a simple path inside the
graph respectively; otherwise 𝑓(𝑥, 𝑦) would equal
the length of the path 𝑦 in the graph 𝑥. Since a path
in an 𝑛 vertex graph can be represented by at most
502 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
The proof follows by generalizing our ideas from the longest path
example above. Let 𝑓 be as in the theorem statement. If P = NP then
for every for every string 𝑥 ∈ {0, 1}∗ and number 𝑘, we can test in
in 𝑝𝑜𝑙𝑦(|𝑥|, 𝑚) time whether there exists 𝑦 such that 𝑓(𝑥, 𝑦) ≥ 𝑘, or
in other words test whether max𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘. If 𝑓(𝑥, 𝑦) is an
integer between 0 and 𝑝𝑜𝑙𝑦(|𝑥| + |𝑦|) (as is the case in the example of
longest path) then we can just try out all possibilities for 𝑘 to find the
maximum number 𝑘 for which max𝑦 𝑓(𝑥, 𝑦) ≥ 𝑘. Otherwise, we can
use binary search to hone down on the right value. Once we do so, we
can use search-to-decision to actually find the string 𝑦∗ that achieves
the maximum.
⋆
⎧
{1 ∃𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘
𝐹 (𝑥, 1𝑚 , 𝑘) = ⎨ (16.2)
⎩0 otherwise
{
Since 𝑓 is computable in polynomial time, 𝐹 is in NP, and so under
our assumption that P = NP, 𝐹 itself can be computed in polynomial
time. Now, for every 𝑥 and 𝑚, we can compute the largest 𝑘 such that
𝐹 (𝑥, 1𝑚 , 𝑘) = 1 by a binary search. Specifically, we will do this as
follows:
R
Remark 16.5 — Need for binary search.. In many exam-
ples, such as the case of finding longest path, we don’t
need to use the binary search step in Theorem 16.3,
and can simply enumerate over all possible values for
𝑘 until we find the correct one. One example where
we do need to use this binary search step is in the case
of the problem of finding a maximum length path in
a weighted graph. This is the problem where 𝐺 is a
weighted graph, and every edge of 𝐺 is given a weight
which is a number between 0 and 2𝑘 . Theorem 16.3
shows that we can find the maximum-weight simple
path in 𝐺 (i.e., simple path maximizing the sum of
504 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
One can obviously easily construct a Turing machine, which for every
formula 𝐹 in first order predicate logic and every natural number 𝑛, al-
lows one to decide if there is a proof of 𝐹 of length 𝑛 (length = number
of symbols). Let 𝜓(𝐹 , 𝑛) be the number of steps the machine requires
for this and let 𝜑(𝑛) = max𝐹 𝜓(𝐹 , 𝑛). The question is how fast 𝜑(𝑛)
grows for an optimal machine. One can show that 𝜑 ≥ 𝑘 ⋅ 𝑛 [for some
constant 𝑘 > 0]. If there really were a machine with 𝜑(𝑛) ∼ 𝑘 ⋅ 𝑛 (or
even ∼ 𝑘 ⋅ 𝑛2 ), this would have consequences of the greatest importance. 4
The undecidability of Entscheidungsproblem refers
Namely, it would obviously mean that in spite of the undecidability
to the uncomputability of the function that maps a
of the Entscheidungsproblem,4 the mental work of a mathematician statement in first order logic to 1 if and only if that
concerning Yes-or-No questions could be completely replaced by a ma- statement has a proof.
chine. After all, one would simply have to choose the natural number
𝑛 so large that when the machine does not deliver a result, it makes no
sense to think more about the problem.
For many reasonable proof systems (including the one that Gödel
referred to), SHORTPROOF𝑉 is in fact NP-complete, and so Gödel can
be thought of as the first person to formulate the P vs NP question.
Unfortunately, the letter was only discovered in 1988.
w hat i f p e q ua l s n p ? 507
{0, 1}𝑛 there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Consider the
function 𝐹 such that 𝐹 (𝑥𝑦) = 1 if there exists 𝑧 ∈ {0, 1}𝑛 such that
𝑉 (𝑥𝑦𝑧) = 1. Since 𝑉 runs in polynomial-time 𝐹 ∈ NP and hence if
P = NP, then there is an algorithm 𝑉 ′ that on input 𝑥, 𝑦 outputs 1 if
and only if there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Now we
can see that the original statement we consider is true if and only if for
every 𝑦 ∈ {0, 1}𝑛 , 𝑉 ′ (𝑥𝑦) = 1, which means it is false if and only if
the following condition (∗) holds: there exists some 𝑦 ∈ {0, 1}𝑛 such
that 𝑉 ′ (𝑥𝑦) = 0. But for every 𝑥 ∈ {0, 1}𝑛 , the question of whether
the condition (∗) is itself in NP (as we assumed 𝑉 ′ can be computed
in polynomial time) and hence under the assumption that P = NP
we can determine in polynomial time whether the condition (∗), and
hence our original statement, is true.
⋆
𝜑𝑥,𝑦0 = ∀𝑦1 ∈{0,1}𝑚 ∃𝑦2 ∈{0,1}𝑚 ⋯ 𝒬𝑦𝑎−1 ∈{0,1}𝑚 𝑉 (𝑥𝑦0 𝑦1 ⋯ 𝑦𝑎−1 ) = 1 (16.8)
𝜑𝑥,𝑦 = ∃𝑦1 ∈{0,1}𝑚 ∀𝑦2 ∈{0,1}𝑚 ⋯ 𝒬𝑦𝑎−1 ∈{0,1}𝑚 𝑉 (𝑥𝑦0 𝑦1 ⋯ 𝑦𝑎−1 ) = 0 (16.9)
0
w hat i f p e q ua l s n p ? 509
The algorithm of Theorem 16.6 can also solve the search problem
as well: find the value 𝑦0 that certifies the truth of (16.7). We note
that while this algorithm is in polynomial time, the exponent of this
polynomial blows up quite fast. If the original NANDSAT algorithm
required Ω(𝑛2 ) time, solving 𝑎 levels of quantifiers would require time 7
We do not know whether such loss is inherent.
Ω(𝑛2 ).7
𝑎
As far as we can tell, it’s possible that the quantified
boolean formula problem has a linear-time algorithm.
We will, however, see later in this course that it
16.4.1 Application: self improving algorithm for 3SAT satisfies a notion known as PSPACE-hardness that is
Suppose that we found a polynomial-time algorithm 𝐴 for 3SAT that even stronger than NP-hardness.
is “good but not great”. For example, maybe our algorithm runs in
time 𝑐𝑛2 for some not too small constant 𝑐. However, it’s possible
that the best possible SAT algorithm is actually much more efficient
than that. Perhaps, as we guessed before, there is a circuit 𝐶 ∗ of at
most 106 𝑛 gates that computes 3SAT on 𝑛 variables, and we simply
haven’t discovered it yet. We can use Theorem 16.6 to “bootstrap” our
original “good but not great” 3SAT algorithm to discover the optimal
one. The idea is that we can phrase the question of whether there
exists a size 𝑠 circuit that computes 3SAT for all length 𝑛 inputs as
follows: there exists a size ≤ 𝑠 circuit 𝐶 such that for every formula 𝜑
described by a string of length at most 𝑛, if 𝐶(𝜑) = 1 then there exists
an assignment 𝑥 to the variables of 𝜑 that satisfies it. One can see that
this is a statement of the form (16.5) and hence if P = NP we can solve
it in polynomial time as well. We can therefore imagine investing huge
computational resources in running 𝐴 one time to discover the circuit
𝐶 ∗ and then using 𝐶 ∗ for all further computation.
510 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
“the laws of nature have this amazing feeling of inevitability… which is associ-
ated with local perfection.”
“The classical picture of the world is the top of a local mountain in the space of
ideas. And you go up to the top and it looks amazing up there and absolutely
incredible. And you learn that there is a taller mountain out there. Find it,
Mount Quantum…. they’re not smoothly connected … you’ve got to make a
jump to go from classical to quantum … This also tells you why we have such
major challenges in trying to extend our understanding of physics. We don’t
have these knobs, and little wheels, and twiddles that we can turn. We have to
learn how to make these jumps. And it is a tall order. And that’s why things are
difficult.”
✓ Chapter Recap
16.10 EXERCISES
17.1 EXERCISES
“Einstein was doubly wrong … not only does God definitely play dice, but He
sometimes confuses us by throwing them where they can’t be seen.”, Stephen
Hawking
These are all important questions that have been studied and de-
bated by scientists, mathematicians, statisticians and philosophers.
Fortunately, we will not need to deal directly with these questions
here. We will be mostly interested in the setting of tossing 𝑛 random,
unbiased and independent coins. Below we define the basic proba-
bilistic objects of events and random variables when restricted to this
setting. These can be defined for much more general probabilistic ex-
periments or sample spaces, and later on we will briefly discuss how
this can be done. However, the 𝑛-coin case is sufficient for almost
everything we’ll need in this course.
If instead of “heads” and “tails” we encode the sides of each coin
by “zero” and “one”, we can encode the result of tossing 𝑛 coins as
a string in {0, 1}𝑛 . Each particular outcome 𝑥 ∈ {0, 1}𝑛 is obtained
with probability 2−𝑛 . For example, if we toss three coins, then we
obtain each of the 8 outcomes 000, 001, 010, 011, 100, 101, 110, 111
with probability 2−3 = 1/8 (see also Fig. 18.1). We can describe the
experiment of tossing 𝑛 coins as choosing a string 𝑥 uniformly at
random from {0, 1}𝑛 , and hence we’ll use the shorthand 𝑥 ∼ {0, 1}𝑛
for 𝑥 that is chosen according to this experiment.
An event is simply a subset 𝐴 of {0, 1}𝑛 . The probability of 𝐴, de-
noted by Pr𝑥∼{0,1}𝑛 [𝐴] (or Pr[𝐴] for short, when the sample space is
understood from the context), is the probability that an 𝑥 chosen uni-
formly at random will be contained in 𝐴. Note that this is the same as
|𝐴|/2𝑛 (where |𝐴| as usual denotes the number of elements in the set Figure 18.1: The probabilistic experiment of tossing
𝐴). For example, the probability that 𝑥 has an even number of ones three coins corresponds to making 2 × 2 × 2 = 8
𝑛−1 choices, each with equal probability. In this example,
is Pr[𝐴] where 𝐴 = {𝑥 ∶ ∑𝑖=0 𝑥𝑖 = 0 mod 2}. In the case 𝑛 = 3, the blue set corresponds to the event 𝐴 = {𝑥 ∈
𝐴 = {000, 011, 101, 110}, and hence Pr[𝐴] = 84 = 21 (see Fig. 18.2). It {0, 1}3 | 𝑥0 = 0} where the first coin toss is equal
to 0, and the pink set corresponds to the event 𝐵 =
turns out this is true for every 𝑛: {𝑥 ∈ {0, 1}3 | 𝑥1 = 1} where the second coin toss is
equal to 1 (with their intersection having a purplish
Lemma 18.1 For every 𝑛 > 0,
color). As we can see, each of these events contains 4
𝑛−1 elements (out of 8 total) and so has probability 1/2.
The intersection of 𝐴 and 𝐵 contains two elements,
Pr [∑ 𝑥𝑖 is even ] = 1/2 (18.1)
𝑥∼{0,1} 𝑛 and so the probability that both of these events occur
𝑖=0
is 2/8 = 1/4.
P
To test your intuition on probability, try to stop here
and prove the lemma on your own.
2𝑛−2 + 2𝑛−2 1
|𝐸0 ∪𝑂1 |
2𝑛 = = , (18.2)
2𝑛 2
using the fact that 𝐸0 and 𝑂1 are disjoint and hence |𝐸0 ∪ 𝑂1 | =
|𝐸0 | + |𝑂1 |.
■
We can also use the intersection (∩) and union (∪) operators to
talk about the probability of both event 𝐴 and event 𝐵 happening, or
the probability of event 𝐴 or event 𝐵 happening. For example, the
probability 𝑝 that 𝑥 has an even number of ones and 𝑥0 = 1 is the same
𝑛−1
as Pr[𝐴 ∩ 𝐵] where 𝐴 = {𝑥 ∈ {0, 1}𝑛 ∶ ∑𝑖=0 𝑥𝑖 = 0 mod 2} and
𝐵 = {𝑥 ∈ {0, 1}𝑛 ∶ 𝑥0 = 1}. This probability is equal to 1/4 for
𝑛 > 1. (It is a great exercise for you to pause here and verify that you
understand why this is the case.)
Because intersection corresponds to considering the logical AND
of the conditions that two events happen, while union corresponds
to considering the logical OR, we will sometimes use the ∧ and ∨
operators instead of ∩ and ∪, and so write this probability 𝑝 = Pr[𝐴 ∩
𝐵] defined above also as
Pr [∑ 𝑥𝑖 = 0 mod 2 ∧ 𝑥0 = 1] . (18.3)
𝑥∼{0,1}𝑛
𝑖
Pr[𝐴] = |𝐴|
2𝑛 = 2𝑛 −|𝐴|
2𝑛 =1− |𝐴|
2𝑛 = 1 − Pr[𝐴] (18.4)
This makes sense: since 𝐴 happens if and only if 𝐴 does not happen,
the probability of 𝐴 should be one minus the probability of 𝐴.
R
524 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof.
𝔼[𝑋 + 𝑌 ] = ∑ 2−𝑛 (𝑋(𝑥) + 𝑌 (𝑥)) =
𝑥∈{0,1}𝑛
𝔼[𝑋] + 𝔼[𝑌 ]
■
Solution:
We can solve this using the linearity of expectation. We can de-
fine random variables 𝑋0 , 𝑋1 , … , 𝑋𝑛−1 such that 𝑋𝑖 (𝑥) = 𝑥𝑖 . Since
each 𝑥𝑖 equals 1 with probability 1/2 and 0 with probability 1/2,
𝑛−1
𝔼[𝑋𝑖 ] = 1/2. Since 𝑋 = ∑𝑖=0 𝑋𝑖 , by the linearity of expectation
P
If you have not seen discrete probability before, please
go over this argument again until you are sure you
follow it; it is a prototypical simple example of the
type of reasoning we will employ again and again in
this course.
P
Before looking at the proof, try to see why the union
bound makes intuitive sense. We can also prove
it directly from the definition of probabilities and
526 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof of Lemma 18.4. For every 𝑥, the variable 1𝐴∪𝐵 (𝑥) ≤ 1𝐴 (𝑥)+1𝐵 (𝑥).
Hence, Pr[𝐴∪𝐵] = 𝔼[1𝐴∪𝐵 ] ≤ 𝔼[1𝐴 +1𝐵 ] = 𝔼[1𝐴 ]+𝔼[1𝐵 ] = Pr[𝐴]+Pr[𝐵].
■
R
Remark 18.5 — Disjointness vs independence. People
sometimes confuse the notion of disjointness and in-
dependence, but these are actually quite different. Two
events 𝐴 and 𝐵 are disjoint if 𝐴 ∩ 𝐵 = ∅, which means
that if 𝐴 happens then 𝐵 definitely does not happen.
They are independent if Pr[𝐴 ∩ 𝐵] = Pr[𝐴] Pr[𝐵] which
means that knowing that 𝐴 happens gives us no infor-
mation about whether 𝐵 happened or not. If 𝐴 and 𝐵
have nonzero probability, then being disjoint implies
that they are not independent, since in particular it
means that they are negatively correlated.
More than two events: We can generalize this definition to more than
two events. We say that events 𝐴1 , … , 𝐴𝑘 are mutually independent
if knowing that any set of them occurred or didn’t occur does not
change the probability that an event outside the set occurs. Formally,
the condition is that for every subset 𝐼 ⊆ [𝑘],
For example, if 𝑥 ∼ {0, 1}3 , then the events {𝑥0 = 1}, {𝑥1 = 1} and
{𝑥2 = 1} are mutually independent. On the other hand, the events
{𝑥0 = 1}, {𝑥1 = 1} and {𝑥0 + 𝑥1 = 0 mod 2} are not mutually
independent, even though every pair of these events is independent
(can you see why? see also Fig. 18.5).
P
The notation in the lemma’s statement is a bit cum-
bersome, but at the end of the day, it simply says that
if 𝑋 and 𝑌 are random variables that depend on two
disjoint sets 𝑆 and 𝑇 of coins (for example, 𝑋 might
be the sum of the first 𝑛/2 coins, and 𝑌 might be the
largest consecutive stretch of zeroes in the second 𝑛/2
coins), then they are independent.
𝔼[𝑋] 𝔼[𝑌 ]
(18.13)
where the first equality (= ) follows from the independence of 𝑋
(1)
neither will learning 𝐹 (𝑋). Indeed, to prove this we can write for
every 𝑎, 𝑏 ∈ ℝ:
∑ Pr[𝑋 = 𝑥] Pr[𝑌 = 𝑦] =
𝑥 s.t.𝐹 (𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏
⎛
⎜ ∑ Pr[𝑋 = 𝑥]⎞ ⎟⋅⎛
⎜ ∑ Pr[𝑌 = 𝑦]⎞⎟=
⎝𝑥 s.t.𝐹 (𝑥)=𝑎 ⎠ ⎝𝑦 s.t.𝐺(𝑦)=𝑏 ⎠
Pr[𝐹 (𝑋) = 𝑎] Pr[𝐺(𝑌 ) = 𝑏].
(18.14)
𝑛−1 𝑛−1
𝔼[ ∏ 𝑋𝑖 ] = ∏ 𝔼[𝑋𝑖 ]. (18.16)
𝑖=0 𝑖=0
P
We leave proving Lemma 18.7 and Lemma 18.8 as
Exercise 18.6 and Exercise 18.7. It is good idea for you
stop now and do these exercises to make sure you are
comfortable with the notion of independence, as we
will use it heavily later on in this course.
If 𝑋 is a non-negative random
Theorem 18.9 — Markov’s inequality.
variable then for every 𝑘 > 1, Pr[𝑋 ≥ 𝑘 𝔼[𝑋]] ≤ 1/𝑘.
Figure 18.6: The probabilities that we obtain a partic-
ular sum when we toss 𝑛 = 10, 20, 100, 1000 coins
converge quickly to the Gaussian/normal distribu-
P tion.
Markov’s Inequality is actually a very natural state-
ment (see also Fig. 18.7). For example, if you know
that the average (not the median!) household income
in the US is 70,000 dollars, then in particular you can
deduce that at most 25 percent of households make
more than 280,000 dollars, since otherwise, even if
the remaining 75 percent had zero income, the top
25 percent alone would cause the average income to
be larger than 70,000 dollars. From this example you
can already see that in many situations, Markov’s
inequality will not be tight and the probability of devi-
ating from expectation will be much smaller: see the
Chebyshev and Chernoff inequalities below.
Proof. Suppose towards the sake of contradiction that Pr[𝑋 < 𝔼[𝑋]] =
1. Then the random variable 𝑌 = 𝔼[𝑋] − 𝑋 is always positive. By
linearity of expectation 𝔼[𝑌 ] = 𝔼[𝑋] − 𝔼[𝑋] = 0. Yet by Markov, a
non-negative random variable 𝑌 with 𝔼[𝑌 ] = 0 must equal 0 with
probability 1, since the probability that 𝑌 > 𝑘 ⋅ 0 = 0 is at most 1/𝑘 for
every 𝑘 > 1. Hence we get a contradiction to the assumption that 𝑌 is
always positive.
■
𝑖=0
We omit the proof, which appears in many texts, and uses Markov’s
inequality on i.i.d random variables 𝑌0 , … , 𝑌𝑛 that are of the form
𝑌𝑖 = 𝑒𝜆𝑋𝑖 for some carefully chosen parameter 𝜆. See Exercise 18.11
for a proof of the simple (but highly useful and representative) case
where each 𝑋𝑖 is {0, 1} valued and 𝑝 = 1/2. (See also Exercise 18.12
for a generalization.)
R
Remark 18.13 — Slight simplification of Chernoff. Since 𝑒
is roughly 2.7 (and in particular larger than 2),
(18.19) would still be true if we replaced its righthand
side with 𝑒−2𝜖 𝑛+1 . For 𝑛 > 1/𝜖2 , the equation will
2
𝑖=0
where the probability is taken over the choice of the set of samples
𝑆.
In particular if |𝒞| ≤ 2𝑘 and 𝑛 > 𝑘 log(1/𝛿)
𝜖2 then with probability at
least 1 − 𝛿, the classifier ℎ∗ ∈ 𝒞 that minimizes that empirical test er-
ror 𝐿̂ 𝑆 (𝐶) satisfies 𝐿(ℎ∗ ) ≤ 𝐿̂ 𝑆 (ℎ∗ ) + 𝜖, and hence its test error is at
most 𝜖 worse than its training error.
Proof Idea:
The idea is to combine the Chernoff bound with the union bound.
Let 𝑘 = log |𝒞|. We first use the Chernoff bound to show that for
every fixed ℎ ∈ 𝒞, if we choose 𝑆 at random then the probability that
|𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖 will be smaller than 2𝛿𝑘 . We can then use the union
bound over all the 2𝑘 members of 𝒞 to show that this will be the case
for every ℎ.
⋆
⎧
{1 ℎ(𝑥𝑖 ) ≠ 𝑦𝑖
𝑋𝑖 = . (18.22)
⎨
⎩0 otherwise
{
Since the samples (𝑥0 , 𝑦0 ), … , (𝑥𝑛−1 , 𝑦𝑛−1 ) are drawn independently
from the same distribution 𝐷, the random variables 𝑋0 , … , 𝑋𝑛−1 are
independently and identically distributed. Moreover, for every 𝑖,
𝔼[𝑋𝑖 ] = 𝐿(ℎ). Hence by the Chernoff bound (see (18.20)), the proba-
𝑛
bility that | ∑𝑖=0 𝑋𝑖 −𝑛⋅𝐿(ℎ)| ≥ 𝜖𝑛 is at most 𝑒−𝜖 𝑛 < 𝑒−𝑘 log(1/𝛿) < 𝛿/2𝑘
2
(using the fact that 𝑒 > 2). Since 𝐿(ℎ)̂ = 𝑛1 ∑𝑖∈[𝑛] 𝑋𝑖 , this completes
the proof of the claim.
536 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Given the claim, the theorem follows from the union bound. In-
deed, for every ℎ ∈ 𝒞, define the “bad event” 𝐵ℎ to be the event (over
the choice of 𝑆) that |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖. By the claim Pr[𝐵ℎ ] < 𝛿/2𝑘 ,
and hence by the union bound the probability that the union of 𝐵ℎ for
all ℎ ∈ ℋ happens is smaller than |𝒞|𝛿/2𝑘 = 𝛿. If for every ℎ ∈ 𝒞, 𝐵ℎ
does not happen, it means that for every ℎ ∈ ℋ, |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| ≤ 𝜖,
and so the probability of the latter event is larger than 1 − 𝛿 which is
what we wanted to prove.
■
✓ Chapter Recap
18.4 EXERCISES
Suppose that we toss three independent fair coins 𝑎, 𝑏, 𝑐 ∈
Exercise 18.1
{0, 1}. What is the probability that the XOR of 𝑎,𝑏, and 𝑐 is equal to 1?
What is the probability that the AND of these three values is equal to
1? Are these two events independent?
■
Prove that if
Exercise 18.8 — Variance of independent random variables.
𝑋0 , … , 𝑋𝑛−1 are independent random variables then Var[𝑋0 + ⋯ +
𝑛−1
𝑋𝑛−1 ] = ∑𝑖=0 Var[𝑋𝑖 ].
■
2. Use this and Exercise 18.10 to prove (an approximate version of)
the Chernoff bound for the case that 𝑋0 , … , 𝑋𝑛−1 are i.i.d. random
variables over {0, 1} each equaling 0 and 1 with probability 1/2.
That is, prove that for every 𝜖 > 0, and 𝑋0 , … , 𝑋𝑛−1 as above,
𝑛−1
Pr[| ∑𝑖=0 𝑋𝑖 − 𝑛2 | > 𝜖𝑛] < 20.1⋅𝜖 𝑛 .
2
■
538 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
1. Prove that for every 𝑗0 , … , 𝑗𝑛−1 ∈ ℕ, if there exists one 𝑖 such that 𝑗𝑖
𝑛−1 𝑗
is odd then 𝔼[∏𝑖=0 𝑌𝑖 𝑖 ] = 0.
3
Hint: Bound the number of tuples 𝑗0 , … , 𝑗𝑛−1 such
2. Prove that for every 𝑘, 𝔼[(∑𝑖=0 𝑌𝑖 )𝑘 ] ≤ (10𝑘𝑛)𝑘/2 .3
𝑛−1
that every 𝑗𝑖 is even and ∑ 𝑗𝑖 = 𝑘 using the Binomial
coefficient and the fact that in any such tuple there are
𝑛/(10000 log 1/𝜖) 4
3. Prove that for every 𝜖 > 0, Pr[| ∑𝑖 𝑌𝑖 | ≥ 𝜖𝑛] ≥ 2−𝜖 .
2
at most 𝑘/2 distinct indices.
4
Hint: Set 𝑘 = 2⌈𝜖2 𝑛/1000⌉ and then show that if the
■
event | ∑ 𝑌𝑖 | ≥ 𝜖𝑛 happens then the random variable
(∑ 𝑌𝑖 )𝑘 is a factor of 𝜖−𝑘 larger than its expectation.
Exercise 18.13 — Sampling.Suppose that a country has 300,000,000 citi-
zens, 52 percent of which prefer the color “green” and 48 percent of
which prefer the color “orange”. Suppose we sample 𝑛 random citi-
zens and ask them their favorite color (assume they will answer truth-
fully). What is the smallest value 𝑛 among the following choices so
that the probability that the majority of the sample answers “green” is
at most 0.05?
a. 1,000
b. 10,000
c. 100,000
d. 1,000,000
a. 1,000
b. 10,000
c. 100,000
p roba bi l i ty the ory 1 0 1 539
d. 1,000,000
19
Probabilistic computation
“in 1946 .. (I asked myself) what are the chances that a Canfield solitaire laid
out with 52 cards will come out successfully? After spending a lot of time
trying to estimate them by pure combinatorial calculations, I wondered whether
a more practical method … might not be to lay it our say one hundred times and
simple observe and count”, Stanislaw Ulam, 1983
“The salient features of our method are that it is probabilistic … and with a
controllable miniscule probability of error.”, Michael Rabin, 1977
simpler way than was known otherwise. We will describe the algo-
rithms in an informal / “pseudo-code” way, rather than as NAND-
TM, NAND-RAM programs or Turing macines. In Chapter 20 we will
discuss how to augment the computational models we say before to
incorporate the ability to “toss coins”.
Proof Idea:
We simply choose a random cut: we choose a subset 𝑆 of vertices by
choosing every vertex 𝑣 to be a member of 𝑆 with probability 1/2 in-
dependently. It’s not hard to see that each edge is cut with probability
1/2 and so the expected number of cut edges is 𝑚/2.
⋆
Operation:
Proof Idea:
To see the idea behind the proof, think of the case that 𝑚 = 1000. In
this case one can show that we will cut at least 500 edges with proba-
bility at least 0.001 (and so in particular larger than 1/(2𝑚) = 1/2000).
Specifically, if we assume otherwise, then this means that with proba-
bility more than 0.999 the algorithm cuts 499 or fewer edges. But since
we can never cut more than the total of 1000 edges, given this assump-
tion, the highest value the expected number of edges cut is if we cut
exactly 499 edges with probability 0.999 and cut 1000 edges with prob-
ability 0.001. Yet even in this case the expected number of edges will
544 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
be 0.999 ⋅ 499 + 0.001 ⋅ 1000 < 500, which contradicts the fact that we’ve
calculated the expectation to be at least 500 in Theorem 19.1.
⋆
Proof of Lemma 19.2. Let 𝑝 be the probability that we cut at least 𝑚/2
edges and suppose, towards a contradiction, that 𝑝 < 1/(2𝑚). Since
the number of edges cut is an integer, and 𝑚/2 is a multiple of 0.5,
by definition of 𝑝, with probability 1 − 𝑝 we cut at most 𝑚/2 − 0.5
edges. Moreover, since we can never cut more than 𝑚 edges, under
our assumption that 𝑝 < 1/(2𝑚), we can bound the expected number
of edges cut by
But if 𝑝 < 1/(2𝑚) then 𝑝𝑚 < 0.5 and so the righthand side is smaller
than 𝑚/2, which contradicts the fact that (as proven in Theorem 19.1)
the expected number of edges cut is at least 𝑚/2.
■
2. Output “failed”
• Since the earth is about 5 billion years old, we can estimate the
chance that an asteroid of the magnitude that caused the dinosaurs’
extinction will hit us this very second to be about 2−60 . It is quite
likely that even a deterministic algorithm will fail if this happens.
for 3SAT are randomized, and are related to the following simple
algorithm, variants of which are also used in practice:
Algorithm WalkSAT:
Input: An 𝑛 variable 3CNF formula 𝜑.
Parameters: 𝑇 , 𝑆 ∈ ℕ
Operation:
𝑛−1 𝑛−1
𝑃 (𝑥0,0 , … , 𝑥𝑛−1,𝑛−1 ) = ∑ ( ∏ 𝑠𝑖𝑔𝑛(𝜋)𝐴𝑖,𝜋(𝑖) ) ∏ 𝑥𝑖,𝜋(𝑖) (19.7)
𝜋∈𝑆𝑛 𝑖=0 𝑖=0
If a polynomial is not identically zero, then it can’t have “too many” roots.
This makes sense: if there are only “few” roots, then we expect that
with high probability the random input 𝑥 is not going to be one of
those roots. However, to transform this into an actual algorithm, we
550 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
need to make both the intuition and the notion of a “random” input
precise. Choosing a random real number is quite problematic, espe-
cially when you have only a finite number of coins at your disposal,
and so we start by reducing the task to a finite setting. We will use the
following result:
✓ Chapter Recap
19.2 EXERCISES
Exercise 19.1 — Amplification for max cut. Prove Lemma 19.3
■
5
TODO: add exercise to give a deterministic max cut
Exercise 19.2 — Deterministic max cut algorithm. 5 algorithm that gives 𝑚/2 edges. Talk about greedy
■ approach.
19.4 ACKNOWLEDGEMENTS
Learning Objectives:
• Formal definition of probabilistic polynomial
time: the class BPP.
• Proof that every function in BPP can be
computed by 𝑝𝑜𝑙𝑦(𝑛)-sized NAND-CIRC
programs/circuits.
• Relations between BPP and NP.
• Pseudorandom generators
20
Modeling randomized computation
“Any one who considers arithmetical methods of producing random digits is, of
course, in a state of sin.” John von Neumann, 1951.
1. We can define the class BPP that captures all Boolean functions that
can be computed in polynomial time by a randomized algorithm.
Crucially BPP is still very much a worst case class of computation:
the probability is only over the choice of the random coins of the
algorithm, as opposed to the choice of the input.
3. Though, as is the case for P and NP, there is much we do not know
about the class BPP, we can establish some relations between BPP
and the other complexity classes we saw before. In particular we
will show that P ⊆ BPP ⊆ EXP and BPP ⊆ P/poly .
4. While the relation between BPP and NP is not known, we can show
that if P = NP then BPP = P.
where this probability is taken over the result of the RAND opera-
tions of 𝑃 .
Note that the probability in (20.1) is taken only over the ran-
dom choices in the execution of 𝑃 and not over the choice of the in-
put 𝑥. In particular, as discussed in Big Idea 24, BPP is still a worst
case complexity class, in the sense that if 𝐹 is in BPP then there is a
polynomial-time randomized algorithm that computes 𝐹 with proba-
bility at least 2/3 on every possible (and not just random) input.
The same polynomial-overhead simulation of NAND-RAM pro-
grams by NAND-TM programs we saw in Theorem 13.5 extends to
randomized programs as well. Hence the class BPP is the same re-
gardless of whether it is defined via RNAND-TM or RNAND-RAM
programs. Similarly, we could have just as well defined BPP using
randomized Turing machines.
Because of these equivalences, below we will use the name “poly-
nomial time randomized algorithm” to denote a computation that can be
modeled by a polynomial-time RNAND-TM program, RNAND-RAM
program, or a randomized Turing machine (or any programming lan-
guage that includes a coin tossing operation). Since all these models
are equivalent up to polynomial factors, you can use your favorite
model to capture polynomial-time randomized algorithms without
any loss in generality.
Modern programming lan-
Solved Exercise 20.1 — Choosing from a set.
guages often involve not just the ability to toss a random coin in {0, 1}
but also to choose an element at random from a set 𝑆. Show that you
can emulate this primitive using coin tossing. Specifically, show that
there is randomized algorithm 𝐴 that on input a set 𝑆 of 𝑚 strings of
length 𝑛, runs in time 𝑝𝑜𝑙𝑦(𝑛, 𝑚) and outputs either an element 𝑥 ∈ 𝑆
or “fail” such that
■
556 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
If the size of 𝑆 is a power of two, that is 𝑚 = 2ℓ for some ℓ ∈ 𝑁 ,
then we can choose a random element in 𝑆 by tossing ℓ coins to
obtain a string 𝑤 ∈ {0, 1}ℓ and then output the 𝑖-th element of 𝑆
where 𝑖 is the number whose binary representation is 𝑤.
If 𝑆 is not a power of two, then our first attempt will be to let
ℓ = ⌈log 𝑚⌉ and do the same, but then output the 𝑖-th element of
𝑆 if 𝑖 ∈ [𝑚] and output “fail” otherwise. Conditioned on not out-
putting “fail”, this element is distributed uniformly in 𝑆. However,
in the worst case, 2ℓ can be almost 2𝑚 and so the probability of fail
might be close to half. To reduce the failure probability, we can
repeat the experiment above 𝑛 times. Specifically, we will use the
following algorithm
Pr [𝐺(𝑥𝑟) = 𝐹 (𝑥)] ≥ 2
3 . (20.2)
𝑟∼{0,1}𝑎|𝑥|𝑏
Proof Idea:
The idea behind the proof is that, as illustrated in Fig. 20.2, we can
simply replace sampling a random coin with reading a bit from the
extra “random input” 𝑟 and vice versa. To prove this rigorously we
need to work through some slightly cumbersome formal notation.
This might be one of those proofs that is easier to work out on your
own than to read.
⋆
where the probability in the righthand side is taken over the RAND()
operations in 𝑃 . In particular this means that if we define 𝐺(𝑥𝑟) =
𝑃 ′ (𝑥𝑟) then the function 𝐺 satisfies the conditions of (20.2).
The algorithm 𝑃 ′ will be very simple: it simulates the program 𝑃 ,
maintaining a counter 𝑖 initialized to 0. Every time that 𝑃 makes a
RAND() operation, the program 𝑃 ′ will supply the result from 𝑟𝑖 and
increment 𝑖 by one. We will never “run out” of bits, since the running
time of 𝑃 is at most 𝑎𝑛𝑏 and hence it can make at most this number of
RNAND() calls. The output of 𝑃 ′ (𝑥𝑟) for a random 𝑟 ∼ {0, 1}𝑚 will be
distributed identically to the output of 𝑃 (𝑥).
For the other direction, given a function 𝐺 ∈ P satisfying the condi-
tion (20.2) and a NAND-TM 𝑃 ′ that computes 𝐺 in polynomial time,
we can construct an RNAND-TM program 𝑃 that computes 𝐹 in poly-
nomial time. On input 𝑥 ∈ {0, 1}𝑛 , the program 𝑃 will simply use the
RNAND() instruction 𝑎𝑛𝑏 times to fill an array R[0] , …, R[𝑎𝑛𝑏 − 1] and
then execute the original program 𝑃 ′ on input 𝑥𝑟 where 𝑟𝑖 is the 𝑖-th
element of the array R. Once again, it is clear that if 𝑃 ′ runs in polyno-
mial time then so will 𝑃 , and for every input 𝑥 and 𝑟 ∈ {0, 1}𝑎𝑛 , the
𝑏
R
Remark 20.4 — Definitions of BPP and NP. The char-
acterization of BPP Theorem 20.3 is reminiscent of
the characterization of NP in Definition 15.1, with
the randomness in the case of BPP playing the role
of the solution in the case of NP. However, there are
important differences between the two:
1 1
Pr[𝐴(𝑥) = 𝐹 (𝑥)] ≥ + . (20.4)
2 𝑝(𝑛)
Proof Idea:
The proof is the same as we’ve seen before in the case of maximum
cut and other examples. We use the Chernoff bound to argue that if
𝐴 computes 𝐹 with probability at least 12 + 𝜖 and we run it 𝑂(𝑘/𝜖2 )
times, each time using fresh and independent random coins, then the
probability that the majority of the answers will not be correct will be
less than 2−𝑘 . Amplification can be thought of as a “polling” of the
choices for randomness for the algorithm (see Fig. 20.3).
⋆
Before seeing the proof, note that Theorem 20.6 implies that if there
was a randomized polynomial time algorithm for any NP-complete
problem such as 3SAT, ISET etc., then there would be such an algo-
rithm for every problem in NP. Thus, regardless of whether our model
of computation is deterministic or randomized algorithms, NP com-
plete problems retain their status as the “hardest problems in NP.”
Proof Idea:
The idea is to simply run the reduction as usual, and plug it into
the randomized algorithm instead of a deterministic one. It would
be an excellent exercise, and a way to reinforce the definitions of NP-
hardness and randomized algorithms, for you to work out the proof
for yourself. However for the sake of completeness, we include this
proof below.
⋆
for every 𝑦 ∈ {0, 1}∗ (where the probability is taken over the random
coin tosses of 𝑃 ). Hence we can get a polynomial-time RNAND-TM
program 𝑃 ′ to compute 𝐺 by setting 𝑃 ′ (𝑥) = 𝑃 (𝑅(𝑥)). By (20.6)
Pr[𝑃 ′ (𝑥) = 𝐹 (𝑅(𝑥))] ≥ 2/3 and since 𝐹 (𝑅(𝑥)) = 𝐺(𝑥) this implies that
Pr[𝑃 ′ (𝑥) = 𝐺(𝑥)] ≥ 2/3, which proves that 𝐺 ∈ BPP.
■
don’t even know how to rule out the possibility that BPP = EXP! Thus
a priori it’s possible (though seems highly unlikely) that randomness
is a magical tool that allows us to speed up arbitrary exponential time 1
At the time of this writing, the largest “natural” com-
computation.1 Nevertheless, as we discuss below, it is believed that plexity class which we can’t rule out being contained
randomization’s power is much weaker and BPP lies in much more in BPP is the class NEXP, which we did not define
in this course, but corresponds to non deterministic
“pedestrian” territory.
exponential time. See this paper for a discussion of
this question.
20.3 THE POWER OF RANDOMIZATION
A major question is whether randomization can add power to compu-
tation. Mathematically, we can phrase this as the following question:
does BPP = P? Given what we’ve seen so far about the relations of
other complexity classes such as P and NP, or NP and EXP, one might
guess that:
One would be correct about the former, but wrong about the latter.
As we will see, we do in fact have reasons to believe that BPP = P.
This can be thought of as supporting the extended Church Turing hy-
pothesis that deterministic polynomial-time Turing machines capture
what can be feasibly computed in the physical world.
We now survey some of the relations that are known between
BPP and other complexity classes we have encountered. (See also
Fig. 20.4.)
Proof Idea:
The idea behind the proof is that we can first amplify by repetition
the probability of success from 2/3 to 1 − 0.1 ⋅ 2−𝑛 . This will allow us to
show that for every 𝑛 ∈ ℕ there exists a single fixed choice of “favorable
coins” which is a string 𝑟 of length polynomial in 𝑛 such that if 𝑟 is
used for the randomness then we output the right answer on all of
the possible 2𝑛 inputs. We can then use the standard “unravelling the
loop” technique to transform an RNAND-TM program to an RNAND-
CIRC program, and “hardwire” the favorable choice of random coins
to transform the RNAND-CIRC program into a plain old deterministic
NAND-CIRC program.
⋆
slower (and hence still polynomial time) such that for every 𝑥 ∈
{0, 1}𝑛
20.4 DERANDOMIZATION
The proof of Theorem 20.8 can be summarized as follows: we can
replace a 𝑝𝑜𝑙𝑦(𝑛)-time algorithm that tosses coins as it runs with an
algorithm that uses a single set of coin tosses 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) which
will be good enough for all inputs of size 𝑛. Another way to say it is
that for the purposes of computing functions, we do not need “online”
access to random coins and can generate a set of coins “offline” ahead
of time, before we see the actual input.
But this does not really help us with answering the question of
whether BPP equals P, since we still need to find a way to generate
these “offline” coins in the first place. To derandomize an RNAND-
TM program we will need to come up with a single deterministic
564 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
algorithm that will work for all input lengths. That is, unlike in the
case of RNAND-CIRC programs, we cannot choose for every input
length 𝑛 some string 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) to use as our random coins.
Can we derandomize randomized algorithms, or does randomness
add an inherent extra power for computation? This is a fundamentally
interesting question but is also of practical significance. Ever since
people started to use randomized algorithms during the Manhattan
project, they have been trying to remove the need for randomness and
replace it with numbers that are selected through some deterministic
process. Throughout the years this approach has often been used 2
One amusing anecdote is a recent case where scam-
successfully, though there have been a number of failures as well.2 mers managed to predict the imperfect “pseudo-
A common approach people used over the years was to replace random generator” used by slot machines to cheat
casinos. Unfortunately we don’t know the details of
the random coins of the algorithm by a “randomish looking” string
how they did it, since the case was sealed.
that they generated through some arithmetic progress. For example,
one can use the digits of 𝜋 for the random tape. Using these type of
methods corresponds to what von Neumann referred to as a “state
of sin”. (Though this is a sin that he himself frequently committed,
as generating true randomness in sufficient quantity was and still is
often too expensive.) The reason that this is considered a “sin” is that
such a procedure will not work in general. For example, it is easy to
modify any probabilistic algorithm 𝐴 such as the ones we have seen in
Chapter 19, to an algorithm 𝐴′ that is guaranteed to fail if the random
tape happens to equal the digits of 𝜋. This means that the procedure
“replace the random tape by the digits of 𝜋” does not yield a general
way to transform a probabilistic algorithm to a deterministic one that
will solve the same problem. Of course, this procedure does not always
fail, but we have no good way to determine when it fails and when
it succeeds. This reasoning is not specific to 𝜋 and holds for every
deterministically produced string, whether it obtained by 𝜋, 𝑒, the
Fibonacci series, or anything else.
An algorithm that checks if its random tape is equal to 𝜋 and then
fails seems to be quite silly, but this is but the “tip of the iceberg” for a
very serious issue. Time and again people have learned the hard way
that one needs to be very careful about producing random bits using
deterministic means. As we will see when we discuss cryptography,
many spectacular security failures and break-ins were the result of
using “insufficiently random” coins.
P
This is a definition that’s worth reading more than
once, and spending some time to digest it. Note that it
takes several parameters:
We will now (partially) answer both questions. For the first ques-
tion, let us come clean and confess we do not know how to prove that
interesting pseudorandom generators exist. By interesting we mean
pseudorandom generators that satisfy that 𝜖 is some small constant
(say 𝜖 < 1/3), 𝑚 > ℓ, and the function 𝐺 itself can be computed in
𝑝𝑜𝑙𝑦(𝑚) time. Nevertheless, Lemma 20.12 (whose statement and proof
is deferred to the end of this chapter) shows that if we only drop the
last condition (polynomial-time computability), then there do in fact
exist pseudorandom generators where 𝑚 is exponentially larger than ℓ.
P
At this point you might want to skip ahead and look at
the statement of Lemma 20.12. However, since its proof
is somewhat subtle, I recommend you defer reading it
until you’ve finished reading the rest of this chapter.
P
The “optimal PRG conjecture” is worth while reading
more than once. What it posits is that we can obtain
(𝑇 , 𝜖) pseudorandom generator 𝐺 such that every
output bit of 𝐺 can be computed in time polynomial
in the length ℓ of the input, where 𝑇 is exponentially
large in ℓ and 𝜖 is exponentially small in ℓ. (Note that
we could not hope for the entire output to be com-
mod e l i ng r a n d omi ze d comp u tati on 567
We emphasize again that the optimal PRG conjecture is, as its name
implies, a conjecture, and we still do not know how to prove it. In par-
ticular, it is stronger than the conjecture that P ≠ NP. But we do have
some evidence for its truth. There is a spectrum of different types of
pseudorandom generators, and there are weaker assumptions than
the optimal PRG conjecture that suffice to prove that BPP = P. In
particular this is known to hold under the assumption that there exists
a function 𝐹 ∈ TIME(2𝑂(𝑛) ) and 𝜖 > 0 such that for every sufficiently
large 𝑛, 𝐹↾𝑛 is not in SIZE(2𝜖𝑛 ). The name “Optimal PRG conjecture”
is non standard. This conjecture is sometimes known in the literature 3
A pseudorandom generator of the form we posit,
as the existence of exponentially strong pseudorandom functions.3 where each output bit can be computed individually
in time polynomial in the seed length, is commonly
20.4.3 Usefulness of pseudorandom generators known as a pseudorandom function generator. For more
on the many interesting results and connections in the
We now show that optimal pseudorandom generators are indeed very
study of pseudorandomness, see this monograph of Salil
useful, by proving the following theorem: Vadhan.
Proof Idea:
The optimal PRG conjecture tells us that we can achieve exponential
expansion of ℓ truly random coins into as many as 2𝛿ℓ “pseudorandom
coins.” Looked at from the other direction, it allows us to reduce the
need for randomness by taking an algorithm that uses 𝑚 coins and
converting it into an algorithm that only uses 𝑂(log 𝑚) coins. Now an
algorithm of the latter type by can be made fully deterministic by enu-
merating over all the 2𝑂(log 𝑚) (which is polynomial in 𝑚) possibilities
for its random choices.
568 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⋆
We now proceed with the proof details.
P
Before reading the proof, it is instructive to think
why this result is not “obvious.” If P = NP then
given any randomized algorithm 𝐴 and input 𝑥,
we will be able to figure out in polynomial time if
there is a string 𝑟 ∈ {0, 1}𝑚 of random coins for 𝐴
mod e l i ng r a n d omi ze d comp u tati on 569
Proof Idea:
The construction follows the “quantifier elimination” idea which
we have seen in Theorem 16.6. We will show that for every 𝐹 ∈ BPP,
we can reduce the question of some input 𝑥 satisfies 𝐹 (𝑥) = 1 to the
question of whether a formula of the form ∃𝑢∈{0,1}𝑚 ∀𝑣∈{0,1}𝑘 𝑃 (𝑢, 𝑣)
is true, where 𝑚, 𝑘 are polynomial in the length of 𝑥 and 𝑃 is
polynomial-time computable. By Theorem 16.6, if P = NP then we can
decide in polynomial time whether such a formula is true or false.
The idea behind this construction is that using amplification we
can obtain a randomized algorithm 𝐴 for computing 𝐹 using 𝑚 coins
such that for every 𝑥 ∈ {0, 1}𝑛 , if 𝐹 (𝑥) = 0 then the set 𝑆 ⊆ {0, 1}𝑚
of coins that make 𝐴 output 1 is extremely tiny, and if 𝐹 (𝑥) = 1 then
it is very large. Now in the case 𝐹 (𝑥) = 1, one can show that this
means that there exists a small number 𝑘 of “shifts” 𝑠0 , … , 𝑠𝑘−1 such
that the union of the sets 𝑆 ⊕ 𝑠𝑖 (i.e., sets of the form {𝑠 ⊕ 𝑠𝑖 | 𝑠 ∈ 𝑆})
covers {0, 1}𝑚 , while in the case 𝐹 (𝑥) = 0 this union will always be of
size at most 𝑘|𝑆| which is much smaller than 2𝑚 . We can express the
condition that there exists 𝑠0 , … , 𝑠𝑘−1 such that ∪𝑖∈[𝑘] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚
as a statement with a constant number of quantifiers.
⋆
∃𝑠0 ,…,𝑠100𝑚−1 ∈{0,1}𝑚 ∀𝑤∈{0,1}𝑚 (𝑤 ∈ (𝑆𝑥 ⊕𝑠0 )∨𝑤 ∈ (𝑆𝑥 ⊕𝑠1 )∨⋯ 𝑤 ∈ (𝑆𝑥 ⊕𝑠100𝑚−1 ))
(20.12)
or equivalently
100𝑚−1 100𝑚−1
∪𝑖∈[100𝑚−1] |𝑆𝑥 ⊕ 𝑠𝑖 | ≤ ∑ |𝑆𝑥 ⊕ 𝑠𝑖 | = ∑ |𝑆𝑥 | = 100𝑚|𝑆𝑥 | .
𝑖=0 𝑖=0
(20.14)
To prove CLAIM II, we will use a technique known as the prob-
abilistic method (see the proof of Lemma 20.12 for a more extensive
discussion). Note that this is a completely different use of probability
than in the theorem statement, we just use the methods of probability
to prove an existential statement.
Proof of CLAIM II: Let 𝑆 ⊆ {0, 1}𝑚 with |𝑆| ≥ 0.5 ⋅ 2𝑚 be as
in the claim’s statement. Consider the following probabilistic ex-
periment: we choose 100𝑚 random shifts 𝑠0 , … , 𝑠100𝑚−1 indepen-
dently at random in {0, 1}𝑚 , and consider the event GOOD that
∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 . To prove CLAIM II it is enough to show
that Pr[GOOD] > 0, since that means that in particular there must exist
shifts 𝑠0 , … , 𝑠100𝑚−1 that satisfy this condition.
For every 𝑧 ∈ {0, 1}𝑚 , define the event BAD𝑧 to hold if 𝑧 ∉
∪𝑖∈[100𝑚−1] (𝑆 ⊕ 𝑠𝑖 ). The event GOOD holds if BAD𝑧 fails for every
𝑧 ∈ {0, 1}𝑚 , and so our goal is to prove that Pr[∪𝑧∈{0,1}𝑚 BAD𝑧 ] < 1. By
the union bound, to show this, it is enough to show that Pr[BAD𝑧 ] <
mod e l i ng r a n d omi ze d comp u tati on 571
2−𝑚 for every 𝑧 ∈ {0, 1}𝑚 . Define the event BAD𝑧 to hold if 𝑧 ∉ 𝑆 ⊕ 𝑠𝑖 .
𝑖
𝑖=0
Pr [𝑧 ∈ 𝑆 ⊕ 𝑠] ≥ 1
2 . (20.16)
𝑠∈{0,1}𝑚
Proof Idea:
The proof uses an extremely useful technique known as the “prob-
abilistic method” which is not too hard mathematically but can be
confusing at first.5 The idea is to give a “non constructive” proof of There is a whole (highly recommended) book by
5
⎧
{ ⎫
}
1 1
𝐵𝑃 = ⎨𝐺 ∈ ℱ 𝑚
ℓ ∣ ∣ 2ℓ ∑ 𝑃 (𝐺(𝑠)) − 2𝑚 ∑ 𝑃 (𝑟)∣ > 𝜖⎬
{
⎩ 𝑠∈{0,1}ℓ 𝑟∈{0,1}𝑚 }
⎭
(20.17)
(We’ve replaced here the probability statements in (20.9) with the
equivalent sums so as to reduce confusion as to what is the sample
space that 𝐵𝑃 is defined over.)
To understand this proof it is crucial that you pause here and see
how the definition of 𝐵𝑃 above corresponds to (20.17). This may well
take re-reading the above text once or twice, but it is a good exercise
at parsing probabilistic statements and learning how to identify the
sample space that these statements correspond to.
Now, we’ve shown in Theorem 5.2 that up to renaming variables
(which makes no difference to program’s functionality) there are
2𝑂(𝑇 log 𝑇 ) NAND-CIRC programs of at most 𝑇 lines. Since 𝑇 log 𝑇 <
𝑇 2 for sufficiently large 𝑇 , this means that if Claim I is true, then
by the union bound it holds that the probability of the union of
𝐵𝑃 over all NAND-CIRC programs of at most 𝑇 lines is at most
2𝑂(𝑇 log 𝑇 ) 2−𝑇 < 0.1 for sufficiently large 𝑇 . What is important for
2
mod e l i ng r a n d omi ze d comp u tati on 573
is at most 2−𝑇 .
2
✓ Chapter Recap
20.7 EXERCISES
21
Cryptography
“A good disguise should not reveal the person’s height”, Shafi Goldwasser
and Silvio Micali, 1982
We will often write the first input (i.e., the key) to the encryp-
tion and decryption as a subscript and so can write (21.1) also as
𝐷𝑘 (𝐸𝑘 (𝑥)) = 𝑥. Figure 21.6: A private-key encryption scheme is a
pair of algorithms 𝐸, 𝐷 such that for every key
Prove that for
Solved Exercise 21.1 — Lengths of ciphertext and plaintext.
𝑘 ∈ {0, 1}𝑛 and plaintext 𝑥 ∈ {0, 1}𝐿(𝑛) , 𝑦 = 𝐸𝑘 (𝑥)
every valid encryption scheme (𝐸, 𝐷) with functions 𝐿, 𝐶. 𝐶(𝑛) ≥ is a ciphertext of length 𝐶(𝑛). The encryption scheme
𝐿(𝑛) for every 𝑛. is valid if for every such 𝑦, 𝐷𝑘 (𝑦) = 𝑥. That is, the
decryption of an encryption of 𝑥 is 𝑥, as long as both
■ encryption and decryption use the same key.
c ry p tog ra p hy 583
Solution:
For every fixed key 𝑘 ∈ {0, 1}𝑛 , the equation (21.1) implies that
the map 𝑦 ↦ 𝐷𝑘 (𝑦) inverts the map 𝑥 ↦ 𝐸𝑘 (𝑥), which in partic-
ular means that the map 𝑥 ↦ 𝐸𝑘 (𝑥) must be one to one. Hence
its codomain must be at least as large as its domain, and since its
domain is {0, 1}𝐿(𝑛) and its codomain is {0, 1}𝐶(𝑛) it follows that
𝐶(𝑛) ≥ 𝐿(𝑛).
■
P
You would appreciate the subtleties of defining secu-
rity of encryption more if at this point you take a five
minute break from reading, and try (possibly with a
partner) to brainstorm on how you would mathemat-
ically define the notion that an encryption scheme is
secure, in the sense that it protects the secrecy of the
plaintext 𝑥.
R
Remark 21.2 — Randomness in the real world. Choos-
ing the secrets for cryptography requires generating
randomness, which is often done by measuring some
“unpredictable” or “high entropy” data, and then
applying hash functions to the result to “extract” a
uniformly random string. Great care must be taken in
doing this, and randomness generators often turn out
to be the Achilles heel of secure systems.
In 2006 a programmer removed a line of code from the
procedure to generate entropy in OpenSSL package
distributed by Debian since it caused a warning in
some automatic verification code. As a result for two
years (until this was discovered) all the randomness
generated by this procedure used only the process
ID as an “unpredictable” source. This means that all
communication done by users in that period is fairly
easily breakable (and in particular, if some entities
recorded that communication they could break it also
retroactively). See XKCD’s take on that incident.
In 2012 two separate teams of researchers scanned a
large number of RSA keys on the web and found out
that about 4 percent of them are easy to break. The
main issue were devices such as routers, internet-
connected printers and such. These devices sometimes
run variants of Linux- a desktop operating system-
but without a hard drive, mouse or keyboard, they
don’t have access to many of the entropy sources that
desktop have. Coupled with some good old fashioned
ignorance of cryptography and software bugs, this led
to many keys that are downright trivial to break, see
this blog post and this web page for more details.
Since randomness is so crucial to security, breaking
the procedure to generate randomness can lead to a
complete break of the system that uses this random-
ness. Indeed, the Snowden documents, combined with
c ry p tog ra p hy 585
P
This definition might take more than one reading
to parse. Try to think of how this condition would
correspond to your intuitive notion of “learning no
information” about 𝑥 from observing 𝐸𝑘 (𝑥), and to
Shannon’s quote in the beginning of this chapter.
586 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Pr[𝑖 = 0 ∧ 𝑦 = 𝐸𝑘 (𝑥𝑖 )]
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = . (21.2)
Pr[𝑦 = 𝐸𝑘 (𝑥)]
1
2 𝑝0 (𝑦) 𝑝 1
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = 1 1
= = (21.3)
2 𝑝0 (𝑦) + 2 𝑝1 (𝑦)
𝑝+𝑝 2
c ry p tog ra p hy 587
using the fact that 𝑝0 (𝑦) = 𝑝1 (𝑦) = 𝑝. This means that observing the
ciphertext 𝑦 did not help us at all! We still would not be able to guess
whether Alice sent “attack” or “retreat” with better than 50/50 odds!
This example can be vastly generalized to show that perfect secrecy
is indeed “perfect” in the sense that observing a ciphertext gives Eve
no additional information about the plaintext beyond her a priori knowl-
edge.
There is a per-
Theorem 21.4 — One Time Pad (Vernam 1917, Shannon 1949).
fectly secret valid encryption scheme (𝐸, 𝐷) with 𝐿(𝑛) = 𝐶(𝑛) = 𝑛.
P
The argument above is quite simple but is worth
reading again. To understand why the one-time pad
is perfectly secret, it is useful to envision it as a bi-
partite graph as we’ve done in Fig. 21.8. (In fact the Figure 21.9: In the one time pad encryption scheme we
encryption scheme of Fig. 21.8 is precisely the one- encrypt a plaintext 𝑥 ∈ {0, 1}𝑛 with a key 𝑘 ∈ {0, 1}𝑛
time pad for 𝑛 = 2.) For every 𝑛, the one-time pad by the ciphertext 𝑥 ⊕ 𝑘 where ⊕ denotes the bitwise
XOR operation.
encryption scheme corresponds to a bipartite graph
with 2𝑛 vertices on the “left side” corresponding to the
plaintexts in {0, 1}𝑛 and 2𝑛 vertices on the “right side”
corresponding to the ciphertexts {0, 1}𝑛 . For every
𝑥 ∈ {0, 1}𝑛 and 𝑘 ∈ {0, 1}𝑛 , we connect 𝑥 to the vertex
𝑦 = 𝐸𝑘 (𝑥) with an edge that we label with 𝑘. One can
see that this is the complete bipartite graph, where
every vertex on the left is connected to all vertices on
the right. In particular this means that for every left
vertex 𝑥, the distribution on the ciphertexts obtained
by taking a random 𝑘 ∈ {0, 1}𝑛 and going to the
neighbor of 𝑥 on the edge labeled 𝑘 is the uniform dis-
tribution over {0, 1}𝑛 . This ensures the perfect secrecy
condition.
Proof Idea:
Figure 21.10: Gene Grabeel, who founded the U.S.
The idea behind the proof is illustrated in Fig. 21.11. We define a Russian SigInt program on 1 Feb 1943. Photo taken in
graph between the plaintexts and ciphertexts, where we put an edge 1942, see Page 7 in the Venona historical study.
and
How does this mesh with the fact that, as we’ve already seen, peo-
ple routinely use cryptosystems with a 16 byte (i.e., 128 bit) key but
many terabytes of plaintext? The proof of Theorem 21.5 does give in
fact a way to break all these cryptosystems, but an examination of this
proof shows that it only yields an algorithm with time exponential in
the length of the key. This motivates the following relaxation of perfect
secrecy to a condition known as “computational secrecy”. Intuitively,
an encryption scheme is computationally secret if no polynomial time
algorithm can break it. The formal definition is below:
P
Definition 21.6 requires a second or third read and
some practice to truly understand. One excellent exer-
cise to make sure you follow it is to see that if we allow
𝑃 to be an arbitrary function mapping {0, 1}𝑚(𝑛) to
{0, 1}, and we replace the condition in (21.4) that the
lefthand side is smaller than 𝑝(𝑛)1
with the condition
that it is equal to 0 then we get the perfect secrecy
condition of Definition 21.3. Indeed if the distributions
𝐸𝑘 (𝑥0 ) and 𝐸𝑘 (𝑥1 ) are identical then applying any
function 𝑃 to them we get the same expectation. On
the other hand, if the two distributions above give a
different probability for some element 𝑦∗ ∈ {0, 1}𝑚(𝑛) ,
then the function 𝑃 (𝑦) that outputs 1 iff 𝑦 = 𝑦∗ will
have a different expectation under the former distribu-
tion than under the latter.
Regarding the first question, it is not hard to show that if, for ex-
ample, Alice uses a computationally secret encryption algorithm to
encrypt either “attack” or “retreat” (each chosen with probability
592 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
1
∣ Pr [𝐶(𝐺(𝑠)) = 1] − Pr [𝐶(𝑟) = 1]∣ < . (21.5)
𝑠∼{0,1}ℓ 𝑟∼{0,1}𝑚 𝑝(𝑛)
Proof Idea:
The proof is illustrated in Fig. 21.12. We simply take the one-time
pad on 𝐿 bit plaintexts, but replace the key with 𝐺(𝑘) where 𝑘 is a
string in {0, 1}𝑛 and 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 is a pseudorandom gen-
erator. Since the one time pad cannot be broken, an adversary that
breaks the derandomized one-time pad can be used to distinguish
between the output of the pseudorandom generator and the uniform
distribution.
⋆
Proof of Theorem 21.8. Let 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 for 𝐿 = 𝑛𝑎 be the
restriction to input length 𝑛 of the pseudorandom generator 𝐺 whose
existence we are guaranteed from the crypto PRG conjecture. We
now define our encryption scheme as follows: given key 𝑘 ∈ {0, 1}𝑛
and plaintext 𝑥 ∈ {0, 1}𝐿 , the encryption 𝐸𝑘 (𝑥) is simply 𝑥 ⊕ 𝐺(𝑘).
To decrypt a string 𝑦 ∈ {0, 1}𝑚 we output 𝑦 ⊕ 𝐺(𝑘). This is a valid
encryption since 𝐺 is computable in polynomial time and (𝑥 ⊕ 𝐺(𝑘)) ⊕
𝐺(𝑘) = 𝑥 ⊕ (𝐺(𝑘) ⊕ 𝐺(𝑘)) = 𝑥 for every 𝑥 ∈ {0, 1}𝐿 .
Computational secrecy follows from the condition of a pseudo-
random generator. Suppose, towards a contradiction, that there is
a polynomial 𝑝, NAND-CIRC program 𝑄 of at most 𝑝(𝐿) lines and
𝑥, 𝑥′ ∈ {0, 1}𝐿(𝑛) such that
(We use here the simple fact that for a {0, 1}-valued random variable
𝑋, Pr[𝑋 = 1] = 𝔼[𝑋].)
By the definition of our encryption scheme, this means that
Now since (as we saw in the security analysis of the one-time pad),
for every strings 𝑥, 𝑥′ ∈ {0, 1}𝐿 , the distribution 𝑟 ⊕ 𝑥 and 𝑟 ⊕ 𝑥′ are
identical, where 𝑟 ∼ {0, 1}𝐿 . Hence
1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)] + 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.9)
(Please make sure that you can see why this is true.)
Now we can use the triangle inequality that |𝐴 + 𝐵| ≤ |𝐴| + |𝐵| for
every two numbers 𝐴, 𝐵, applying it for 𝐴 = 𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘) ⊕ 𝑥)] −
𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥)] and 𝐵 = 𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥′ )]−𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘)⊕𝑥′ )]
to derive
1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)]∣+∣ 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.10)
In particular, either the first term or the second term of the
lefthand-side of (21.10) must be at least 2𝑝(𝐿)
1
. Let us assume the first
case holds (the second case is analyzed in exactly the same way).
Then we get that
R
Remark 21.9 — Stream ciphers in practice. The two
most widely used forms of (private key) encryption
schemes in practice are stream ciphers and block ciphers.
(To make things more confusing, a block cipher is
always used in some mode of operation and some
of these modes effectively turn a block cipher into
a stream cipher.) A block cipher can be thought as
a sort of a “random invertible map” from {0, 1}𝑛 to
{0, 1}𝑛 , and can be used to construct a pseudorandom
generator and from it a stream cipher, or to encrypt
data directly using other modes of operations. There
are a great many other security notions and consider-
ations for encryption schemes beyond computational
secrecy. Many of those involve handling scenarios
such as chosen plaintext, man in the middle, and cho-
sen ciphertext attacks, where the adversary is not just
merely a passive eavesdropper but can influence the
communication in some way. While this chapter is
c ry p tog ra p hy 595
If P
Theorem 21.10 — Breaking encryption using NP algorithm. = NP
then there is no computationally secret encryption scheme with
𝐿(𝑛) > 𝑛.
Furthermore, for every valid encryption scheme (𝐸, 𝐷) with
𝐿(𝑛) > 𝑛 + 100 there is a polynomial 𝑝 such that for every large
enough 𝑛 there exist 𝑥0 , 𝑥1 ∈ {0, 1}𝐿(𝑛) and a 𝑝(𝑛)-line NAND-
CIRC program EVE s.t.
We will now use the following extremely simple but useful fact
known as the averaging principle (see also Lemma 18.10): for every
random variable 𝑍, if 𝔼[𝑍] = 𝜇, then with positive probability 𝑍 ≤ 𝜇.
(Indeed, if 𝑍 > 𝜇 with probability one, then the expected value of 𝑍
will have to be larger than 𝜇, just like you can’t have a class in which
all students got A or A- and yet the overall average is B+.) In our case
it means that with positive probability ∑𝑘∈{0,1}𝑛 𝑍𝑘 ≤ 22𝐿(𝑛) . In other
2𝑛
words, there exists some 𝑥1 ∈ {0, 1}𝐿(𝑛) such that ∑𝑘∈{0,1}𝑛 𝑍𝑘 (𝑥1 ) ≤
22𝑛
2𝐿(𝑛)
.Yet this means that if we choose a random 𝑘 ∼ {0, 1}𝑛 , then
the probability that 𝐸𝑘 (𝑥1 ) ∈ 𝑆0 is at most 21𝑛 ⋅ 22𝐿(𝑛) = 2𝑛−𝐿(𝑛) .
2𝑛
and Hellman’s elusive trapdoor function. This was done the next year
by Rivest, Shamir and Adleman who came up with the RSA trapdoor
function, which through the framework of Diffie and Hellman yielded
not just encryption but also signatures. (A close variant of the RSA
function was discovered earlier by Clifford Cocks at GCHQ, though
as far as I can tell Cocks, Ellis and Williamson did not realize the
application to digital signatures.) From this point on began a flurry of
advances in cryptography which hasn’t died down till this day.
• Bob: Given the triple (𝑝, 𝑔, ℎ), Bob sends a message 𝑥 ∈ {0, 1}𝐿
to Alice by choosing 𝑏 at random in [𝑝], and sending to Alice the
pair (𝑔𝑏 mod 𝑝, 𝑟𝑒𝑝(ℎ𝑏 mod 𝑝) ⊕ 𝑥) where 𝑟𝑒𝑝 ∶ [𝑝] → {0, 1}∗
is some “representation function” that maps [𝑝] to {0, 1}𝐿 . (The
function 𝑟𝑒𝑝 does not need to be one-to-one and you can think of
𝑟𝑒𝑝(𝑧) as simply outputting 𝐿 of the bits of 𝑧 in the natural binary
representation, it does need to satisfy certain technical conditions
which we omit in this description.)
The correctness of the protocol follows from the simple fact that
(𝑔𝑎 )𝑏 = (𝑔𝑏 )𝑎 for every 𝑔, 𝑎, 𝑏 and this still holds if we work modulo
𝑝. Its security relies on the computational assumption that computing
this map is hard, even in a certain “average case” sense (this computa-
tional assumption is known as the Decisional Diffie Hellman assump-
tion). The Diffie-Hellman key exchange protocol can be thought of as
a public key encryption where the Alice’s first message is the public
key, and Bob’s message is the encryption.
One can think of the Diffie-Hellman protocol as being based on
a “trapdoor pseudorandom generator” whereas the triple 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑎𝑏
looks “random” to someone that doesn’t know 𝑎, but someone that
does know 𝑎 can see that raising the second element to the 𝑎-th power
yields the third element. The Diffie-Hellman protocol can be described
abstractly in the context of any finite Abelian group for which we can
efficiently compute the group operation. It has been implemented
on other groups than numbers modulo 𝑝, and in particular Elliptic
Curve Cryptography (ECC) is obtained by basing the Diffie Hell-
man on elliptic curve groups which gives some practical advantages.
Another common group theoretic basis for key-exchange/public key
encryption protocol is the RSA function. A big disadvantage of Diffie-
Hellman (both the modular arithmetic and elliptic curve variants)
and RSA is that both schemes can be broken in polynomial time by a
602 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
21.10 MAGIC
Beyond encryption and signature schemes, cryptographers have man-
aged to obtain objects that truly seem paradoxical and “magical”. We
briefly discuss some of these objects. We do not give any details, but
hopefully this will spark your curiosity to find out more.
✓ Chapter Recap
21.11 EXERCISES
ing the known algorithms are optimal) we need to set the prime to be
bigger (and so have larger key sizes with corresponding overhead in
communication and computation) to get the same level of security.
Zero-knowledge proofs were constructed by Goldwasser, Micali,
and Rackoff in 1982, and their wide applicability was shown (using
the theory of NP completeness) by Goldreich, Micali, and Wigderson
in 1986.
Two party and multiparty secure computation protocols were con-
structed (respectively) by Yao in 1982 and Goldreich, Micali, and
Wigderson in 1987. The latter work gave a general transformation
from security against passive adversaries to security against active
adversaries using zero knowledge proofs.
22
Proofs and algorithms
“Let’s not try to define knowledge, but try to define zero-knowledge.”, Shafi
Goldwasser.
• Interactive proofs
22.1 EXERCISES
Quantum computing
“We always have had (secret, secret, close the doors!) … a great deal of diffi-
culty in understanding the world view that quantum mechanics represents …
It has not yet become obvious to me that there’s no real problem. … Can I learn
anything from asking this question about computers–about this may or may
not be mystery as to what the world view of quantum mechanics is?” , Richard
Feynman, 1981
“The only difference between a probabilistic classical world and the equations
of the quantum world is that somehow or other it appears as if the probabilities
would have to go negative”, Richard Feynman, 1981
Specifically, consider an event that can either occur or not (e.g. “de-
tector number 17 was hit by a photon”). In classical probability, we
model this by a probability distribution over the two outcomes: a pair
of non-negative numbers 𝑝 and 𝑞 such that 𝑝 + 𝑞 = 1, where 𝑝 corre-
sponds to the probability that the event occurs and 𝑞 corresponds to
the probability that the event does not occur. In quantum mechanics,
we model this also by pair of numbers, which we call amplitudes. This
is a pair of (potentially negative or even complex) numbers 𝛼 and 𝛽
such that |𝛼|2 + |𝛽|2 = 1. The probability that the event occurs is |𝛼|2
and the probability that it does not occur is |𝛽|2 . In isolation, these
negative or complex numbers don’t matter much, since we anyway
square them to obtain probabilities. But the interaction of positive and
negative amplitudes can result in surprising cancellations where some-
how combining two scenarios where an event happens with positive
probability results in a scenario where it never does.
P
If you don’t find the above description confusing and
unintuitive, you probably didn’t get it. Please make
sure to re-read the above paragraphs until you are
thoroughly confused.
R
Remark 23.1 — Complex vs real, other simplifications. If
(like the author) you are a bit intimidated by complex
numbers, don’t worry: you can think of all ampli-
tudes as real (though potentially negative) numbers
without loss of understanding. All the “magic” of
quantum computing already arises in this case, and
so we will often restrict attention to real amplitudes in
this chapter.
We will also only discuss so-called pure quantum
states, and not the more general notion of mixed states.
Pure states turn out to be sufficient for understanding
the algorithmic aspects of quantum computing.
More generally, this chapter is not meant to be a com-
plete description of quantum mechanics, quantum
information theory, or quantum computing, but rather
illustrate the main points where these differ from
classical computing.
rooms.1 You will interrogate Alice and your associate will interrogate
Bob. You choose a random bit 𝑥 ∈ {0, 1} and your associate chooses
a random 𝑦 ∈ {0, 1}. We let 𝑎 be Alice’s response and 𝑏 be Bob’s
response. We say that Alice and Bob win this experiment if 𝑎 ⊕ 𝑏 =
𝑥 ∧ 𝑦. In other words, Alice and Bob need to output two bits that
disagree if 𝑥 = 𝑦 = 1 and agree otherwise.
Now if Alice and Bob are not telepathic, then they need to agree in
advance on some strategy. It’s not hard for Alice and Bob to succeed
with probability 3/4: just always output the same bit. Moreover, by
doing some case analysis, we can show that no matter what strategy
they use, Alice and Bob cannot succeed with higher probability than
that:
R
Remark 23.3 — Randomized strategies. Theorem 23.2
above assumes that Alice and Bob use deterministic
strategies 𝑓 and 𝑔 respectively. More generally, Alice
and Bob could use a randomized strategy, or equiva-
lently, each could choose 𝑓 and 𝑔 from some distri-
butions ℱ and 𝒢 respectively. However the averaging
principle (Lemma 18.10) implies that if all possible
deterministic strategies succeed with probability at
most 3/4, then the same is true for all randomized
strategies.
616 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
could turn out to be useful for problems that a-priori seemed to have
nothing to do with quantum physics.
As we’ll discuss later, at the moment there are several intensive
efforts to construct large scale quantum computers. It seems safe
to say that, as far as we know, in the next five years or so there will
not be a quantum computer large enough to factor, say, a 1024 bit
number. On the other hand, it does seem quite likely that in the very
near future there will be quantum computers which achieve some task
exponentially faster than the best-known way to achieve the same
task with a classical computer. When and if a quantum computer is
built that is strong enough to break reasonable parameters of Diffie
Hellman, RSA and elliptic curve cryptography is anybody’s guess. It
could also be a “self destroying prophecy” whereby the existence of
a small-scale quantum computer would cause everyone to shift away
to lattice-based crypto which in turn will diminish the motivation
to invest the huge resources needed to build a large scale quantum 5
Of course, given that we’re still hearing of attacks
computer.5 exploiting “export grade” cryptography that was
supposed to disappear in 1990’s, I imagine that
we’ll still have products running 1024 bit RSA when
R everyone has a quantum laptop.
Remark 23.4 — Quantum computing and NP. Despite
popular accounts of quantum computers as having
variables that can take “zero and one at the same
time” and therefore can “explore an exponential num-
ber of possibilities simultaneously”, their true power
is much more subtle and nuanced. In particular, as far
as we know, quantum computers do not enable us to
solve NP complete problems such as 3SAT in polyno-
mial or even sub-exponential time. However, Grover’s
search algorithm does give a more modest advan-
tage (namely, quadratic) for quantum computers
over classical ones for problems in NP. In particular,
due to Grover’s search algorithm, we know that the
𝑘-SAT problem for 𝑛 variables can be solved in time
𝑂(2𝑛/2 𝑝𝑜𝑙𝑦(𝑛)) on a quantum computer for every 𝑘.
In contrast, the best known algorithms for 𝑘-SAT on a
1
classical computer take roughly 2(1− 𝑘 )𝑛 steps.
Big Idea 28 Quantum computers are not a panacea and are un-
likely to solve NP complete problems, but they can provide exponen-
tial speedups to certain structured problems.
𝑒𝑠0 ⋯𝑠16 (1−𝑠3 ⋅𝑠5 )𝑠18 ⋯𝑠𝑛−1 . (Since {𝑒𝑠 }𝑠∈{0,1}𝑛 is a basis for 𝑅2 , it suffices to
𝑛
P
Please make sure you understand why performing the
operation will take a system in state 𝑝 to a system in
the state 𝐹 𝑝. Understanding the evolution of proba-
bilistic systems is a prerequisite to understanding the
evolution of quantum systems.
If your linear algebra is a bit rusty, now would be a
good time to review it, and in particular make sure
you are comfortable with the notions of matrices, vec-
tors, (orthogonal and orthonormal) bases, and norms.
620 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
0 1
𝑁 =( ) . (23.3)
1 0
+1 +1
𝐻= √1
2
( ) . (23.4)
+1 −1
q ua n tu m comp u ti ng 621
Proof. Alice and Bob will start by preparing a 2-qubit quantum system
in the state
𝜓= √1 |00⟩
2
+ √1 |11⟩
2
(23.5)
(this state is known as an EPR pair). Alice takes the first qubit
of the system to her room, and Bob takes the second qubit to his
room. Now, when Alice receives 𝑥 if 𝑥 = 0 she does nothing and
if 𝑥 = 1 she applies the unitary map 𝑅−𝜋/8 to her qubit where
𝑐𝑜𝑠𝜃 − sin 𝜃
𝑅𝜃 = ( ) is the unitary operation corresponding to
sin 𝜃 cos 𝜃
rotation in the plane with angle 𝜃. When Bob receives 𝑦, if 𝑦 = 0 he
does nothing and if 𝑦 = 1 he applies the unitary map 𝑅𝜋/8 to his qubit.
Then each one of them measures their qubit and sends this as their
response.
Recall that to win the game Bob and Alice want their outputs to
be more likely to differ if 𝑥 = 𝑦 = 1 and to be more likely to agree
otherwise. We will split the analysis in one case for each of the four
possible values of 𝑥 and 𝑦.
Case 1: 𝑥 = 0 and 𝑦 = 0. If 𝑥 = 𝑦 = 0 then the state does not
change. Because the state 𝜓 is proportional to |00⟩ + |11⟩, the measure-
ments of Bob and Alice will always agree (if Alice measures 0 then the
state collapses to |00⟩ and so Bob measures 0 as well, and similarly for
1). Hence in the case 𝑥 = 𝑦 = 1, Alice and Bob always win.
Case 2: 𝑥 = 0 and 𝑦 = 1. If 𝑥 = 0 and 𝑦 = 1 then after Alice
measures her bit, if she gets 0 then the system collapses to the state
|00⟩, in which case after Bob performs his rotation, his qubit is in
the state cos(𝜋/8)|0⟩ + sin(𝜋/8)|1⟩. Thus, when Bob measures his
qubit, he will get 0 (and hence agree with Alice) with probability
cos2 (𝜋/8) ≥ 0.85. Similarly, if Alice gets 1 then the system collapses
to |11⟩, in which case after rotation Bob’s qubit will be in the state
− sin(𝜋/8)|0⟩ + cos(𝜋/8)|1⟩ and so once again he will agree with Alice
with probability cos2 (𝜋/8).
The analysis for Case 3, where 𝑥 = 1 and 𝑦 = 0, is completely
analogous to Case 2. Hence Alice and Bob will agree with probability
cos2 (𝜋/8) in this case as well. (To show this we use the observation
that the result of this experiment is the same regardless of the order
in which Alice and Bob apply their rotations and measurements; this
requires a proof but is not very hard to show.)
Case 4: 𝑥 = 1 and 𝑦 = 1. For the case that 𝑥 = 1 and 𝑦 = 1,
after both Alice and Bob perform their rotations, the state will be
proportional to
R
Remark 23.6 — Quantum vs probabilistic strategies. It
is instructive to understand what is it about quan-
tum mechanics that enabled this gain in Bell’s
Inequality. For this, consider the following anal-
ogous probabilistic strategy for Alice and Bob.
They agree that each one of them output 0 if he
or she get 0 as input and outputs 1 with prob-
ability 𝑝 if they get 1 as input. In this case one
can see that their success probability would be
1
4
⋅ 1 + 21 (1 − 𝑝) + 14 [2𝑝(1 − 𝑝)] = 0.75 − 0.5𝑝2 ≤ 0.75.
The quantum strategy we described above can be
thought of as a variant of the probabilistic strategy for
parameter 𝑝 set to sin2 (𝜋/8) = 0.15. But in the case
𝑥 = 𝑦 = 1, instead of disagreeing only with probability
2𝑝(1 − 𝑝) = 1/4, the existence of the so called “neg-
ative probabilities’ ’ in the quantum world allowed
us to rotate the state in opposing directions to achieve
destructive interference and hence a higher probability
of disagreement, namely sin2 (𝜋/4) = 0.5.
624 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
columns as 000, 001, 010, … , 111, then 𝑈𝑁𝐴𝑁𝐷 can be written as the
following matrix:
0 1 0 0 0 0 0 0
⎛
⎜ ⎞
⎜ 1 0 0 0 0 0 0 0⎟⎟
⎜
⎜ ⎟
⎜ 0 0 0 1 0 0 0 0⎟⎟
⎜
⎜ ⎟
0 0 1 0 0 0 0 0⎟
𝑈𝑁𝐴𝑁𝐷 =⎜
⎜
⎜
⎟
⎟ (23.8)
⎜ 0 0 0 0 0 1 0 0⎟⎟
⎜
⎜ ⎟
⎜ 0 0 0 0 1 0 0 0⎟⎟
⎜
⎜ ⎟
⎜0 0 0 0 0 0 1 0⎟⎟
⎝0 0 0 0 0 0 0 1⎠
If we have an 𝑛 qubit system, then for 𝑖, 𝑗, 𝑘 ∈ [𝑛], we will denote
by 𝑈𝑁𝐴𝑁𝐷
𝑖,𝑗,𝑘
as the 2𝑛 × 2𝑛 unitary matrix that corresponds to applying
𝑈𝑁𝐴𝑁𝐷 to the 𝑖-th, 𝑗-th, and 𝑘-th bits, leaving the others intact. That is,
for every 𝑣 = ∑𝑥∈{0,1}𝑛 𝑣𝑥 |𝑥⟩, 𝑈𝑁𝐴𝑁𝐷
𝑖,𝑗,𝑘
𝑣 = ∑𝑥∈{0,1}𝑛 𝑣𝑥 |𝑥0 ⋯ 𝑥𝑘−1 (𝑥𝑘 ⊕
NAND(𝑥𝑖 , 𝑥𝑗 ))𝑥𝑘+1 ⋯ 𝑥𝑛−1 ⟩.
As mentioned above, we will also use the Hadamard or HAD oper-
ation, A quantum circuit is obtained by applying a sequence of 𝑈𝑁𝐴𝑁𝐷
and HAD gates, where a HAD gates corresponding to applying the
matrix
+1 +1
𝐻 = √12 ( ) . (23.9)
+1 −1
Another way to write define 𝐻 is that for 𝑏 ∈ {0, 1}, 𝐻|𝑏⟩ = √1 |0⟩
2
+
We define HAD to be the 2 × 2 unitary matrix that
𝑖
√1 (−1)𝑏 |1⟩. 𝑛 𝑛
2
applies HAD to the 𝑖-th qubit and leaves the others intact. Using the
ket notation, we can write this as
HAD
𝑖
∑ 𝑣𝑥 |𝑥⟩ = √1 ∑ |𝑥0 ⋯ 𝑥𝑖−1 ⟩ (|0⟩ + (−1)𝑥𝑖 |1⟩) |𝑥𝑖+1 ⋯ 𝑥𝑛−1 ⟩ .
2
𝑥∈{0,1}𝑛 𝑥∈{0,1}𝑛
(23.10)
A quantum circuit is obtained by composing these basic operations
on some 𝑚 qubits. If 𝑚 ≥ 𝑛, we use a circuit to compute a function
𝑓 ∶ {0, 1}𝑛 → {0, 1}:
• We say that the circuit computes the function 𝑓 if the probability that
this output equals 𝑓(𝑥) is at least 2/3. Note that this probability
is obtained by summing up the squares of the amplitudes of all
coordinates in the final state of the system corresponding to vectors
|𝑦⟩ where 𝑦𝑚−1 = 𝑓(𝑥).
2
∑ |𝑣𝑦 |2 ≥ . (23.12)
𝑦∈{0,1}𝑚 s.t. 𝑦𝑚−1 =𝑓(𝑥)
3
P
Please stop here and see that this definition makes
sense to you.
R
Remark 23.9 — The obviously exponential fallacy. A
priori it might seem “obvious” that quantum com-
puting is exponentially powerful, since to perform a
quantum computation on 𝑛 bits we need to maintain
the 2𝑛 dimensional state vector and apply 2𝑛 × 2𝑛 ma-
trices to it. Indeed popular descriptions of quantum
computing (too) often say something along the lines
that the difference between quantum and classical
computer is that a classical bit can either be zero or
one while a qubit can be in both states at once, and
so in many qubits a quantum computer can perform
exponentially many computations at once.
Depending on how you interpret it, this description
is either false or would apply equally well to proba-
bilistic computation, even though we’ve already seen
that every randomized algorithm can be simulated by
a similar-sized circuit, and in fact we conjecture that
BPP = P.
Moreover, this “obvious” approach for simulating a
quantum computation will take not just exponential
time but exponential space as well, while can be shown
that using a simple recursive formula one can calcu-
late the final quantum state using polynomial space (in
physics this is known as “Feynman path integrals”).
So, the exponentially long vector description by itself
does not imply that quantum computers are exponen-
tially powerful. Indeed, we cannot prove that they are
(i.e., we have not been able to rule out the possibility
that every QNAND-CIRC program could be simu-
lated by a NAND-CIRC program/ Boolean circuit with
polynomial overhead), but we do have some problems
(integer factoring most prominently) for which they
do provide exponential speedup over the currently
best known classical (deterministic or probabilistic)
algorithms.
Definition 23.10 — The class BQP. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that
𝐹 ∈ BQP if there exists a polynomial time NAND-TM program 𝑃
such that for every 𝑛, 𝑃 (1𝑛 ) is the description of a quantum circuit
𝐶𝑛 that computes the restriction of 𝐹 to {0, 1}𝑛 .
P
Definition 23.10 is the quantum analog of the alter-
native characterization of P that appears in Solved
Exercise 13.4. One way to verify that you’ve under-
stood Definition 23.10 it to see that you can prove
(1) P ⊆ BQP and in fact the stronger statement
BPP ⊆ BQP, (2) BQP ⊆ EXP, and (3) For every
NP-complete function 𝐹 , if 𝐹 ∈ BQP then NP ⊆ BQP.
Exercise 23.1 asks you to work these out.
q ua n tu m comp u ti ng 629
The relation between NP and BQP is not known (see also Re-
mark 23.4). It is widely believed that NP ⊈ BQP, but there is no
consensus whether or not BQP ⊆ NP. It is quite possible that these
two classes are incomparable, in the sense that NP ⊈ BQP (and in par-
ticular no NP-complete function belongs to BQP) but also BQP ⊈ NP
(and there are some interesting candidates for such problems).
It can be shown that QNANDEVAL (evaluating a quantum circuit
on an input) is computable by a polynomial size QNAND-CIRC pro-
gram, and moreover this program can even be generated uniformly
and hence QNANDEVAL is in BQP. This allows us to “port” many
of the results of classical computational complexity into the quantum
realm, including the notions of a universal quantum Turing machine,
as well as all of the uncomputability results. There is even a quantum
analog of the Cook-Levin Theorem.
R
Remark 23.11 — Restricting attention to circuits. Because
the non uniform model is a little cleaner to work with,
in the rest of this chapter we mostly restrict attention
to this model, though all the algorithms we discuss
can be implemented using uniform algorithms as well.
Musical notes yield one type of periodic function. When you pull
on a string on a musical instrument, it vibrates in a repeating pattern.
Hence, if we plot the speed of the string (and so also the speed of
the air around it) as a function of time, it will correspond to some
periodic function. The length of the period is known as the wave length
of the note. The frequency is the number of times the function repeats
itself within a unit of time. For example, the “Middle C” note has
a frequency of 261.63 Hertz, which means its period is 1/(261.63)
seconds.
If we play a chord by playing several notes at once, we get a more
complex periodic function obtained by combining the functions of
the individual notes (see Fig. 23.5). The human ear contains many
small hairs, each of which is sensitive to a narrow band of frequencies.
Hence when we hear the sound corresponding to a chord, the hairs in
our ears actually separate it out to the components corresponding to
each frequency.
It turns out that (essentially) every periodic function 𝑓 ∶ ℝ → ℝ
can be decomposed into a sum of simple wave functions (namely
functions of the form 𝑥 ↦ sin(𝜃𝑥) or 𝑥 ↦ cos(𝜃𝑥)). This is known as
the Fourier Transform (see Fig. 23.6). The Fourier transform makes it
easy to compute the period of a given function: it will simply be the
least common multiple of the periods of the constituent waves.
23.10.2 Shor’s Algorithm: A bird’s eye view Figure 23.5: Left: The air-pressure when playing a
On input an integer 𝑀 , Shor’s algorithm outputs the prime factoriza- “C Major” chord as a function of time. Right: The
coefficients of the Fourier transform of the same func-
tion of 𝑀 in time that is polynomial in log 𝑀 . The main steps in the tion, we can see that it is the sum of three frequencies
algorithm are the following: corresponding to the C, E and G notes (261.63, 329.63
and 392 Hertz respectively). Credit: Bjarke Mønsted’s
Quora answer.
The first step in the algorithm is to
Step 1: Reduce to period finding.
pick a random 𝐴 ∈ {0, 1 … , 𝑀 − 1} and define the function 𝐹𝐴 ∶
{0, 1}𝑚 → {0, 1}𝑚 as 𝐹𝐴 (𝑥) = 𝐴𝑥 ( mod 𝑀 ) where we identify the
string 𝑥 ∈ {0, 1}𝑚 with an integer using the binary representation,
and similarly represent the integer 𝐴𝑥 ( mod 𝑀 ) as a string. (We will
choose 𝑚 to be some polynomial in log 𝑀 and so in particular {0, 1}𝑚
is a large enough set to represent all the numbers in {0, 1, … , 𝑀 − 1}).
Some not-too-hard (though somewhat technical) calculations show
that: (1) The function 𝐹𝐴 is periodic (i.e., there is some integer 𝑝𝐴 such
that 𝐹𝐴 (𝑥 + 𝑝𝐴 ) = 𝐹𝐴 (𝑥) for “almost” every 𝑥) and more importantly Figure 23.6: If 𝑓 is a periodic function then when we
(2) If we can recover the period 𝑝𝐴 of 𝐹𝐴 for several randomly cho- represent it in the Fourier transform, we expect the
coefficients corresponding to wavelengths that do
sen 𝐴’s, then we can recover the factorization of 𝑀 . (We’ll ignore the not evenly divide the period to be very small, as they
“almost” qualifier in the discussion below; it causes some annoying, would tend to “cancel out”.
yet ultimately manageable, technical issues in the full-fledged algo-
rithm.) Hence, factoring 𝑀 reduces to finding out the period of the
function 𝐹𝐴 . Exercise 23.2 asks you to work out this for the related
632 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Using a simple
Step 2: Period finding via the Quantum Fourier Transform.
trick known as “repeated squaring”, it is possible to compute the
map 𝑥 ↦ 𝐹𝐴 (𝑥) in time polynomial in 𝑚, which means we can also
compute this map using a polynomial number of NAND gates,and so
in particular we can generate in polynomial quantum time a quantum
state 𝜌 that is (up to normalization) equal to
R
Remark 23.13 — Quantum Fourier Transform. Despite
its name, the Quantum Fourier Transform does not
actually give a way to compute the Fourier Trans-
form of a function 𝑓 ∶ {0, 1}𝑚 → ℝ. This would be
impossible to do in time polynomial in 𝑚, as simply
writing down the Fourier Transform would require 2𝑚
coefficients. Rather the Quantum Fourier Transform
gives a quantum state where the amplitude correspond-
ing to an element (think: frequency) ℎ is equal to
the corresponding Fourier coefficient. This allows to
sample from a distribution where ℎ is drawn with
probability proportional to the square of its Fourier
coefficient. This is not the same as computing the
Fourier transform, but is good enough for recovering
the period.
R
Remark 23.14 — Group theory. While we define the con-
cepts we use, some background in group or number
theory will be very helpful for fully understanding
this section. In particular we will use the notion of
finite commutative (a.k.a. Abelian) groups. These are
defined as follows.
̂
𝑓 = ∑ 𝑓(𝑔)𝜒 𝑔 , (23.15)
𝑔∈𝔾
̂
𝑓 = ∑ 𝑓(𝑦)𝜒 𝑦 (23.16)
𝑦∈{0,1}
̂
∑ 𝑓(𝑦)|𝑦⟩ (23.17)
𝑦∈{0,1}𝑛
where 𝑓 = ∑𝑦 𝑓(𝑦)𝜒
̂
𝑦 and 𝜒𝑦 ∶ {0, 1}
𝑛
→ ℂ is the function
𝜒𝑦 (𝑥) = −1∑ 𝑥𝑖 𝑦𝑖 .
Proof Idea:
The idea behind the proof is that the Hadamard operation corre-
sponds to the Fourier transform over the group {0, 1}𝑛 (with the XOR
operations). To show this, we just need to do the calculations.
⋆
636 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
HAD|𝑎⟩ = √1 (|0⟩
2
+ (−1)𝑎 |1⟩) . (23.18)
We are given the state
𝜌= ∑ 𝑓(𝑥)|𝑥⟩ . (23.19)
𝑥∈{0,1}𝑛
𝑛−1
2−𝑛/2 ∑ 𝑓(𝑥) ∏ (|0⟩ + (−1)𝑥𝑖 |1⟩) . (23.20)
𝑥∈{0,1}𝑛 𝑖=0
We can now use the distributive law and open up a term of the
form
(If you find the above confusing, try to work out explicitly this
calculation for 𝑛 = 3; namely show that (|0⟩ + (−1)𝑥0 |1⟩)(|0⟩ +
(−1)𝑥1 |1⟩)(|0⟩+(−1)𝑥2 |1⟩) is the same as the sum over 23 terms |000⟩+
(−1)𝑥2 |001⟩ + ⋯ + (−1)𝑥0 +𝑥1 +𝑥2 |111⟩.)
By changing the order of summations, we see that the final state is
̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(𝑥)𝜔𝑥𝑦 . (23.25)
𝑥∈ℤ𝐿
̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(2𝑧)(𝜔2 )𝑦𝑧 + 𝜔𝑦
√
𝐿
∑ 𝑓(2𝑧 + 1)(𝜔2 )𝑦𝑧 (23.26)
𝑧∈𝑍𝐿/2 𝑧∈ℤ𝐿/2
This observation is usually used to obtain a fast (e.g. 𝑂(𝐿 log 𝐿))
time to compute the Fourier transform in a classical setting, but it can
638 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
✓ Chapter Recap
23.12 EXERCISES
Exercise 23.1 — Quantum and classical complexity class relations. Prove the
following relations between quantum complexity classes and classical
ones:
1. P/poly ⊆ BQP/poly . See footnote for hint.7 You can use 𝑈𝑁𝐴𝑁𝐷 to simulate NAND gates.
7
8
Use the alternative characterization of P as in Solved
2. P ⊆ BQP. See footnote for hint.8 Exercise 13.4.
9
You can use the HAD gate to simulate a coin toss.
3. BPP ⊆ BQP. See footnote for hint.9
4. BQP ⊆ EXP. See footnote for hint.10 In exponential time simulating quantum computa-
10
Show a probabilistic
Exercise 23.2 — Discrete logarithm from order finding.
polynomial time classical algorithm that given an Abelian finite group
𝔾 (in the form of an algorithm that computes the group operation),
a generator 𝑔 for the group, and an element ℎ ∈ 𝔾, as well as access to
a black box that on input 𝑓 ∈ 𝔾 outputs the order of 𝑓 (the smallest 𝑎
such that 𝑓 𝑎 = 1), computes the discrete logarithm of ℎ with respect to
𝑔. That is the algorithm should output a number 𝑥 such that 𝑔𝑥 = ℎ.
See footnote for hint.12 We are given ℎ = 𝑔𝑥 and need to recover 𝑥. To
12