Lnotes Book
Lnotes Book
I N T RO D U C T I O N TO
THEORETICAL
COMPUTER SCIENCE
T E X T B O O K I N P R E PA R AT I O N .
AVA I L A B L E O N HTTPS://INTROTCS.ORG
Text available on https://github.com/boazbk/tcs - please post any issues there - thank you!
Preface 9
Preliminaries 17
0 Introduction 19
1 Mathematical Background 37
21 Cryptography 571
VI Appendices 631
Contents (detailed)
Preface 9
0.1 To the student . . . . . . . . . . . . . . . . . . . . . . . . 10
0.1.1 Is the effort worth it? . . . . . . . . . . . . . . . . 11
0.2 To potential instructors . . . . . . . . . . . . . . . . . . . 12
0.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 14
Preliminaries 17
0 Introduction 19
0.1 Integer multiplication: an example of an algorithm . . . 20
0.2 Extended Example: A faster way to multiply (optional) 22
0.3 Algorithms beyond arithmetic . . . . . . . . . . . . . . . 27
0.4 On the importance of negative results . . . . . . . . . . 28
0.5 Roadmap to the rest of this book . . . . . . . . . . . . . 29
0.5.1 Dependencies between chapters . . . . . . . . . . 30
0.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 33
1 Mathematical Background 37
1.1 This chapter: a reader’s manual . . . . . . . . . . . . . . 37
1.2 A quick overview of mathematical prerequisites . . . . 38
1.3 Reading mathematical texts . . . . . . . . . . . . . . . . 39
1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . 40
1.3.2 Assertions: Theorems, lemmas, claims . . . . . . 40
1.3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4 Basic discrete math objects . . . . . . . . . . . . . . . . . 41
1.4.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.2 Special sets . . . . . . . . . . . . . . . . . . . . . . 42
1.4.3 Functions . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.5 Logic operators and quantifiers . . . . . . . . . . 49
1.4.6 Quantifiers for summations and products . . . . 50
1.4.7 Parsing formulas: bound and free variables . . . 50
1.4.8 Asymptotics and Big-𝑂 notation . . . . . . . . . . 52
8
21 Cryptography 571
21.1 Classical cryptosystems . . . . . . . . . . . . . . . . . . . 572
21.2 Defining encryption . . . . . . . . . . . . . . . . . . . . . 574
21.3 Defining security of encryption . . . . . . . . . . . . . . 575
21.4 Perfect secrecy . . . . . . . . . . . . . . . . . . . . . . . . 577
21.4.1 Example: Perfect secrecy in the battlefield . . . . 578
21.4.2 Constructing perfectly secret encryption . . . . . 579
21.5 Necessity of long keys . . . . . . . . . . . . . . . . . . . 581
21.6 Computational secrecy . . . . . . . . . . . . . . . . . . . 582
21.6.1 Stream ciphers or the “derandomized one-time
pad” . . . . . . . . . . . . . . . . . . . . . . . . . . 584
21.7 Computational secrecy and NP . . . . . . . . . . . . . . 587
21.8 Public key cryptography . . . . . . . . . . . . . . . . . . 589
21.8.1 Defining public key encryption . . . . . . . . . . 591
21.8.2 Diffie-Hellman key exchange . . . . . . . . . . . . 592
21.9 Other security notions . . . . . . . . . . . . . . . . . . . 594
21.10 Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
21.10.1 Zero knowledge proofs . . . . . . . . . . . . . . . 595
21.10.2 Fully homomorphic encryption . . . . . . . . . . 595
21.10.3 Multiparty secure computation . . . . . . . . . . 596
21.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
21.12 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 597
VI Appendices 631
Preface
“We make ourselves no promises, but we cherish the hope that the unobstructed
pursuit of useless knowledge will prove to have consequences in the future
as in the past” … “An institution which sets free successive generations of
human souls is amply justified whether or not this graduate or that makes a
so-called useful contribution to human knowledge. A poem, a symphony, a
painting, a mathematical truth, a new scientific fact, all bear in themselves all
the justification that universities, colleges, and institutes of research need or
require”, Abraham Flexner, The Usefulness of Useless Knowledge, 1939.
“I suggest that you take the hardest courses that you can, because you learn
the most when you challenge yourself… CS 121 I found pretty hard.”, Mark
Zuckerberg, 2005.
• Actively notice which questions arise in your mind as you read the
text, and whether or not they are answered in the text.
very well be true, but the main benefit of this book is not in teaching
you any practical tool or technique, but instead in giving you a differ-
ent way of thinking: an ability to recognize computational phenomena
even when they occur in non-obvious settings, a way to model compu-
tational tasks and questions, and to reason about them.
Regardless of any use you will derive from this book, I believe
learning this material is important because it contains concepts that
are both beautiful and fundamental. The role that energy and matter
played in the 20th century is played in the 21st by computation and
information, not just as tools for our technology and economy, but also
as the basic building blocks we use to understand the world. This
book will give you a taste of some of the theory behind those, and
hopefully spark your curiosity to study more.
0.3 ACKNOWLEDGEMENTS
This text is continually evolving, and I am getting input from many
people, for which I am deeply grateful. Salil Vadhan co-taught with
me the first iteration of this course and gave me a tremendous amount
of useful feedback and insights during this process. Michele Amoretti
and Marika Swanberg carefully read several chapters of this text and
gave extremely helpful detailed comments. Dave Evans and Richard
Xu contributed many pull requests fixing errors and improving phras-
ing. Thanks to Anil Ada, Venkat Guruswami, and Ryan O’Donnell for
25
Introduction
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMDCCCCLVI
R
Remark 0.3 — Specification, implementation, and analysis
of algorithms.. A full description of an algorithm has
three components:
proof that the algorithm does in fact do what it’s supposed to do. The
operations of Karatsuba’s algorithm are detailed in Algorithm 0.4,
while the analysis is given in Lemma 0.5 and Lemma 0.6.
since the numbers 𝑥,𝑥, 𝑦,𝑦,𝑥 + 𝑥,𝑦 + 𝑦 all have at most 𝑚 + 2 < 𝑛 digits,
the induction hypothesis implies that the values 𝐴, 𝐵, 𝐶 computed
by the recursive calls will satisfy 𝐴 = 𝑥𝑦, 𝐵 = (𝑥 + 𝑥)(𝑦 + 𝑦) and
𝐶 = 𝑥𝑦. Plugging this into (4) we see that 𝑥 ⋅ 𝑦 equals the value
(102𝑚 − 10𝑚 ) ⋅ 𝐴 + 10𝑚 ⋅ 𝐵 + (1 − 10𝑚 ) ⋅ 𝐶 computed by Algorithm 0.4.
■
i n trod u c ti on 35
Proof. Fig. 2 illustrates the idea behind the proof, which we only
sketch here, leaving filling out the details as Exercise 0.4. The proof
is again by induction. We define 𝑇 (𝑛) to be the maximum number of
steps that Algorithm 0.4 takes on inputs of length at most 𝑛. Since in
the base case 𝑛 ≤ 4, Exercise 0.4 performs a constant number of com-
putation, we know that 𝑇 (4) ≤ 𝑐 for some constant 𝑐 and for 𝑛 > 4, it
satisfies the recursive equation
for some constant 𝑐′ (using the fact that addition can be done in 𝑂(𝑛)
operations).
The recursive equation (5) solves to 𝑂(𝑛log2 3 ). The intuition be-
hind this is presented in Fig. 2, and this is also a consequence of the
so-called “Master Theorem” on recurrence relations. As mentioned
above, we leave completing the proof to the reader as Exercise 0.4.
■
R
Remark 0.7 — Matrix Multiplication (advanced note).
(This book contains many “advanced” or “optional”
notes and sections. These may assume background
that not every student has, and can be safely skipped
over as none of the future parts depends on them.)
Ideas similar to Karatsuba’s can be used to speed up
matrix multiplications as well. Matrices are a powerful
way to represent linear equations and operations,
widely used in numerous applications of scientific
computing, graphics, machine learning, and many
many more.
One of the basic operations one can do with
two matrices is to multiply them. For example,
𝑥 𝑥0,1 𝑦 𝑦0,1
if 𝑥 = ( 0,0 ) and 𝑦 = ( 0,0 )
𝑥1,0 𝑥1,1 𝑦1,0 𝑦1,1
then the product of 𝑥 and 𝑦 is the matrix
𝑥 𝑦 + 𝑥0,1 𝑦1,0 𝑥0,0 𝑦0,1 + 𝑥0,1 𝑦1,1
( 0,0 0,0 ). You can
𝑥1,0 𝑦0,0 + 𝑥1,1 𝑦1,0 𝑥1,0 𝑦0,1 + 𝑥1,1 𝑦1,1
see that we can compute this matrix by eight products
of numbers.
Now suppose that 𝑛 is even and 𝑥 and 𝑦 are a pair of
𝑛 × 𝑛 matrices which we can think of as each com-
posed of four (𝑛/2) × (𝑛/2) blocks 𝑥0,0 , 𝑥0,1 , 𝑥1,0 , 𝑥1,1
and 𝑦0,0 , 𝑦0,1 , 𝑦1,0 , 𝑦1,1 . Then the formula for the matrix
product of 𝑥 and 𝑦 can be expressed in the same way
as above, just replacing products 𝑥𝑎,𝑏 𝑦𝑐,𝑑 with matrix
products, and addition with matrix addition. This
means that we can use the formula above to give an
algorithm that doubles the dimension of the matrices
at the expense of increasing the number of operations
by a factor of 8, which for 𝑛 = 2ℓ results in 8ℓ = 𝑛3
operations.
In 1969 Volker Strassen noted that we can compute
the product of a pair of two-by-two matrices using
only seven products of numbers by observing that
each entry of the matrix 𝑥𝑦 can be computed by
adding and subtracting the following seven terms:
𝑡1 = (𝑥1,0 + 𝑥1,1 )(𝑦0,0 + 𝑦1,1 ), 𝑡2 = (𝑥0,0 + 𝑥1,1 )𝑦0,0 ,
𝑡3 = 𝑥0,0 (𝑦0,1 − 𝑦1,1 ), 𝑡4 = 𝑥1,1 (𝑦0,1 − 𝑦0,0 ),
𝑡5 = (𝑥0,0 + 𝑥0,1 )𝑦1,1 , 𝑡6 = (𝑥1,0 − 𝑥0,0 )(𝑦1,0 + 𝑦0,1 ),
𝑡7 = (𝑥0,1 − 𝑥1,1 )(𝑦1,0 + 𝑦1,1 ). Indeed, one can verify
𝑡1 + 𝑡4 − 𝑡5 + 𝑡7 𝑡3 + 𝑡5
that 𝑥𝑦 = ( ).
𝑡2 + 𝑡4 𝑡1 + 𝑡 3 − 𝑡 2 + 𝑡 6
i n trod u c ti on 37
Even for classical questions, studied through the ages, new dis-
coveries are still being made. For example, for the question of de-
termining whether a given integer is prime or composite, which has
been studied since the days of Pythagoras, efficient probabilistic algo-
rithms were only discovered in the 1970s, while the first deterministic
polynomial-time algorithm was only found in 2002. For the related
problem of actually finding the factors of a composite number, new
algorithms were found in the 1980s, and (as we’ll see later in this
course) discoveries in the 1990s raised the tantalizing prospect of
obtaining faster algorithms through the use of quantum mechanical
effects.
Despite all this progress, there are still many more questions than
answers in the world of algorithms. For almost all natural prob-
lems, we do not know whether the current algorithm is the “best”,
or whether a significantly better one is still waiting to be discovered.
As alluded to in Cobham’s opening quote for this chapter, even for
the basic problem of multiplying numbers we have not yet answered
the question of whether there is a multiplication algorithm that is as
efficient as our algorithms for addition. But at least we now know the
right way to ask it.
✓ Chapter Recap
The book largely proceeds in linear order, with each chapter build-
ing on the previous ones, with the following exceptions:
• The topics of 𝜆 calculus (Section 8.5 and Section 8.5), Gödel’s in-
completeness theorem (Chapter 11), Automata/regular expres-
sions and context-free grammars (Chapter 10), and space-bounded
computation (Chapter 17), are not used in the following chapters.
Hence you can choose whether to cover or skip any subset of them.
A course based on this book can use all of Parts I, II, and III (possi-
bly skipping over some or all of the 𝜆 calculus, Chapter 11, Chapter 10
or Chapter 17), and then either cover all or some of Part IV (random-
ized computation), and add a “sprinkling” of advanced topics from
Part V based on student or instructor interest.
0.6 EXERCISES
Exercise 0.1Rank the significance of the following inventions in speed-
ing up the multiplication of large (that is 100-digit or more) numbers.
That is, use “back of the envelope” estimates to order them in terms of
the speedup factor they offered over the previous state of affairs.
a. 𝑛 operations.
b. 𝑛2 operations.
c. 𝑛 log 𝑛 operations.
d. 2𝑛 operations.
e. 𝑛! operations.
i n trod u c ti on 43
In this exercise, we
Exercise 0.6 — Matrix Multiplication (optional, advanced).
show that if for some 𝜔 > 2, we can write the product of two 𝑘 × 𝑘
real-valued matrices 𝐴, 𝐵 using at most 𝑘𝜔 multiplications, then we
can multiply two 𝑛 × 𝑛 matrices in roughly 𝑛𝜔 time for every large
enough 𝑛.
To make this precise, we need to make some notation that is unfor-
tunately somewhat cumbersome. Assume that there is some 𝑘 ∈ ℕ
and 𝑚 ≤ 𝑘𝜔 such that for every 𝑘 × 𝑘 matrices 𝐴, 𝐵, 𝐶 such that
𝐶 = AB, we can write for every 𝑖, 𝑗 ∈ [𝑘]:
𝑚−1
𝐶𝑖,𝑗 = ∑ 𝛼ℓ𝑖,𝑗 𝑓ℓ (𝐴)𝑔ℓ (𝐵)
ℓ=0
1
• Transform an intuitive argument into a
rigorous proof.
Mathematical Background
“I found that every number, which may be expressed from one to ten, surpasses
the preceding by one unit: afterwards the ten is doubled or tripled … until
a hundred; then the hundred is doubled and tripled in the same manner as
the units and the tens … and so forth to the utmost limit of numeration.”,
Muhammad ibn Mūsā al-Khwārizmī, 820, translation by Fredric Rosen,
1831.
the whole chapter. You can just take a quick look at Section 1.2 to
see the main tools we will use, Section 1.7 for our notation and con-
ventions, and then skip ahead to the rest of this book. Alternatively,
you can sit back, relax, and read this chapter just to get familiar
with our notation, as well as to enjoy (or not) my philosophical
musings and attempts at humor.
• If your background is less extensive, see Section 1.9 for some re-
sources on these topics. This chapter briefly covers the concepts
that we need, but you may find it helpful to see a more in-depth
treatment. As usual with math, the best way to get comfortable
with this material is to work out exercises on your own.
• Proofs: First and foremost, this book involves a heavy dose of for-
mal mathematical reasoning, which includes mathematical defini-
tions, statements, and proofs.
In the rest of this chapter we briefly review the above notions. This
is partially to remind the reader and reinforce material that might
not be fresh in your mind, and partially to introduce our notation
and conventions which might occasionally differ from those you’ve
encountered before.
1.3.1 Definitions
Mathematicians often define new concepts in terms of old concepts.
For example, here is a mathematical definition which you may have
encountered in the past (and will see again shortly):
1.3.3 Proofs
Mathematical proofs are the arguments we use to demonstrate that our
theorems, lemmas, and claims are indeed true. We discuss proofs in
Section 1.5 below, but the main point is that the mathematical stan-
dard of proof is very high. Unlike in some other realms, in mathe-
matics a proof is an “airtight” argument that demonstrates that the
statement is true beyond a shadow of a doubt. Some examples in this
section for mathematical proofs are given in Solved Exercise 1.1 and
Section 1.6. As mentioned in the preface, as a general rule, it is more
important you understand the definitions than the theorems, and it is
more important you understand a theorem statement than its proof.
mathe mati ca l backg rou n d 51
1.4.1 Sets
A set is an unordered collection of objects. For example, when we
write 𝑆 = {2, 4, 7}, we mean that 𝑆 denotes the set that contains the
numbers 2, 4, and 7. (We use the notation “2 ∈ 𝑆” to denote that 2 is
an element of 𝑆.) Note that the set {2, 4, 7} and {7, 4, 2} are identical,
since they contain the same elements. Also, a set either contains an
element or does not contain it – there is no notion of containing it
“twice” – and so we could even write the same set 𝑆 as {2, 2, 4, 7}
(though that would be a little weird). The cardinality of a finite set 𝑆,
denoted by |𝑆|, is the number of elements it contains. (Cardinality can
be defined for infinite sets as well; see the sources in Section 1.9.) So,
in the example above, |𝑆| = 3. A set 𝑆 is a subset of a set 𝑇 , denoted
by 𝑆 ⊆ 𝑇 , if every element of 𝑆 is also an element of 𝑇 . (We can
also describe this by saying that 𝑇 is a superset of 𝑆.) For example,
{2, 7} ⊆ {2, 4, 7}. The set that contains no elements is known as the
empty set and it is denoted by ∅. If 𝐴 is a subset of 𝐵 that is not equal
to 𝐵 we say that 𝐴 is a strict subset of 𝐵, and denote this by 𝐴 ⊊ 𝐵.
We can define sets by either listing all their elements or by writing
down a rule that they satisfy such as
Of course there is more than one way to write the same set, and of-
ten we will use intuitive notation listing a few examples that illustrate
the rule. For example, we can also define EVEN as
EVEN = {0, 2, 4, …} .
Note that a set can be either finite (such as the set {2, 4, 7}) or in-
finite (such as the set EVEN). Also, the elements of a set don’t have
to be numbers. We can talk about the sets such as the set {𝑎, 𝑒, 𝑖, 𝑜, 𝑢}
of all the vowels in the English language, or the set {New York, Los
Angeles, Chicago, Houston, Philadelphia, Phoenix, San Antonio,
San Diego, Dallas} of all cities in the U.S. with population more than
one million per the 2010 census. A set can even have other sets as ele-
ments, such as the set {∅, {1, 2}, {2, 3}, {1, 3}} of all even-sized subsets
of {1, 2, 3}.
ℕ = {0, 1, 2, …}
contains all natural numbers, i.e., non-negative integers. For any natural
number 𝑛 ∈ ℕ, we define the set [𝑛] as {0, … , 𝑛 − 1} = {𝑘 ∈ ℕ ∶
𝑘 < 𝑛}. (We start our indexing of both ℕ and [𝑛] from 0, while many
other texts index those sets from 1. Starting from zero or one is simply
a convention that doesn’t make much difference, as long as one is
consistent about it.)
We will also occasionally use the set ℤ = {… , −2, −1, 0, +1, +2, …} 1
The letter Z stands for the German word “Zahlen”,
of (negative and non-negative) integers,1 as well as the set ℝ of real which means numbers.
mathe mati ca l backg rou n d 53
numbers. (This is the set that includes not just the integers, but also
fractional and irrational numbers; e.g., ℝ contains numbers such as
+0.5, −𝜋, etc.) We denote by ℝ+ the set {𝑥 ∈ ℝ ∶ 𝑥 > 0} of positive real
numbers. This set is sometimes also denoted as (0, ∞).
{0, 1}3 = {000, 001, 010, 011, 100, 101, 110, 111} .
For every string 𝑥 ∈ {0, 1}𝑛 and 𝑖 ∈ [𝑛], we write 𝑥𝑖 for the 𝑖𝑡ℎ
element of 𝑥.
We will also often talk about the set of binary strings of all lengths,
which is
or more concisely as
Σ∗ = ∪𝑛∈ℕ Σ𝑛 .
For example, if Σ = {𝑎, 𝑏, 𝑐, 𝑑, … , 𝑧} then Σ∗ denotes the set of all finite
length strings over the alphabet a-z.
1.4.3 Functions
If 𝑆 and 𝑇 are non-empty sets, a function 𝐹 mapping 𝑆 to 𝑇 , denoted
by 𝐹 ∶ 𝑆 → 𝑇 , associates with every element 𝑥 ∈ 𝑆 an element
𝐹 (𝑥) ∈ 𝑇 . The set 𝑆 is known as the domain of 𝐹 and the set 𝑇
is known as the codomain of 𝐹 . The image of a function 𝐹 is the set
{𝐹 (𝑥) | 𝑥 ∈ 𝑆} which is the subset of 𝐹 ’s codomain consisting of all
output elements that are mapped from some input. (Some texts use
range to denote the image of a function, while other texts use range
to denote the codomain of a function. Hence we will avoid using the
term “range” altogether.) As in the case of sets, we can write a func-
tion either by listing the table of all the values it gives for elements
in 𝑆 or by using a rule. For example if 𝑆 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and 𝑇 = {0, 1}, then the table below defines a function 𝐹 ∶ 𝑆 → 𝑇 .
Note that this function is the same as the function defined by the rule 2
For two natural numbers 𝑥 and 𝑎, 𝑥 mod 𝑎 (short-
𝐹 (𝑥) = (𝑥 mod 2).2 hand for “modulo”) denotes the remainder of 𝑥
when it is divided by 𝑎. That is, it is the number 𝑟 in
Table 1.1: An example of a function. {0, … , 𝑎 − 1} such that 𝑥 = 𝑎𝑘 + 𝑟 for some integer 𝑘.
We sometimes also use the notation 𝑥 = 𝑦 ( mod 𝑎)
to denote the assertion that 𝑥 mod 𝑎 is the same as 𝑦
Input Output mod 𝑎.
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Basic facts about functions:Verifying that you can prove the following
results is an excellent way to brush up on functions:
simple path is a path (𝑢0 , … , 𝑢𝑘−1 ) where all the 𝑢𝑖 ’s are distinct. A cycle rected graph. The undirected graph has vertex set
{1, 2, 3, 4} and edge set {{1, 2}, {2, 3}, {2, 4}}. The
is a path (𝑢0 , … , 𝑢𝑘 ) where 𝑢0 = 𝑢𝑘 . We say that two vertices 𝑢, 𝑣 ∈ 𝑉 directed graph has vertex set {𝑎, 𝑏, 𝑐} and the edge
are connected if either 𝑢 = 𝑣 or there is a path from (𝑢0 , … , 𝑢𝑘 ) where set {(𝑎, 𝑏), (𝑏, 𝑐), (𝑐, 𝑎), (𝑎, 𝑐)}.
mathe mati ca l backg rou n d 57
Solved Exercise 1.1 — Connected vertices have simple paths. Prove Lemma 1.6
■
Solution:
The proof follows the idea illustrated in Fig. 1.6. One complica-
tion is that there can be more than one vertex that is visited twice
58 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 1.7 — Finding proofs. Solved Exercise 1.1 is a
good example of the process of finding a proof. You
start by ensuring you understand what the statement
means, and then come up with an informal argument
why it should be true. You then transform the infor-
mal argument into a rigorous proof. This proof need
not be very long or overly formal, but should clearly
establish why the conclusion of the statement follows
from its assumptions.
R
Remark 1.13 — Labeled graphs. For some applications
we will consider labeled graphs, where the vertices or
edges have associated labels (which can be numbers,
strings, or members of some other set). We can think
of such a graph as having an associated (possibly
partial) labelling function 𝐿 ∶ 𝑉 ∪ 𝐸 → ℒ, where ℒ is
the set of potential labels. However we will typically
not refer explicitly to this labeling function and simply
say things such as “vertex 𝑣 has the label 𝛼”.
For example, the sum of the squares of all numbers from 1 to 100
can be written as
∑ 𝑖2 . (1.1)
𝑖∈{1,…,100}
100
∑ 𝑖2 .
𝑖=1
∃𝑎,𝑏∈ℕ (𝑎 ≠ 1) ∧ (𝑎 ≠ 𝑛) ∧ (𝑛 = 𝑎 × 𝑏) (1.2)
Since 𝑛 is free, it can be set to any value, and the truth of the state-
ment (1.2) depends on the value of 𝑛. For example, if 𝑛 = 8 then (1.2)
is true, but for 𝑛 = 11 it is false. (Can you see why?)
The same issue appears when parsing code. For example, in the
following snippet from the C programming language
the variable i is bound within the for block but the variable n is
free.
The main property of bound variables is that we can rename them
(as long as the new name doesn’t conflict with another used variable)
without changing the meaning of the statement. Thus for example the
statement
∃𝑥,𝑦∈ℕ (𝑥 ≠ 1) ∧ (𝑥 ≠ 𝑛) ∧ (𝑛 = 𝑥 × 𝑦) (1.3)
is equivalent to (1.2) in the sense that it is true for exactly the same
set of 𝑛’s.
Similarly, the code
produces the same result as the code above that used i instead of j.
R
Remark 1.14 — Aside: mathematical vs programming no-
tation. Mathematical notation has a lot of similarities
with programming language, and for the same rea-
sons. Both are formalisms meant to convey complex
concepts in a precise way. However, there are some
cultural differences. In programming languages, we
often try to use meaningful variable names such as
NumberOfVertices while in math we often use short
identifiers such as 𝑛. Part of it might have to do with
the tradition of mathematical proofs as being hand-
written and verbally presented, as opposed to typed
up and compiled. Another reason is if the wrong
variable name is used in a proof, at worst it causes
confusion to readers; when the wrong variable name
62 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
texts write 𝐹 ∈ 𝑂(𝐺) instead of 𝐹 = 𝑂(𝐺), but we will not use this
notation.) Despite the misleading equality sign, you should remember
that a statement such as 𝐹 = 𝑂(𝐺) means that 𝐹 is “at most” 𝐺 in
some rough sense when we ignore constants, and a statement such as
𝐹 = Ω(𝐺) means that 𝐹 is “at least” 𝐺 in the same rough sense.
• When adding two functions, we only care about the larger one. For
example, for the purpose of 𝑂-notation, 𝑛3 + 100𝑛2 is the same as
𝑛3 , and in general in any polynomial, we only care about the larger
exponent.
64 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
every two constants 𝑎 > 0 and 𝜖 > 0 even if 𝜖 is much smaller than
√
𝑎. For example, 100𝑛100 = 𝑜(2 𝑛 ).
R
Remark 1.16 — Big 𝑂 for other applications (optional).
While Big-𝑂 notation is often used to analyze running
time of algorithms, this is by no means the only ap-
plication. We can use 𝑂 notation to bound asymptotic
relations between any functions mapping integers
to positive numbers. It can be used regardless of
whether these functions are a measure of running
time, memory usage, or any other quantity that may
have nothing to do with computation. Here is one
example which is unrelated to this book (and hence
one that you can feel free to skip): one way to state the
Riemann Hypothesis (one of the most famous open
questions in mathematics) is that it corresponds to
the conjecture that the number of primes between 0
𝑛
and 𝑛 is equal to ∫2 ln1𝑥 𝑑𝑥 up to an additive error of
√
magnitude at most 𝑂( 𝑛 log 𝑛).
1.5 PROOFS
Many people think of mathematical proofs as a sequence of logical
deductions that starts from some axioms and ultimately arrives at
a conclusion. In fact, some dictionaries define proofs that way. This
is not entirely wrong, but at its essence, a mathematical proof of a
statement X is simply an argument that convinces the reader that X is
true beyond a shadow of a doubt.
To produce such a proof you need to:
In many cases, the first part is the most important one. Understand-
ing what a statement means is oftentimes more than halfway towards
understanding why it is true. In the third part, to convince the reader
beyond a shadow of a doubt, we will often want to break down the
reasoning to “basic steps”, where each basic step is simple enough
to be “self-evident”. The combination of all steps yields the desired
statement.
what this purpose is. When you write a proof, for every equation or
sentence you include, ask yourself:
2. If so, does this statement follow from the previous steps, or are we
going to establish it in the next step?
R
Remark 1.20 — Hierarchical Proofs (optional). Mathe-
matical proofs are ultimately written in English prose.
The well-known computer scientist Leslie Lamport
argues that this is a problem, and proofs should be
written in a more formal and rigorous way. In his
manuscript he proposes an approach for structured
hierarchical proofs, that have the following form:
P
If you have not seen the proof of this theorem before
(or don’t remember it), this would be an excellent
point to pause and try to prove it yourself. One way
to do it would be to describe an algorithm that given as
input a directed acyclic graph 𝐺 on 𝑛 vertices and 𝑛−2
or fewer edges, constructs an array 𝐹 of length 𝑛 such
that for every edge 𝑢 → 𝑣 in the graph 𝐹 [𝑢] < 𝐹 [𝑣].
(a) 𝑃 is true
and
Figure 1.9: Some examples of DAGs of one, two and
(b) 𝑃 implies 𝑄 three vertices, and valid ways to assign layers to the
then 𝑄 is true. vertices.
mathe mati ca l backg rou n d 71
R
Remark 1.25 — Induction and recursion. Proofs by in-
duction are closely related to algorithms by recursion.
In both cases we reduce solving a larger problem to
solving a smaller instance of itself. In a recursive algo-
rithm to solve some problem P on an input of length
𝑘 we ask ourselves “what if someone handed me a
way to solve P on instances smaller than 𝑘?”. In an
inductive proof to prove a statement Q parameterized
by a number 𝑘, we ask ourselves “what if I already
knew that 𝑄(𝑘′ ) is true for 𝑘′ < 𝑘?”. Both induction
and recursion are crucial concepts for this course and
Computer Science at large (and even other areas of
inquiry, including not just mathematics but other
sciences as well). Both can be confusing at first, but
with time and practice they become clearer. For more
on proofs by induction and recursion, you might find
the following Stanford CS 103 handout, this MIT 6.00
lecture or this excerpt of the Lehman-Leighton book
useful.
⎧
{𝑓 ′ (𝑣) + 1 𝑣 ≠ 𝑣0
𝑓(𝑣) = ⎨ .
{
⎩0 𝑣 = 𝑣0
We claim that 𝑓 is a valid layering, namely that for every edge 𝑢 →
𝑣, 𝑓(𝑢) < 𝑓(𝑣). To prove this, we split into cases:
P
Reading a proof is no less of an important skill than
producing one. In fact, just like understanding code,
it is a highly non-trivial skill in itself. Therefore I
strongly suggest that you re-read the above proof, ask-
ing yourself at every sentence whether the assumption
it makes is justified, and whether this sentence truly
demonstrates what it purports to achieve. Another
good habit is to ask yourself when reading a proof for
every variable you encounter (such as 𝑢, 𝑖, 𝐺′ , 𝑓 ′ , etc.
in the above proof) the following questions: (1) What
type of variable is it? Is it a number? a graph? a ver-
tex? a function? and (2) What do we know about it?
Is it an arbitrary member of the set? Have we shown
some facts about it?, and (3) What are we trying to
show about it?.
Let 𝐺 = (𝑉 , 𝐸) be a DAG. We
Theorem 1.26 — Minimal layering is unique.
say that a layering 𝑓 ∶ 𝑉 → ℕ is minimal if for every vertex 𝑣 ∈ 𝑉 , if
𝑣 has no in-neighbors then 𝑓(𝑣) = 0 and if 𝑣 has in-neighbors then
there exists an in-neighbor 𝑢 of 𝑣 such that 𝑓(𝑢) = 𝑓(𝑣) − 1.
For every layering 𝑓, 𝑔 ∶ 𝑉 → ℕ of 𝐺, if both 𝑓 and 𝑔 are minimal
then 𝑓 = 𝑔.
Proof Idea:
The idea is to prove the theorem by induction on the layers. If 𝑓 and
𝑔 are minimal then they must agree on the source vertices, since both
𝑓 and 𝑔 should assign these vertices to layer 0. We can then show that
if 𝑓 and 𝑔 agree up to layer 𝑖 − 1, then the minimality property implies
that they need to agree in layer 𝑖 as well. In the actual proof we use
a small trick to save on writing. Rather than proving the statement
that 𝑓 = 𝑔 (or in other words that 𝑓(𝑣) = 𝑔(𝑣) for every 𝑣 ∈ 𝑉 ),
we prove the weaker statement that 𝑓(𝑣) ≤ 𝑔(𝑣) for every 𝑣 ∈ 𝑉 .
(This is a weaker statement since the condition that 𝑓(𝑣) is lesser or
equal than to 𝑔(𝑣) is implied by the condition that 𝑓(𝑣) is equal to
𝑔(𝑣).) However, since 𝑓 and 𝑔 are just labels we give to two minimal
layerings, by simply changing the names “𝑓” and “𝑔” the same proof
also shows that 𝑔(𝑣) ≤ 𝑓(𝑣) for every 𝑣 ∈ 𝑉 and hence that 𝑓 = 𝑔.
⋆
P
The proof of Theorem 1.26 is fully rigorous, but is
written in a somewhat terse manner. Make sure that
you read through it and understand why this is indeed
an airtight proof of the Theorem’s statement.
• We also index the set [𝑛] starting with 0, and hence define it as
{0, … , 𝑛 − 1}. In other texts it is often defined as {1, … , 𝑛}. Similarly,
we index our strings starting with 0, and hence a string 𝑥 ∈ {0, 1}𝑛
is written as 𝑥0 𝑥1 ⋯ 𝑥𝑛−1 .
• We use ⌈𝑥⌉ and ⌊𝑥⌋ for the “ceiling” and “floor” operators that
correspond to “rounding up” or “rounding down” a number to the
76 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Also, such conventions do not replace the need to explicitly declare for
each new variable the type of object that it denotes.
• “Let 𝑋 be …”, “let 𝑋 denote …”, or “let 𝑋 = …”: These are all
different ways for us to say that we are defining the symbol 𝑋 to
stand for whatever expression is in the …. When 𝑋 is a property of
some objects we might define 𝑋 by writing something along the
lines of “We say that … has the property 𝑋 if ….”. While we often
78 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• “Thus”, “Therefore” , “We get that”: This means that the following
sentence is implied by the preceding one, as in “The 𝑛-vertex graph
𝐺 is connected. Therefore it contains at least 𝑛 − 1 edges.” We
sometimes use “indeed” to indicate that the following text justifies
the claim that was made in the preceding sentence as in “The 𝑛-
vertex graph 𝐺 has at least 𝑛 − 1 edges. Indeed, this follows since 𝐺 is
connected.”
✓ Chapter Recap
1.8 EXERCISES
a. Write a logical expression 𝜑(𝑥)
Exercise 1.1 — Logical expressions.
involving the variables 𝑥0 , 𝑥1 , 𝑥2 and the operators ∧ (AND), ∨
(OR), and ¬ (NOT), such that 𝜑(𝑥) is true if the majority of the
inputs are True.
b. Let 𝑛 > 10. 𝑆 is the set of all functions mapping {0, 1}𝑛 to {0, 1}.
𝑇 = {0, 1}𝑛 .
3
c. Let 𝐴0 , … , 𝐴𝑘−1 be finite subsets of {1, … , 𝑛}, such that |𝐴𝑖 | = 𝑚 for
every 𝑖 ∈ [𝑘]. Prove that if 𝑘 > 100𝑛, then there exist two distinct
sets 𝐴𝑖 , 𝐴𝑗 s.t. |𝐴𝑖 ∩ 𝐴𝑗 | ≥ 𝑚2 /(10𝑛).
■
Exercise 1.9Prove that for every finite 𝑆, 𝑇 , there are (|𝑇 | + 1) |𝑆|
partial
functions from 𝑆 to 𝑇 .
■
Exercise 1.11 Prove that for every undirected graph 𝐺 of 100 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 20 vertices such that no two vertices in 𝑆 are neighbors of one
another.
■
√
d. 𝐹 (𝑛) = 𝑛, 𝐺(𝑛) = 2√log 𝑛 .
e. 𝐹 (𝑛) = (⌈0.2𝑛⌉
𝑛
), 𝐺(𝑛) = 20.1𝑛 (where (𝑛𝑘) is the number of 𝑘-sized 7
One way to do this is to use Stirling’s approximation
subsets of a set of size 𝑛). See footnote for hint.7 for the factorial function.
Exercise 1.15 Prove that for every undirected graph 𝐺 of 1000 vertices,
if every vertex has degree at most 4, then there exists a subset 𝑆 of at
least 200 vertices such that no two vertices in 𝑆 are neighbors of one
another.
■
2
graphs.
• Prefix-free representations.
“The alphabet (sic) was a great invention, which enabled men (sic) to store
and to learn with little effort what others had learned the hard way – that is, to
learn from books rather than from direct, possibly painful, contact with the real
world.”, B.F. Skinner
“The name of the song is called ‘HADDOCK’S EYES.”’ [said the Knight]
“Oh, that’s the name of the song, is it?” Alice said, trying to feel interested.
“No, you don’t understand,” the Knight said, looking a little vexed. “That’s
what the name is CALLED. The name really is ‘THE AGED AGED MAN.”’
“Then I ought to have said ‘That’s what the SONG is called’?” Alice cor-
rected herself.
“No, you oughtn’t: that’s quite another thing! The SONG is called ‘WAYS
AND MEANS’: but that’s only what it’s CALLED, you know!”
“Well, what IS the song, then?” said Alice, who was by this time com-
pletely bewildered.
“I was coming to that,” the Knight said. “The song really IS ‘A-SITTING ON
A GATE’: and the tune’s my own invention.”
Lewis Carroll, Through the Looking-Glass
networks, MRI scans, gene data, and even other programs. We will
represent all these objects as strings of zeroes and ones, that is objects
such as 0011101 or 1011 or any other finite list of 1’s and 0’s. (This
choice is for convenience: there is nothing “holy” about zeroes and
ones, and we could have used any other finite collection of symbols.)
Today, we are so used to the notion of digital representation that
we are not surprised by the existence of such an encoding. But it is
actually a deep insight with significant implications. Many animals
can convey a particular fear or desire, but what is unique about hu-
mans is language: we use a finite collection of basic symbols to describe
a potentially unlimited range of experiences. Language allows trans-
mission of information over both time and space and enables soci-
eties that span a great many people and accumulate a body of shared
Figure 2.2: We represent numbers, texts, images, net-
knowledge over time. works and many other objects using strings of zeroes
Over the last several decades, we have seen a revolution in what we and ones. Writing the zeroes and ones themselves in
green font over a black background is optional.
can represent and convey in digital form. We can capture experiences
with almost perfect fidelity, and disseminate it essentially instanta-
neously to an unlimited audience. Moreover, once information is in
digital form, we can compute over it, and gain insights from data that
were not accessible in prior times. At the heart of this revolution is the
simple but profound observation that we can represent an unbounded
variety of objects using a finite set of symbols (and in fact using only
the two symbols 0 and 1).
In later chapters, we will typically take such representations for
granted, and hence use expressions such as “program 𝑃 takes 𝑥 as
input” when 𝑥 might be a number, a vector, a graph, or any other
object, when we really mean that 𝑃 takes as input the representation of
𝑥 as a binary string. However, in this chapter we will dwell a bit more
on how we can construct such representations.
The two “big ideas” we discuss are Big Idea 1 - we can com-
pose representations for simple objects to represent more
complex objects and Big Idea 2 - it is crucial to distinguish be-
tween functions’ (“what”) and programs’ (“how”). The latter
will be a theme we will come back to time and again in this
book.
40 101000
53 110101
389 110000101
3750 111010100110
⎧0 𝑛=0
{
{
𝑁 𝑡𝑆(𝑛) = 1 𝑛=1 (2.1)
⎨
{
{𝑁 𝑡𝑆(⌊𝑛/2⌋)𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) 𝑛 > 1
⎩
where 𝑝𝑎𝑟𝑖𝑡𝑦 ∶ ℕ → {0, 1} is the function defined as 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 0
if 𝑛 is even and 𝑝𝑎𝑟𝑖𝑡𝑦(𝑛) = 1 if 𝑛 is odd, and as usual, for strings
𝑥, 𝑦 ∈ {0, 1}∗ , 𝑥𝑦 denotes the concatenation of 𝑥 and 𝑦. The function
comp u tati on a n d re p re se n tati on 87
R
Remark 2.1 — Binary representation in python (optional).
We can implement the binary representation in Python
as follows:
print(NtS(236))
# 11101100
print(NtS(19))
# 10011
print(StN(NtS(236)))
# 236
R
Remark 2.2 — Programming examples. In this book,
we sometimes use code examples as in Remark 2.1.
The point is always to emphasize that certain com-
putations can be achieved concretely, rather than
illustrating the features of Python or any other pro-
gramming language. Indeed, one of the messages of
this book is that all programming languages are in
a certain precise sense equivalent to one another, and
hence we could have just as well used JavaScript, C,
COBOL, Visual Basic or even BrainF*ck. This book
is not about programming, and it is absolutely OK if
88 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{0 𝑁 𝑡𝑆(𝑚) 𝑚≥0
𝑍𝑡𝑆(𝑚) = ⎨
{
⎩1 𝑁 𝑡𝑆(−𝑚) 𝑚 < 0
where 𝑁 𝑡𝑆 is defined as in (2.1).
While the encoding function of a representation needs to be one
to one, it does not have to be onto. For example, in the representation
above there is no number that is represented by the empty string
but it is still a fine representation, since every integer is represented
uniquely by some string.
R
Remark 2.3 — Interpretation and context. Given a string
𝑦 ∈ {0, 1}∗ , how do we know if it’s “supposed” to
represent a (non-negative) natural number or a (po-
tentially negative) integer? For that matter, even if
we know 𝑦 is “supposed” to be an integer, how do
we know what representation scheme it uses? The
short answer is that we do not necessarily know this
information, unless it is supplied from the context. (In
programming languages, the compiler or interpreter
determines the representation of the sequence of bits
corresponding to a variable based on the variable’s
type.) We can treat the same string 𝑦 as representing a
natural number, an integer, a piece of text, an image,
or a green gremlin. Whenever we say a sentence such
as “let 𝑛 be the number represented by the string 𝑦,”
we will assume that we are fixing some canonical rep-
resentation scheme such as the ones above. The choice
of the particular representation scheme will rarely
matter, except that we want to make sure to stick with
the same one for consistency.
⎧
{𝑁 𝑡𝑆𝑛+1 (𝑘) 0 ≤ 𝑘 ≤ 2𝑛 − 1
𝑍𝑡𝑆𝑛 (𝑘) = ⎨ ,
𝑛+1
{
⎩𝑁 𝑡𝑆𝑛+1 (2 + 𝑘) −2𝑛 ≤ 𝑘 ≤ −1
bol such as ‖ to represent, for example, the pair consisting of the num-
bers represented by 10 and 110001 as the length-9 string “10‖110001”.
In other words, there is a one to one map 𝐹 from pairs of strings
𝑥, 𝑦 ∈ {0, 1}∗ into a single string 𝑧 over the alphabet Σ = {0, 1, ‖}
(in other words, 𝑧 ∈ Σ∗ ). Using such separators is similar to the
way we use spaces and punctuation to separate words in English. By
adding a little redundancy, we achieve the same effect in the digital
domain. We can map the three-element set Σ to the three-element set
{00, 11, 01} ⊂ {0, 1}2 in a one-to-one fashion, and hence encode a
length 𝑛 string 𝑧 ∈ Σ∗ as a length 2𝑛 string 𝑤 ∈ {0, 1}∗ .
Our final representation for rational numbers is obtained by com-
posing the following steps:
1. Representing a (potentially negative) rational number as a pair of
integers 𝑎, 𝑏 such that 𝑟 = 𝑎/𝑏.
Theorem 2.6 was proven by Georg Cantor in 1874. This result (and
the theory around it) was quite shocking to mathematicians at the
time. By showing that there is no one-to-one map from ℝ to {0, 1}∗ (or
ℕ), Cantor showed that these two infinite sets have “different forms of
infinity” and that the set of real numbers ℝ is in some sense “bigger”
than the infinite set {0, 1}∗ . The notion that there are “shades of infin-
ity” was deeply disturbing to mathematicians and philosophers at the
time. The philosopher Ludwig Wittgenstein (whom we mentioned be-
fore) called Cantor’s results “utter nonsense” and “laughable.” Others
thought they were even worse than that. Leopold Kronecker called
Cantor a “corrupter of youth,” while Henri Poincaré said that Can-
tor’s ideas “should be banished from mathematics once and for all.”
The tide eventually turned, and these days Cantor’s work is univer-
sally accepted as the cornerstone of set theory and the foundations of
mathematics. As David Hilbert said in 1925, “No one shall expel us from
the paradise which Cantor has created for us.” As we will see later in this
94 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
book, Cantor’s ideas also play a huge role in the theory of computa-
tion.
Now that we have discussed Theorem 2.5’s importance, let us see
the proof. It is achieved in two steps:
1. Define some infinite set 𝒳 for which it is easier for us to prove that
𝒳 is not countable (namely, it’s easier for us to prove that there is
no one-to-one function from 𝒳 to {0, 1}∗ ).
We now proceed to do precisely that. That is, we will define the set
{0, 1}∞ , which will play the role of 𝒳, and then state and prove two
lemmas that show that this set satisfies our two desired properties.
That is, {0, 1}∞ is a set of functions, and a function 𝑓 is in {0, 1}∞
iff its domain is ℕ and its codomain is {0, 1}. We can also think of
{0, 1}∞ as the set of all infinite sequences of bits, since a function 𝑓 ∶
ℕ → {0, 1} can be identified with the sequence (𝑓(0), 𝑓(1), 𝑓(2), …).
The following two lemmas show that {0, 1}∞ can play the role of 𝒳 to
establish Theorem 2.5.
Lemma 2.8 There does not exist a one-to-one map 𝐹 𝑡𝑆 ∶ {0, 1}∞ → 3
𝐹 𝑡𝑆 stands for “functions to strings”.
{0, 1}∗ .3
4
𝐹 𝑡𝑅 stands for “functions to reals.”
Lemma 2.9 There does exist a one-to-one map 𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ.4
As we’ve seen above, Lemma 2.8 and Lemma 2.9 together imply
Theorem 2.5. To repeat the argument more formally, suppose, for
the sake of contradiction, that there did exist a one-to-one function
𝑅𝑡𝑆 ∶ ℝ → {0, 1}∗ . By Lemma 2.9, there exists a one-to-one function
𝐹 𝑡𝑅 ∶ {0, 1}∞ → ℝ. Thus, under this assumption, since the composi-
tion of two one-to-one functions is one-to-one (see Exercise 2.12), the
comp u tati on a n d re p re se n tati on 95
Now all that is left is to prove these two lemmas. We start by prov-
ing Lemma 2.8 which is really the heart of Theorem 2.5.
Warm-up: ”Baby Cantor”. The proof of Lemma 2.8 is rather subtle. One
way to get intuition for it is to consider the following finite statement
“there is no onto function 𝑓 ∶ {0, … , 99} → {0, 1}100 ”. Of course
we know it’s true since the set {0, 1}100 is bigger than the set [100],
but let’s see a direct proof. For every 𝑓 ∶ {0, … , 99} → {0, 1}100 , we
can define the string 𝑑 ∈ {0, 1}100 as follows: 𝑑 = (1 − 𝑓(0)0 , 1 −
𝑓(1)1 , … , 1 − 𝑓(99)99 ). If 𝑓 was onto, then there would exist some
𝑛 ∈ [100] such that 𝑓(𝑛) = 𝑑, but we claim that no such 𝑛 exists.
96 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Indeed, if there was such 𝑛, then the 𝑛-th coordinate of 𝑑 would equal
𝑓(𝑛)𝑛 but by definition this coordinate equals 1 − 𝑓(𝑛)𝑛 . See also a
“proof by code” of this statement.
Proof of Lemma 2.8. We will prove that there does not exist an onto
function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . This implies the lemma since
for every two sets 𝐴 and 𝐵, there exists an onto function from 𝐴 to
𝐵 if and only if there exists a one-to-one function from 𝐵 to 𝐴 (see
Lemma 1.2).
The technique of this proof is known as the “diagonal argument”
and is illustrated in Fig. 2.8. We assume, towards a contradiction, that
there exists such a function 𝑆𝑡𝐹 ∶ {0, 1}∗ → {0, 1}∞ . We will show
that 𝑆𝑡𝐹 is not onto by demonstrating a function 𝑑 ∈ {0, 1}∞ such that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Consider the lexicographic ordering
of binary strings (i.e., "",0,1,00,01,…). For every 𝑛 ∈ ℕ, we let 𝑥𝑛 be the
𝑛-th string in this order. That is 𝑥0 = "", 𝑥1 = 0, 𝑥2 = 1 and so on and
so forth. We define the function 𝑑 ∈ {0, 1}∞ as follows:
𝑆𝑡𝐹 ("")(0), 𝑆𝑡𝐹 (0)(1), 𝑆𝑡𝐹 (1)(2), 𝑆𝑡𝐹 (00)(3), 𝑆𝑡𝐹 (01)(4), …
which correspond to the elements 𝑆𝑡𝐹 (𝑥𝑛 )(𝑛) in the 𝑛-th row and
𝑛-th column of this table for 𝑛 = 0, 1, 2, …. The function 𝑑 we defined
above maps every 𝑛 ∈ ℕ to the negation of the 𝑛-th diagonal value.
To complete the proof that 𝑆𝑡𝐹 is not onto we need to show that
𝑑 ≠ 𝑆𝑡𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Indeed, let 𝑥 ∈ {0, 1}∗ be some string
and let 𝑔 = 𝑆𝑡𝐹 (𝑥). If 𝑛 is the position of 𝑥 in the lexicographical
order then by construction 𝑑(𝑛) = 1 − 𝑔(𝑛) ≠ 𝑔(𝑛) which means that
𝑔 ≠ 𝑑 which is what we wanted to prove.
■
comp u tati on a n d re p re se n tati on 97
R
Remark 2.10 — Generalizing beyond strings and reals.
Lemma 2.8 doesn’t really have much to do with the
natural numbers or the strings. An examination of
the proof shows that it really shows that for every
set 𝑆, there is no one-to-one map 𝐹 ∶ {0, 1}𝑆 → 𝑆
where {0, 1}𝑆 denotes the set {𝑓 | 𝑓 ∶ 𝑆 → {0, 1}}
of all Boolean functions with domain 𝑆. Since we can
identify a subset 𝑉 ⊆ 𝑆 with its characteristic function
𝑓 = 1𝑉 (i.e., 1𝑉 (𝑥) = 1 iff 𝑥 ∈ 𝑉 ), we can think of
{0, 1}𝑆 also as the set of all subsets of 𝑆. This subset
is sometimes called the power set of 𝑆 and denoted by
𝒫(𝑆) or 2𝑆 .
The proof of Lemma 2.8 can be generalized to show
that there is no one-to-one map between a set and its
power set. In particular, it means that the set {0, 1}ℝ is
“even bigger” than ℝ. Cantor used these ideas to con-
struct an infinite hierarchy of shades of infinity. The
number of such shades turns out to be much larger
than |ℕ| or even |ℝ|. He denoted the cardinality of ℕ
by ℵ0 and denoted the next largest infinite number
by ℵ1 . (ℵ is the first letter in the Hebrew alphabet.)
Cantor also made the continuum hypothesis that
|ℝ| = ℵ1 . We will come back to the fascinating story
of this hypothesis later on in this book. This lecture of
Aaronson mentions some of these issues (see also this
Berkeley CS 70 lecture).
Proof Idea:
We define 𝐹 𝑡𝑅(𝑓) to be the number between 0 and 2 whose dec-
imal expansion is 𝑓(0).𝑓(1)𝑓(2) …, or in other words 𝐹 𝑡𝑅(𝑓) =
∞
∑𝑖=0 𝑓(𝑖) ⋅ 10−𝑖 . If 𝑓 and 𝑔 are two distinct functions in {0, 1}∞ , then
there must be some input 𝑘 in which they disagree. If we take the
minimum such 𝑘, then the numbers 𝑓(0).𝑓(1)𝑓(2) … 𝑓(𝑘 − 1)𝑓(𝑘) …
and 𝑔(0).𝑔(1)𝑔(2) … 𝑔(𝑘) … agree with each other all the way up to the
𝑘 − 1-th digit after the decimal point, and disagree on the 𝑘-th digit.
But then these numbers must be distinct. Concretely, if 𝑓(𝑘) = 1 and
𝑔(𝑘) = 0 then the first number is larger than the second, and otherwise
(𝑓(𝑘) = 0 and 𝑔(𝑘) = 1) the first number is smaller than the second.
In the proof we have to be a little careful since these are numbers with
infinite expansions. For example, the number one half has two decimal
98 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
expansions 0.5 and 0.49999 ⋯. However, this issue does not come up
here, since we restrict attention only to numbers with decimal expan-
sions that do not involve the digit 9.
⋆
Proof of Lemma 2.9. For every 𝑓 ∈ {0, 1}∞ , we define 𝐹 𝑡𝑅(𝑓) to be the
number whose decimal expansion is 𝑓(0).𝑓(1)𝑓(2)𝑓(3) …. Formally,
∞
𝐹 𝑡𝑅(𝑓) = ∑ 𝑓(𝑖) ⋅ 10−𝑖 (2.2)
𝑖=0
R
Remark 2.11 — Using decimal expansion (op-
tional). In the proof above we used the fact that
1 + 1/10 + 1/100 + ⋯ converges to 10/9, which
plugging into (2.3) yields that the difference between
𝐹 𝑡𝑅(𝑔) and 𝐹 𝑡𝑅(ℎ) is at least 10−𝑘 −10−𝑘−1 ⋅(10/9) > 0.
While the choice of the decimal representation for 𝐹 𝑡𝑅
was arbitrary, we could not have used the binary
representation in its place. Had we used the binary
expansion instead of decimal, the corresponding se-
quence 1 + 1/2 + 1/4 + ⋯ converges to 2/1 = 2,
comp u tati on a n d re p re se n tati on 99
alently, there does not exist an onto map 𝑆𝑡𝐴𝐿𝐿 ∶ {0, 1}∗ → ALL.
Proof Idea:
This is a direct consequence of Lemma 2.8, since we can use the
binary representation to show a one-to-one map from {0, 1}∞ to ALL.
Hence the uncountability of {0, 1}∞ implies the uncountability of
ALL.
⋆
Proof of Theorem 2.12. Since {0, 1}∞ is uncountable, the result will
follow by showing a one-to-one map from {0, 1}∞ to ALL. The reason
is that the existence of such a map implies that if ALL was countable,
and hence there was a one-to-one map from ALL to ℕ, then there
would have been a one-to-one map from {0, 1}∞ to ℕ, contradicting
Lemma 2.8.
We now show this one-to-one map. We simply map a function
𝑓 ∈ {0, 1}∞ to the function 𝐹 ∶ {0, 1}∗ → {0, 1} as follows. We let
𝐹 (0) = 𝑓(0), 𝐹 (1) = 𝑓(1), 𝐹 (10) = 𝑓(2), 𝐹 (11) = 𝑓(3) and so on and
so forth. That is, for every 𝑥 ∈ {0, 1}∗ that represents a natural number
𝑛 in the binary basis, we define 𝐹 (𝑥) = 𝑓(𝑛). If 𝑥 does not represent
such a number (e.g., it has a leading zero), then we set 𝐹 (𝑥) = 0.
This map is one-to-one since if 𝑓 ≠ 𝑔 are two distinct elements in
{0, 1}∞ , then there must be some input 𝑛 ∈ ℕ on which 𝑓(𝑛) ≠ 𝑔(𝑛).
But then if 𝑥 ∈ {0, 1}∗ is the string representing 𝑛, we see that 𝐹 (𝑥) ≠
𝐺(𝑥) where 𝐹 is the function in ALL that 𝑓 mapped to, and 𝐺 is the
function that 𝑔 is mapped to.
■
100 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Make sure you know how to prove the equivalence of
all the results above.
R
Remark 2.15 — Total decoding functions. While the
decoding function of a representation scheme can in
general be a partial function, the proof of Lemma 2.14
implies that every representation scheme has a total
decoding function. This observation can sometimes be
useful.
if you have a pigeon coop with 𝑚 holes and 𝑘 > 𝑚 pigeons, then there
must be two pigeons in the same hole.)
■
Recall that for every set 𝒪, the set 𝒪∗ consists of all finite length
tuples (i.e., lists) of elements in 𝒪. The following theorem shows that
if 𝐸 is a prefix-free encoding of 𝒪 then by concatenating encodings we
can obtain a valid (i.e., one-to-one) representation of 𝒪∗ :
P
Theorem 2.18 is an example of a theorem that is a little
hard to parse, but in fact is fairly straightforward to
prove once you understand what it means. Therefore,
I highly recommend that you pause here to make
sure you understand the statement of this theorem.
You should also try to prove it on your own before
proceeding further.
Proof Idea:
The idea behind the proof is simple. Suppose that for example
Figure 2.9: If we have a prefix-free representation of
we want to decode a triple (𝑜0 , 𝑜1 , 𝑜2 ) from its representation 𝑥 =
each object then we can concatenate the representa-
𝐸(𝑜0 , 𝑜1 , 𝑜2 ) = 𝐸(𝑜0 )𝐸(𝑜1 )𝐸(𝑜2 ). We will do so by first finding the tions of 𝑘 objects to obtain a representation for the
first prefix 𝑥0 of 𝑥 that is a representation of some object. Then we tuple (𝑜0 , … , 𝑜𝑘−1 ).
Proof of Theorem 2.18. We now show the formal proof. Suppose, to-
wards the sake of contradiction, that there exist two distinct tuples
(𝑜0 , … , 𝑜𝑘−1 ) and (𝑜0′ , … , 𝑜𝑘′ ′ −1 ) such that
and
where 𝑥𝑗 = 𝐸(𝑜𝑗 ) = 𝐸(𝑜𝑗′ ) for all 𝑗 < 𝑖. Let 𝑦 be the string obtained
after removing the prefix 𝑥0 ⋯ 𝑥𝑖−1 from 𝑥. We see that 𝑦 can be writ-
ten as both 𝑦 = 𝐸(𝑜𝑖 )𝑠 for some string 𝑠 ∈ {0, 1}∗ and as 𝑦 = 𝐸(𝑜𝑖′ )𝑠′
for some 𝑠′ ∈ {0, 1}∗ . But this means that one of 𝐸(𝑜𝑖 ) and 𝐸(𝑜𝑖′ ) must
be a prefix of the other, contradicting the prefix-freeness of 𝐸.
In the case that 𝑖 = 𝑘 and 𝑘′ > 𝑘, we get a contradiction in the
following way. In this case
104 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 2.19 — Prefix freeness of list representation.
Even if the representation 𝐸 of objects in 𝒪 is prefix
free, this does not mean that our representation 𝐸
of lists of such objects will be prefix free as well. In
fact, it won’t be: for every three objects 𝑜, 𝑜′ , 𝑜″ the
representation of the list (𝑜, 𝑜′ ) will be a prefix of the
representation of the list (𝑜, 𝑜′ , 𝑜″ ). However, as we see
in Lemma 2.20 below, we can transform every repre-
sentation into prefix-free form, and so will be able to
use that transformation if needed to represent lists of
lists, lists of lists of lists, and so on and so forth.
P
For the sake of completeness, we will include the
proof below, but it is a good idea for you to pause
here and try to prove it on your own, using the same
technique we used for representing rational numbers.
Proof of Lemma 2.20. The idea behind the proof is to use the map 0 ↦
00, 1 ↦ 11 to “double” every bit in the string 𝑥 and then mark the
end of the string by concatenating to it the pair 01. If we encode a
string 𝑥 in this way, it ensures that the encoding of 𝑥 is never a prefix
comp u tati on a n d re p re se n tati on 105
The proof of Lemma 2.20 is not the only or even the best way to
transform an arbitrary representation into prefix-free form. Exer-
cise 2.10 asks you to construct a more efficient prefix-free transforma-
tion satisfying |𝐸(𝑜)| ≤ |𝐸(𝑜)| + 𝑂(log |𝐸(𝑜)|).
NtS(234)
# 11101010
pfNtS(234)
# 111111001100110001
pfStN(pfNtS(234))
# 234
pfvalidM(pfNtS(234))
# true
comp u tati on a n d re p re se n tati on 107
P
Note that the Python function prefixfree above
takes two Python functions as input and outputs
three Python functions as output. (When it’s not
too awkward, we use the term “Python function” or
“subroutine” to distinguish between such snippets of
Python programs and mathematical functions.) You
don’t have to know Python in this course, but you do
need to get comfortable with the idea of functions as
mathematical objects in their own right, that can be
used as inputs and outputs of other functions.
def represlists(pfencode,pfdecode,pfvalid):
"""
Takes functions pfencode, pfdecode and pfvalid,
and returns functions encodelists, decodelists
that can encode and decode lists of the objects
respectively.
"""
def encodelist(L):
"""Gets list of objects, encodes it as list of
↪ bits"""
return "".join([pfencode(obj) for obj in L])
def decodelist(S):
"""Gets lists of bits, returns lists of objects"""
i=0; j=1 ; res = []
while j<=len(S):
if pfvalid(S[i:j]):
res += [pfdecode(S[i:j])]
i=j
j+= 1
return res
return encodelist,decodelist
LtS([234,12,5])
# 111111001100110001111100000111001101
StL(LtS([234,12,5]))
# [234, 12, 5]
edge ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 ∈ 𝐸. We can transform an undirected graph to a directed
graph by replacing every edge {𝑖, 𝑗} with both edges ⃗⃗⃗⃗⃗⃗⃗⃗
𝑖 𝑗 and ⃖⃖⃖⃖⃖⃖⃖⃖
𝑖𝑗
Another representation for graphs is the adjacency list representa-
tion. That is, we identify the vertex set 𝑉 of a graph with the set [𝑛]
110 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2.5.9 Notation
We will typically identify an object with its representation as a string.
For example, if 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is some function that maps
strings to strings and 𝑛 is an integer, we might make statements such
as “𝐹 (𝑛) + 1 is prime” to mean that if we represent 𝑛 as a string 𝑥,
then the integer 𝑚 represented by the string 𝐹 (𝑥) satisfies that 𝑚 + 1
is prime. (You can see how this convention of identifying objects with
their representation can save us a lot of cumbersome formalism.)
Similarly, if 𝑥, 𝑦 are some objects and 𝐹 is a function that takes strings
as inputs, then by 𝐹 (𝑥, 𝑦) we will mean the result of applying 𝐹 to the
representation of the ordered pair (𝑥, 𝑦). We use the same notation to
invoke functions on 𝑘-tuples of objects for every 𝑘.
This convention of identifying an object with its representation as
a string is one that we humans follow all the time. For example, when
people say a statement such as “17 is a prime number”, what they
really mean is that the integer whose decimal representation is the
string “17”, is prime.
When we say
𝐴 is an algorithm that computes the multiplication function on natural num-
bers.
what we really mean is that
𝐴 is an algorithm that computes the function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ such that
for every pair 𝑎, 𝑏 ∈ ℕ, if 𝑥 ∈ {0, 1}∗ is a string representing the pair (𝑎, 𝑏)
then 𝐹 (𝑥) will be a string representing their product 𝑎 ⋅ 𝑏.
comp u tati on a n d re p re se n tati on 111
R
Remark 2.23 — Boolean functions and languages. An
important special case of computational tasks corre-
sponds to computing Boolean functions, whose output
is a single bit {0, 1}. Computing such functions corre-
sponds to answering a YES/NO question, and hence
this task is also known as a decision problem. Given any
function 𝐹 ∶ {0, 1}∗ → {0, 1} and 𝑥 ∈ {0, 1}∗ , the task
of computing 𝐹 (𝑥) corresponds to the task of deciding
whether or not 𝑥 ∈ 𝐿 where 𝐿 = {𝑥 ∶ 𝐹 (𝑥) = 1} is
known as the language that corresponds to the function
𝐹 . (The language terminology is due to historical
connections between the theory of computation and
formal linguistics as developed by Noam Chomsky.)
Hence many texts refer to such a computational task
as deciding a language.
def mult1(x,y):
res = 0
while y>0:
res += x
y -= 1
return res
def mult2(x,y):
a = str(x) # represent x as string in decimal notation
b = str(y) # represent y as string in decimal notation
res = 0
for i in range(len(a)):
for j in range(len(b)):
res += int(a[len(a)-i])*int(b[len(b)-
↪ j])*(10**(i+j))
return res
print(mult1(12,7))
# 84
print(mult2(12,7))
# 84
Both mult1 and mult2 produce the same output given the same
pair of natural number inputs. (Though mult1 will take far longer to
do so when the numbers become large.) Hence, even though these are
two different programs, they compute the same mathematical function.
This distinction between a program or algorithm 𝐴, and the function 𝐹
that 𝐴 computes will be absolutely crucial for us in this course (see also
Fig. 2.13).
✓ Chapter Recap
2.7 EXERCISES
Exercise 2.1 Which one of these objects can be represented by a binary
string?
a. An integer 𝑥
b. An undirected graph 𝐺.
c. A directed graph 𝐻
Exercise 2.3 — More compact than ASCII representation. The ASCII encoding
can be used to encode a string of 𝑛 English letters as a 7𝑛 bit binary
string, but in this exercise, we ask about finding a more compact rep-
resentation for strings of English lowercase letters.
2. Prove that there exists no representation scheme for strings over the
alphabet {𝑎, 𝑏, … , 𝑧} as binary strings such that for every length-𝑛
string 𝑥 ∈ {𝑎, 𝑏, … , 𝑧}𝑛 , the representation 𝐸(𝑥) is a binary string of
length ⌊4.6𝑛 + 1000⌋. In other words, prove that there exists some
𝑛 > 0 such that there is no one-to-one function 𝐸 ∶ {𝑎, 𝑏, … , 𝑧}𝑛 →
{0, 1}⌊4.6𝑛+1000⌋ .
Exercise 2.4 — Representing graphs: upper bound. Show that there is a string
representation of directed graphs with vertex set [𝑛] and degree at
most 10 that uses at most 1000𝑛 log 𝑛 bits. More formally, show the
following: Suppose we define for every 𝑛 ∈ ℕ, the set 𝐺𝑛 as the set
containing all directed graphs (with no self loops) over the vertex
set [𝑛] where every vertex has degree at most 10. Then, prove that for
every sufficiently large 𝑛, there exists a one-to-one function 𝐸 ∶ 𝐺𝑛 →
{0, 1}⌊1000𝑛 log 𝑛⌋ .
■
1. Define 𝑆𝑛 to be the
Exercise 2.5 — Representing graphs: lower bound.
set of one-to-one and onto functions mapping [𝑛] to [𝑛]. Prove that
there is a one-to-one mapping from 𝑆𝑛 to 𝐺2𝑛 , where 𝐺2𝑛 is the set
defined in Exercise 2.4 above.
2. In this question you will show that one cannot improve the rep-
resentation of Exercise 2.4 to length 𝑜(𝑛 log 𝑛). Specifically, prove
for every sufficiently large 𝑛 ∈ ℕ there is no one-to-one function
𝐸 ∶ 𝐺𝑛 → {0, 1}⌊0.001𝑛 log 𝑛⌋+1000 .
■
2. Use 1. to compute the size of the set {𝑦 ∈ {0, 1}∗ ∶ |𝑦| ≤ 𝑘} where |𝑦|
denotes the length of the string 𝑦.
Suppose that 𝐹 ∶
Exercise 2.10 — More efficient prefix-free transformation.
𝑂 → {0, 1}∗ is some (not necessarily prefix-free) representation of the
objects in the set 𝑂, and 𝐺 ∶ ℕ → {0, 1}∗ is a prefix-free representa-
tion of the natural numbers. Define 𝐹 ′ (𝑜) = 𝐺(|𝐹 (𝑜)|)𝐹 (𝑜) (i.e., the
concatenation of the representation of the length 𝐹 (𝑜) and 𝐹 (𝑜)).
118 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
a. For every 𝑥 ∈ 𝑆, let 𝐿(𝑥) ⊆ {0, 1}𝑛 denote all the length-𝑛 strings
whose first 𝑘 bits are 𝑥0 , … , 𝑥𝑘−1 . Prove that (1) |𝐿(𝑥)| = 2𝑛−|𝑥| and
(2) For every distinct 𝑥, 𝑥′ ∈ 𝑆, 𝐿(𝑥) is disjoint from 𝐿(𝑥′ ).
b. Prove that ∑𝑥∈𝑆 2−|𝑥| ≤ 1. (Hint: first show that ∑𝑥∈𝑆 |𝐿(𝑥)| ≤ 2𝑛 .)
3
• Examples of computing in the physical world.
Defining computation
“there is no reason why mental as well as bodily labor should not be economized
by the aid of machinery”, Charles Babbage, 1852
“If, unwarned by my example, any man shall undertake and shall succeed
in constructing an engine embodying in itself the whole of the executive de-
partment of mathematical analysis upon different principles or by simpler
mechanical means, I have no fear of leaving my reputation in his charge, for he
alone will be fully able to appreciate the nature of my efforts and the value of
their results.”, Charles Babbage, 1864
“To understand a program you must become both the machine and the pro-
gram.”, Alan Perlis, 1982
[How to solve an equation of the form ] “roots and squares are equal to num-
bers”: For instance “one square, and ten roots of the same, amount to thirty-
nine dirhems” that is to say, what must be the square which, when increased
by ten of its own root, amounts to thirty-nine? The solution is this: you halve
the number of the roots, which in the present instance yields five. This you
multiply by itself; the product is twenty-five. Add this to thirty-nine’ the sum
is sixty-four. Now take the root of this, which is eight, and subtract from it half
the number of roots, which is five; the remainder is three. This is the root of the
square which you sought for; the square itself is nine.
For the purposes of this book, we will need a much more precise
way to describe algorithms. Fortunately (or is it unfortunately?), at
least at the moment, computers lag far behind school-age children
in learning from examples. Hence in the 20th century, people came
Figure 3.4: Text pages from Algebra manuscript with
up with exact formalisms for describing algorithms, namely program- geometrical solutions to two quadratic equations.
ming languages. Here is al-Khwarizmi’s quadratic equation solving Shelfmark: MS. Huntington 214 fol. 004v-005r
def solve_eq(b,c):
# return solution of x^2 + bx = c following Al
↪ Khwarizmi's instructions Figure 3.5: An explanation for children of the two digit
# Al Kwarizmi demonstrates this for the case b=10 and addition algorithm
↪ c= 39
126 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{0 𝑎 = 𝑏 = 0
OR(𝑎, 𝑏) = ⎨
⎩1 otherwise
{
⎧
{1 𝑎 = 𝑏 = 1
AND(𝑎, 𝑏) = ⎨
⎩0 otherwise
{
⎧
{0 𝑎 = 1
NOT(𝑎) = ⎨
{
⎩1 𝑎 = 0
The functions AND, OR and NOT, are the basic logical operators
used in logic and many computer systems. In the context of logic, it is
common to use the notation 𝑎 ∧ 𝑏 for AND(𝑎, 𝑏), 𝑎 ∨ 𝑏 for OR(𝑎, 𝑏) and
𝑎 and ¬𝑎 for NOT(𝑎), and we will use this notation as well.
Each one of the functions AND, OR, NOT takes either one or two
single bits as input, and produces a single bit as output. Clearly, it
cannot get much more basic than that. However, the power of compu-
tation comes from composing such simple building blocks together.
⎧
{1 𝑥 0 + 𝑥 1 + 𝑥 2 ≥ 2
MAJ(𝑥) = ⎨ .
⎩0 otherwise
{
That is, for every 𝑥 ∈ {0, 1}3 , MAJ(𝑥) = 1 if and only if the ma-
jority (i.e., at least two out of the three) of 𝑥’s elements are equal
to 1. Can you come up with a formula involving AND, OR and
NOT to compute MAJ? (It would be useful for you to pause at this
point and work out the formula for yourself. As a hint, although
the NOT operator is needed to compute some functions, you will
not need to use it to compute MAJ.)
Let us first try to rephrase MAJ(𝑥) in words: “MAJ(𝑥) = 1 if and
only if there exists some pair of distinct elements 𝑖, 𝑗 such that both
𝑥𝑖 and 𝑥𝑗 are equal to 1.” In other words it means that MAJ(𝑥) = 1
iff either both 𝑥0 = 1 and 𝑥1 = 1, or both 𝑥1 = 1 and 𝑥2 = 1, or both
𝑥0 = 1 and 𝑥2 = 1. Since the OR of three conditions 𝑐0 , 𝑐1 , 𝑐2 can
be written as OR(𝑐0 , OR(𝑐1 , 𝑐2 )), we can now translate this into a
formula as follows:
def MAJ(X[0],X[1],X[2]):
firstpair = AND(X[0],X[1])
secondpair = AND(X[1],X[2])
thirdpair = AND(X[0],X[2])
temp = OR(secondpair,thirdpair)
return OR(firstpair,temp)
Solution:
We can prove this by enumerating over all the 8 possible values
for 𝑎, 𝑏, 𝑐 ∈ {0, 1} but it also follows from the standard distributive
law. Suppose that we identify any positive integer with “true” and
the value zero with “false”. Then for every numbers 𝑢, 𝑣 ∈ ℕ, 𝑢 + 𝑣
is positive if and only if 𝑢 ∨ 𝑣 is true and 𝑢 ⋅ 𝑣 is positive if and only
if 𝑢 ∧ 𝑣 is true. This means that for every 𝑎, 𝑏, 𝑐 ∈ {0, 1}, the expres-
sion 𝑎 ∧ (𝑏 ∨ 𝑐) is true if and only if 𝑎 ⋅ (𝑏 + 𝑐) is positive, and the
expression (𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑐) is true if and only if 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 is positive,
But by the standard distributive law 𝑎 ⋅ (𝑏 + 𝑐) = 𝑎 ⋅ 𝑏 + 𝑎 ⋅ 𝑐 and
hence the former expression is true if and only if the latter one is.
■
3.2.2 Extended example: Computing XOR from AND, OR, and NOT
Let us see how we can obtain a different function from the same
building blocks. Define XOR ∶ {0, 1}2 → {0, 1} to be the function
XOR(𝑎, 𝑏) = 𝑎 + 𝑏 mod 2. That is, XOR(0, 0) = XOR(1, 1) = 0 and
XOR(1, 0) = XOR(0, 1) = 1. We claim that we can construct XOR
using only AND, OR, and NOT.
P
As usual, it is a good exercise to try to work out the
algorithm for XOR using AND, OR and NOT on your
own before reading further.
The following algorithm computes XOR using AND, OR, and NOT:
def XOR(a,b):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
return AND(w2,w3)
Solution:
Addition modulo two satisfies the same properties of associativ-
ity ((𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐)) and commutativity (𝑎 + 𝑏 = 𝑏 + 𝑎) as
standard addition. This means that, if we define 𝑎 ⊕ 𝑏 to equal 𝑎 + 𝑏
d e fi n i ng comp u tati on 131
mod 2, then
XOR3 (𝑎, 𝑏, 𝑐) = (𝑎 ⊕ 𝑏) ⊕ 𝑐
or in other words
def XOR3(a,b,c):
w1 = AND(a,b)
w2 = NOT(w1)
w3 = OR(a,b)
w4 = AND(w2,w3)
w5 = AND(w4,c)
w6 = NOT(w5)
w7 = OR(w4,c)
return AND(w6,w7)
P
Try to generalize the above examples to obtain a way
to compute XOR𝑛 ∶ {0, 1}𝑛 → {0, 1} for every 𝑛 us-
ing at most 4𝑛 basic steps involving applications of a
function in {AND, OR, NOT} to outputs or previously
computed values.
P
These concerns will to a large extent guide us in the
upcoming chapters. Thus you would be well advised
to re-read the above informal definition and see what
you think about these issues.
In the remainder of this chapter, and the rest of this book, we will
begin to answer some of these questions. We will see more examples
of the power of simple operations to compute more complex opera-
tions including addition, multiplication, sorting and more. We will
also discuss how to physically implement simple operations such as
AND, OR and NOT using a variety of technologies.
outgoing from it. We also designate some gates as output gates, and
their value corresponds to the result of evaluating the circuit. For ex-
ample, Fig. 3.8 gives such a circuit for the XOR function, following
Section 3.2.2. We evaluate an 𝑛-input Boolean circuit 𝐶 on an input
𝑥 ∈ {0, 1}𝑛 by placing the bits of 𝑥 on the inputs, and then propagat-
ing the values on the wires until we reach an output, see Fig. 3.9.
R
Remark 3.4 — Physical realization of Boolean circuits. Figure 3.8: A circuit with AND, OR and NOT gates for
Boolean circuits are a mathematical model that does not computing the XOR function.
necessarily correspond to a physical object, but they
can be implemented physically. In physical imple-
mentations of circuits, the signal is often implemented
by electric potential, or voltage, on a wire, where for
example voltage above a certain level is interpreted
as a logical value of 1, and below a certain level is in-
terpreted as a logical value of 0. Section 3.5 discusses
physical implementations of Boolean circuits (with
examples including using electrical signals such as
in silicon-based circuits, as well as biological and
mechanical implementations).
Solution:
Another way to describe the function ALLEQ is that it outputs
1 on an input 𝑥 ∈ {0, 1}4 if and only if 𝑥 = 04 or 𝑥 = 14 . We can
phrase the condition 𝑥 = 14 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be
computed using three AND gates. Similarly we can phrase the con-
dition 𝑥 = 04 as 𝑥0 ∧ 𝑥1 ∧ 𝑥2 ∧ 𝑥3 which can be computed using four
NOT gates and three AND gates. The output of ALLEQ is the OR
of these two conditions, which results in the circuit of 4 NOT gates,
6 AND gates, and one OR gate presented in Fig. 3.10.
■
1. Formally define a Boolean circuit as a mathematical object. Figure 3.10: A Boolean circuit for computing the all
equal function ALLEQ ∶ {0, 1}4 → {0, 1} that outputs
2. Formally define what it means for a circuit 𝐶 to compute a function 1 on 𝑥 ∈ {0, 1}4 if and only if 𝑥0 = 𝑥1 = 𝑥2 = 𝑥3 .
𝑓.
• The other 𝑠 vertices are known as gates. Each gate is labeled with
∧, ∨ or ¬. Gates labeled with ∧ (AND) or ∨ (OR) have two in-
neighbors. Gates labeled with ¬ (NOT) have one in-neighbor.
We will allow parallel edges. 1
• Exactly 𝑚 of the gates are also labeled with the 𝑚 labels Y[0], …,
Y[𝑚 − 1] (in addition to their label ∧/∨/¬). These are known as
outputs.
• For every 𝑣 in the ℓ-th layer (i.e., 𝑣 such that ℎ(𝑣) = ℓ) do:
• The result of this process is the value 𝑦 ∈ {0, 1}𝑚 such that for
every 𝑗 ∈ [𝑚], 𝑦𝑗 is the value assigned to the vertex with label
Y[𝑗].
Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . We say that the circuit 𝐶 computes 𝑓 if
for every 𝑥 ∈ {0, 1}𝑛 , 𝐶(𝑥) = 𝑓(𝑥).
R
Remark 3.7 — Boolean circuits nitpicks (optional). In
phrasing Definition 3.5, we’ve made some technical
d e fi n i ng comp u tati on 137
temp = AND(X[0],X[1])
Y[0] = NOT(temp)
Let 𝑓
Definition 3.8 — Computing a function via AON-CIRC programs. ∶
{0, 1}𝑛 → {0, 1}𝑚 , and 𝑃 be a valid AON-CIRC program with 𝑛
inputs and 𝑚 outputs. We say that 𝑃 computes 𝑓 if 𝑃 (𝑥) = 𝑓(𝑥) for
every 𝑥 ∈ {0, 1}𝑛 .
Solution:
Writing such a program is tedious but not truly hard. To com-
pare two numbers we first compare their most significant digit,
and then go down to the next digit and so on and so forth. In this
case where the numbers have just two binary digits, these compar-
isons are particularly simple. The number represented by (𝑎, 𝑏) is
larger than the number represented by (𝑐, 𝑑) if and only if one of
the following conditions happens:
1. The most significant bit 𝑎 of (𝑎, 𝑏) is larger than the most signifi-
cant bit 𝑐 of (𝑐, 𝑑).
or
2. The two most significant bits 𝑎 and 𝑐 are equal, but 𝑏 > 𝑑.
# Compute CMP:{0,1}^4-->{0,1}
# CMP(X)=1 iff 2X[0]+X[1] > 2X[2] + X[3]
temp_1 = NOT(X[2])
temp_2 = AND(X[0],temp_1)
temp_3 = OR(X[0],temp_1)
temp_4 = NOT(X[3])
temp_5 = AND(X[1],temp_4)
temp_6 = AND(temp_5,temp_3)
Y[0] = OR(temp_2,temp_6)
Let
Theorem 3.9 — Equivalence of circuits and straight-line programs.
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 and 𝑠 ≥ 𝑚 be some number. Then 𝑓 is
computable by a Boolean circuit with 𝑠 gates if and only if 𝑓 is
Figure 3.12: A circuit for computing the CMP function.
computable by an AON-CIRC program of 𝑠 lines. The evaluation of this circuit on (1, 1, 1, 0) yields the
output 1, since the number 3 (represented in binary
as 11) is larger than the number 2 (represented in
Proof Idea:
binary as 10).
The idea is simple - AON-CIRC programs and Boolean circuits
are just different ways of describing the exact same computational
process. For example, an AND gate in a Boolean circuit corresponds to
computing the AND of two previously-computed values. In an AON-
CIRC program this will correspond to the line that stores in a variable
the AND of two previously-computed variables.
⋆
P
This proof of Theorem 3.9 is simple at heart, but all
the details it contains can make it a little cumbersome
to read. You might be better off trying to work it out
yourself before reading it. Our GitHub repository con-
tains a “proof by Python” of Theorem 3.9: implemen-
tation of functions circuit2prog and prog2circuits
mapping Boolean circuits to AON-CIRC programs and
vice versa.
d e fi n i ng comp u tati on 141
Proof of Theorem 3.9. Let 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 . Since the theorem is an
“if and only if” statement, to prove it we need to show both directions:
translating an AON-CIRC program that computes 𝑓 into a circuit that
computes 𝑓, and translating a circuit that computes 𝑓 into an AON-
CIRC program that does so.
We start with the first direction. Let 𝑃 be an AON-CIRC program
that computes 𝑓. We define a circuit 𝐶 as follows: the circuit will
have 𝑛 inputs and 𝑠 gates. For every 𝑖 ∈ [𝑠], if the 𝑖-th operator line
has the form foo = AND(bar,blah) then the 𝑖-th gate in the circuit
will be an AND gate that is connected to gates 𝑗 and 𝑘 where 𝑗 and
𝑘 correspond to the last lines before 𝑖 where the variables bar and
blah (respectively) were written to. (For example, if 𝑖 = 57 and the
last line bar was written to is 35 and the last line blah was written
to is 17 then the two in-neighbors of gate 57 will be gates 35 and 17.)
If either bar or blah is an input variable then we connect the gate to
the corresponding input vertex instead. If foo is an output variable
of the form Y[𝑗] then we add the same label to the corresponding
gate to mark it as an output gate. We do the analogous operations if
the 𝑖-th line involves an OR or a NOT operation (except that we use the
corresponding OR or NOT gate, and in the latter case have only one
in-neighbor instead of two). For every input 𝑥 ∈ {0, 1}𝑛 , if we run
the program 𝑃 on 𝑥, then the value written that is computed in the
𝑖-th line is exactly the value that will be assigned to the 𝑖-th gate if we
evaluate the circuit 𝐶 on 𝑥. Hence 𝐶(𝑥) = 𝑃 (𝑥) for every 𝑥 ∈ {0, 1}𝑛 .
For the other direction, let 𝐶 be a circuit of 𝑠 gates and 𝑛 inputs that
computes the function 𝑓. We sort the gates according to a topological
order and write them as 𝑣0 , … , 𝑣𝑠−1 . We now can create a program
𝑃 of 𝑠 operator lines as follows. For every 𝑖 ∈ [𝑠], if 𝑣𝑖 is an AND
gate with in-neighbors 𝑣𝑗 , 𝑣𝑘 then we will add a line to 𝑃 of the form
temp_𝑖 = AND(temp_𝑗,temp_𝑘), unless one of the vertices is an input
vertex or an output gate, in which case we change this to the form
X[.] or Y[.] appropriately. Because we work in topological order-
ing, we are guaranteed that the in-neighbors 𝑣𝑗 and 𝑣𝑘 correspond to
variables that have already been assigned a value. We do the same for
OR and NOT gates. Once again, one can verify that for every input 𝑥,
the value 𝑃 (𝑥) will equal 𝐶(𝑥) and hence the program computes the
same function as the circuit. (Note that since 𝐶 is a valid circuit, per
Definition 3.5, every input vertex of 𝐶 has at least one out-neighbor
and there are exactly 𝑚 output gates labeled 0, … , 𝑚 − 1; hence all the
variables X[0], …, X[𝑛 − 1] and Y[0] ,…, Y[𝑚 − 1] will appear in the
program 𝑃 .)
■
Figure 3.13: Two equivalent descriptions of the same
AND/OR/NOT computation as both an AON pro-
gram and a Boolean circuit.
142 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
3.5.1 Transistors
A transistor can be thought of as an electric circuit with two inputs,
known as the source and the gate and an output, known as the sink.
The gate controls whether current flows from the source to the sink. In
Figure 3.14: Crab-based logic gates from the paper
a standard transistor, if the gate is “ON” then current can flow from the “Robust soldier-crab ball gate” by Gunji, Nishiyama
source to the sink and if it is “OFF” then it can’t. In a complementary and Adamatzky. This is an example of an AND gate
transistor this is reversed: if the gate is “OFF” then current can flow that relies on the tendency of two swarms of crabs
arriving from different directions to combine to a
from the source to the sink and if it is “ON” then it can’t. single swarm that continues in the average of the
There are several ways to implement the logic of a transistor. For directions.
example, we can use faucets to implement it using water pressure
(e.g. Fig. 3.15). This might seem as merely a curiosity, but there is
a field known as fluidics concerned with implementing logical op-
erations using liquids or gasses. Some of the motivations include
operating in extreme environmental conditions such as in space or a
battlefield, where standard electronic equipment would not survive.
The standard implementations of transistors use electrical current.
Figure 3.15: We can implement the logic of transistors
One of the original implementations used vacuum tubes. As its name
using water. The water pressure from the gate closes
implies, a vacuum tube is a tube containing nothing (i.e., a vacuum) or opens a faucet between the source and the sink.
and where a priori electrons could freely flow from the source (a
wire) to the sink (a plate). However, there is a gate (a grid) between
the two, where modulating its voltage can block the flow of electrons.
d e fi n i ng comp u tati on 143
of AND/OR/NOT uses some other gates as the basic basis. For exam-
ple, one particular basis we can use are threshold gates. For every vector
𝑤 = (𝑤0 , … , 𝑤𝑘−1 ) of integers and integer 𝑡 (some or all of which
could be negative), the threshold function corresponding to 𝑤, 𝑡 is the
function 𝑇𝑤,𝑡 ∶ {0, 1}𝑘 → {0, 1} that maps 𝑥 ∈ {0, 1}𝑘 to 1 if and only if
𝑘−1
∑𝑖=0 𝑤𝑖 𝑥𝑖 ≥ 𝑡. For example, the threshold function 𝑇𝑤,𝑡 correspond-
ing to 𝑤 = (1, 1, 1, 1, 1) and 𝑡 = 3 is simply the majority function MAJ5
on {0, 1}5 . Threshold gates can be thought of as an approximation for
neuron cells that make up the core of human and animal brains. To a Figure 3.22: An AND gate using a “Game of Life”
configuration. Figure taken from Jean-Philippe
first approximation, a neuron has 𝑘 inputs and a single output, and Rennard’s paper.
the neuron “fires” or “turns on” its output when those signals pass
some threshold.
Many machine learning algorithms use artificial neural networks
whose purpose is not to imitate biology but rather to perform some
computational tasks, and hence are not restricted to a threshold or
other biologically-inspired gates. Generally, a neural network is often
described as operating on signals that are real numbers, rather than
0/1 values, and where the output of a gate on inputs 𝑥0 , … , 𝑥𝑘−1 is
obtained by applying 𝑓(∑𝑖 𝑤𝑖 𝑥𝑖 ) where 𝑓 ∶ ℝ → ℝ is an activation
function such as rectified linear unit (ReLU), Sigmoid, or many others
(see Fig. 3.23). However, for the purposes of our discussion, all of
the above are equivalent (see also Exercise 3.13). In particular we can
reduce the setting of real inputs to binary inputs by representing a
real number in the binary basis, and multiplying the weight of the bit
corresponding to the 𝑖𝑡ℎ digit by 2𝑖 .
d e fi n i ng comp u tati on 145
Proof. We start with the following observation. For every 𝑎 ∈ {0, 1},
AND(𝑎, 𝑎) = 𝑎. Hence, NAND(𝑎, 𝑎) = NOT(AND(𝑎, 𝑎)) = NOT(𝑎).
This means that NAND can compute NOT. By the principle of “dou-
ble negation”, AND(𝑎, 𝑏) = NOT(NOT(AND(𝑎, 𝑏))), and hence
we can use NAND to compute AND as well. Once we can compute
AND and NOT, we can compute OR using “De Morgan’s Law”:
OR(𝑎, 𝑏) = NOT(AND(NOT(𝑎), NOT(𝑏))) (which can also be writ-
ten as 𝑎 ∨ 𝑏 = 𝑎 ∧ 𝑏) for every 𝑎, 𝑏 ∈ {0, 1}. Figure 3.25: A “gadget” in a pipe that ensures that at
most one marble can pass through it. The first marble
■ that passes causes the barrier to lift and block new
ones.
146 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Theorem 3.10’s proof is very simple, but you should
make sure that (i) you understand the statement of
the theorem, and (ii) you follow its proof. In partic-
ular, you should make sure you understand why De
Morgan’s law is true.
Solution:
Recall that (3.1) states that
NAND(𝑏, 𝑐) )
1. Let 𝑢 = NAND(𝑥0 , 𝑥1 ).
2. Let 𝑣 = NAND(𝑥0 , 𝑢)
3. Let 𝑤 = NAND(𝑥1 , 𝑢).
4. The XOR of 𝑥0 and 𝑥1 is 𝑦0 = NAND(𝑣, 𝑤).
One can verify that this algorithm does indeed compute XOR
by enumerating all the four choices for 𝑥0 , 𝑥1 ∈ {0, 1}. We can also
represent this algorithm graphically as a circuit, see Fig. 3.28.
Proof Idea:
Figure 3.28: A circuit with NAND gates to compute
The idea of the proof is to just replace every AND, OR and NOT the XOR of two bits.
gate with their NAND implementation following the proof of Theo-
rem 3.10.
⋆
• NOT(𝑎) = NAND(𝑎, 𝑎)
Big Idea 3 Two models are equivalent in power if they can be used
to compute the same set of functions.
148 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
foo = NAND(bar,blah)
150 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)
P
Do you know what function this program computes?
Hint: you have seen it before.
For
Theorem 3.17 — NAND circuits and straight-line program equivalence.
every 𝑓 ∶ {0, 1} → {0, 1} and 𝑠 ≥ 𝑚, 𝑓 is computable by a
𝑛 𝑚
R
Remark 3.18 — Is the NAND-CIRC programming language
Turing Complete? (optional note). You might have heard
of a term called “Turing Complete” that is sometimes
used to describe programming languages. (If you
haven’t, feel free to ignore the rest of this remark: we
define this term precisely in Chapter 8.) If so, you
might wonder if the NAND-CIRC programming lan-
guage has this property. The answer is no, or perhaps
more accurately, the term “Turing Completeness” is
not really applicable for the NAND-CIRC program-
ming language. The reason is that, by design, the
NAND-CIRC programming language can only com-
pute finite functions 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 that take a
fixed number of input bits and produce a fixed num-
ber of outputs bits. The term “Turing Complete” is
only applicable to programming languages for infinite
functions that can take inputs of arbitrary length. We
will come back to this distinction later on in this book.
✓ Chapter Recap
3.8 EXERCISES
Give a Boolean circuit
Exercise 3.1 — Compare 4 bit numbers.
(with AND/OR/NOT gates) that computes the function
CMP8 ∶ {0, 1}8 → {0, 1} such that CMP8 (𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑏0 , 𝑏1 , 𝑏2 , 𝑏3 ) = 1
if and only if the number represented by 𝑎0 𝑎1 𝑎2 𝑎3 is larger than the
number represented by 𝑏0 𝑏1 𝑏2 𝑏3 .
■
Exercise 3.4 — AND,OR is not universal. Prove that for every 𝑛-bit input
circuit 𝐶 that contains only AND and OR gates, as well as gates that
compute the constant functions 0 and 1, 𝐶 is monotone, in the sense
that if 𝑥, 𝑥′ ∈ {0, 1}𝑛 , 𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 ∈ [𝑛], then 𝐶(𝑥) ≤ 𝐶(𝑥′ ).
Conclude that the set {AND, OR, 0, 1} is not universal.
■
Exercise 3.7 — MAJ,NOT is not universal. Prove that {MAJ, NOT} is not a 4
Hint: Use the fact that MAJ(𝑎, 𝑏, 𝑐) = 𝑀𝐴𝐽(𝑎, 𝑏, 𝑐)
universal set. See footnote for hint.4 to prove that every 𝑓 ∶ {0, 1}𝑛 → {0, 1} computable
■
by a circuit with only MAJ and NOT gates satisfies
𝑓(0, 0, … , 0) ≠ 𝑓(1, 1, … , 1). Thanks to Nathan
Let NOR ∶ {0, 1} → {0, 1} defined as
Exercise 3.8 — NOR is universal. 2 Brunelle and David Evans for suggesting this exercise.
neural networks. As a corollary you will obtain that deep neural net-
works can simulate NAND circuits. Since NAND circuits can also
simulate deep neural networks, these two computational models are
equivalent to one another.
4. Prove that for every NAND-circuit 𝐶 with 𝑛 inputs and one output
that computes a function 𝑔 ∶ {0, 1}𝑛 → {0, 1}, if we replace every
gate of 𝐶 with a NAND-approximator and then invoke the result-
ing circuit on some 𝑥 ∈ {0, 1}𝑛 , the output will be a number 𝑦 such
that |𝑦 − 𝑔(𝑥)| ≤ 1/3.
4
Syntactic sugar, and computing every function
“[In 1951] I had a running compiler and nobody would touch it because,
they carefully told me, computers could only do arithmetic; they could not do
programs.”, Grace Murray Hopper, 1986.
2. So you can realize how lucky you are to be taking a theory of com-
putation course and not a compilers course… :)
def Proc(a,b):
proc_code
return c
some_code
f = Proc(d,e)
some_more_code
some_code
proc_code'
some_more_code
Let NAND-CIRC-
Theorem 4.1 — Procedure definition syntatic sugar.
PROC be the programming language NAND-CIRC augmented
with the syntax above for defining procedures. Then for every
NAND-CIRC-PROC program 𝑃 , there exists a standard (i.e.,
“sugar-free”) NAND-CIRC program 𝑃 ′ that computes the same
function as 𝑃 .
R
Remark 4.2 — No recursive procedure. NAND-CIRC-
PROC only allows non-recursive procedures. In partic-
ular, the code of a procedure Proc cannot call Proc but
only use procedures that were defined before it. With-
out this restriction, the above “search and replace”
procedure might never terminate and Theorem 4.1
would not be true.
■ Pro-
Example 4.3 — Computing Majority from NAND using syntactic sugar.
cedures allow us to express NAND-CIRC programs much more
cleanly and succinctly. For example, because we can compute
AND, OR, and NOT using NANDs, we can compute the Majority
function as follows:
def NOT(a):
return NAND(a,a)
def AND(a,b):
temp = NAND(a,b)
return NOT(temp)
def OR(a,b):
temp1 = NOT(a)
temp2 = NOT(b)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 163
return NAND(temp1,temp2)
def MAJ(a,b,c):
and1 = AND(a,b)
and2 = AND(a,c)
and3 = AND(b,c)
or1 = OR(and1,and2)
return OR(or1,and3)
print(MAJ(0,1,1))
# 1
R
Remark 4.4 — Counting lines. While we can use syn-
tactic sugar to present NAND-CIRC programs in more
readable ways, we did not change the definition of
the language itself. Therefore, whenever we say that
some function 𝑓 has an 𝑠-line NAND-CIRC program
we mean a standard “sugar-free” NAND-CIRC pro-
gram, where all syntactic sugar has been expanded
out. For example, the program of Example 4.3 is a
12-line program for computing the MAJ function,
164 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2. A line foo = exp, where exp is the expression following the re-
turn statement in the definition of the procedure Proc.
R
Remark 4.5 — Parsing function definitions (optional). The
function desugar in Fig. 4.3 assumes that it is given
the procedure already split up into its name, argu-
ments, and body. It is not crucial for our purposes to
describe precisely how to scan a definition and split it
up into these components, but in case you are curious,
it can be achieved in Python via the following code:
def parse_func(code):
"""Parse a function definition into name,
↪ arguments and body"""
lines = [l.strip() for l in code.split('\n')]
regexp = r'def\s+([a-zA-Z\_0-9]+)\(([\sa-zA-
↪ Z0-9\_,]+)\)\s*:\s*'
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 165
Figure 4.3: Python code for transforming NAND-CIRC-PROC programs into standard sugar-free NAND-CIRC programs.
m = re.match(regexp,lines[0])
return m.group(1), m.group(2).split(','),
↪ '\n'.join(lines[1:])
P
Before reading onward, try to see how you could com-
pute the IF function using NAND’s. Once you do that,
see how you can use that to emulate if/then types of
constructs.
def IF(cond,a,b):
notcond = NAND(cond,cond)
temp = NAND(b,notcond)
temp1 = NAND(a,cond)
return NAND(temp,temp1)
that assigns to foo its old value when condition equals 0, and
assign to foo the value of blah otherwise. More generally we can
replace code of the form
if (cond):
a = ...
b = ...
c = ...
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 167
temp_a = ...
temp_b = ...
temp_c = ...
a = IF(cond,temp_a,a)
b = IF(cond,temp_b,b)
c = IF(cond,temp_c,c)
Let NAND-CIRC-
Theorem 4.6 — Conditional statements syntactic sugar.
IF be the programming language NAND-CIRC augmented with
if/then/else statements for allowing code to be conditionally
executed based on whether a variable is equal to 0 or 1.
Then for every NAND-CIRC-IF program 𝑃 , there exists a stan-
dard (i.e., “sugar-free”) NAND-CIRC program 𝑃 ′ that computes
the same function as 𝑃 .
ADD([1,1,1,0,0],[1,0,0,0,0]);;
# [0, 0, 0, 1, 0, 0]
where zero is the constant zero function, and MAJ and XOR corre-
spond to the majority and XOR functions respectively. While we use
Python syntax for convenience, in this example 𝑛 is some fixed integer
and so for every such 𝑛, ADD is a finite function that takes as input 2𝑛
168 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
For every 𝑛,
Theorem 4.8 — Multiplication using NAND-CIRC programs.
let MULT𝑛 ∶ {0, 1} 2𝑛
→ {0, 1} be the function that, given
2𝑛
LOOKUP𝑘 (𝑥, 𝑖) = 𝑥𝑖
def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
if i[0]==1:
return LOOKUP1(X[2],X[3],i[1])
else:
return LOOKUP1(X[0],X[1],i[1])
or in other words,
def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]):
a = LOOKUP1(X[2],X[3],i[1])
b = LOOKUP1(X[0],X[1],i[1])
return IF( i[0],a,b)
Proof of Theorem 4.10 from Lemma 4.11. Now that we have Lemma 4.11,
we can complete the proof of Theorem 4.10. We will prove by induc-
tion on 𝑘 that there is a NAND-CIRC program of at most 4 ⋅ (2𝑘 − 1)
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 171
lines for LOOKUP𝑘 . For 𝑘 = 1 this follows by the four line program for
IF we’ve seen before. For 𝑘 > 1, we use the following pseudocode:
a = LOOKUP_(k-1)(X[0],...,X[2^(k-1)-1],i[1],...,i[k-1])
b = LOOKUP_(k-1)(X[2^(k-1)],...,X[2^(k-1)],i[1],...,i[k-
↪ 1])
return IF(i[0],b,a)
2. Coming up with NAND-CIRC programs for various functions is a Figure 4.7: The number of lines in our implementation
very tedious task. of the LOOKUP_k function as a function of 𝑘 (i.e., the
length of the index). The number of lines in our
implementation is roughly 3 ⋅ 2𝑘 .
Thus I would not blame the reader if they were not particularly
looking forward to a long sequence of examples of functions that can
be computed by NAND-CIRC programs. However, it turns out we are
not going to need this, as we can show in one fell swoop that NAND-
CIRC programs can compute every finite function:
G0000 = 1
G1000 = 1
G0100 = 0
...
G0111 = 1
G1111 = 1
Y[0] = LOOKUP_4(G0000,G1000,...,G1111,
X[0],X[1],X[2],X[3])
R
Remark 4.14 — Result in perspective. While Theo-
rem 4.12 seems striking at first, in retrospect, it is
perhaps not that surprising that every finite function
can be computed with a NAND-CIRC program. After
all, a finite function 𝐹 ∶ {0, 1}𝑛 → {0, 1}𝑚 can be
represented by simply the list of its outputs for each
one of the 2𝑛 input values. So it makes sense that we
could write a NAND-CIRC program of similar size
to compute it. What is more interesting is that some
functions, such as addition and multiplication, have
174 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
There ex-
Theorem 4.15 — Universality of NAND circuits, improved bound.
ists a constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and function
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a NAND-CIRC program with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 lines that computes the function 𝑓. 3
3
The constant 𝑐 in this theorem is at most 10 and in
fact can be arbitrarily close to 1, see Section 4.8.
Proof. As before, it is enough to prove the case that 𝑚 = 1. Hence
we let 𝑓 ∶ {0, 1}𝑛 → {0, 1}, and our goal is to prove that there exists
a NAND-CIRC program of 𝑂(2𝑛 /𝑛) lines (or equivalently a Boolean
circuit of 𝑂(2𝑛 /𝑛) gates) that computes 𝑓.
We let 𝑘 = log(𝑛 − 2 log 𝑛) (the reasoning behind this choice will
become clear later on). We define the function 𝑔 ∶ {0, 1}𝑘 → {0, 1}2
𝑛−𝑘
as follows:
ping {0, 1}𝑘 to {0, 1}, we can compute the function 𝑔 (and hence by
(4.2) also 𝑓) using at most
(4.4)
𝑘
𝑂(22 ⋅ 2𝑘 + 2𝑛−𝑘 )
operations. Now all that is left is to plug into (4.4) our choice of 𝑘 =
log(𝑛 − 2 log 𝑛). By definition, 2𝑘 = 𝑛 − 2 log 𝑛, which means that (4.4)
can be bounded Figure 4.9: If 𝑔0 , … , 𝑔𝑁−1 is a collection of functions
each mapping {0, 1}𝑘 to {0, 1} such that at most 𝑆
of them are distinct then for every 𝑎 ∈ {0, 1}𝑘 , we
𝑂 (2𝑛−2 log 𝑛 ⋅ (𝑛 − 2 log 𝑛) + 2𝑛−log(𝑛−2 log 𝑛) ) ≤
can compute all the values 𝑔0 (𝑎), … , 𝑔𝑁−1 (𝑎) using
at most 𝑂(𝑆 ⋅ 2𝑘 + 𝑁) operations by first computing
the distinct functions and then copying the resulting
values.
𝑛
2𝑛 𝑛
2𝑛 𝑛
𝑂 ( 2𝑛2 ⋅ 𝑛 + 𝑛−2 log 𝑛 ) ≤ 𝑂 ( 2𝑛 + 0.5𝑛 ) = 𝑂 ( 2𝑛 )
which is what we wanted to prove. (We used above the fact that 𝑛 −
2 log 𝑛 ≥ 0.5 log 𝑛 for sufficiently large 𝑛.)
■
There
Theorem 4.16 — Universality of Boolean circuits, improved bound.
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚2𝑛 /𝑛 gates that computes the function 𝑓 .
There
Theorem 4.17 — Universality of Boolean circuits (alternative phrasing).
exists some constant 𝑐 > 0 such that for every 𝑛, 𝑚 > 0 and func-
tion 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 , there is a Boolean circuit with at most
𝑐 ⋅ 𝑚 ⋅ 𝑛2𝑛 gates that computes the function 𝑓 .
Proof Idea:
The idea of the proof is illustrated in Fig. 4.10. As before, it is
enough to focus on the case that 𝑚 = 1 (the function 𝑓 has a sin-
gle output), since we can always extend this to the case of 𝑚 > 1
by looking at the composition of 𝑚 circuits each computing a differ-
ent output bit of the function 𝑓. We start by showing that for every
𝛼 ∈ {0, 1}𝑛 , there is an 𝑂(𝑛)-sized circuit that computes the function
𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} defined as follows: 𝛿𝛼 (𝑥) = 1 iff 𝑥 = 𝛼 (that is, Figure 4.10: Given a function 𝑓 ∶ {0, 1}𝑛 → {0, 1},
𝛿𝛼 outputs 0 on all inputs except the input 𝛼). We can then write any we let {𝑥0 , 𝑥1 , … , 𝑥𝑁−1 } ⊆ {0, 1}𝑛 be the set of
inputs such that 𝑓(𝑥𝑖 ) = 1, and note that 𝑁 ≤ 2𝑛 .
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} as the OR of at most 2𝑛 functions 𝛿𝛼 for
We can express 𝑓 as the OR of 𝛿𝑥𝑖 for 𝑖 ∈ [𝑁] where
the 𝛼’s on which 𝑓(𝛼) = 1. the function 𝛿𝛼 ∶ {0, 1}𝑛 → {0, 1} (for 𝛼 ∈ {0, 1}𝑛 )
⋆ is defined as follows: 𝛿𝛼 (𝑥) = 1 iff 𝑥 = 𝛼. We can
compute the OR of 𝑁 values using 𝑁 two-input OR
gates. Therefore if we have a circuit of size 𝑂(𝑛) to
Proof of Theorem 4.17. We prove the theorem for the case 𝑚 = 1. The compute 𝛿𝛼 for every 𝛼 ∈ {0, 1}𝑛 , we can compute 𝑓
result can be extended for 𝑚 > 1 as before (see also Exercise 4.9). Let using a circuit of size 𝑂(𝑛 ⋅ 𝑁) = 𝑂(𝑛 ⋅ 2𝑛 ).
𝑓 ∶ {0, 1}𝑛 → {0, 1}. We will prove that there is an 𝑂(𝑛 ⋅ 2𝑛 )-sized
Boolean circuit to compute 𝑓 in the following steps:
𝛼 ∈ {0, 1}𝑛 such that 𝑓(𝛼) = 1. (If 𝑓 is the constant zero function
and hence there is no such 𝛼, then we can use the circuit 𝑓(𝑥) =
𝑥0 ∧ 𝑥0 .)
⎧
{1 𝑥 = 𝛼
𝛿𝛼 (𝑥) = .
⎨
⎩0 otherwise
{
Fig. 4.12 depicts the set SIZE𝑛,1 (𝑠). Note that SIZE𝑛,𝑚 (𝑠) is a set of
functions, not of programs! Asking if a program or a circuit is a mem-
ber of SIZE𝑛,𝑚 (𝑠) is a category error as in the sense of Fig. 4.13. As we
discussed in Section 3.7.2 (and Section 2.6.1), the distinction between
programs and functions is absolutely crucial. You should always re-
member that while a program computes a function, it is not equal to
a function. In particular, as we’ve seen, there can be more than one
program to compute the same function.
𝑛
Figure 4.12: There are 22 functions mapping {0, 1}𝑛
to {0, 1}, and an infinite number of circuits with 𝑛 bit
inputs and a single bit of output. Every circuit com-
putes one function, but every function can be com-
puted by many circuits. We say that 𝑓 ∈ SIZE𝑛,1 (𝑠)
if the smallest circuit that computes 𝑓 has 𝑠 or fewer
gates. For example XOR𝑛 ∈ SIZE𝑛,1 (4𝑛). Theo-
rem 4.12 shows that every function 𝑔 is computable
by some circuit of at most 𝑐 ⋅ 2𝑛 /𝑛 gates, and hence
SIZE𝑛,1 (𝑐 ⋅ 2𝑛 /𝑛) corresponds to the set of all func-
tions from {0, 1}𝑛 to {0, 1}.
Solution:
If 𝑓 ∈ SIZE𝑛 (𝑠) then there is an 𝑠-line NAND-CIRC program
𝑃 that computes 𝑓. We can rename the variable Y[0] in 𝑃 to a
variable temp and add the line
Y[0] = NAND(temp,temp)
✓ Chapter Recap
4.7 EXERCISES
This exercise asks you to give a one-to-one map
Exercise 4.1 — Pairing.
from ℕ to ℕ. This can be useful to implement two-dimensional arrays
2
t = NAND(X[2],X[2])
u = NAND(X[0],t)
v = NAND(X[1],X[2])
Y[0] = NAND(u,v)
■
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 181
2. A full adder is the function FA ∶ {0, 1}3 → {0, 1}2 that takes in
two bits and a “carry” bit and outputs their sum. That is, for every
𝑎, 𝑏, 𝑐 ∈ {0, 1}, FA(𝑎, 𝑏, 𝑐) = (𝑒, 𝑓) such that 2𝑒 + 𝑓 = 𝑎 + 𝑏 + 𝑐.
Prove that there is a NAND circuit of at most nine NAND gates that
computes FA.
Temp[0] = NAND(X[0],X[0])
Temp[1] = NAND(X[1],X[1])
Temp[2] = NAND(Temp[0],Temp[1])
Temp[3] = NAND(X[2],X[2])
Temp[4] = NAND(X[3],X[3])
Temp[5] = NAND(Temp[3],Temp[4])
Temp[6] = NAND(Temp[2],Temp[2])
Temp[7] = NAND(Temp[5],Temp[5])
Y[0] = NAND(Temp[6],Temp[7])
sy n tac ti c su ga r, a n d comp u ti ng e ve ry fu nc ti on 183
1. Write a program 𝑃 ′ with at most three lines of code that uses both
NAND as well as the syntactic sugar OR that computes the same func-
tion as 𝑃 .
2. Draw a circuit that computes the same function as 𝑃 and uses only
AND and NOT gates.
and
“The term code script is, of course, too narrow. The chromosomal structures
are at the same time instrumental in bringing about the development they
foreshadow. They are law-code and executive power - or, to use another simile,
they are architect’s plan and builder’s craft - in one.” , Erwin Schrödinger,
1944.
This correspondence between code and data is one of the most fun-
damental aspects of computing. It underlies the notion of general
purpose computers, that are not pre-wired to compute only one task,
and also forms the basis of our hope for obtaining general artificial
intelligence. This concept finds immense use in all areas of comput-
ing, from scripting languages to machine learning, but it is fair to say
that we haven’t yet fully mastered it. Many security exploits involve
cases such as “buffer overflows” when attackers manage to inject code
where the system expected only “passive” data (see Fig. 5.1). The re-
lation between code and data reaches beyond the realm of electronic
temp_0 = NAND(X[0],X[1])
temp_1 = NAND(X[0],temp_0)
temp_2 = NAND(X[1],temp_0)
Y[0] = NAND(temp_1,temp_2)
There is a constant 𝑐
Theorem 5.1 — Representing programs as strings.
such that for 𝑓 ∈ SIZE(𝑠), there exists a program 𝑃 computing 𝑓
whose string representation has length at most 𝑐𝑠 log 𝑠.
P
188 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
That is, there are at most 2𝑂(𝑠 log 𝑠) functions computed by NAND-
CIRC programs of at most 𝑠 lines. 1
1
The implicit constant in the 𝑂(⋅) notation is
smaller than 10. That is, for all sufficiently large 𝑠,
Proof. For any 𝑛, 𝑚 ∈ ℕ, we will show a one-to-one map 𝐸 from |SIZE𝑛,𝑚 (𝑠)| < 210𝑠 log 𝑠 , see Remark 5.4. As discussed
in Section 1.7, we use the bound 10 simply because it
SIZE𝑛,𝑚 (𝑠) to the set of strings of length 𝑐𝑠 log 𝑠 for some constant is a round number.
𝑐. This will conclude the proof, since it implies that |SIZE𝑛,𝑚 (𝑠)| is
smaller than the size of the set of all strings of length at most ℓ =
𝑐𝑠 log 𝑠. The size of the latter set is 1 + 2 + 4 + ⋯ + 2ℓ = 2ℓ+1 − 1 by the
formula for sums of geometric progressions.
The map 𝐸 will simply map 𝑓 to the representation of the smallest
program computing 𝑓. Since 𝑓 ∈ SIZE𝑛,𝑚 (𝑠), there is a program 𝑃
of at most 𝑠 lines that can be represented using a string of length at
most 𝑐𝑠 log 𝑠 by Theorem 5.1. Moreover, the map 𝑓 ↦ 𝐸(𝑓) is one to
one, since for every distinct 𝑓, 𝑓 ′ ∶ {0, 1}𝑛 → {0, 1}𝑚 there must exist
some input 𝑥 ∈ {0, 1}𝑛 on which 𝑓(𝑥) ≠ 𝑓 ′ (𝑥). This means that the
programs that compute 𝑓 and 𝑓 ′ respectively cannot be identical.
■
the inputs {0, 1}𝑛 . Hence the number of functions mapping {0, 1}𝑛
to {0, 1} is equal to the number of possible 2𝑛 length lists of values
which is exactly 22 . Note that this is double exponential in 𝑛, and hence
𝑛
even for small values of 𝑛 (e.g., 𝑛 = 10) the number of functions from 2
“Astronomical” here is an understatement: there are
{0, 1}𝑛 to {0, 1} is truly astronomical.2 As mentioned, this yields the 10
much fewer than 22 stars, or even particles, in the
following corollary: observable universe.
There is a constant
Theorem 5.3 — Counting argument lower bound.
𝛿 > 0, such that for every sufficiently large 𝑛, there is a function
𝑓 ∶ {0, 1}𝑛 → {0, 1} such that 𝑓 ∉ SIZE𝑛 ( 𝛿2𝑛 ). That is, the shortest
𝑛
using the fact that since 𝑠 < 2𝑛 , log 𝑠 < 𝑛 and 𝛿 = 1/𝑐. But since
|SIZE𝑛 (𝑠)| is smaller than the total number of functions mapping 𝑛
bits to 1 bit, there must be at least one such function not in SIZE𝑛 (𝑠),
which is what we needed to prove.
■
We have seen before that every function mapping {0, 1}𝑛 to {0, 1}
can be computed by an 𝑂(2𝑛 /𝑛) line program. Theorem 5.3 shows
that this is tight in the sense that some functions do require such an
astronomical number of lines to compute.
In fact, as we explore in the exercises, this is the case for most func-
tions. Hence functions that can be computed in a small number of
lines (such as addition, multiplication, finding short paths in graphs,
or even the EVAL function) are the exception, rather than the rule.
R
Remark 5.4 — More efficient representation (advanced,
optional). The ASCII representation is not the shortest
representation for NAND-CIRC programs. NAND-
CIRC programs are equivalent to circuits with NAND
gates, which means that a NAND-CIRC program of 𝑠
lines, 𝑛 inputs, and 𝑚 outputs can be represented by
a labeled directed graph of 𝑠 + 𝑛 vertices, of which 𝑛
190 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
It turns out that we can use Theorem 5.3 to show a more general re-
sult: whenever we increase our “budget” of gates we can compute
new functions.
Proof Idea:
To prove the theorem we need to find a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}
such that 𝑓 can be computed by a circuit of 𝑠 + 10𝑛 gates but it cannot
be computed by a circuit of 𝑠 gates. We will do so by coming up with
a sequence of functions 𝑓0 , 𝑓1 , 𝑓2 , … , 𝑓𝑁 with the following properties:
(1) 𝑓0 can be computed by a circuit of at most 10𝑛 gates, (2) 𝑓𝑁 cannot
be computed by a circuit of 0.1 ⋅ 2𝑛 /𝑛 gates, and (3) for every 𝑖 ∈
{0, … , 𝑁 }, if 𝑓𝑖 can be computed by a circuit of size 𝑠, then 𝑓𝑖+1 can be
computed by a circuit of size at most 𝑠+10𝑛. Together these properties
imply that if we let 𝑖 be the smallest number such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠),
then since 𝑓𝑖−1 ∈ SIZE𝑛 (𝑠) it must hold that 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛) which
is what we need to prove. See Fig. 5.4 for an illustration.
⋆
Proof of Theorem 5.5. Let 𝑓 ∗ ∶ {0, 1}𝑛 → {0, 1} be the function Figure 5.4: We prove Theorem 5.5 by coming up
with a list 𝑓0 , … , 𝑓2𝑛 of functions such that 𝑓0 is the
(whose existence we are guaranteed by Theorem 5.3) such that all zero function, 𝑓2𝑛 is a function (obtained from
𝑓 ∗ ∉ SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛). We define the functions 𝑓0 , 𝑓1 , … , 𝑓2𝑛 map- Theorem 5.3) outside of SIZE𝑛 (0.1 ⋅ 2𝑛 /𝑛) and such
that 𝑓𝑖−1 and 𝑓𝑖 differ by one another on at most one
ping {0, 1}𝑛 to {0, 1} as follows. For every 𝑥 ∈ {0, 1}𝑛 , if 𝑙𝑒𝑥(𝑥) ∈
input. We can show that for every 𝑖, the number of
{0, 1, … , 2𝑛 − 1} is 𝑥’s order in the lexicographical order then gates to compute 𝑓𝑖 is at most 10𝑛 larger than the
number of gates to compute 𝑓𝑖−1 and so if we let 𝑖 be
⎧ the smallest number such that 𝑓𝑖 ∉ SIZE𝑛 (𝑠), then
{𝑓 ∗ (𝑥) 𝑙𝑒𝑥(𝑥) < 𝑖 𝑓𝑖 ∈ SIZE𝑛 (𝑠 + 10𝑛).
𝑓𝑖 (𝑥) = ⎨ .
{
⎩0 otherwise
cod e a s data, data a s cod e 191
⎧
{𝑏 𝑥 = 𝑥∗
𝑓𝑖 (𝑥) =
⎨
{𝑓𝑖−1 (𝑥) 𝑥 ≠ 𝑥∗
⎩
or in other words
blah = NAND(baz,boo)
Let 𝑃 be a NAND-CIRC
Definition 5.7 — List of tuples representation.
program of 𝑛 inputs, 𝑚 outputs, and 𝑠 lines, and let 𝑡 be the num-
ber of distinct variables used by 𝑃 . The list of tuples representation
of 𝑃 is the triple (𝑛, 𝑚, 𝐿) where 𝐿 is a list of triples of the form
(𝑖, 𝑗, 𝑘) for 𝑖, 𝑗, 𝑘 ∈ [𝑡].
We assign a number for each variable of 𝑃 as follows:
u = NAND(X[0],X[1])
v = NAND(X[0],u)
w = NAND(X[1],u)
Y[0] = NAND(v,w)
gram must touch at least once all its input and output variables), those
prefix free representations can be encoded using strings of length
𝑂(log 𝑠). In particular, every program 𝑃 of at most 𝑠 lines can be rep-
resented by a string of length 𝑂(𝑠 log 𝑠). Similarly, every circuit 𝐶 of
at most 𝑠 gates can be represented by a string of length 𝑂(𝑠 log 𝑠) (for
example by translating 𝐶 to the equivalent program 𝑃 ).
{𝑃 (𝑥) 𝑝 ∈ {0, 1}|𝑆(𝑠)| represents a size-𝑠 program 𝑃 with 𝑛 inputs and 𝑚 outputs
⎧
EVAL𝑠,𝑛,𝑚 (𝑝𝑥) =
⎨
{0𝑚 otherwise
⎩
(5.2)
where 𝑆(𝑠) is defined as in (5.1) and we use the concrete representa-
tion scheme described in Section 5.1.
That is, EVAL𝑠,𝑛,𝑚 takes as input the concatenation of two strings:
a string 𝑝 ∈ {0, 1}|𝑆(𝑠)| and a string 𝑥 ∈ {0, 1}𝑛 . If 𝑝 is a string that
represents a list of triples 𝐿 such that (𝑛, 𝑚, 𝐿) is a list-of-tuples rep-
resentation of a size-𝑠 NAND-CIRC program 𝑃 , then EVAL𝑠,𝑛,𝑚 (𝑝𝑥)
is equal to the evaluation 𝑃 (𝑥) of the program 𝑃 on the input 𝑥. Oth-
erwise, EVAL𝑠,𝑛,𝑚 (𝑝𝑥) equals 0𝑚 (this case is not very important: you
can simply think of 0𝑚 as some “junk value” that indicates an error).
For every
Theorem 5.9 — Bounded Universality of NAND-CIRC programs.
𝑠, 𝑛, 𝑚 ∈ ℕ with 𝑠 ≥ 𝑚 there is a NAND-CIRC program 𝑈𝑠,𝑛,𝑚 that
computes the function EVAL𝑠,𝑛,𝑚 .
P
Theorem 5.9 is simple but important. Make sure you
understand what this theorem means, and why it is a
corollary of Theorem 4.12.
ing EVAL𝑠,𝑛,𝑚 with size that is polynomial in its input length. This is
shown in the following theorem.
P
If you haven’t done so already, now might be a good
time to review 𝑂 notation in Section 1.4.8. In particu-
lar, an equivalent way to state Theorem 5.10 is that it
says that there exists some number 𝑐 > 0 such that for
every 𝑠, 𝑛, 𝑚 ∈ ℕ, there exists a NAND-CIRC program
𝑃 of at most 𝑐𝑠2 log 𝑠 lines that computes the function
EVAL𝑠,𝑛,𝑚 .
This approach yields much more than just proving Theorem 5.10:
we will see that it is in fact always possible to transform (loop free)
code in high level languages such as Python to NAND-CIRC pro-
grams (and hence to Boolean circuits as well).
P
It would be highly worthwhile for you to stop here
and try to solve this problem yourself. For example,
you can try thinking how you would write a program
NANDEVAL(n,m,s,L,x) that computes this function in
the programming language of your choice.
P
Before reading further, try to think how you could give
a “constructive proof” of Theorem 5.10. That is, think
of how you would write, in the programming lan-
guage of your choice, a function universal(s,n,m)
that on input 𝑠, 𝑛, 𝑚 outputs the code for the NAND-
CIRC program 𝑈𝑠,𝑛,𝑚 such that 𝑈𝑠,𝑛,𝑚 computes
EVAL𝑠,𝑛,𝑚 . There is a subtle but crucial difference
between this function and the Python NANDEVAL pro-
gram described above. Rather than actually evaluating
a given program 𝑃 on some input 𝑤, the function
universal should output the code of a NAND-CIRC
program that computes the map (𝑃 , 𝑥) ↦ 𝑃 (𝑥).
Figure 5.7: Code for evaluating a NAND-CIRC program given in the list-of-tuples representation
def NANDEVAL(n,m,L,X):
# Evaluate a NAND-CIRC program from list of tuple representation.
s = len(L) # num of lines
t = max(max(a,b,c) for (a,b,c) in L)+1 # max index in L + 1
Vartable = [0] * t # initialize array
# helper functions
def GET(V,i): return V[i]
def UPDATE(V,i,b):
V[i]=b
return V
P
Please make sure that you understand why GET and
LOOKUPℓ are the same function.
UPDATE function for arrays of length 2ℓ . That is, on input 𝑉 ∈ {0, 1}2 ,
ℓ
that
⎧
{𝑉𝑗 𝑗 ≠ 𝑖
𝑉𝑗′ = ⎨
{
⎩𝑏 𝑗=𝑖
where we identify the string 𝑖 ∈ {0, 1}ℓ with a number in {0, … , 2ℓ − 1}
using the binary representation. We can compute UPDATEℓ using an
𝑂(2ℓ ℓ) = (𝑠 log 𝑠) line NAND-CIRC program as follows:
2. We have seen that we can compute the function IF ∶ {0, 1}3 → {0, 1}
such that IF(𝑎, 𝑏, 𝑐) equals 𝑏 if 𝑎 = 1 and 𝑐 if 𝑎 = 0.
def UPDATE_ell(V,i,b):
# Get V[0]...V[2^ell-1], i in {0,1}^ell, b in {0,1}
# Return NewV[0],...,NewV[2^ell-1]
# updated array with NewV[i]=b and all
# else same as V
for j in range(2**ell): # j = 0,1,2,....,2^ell -1
a = EQUALS_j(i)
NewV[j] = IF(a,b,V[j])
return NewV
R
Remark 5.12 — Improving to quasilinear overhead (ad-
vanced optional note). The NAND-CIRC program
above is less efficient than its Python counterpart,
since NAND does not offer arrays with efficient ran-
dom access. Hence for example the LOOKUP operation
on an array of 𝑠 bits takes Ω(𝑠) lines in NAND even
though it takes 𝑂(1) steps (or maybe 𝑂(log 𝑠) steps,
depending on how we count) in Python.
It turns out that it is possible to improve the bound
of Theorem 5.10, and evaluate 𝑠 line NAND-CIRC
programs using a NAND-CIRC program of 𝑂(𝑠 log 𝑠)
lines. The key is to consider the description of NAND-
CIRC programs as circuits, and in particular as di-
rected acyclic graphs (DAGs) of bounded in-degree.
A universal NAND-CIRC program 𝑈𝑠 for 𝑠 line pro-
grams will correspond to a universal graph 𝐻𝑠 for such
𝑠 vertex DAGs. We can think of such a graph 𝑈𝑠 as
fixed “wiring” for a communication network, that
should be able to accommodate any arbitrary pattern
of communication between 𝑠 vertices (where this pat-
tern corresponds to an 𝑠 line NAND-CIRC program).
It turns out that such efficient routing networks exist
that allow embedding any 𝑠 vertex circuit inside a uni-
versal graph of size 𝑂(𝑠 log 𝑠), see the bibliographical
notes Section 5.9 for more on this issue.
R
Remark 5.13 — Advanced note: making PECTT concrete
(advanced, optional). We can attempt a more exact
phrasing of the PECTT as follows. Suppose that 𝑍 is
a physical system that accepts 𝑛 binary stimuli and
has a binary output, and can be enclosed in a sphere
of volume 𝑉 . We say that the system 𝑍 computes a
function 𝑓 ∶ {0, 1}𝑛 → {0, 1} within 𝑡 seconds if when-
ever we set the stimuli to some value 𝑥 ∈ {0, 1}𝑛 , if
we measure the output after 𝑡 seconds then we obtain
𝑓(𝑥).
One can then phrase the PECTT as stipulating that if
there exists such a system 𝑍 that computes 𝐹 within
𝑡 seconds, then there exists a NAND-CIRC program
that computes 𝐹 and has at most 𝛼(𝑉 𝑡)2 lines, where
𝛼 is some normalization constant. (We can also con-
sider variants where we use surface area instead
of volume, or take (𝑉 𝑡) to a different power than 2.
However, none of these choices makes a qualitative
difference to the discussion below.) In particular,
suppose that 𝑓 ∶ {0, 1}𝑛 → {0, 1} is a function that
requires 2𝑛 /(100𝑛) > 20.8𝑛 lines for any NAND-CIRC
program (such a function exists by Theorem 5.3).
Then the PECTT would imply that either the volume
or the time of a system that computes 𝐹 will have to
√
be at least 20.2𝑛 / 𝛼. Since this quantity grows expo-
nentially in 𝑛, it is not hard to set parameters so that
even for moderately large values of 𝑛, such a system
could not fit in our universe.
To fully make the PECTT concrete, we need to decide
on the units for measuring time and volume, and the
normalization constant 𝛼. One conservative choice is
to assume that we could squeeze computation to the
absolute physical limits (which are many orders of
magnitude beyond current technology). This corre-
sponds to setting 𝛼 = 1 and using the Planck units
for volume and time. The Planck length ℓ𝑃 (which is,
roughly speaking, the shortest distance that can the-
oretically be measured) is roughly 2−120 meters. The
cod e a s data, data a s cod e 205
• Spaghetti sort: One of the first lower bounds that Computer Sci-
ence students encounter is that sorting 𝑛 numbers requires making
Ω(𝑛 log 𝑛) comparisons. The “spaghetti sort” is a description of a
proposed “mechanical computer” that would do this faster. The
idea is that to sort 𝑛 numbers 𝑥1 , … , 𝑥𝑛 , we could cut 𝑛 spaghetti
noodles into lengths 𝑥1 , … , 𝑥𝑛 , and then if we simply hold them
206 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
together in our hand and bring them down to a flat surface, they
will emerge in sorted order. There are a great many reasons why
this is not truly a challenge to the PECTT hypothesis, and I will not
ruin the reader’s fun in finding them out by her or himself.
sume that current hard disk + silicon technologies are the absolute 7
We were extremely conservative in the suggested
best possible.7 parameters for the PECTT, having assumed that as
many as ℓ−2𝑃 10
−6 ∼ 1061 bits could potentially be
• Continuous/real computers. The physical world is often described stored in a millimeter radius region.
using continuous quantities such as time and space, and people
have suggested that analog devices might have direct access to
computing with real-valued quantities and would be inherently
more powerful than discrete models such as NAND machines.
Whether the “true” physical world is continuous or discrete is an
open question. In fact, we do not even know how to precisely phrase
this question, let alone answer it. Yet, regardless of the answer, it
seems clear that the effort to measure a continuous quantity grows
with the level of accuracy desired, and so there is no “free lunch”
or way to bypass the PECTT using such machines (see also this
paper). Related to that are proposals known as “hypercomputing”
or “Zeno’s computers” which attempt to use the continuity of time
by doing the first operation in one second, the second one in half a
second, the third operation in a quarter second and so on.. These
fail for a similar reason to the one guaranteeing that Achilles will
eventually catch the tortoise despite the original Zeno’s paradox.
R
Remark 5.14 — Physical Extended Church-Turing Thesis
and Cryptography. While even the precise phrasing of
the PECTT, let alone understanding its correctness, is
still a subject of active research, some variants of it are
already implicitly assumed in practice. Governments,
companies, and individuals currently rely on cryptog-
raphy to protect some of their most precious assets,
including state secrets, control of weapon systems
and critical infrastructure, securing commerce, and
protecting the confidentiality of personal information.
In applied cryptography, one often encounters state-
ments such as “cryptosystem 𝑋 provides 128 bits of
security”. What such a statement really means is that
(a) it is conjectured that there is no Boolean circuit
(or, equivalently, a NAND-CIRC program) of size
much smaller than 2128 that can break 𝑋, and (b) we
assume that no other physical mechanism can do bet-
ter, and hence it would take roughly a 2128 amount of
“resources” to break 𝑋. We say “conjectured” and not
“proved” because, while we can phrase the statement
that breaking the system cannot be done by an 𝑠-gate
circuit as a precise mathematical conjecture, at the
moment we are unable to prove such a statement for
any non-trivial cryptosystem. This is related to the P
vs NP question we will discuss in future chapters. We
will explore Cryptography in Chapter 21.
✓ Chapter Recap
Sneak preview: In the next part we will discuss how to model compu-
tational tasks on unbounded inputs, which are specified using functions
𝐹 ∶ {0, 1}∗ → {0, 1}∗ (or 𝐹 ∶ {0, 1}∗ → {0, 1}) that can take an
unbounded number of Boolean inputs.
5.8 EXERCISES
Exercise 5.1 Which one of the following statements is false:
Exercise 5.4 — Counting lower bound for multibit functions. Prove that there
exists a number 𝛿 > 0 such that for every sufficiently large 𝑛 and every
𝑚 there exists a function 𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 that requires at least 10
How many functions from {0, 1}𝑛 to {0, 1}𝑚
𝛿𝑚 ⋅ 2𝑛 /𝑛 NAND gates to compute. See footnote for hint.10 exist? Note that our definition of circuits requires
■ each output to correspond to a unique gate, though
that restriction can make at most an 𝑂(𝑚) additive
Prove that there
Exercise 5.5 — Size hierarchy theorem for multibit functions. difference in the number of gates.
exists a number 𝐶 such that for every 𝑛, 𝑚 and 𝑛+𝑚 < 𝑠 < 𝑚⋅2𝑛 /(𝐶𝑛)
there exists a function 𝑓 ∈ SIZE𝑛,𝑚 (𝐶 ⋅ 𝑠) ⧵ SIZE𝑛,𝑚 (𝑠). See footnote for 11
Follow the proof of Theorem 5.5, replacing the use
hint.11 of the counting argument with Exercise 5.4.
■
Conclude that the implicit constant in Theorem 5.2 can be made arbi-
trarily close to 5. See footnote for hint.12
12
Using the adjacency list representation, a graph
with 𝑛 in-degree zero vertices and 𝑠 in-degree two
■ vertices can be represented using roughly 2𝑠 log(𝑠 +
𝑛) ≤ 2𝑠(log 𝑠 + 𝑂(1)) bits. The labeling of the 𝑛 input
Exercise 5.7 — Tighter counting lower bound. Prove that for every 𝛿 < 1/2, if and 𝑚 output vertices can be specified by a list of 𝑛
𝑛 is sufficiently large then there exists a function 𝑓 ∶ {0, 1}𝑛 → {0, 1} labels in [𝑛] and 𝑚 labels in [𝑚].
such that 𝑓 ∉ SIZE𝑛,1 ( 𝛿2𝑛 ). See footnote for hint.13 Hint: Use the results of Exercise 5.6 and the fact that
𝑛 13
Exercise 5.8 — Random functions are hard. Suppose 𝑛 > 1000 and that we
choose a function 𝐹 ∶ {0, 1}𝑛 → {0, 1} at random, choosing for every
𝑥 ∈ {0, 1}𝑛 the value 𝐹 (𝑥) to be the result of tossing an independent
unbiased coin. Prove that the probability that there is a 2𝑛 /(1000𝑛)
line program that computes 𝐹 is at most 2−100 .14
14
Hint: An equivalent way to say this is that you
need to prove that the set of functions that can be
■ computed using at most 2𝑛 /(1000𝑛) lines has fewer
𝑛
than 2−100 22 elements. Can you see why?
Exercise 5.9 The following is a tuple representing a NAND program:
(3, 1, ((3, 2, 2), (4, 1, 1), (5, 3, 4), (6, 2, 1), (7, 6, 6), (8, 0, 0), (9, 7, 8), (10, 5, 0), (11, 9, 10))).
1. Write a table with the eight values 𝑃 (000), 𝑃 (001), 𝑃 (010), 𝑃 (011),
𝑃 (100), 𝑃 (101), 𝑃 (110), 𝑃 (111) in this order.
Prove that for every sufficiently large 𝑛, there does not exist an XOR
circuit 𝐶 that computes the function 𝐸𝑛 , where a XOR circuit has the
XOR gate as well as the constants 0 and 1 (see Exercise 3.5). That is,
prove that there is some constant 𝑛0 such that for every 𝑛 > 𝑛0 and
XOR circuit 𝐶 of 𝑛2 inputs and a single output, there exists a pair
(𝑃 , 𝑥) such that 𝐶(𝑃 , 𝑥) ≠ 𝐸𝑛 (𝑃 , 𝑥).
■
of any explicit function for which we can prove that it requires, say, at
least 𝑛100 or even 100𝑛 size. At the moment, the strongest such lower
bound we know is that there are quite simple and explicit 𝑛-variable
functions that require at least (5 − 𝑜(1))𝑛 lines to compute, see this
paper of Iwama et al as well as this more recent work of Kulikov et al.
Proving lower bounds for restricted models of circuits is an extremely
interesting research area, for which Jukna’s book [Juk12] (see also
Wegener [Weg87]) provides a very good introduction and overview. I
learned of the proof of the size hierarchy theorem (Theorem 5.5) from
Sasha Golovnev.
Scott Aaronson’s blog post on how information is physical is a good
discussion on issues related to the physical extended Church-Turing
Physics. Aaronson’s survey on NP complete problems and physical
reality [Aar05] discusses these issues as well, though it might be
easier to read after we reach Chapter 15 on NP and NP-completeness.
II
UNIFORM COMPUTATION
Learning Objectives:
• Define functions on unbounded length inputs,
that cannot be described by a finite size table
of inputs and outputs.
• Equivalence with the task of deciding
membership in a language.
• Deterministic finite automatons (optional): A
simple example for a model for unbounded
computation.
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result
graphs, fitting curves to points, and so on. To contrast with the fi-
nite case, we will sometimes call a function 𝐹 ∶ {0, 1}∗ → {0, 1} (or
𝐹 ∶ {0, 1}∗ → {0, 1}∗ ) infinite. However, this does not mean that 𝐹
takes as input strings of infinite length! It just means that 𝐹 can take
as input a string that can be arbitrarily long, and so we cannot simply
write down a table of all the outputs of 𝐹 on different inputs.
Big Idea 8 A function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ specifies the computa-
tional task mapping an input 𝑥 ∈ {0, 1}∗ into the output 𝐹 (𝑥).
MULT(𝑥, 𝑦) = 𝑥 ⋅ 𝑦
takes the binary representation of a pair of integers 𝑥, 𝑦 ∈ ℕ, and
outputs the binary representation of their product 𝑥 ⋅ 𝑦. However, since
we can represent a pair of strings as a single string, we will consider
functions such as MULT as mapping {0, 1}∗ to {0, 1}∗ . We will typi-
cally not be concerned with low-level details such as the precise way
to represent a pair of integers as a string, since virtually all choices will
be equivalent for our purposes.
220 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{1 ∀𝑖∈[|𝑥|] 𝑥𝑖 = 𝑥|𝑥|−𝑖
PALINDROME(𝑥) =
⎨
⎩0 otherwise
{
PALINDROME has a single bit as output. Functions with a single
bit of output are known as Boolean functions. Boolean functions are
central to the theory of computation, and we will discuss them often
in this book. Note that even though Boolean functions have a single
bit of output, their input can be of arbitrary length. Thus they are still
infinite functions that cannot be described via a finite table of values.
“Booleanizing” functions. Sometimes it might be convenient to ob-
tain a Boolean variant for a non-Boolean function. For example, the
following is a Boolean variant of MULT.
Solution:
For every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ , we can define
def F(x):
res = []
i = 0
while BF(x,i,1):
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 221
res.append(BF(x,i,0))
i += 1
return res
For example, recall the Python program that computes the XOR
function:
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
↪ otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result
In each step, this program reads a single bit X[i] and updates its
state result based on that bit (flipping result if X[i] is 1 and keep-
ing it the same otherwise). When it is done transversing the input,
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 223
A deterministic finite
Definition 6.2 — Deterministic Finite Automaton.
automaton (DFA) with 𝐶 states over {0, 1} is a pair (𝑇 , 𝒮) with
𝑇 ∶ [𝐶] × {0, 1} → [𝐶] and 𝒮 ⊆ [𝐶]. The finite function 𝑇 is known
as the transition function of the DFA. The set 𝒮 is known as the set of
accepting states.
Let 𝐹 ∶ {0, 1}∗ → {0, 1} be a Boolean function with the infinite
domain {0, 1}∗ . We say that (𝑇 , 𝒮) computes a function 𝐹 ∶ {0, 1}∗ →
{0, 1} if for every 𝑛 ∈ ℕ and 𝑥 ∈ {0, 1}𝑛 , if we define 𝑠0 = 0 and
𝑠𝑖+1 = 𝑇 (𝑠𝑖 , 𝑥𝑖 ) for every 𝑖 ∈ [𝑛], then
𝑠𝑛 ∈ 𝒮 ⇔ 𝐹 (𝑥) = 1
P
Make sure not to confuse the transition function of
an automaton (𝑇 in Definition 6.2), which is a finite
function specifying the table of “rules” which it fol-
lows, with the function the automaton computes (𝐹 in
Definition 6.2) which is an infinite function.
R
Remark 6.3 — Definitions in other texts. Deterministic
finite automata can be defined in several equivalent
ways. In particular Sipser [Sip97] defines a DFA as a
five-tuple (𝑄, Σ, 𝛿, 𝑞0 , 𝐹 ) where 𝑄 is the set of states,
Σ is the alphabet, 𝛿 is the transition function, 𝑞0 is
the initial state, and 𝐹 is the set of accepting states.
In this book the set of states is always of the form
𝑄 = {0, … , 𝐶 − 1} and the initial state is always 𝑞0 = 0,
but this makes no difference to the computational
power of these models. Also, we restrict our attention
to the case that the alphabet Σ is equal to {0, 1}.
Solved Exercise 6.2 — DFA for (010)∗ . Prove that there is a DFA that com-
putes the following function 𝐹 :
Solution:
When asked to construct a deterministic finite automaton, it is
often useful to start by constructing a single-pass constant-memory
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 225
def F(X):
'''Return 1 iff X is a concatenation of zero/more
↪ copies of [0,1,0]'''
if len(X) % 3 != 0:
return False
ultimate = 0
penultimate = 1
antepenultimate = 0
for idx, b in enumerate(X):
antepenultimate = penultimate
penultimate = ultimate
ultimate = b
if idx % 3 == 2 and ((antepenultimate,
↪ penultimate, ultimate) != (0,1,0)):
return False
return True
• The length of the input 𝑥 ∈ {0, 1}∗ that the DFA is provided. The
input length is always finite, but not a priori bounded.
• The number of steps that the DFA takes can grow with the length of
the input. Indeed, a DFA makes a single pass on the input and so it
takes precisely |𝑥| steps on an input 𝑥 ∈ {0, 1}∗ .
Proof Idea:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 227
⎧
{𝐹 𝑎 represents automaton 𝐴 and 𝐹 is the function 𝐴 computes
𝑆𝑡𝐷𝐶(𝑎) = ⎨
⎩ONE otherwise
{
where ONE ∶ {0, 1}∗ → {0, 1} is the constant function that outputs
1 on all inputs (and is a member of DFACOMP). Since by definition,
every function 𝐹 in DFACOMP is computable by some automaton,
𝑆𝑡𝐷𝐶 is an onto function from {0, 1}∗ to DFACOMP, which means
that DFACOMP is countable (see Section 2.4.2).
■
There exists a
Theorem 6.5 — Existence of DFA-uncomputable functions.
Boolean function 𝐹 ∶ {0, 1} → {0, 1} that is not computable by any
∗
DFA.
one could search for all text files that contain the string important
document or perhaps (letting 𝑃 correspond to a neural-network based
classifier) all images that contain a cat. However, we don’t want our
system to get into an infinite loop just trying to evaluate the program
𝑃 ! For this reason, typical systems for searching files or databases do
not allow users to specify the patterns using full-fledged programming
languages. Rather, such systems use restricted computational models that
on the one hand are rich enough to capture many of the queries needed
in practice (e.g., all filenames ending with .txt, or all phone numbers
of the form (617) xxx-xxxx), but on the other hand are restricted
enough so that queries can be evaluated very efficiently on huge files
and in particular cannot result in an infinite loop.
One of the most popular such computational models is regular
expressions. If you ever used an advanced text editor, a command-line
shell, or have done any kind of manipulation of text files, then you
have probably come across regular expressions.
A regular expression over some alphabet Σ is obtained by combin-
ing elements of Σ with the operation of concatenation, as well as |
(corresponding to or) and ∗ (corresponding to repetition zero or
more times). (Common implementations of regular expressions in
programming languages and shells typically include some extra oper-
ations on top of | and ∗, but these operations can be implemented as
“syntactic sugar” using the operators | and ∗.) For example, the fol-
lowing regular expression over the alphabet {0, 1} corresponds to the
set of all strings 𝑥 ∈ {0, 1}∗ where every digit is repeated at least twice:
1. 𝑒 = 𝜎 where 𝜎 ∈ Σ
P
The formal definition of Φ𝑒 is one of those definitions
that is more cumbersome to write than to grasp. Thus
it might be easier for you first to work out the defini-
tion on your own, and then check that it matches what
is written below.
2. If 𝑒 = (𝑒′ |𝑒″ ) then Φ𝑒 (𝑥) = Φ𝑒′ (𝑥)∨Φ𝑒″ (𝑥) where ∨ is the OR op-
erator.
5. Finally, for the edge cases Φ∅ is the constant zero function, and
Φ"" is the function that only outputs 1 on the empty string "".
230 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
The definitions above are not inherently difficult but
are a bit cumbersome. So you should pause here and
go over it again until you understand why it corre-
sponds to our intuitive notion of regular expressions.
This is important not just for understanding regular
expressions themselves (which are used time and
again in a great many applications) but also for get-
ting better at understanding recursive definitions in
general.
𝑒 = (𝑎|𝑏|𝑐|𝑑)(𝑎|𝑏|𝑐|𝑑)∗ (1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗
Solution:
We can obtain such a recursive algorithm by using the following
observations:
𝑛, it can make 𝑛 recursive calls, and hence it can be shown that in the
worst case Algorithm 6.10 can take time exponential in the length of
the input string 𝑥. Fortunately, it turns out that there is a much more
efficient algorithm that can match regular expressions in linear (i.e.,
𝑂(𝑛)) time. Since we have not yet covered the topics of time and space
complexity, we describe this algorithm in high level terms, without
making the computational model precise. Rather we will use the
colloquial notion of 𝑂(𝑛) running time as used in introduction to
programming courses and whiteboard coding interviews. We will see
a formal definition of time complexity in Chapter 13.
Let 𝑒 be a
Theorem 6.12 — Matching regular expressions in linear time.
regular expression. Then there is an 𝑂(𝑛) time algorithm that
computes Φ𝑒 .
Φ𝑒 (𝑥) on the length of 𝑥 and not about the dependence of this time on
the length of 𝑒.
Algorithm 6.14 is a recursive algorithm that input an expression
𝑒 and a string 𝑥 ∈ {0, 1}𝑛 , does computation of at most 𝐶(|𝑒|) steps
and then calls itself with input some expression 𝑒′ and a string 𝑥′ of
length 𝑛 − 1. It will terminate after 𝑛 steps when it reaches a string of
length 0. So, the running time 𝑇 (𝑒, 𝑛) that it takes for Algorithm 6.14
to compute Φ𝑒 for inputs of length 𝑛 satisfies the recursive equation:
Claim: Let 𝑒 be a regular expression over {0, 1}, then there is a num-
ber 𝐿(𝑒) ∈ ℕ, such that for every sequence of symbols 𝛼0 , … , 𝛼𝑛−1 , if
we define 𝑒′ = 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑛−1 ] (i.e., restricting 𝑒 to 𝛼0 , and then 𝛼1
and so on and so forth), then |𝑒′ | ≤ 𝐿(𝑒).
Proof of claim: For a regular expression 𝑒 over {0, 1} and 𝛼 ∈ {0, 1}𝑚 ,
we denote by 𝑒[𝛼] the expression 𝑒[𝛼0 ][𝛼1 ] ⋯ [𝛼𝑚−1 ] obtained by restrict-
ing 𝑒 to 𝛼0 and then to 𝛼1 and so on. We let 𝑆(𝑒) = {𝑒[𝛼]|𝛼 ∈ {0, 1}∗ }.
We will prove the claim by showing that for every 𝑒, the set 𝑆(𝑒) is fi-
nite, and hence so is the number 𝐿(𝑒) which is the maximum length of
𝑒′ for 𝑒′ ∈ 𝑆(𝑒).
We prove this by induction on the structure of 𝑒. If 𝑒 is a symbol, the
empty string, or the empty set, then this is straightforward to show
as the most expressions 𝑆(𝑒) can contain are the expression itself, "",
and ∅. Otherwise we split to the two cases (i) 𝑒 = 𝑒′∗ and (ii) 𝑒 =
𝑒′ 𝑒″ , where 𝑒′ , 𝑒″ are smaller expressions (and hence by the induction
hypothesis 𝑆(𝑒′ ) and 𝑆(𝑒″ ) are finite). In the case (i), if 𝑒 = (𝑒′ )∗ then
𝑒[𝛼] is either equal to (𝑒′ )∗ 𝑒′ [𝛼] or it is simply the empty set if 𝑒′ [𝛼] = ∅.
Since 𝑒′ [𝛼] is in the set 𝑆(𝑒′ ), the number of distinct expressions in
𝑆(𝑒) is at most |𝑆(𝑒′ )| + 1. In the case (ii), if 𝑒 = 𝑒′ 𝑒″ then all the
restrictions of 𝑒 to strings 𝛼 will either have the form 𝑒′ 𝑒″ [𝛼] or the form
𝑒′ 𝑒″ [𝛼]|𝑒′ [𝛼′ ] where 𝛼′ is some string such that 𝛼 = 𝛼′ 𝛼″ and 𝑒″ [𝛼″ ]
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 237
matches the empty string. Since 𝑒″ [𝛼] ∈ 𝑆(𝑒″ ) and 𝑒′ [𝛼′ ] ∈ 𝑆(𝑒′ ), the
number of the possible distinct expressions of the form 𝑒[𝛼] is at most
|𝑆(𝑒″ )| + |𝑆(𝑒″ )| ⋅ |𝑆(𝑒′ )|. This completes the proof of the claim.
Let 𝑒 be a regular
Theorem 6.15 — DFA for regular expression matching.
expression. Then there is an algorithm that on input 𝑥 ∈ {0, 1}∗
computes Φ𝑒 (𝑥) while making a single pass over 𝑥 and maintaining
a constant amount of memory.
Proof Idea:
The single-pass constant-memory for checking if a string matches
a regular expression is presented in Algorithm 6.16. The idea is to
replace the recursive algorithm of Algorithm 6.14 with a dynamic pro-
gram, using the technique of memoization. If you haven’t taken yet an
algorithms course, you might not know these techniques. This is OK;
while this more efficient algorithm is crucial for the many practical
applications of regular expressions, it is not of great importance for
this book.
⋆
238 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 239
One direction follows from Theorem 6.15, which shows that for
every regular expression 𝑒, the function Φ𝑒 can be computed by a DFA
(see for example Fig. 6.6). For the other direction, we show that given
a DFA (𝑇 , 𝒮) for every 𝑣, 𝑤 ∈ [𝐶] we can find a regular expression that
would match 𝑥 ∈ {0, 1}∗ if and only if the DFA starting in state 𝑣, will
end up in state 𝑤 after reading 𝑥.
⋆
Proof of Theorem 6.17. Since Theorem 6.15 proves the “only if” direc-
tion, we only need to show the “if” direction. Let 𝐴 = (𝑇 , 𝒮) be a DFA
with 𝐶 states that computes the function 𝐹 . We need to show that 𝐹 is
regular.
For every 𝑣, 𝑤 ∈ [𝐶], we let 𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} be the function
Figure 6.6: A deterministic finite automaton that
that maps 𝑥 ∈ {0, 1}∗ to 1 if and only if the DFA 𝐴, starting at the computes the function Φ(01)∗ .
state 𝑣, will reach the state 𝑤 if it reads the input 𝑥. We will prove that
𝐹𝑣,𝑤 is regular for every 𝑣, 𝑤. This will prove the theorem, since by
Definition 6.2, 𝐹 (𝑥) is equal to the OR of 𝐹0,𝑤 (𝑥) for every 𝑤 ∈ 𝒮.
Hence if we have a regular expression for every function of the form
𝐹𝑣,𝑤 then (using the | operation), we can obtain a regular expression
for 𝐹 as well.
To give regular expressions for the functions 𝐹𝑣,𝑤 , we start by
defining the following functions 𝐹𝑣,𝑤 𝑡
: for every 𝑣, 𝑤 ∈ [𝐶] and Figure 6.7: Given a DFA of 𝐶 states, for every 𝑣, 𝑤 ∈
[𝐶] and number 𝑡 ∈ {0, … , 𝐶} we define the function
0 ≤ 𝑡 ≤ 𝐶, 𝐹𝑣,𝑤 (𝑥) = 1 if and only if starting from 𝑣 and observ-
𝑡
𝑡
𝐹𝑣,𝑤 ∶ {0, 1}∗ → {0, 1} to output one on input
ing 𝑥, the automata reaches 𝑤 with all intermediate states being in the set 𝑥 ∈ {0, 1}∗ if and only if when the DFA is initialized
[𝑡] = {0, … , 𝑡 − 1} (see Fig. 6.7). That is, while 𝑣, 𝑤 themselves might in the state 𝑣 and is given the input 𝑥, it will reach the
state 𝑤 while going only through the intermediate
be outside [𝑡], 𝐹𝑣,𝑤
𝑡
(𝑥) = 1 if and only if throughout the execution of states {0, … , 𝑡 − 1}.
the automaton on the input 𝑥 (when initiated at 𝑣) it never enters any
of the states outside [𝑡] and still ends up at 𝑤. If 𝑡 = 0 then [𝑡] is the
empty set, and hence 𝐹𝑣,𝑤 0
(𝑥) = 1 if and only if the automaton reaches
𝑤 from 𝑣 directly on 𝑥, without any intermediate state. If 𝑡 = 𝐶 then
all states are in [𝑡], and hence 𝐹𝑣,𝑤
𝑡
= 𝐹𝑣,𝑤 .
We will prove the theorem by induction on 𝑡, showing that 𝐹𝑣,𝑤 𝑡
is
regular for every 𝑣, 𝑤 and 𝑡. For the base case of 𝑡 = 0, 𝐹𝑣,𝑤 is regular
0
𝑡 𝑡 𝑡 ∗ 𝑡
𝑅𝑣,𝑤 | 𝑅𝑣,𝑡 (𝑅𝑡,𝑡 ) 𝑅𝑡,𝑤 .
This completes the proof of the inductive step and hence of the theo-
rem.
■
1. |𝑦| ≥ 1.
2. |𝑥𝑦| ≤ 𝑛0 .
Proof Idea:
The idea behind the proof is the following. Let 𝑛0 be twice the
number of symbols that are used in the expression 𝑒, then the only
way that there is some 𝑤 with |𝑤| > 𝑛0 and Φ𝑒 (𝑤) = 1 is that 𝑒 con-
tains the ∗ (i.e. star) operator and that there is a non-empty substring
𝑦 of 𝑤 that was matched by (𝑒′ )∗ for some sub-expression 𝑒′ of 𝑒. We
can now repeat 𝑦 any number of times and still get a matching string.
See also Fig. 6.9.
⋆
P
The pumping lemma is a bit cumbersome to state,
but one way to remember it is that it simply says the
following: “if a string matching a regular expression is
long enough, one of its substrings must be matched using
the ∗ operator”.
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 243
|𝑤′ | > 2|𝑒′ | then by the induction hypothesis there exist 𝑥, 𝑦, 𝑧 ′ with
|𝑦| ≥ 1, |𝑥𝑦| ≤ 2|𝑒′ | < 𝑛0 such that 𝑤′ = 𝑥𝑦𝑧 ′ and 𝑒′ matches 𝑥𝑦𝑘 𝑧′
for every 𝑘 ∈ ℕ. This completes the proof since if we set 𝑧 = 𝑧 ′ 𝑤″
then we see that 𝑤 = 𝑤′ 𝑤″ = 𝑥𝑦𝑧 and 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 for
every 𝑘 ∈ ℕ. Otherwise, if |𝑤′ | ≤ 2|𝑒′ | then since |𝑤| = |𝑤′ | + |𝑤″ | >
𝑛0 = 2(|𝑒′ | + |𝑒″ |), it must be that |𝑤″ | > 2|𝑒″ |. Hence by the induction
hypothesis there exist 𝑥′ , 𝑦, 𝑧 such that |𝑦| ≥ 1, |𝑥′ 𝑦| ≤ 2|𝑒″ | and 𝑒″
matches 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ. But now if we set 𝑥 = 𝑤′ 𝑥′ we see that
|𝑥𝑦| = |𝑤′ | + |𝑥′ 𝑦| ≤ 2|𝑒′ | + 2|𝑒″ | = 𝑛0 and on the other hand the
expression 𝑒 = (𝑒′ )(𝑒″ ) matches 𝑥𝑦𝑘 𝑧 = 𝑤′ 𝑥′ 𝑦𝑘 𝑧 for every 𝑘 ∈ ℕ.
In case (c), if 𝑤 is matched by (𝑒′ )∗ then 𝑤 = 𝑤0 ⋯ 𝑤𝑡 where for
every 𝑖 ∈ [𝑡], 𝑤𝑖 is a nonempty string matched by 𝑒′ . If |𝑤0 | > 2|𝑒′ |,
then we can use the same approach as in the concatenation case above.
Otherwise, we simply note that if 𝑥 is the empty string, 𝑦 = 𝑤0 , and
𝑧 = 𝑤1 ⋯ 𝑤𝑡 then |𝑥𝑦| ≤ 𝑛0 and 𝑥𝑦𝑘 𝑧 is matched by (𝑒′ )∗ for every
𝑘 ∈ ℕ.
■
244 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 6.22 — Recursive definitions and inductive
proofs. When an object is recursively defined (as in the
case of regular expressions) then it is natural to prove
properties of such objects by induction. That is, if we
want to prove that all objects of this type have prop-
erty 𝑃 , then it is natural to use an inductive step that
says that if 𝑜′ , 𝑜″ , 𝑜‴ etc have property 𝑃 then so is an
object 𝑜 that is obtained by composing them.
Using the pumping lemma, we can easily prove Lemma 6.20 (i.e.,
the non-regularity of the “matching parenthesis” function):
The pumping lemma is a very useful tool to show that certain func-
tions are not computable by a regular expression. However, it is not an
“if and only if” condition for regularity: there are non-regular func-
tions that still satisfy the pumping lemma conditions. To understand
the pumping lemma, it is crucial to follow the order of quantifiers in
Theorem 6.21. In particular, the number 𝑛0 in the statement of Theo-
rem 6.21 depends on the regular expression (in the proof we chose 𝑛0
to be twice the number of symbols in the expression). So, if we want
to use the pumping lemma to rule out the existence of a regular ex-
pression 𝑒 computing some function 𝐹 , we need to be able to choose
an appropriate input 𝑤 ∈ {0, 1}∗ that can be arbitrarily large and
satisfies 𝐹 (𝑤) = 1. This makes sense if you think about the intuition
behind the pumping lemma: we need 𝑤 to be large enough as to force
the use of the star operator.
Prove that the following
Solved Exercise 6.4 — Palindromes is not regular.
function over the alphabet {0, 1, ; } is not regular: PAL(𝑤) = 1 if and
only if 𝑤 = 𝑢; 𝑢𝑅 where 𝑢 ∈ {0, 1}∗ and 𝑢𝑅 denotes 𝑢 “reversed”:
the string 𝑢|𝑢|−1 ⋯ 𝑢0 . (The Palindrome function is most often defined
without an explicit separator character ;, but the version with such a
separator is a bit cleaner, and so we use it here. This does not make
fu nc ti on s w i th i n f i n i te d oma i n s, au tomata, a n d re g u l a r e x pre ssi on s 245
Figure 6.10: A cartoon of a proof using the pumping lemma that a function 𝐹 is not regular. The pumping lemma states that if 𝐹 is regular then there
exists a number 𝑛0 such that for every large enough 𝑤 with 𝐹 (𝑤) = 1, there exists a partition of 𝑤 to 𝑤 = 𝑥𝑦𝑧 satisfying certain conditions such
that for every 𝑘 ∈ ℕ, 𝐹 (𝑥𝑦𝑘 𝑧) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there exists
quantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifier
corresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof
corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice
of 𝑘 that would result in 𝐹 (𝑥𝑦𝑘 𝑧) = 0, hence violating the conclusion of the pumping lemma.
246 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
We use the pumping lemma. Suppose toward the sake of con-
tradiction that there is a regular expression 𝑒 computing PAL,
and let 𝑛0 be the number obtained by the pumping lemma (The-
orem 6.21). Consider the string 𝑤 = 0𝑛0 ; 0𝑛0 . Since the reverse
of the all zero string is the all zero string, PAL(𝑤) = 1. Now, by
the pumping lemma, if PAL is computed by 𝑒, then we can write
𝑤 = 𝑥𝑦𝑧 such that |𝑥𝑦| ≤ 𝑛0 , |𝑦| ≥ 1 and PAL(𝑥𝑦𝑘 𝑧) = 1 for
every 𝑘 ∈ ℕ. In particular, it must hold that PAL(𝑥𝑧) = 1, but this
is a contradiction, since 𝑥𝑧 = 0𝑛0 −|𝑦| ; 0𝑛0 and so its two parts are
not of the same length and in particular are not the reverse of one
another.
■
𝑒 and 𝑒′ compute the same function?” and “does there exist a string 𝑥
that is matched by the expression 𝑒?”. The following theorem shows
that we can answer the latter question:
There is an
Theorem 6.23 — Emptiness of regular languages is computable.
algorithm that given a regular expression 𝑒, outputs 1 if and only if
Φ𝑒 is the constant zero function.
Proof Idea:
The idea is that we can directly observe this from the structure
of the expression. The only way a regular expression 𝑒 computes
the constant zero function is if 𝑒 has the form ∅ or is obtained by
concatenating ∅ with other expressions.
⋆
• ∅ is empty.
Let
Theorem 6.24 — Equivalence of regular expressions is computable.
REGEQ ∶ {0, 1} → {0, 1} be the function that on input (a string
∗
Proof Idea:
The idea is to show that given a pair of regular expressions 𝑒 and
𝑒′ we can find an expression 𝑒″ such that Φ𝑒″ (𝑥) = 1 if and only if
Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). Therefore Φ𝑒″ is the constant zero function if and only
248 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof of Theorem 6.24. We will prove Theorem 6.24 from Theorem 6.23.
(The two theorems are in fact equivalent: it is easy to prove Theo-
rem 6.23 from Theorem 6.24, since checking for emptiness is the same
as checking equivalence with the expression ∅.) Given two regu-
lar expressions 𝑒 and 𝑒′ , we will compute an expression 𝑒″ such that
Φ𝑒″ (𝑥) = 1 if and only if Φ𝑒 (𝑥) ≠ Φ𝑒′ (𝑥). One can see that 𝑒 is equiva-
lent to 𝑒′ if and only if 𝑒″ is empty.
We start with the observation that for every bit 𝑎, 𝑏 ∈ {0, 1}, 𝑎 ≠ 𝑏 if
and only if
(𝑎 ∧ 𝑏) ∨ (𝑎 ∧ 𝑏) .
Hence we need to construct 𝑒″ such that for every 𝑥,
Φ𝑒″ (𝑥) = (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) ∨ (Φ𝑒 (𝑥) ∧ Φ𝑒′ (𝑥)) . (6.3)
To construct the expression 𝑒″ , we will show how given any pair of
expressions 𝑒 and 𝑒′ , we can construct expressions 𝑒 ∧ 𝑒′ and 𝑒 that
compute the functions Φ𝑒 ∧ Φ𝑒′ and Φ𝑒 respectively. (Computing the
expression for 𝑒 ∨ 𝑒′ is straightforward using the | operation of regular
expressions.)
Specifically, by Lemma 6.18, regular functions are closed under
negation, which means that for every regular expression 𝑒, there is an
expression 𝑒 such that Φ𝑒 (𝑥) = 1 − Φ𝑒 (𝑥) for every 𝑥 ∈ {0, 1}∗ . Now,
for every two expressions 𝑒 and 𝑒′ , the expression
𝑒 ∧ 𝑒′ = (𝑒|𝑒′ )
computes the AND of the two expressions. Given these two transfor-
mations, we see that for every regular expressions 𝑒 and 𝑒′ we can find
a regular expression 𝑒″ satisfying (6.3) such that 𝑒″ is empty if and
only if 𝑒 and 𝑒′ are equivalent.
■
✓ Chapter Recap
6.7 EXERCISES
Suppose that 𝐹 , 𝐺 ∶
Exercise 6.1 — Closure properties of regular functions.
{0, 1} → {0, 1} are regular. For each one of the following defini-
∗
Exercise 6.2One among the following two functions that map {0, 1}∗
to {0, 1} can be computed by a regular expression, and the other one
cannot. For the one that can be computed by a regular expression,
write the expression that does it. For the one that cannot, prove that
this cannot be done using the pumping lemma.
|𝑥|−1
• 𝐹 (𝑥) = 1 if 4 divides ∑𝑖=0 𝑥𝑖 and 𝐹 (𝑥) = 0 otherwise.
|𝑥|−1
• 𝐺(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 ≥ |𝑥|/4 and 𝐺(𝑥) = 0 otherwise.
2. Prove that the following function 𝐹 ∶ {0, 1}∗ → {0, 1} is not regular.
For every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 1 iff ∑𝑗 𝑥𝑗 = 3𝑖 for some 𝑖 > 0.
■
7
and NAND-TM programs.
“The bounds of arithmetic were however outstepped the moment the idea of
applying the [punched] cards had occurred; and the Analytical Engine does not
occupy common ground with mere ‘calculating machines.’… In enabling mech-
anism to combine together general symbols, in successions of unlimited variety
and extent, a uniting link is established between the operations of matter and
the abstract mental processes of the most abstract branch of mathematical sci-
ence.” , Ada Augusta, countess of Lovelace, 1843
“What is the difference between a Turing machine and the modern computer?
It’s the same as that between Hillary’s ascent of Everest and the establishment
of a Hilton hotel on its peak.” , Alan Perlis, 1982.
• At each step, the machine reads the symbol 𝜎 = 𝑇 [𝑖] that is in the
𝑖𝑡ℎ location of the tape. Based on this symbol and its state 𝑠, the
machine decides on:
Figure 7.5: Steam-powered Turing machine mural,
– What symbol 𝜎′ to write on the tape painted by CSE grad students at the University of
Washington on the night before spring qualifying
– Whether to move Left (i.e., 𝑖 ← 𝑖 − 1), Right (i.e., 𝑖 ← 𝑖 + 1), Stay
examinations, 1987. Image from https://www.cs.
in place, or Halt the computation. washington.edu/building/art/SPTM.
– What is going to be the new state 𝑠 ∈ [𝑘]
254 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• The set of rules the Turing machine follows is known as its transi-
tion function.
• When the machine halts, its output is the binary string obtained by
reading the tape from the beginning until the first location in which
it contains a ∅ symbol, and then outputting all 0 and 1 symbols in
sequence, dropping the initial ▷ symbol if it exists, as well as the
final ∅ symbol.
State Label
0 START
1 RIGHT_0
2 RIGHT_1
3 LOOK_FOR_0
4 LOOK_FOR_1
5 RETURN
6 OUTPUT_0
7 OUTPUT_1
8 0_AND_BLANK
9 1_AND_BLANK
10 BLANK_AND_STOP
• 𝑀 starts in state START and goes right, looking for the first symbol
that is 0 or 1. If it finds ∅ before it hits such a symbol then it moves
to the OUTPUT_1 state described below.
• Once 𝑀 finds such a symbol 𝑏 ∈ {0, 1}, 𝑀 deletes 𝑏 from the tape
by writing the × symbol, it enters either the RIGHT_𝑏 mode and
starts moving rightwards until it hits the first ∅ or × symbol.
loop s a n d i n fi n i ty 255
• The RETURN state means that 𝑀 goes back to the beginning. Specifi-
cally, 𝑀 moves leftward until it hits the first symbol that is not 0 or
1, in which case it changes its state to START.
• The OUTPUT_𝑏 states mean that 𝑀 will eventually output the value
𝑏. In both the OUTPUT_0 and OUTPUT_1 states, 𝑀 goes left until it
hits ▷. Once it does so, it makes a right step, and changes to the
1_AND_BLANK or 0_AND_BLANK states respectively. In the latter states,
𝑀 writes the corresponding value, moves right and changes to the
BLANK_AND_STOP state, in which it writes ∅ to the tape and halts.
The above description can be turned into a table describing for each
one of the 11 ⋅ 5 combination of state and symbol, what the Turing
machine will do when it is in that state and it reads that symbol. This
table is known as the transition function of the Turing machine.
P
You should make sure you see why this formal def-
inition corresponds to our informal description of
a Turing machine. To get more intuition on Turing
machines, you can explore some of the online avail-
able simulators such as Martin Ugarte’s, Anthony
Morphett’s, or Paul Rendell’s.
This is a good point to remind the reader that functions are not the
same as programs:
Functions ≠ Programs .
R
Remark 7.4 — Functions vs. languages. As discussed
in Section 6.1.2, many texts use the terminology of
“languages” rather than functions to refer to compu-
tational tasks. A Turing machine 𝑀 decides a language
𝐿 if for every input 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 1 if
and only if 𝑥 ∈ 𝐿. This is equivalent to computing
the Boolean function 𝐹 ∶ {0, 1}∗ → {0, 1} defined as
𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿. A language 𝐿 is decidable if there
is a Turing machine 𝑀 that decides it. For historical
reasons, some texts also call such languages recursive
, which is the reason that the letter R is often used
to denote the set of computable Boolean functions /
decidable languages defined in Definition 7.3.
In this book we stick to the terminology of functions
rather than languages, but all definitions and results
can be easily translated back and forth by using the
equivalence between the function 𝐹 ∶ {0, 1}∗ → {0, 1}
and the language 𝐿 = {𝑥 ∈ {0, 1}∗ | 𝐹 (𝑥) = 1}.
Let 𝐹 be either a
Definition 7.5 — Computable (partial or total) functions.
total or partial function mapping {0, 1} to {0, 1}∗ and let 𝑀 be a
∗
R
Remark 7.6 — Bot symbol. We often use ⊥ as our spe-
cial “failure symbol”. If a Turing machine 𝑀 fails to
halt on some input 𝑥 ∈ {0, 1}∗ then we denote this by
𝑀 (𝑥) = ⊥. This does not mean that 𝑀 outputs some
encoding of the symbol ⊥ but rather that 𝑀 enters
into an infinite loop when given 𝑥 as input.
If a partial function 𝐹 is undefined on 𝑥 then we can
also write 𝐹 (𝑥) = ⊥. Therefore one might think
that Definition 7.5 can be simplified to requiring that
𝑀 (𝑥) = 𝐹 (𝑥) for every 𝑥 ∈ {0, 1}∗ , which would imply
that for every 𝑥, 𝑀 halts on 𝑥 if and only if 𝐹 is de-
fined on 𝑥. However, this is not the case: for a Turing
machine 𝑀 to compute a partial function 𝐹 it is not
necessary for 𝑀 to enter an infinite loop on inputs 𝑥
on which 𝐹 is not defined. All that is needed is for 𝑀
to output 𝐹 (𝑥) on values of 𝑥 on which 𝐹 is defined:
on other inputs it is OK for 𝑀 to output an arbitrary
value such as 0, 1, or anything else, or not to halt at all.
To borrow a term from the C programming language,
on inputs 𝑥 on which 𝐹 is not defined, what 𝑀 does is
“undefined behavior”.
def PAL(Tape):
head = 0
state = 0 # START
while (state != 12):
if (state == 0 && Tape[head]=='0'):
state = 3 # LOOK_FOR_0
Tape[head] = 'x'
head += 1 # move right
if (state==0 && Tape[head]=='1')
state = 4 # LOOK_FOR_1
Tape[head] = 'x'
head += 1 # move right
... # more if statements here
The precise details of this program are not important. What mat-
ters is that we can describe Turing machines as programs. Moreover,
note that when translating a Turing machine into a program, the tape
2
Most programming languages use arrays of fixed
becomes a list or array that can hold values from the finite set Σ.2 The size, while a Turing machine’s tape is unbounded. But
head position can be thought of as an integer-valued variable that holds of course there is no need to store an infinite number
of ∅ symbols. If you want, you can think of the tape
integers of unbounded size. The state is a local register that can hold
as a list that starts off just long enough to store the
one of a fixed number of values in [𝑘]. input, but is dynamically grown in size as the Turing
More generally we can think of every Turing machine 𝑀 as equiva- machine’s head explores new positions.
state = 19
elif Tape[i]==">" and state == 13: #
↪ δ_M(13,">")=(15,"0","S")
Tape[i]="0"
state = 15
elif ...
loop s a n d i n fi n i ty 261
...
elif Tape[i]==">" and state == 29: #
↪ δ_M(29,">")=(.,.,"H")
break # Halt
R
Remark 7.7 — NAND-CIRC + loops + arrays = every-
thing.. As we will see, adding loops and arrays to
NAND-CIRC is enough to capture the full power of
all programming languages! Hence we could replace
“NAND-TM” with any of Python, C, Javascript, OCaml,
etc. in the left-hand side of (7.1). But we’re getting
ahead of ourselves: this issue will be discussed in
Chapter 8.
• We use the convention that arrays always start with a capital letter,
and scalar variables (which are never indexed with i) start with
lowercase letters. Hence Foo is an array and bar is a scalar variable.
• The input and output X and Y are now considered arrays with val-
ues of zeroes and ones. (There are also two other special arrays
X_nonblank and Y_nonblank, see below.)
2. The program is executed line by line. When the last line MODAND-
JUMP(foo,bar) is executed we do as follows:
7.2.3 Examples
We now present some examples of NAND-TM programs.
carry = IF(started,carry,one(started))
started = one(started)
Y[i] = XOR(X[i],carry)
carry = AND(X[i],carry)
Y_nonblank[i] = one(started)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
temp_0 = NAND(started,started)
temp_1 = NAND(started,temp_0)
temp_2 = NAND(started,started)
temp_3 = NAND(temp_1,temp_2)
temp_4 = NAND(carry,started)
carry = NAND(temp_3,temp_4)
temp_6 = NAND(started,started)
started = NAND(started,temp_6)
temp_8 = NAND(X[i],carry)
temp_9 = NAND(X[i],temp_8)
temp_10 = NAND(carry,temp_8)
Y[i] = NAND(temp_9,temp_10)
temp_12 = NAND(X[i],carry)
carry = NAND(temp_12,temp_12)
temp_14 = NAND(started,started)
Y_nonblank[i] = NAND(started,temp_14)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
266 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
P
Working out the above two examples can go a long
way towards understanding the NAND-TM language.
See our GitHub repository for a full specification of
the NAND-TM language.
loop s a n d i n fi n i ty 267
For
Theorem 7.11 — Turing machines and NAND-TM programs are equivalent.
every 𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM pro-
∗ ∗
Proof Idea:
To prove such an equivalence theorem, we need to show two di-
rections. We need to be able to (1) transform a Turing machine 𝑀 to
a NAND-TM program 𝑃 that computes the same function as 𝑀 and
(2) transform a NAND-TM program 𝑃 into a Turing machine 𝑀 that
computes the same function as 𝑃 .
The idea of the proof is illustrated in Fig. 7.9. To show (1), given
a Turing machine 𝑀 , we will create a NAND-TM program 𝑃 that
will have an array Tape for the tape of 𝑀 and scalar (i.e., non-array)
variable(s) state for the state of 𝑀 . Specifically, since the state of a
Turing machine is not in {0, 1} but rather in a larger set [𝑘], we will use
⌈log 𝑘⌉ variables state_0 , …, state_⌈log 𝑘⌉ − 1 variables to store the
representation of the state. Similarly, to encode the larger alphabet Σ
of the tape, we will use ⌈log |Σ|⌉ arrays Tape_0 , …, Tape_⌈log |Σ|⌉ − 1,
such that the 𝑖𝑡ℎ location of these arrays encodes the 𝑖𝑡ℎ symbol in the
tape for every tape. Using the fact that every function can be computed
by a NAND-CIRC program, we will be able to compute the transition
function of 𝑀 , replacing moving left and right by decrementing and
incrementing i respectively.
We show (2) using very similar ideas. Given a program 𝑃 that uses
𝑎 array variables and 𝑏 scalar variables, we will create a Turing ma-
chine with about 2𝑏 states to encode the values of scalar variables, and
an alphabet of about 2𝑎 so we can encode the arrays using our tape.
(The reason the sizes are only “about” 2𝑎 and 2𝑏 is that we need to
add some symbols and steps for bookkeeping purposes.) The Turing
machine 𝑀 simulates each iteration of the program 𝑃 by updating its
state and tape accordingly.
⋆
• We encode [𝑘] using {0, 1}ℓ and Σ using {0, 1}ℓ , where ℓ = ⌈log 𝑘⌉
′
• We encode the set {L, R, S, H} using {0, 1}2 . We will choose the
encoding L ↦ 01, R ↦ 11, S ↦ 10, H ↦ 00. (This conveniently
corresponds to the semantics of the MODANDJUMP operation.)
{0, 1}ℓ that on input the contents of 𝑃 ’s scalar variables and the con-
′
3. When the program halts (i.e., MODANDJUMP gets 00) then the Turing
machine will enter into a special loop to copy the results of the Y
array into the output and then halt. We can achieve this by adding a
few more states.
R
Remark 7.12 — Running time equivalence (optional). If
we examine the proof of Theorem 7.11 then we can
see that every iteration of the loop of a NAND-TM
program corresponds to one step in the execution of
the Turing machine. We will come back to this ques-
tion of measuring the number of computation steps
later in this course. For now, the main take away point
is that NAND-TM programs and Turing machines
are essentially equivalent in power even when taking
running time into account.
• Inner loops such as the while and for operations common to many
programming languages.
• Multiple index variables (e.g., not just i but we can add j, k, etc.).
In all of these cases (and many others) we can implement the new
feature as mere “syntactic sugar” on top of standard NAND-TM. This
means that the set of functions computable by NAND-TM with this
feature is the same as the set of functions computable by standard
NAND-TM. Similarly, we can show that the set of functions com-
putable by Turing machines that have more than one tape, or tapes
of more dimensions than one, is the same as the set of functions com-
putable by standard Turing machines.
"start": do foo
GOTO("end")
"skip": do bar
"end": do blah
then the program will only do foo and blah as when it reaches the
line GOTO("end") it will jump to the line labeled with "end". We can
achieve the effect of GOTO in NAND-TM using conditionals. In the
code below, we assume that we have a variable pc that can take strings
of some constant length. This can be encoded using a finite number
of Boolean variables pc_0, pc_1, …, pc_𝑘 − 1, and so when we write
below pc = "label" what we mean is something like pc_0 = 0,pc_1
= 1, … (where the bits 0, 1, … correspond to the encoding of the finite
string "label" as a string of length 𝑘). We also assume that we have
access to conditional (i.e., if statements), which we can emulate using
syntactic sugar in the same way as we did in NAND-CIRC.
272 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
do foo
do bar
do blah
pc = "line1"
if (pc=="line1"):
do foo
pc = "line2"
if (pc=="line2"):
do bar
pc = "line3"
if (pc=="line3"):
do blah
Other loops. Once we have GOTO, we can emulate all the standard loop
constructs such as while, do .. until or for in NAND-TM as well.
For example, we can replace the code
while foo:
do blah
do bar
with
"loop":
if NOT(foo): GOTO("next")
do blah
GOTO("loop")
"next":
do bar
loop s a n d i n fi n i ty 273
R
Remark 7.13 — GOTO’s in programming languages. The
GOTO statement was a staple of most early program-
ming languages, but has largely fallen out of favor and
is not included in many modern languages such as
Python, Java, Javascript. In 1968, Edsger Dijsktra wrote a
famous letter titled “Go to statement considered harm-
ful.” (see also Fig. 7.10). The main trouble with GOTO
is that it makes analysis of programs more difficult
by making it harder to argue about invariants of the
program.
When a program contains a loop of the form:
for j in range(100):
do something
do blah
✓ Chapter Recap
7.6 EXERCISES
Produce the code of a
Exercise 7.1 — Explicit NAND TM programming.
(syntactic-sugar free) NAND-TM program 𝑃 that computes the (un-
bounded input length) Majority function 𝑀 𝑎𝑗 ∶ {0, 1}∗ → {0, 1} where
|𝑥|
for every 𝑥 ∈ {0, 1}∗ , 𝑀 𝑎𝑗(𝑥) = 1 if and only if ∑𝑖=0 𝑥𝑖 > |𝑥|/2. We
say “produce” rather than “write” because you do not have to write
the code of 𝑃 by hand, but rather can use the programming language
of your choice to compute this code.
■
4. SORT ∶ {0, 1}∗ → {0, 1}∗ which takes as input the representation of
a list of natural numbers (𝑎0 , … , 𝑎𝑛−1 ) and returns its sorted version
(𝑏0 , … , 𝑏𝑛−1 ) such that for every 𝑖 ∈ [𝑛] there is some 𝑗 ∈ [𝑛] with
𝑏𝑖 = 𝑎𝑗 and 𝑏0 ≤ 𝑏1 ≤ ⋯ ≤ 𝑏𝑛−1 .
276 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{∃𝑦∈{0,1}|𝑥| 𝐹 (𝑥𝑦) = 1
𝐺(𝑥) = ⎨
{
⎩0 otherwise
is in R.
Menabrea on the engine, adding copious notes (longer than the paper
itself). The quote in the chapter’s beginning is taken from Nota A in
this text. Lovelace’s notes contain several examples of programs for the
analytical engine, and because of this she has been called “the world’s
first computer programmer” though it is not clear whether they were
written by Lovelace or Babbage himself [Hol01]. Regardless, Ada was
clearly one of very few people (perhaps the only one outside of Bab-
bage himself) to fully appreciate how important and revolutionary the
idea of mechanizing computation truly is.
The books of Shetterly [She16] and Sobel [Sob17] discuss the his-
tory of human computers (who were female, more often than not)
and their important contributions to scientific discoveries in astron-
omy and space exploration.
Alan Turing was one of the intellectual giants of the 20th century.
He was not only the first person to define the notion of computation,
but also invented and used some of the world’s earliest computational
devices as part of the effort to break the Enigma cipher during World
War II, saving millions of lives. Tragically, Turing committed suicide
in 1954, following his conviction in 1952 for homosexual acts and a
court-mandated hormonal treatment. In 2009, British prime minister
Gordon Brown made an official public apology to Turing, and in 2013
Queen Elizabeth II granted Turing a posthumous pardon. Turing’s life
is the subject of a great book and a mediocre movie.
Sipser’s text [Sip97] defines a Turing machine as a seven tuple con-
sisting of the state space, input alphabet, tape alphabet, transition
function, starting state, accepting state, and rejecting state. Superfi-
cially this looks like a very different definition than Definition 7.1 but
it is simply a different representation of the same concept, just as a
graph can be represented in either adjacency list or adjacency matrix
form.
One difference is that Sipser considers a general set of states 𝑄 that
is not necessarily of the form 𝑄 = {0, 1, 2, … , 𝑘 − 1} for some natural
number 𝑘 > 0. Sipser also restricts his attention to Turing machines
that output only a single bit and therefore designates two special halt-
ing states: the “0 halting state” (often known as the rejecting state) and
the other as the “1 halting state” (often known as the accepting state).
Thus instead of writing 0 or 1 on an output tape, the machine will en-
ter into one of these states and halt. This again makes no difference
to the computational power, though we prefer to consider the more
general model of multi-bit outputs. (Sipser presents the basic task of a
Turing machine as that of deciding a language as opposed to computing
a function, but these are equivalent, see Remark 7.4.)
Sipser considers also functions with input in Σ∗ for an arbitrary
alphabet Σ (and hence distinguishes between the input alphabet which
loop s a n d i n fi n i ty 279
8
Equivalent models of computation
Theorem 8.1 — Turing Machines (aka NAND-TM programs) and RAM ma-
For every function
chines (aka NAND-RAM programs) are equivalent.
𝐹 ∶ {0, 1} → {0, 1} , 𝐹 is computable by a NAND-TM program if
∗ ∗
Proof Idea:
Clearly NAND-RAM is only more powerful than NAND-TM, and Figure 8.4: Overview of the steps in the proof of The-
so if a function 𝐹 is computable by a NAND-TM program then it can orem 8.1 simulating NANDRAM with NANDTM.
We first use the inner loop syntactic sugar of Sec-
be computed by a NAND-RAM program. The challenging direction is tion 7.4.1 to enable loading an integer from an array
to transform a NAND-RAM program 𝑃 to an equivalent NAND-TM to the index variable i of NANDTM. Once we can do
program 𝑄. To describe the proof in full we will need to cover the full that, we can simulate indexed access in NANDTM. We
then use an embedding of ℕ2 in ℕ to simulate two
formal specification of the NAND-RAM language, and show how we dimensional bit arrays in NANDTM. Finally, we use
can implement every one of its features as syntactic sugar on top of the binary representation to encode one-dimensional
arrays of integers as two dimensional arrays of bits
NAND-TM. hence completing the simulation of NANDRAM with
This can be done but going over all the operations in detail is rather NANDTM.
tedious. Hence we will focus on describing the main ideas behind this
286 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
2. Two dimensional bit arrays: We then show how we can use “syntactic
sugar” to augment NAND-TM with two dimensional arrays. That is,
have two indices i and j and two dimensional arrays, such that we can
use the syntax Foo[i][j] to access the (i,j)-th location of Foo.
R
Remark 8.2 — RAM machines / NAND-RAM and assembly
language (optional). RAM machines correspond quite
closely to actual microprocessors such as those in the
Intel x86 series that also contains a large primary mem-
ory and a constant number of small registers. This is of
course no accident: RAM machines aim at modeling
more closely than Turing machines the architecture of
actual computing systems, which largely follows the
so called von Neumann architecture as described in
the report [Neu45]. As a result, NAND-RAM is sim-
ilar in its general outline to assembly languages such
as x86 or NIPS. These assembly languages all have
instructions to (1) move data from registers to mem-
ory, (2) perform arithmetic or logical computations
on registers, and (3) conditional execution and loops
(“if” and “goto”, commonly known as “branches” and
“jumps” in the context of assembly languages).
The main difference between RAM machines and
actual microprocessors (and correspondingly between
eq u i va l e n t mod e l s of comp u tati on 287
# set i to 0.
LABEL("zero_idx")
dir0 = zero
dir1 = one
# corresponds to i <- i-1
GOTO("zero_idx",NOT(Atzero[i]))
...
# zero out temp
#(code below assumes a specific prefix-free encoding in
↪ which 10 is the "end marker")
Temp[0] = 1
Temp[1] = 0
# set i to Bar, assume we know how to increment, compare
LABEL("increment_temp")
cond = EQUAL(Temp,Bar)
dir0 = one
dir1 = one
# corresponds to i <- i+1
INC(Temp)
GOTO("increment_temp",cond)
# if we reach this point, i is number encoded by Bar
...
# final instruction of program
MODANDJUMP(dir0,dir1)
𝑒𝑚𝑏𝑒𝑑(𝑥, 𝑦) = 12 (𝑥 + 𝑦)(𝑥 + 𝑦 + 1) + 𝑥 .
Exercise 8.3 asks you to prove that 𝑒𝑚𝑏𝑒𝑑 is indeed one to one, as
well as computable by a NAND-TM program. (The latter can be done
by simply following the grade-school algorithms for multiplication,
addition, and division.) This means that we can replace code of the
form Two[Foo][Bar] = something (i.e., access the two dimensional
array Two at the integers encoded by the one dimensional arrays Foo
and Bar) by code of the form:
R
Remark 8.3 — Recursion in NAND-RAM (advanced). One
concept that appears in many programming languages
but we did not include in NAND-RAM programs is
recursion. However, recursion (and function calls in
general) can be implemented in NAND-RAM using
the stack data structure. A stack is a data structure con-
taining a sequence of elements, where we can “push”
elements into it and “pop” them from it in “first in last
out” order.
We can implement a stack using an array of integers
Stack and a scalar variable stackpointer that will
290 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Stack[stackpointer]=foo
stackpointer += one
bar = Stack[stackpointer]
stackpointer -= one
Let ℱ be
Definition 8.5 — Turing completeness and equivalence (optional).
the set of all partial functions from {0, 1}∗ to {0, 1}∗ . A computa-
tional model is a map ℳ ∶ {0, 1}∗ → ℱ.
We say that a program 𝑃 ∈ {0, 1}∗ ℳ-computes a function 𝐹 ∈ ℱ
if ℳ(𝑃 ) = 𝐹 .
A computational model ℳ is Turing complete if there is a com-
putable map ENCODEℳ ∶ {0, 1}∗ → {0, 1}∗ for every Turing
machine 𝑁 (represented as a string), ℳ(ENCODEℳ (𝑁 )) is equal
to the partial function computed by 𝑁 .
A computational model ℳ is Turing equivalent if it is Tur-
ing complete and there exists a computable map DECODEℳ ∶
{0, 1}∗ → {0, 1}∗ such that or every string 𝑃 ∈ {0, 1}∗ , 𝑁 =
DECODEℳ (𝑃 ) is a string representation of a Turing machine that
computes the function ℳ(𝑃 ).
• Turing machines
• NAND-TM programs
• NAND-RAM programs
• λ calculus
• Game of life (mapping programs and inputs/outputs to starting
and ending configurations)
• Programming languages such as Python/C/Javascript/OCaml…
(allowing for unbounded storage)
Since the cells in the game of life are arranged in an infinite two-
dimensional grid, it is an example of a two dimensional cellular automa-
ton. We can also consider the even simpler setting of a one dimensional
cellular automaton, where the cells are arranged in an infinite line, see
Fig. 8.10. It turns out that even this simple model is enough to achieve
296 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
For every
Theorem 8.7 — One dimensional automata are Turing complete.
Turing machine 𝑀 , there is a one dimensional cellular automaton
that can simulate 𝑀 on every input 𝑥.
Figure 8.11: A Game-of-Life configuration simulating
To make the notion of “simulating a Turing machine” more precise
a Turing machine. Figure by Paul Rendell.
we will need to define configurations of Turing machines. We will
do so in Section 8.4.2 below, but at a high level a configuration of a
Turing machine is a string that encodes its full state at a given step in
eq u i va l e n t mod e l s of comp u tati on 297
its computation. That is, the contents of all (non-empty) cells of its
tape, its current state, as well as the head position.
The key idea in the proof of Theorem 8.7 is that at every point in
the computation of a Turing machine 𝑀 , the only cell in 𝑀 ’s tape that
can change is the one where the head is located, and the value this
cell changes to is a function of its current state and the finite state of
𝑀 . This observation allows us to encode the configuration of a Turing
machine 𝑀 as a finite configuration of a cellular automaton 𝑟, and
ensure that a one-step evolution of this encoded configuration under
the rules of 𝑟 corresponds to one step in the execution of the Turing
machine 𝑀 .
• 𝑀 ’s tape contains 𝛼𝑗,0 for all 𝑗 < |𝛼| and contains ∅ for all po-
sitions that are at least |𝛼|, where we let 𝛼𝑗,0 be the value 𝜎 such
that 𝛼𝑗 = (𝜎, 𝑡) with 𝜎 ∈ Σ and 𝑡 ∈ {⋅} ∪ [𝑘]. (In other words,
298 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Definition 8.8 below has some technical details, but
is not actually that deep or complicated. Try to take a
moment to stop and think how you would encode as a
string the state of a Turing machine at a given point in
an execution.
Think what are all the components that you need to
know in order to be able to continue the execution
from this point onwards, and what is a simple way
to encode them using a list of finite symbols. In par-
ticular, with an eye towards our future applications,
try to think of an encoding which will make it as sim-
ple as possible to map a configuration at step 𝑡 to the
configuration at step 𝑡 + 1.
2. The full contents of the large scale memory, that is the tape.
3. The contents of the “local registers”, that is the state of the ma-
chine.
Theorem 8.10 — One dimensional automata are Turing complete (formal state-
ment). For every Turing machine 𝑀 , if we denote by Σ the alphabet
of its configuration strings, then there is a one-dimensional cellular
∗
automaton 𝑟 over the alphabet Σ such that
The automaton arising from the proof of Theorem 8.10 has a large
alphabet, and furthermore one whose size that depends on the ma-
chine 𝑀 that is being simulated. It turns out that one can obtain an
automaton with an alphabet of fixed size that is independent of the
program being simulated, and in fact the alphabet of the automaton
can be the minimal set {0, 1}! See Fig. 8.13 for an example of such an
Turing-complete automaton.
R
Remark 8.11 — Configurations of NAND-TM programs.
We can use the same approach as Definition 8.8 to
define configurations of a NAND-TM program. Such a
configuration will need to encode:
𝑓(𝑥) = 𝑥 × 𝑥
we can write it as
𝜆𝑥.𝑥 × 𝑥
and so (𝜆𝑥.𝑥 × 𝑥)(7) = 49. That is, you can think of 𝜆𝑥.𝑒𝑥𝑝(𝑥),
where 𝑒𝑥𝑝 is some expression as a way of specifying the anonymous
function 𝑥 ↦ 𝑒𝑥𝑝(𝑥). Anonymous functions, using either 𝜆𝑥.𝑓(𝑥), 𝑥 ↦
𝑓(𝑥) or other closely related notation, appear in many programming
languages. For example, in Python we can define the squaring function
using lambda x: x*x while in JavaScript we can use x => x*x or
(x) => x*x. In Scheme we would define it as (lambda (x) (* x x)).
Clearly, the name of the argument to a function doesn’t matter, and so
𝜆𝑦.𝑦 × 𝑦 is the same as 𝜆𝑥.𝑥 × 𝑥, as both correspond to the squaring
function.
Dropping parentheses. To reduce notational clutter, when writing
𝜆 calculus expressions we often drop the parentheses for function
evaluation. Hence instead of writing 𝑓(𝑥) for the result of applying
the function 𝑓 to the input 𝑥, we can also write this as simply 𝑓 𝑥.
Therefore we can write (𝜆𝑥.𝑥 × 𝑥)7 = 49. In this chapter, we will use
both the 𝑓(𝑥) and 𝑓 𝑥 notations for function application. Function
evaluations are associative and bind from left to right, and hence 𝑓 𝑔 ℎ
is the same as (𝑓𝑔)ℎ.
For example, can you guess what number is the following expression
equal to?
P
The expression (8.1) might seem daunting, but before
you look at the solution below, try to break it apart
to its components, and evaluate each component at a
time. Working out this example would go a long way
toward understanding the λ calculus.
((𝐹 𝑔) 3) .
((𝜆𝑥.(𝜆𝑦.𝑥)) 2) 9 . (8.2)
Solution:
𝜆𝑦.𝑥 is the function that on input 𝑦 ignores its input and outputs
𝑥. Hence (𝜆𝑥.(𝜆𝑦.𝑥))2 yields the function 𝑦 ↦ 2 (or, using 𝜆 nota-
tion, the function 𝜆𝑦.2). Hence (8.2) is equivalent to (𝜆𝑦.2)9 = 2.
■
𝜆𝑥.(𝜆𝑦.𝑥 + 𝑦) (8.3)
(𝜆𝑥.𝑓)(𝜆𝑦.𝑔 𝑧) . (8.4)
There are two natural conventions for this:
Because the λ calculus has only pure functions, that do not have
“side effects”, in many cases the order does not matter. In fact, it can
be shown that if we obtain a definite irreducible expression (for ex-
ample, a number) in both strategies, then it will be the same one.
304 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
However, for concreteness we will always use the “call by name” (i.e.,
lazy evaluation) order. (The same choice is made in the programming
language Haskell, though many other programming languages use
eager evaluation.) Formally, the evaluation of a λ expression using
“call by name” is captured by the following process:
𝑒 = 𝜆𝑥.𝑥
𝑓 = (𝜆𝑎.(𝜆𝑏.𝑏))(𝜆𝑧.𝑧 𝑧)
Solution:
The canonical simplification of 𝑒 is simply 𝜆𝑣0 .𝑣0 . To do the
canonical simplification of 𝑓 we first use 𝛽 reduction to plug in
𝜆𝑧.𝑧𝑧 instead of 𝑎 in (𝜆𝑏.𝑏) but since 𝑎 is not used in this function at
all, we simply obtained 𝜆𝑏.𝑏 which simplifies to 𝜆𝑣0 .𝑣0 as well.
■
eq u i va l e n t mod e l s of comp u tati on 305
⎧
{𝑧 𝐿 = NIL
REDUCE 𝐿 𝑓 𝑧 = ⎨ .
{𝑓 (HEAD 𝐿) (REDUCE (TAIL 𝐿) 𝑓 𝑧)
⎩ otherwise
Give a λ expression
Solved Exercise 8.3 — Compute NAND using λ calculus.
𝑁 such that 𝑁 𝑥 𝑦 = NAND(𝑥, 𝑦) for every 𝑥, 𝑦 ∈ {0, 1}.
■
Solution:
The NAND of 𝑥, 𝑦 is equal to 1 unless 𝑥 = 𝑦 = 1. Hence we can
write
Solution:
First, we note that we can compute XOR of two bits as follows:
and
XOR2 = 𝜆𝑎, 𝑏.IF(𝑏, NOT(𝑎), 𝑎) (8.7)
(We are using here a bit of syntactic sugar to describe the func-
tions. To obtain the λ expression for XOR we will simply replace
the expression (8.6) in (8.7).) Now recursively we can define the
XOR of a list as follows:
⎧
{0 𝐿 is empty
XOR(𝐿) =
⎨
⎩XOR2 (HEAD(𝐿), XOR(TAIL(𝐿)))
{ otherwise
Proof Idea:
To prove the theorem, we need to show that (1) if 𝐹 is computable
by a λ calculus expression then it is computable by a Turing machine,
and (2) if 𝐹 is computable by a Turing machine, then it is computable
by an enhanced λ calculus expression.
Showing (1) is fairly straightforward. Applying the simplification
rules to a λ expression basically amounts to “search and replace”
eq u i va l e n t mod e l s of comp u tati on 309
Proof of Theorem 8.16. We only sketch the proof. The “if” direction
is simple. As mentioned above, evaluating λ expressions basically
amounts to “search and replace”. It is also a fairly straightforward
programming exercise to implement all the above basic operations in
an imperative language such as Python or C, and using the same ideas
we can do so in NAND-RAM as well, which we can then transform to
a NAND-TM program.
For the “only if” direction we need to simulate a Turing machine
using a λ expression. We will do so by first showing for every Tur-
ing machine 𝑀 a λ expression to compute the next-step function
∗ ∗
NEXT𝑀 ∶ Σ → Σ that maps a configuration of 𝑀 to the next one (see
Section 8.4.2).
∗
A configuration of 𝑀 is a string 𝛼 ∈ Σ for a finite set Σ. We can
encode every symbol 𝜎 ∈ Σ by a finite string {0, 1}ℓ , and so we will
encode a configuration 𝛼 in the λ calculus as a list ⟨𝛼0 , 𝛼1 , … , 𝛼𝑚−1 , ⊥⟩
where 𝛼𝑖 is an ℓ-length string (i.e., an ℓ-length list of 0’s and 1’s) en-
coding a symbol in Σ.
∗
By Lemma 8.9, for every 𝛼 ∈ Σ , NEXT𝑀 (𝛼)𝑖 is equal to
3
𝑟(𝛼𝑖−1 , 𝛼𝑖 , 𝛼𝑖+1 ) for some finite function 𝑟 ∶ Σ → Σ. Using our
encoding of Σ as {0, 1}ℓ , we can also think of 𝑟 as mapping {0, 1}3ℓ to
{0, 1}ℓ . By Solved Exercise 8.3, we can compute the NAND function,
and hence every finite function, including 𝑟, using the λ calculus.
Using this insight, we can compute NEXT𝑀 using the λ calculus as
follows. Given a list 𝐿 encoding the configuration 𝛼0 ⋯ 𝛼𝑚−1 , we
define the lists 𝐿𝑝𝑟𝑒𝑣 and 𝐿𝑛𝑒𝑥𝑡 encoding the configuration 𝛼 shifted
by one step to the right and left respectively. The next configuration
𝛼′ is defined as 𝛼′𝑖 = 𝑟(𝐿𝑝𝑟𝑒𝑣 [𝑖], 𝐿[𝑖], 𝐿𝑛𝑒𝑥𝑡 [𝑖]) where we let 𝐿′ [𝑖] denote
the 𝑖-th element of 𝐿′ . This can be computed by recursion (and hence
using the enhanced λ calculus’ RECURSE operator) as follows:
310 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
⎧
{𝛼 𝛼 is halting configuration
FINAL(𝛼) = ⎨ .
⎩FINAL(NEXT𝑀 (𝛼)) otherwise
{
P
This is a good point to pause and think how
you would implement these operations your-
self. For example, start by thinking how you
could implement MAP using REDUCE, and
then REDUCE using RECURSE combined with
0, 1, IF, PAIR, HEAD, TAIL, NIL, ISEMPTY. You can
also implement PAIR, HEAD and TAIL based on
0, 1, IF. The most challenging part is to implement
RECURSE using only the operations of the pure λ
calculus.
There
Theorem 8.18 — Enhanced λ calculus equivalent to pure λ calculus..
are λ expressions that implement the functions 0,1,IF,PAIR, HEAD,
TAIL, NIL, ISEMPTY, MAP, REDUCE, and RECURSE.
• We define NIL to be the function that ignores its input and always
outputs 1. That is, NIL = 𝜆𝑥.1. The ISEMPTY function checks,
given an input 𝑝, whether we get 1 if we apply 𝑝 to the function
𝑧𝑒𝑟𝑜 = 𝜆𝑥, 𝑦.0 that ignores both its inputs and always outputs 0. For
every valid pair of the form 𝑝 = PAIR𝑥𝑦, 𝑝𝑧𝑒𝑟𝑜 = 𝑝𝑥𝑦 = 0 while
NIL𝑧𝑒𝑟𝑜 = 1. Formally, ISEMPTY = 𝜆𝑝.𝑝(𝜆𝑥, 𝑦.0).
R
Remark 8.19 — Church numerals (optional). There is
nothing special about Boolean values. You can use
similar tricks to implement natural numbers using
λ terms. The standard way to do so is to represent
the number 𝑛 by the function ITER𝑛 that on input a
function 𝑓 outputs the function 𝑥 ↦ 𝑓(𝑓(⋯ 𝑓(𝑥))) (𝑛
times). That is, we represent the natural number 1 as
𝜆𝑓.𝑓, the number 2 as 𝜆𝑓.(𝜆𝑥.𝑓(𝑓𝑥)), the number 3 as
𝜆𝑓.(𝜆𝑥.𝑓(𝑓(𝑓𝑥))), and so on and so forth. (Note that
this is not the same representation we used for 1 in
the Boolean context: this is fine; we already know that
the same object can be represented in more than one
way.) The number 0 is represented by the function
that maps any function 𝑓 to the identity function 𝜆𝑥.𝑥.
(That is, 0 = 𝜆𝑓.(𝜆𝑥.𝑥).)
In this representation, we can compute PLUS(𝑛, 𝑚)
as 𝜆𝑓.𝜆𝑥.(𝑛𝑓)((𝑚𝑓)𝑥) and TIMES(𝑛, 𝑚) as 𝜆𝑓.𝑛(𝑚𝑓).
Subtraction and division are trickier, but can be
achieved using recursion. (Working this out is a great
exercise.)
⎧
{0 𝐿 is empty
XOR(𝐿) =
⎨
⎩XOR2 (HEAD(𝐿), XOR(TAIL(𝐿))) otherwise
{
where XOR2 ∶ {0, 1}2 → {0, 1} is the XOR on two bits. In Python we
would write this as
print(xor([0,1,1,0,0,1]))
# 1
Now, how could we eliminate this recursive call? The main idea is
that since functions can take other functions as input, it is perfectly
legal in Python (and the λ calculus of course) to give a function itself
314 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
At this point you might want to stop and try to im-
plement this on your own in Python or any other
programming language of your choice (as long as it
allows functions as inputs).
Our first attempt might be to simply use the idea of replacing the
recursive call by me. Let’s define this function as myxor
myxor(myxor,[1,0,1])
If you do this, you will get the following complaint from the inter-
preter:
TypeError: myxor() missing 1 required positional argu-
ment
The problem is that myxor expects two inputs- a function and a
list- while in the call to me we only provided a list. To correct this, we
modify the call to also provide the function itself:
tempxor(tempxor,[1,0,1])
# 0
tempxor(tempxor,[1,0,1,1])
# 1
1. Create the function myf that takes a pair of inputs me and x, and
replaces recursive calls to f with calls to me.
eq u i va l e n t mod e l s of comp u tati on 315
2. Create the function tempf that converts calls in myf of the form
me(x) to calls of the form me(me,x).
def RECURSE(myf):
def tempf(me,x): return myf(lambda y: me(me,y),x)
xor = RECURSE(myxor)
print(xor([0,1,1,0,0,1]))
# 1
print(xor([1,1,0,0,1,1,1,1]))
# 0
# XOR function
XOR = RECURSE(myXOR)
#TESTING:
R
Remark 8.20 — The Y combinator. The RECURSE opera-
tor above is better known as the Y combinator.
It is one of a family of a fixed point operators that given
a lambda expression 𝐹 , find a fixed point 𝑓 of 𝐹 such
that 𝑓 = 𝐹 𝑓. If you think about it, XOR is the fixed
point of 𝑚𝑦𝑋𝑂𝑅 above. XOR is the function such
that for every 𝑥, if plug in XOR as the first argument
of 𝑚𝑦𝑋𝑂𝑅 then we get back XOR, or in other words
XOR = 𝑚𝑦𝑋𝑂𝑅 XOR. Hence finding a fixed point for
𝑚𝑦𝑋𝑂𝑅 is the same as applying RECURSE to it.
“[The thesis is] not so much a definition or to an axiom but … a natural law.”,
Emil Post, 1936.
Computational
problems Type of model Examples
Finite functions Non-uniform Boolean circuits,
𝑓 ∶ {0, 1}𝑛 → {0, 1}𝑚 computation NAND circuits,
(algorithm straight-line programs
depends on input (e.g., NAND-CIRC)
length)
Functions with Sequential access Turing machines,
unbounded inputs to memory NAND-TM programs
𝐹 ∶ {0, 1}∗ → {0, 1}∗
– Indexed access / RAM machines,
RAM NAND-RAM, modern
programming
languages
– Other Lambda calculus,
cellular automata
✓ Chapter Recap
8.9 EXERCISES
Exercise 8.1 — Alternative proof for TM/RAM equivalence. Let SEARCH ∶
{0, 1}∗ → {0, 1}∗ be the following function. The input is a pair
(𝐿, 𝑘) where 𝑘 ∈ {0, 1}∗ , 𝐿 is an encoding of a list of key value pairs
(𝑘0 , 𝑣1 ), … , (𝑘𝑚−1 , 𝑣𝑚−1 ) where 𝑘0 , … , 𝑘𝑚−1 , 𝑣0 , … , 𝑣𝑚−1 are binary
strings. The output is 𝑣𝑖 for the smallest 𝑖 such that 𝑘𝑖 = 𝑘, if such 𝑖
exists, and otherwise the empty string.
4. Prove that for every 𝐹 ∶ {0, 1}∗ → {0, 1}∗ that is computable by a
NAND-RAM program, 𝐹 is computable by a Turing machine.
Exercise 8.2 — NAND-TM lookup. This exercise shows part of the proof that
NAND-TM can simulate NAND-RAM. Produce the code of a NAND-
TM program that computes the function LOOKUP ∶ {0, 1}∗ → {0, 1}
that is defined as follows. On input 𝑝𝑓(𝑖)𝑥, where 𝑝𝑓(𝑖) denotes a
prefix-free encoding of an integer 𝑖, LOOKUP(𝑝𝑓(𝑖)𝑥) = 𝑥𝑖 if 𝑖 < |𝑥|
eq u i va l e n t mod e l s of comp u tati on 319
Exercise 8.7 — Next-step function is local. Prove Lemma 8.9 and use it to
complete the proof of Theorem 8.7.
320 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Exercise 8.8 — λ calculus requires at most three variables. Prove that for ev-
ery λ-expression 𝑒 with no free variables there is an equivalent λ- 6
Hint: You can reduce the number of variables a
expression 𝑓 that only uses the variables 𝑥,𝑦, and 𝑧.6 function takes by “pairing them up”. That is, define a
■ λ expression PAIR such that for every 𝑥, 𝑦 PAIR𝑥𝑦 is
some function 𝑓 such that 𝑓0 = 𝑥 and 𝑓1 = 𝑦. Then
1. Let 𝑒 =
Exercise 8.9 — Evaluation order example in λ calculus. use PAIR to iteratively reduce the number of variables
𝜆𝑥.7 ((𝜆𝑥.𝑥𝑥)(𝜆𝑥.𝑥𝑥)). Prove that the simplification process of 𝑒 used.
Let 𝑀 be a Turing
Exercise 8.11 — Next-step function without 𝑅𝐸𝐶𝑈 𝑅𝑆𝐸 .
machine. Give an enhanced λ calculus expression to compute the
next-step function NEXT𝑀 of 𝑀 (as in the proof of Theorem 8.16) 9
Use MAP and REDUCE (and potentially FILTER).
without using RECURSE. See footnote for hint.9 You might also find the function 𝑧𝑖𝑝 of Exercise 8.10
■
useful.
Give a program
Exercise 8.12 — λ calculus to NAND-TM compiler (challenging).
in the programming language of your choice that takes as input a λ
expression 𝑒 and outputs a NAND-TM program 𝑃 that computes the
same function as 𝑒. For partial credit you can use the GOTO and all
NAND-CIRC syntactic sugar in your output program. You can use
any encoding of λ expressions as binary string that is convenient for 10
Try to set up a procedure such that if array Left
you. See footnote for hint.10 contains an encoding of a λ expression 𝜆𝑥.𝑒 and
■
array Right contains an encoding of another λ expres-
sion 𝑒′ , then the array Result will contain 𝑒[𝑥 → 𝑒′ ].
Exercise 8.13 — At least two in 𝜆 calculus. Let 1 = 𝜆𝑥, 𝑦.𝑥 and 0 = 𝜆𝑥, 𝑦.𝑦 as
before. Define
Prove that ALT is a 𝜆 expression that computes the at least two func-
tion. That is, for every 𝑎, 𝑏, 𝑐 ∈ {0, 1} (as encoded above) ALT𝑎𝑏𝑐 = 1
if and only at least two of {𝑎, 𝑏, 𝑐} are equal to 1.
■
eq u i va l e n t mod e l s of comp u tati on 321
if search('110011') {
replace('110011','00')
} else if search('110111') {
replace('110111','00')
} else if search('111011') {
replace('111011','00')
} else if search('111111') {
replace('1111111','00')
}
322 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
sion you fed it. Typed variants of the λ calculus are objects of intense
research, and are strongly related to type systems for programming
language and computer-verifiable proof systems, see [Pie02]. Some of
the typed variants of the λ calculus do not have infinite loops, which
makes them very useful as ways of enabling static analysis of pro-
grams as well as computer-verifiable proofs. We will come back to this
point in Chapter 10 and Chapter 22.
Tao has proposed showing the Turing completeness of fluid dy-
namics (a “water computer”) as a way of settling the question of the
behavior of the Navier-Stokes equations, see this popular article.
Learning Objectives:
• The universal machine/program - “one
program to rule them all”
• A fundamental result in computer science and
mathematics: the existence of uncomputable
functions.
• The halting problem: the canonical example of
an uncomputable function.
• Introduction to the technique of reductions.
can represent 𝑀 as a string (i.e., using code) and then input 𝑀 to the
universal machine 𝑈 .
Beyond the practical applications, the existence of a universal algo-
rithm also has surprising theoretical ramifications, and in particular
can be used to show the existence of uncomputable functions, upend-
ing the intuitions of mathematicians over the centuries from Euler
to Hilbert. In this chapter we will prove the existence of the univer-
sal program, and also show its implications for uncomputability, see
Fig. 9.1
Proof Idea:
Once you understand what the theorem says, it is not that hard to
prove. The desired program 𝑈 is an interpreter for Turing machines.
Figure 9.2: A Universal Turing Machine is a single
That is, 𝑈 gets a representation of the machine 𝑀 (think of it as source Turing Machine 𝑈 that can evaluate, given input the
code), and some input 𝑥, and needs to simulate the execution of 𝑀 on (description as a string of) arbitrary Turing machine
𝑀 and input 𝑥, the output of 𝑀 on 𝑥. In contrast to
𝑥.
the universal circuit depicted in Fig. 5.6, the machine
Think of how you would code 𝑈 in your favorite programming 𝑀 can be much more complex (e.g., more states or
language. First, you would need to decide on some representation tape alphabet symbols) than 𝑈.
Let 𝑀 be a Turing
Definition 9.2 — String representation of Turing Machine.
machine with 𝑘 states and a size ℓ alphabet Σ = {𝜎0 , … , 𝜎ℓ−1 } (we
use the convention 𝜎0 = 0, 𝜎1 = 1, 𝜎2 = ∅, 𝜎3 = ▷). We represent
𝑀 as the triple (𝑘, ℓ, 𝑇 ) where 𝑇 is the table of values for 𝛿𝑀 :
R
Remark 9.3 — Take away points of representation. The
details of the representation scheme of Turing ma-
chines as strings are immaterial for almost all applica-
tions. What you need to remember are the following
points:
Proof of Theorem 9.1. We will only sketch the proof, giving the major
ideas. First, we observe that we can easily write a Python program
that, on input a representation (𝑘, ℓ, 𝑇 ) of a Turing machine 𝑀 and
an input 𝑥, evaluates 𝑀 on 𝑋. Here is the code of this program for
concreteness, though you can feel free to skip it if you are not familiar
with (or interested in) Python:
# constants
def EVAL(δ,x):
'''Evaluate TM given by transition table δ
on input x'''
Tape = [""] + [a for a in x]
i = 0; s = 0 # i = head pos, s = state
while True:
s, Tape[i], d = δ[(s,Tape[i])]
if d == "H": break
if d == "L": i = max(i-1,0)
if d == "R": i += 1
if i>= len(Tape): Tape.append('Φ')
j = 1; Y = [] # produce output
while Tape[j] != 'Φ':
Y.append(Tape[j])
j += 1
return Y
R
Remark 9.4 — Efficiency of the simulation. The argu-
ment in the proof of Theorem 9.1 is a very inefficient
way to implement the dictionary data structure in
practice, but it suffices for the purpose of proving the
theorem. Reading and writing to a dictionary of 𝑚
values in this implementation takes Ω(𝑚) steps, but
it is in fact possible to do this in 𝑂(log 𝑚) steps using
a search tree data structure or even 𝑂(1) (for “typical”
instances) using a hash table. NAND-RAM and RAM
machines correspond to the architecture of modern
electronic computers, and so we can implement hash
tables and search trees in NAND-RAM just as they are
implemented in other programming languages.
Proof Idea:
The idea behind the proof follows quite closely Cantor’s proof that
the reals are uncountable (Theorem 2.5), and in fact the theorem can
also be obtained fairly directly from that result (see Exercise 7.11).
However, it is instructive to see the direct proof. The idea is to con-
struct 𝐹 ∗ in a way that will ensure that every possible machine 𝑀 will
in fact fail to compute 𝐹 ∗ . We do so by defining 𝐹 ∗ (𝑥) to equal 0 if 𝑥
describes a Turing machine 𝑀 which satisfies 𝑀 (𝑥) = 1 and defining
𝐹 ∗ (𝑥) = 1 otherwise. By construction, if 𝑀 is any Turing machine and
𝑥 is the string describing it, then 𝐹 ∗ (𝑥) ≠ 𝑀 (𝑥) and therefore 𝑀 does
not compute 𝐹 ∗ .
⋆
Big Idea 12 There are some functions that can not be computed by
any algorithm.
P
The proof of Theorem 9.5 is short but subtle. I suggest
that you pause here and go back to read it again and
think about it - this is a proof that is worth reading at
least twice if not three or four times. It is not often the
case that a few lines of mathematical reasoning estab-
lish a deeply profound fact - that there are problems
we simply cannot solve.
Proof Idea:
One way to think about this proof is as follows:
That is, we will use the universal Turing machine that computes EVAL
to derive the uncomputability of HALT from the uncomputability of
𝐹 ∗ shown in Theorem 9.5. Specifically, the proof will be by contra-
diction. That is, we will assume towards a contradiction that HALT is
computable, and use that assumption, together with the universal Tur-
ing machine of Theorem 9.1, to derive that 𝐹 ∗ is computable, which
will contradict Theorem 9.5.
⋆
Proof of Theorem 9.6. The proof will use the previously established
result Theorem 9.5. Recall that Theorem 9.5 shows that the following
function 𝐹 ∗ ∶ {0, 1}∗ → {0, 1} is uncomputable:
⎧
{0 𝑥(𝑥) = 1
𝐹 ∗ (𝑥) = ⎨
⎩1 otherwise
{
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 335
where 𝑥(𝑥) denotes the output of the Turing machine described by the
string 𝑥 on the input 𝑥 (with the usual convention that 𝑥(𝑥) = ⊥ if this
computation does not halt).
We will show that the uncomputability of 𝐹 ∗ implies the uncom-
putability of HALT. Specifically, we will assume, towards a contra-
diction, that there exists a Turing machine 𝑀 that can compute the
HALT function, and use that to obtain a Turing machine 𝑀 ′ that com-
putes the function 𝐹 ∗ . (This is known as a proof by reduction, since we
reduce the task of computing 𝐹 ∗ to the task of computing HALT. By
the contrapositive, this means the uncomputability of 𝐹 ∗ implies the
uncomputability of HALT.)
Indeed, suppose that 𝑀 is a Turing machine that computes HALT.
Algorithm 9.7 describes a Turing machine 𝑀 ′ that computes 𝐹 ∗ . (We
use “high level” description of Turing machines, appealing to the
“have your cake and eat it too” paradigm, see Big Idea 10.)
P
Once again, this is a proof that’s worth reading more
than once. The uncomputability of the halting prob-
lem is one of the fundamental theorems of computer
science, and is the starting point for much of the in-
vestigations we will see later. An excellent way to get
a better understanding of Theorem 9.6 is to go over
Section 9.3.2, which presents an alternative proof of
the same result.
other far less trivial examples of programs that we can certify to never
enter an infinite loop (or programs that we know for sure that will
enter such a loop). However, there is no general procedure that would
determine for an arbitrary program 𝑃 whether it halts or not. More-
over, there are some very simple programs for which no one knows
whether they halt or not. For example, the following Python program
will halt if and only if Goldbach’s conjecture is false:
def isprime(p):
return all(p % i for i in range(2,p-1))
def Goldbach(n):
return any( (isprime(p) and isprime(n-p))
for p in range(2,n-1))
n = 4
while True:
if not Goldbach(n): break
n+= 2
If T[P] = True the routine P will loop, and it will only terminate if
T[P] = False. In each case T[P] has exactly the wrong value, and this
contradiction shows that the function T cannot exist.
Yours faithfully,
C. Strachey
Churchill College, Cambridge
P
Try to stop and extract the argument for proving
Theorem 9.6 from the letter above.
Since CPL is not as common today, let us reproduce this proof. The
idea is the following: suppose for the sake of contradiction that there
exists a program T such that T(f,x) equals True iff f halts on input
x. (Strachey’s letter considers the no-input variant of HALT, but as
we’ll see, this is an immaterial distinction.) Then we can construct a
program P and an input x such that T(P,x) gives the wrong answer.
The idea is that on input x, the program P will do the following: run
T(x,x), and if the answer is True then go into an infinite loop, and
otherwise halt. Now you can see that T(P,P) will give the wrong
answer: if P halts when it gets its own code as input, then T(P,P) is
supposed to be True, but then P(P) will go into an infinite loop. And
if P does not halt, then T(P,P) is supposed to be False but then P(P)
will halt. We can also code this up in Python:
def CantSolveMe(T):
"""
Gets function T that claims to solve HALT.
Returns a pair (P,x) of code and input on which
T(P,x) ≠ HALT(x)
"""
def fool(x):
if T(x,x):
while True: pass
return "I halted"
return (fool,fool)
def T(f,x):
"""Crude halting tester - decides it doesn't halt if it
↪ contains a loop."""
import inspect
source = inspect.getsource(f)
if source.find("while"): return False
if source.find("for"): return False
return True
9.4 REDUCTIONS
The Halting problem turns out to be a linchpin of uncomputability, in
the sense that Theorem 9.6 has been used to show the uncomputabil-
ity of a great many interesting functions. We will see several examples
of such results in this chapter and the exercises, but there are many
more such results (see Fig. 9.6).
R
Remark 9.8 — Reductions are algorithms. A reduction
is an algorithm, which means that, as discussed in
Remark 0.3, a reduction has three components:
P
The proof of Theorem 9.9 is below, but before reading
it you might want to pause for a couple of minutes
and think how you would prove it yourself. In partic-
ular, try to think of what a reduction from HALT to
HALTONZERO would look like. Doing so is an excel-
lent way to get some initial comfort with the notion
of proofs by reduction, which a technique we will be
using time and again in this book. You can also see
Fig. 9.8 and the following Colab notebook for a Python
implementation of this reduction.
2. Analysis of the reduction: We will then prove that under the hypoth-
esis that Algorithm 𝐴 computes HALTONZERO, Algorithm 𝐵 will
compute HALT.
def N(z):
M = r'.......'
# a string constant containing desc. of M
x = r'.......'
# a string constant containing x
return eval(M,x)
# note that we ignore the input z
simply ignores the input and always returns the result of evaluating
𝑀 on 𝑥. The algorithm 𝐵 does not actually execute the machine 𝑁𝑀,𝑥 .
𝐵 merely writes down the description of 𝑁𝑀,𝑥 as a string (just as we
did above) and feeds this string as input to 𝐴.
The above completes the description of the reduction. The analysis is
obtained by proving the following claim:
Claim: For every strings 𝑀 , 𝑥, 𝑧, the machine 𝑁𝑀,𝑥 constructed by
Algorithm 𝐵 in Step 1 satisfies that 𝑁𝑀,𝑥 halts on 𝑧 if and only if the
program described by 𝑀 halts on the input 𝑥.
Proof of Claim: Since 𝑁𝑀,𝑥 ignores its input and evaluates 𝑀 on 𝑥
using the universal Turing machine, it will halt on 𝑧 if and only if 𝑀
halts on 𝑥.
In particular if we instantiate this claim with the input 𝑧 = 0 to
𝑁𝑀,𝑥 , we see that HALTONZERO(𝑁𝑀,𝑥 ) = HALT(𝑀 , 𝑥). Thus if
the hypothetical algorithm 𝐴 satisfies 𝐴(𝑀 ) = HALTONZERO(𝑀 )
for every 𝑀 then the algorithm 𝐵 we construct satisfies 𝐵(𝑀 , 𝑥) =
HALT(𝑀 , 𝑥) for every 𝑀 , 𝑥, contradicting the uncomputability of
HALT.
■
R
Remark 9.11 — The hardwiring technique. In the proof of
Theorem 9.9 we used the technique of “hardwiring”
an input 𝑥 to a program/machine 𝑃 . That is, we take
a program that computes the function 𝑥 ↦ 𝑓(𝑥) and
“fix” or “hardwire” some of the inputs to some con-
stant value. For example, if you have a program that
takes as input a pair of numbers 𝑥, 𝑦 and outputs their
product (i.e., computes the function 𝑓(𝑥, 𝑦) = 𝑥 × 𝑦),
344 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Despite the similarity in their names, ZEROFUNC and
HALTONZERO are two different functions. For exam-
ple, if 𝑀 is a Turing machine that on input 𝑥 ∈ {0, 1}∗ ,
halts and outputs the OR of all of 𝑥’s coordinates, then
HALTONZERO(𝑀 ) = 1 (since 𝑀 does halt on the
input 0) but ZEROFUNC(𝑀 ) = 0 (since 𝑀 does not
compute the constant zero function).
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 345
2. Return 𝐴(𝑀 ).
P
We leave the proof of Theorem 9.13 as an exercise
(Exercise 9.6). I strongly encourage you to stop here
and try to solve this exercise.
int First(int n) {
if (n<0) return 0;
return 2*n;
}
int Second(int n) {
int i = 0;
int j = 0
if (n<0) return 0;
while (j<n) {
i = i + 2;
j = j + 1;
}
return i;
}
First and Second are two distinct C programs, but they compute
the same function. A semantic property, would be either true for both
programs or false for both programs, since it depends on the function
the programs compute and not on their code. An example for a se-
mantic property that both First and Second satisfy is the following:
“The program 𝑃 computes a function 𝑓 mapping integers to integers satisfy-
ing that 𝑓(𝑛) ≥ 𝑛 for every input 𝑛”.
A property is not semantic if it depends on the source code rather
than the input/output behavior. For example, properties such as “the
program contains the variable k” or “the program uses the while op-
eration” are not semantic. Such properties can be true for one of the
programs and false for others. Formally, we define semantic proper-
ties as follows:
Solution:
Recall that ZEROFUNC(𝑀 ) = 1 if and only if 𝑀 (𝑥) = 0 for
every 𝑥 ∈ {0, 1}∗ . If 𝑀 and 𝑀 ′ are functionally equivalent, then for
every 𝑥, 𝑀 (𝑥) = 𝑀 ′ (𝑥). Hence ZEROFUNC(𝑀 ) = 1 if and only if
ZEROFUNC(𝑀 ′ ) = 1.
■
Theorem 9.15 — Rice’s Theorem. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. If 𝐹 is seman-
tic and non-trivial then it is uncomputable.
Proof Idea:
The idea behind the proof is to show that every semantic non-
trivial function 𝐹 is at least as hard to compute as HALTONZERO.
This will conclude the proof since by Theorem 9.9, HALTONZERO
is uncomputable. If a function 𝐹 is non-trivial then there are two
machines 𝑀0 and 𝑀1 such that 𝐹 (𝑀0 ) = 0 and 𝐹 (𝑀1 ) = 1. So,
the goal would be to take a machine 𝑁 and find a way to map it into
a machine 𝑀 = 𝑅(𝑁 ), such that (i) if 𝑁 halts on zero then 𝑀 is
functionally equivalent to 𝑀1 and (ii) if 𝑁 does not halt on zero then
𝑀 is functionally equivalent to 𝑀0 .
Because 𝐹 is semantic, if we achieved this, then we would be guar-
anteed that HALTONZERO(𝑁 ) = 𝐹 (𝑅(𝑁 )), and hence would show
that if 𝐹 was computable, then HALTONZERO would be computable
as well, contradicting Theorem 9.9.
⋆
Proof of Theorem 9.15. We will not give the proof in full formality, but
rather illustrate the proof idea by restricting our attention to a particu-
lar semantic function 𝐹 . However, the same techniques generalize to
all possible semantic functions. Define MONOTONE ∶ {0, 1}∗ → {0, 1}
as follows: MONOTONE(𝑀 ) = 1 if there does not exist 𝑛 ∈ ℕ and
two inputs 𝑥, 𝑥′ ∈ {0, 1}𝑛 such that for every 𝑖 ∈ [𝑛] 𝑥𝑖 ≤ 𝑥′𝑖 but 𝑀 (𝑥)
outputs 1 and 𝑀 (𝑥′ ) = 0. That is, MONOTONE(𝑀 ) = 1 if it’s not
possible to find an input 𝑥 such that flipping some bits of 𝑥 from 0 to
1 will change 𝑀 ’s output in the other direction from 1 to 0. We will
348 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• The machine INF that simply goes into an infinite loop on every
input satisfies MONOTONE(INF) = 1, since INF is not defined
anywhere and so in particular there are no two inputs 𝑥, 𝑥′ where
𝑥𝑖 ≤ 𝑥′𝑖 for every 𝑖 but INF(𝑥) = 0 and INF(𝑥′ ) = 1.
• The machine PAR that computes the XOR or parity of its input, is
not monotone (e.g., PAR(1, 1, 0, 0, … , 0) = 0 but PAR(1, 0, 0, … , 0) =
0) and hence MONOTONE(PAR) = 0.
(Note that INF and PAR are machines and not functions.)
We will now give a reduction from HALTONZERO to
MONOTONE. That is, we assume towards a contradiction that
there exists an algorithm 𝐴 that computes MONOTONE and we will
build an algorithm 𝐵 that computes HALTONZERO. Our algorithm 𝐵
will work as follows:
Algorithm 𝐵:
Input: String 𝑁 describing a Turing machine. (Goal: Compute
HALTONZERO(𝑁 ))
Assumption: Access to Algorithm 𝐴 to compute MONOTONE.
Operation:
1. Construct the following machine 𝑀 : “On input 𝑧 ∈ {0, 1}∗ do: (a)
Run 𝑁 (0), (b) Return PAR(𝑧)”.
2. Return 1 − 𝐴(𝑀 ).
R
Remark 9.16 — Semantic is not the same as uncom-
putable. Rice’s Theorem is so powerful and such a
popular way of proving uncomputability that peo-
ple sometimes get confused and think that it is the
only way to prove uncomputability. In particular, a
common misconception is that if a function 𝐹 is not
semantic then it is computable. This is not at all the
case.
For example, consider the following function
HALTNOYALE ∶ {0, 1}∗ → {0, 1}. This is a function
that on input a string that represents a NAND-TM
program 𝑃 , outputs 1 if and only if both (i) 𝑃 halts
on the input 0, and (ii) the program 𝑃 does not con-
tain a variable with the identifier Yale. The function
HALTNOYALE is clearly not semantic, as it will out-
put two different values when given as input one of
the following two functionally equivalent programs:
Yale[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Yale[0])
and
Harvard[0] = NAND(X[0],X[0])
Y[0] = NAND(X[0],Harvard[0])
P
Once again, this is a good point for you to stop and try
to prove the result yourself before reading the proof
below.
Proof. We have seen in Theorem 7.11 that for every Turing machine
𝑀 , there is an equivalent NAND-TM program 𝑃𝑀 such that for ev-
ery 𝑥, 𝑃𝑀 (𝑥) = 𝑀 (𝑥). In particular this means that HALT(𝑀 ) =
NANDTMHALT(𝑃𝑀 ).
The transformation 𝑀 ↦ 𝑃𝑀 that is obtained from the proof
of Theorem 7.11 is constructive. That is, the proof yields a way to
compute the map 𝑀 ↦ 𝑃𝑀 . This means that this proof yields a
reduction from task of computing HALT to the task of computing
NANDTMHALT, which means that since HALT is uncomputable,
neither is NANDTMHALT.
■
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 351
✓ Chapter Recap
9.6 EXERCISES
Let NANDRAMHALT ∶ {0, 1}∗ → {0, 1}
Exercise 9.1 — NAND-RAM Halt.
be the function such that on input (𝑃 , 𝑥) where 𝑃 represents a NAND-
RAM program, NANDRAMHALT(𝑃 , 𝑥) = 1 iff 𝑃 halts on the input 𝑥.
Prove that NANDRAMHALT is uncomputable.
■
2. 𝐻(𝑥) = 1 iff there exist two non-empty strings 𝑢, 𝑣 ∈ {0, 1}∗ such
that 𝑥 = 𝑢𝑣 (i.e., 𝑥 is the concatenation of 𝑢 and 𝑣), 𝐹 (𝑢) = 1 and
𝐺(𝑣) = 1.
Exercise 9.5Prove that the following function FINITE ∶ {0, 1}∗ → {0, 1}
is uncomputable. On input 𝑃 ∈ {0, 1}∗ , we define FINITE(𝑃 ) = 1
if and only if 𝑃 is a string that represents a NAND++ program such 3
Hint: You can use Rice’s Theorem.
that there only a finite number of inputs 𝑥 ∈ {0, 1}∗ s.t. 𝑃 (𝑥) = 1.3
■
354 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Exercise 9.6 — Computing parity. Prove Theorem 9.13 without using Rice’s
Theorem.
■
Exercise 9.8 For each of the following two functions, say whether it is
computable or not:
3. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that 𝐹 is
6
You can either use the diagonalization method to
not recursively enumerable. See footnote for hint.6
prove this directly or show that the set of all recur-
sively enumerable functions is countable.
4. Prove that there exists a function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 is recursively enumerable but the function 𝐹 defined as 𝐹 (𝑥) =
7
HALT has this property: show that if both HALT
1 − 𝐹 (𝑥) is not recursively enumerable. See footnote for hint.7
and 𝐻𝐴𝐿𝑇 were recursively enumerable then HALT
would be in fact computable.
■
un i ve rsa l i ty a n d u ncomp u ta bi l i ty 355
2. Use Theorem 9.15 to prove that for every 𝐺 ∶ {0, 1}∗ → {0, 1}, if (a)
𝐺 is neither the constant zero nor the constant one function, and
(b) for every 𝑀 , 𝑀 ′ such that 𝐿(𝑀 ) = 𝐿(𝑀 ′ ), 𝐺(𝑀 ) = 𝐺(𝑀 ′ ), 8
Show that any 𝐺 satisfying (b) must be semantic.
then 𝐺 is uncomputable. See footnote for hint.8
related to the “Busy Beaver” problem, see Exercise 9.13 and the survey
[Aar20].
The diagonalization argument used to prove uncomputability of 𝐹 ∗
is derived from Cantor’s argument for the uncountability of the reals
discussed in Chapter 2.
Christopher Strachey was an English computer scientist and the
inventor of the CPL programming language. He was also an early
artificial intelligence visionary, programming a computer to play
Checkers and even write love letters in the early 1950’s, see this New
Yorker article and this website.
Rice’s Theorem was proven in [Ric53]. It is typically stated in a
form somewhat different than what we used, see Exercise 9.11.
We do not discuss in the chapter the concept of recursively enumer-
able languages, but it is covered briefly in Exercise 9.10. As usual, we
use function, as opposed to language, notation.
Learning Objectives:
• See that Turing completeness is not always a
good thing.
• Another example of an always-halting
formalism: context-free grammars and simply
typed 𝜆 calculus.
• The pumping lemma for non context-free
functions.
• Examples of computable and uncomputable
“Happy families are all alike; every unhappy family is unhappy in its own
way”, Leo Tolstoy (opening of the book “Anna Karenina”).
• An operation is one of +, −, ×, ÷
operation := +|-|*|/
digit := 0|1|2|3|4|5|6|7|8|9
number := digit|digit number
expression := number|expression operation
↪ expression|(expression)
A string over the alphabet { (,) } can be generated from this gram-
mar (where match is the starting expression and "" corresponds to the
empty string) if and only if it consists of a matching set of parentheses.
364 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
If 𝐺 = (𝑉 , 𝑅, 𝑠) is a
Definition 10.4 — Deriving a string from a grammar.
context-free grammar over Σ, then for two strings 𝛼, 𝛽 ∈ (Σ ∪ 𝑉 )∗
we say that 𝛽 can be derived in one step from 𝛼, denoted by 𝛼 ⇒𝐺 𝛽,
if we can obtain 𝛽 from 𝛼 by applying one of the rules of 𝐺. That is,
we obtain 𝛽 by replacing in 𝛼 one occurrence of the variable 𝑣 with
the string 𝑧, where 𝑣 ⇒ 𝑧 is a rule of 𝐺.
We say that 𝛽 can be derived from 𝛼, denoted by 𝛼 ⇒∗𝐺 𝛽, if it
can be derived by some finite number 𝑘 of steps. That is, if there
are 𝛼1 , … , 𝛼𝑘−1 ∈ (Σ ∪ 𝑉 )∗ , so that 𝛼 ⇒𝐺 𝛼1 ⇒𝐺 𝛼2 ⇒𝐺 ⋯ ⇒𝐺
𝛼𝑘−1 ⇒𝐺 𝛽.
We say that 𝑥 ∈ Σ∗ is matched by 𝐺 = (𝑉 , 𝑅, 𝑠) if 𝑥 can be de-
rived from the starting variable 𝑠 (i.e., if 𝑠 ⇒∗𝐺 𝑥). We define the
function computed by (𝑉 , 𝑅, 𝑠) to be the map Φ𝑉 ,𝑅,𝑠 ∶ Σ∗ → {0, 1}
such that Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 iff 𝑥 is matched by (𝑉 , 𝑅, 𝑠). A function
𝐹 ∶ Σ∗ → {0, 1} is context free if 𝐹 = Φ𝑉 ,𝑅,𝑠 for some CFG (𝑉 , 𝑅, 𝑠).
1
1
As in the case of Definition 6.7 we can also use
A priori it might not be clear that the map Φ𝑉 ,𝑅,𝑠 is computable, language rather than function notation and say that a
language 𝐿 ⊆ Σ∗ is context free if the function 𝐹 such
but it turns out that this is the case. that 𝐹 (𝑥) = 1 iff 𝑥 ∈ 𝐿 is context free.
Proof. We only sketch the proof. We start with the observation we can
convert every CFG to an equivalent version of Chomsky normal form,
where all rules either have the form 𝑢 → 𝑣𝑤 for variables 𝑢, 𝑣, 𝑤 or the
form 𝑢 → 𝜎 for a variable 𝑢 and symbol 𝜎 ∈ Σ, plus potentially the
rule 𝑠 → "" where 𝑠 is the starting variable.
The idea behind such a transformation is to simply add new vari-
ables as needed, and so for example we can translate a rule such as
𝑣 → 𝑢𝜎𝑤 into the three rules 𝑣 → 𝑢𝑟, 𝑟 → 𝑡𝑤 and 𝑡 → 𝜎.
re stri c te d comp u tati ona l mod e l s 365
R
Remark 10.6 — Parse trees. While we focus on the
task of deciding whether a CFG matches a string, the
algorithm to compute Φ𝑉 ,𝑅,𝑠 actually gives more in-
formation than that. That is, on input a string 𝑥, if
Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 then the algorithm yields the sequence
of rules that one can apply from the starting vertex 𝑠
to obtain the final string 𝑥. We can think of these rules
as determining a tree with 𝑠 being the root vertex and
the sinks (or leaves) corresponding to the substrings
of 𝑥 that are obtained by the rules that do not have a
variable in their second element. This tree is known
as the parse tree of 𝑥, and often yields very useful
information about the structure of 𝑥.
Often the first step in a compiler or interpreter for a
programming language is a parser that transforms the
source into the parse tree (also known as the abstract
syntax tree). There are also tools that can automati-
cally convert a description of a context-free grammars
into a parser algorithm that computes the parse tree of
a given string. (Indeed, the above recursive algorithm
can be used to achieve this, but there are much more
efficient versions, especially for grammars that have
particular forms, and programming language design-
ers often try to ensure their languages have these more
efficient grammars.)
Let 𝑒 be a
Theorem 10.7 — Context free grammars and regular expressions.
regular expression over {0, 1}, then there is a CFG (𝑉 , 𝑅, 𝑠) over
{0, 1} such that Φ𝑉 ,𝑅,𝑠 = Φ𝑒 .
computes it. Otherwise, we fall into one of the following case: case 1:
𝑒 = 𝑒′ 𝑒″ , case 2: 𝑒 = 𝑒′ |𝑒″ or case 3: 𝑒 = (𝑒′ )∗ where in all cases 𝑒′ , 𝑒″
are shorter regular expressions. By the induction hypothesis, we can
define grammars (𝑉 ′ , 𝑅′ , 𝑠′ ) and (𝑉 ″ , 𝑅″ , 𝑠″ ) that compute Φ𝑒′ and
Φ𝑒″ respectively. By renaming variables, we can also assume without
loss of generality that 𝑉 ′ and 𝑉 ″ are disjoint.
In case 1, we can define the new grammar as follows: we add a new
starting variable 𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rule 𝑠 ↦ 𝑠′ 𝑠″ . In case 2, we can
define the new grammar as follows: we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ and the rules 𝑠 ↦ 𝑠′ and 𝑠 ↦ 𝑠″ . Case 3 will be the
only one that uses recursion. As before we add a new starting variable
𝑠 ∉ 𝑉 ∪ 𝑉 ′ , but now add the rules 𝑠 ↦ "" (i.e., the empty string) and
also add, for every rule of the form (𝑠′ , 𝛼) ∈ 𝑅′ , the rule 𝑠 ↦ 𝑠𝛼 to 𝑅.
We leave it to the reader as (a very good!) exercise to verify that in
all three cases the grammars we produce capture the same function as
the original expression.
■
It turns out that CFG’s are strictly more powerful than regular
expressions. In particular, as we’ve seen, the “matching parentheses”
function MATCHPAREN can be computed by a context free grammar,
whereas, as shown in Lemma 6.20, it cannot be computed by regular
expressions. Here is another example:
Let PAL ∶
Solved Exercise 10.1 — Context free grammar for palindromes.
{0, 1, ; } → {0, 1} be the function defined in Solved Exercise 6.4 where
∗
Solution:
A simple grammar computing PAL can be described using
Backus–Naur notation:
Solution:
Using Backus–Naur notation we can describe such a grammar as
follows
𝑤 = 𝛼𝑏𝑢; 𝑢𝑅 𝑏′ 𝛽
P
The context-free pumping lemma is even more cum-
bersome to state than its regular analog, but you can
remember it as saying the following: “If a long enough
string is matched by a grammar, there must be a variable
that is repeated in the derivation.”
Proof of Theorem 10.8. We only sketch the proof. The idea is that if
the total number of symbols in the rules of the grammar is 𝑛0 , then
the only way to get |𝑥| > 𝑛0 with Φ𝑉 ,𝑅,𝑠 (𝑥) = 1 is to use recursion.
That is, there must be some variable 𝑣 ∈ 𝑉 such that we are able to
368 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
derive from 𝑣 the value 𝑏𝑣𝑑 for some strings 𝑏, 𝑑 ∈ Σ∗ , and then further
on derive from 𝑣 some string 𝑐 ∈ Σ∗ such that 𝑏𝑐𝑑 is a substring of
𝑥 (in other words, 𝑥 = 𝑎𝑏𝑐𝑑𝑒 for some 𝑎, 𝑒 ∈ {0, 1}∗ ). If we take
the variable 𝑣 satisfying this requirement with a minimum number
of derivation steps, then we can ensure that |𝑏𝑐𝑑| is at most some
constant depending on 𝑛0 and we can set 𝑛1 to be that constant (𝑛1 =
10 ⋅ |𝑅| ⋅ 𝑛0 will do, since we will not need more than |𝑅| applications
of rules, and each such application can grow the string by at most 𝑛0
symbols).
Thus by the definition of the grammar, we can repeat the derivation
to replace the substring 𝑏𝑐𝑑 in 𝑥 with 𝑏𝑘 𝑐𝑑𝑘 for every 𝑘 ∈ ℕ while
retaining the property that the output of Φ𝑉 ,𝑅,𝑠 is still one. Since 𝑏𝑐𝑑
is a substring of 𝑥, we can write 𝑥 = 𝑎𝑏𝑐𝑑𝑒 and are guaranteed that
𝑎𝑏𝑘 𝑐𝑑𝑘 𝑒 is matched by the grammar for every 𝑘.
■
Using Theorem 10.8 one can show that even the simple function
𝐹 ∶ {0, 1}∗ → {0, 1} defined as follows:
is not context free. (In contrast, the function 𝐺 ∶ {0, 1}∗ → {0, 1}
defined as 𝐺(𝑥) = 1 iff 𝑥 = 𝑤0 𝑤1 ⋯ 𝑤𝑛−1 𝑤𝑛−1 𝑤𝑛−2 ⋯ 𝑤0 for some
𝑤 ∈ {0, 1}∗ and 𝑛 = |𝑤| is context free, can you see why?.)
Let EQ ∶ {0, 1, ; }∗ →
Solved Exercise 10.3 — Equality is not context-free.
{0, 1} be the function such that EQ(𝑥) = 1 if and only if 𝑥 = 𝑢; 𝑢 for
some 𝑢 ∈ {0, 1}∗ . Then EQ is not context free.
■
Solution:
We use the context-free pumping lemma. Suppose towards the
sake of contradiction that there is a grammar 𝐺 that computes EQ,
and let 𝑛0 be the constant obtained from Theorem 10.8.
Consider the string 𝑥 = 1𝑛0 0𝑛0 ; 1𝑛0 0𝑛0 , and write it as 𝑥 = 𝑎𝑏𝑐𝑑𝑒
as per Theorem 10.8, with |𝑏𝑐𝑑| ≤ 𝑛0 and with |𝑏| + |𝑑| ≥ 1. By The-
orem 10.8, it should hold that EQ(𝑎𝑐𝑒) = 1. However, by case anal-
ysis this can be shown to be a contradiction.
Firstly, unless 𝑏 is on the left side of the ; separator and 𝑑 is on
the right side, dropping 𝑏 and 𝑑 will definitely make the two parts
different. But if it is the case that 𝑏 is on the left side and 𝑑 is on the
right side, then by the condition that |𝑏𝑐𝑑| ≤ 𝑛0 we know that 𝑏 is a
string of only zeros and 𝑑 is a string of only ones. If we drop 𝑏 and
𝑑 then since one of them is non-empty, we get that there are either
re stri c te d comp u tati ona l mod e l s 369
less zeroes on the left side than on the right side, or there are less
ones on the right side than on the left side. In either case, we get
that EQ(𝑎𝑐𝑒) = 0, obtaining the desired contradiction.
■
There is an algorithm
Theorem 10.9 — Emptiness for CFG’s is decidable.
that on input a context-free grammar 𝐺, outputs 1 if and only if Φ𝐺
is the constant zero function.
Proof Idea:
The proof is easier to see if we transform the grammar to Chomsky
Normal Form as in Theorem 10.5. Given a grammar 𝐺, we can recur-
sively define a non-terminal variable 𝑣 to be non-empty if there is either
a rule of the form 𝑣 ⇒ 𝜎, or there is a rule of the form 𝑣 ⇒ 𝑢𝑤 where
both 𝑢 and 𝑤 are non-empty. Then the grammar is non-empty if and
only if the starting variable 𝑠 is non-empty.
⋆
Proof Idea:
We prove the theorem by reducing from the Halting problem. To
do that we use the notion of configurations of NAND-TM programs, as
defined in Definition 8.8. Recall that a configuration of a program 𝑃 is a
binary string 𝑠 that encodes all the information about the program in
the current iteration.
We define Σ to be {0, 1} plus some separator characters and define
INVALID𝑃 ∶ Σ∗ → {0, 1} to be the function that maps every string 𝐿 ∈
Σ∗ to 1 if and only if 𝐿 does not encode a sequence of configurations
that correspond to a valid halting history of the computation of 𝑃 on
the empty input.
The heart of the proof is to show that INVALID𝑃 is context-free.
Once we do that, we see that 𝑃 halts on the empty input if and only if
INVALID𝑃 (𝐿) = 1 for every 𝐿. To show that, we will encode the list
in a special way that makes it amenable to deciding via a context-free
grammar. Specifically we will reverse all the odd-numbered strings.
⋆
Proof of Theorem 10.10. We only sketch the proof. We will show that if
we can compute CFGFULL then we can solve HALTONZERO, which
has been proven uncomputable in Theorem 9.9. Let 𝑀 be an input
re stri c te d comp u tati ona l mod e l s 371
• A halting configuration will have the value a certain state (which can
be easily “read off” from it) set to 1.
✓ Chapter Recap
10.5 EXERCISES
Suppose that
Exercise 10.1 — Closure properties of context-free functions.
𝐹 , 𝐺 ∶ {0, 1}∗ → {0, 1} are context free. For each one of the following
definitions of the function 𝐻, either prove that 𝐻 is always context
free or give a counterexample for regular 𝐹 , 𝐺 that would make 𝐻 not
context free.
Exercise 10.2 Prove that the function 𝐹 ∶ {0, 1}∗ → {0, 1} such that
𝐹 (𝑥) = 1 if and only if |𝑥| is a power of two is not context free.
■
374 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
• A statement has either the form foo = bar; where foo and bar are
variables, or the form IF (foo) BEGIN ... END where ... is list
of one or more statements, potentially separated by newlines.
1. Let VAR ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑥 ∈ {0, 1}∗ , outputs 1 if and only if 𝑥 corresponds to an ASCII
encoding of a valid variable identifier. Prove that VAR is regular.
2. Let SYN ∶ {0, 1}∗ → {0, 1} be the function that given a string
𝑠 ∈ {0, 1}∗ , outputs 1 if and only if 𝑠 is an ASCII encoding of a valid
program in our language. Prove that SYN is context free. (You do
not have to specify the full formal grammar for SYN, but you need
to show that such a grammar exists.)
2
Try to see if you can “embed” in some way a func-
3. Prove that SYN is not regular. See footnote for hint2 tion that looks similar to MATCHPAREN in SYN, so
you can use a similar proof. Of course for a function
to be non-regular, it does not need to utilize literal
■
parentheses symbols.
11
Is every theorem provable?
For
Theorem 11.1 — Gödel’s Incompleteness Theorem: informal version.
every sound proof system 𝑉 for sufficiently rich mathematical
statements, there is a mathematical statement that is true but is not
provable in 𝑉 .
def f(n):
if n==1: return 1
return f(3*n+1) if n % 2 else f(n//2)
380 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
If we had such a complete and sound proof system then we could
solve the HALTONZERO problem. On input a Turing machine 𝑀 , we
would in parallel run the machine on the input zero, as well as search
all purported proofs 𝑤 and output 0 if we find a proof of “𝑀 does
not halt on zero”. If the system is sound and complete then either the
machine will halt or we will eventually find such a proof, and it will
provide us with the correct output.
⋆
Proof of Theorem 11.3. Assume for the sake of contradiction that there
was such a proof system 𝑉 . We will use 𝑉 to build an algorithm 𝐴
that computes HALTONZERO, hence contradicting Theorem 9.9. Our
algorithm 𝐴 will work as follows:
the proof system is complete, there exists 𝑤 that proves this fact, and
so when Algorithm 𝐴 reaches 𝑛 = |𝑤| we will eventually find this
𝑤 and output 0. Hence under the assumption that the proof system
is complete and sound, 𝐴(𝑀 ) solves the HALTONZERO function,
yielding a contradiction.
■
R
Remark 11.5 — The Gödel statement (optional). One can
extract from the proof of Theorem 11.3 a procedure
that for every proof system 𝑉 , yields a true statement
𝑥∗ that cannot be proven in 𝑉 . But Gödel’s proof
gave a very explicit description of such a statement 𝑥∗
which is closely related to the “Liar’s paradox”. That
is, Gödel’s statement 𝑥∗ was designed to be true if and
only if ∀𝑤∈{0,1}∗ 𝑉 (𝑥, 𝑤) = 0. In other words, it satisfied
the following property
The twin prime conjecture, that states that there is an infinite num-
ber of numbers 𝑝 such that both 𝑝 and 𝑝 + 2 are primes can be phrased
as the quantified integer statement
R
Remark 11.7 — Syntactic sugar for quantified integer
statements. To make our statements more readable,
we often use syntactic sugar and so write 𝑥 ≠ 𝑦 as
shorthand for ¬(𝑥 = 𝑦), and so on. Similarly, the
“implication operator” 𝑎 ⇒ 𝑏 is “syntactic sugar” or
shorthand for ¬𝑎 ∨ 𝑏, and the “if and only if operator”
𝑎 ⇔ is shorthand for (𝑎 ⇒ 𝑏) ∧ (𝑏 ⇒ 𝑎). We will
also allow ourselves the use of “macros”: plugging in
one quantified integer statement in another, as we did
with DIVIDES and PRIME above.
or
Let
Theorem 11.9 — Uncomputability of quantified integer statements.
QIS ∶ {0, 1}∗ → {0, 1} be the function that given a (string rep-
resentation of) a quantified integer statement outputs 1 if it is true
and 0 if it is false. Then QIS is uncomputable.
P
Please stop here and make sure you understand
why the uncomputability of QIS (i.e., Theorem 11.9)
means that there is no sound and complete proof
system for proving quantified integer statements (i.e.,
Theorem 11.8). This follows in the same way that
Theorem 11.3 followed from the uncomputability of
HALTONZERO, but working out the details is a great
exercise (see Exercise 11.1)
In the rest of this chapter, we will show the proof of Theorem 11.8,
following the outline illustrated in Fig. 11.1.
learning tools that have revolutionized Computer Science over the last
several years.
But there are some equations that we simply do not know how to
solve by any means. For example, it took more than 200 years until peo-
ple succeeded in proving that the equation 𝑎11 + 𝑏11 = 𝑐11 has no
solution in integers.3 The notorious difficulty of so called Diophantine
equations (i.e., finding integer roots of a polynomial) motivated the
mathematician David Hilbert in 1900 to include the question of find-
ing a general procedure for solving such equations in his famous list
of twenty-three open problems for mathematics of the 20th century. I
Figure 11.2: Diophantine equations such as finding
don’t think Hilbert doubted that such a procedure exists. After all, the a positive integer solution to the equation 𝑎(𝑎 +
whole history of mathematics up to this point involved the discovery 𝑏)(𝑎 + 𝑐) + 𝑏(𝑏 + 𝑎)(𝑏 + 𝑐) + 𝑐(𝑐 + 𝑎)(𝑐 + 𝑏) =
4(𝑎 + 𝑏)(𝑎 + 𝑐)(𝑏 + 𝑐) (depicted more compactly
of ever more powerful methods, and even impossibility results such and whimsically above) can be surprisingly difficult.
as the inability to trisect an angle with a straightedge and compass, or There are many equations for which we do not know
if they have a solution, and there is no algorithm to
the non-existence of an algebraic formula for quintic equations, merely
solve them in general. The smallest solution for this
pointed out to the need to use more general methods. equation has 80 digits! See this Quora post for more
Alas, this turned out not to be the case for Diophantine equations. information, including the credits for this image.
3
This is a special case of what’s known as “Fermat’s
In 1970, Yuri Matiyasevich, building on a decades long line of work by
Last Theorem” which states that 𝑎𝑛 + 𝑏𝑛 = 𝑐𝑛 has no
Martin Davis, Hilary Putnam and Julia Robinson, showed that there is solution in integers for 𝑛 > 2. This was conjectured in
simply no method to solve such equations in general: 1637 by Pierre de Fermat but only proven by Andrew
Wiles in 1991. The case 𝑛 = 11 (along with all other
so called “regular prime exponents”) was established
Theorem 11.10 — MRDP Theorem. Let DIO ∶ {0, 1}∗ → {0, 1} be the by Kummer in 1850.
function that takes as input a string describing a 100-variable poly-
nomial with integer coefficients 𝑃 (𝑥0 , … , 𝑥99 ) and outputs 1 if and
only if there exists 𝑧0 , … , 𝑧99 ∈ ℕ s.t. 𝑃 (𝑧0 , … , 𝑧99 ) = 0.
Then DIO is uncomputable.
R
Remark 11.11 — Active code vs static data. The diffi-
culty in finding a way to distinguish between “code”
such as NAND-TM programs, and “static content”
such as polynomials is just another manifestation of
the phenomenon that code is the same as data. While
a fool-proof solution for distinguishing between the
two is inherently impossible, finding heuristics that do
a reasonable job keeps many firewall and anti-virus
manufacturers very busy (and finding ways to bypass
these tools keeps many hackers busy as well).
is e ve ry the ore m p rova bl e ? 387
P
If you find the last sentence confusing, it is worth-
while to reread it until you are sure you follow its
logic. We are so accustomed to trying to find solu-
tions for problems that it can sometimes be hard to
follow the arguments for showing that problems are
uncomputable.
1. We will first use a reduction from the Halting problem to show that
deciding the truth of quantified mixed statements is uncomputable.
Quantified mixed statements involve both strings and integers.
Since quantified mixed statements are a more general concept than
quantified integer statements, it is easier to prove the uncomputabil-
ity of deciding their truth.
pression which is true if 𝑖 is smaller than the length of 𝑎 and the 𝑖𝑡ℎ
coordinate of 𝑎 is 1, and is false otherwise.
For example, the true statement that for every string 𝑎 there is a
string 𝑏 that corresponds to 𝑎 in reverse order can be phrased as the
following quantified mixed statement
∀𝑎∈{0,1}∗ ∃𝑏∈{0,1}∗ (|𝑎| = |𝑏|) ∧ (∀𝑖∈ℕ 𝑖 < |𝑎| ⇒ (𝑎𝑖 ⇔ 𝑏|𝑎|−𝑖 )) .
Quantified mixed statements are more general than quantified
integer statements, and so the following theorem is potentially easier
to prove than Theorem 11.9:
Let
Theorem 11.13 — Uncomputability of quantified mixed statements.
QMS ∶ {0, 1}∗ → {0, 1} be the function that given a (string rep-
resentation of) a quantified mixed statement outputs 1 if it is true
and 0 if it is false. Then QMS is uncomputable.
Proof Idea:
The idea behind the proof is similar to that used in showing that
one-dimensional cellular automata are Turing complete (Theorem 8.7)
as well as showing that equivalence (or even “fullness”) of context
free grammars is uncomputable (Theorem 10.10). We use the notion
of a configuration of a NAND-TM program as in Definition 8.8. Such
a configuration can be thought of as a string 𝛼 over some large-but-
finite alphabet Σ describing its current state, including the values
of all arrays, scalars, and the index variable i. It can be shown that
if 𝛼 is the configuration at a certain step of the execution and 𝛽 is
the configuration at the next step, then 𝛽𝑗 = 𝛼𝑗 for all 𝑗 outside of
{𝑖 − 1, 𝑖, 𝑖 + 1} where 𝑖 is the value of i. In particular, every value 𝛽𝑗 is
simply a function of 𝛼𝑗−1,𝑗,𝑗+1 . Using these observations we can write
a quantified mixed statement NEXT(𝛼, 𝛽) that will be true if and only if
𝛽 is the configuration encoding the next step after 𝛼. Since a program
𝑃 halts on input 𝑥 if and only if there is a sequence of configurations
𝛼0 , … , 𝛼𝑡−1 (known as a computation history) starting with the initial
configuration with input 𝑥 and ending in a halting configuration, we
can define a quantified mixed statement to determine if there is such
a statement by taking a universal quantifier over all strings 𝐻 (for
history) that encode a tuple (𝛼0 , 𝛼1 , … , 𝛼𝑡−1 ) and then checking that
𝛼0 and 𝛼𝑡−1 are valid starting and halting configurations, and that
NEXT(𝛼𝑗 , 𝛼𝑗+1 ) is true for every 𝑗 ∈ {0, … , 𝑡 − 2}.
⋆
2. Using the above we can now write the condition that for every
substring of 𝐻 that has the form 𝛼ENC(; )𝛽 with 𝛼, 𝛽 ∈ {0, 1}ℓ
and ENC(; ) being the encoding of the separator “;”, it holds that
NEXT(𝛼, 𝛽) is true.
R
Remark 11.14 — Alternative proofs. There are sev-
eral other ways to show that QMS is uncomputable.
For example, we can express the condition that a 1-
dimensional cellular automaton eventually writes a
“1” to a given cell from a given initial configuration
as a quantified mixed statement over a string encod-
ing the history of all configurations. We can then use
the fact that cellular automatons can simulate Tur-
ing machines (Theorem 8.7) to reduce the halting
problem to QMS. We can also use other well known
uncomputable problems such as tiling or the post cor-
respondence problem. Exercise 11.5 and Exercise 11.6
explore two alternative proofs of Theorem 11.13.
𝜉 that does not use string-valued variables such that 𝜑 is true if and
only if 𝜉 is true.
To remove string-valued variables from a statement, we encode
every string by a pair integer. We will show that we can encode a
string 𝑥 ∈ {0, 1}∗ by a pair of numbers (𝑋, 𝑛) ∈ ℕ s.t.
• 𝑛 = |𝑥|
This will mean that we can replace a “for all” quantifier over strings
such as ∀𝑥∈{0,1}∗ with a pair of quantifiers over integers of the form
∀𝑋∈ℕ ∀𝑛∈ℕ (and similarly replace an existential quantifier of the form
∃𝑥∈{0,1}∗ with a pair of quantifiers ∃𝑋∈ℕ ∃𝑛∈ℕ ) . We can then replace all
calls to |𝑥| by 𝑛 and all calls to 𝑥𝑖 by COORD(𝑋, 𝑖). This means that
if we are able to define COORD via a quantified integer statement,
then we obtain a proof of Theorem 11.9, since we can use it to map
every mixed quantified statement 𝜑 to an equivalent quantified inte-
ger statement 𝜉 such that 𝜉 is true if and only if 𝜑 is true, and hence
QMS(𝜑) = QIS(𝜉). Such a procedure implies that the task of comput-
ing QMS reduces to the task of computing QIS, which means that the
uncomputability of QMS implies the uncomputability of QIS.
The above shows that proof of Theorem 11.9 all boils down to find-
ing the right encoding of strings as integers, and the right way to
implement COORD as a quantified integer statement. To achieve this
we use the following technical result :
There is a sequence of prime
Lemma 11.15 — Constructible prime sequence.
numbers 𝑝0 < 𝑝1 < 𝑝2 < ⋯ such that there is a quantified integer
statement PSEQ(𝑝, 𝑖) that is true if and only if 𝑝 = 𝑝𝑖 .
Using Lemma 11.15 we can encode a 𝑥 ∈ {0, 1}∗ by the numbers
(𝑋, 𝑛) where 𝑋 = ∏𝑥 =1 𝑝𝑖 and 𝑛 = |𝑥|. We can then define the
𝑖
statement COORD(𝑋, 𝑖) as
✓ Chapter Recap
11.6 EXERCISES
Exercise 11.1 — Gödel’s Theorem from uncomputability of 𝑄𝐼𝑆 . Prove Theo-
rem 11.8 using Theorem 11.9.
■
Let FINDPROOF ∶
Exercise 11.2 — Proof systems and uncomputability.
{0, 1}∗ → {0, 1} be the following function. On input a Turing machine
𝑉 (which we think of as the verifying algorithm for a proof system)
and a string 𝑥 ∈ {0, 1}∗ , FINDPROOF(𝑉 , 𝑥) = 1 if and only if there
exists 𝑤 ∈ {0, 1}∗ such that 𝑉 (𝑥, 𝑤) = 1.
Exercise 11.3 — Expression for floor. Let FSQRT(𝑛, 𝑚) = ∀𝑗∈ℕ ((𝑗 × 𝑗) >
√
𝑚) ∨ (𝑗 ≤ 𝑛). Prove that FSQRT(𝑛, 𝑚) is true if and only if 𝑛 = ⌊ 𝑚⌋.
■
𝛼0 𝛼1 ⋯ 𝛼𝑚−1 = 𝛽0 𝛽1 ⋯ 𝛽𝑚−1 .
(We can think of each pair (𝛼, 𝛽) ∈ 𝑆 as a “domino tile” and the ques- Figure 11.3: In the puzzle problem, the input can be
tion is whether we can stack a list of such tiles so that the top and the thought of as a finite collection Σ of types of puz-
zle pieces and the goal is to find out whether or not
bottom yield the same string.) It can be shown that the PCP is uncom-
find a way to arrange pieces from these types in a
putable by a fairly straightforward though somewhat tedious proof rectangle. Formally, we model the input as a pair of
(see for example the Wikipedia page for the Post Correspondence functions 𝑚𝑎𝑡𝑐ℎ↔ , 𝑚𝑎𝑡𝑐ℎ↕ ∶ Σ2 → {0, 1} that
such that 𝑚𝑎𝑡𝑐ℎ↔ (𝑙𝑒𝑓𝑡, 𝑟𝑖𝑔ℎ𝑡) = 1 (respectively
Problem or Section 5.2 in [Sip97]). 𝑚𝑎𝑡𝑐ℎ↕ (𝑢𝑝, 𝑑𝑜𝑤𝑛) = 1 ) if the pair of pieces are
Use this fact to provide a direct proof that QMS is uncomputable by compatible when placed in their respective posi-
tions. We assume Σ contains a special symbol ∅
showing that there exists a computable map 𝑅 ∶ {0, 1}∗ → {0, 1}∗ such
corresponding to having no piece, and an arrange-
that PCP(𝑆) = QMS(𝑅(𝑆)) for every string 𝑆 encoding an instance of ment of puzzle pieces by an (𝑚 − 2) × (𝑛 − 2)
the post correspondence problem. rectangle is modeled by a string 𝑥 ∈ Σ𝑚⋅𝑛 whose
“outer coordinates’ ’ are ∅ and such that for every
■
𝑖 ∈ [𝑛 − 1], 𝑗 ∈ [𝑚 − 1], 𝑚𝑎𝑡𝑐ℎ↕ (𝑥𝑖,𝑗 , 𝑥𝑖+1,𝑗 ) = 1 and
𝑚𝑎𝑡𝑐ℎ↔ (𝑥𝑖,𝑗 , 𝑥𝑖,𝑗+1 ) = 1.
Exercise 11.6 — Uncomputability of puzzle. Let PUZZLE ∶ {0, 1}∗ → {0, 1} be
the problem of determining, given a finite collection of types of “puz-
zle pieces”, whether it is possible to put them together in a rectangle,
see Fig. 11.3. Formally, we think of such a collection as a finite set Σ
(see Fig. 11.3). We model the criteria as to which pieces “fit together”
by a pair of finite function 𝑚𝑎𝑡𝑐ℎ↕ , 𝑚𝑎𝑡𝑐ℎ↔ ∶ Σ2 → {0, 1} such that a
piece 𝑎 fits above a piece 𝑏 if and only if 𝑚𝑎𝑡𝑐ℎ↕ (𝑎, 𝑏) = 1 and a piece
𝑐 fits to the left of a piece 𝑑 if and only if 𝑚𝑎𝑡𝑐ℎ↔ (𝑐, 𝑑) = 1. To model
394 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
the “straight edge” pieces that can be placed next to a “blank spot”
we assume that Σ contains the symbol ∅ and the matching functions
are defined accordingly. A square tiling of Σ is an 𝑚 × 𝑛 long string
𝑥 ∈ Σ𝑚𝑛 , such that for every 𝑖 ∈ {1, … , 𝑚 − 2} and 𝑗 ∈ {1, … , 𝑛 − 2},
𝑚𝑎𝑡𝑐ℎ(𝑥𝑖,𝑗 , 𝑥𝑖−1,𝑗 , 𝑥𝑖+1,𝑗 , 𝑥𝑖,𝑗−1 , 𝑥𝑖,𝑗+1 ) = 1 (i.e., every “internal pieve”
fits in with the pieces adjacent to it). We also require all of the “outer
pieces” (i.e., 𝑥𝑖,𝑗 where 𝑖 ∈ {0, 𝑚 − 1} of 𝑗 ∈ {0, 𝑛 − 1}) are “blank”
or equal to ∅. The function PUZZLE takes as input a string describing
the set Σ and the function 𝑚𝑎𝑡𝑐ℎ and outputs 1 if and only if there is
some square tiling of Σ: some not all blank string 𝑥 ∈ Σ𝑚𝑛 satisfying
the above condition.
Exercise 11.7 — MRDP exercise. The MRDP theorem states that the
problem of determining, given a 𝑘-variable polynomial 𝑝 with integer
coefficients, whether there exists integers 𝑥0 , … , 𝑥𝑘−1 such that
𝑝(𝑥0 , … , 𝑥𝑘−1 ) = 0 is uncomputable. Consider the following quadratic
integer equation problem: the input is a list of polynomials 𝑝0 , … , 𝑝𝑚−1
over 𝑘 variables with integer coefficients, where each of the polynomi-
als is of degree at most two (i.e., it is a quadratic function). The goal
is to determine whether there exist integers 𝑥0 , … , 𝑥𝑘−1 that solve the
equations 𝑝0 (𝑥) = ⋯ = 𝑝𝑚−1 (𝑥) = 0.
Use the MRDP Theorem to prove that this problem is uncom-
putable. That is, show that the function QUADINTEQ ∶ {0, 1}∗ →
{0, 1} is uncomputable, where this function gets as input a string de-
scribing the polynomials 𝑝0 , … , 𝑝𝑚−1 (each with integer coefficients
and degree at most two), and outputs 1 if and only if there exists
5
You can replace the equation 𝑦 = 𝑥4 with the pair
𝑥0 , … , 𝑥𝑘−1 ∈ ℤ such that for every 𝑖 ∈ [𝑚], 𝑝𝑖 (𝑥0 , … , 𝑥𝑘−1 ) = 0. See of equations 𝑦 = 𝑧2 and 𝑧 = 𝑥2 . Also, you can
footnote for hint5 replace the equation 𝑤 = 𝑥6 with the three equations
■
𝑤 = 𝑦𝑢, 𝑦 = 𝑥4 and 𝑢 = 𝑥2 .
⋯2
2. Let TOWER(𝑛) denote the number 2⏟
22 (that is, a “tower of pow-
𝑛 times
ers of two” of height 𝑛). To get a sense of how fast this function
grows, TOWER(1) = 2, TOWER(2) = 22 = 4, TOWER(3) = 22 =
2
“For practical purposes, the difference between algebraic and exponential order
is often more crucial than the difference between finite and non-finite.”, Jack
Edmunds, “Paths, Trees, and Flowers”, 1963
“What is the most efficient way to sort a million 32-bit integers?”, Eric
Schmidt to Barack Obama, 2008
“I think the bubble sort would be the wrong way to go.”, Barack Obama.
• “Is there a function that can be computed in 𝑂(𝑛2 ) time but not in
𝑂(𝑛) time?”
• “Are there natural problems for which the best algorithm (and not
just the best known) requires 2Ω(𝑛) time?”
While the difference between 𝑂(𝑛) and 𝑂(𝑛2 ) time can be crucial in
practice, in this book we focus on the even bigger difference between
polynomial and exponential running time. As we will see, the difference
between polynomial versus exponential time is typically insensitive to
the choice of the particular computational model, a polynomial-time
algorithm is still polynomial whether you use Turing machines, RAM
machines, or parallel cluster as your model of computation, and sim-
ilarly an exponential-time algorithm will remain exponential in all of
these platforms. One of the interesting phenomena of computing is
that there is often a kind of a “threshold phenomenon” or “zero-one
law” for running time. Many natural problems can either be solved
in polynomial running time with a not-too-large exponent (e.g., some-
thing like 𝑂(𝑛2 ) or 𝑂(𝑛3 )), or require exponential (e.g., at least 2Ω(𝑛)
√
or 2Ω( 𝑛) ) time to solve. The reasons for this phenomenon are still not
fully understood, but some light on it is shed by the concept of NP
completeness, which we will see in Chapter 15.
This chapter is merely a tiny sample of the landscape of computa-
tional problems and efficient algorithms. If you want to explore the
field of algorithms and data structures more deeply (which I very
much hope you do!), the bibliographical notes contain references to
some excellent texts, some of which are available freely on the web.
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 401
R
Remark 12.1 — Relations between parts of this book.
Part I of this book contained a quantitative study of
computation of finite functions. We asked what are
the resources (in terms of gates of Boolean circuits or
lines in straight-line programs) required to compute
various finite functions.
Part II of the book contained a qualitative study of
computation of infinite functions (i.e., functions of
unbounded input length). In that part we asked the
qualitative question of whether or not a function is com-
putable at all, regardless of the number of operations.
Part III of the book, beginning with this chapter,
merges the two approaches and contains a quantitative
study of computation of infinite functions. In this part
we ask how do resources for computing a function
scale with the length of the input. In Chapter 13 we
define the notion of running time, and the class P of
functions that can be computed using a number of
steps that scales polynomially with the input length.
In Section 13.6 we will relate this class to the models
of Boolean circuits and straightline programs that we
studied in Part I.
R
Remark 12.3 — On data structures. If you’ve ever taken
an algorithms course, you have probably encountered
many data structures such as lists, arrays, queues,
stacks, heaps, search trees, hash tables and many
more. Data structures are extremely important in com-
puter science, and each one of those offers different
tradeoffs between overhead in storage, operations
supported, cost in time for each operation, and more.
For example, if we store 𝑛 items in a list, we will need
a linear (i.e., 𝑂(𝑛) time) scan to retrieve an element,
while we achieve the same operation in 𝑂(1) time if
we used a hash table. However, when we only care
about polynomial-time algorithms, such factors of
𝑂(𝑛) in the running time will not make much differ-
404 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
∑ 𝑥𝑒 + ∑ 𝑥𝑒 = 0
𝑒∋𝑠 𝑒∋𝑡
−1 ≤ 𝑥𝑒 ≤ 1 ∀𝑒∈𝐸
where for every vertex 𝑣, summing over 𝑒 ∋ 𝑣 means summing over all
the edges that touch 𝑣.
The maximum flow problem can be thought of as the task of max-
imizing ∑𝑒∋𝑠 𝑥𝑒 over all the vectors 𝑥 ∈ ℝ𝑚 that satisfy the above
conditions (12.1). Maximizing a linear function ℓ(𝑥) over the set of
𝑥 ∈ ℝ𝑚 that satisfy certain linear equalities and inequalities is known
as linear programming. Luckily, there are polynomial-time algorithms
for solving linear programming, and hence we can solve the maxi-
mum flow (and so, equivalently, minimum cut) problem in polyno-
mial time. In fact, there are much better algorithms for maximum-
flow/minimum-cut, even for weighted directed graphs, with currently
√
the record standing at 𝑂(min{𝑚10/7 , 𝑚 𝑛}) time.
Given a graph 𝐺 = (𝑉 , 𝐸),
Solved Exercise 12.1 — Global minimum cut.
define the global minimum cut of 𝐺 to be the minimum over all 𝑆 ⊆ 𝑉
with 𝑆 ≠ ∅ and 𝑆 ≠ 𝑉 of the number of edges cut by 𝑆. Prove that
there is a polynomial-time algorithm to compute the global minimum
cut of a graph.
■
Solution:
By the above we know that there is a polynomial-time algorithm
𝐴 that on input (𝐺, 𝑠, 𝑡) finds the minimum 𝑠, 𝑡 cut in the graph
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 407
where 𝐿 is some loss function measuring how far is the predicted la-
bel ℎ(𝑥𝑖 ) from the true label 𝑦𝑖 . When 𝐿 is the square loss function
𝐿(𝑦, 𝑦′ ) = (𝑦 − 𝑦′ )2 and ℎ is a linear function, empirical risk mini-
mization corresponds to the well-known convex minimization task of
linear regression. In other cases, when the task is non-convex, there can
be many global or local minima. That said, even if we don’t find the
global (or even a local) minima, this continuous embedding can still
help us. In particular, when running a local improvement algorithm
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 409
12.2.1 SAT
A propositional formula 𝜑 involves 𝑛 variables 𝑥1 , … , 𝑥𝑛 and the logical
operators AND (∧), OR (∨), and NOT (¬, also denoted as ⋅). We say
that such a formula is in conjunctive normal form (CNF for short) if it is
an AND of ORs of variables or their negations (we call a term of the
form 𝑥𝑖 or 𝑥𝑖 a literal). For example, this is a CNF formula
R
Remark 12.4 — Bit complexity of numbers. Whenever we
discuss problems whose inputs correspond to num-
bers, the input length corresponds to how many bits
are needed to describe the number (or, as is equiv-
alent up to a constant factor, the number of digits
effi c i e n t comp u tati on : a n i n forma l i n trod u c ti on 411
where 𝑆𝑛 is the set of all permutations from [𝑛] to [𝑛] and the sign of
a permutation 𝜋 is equal to −1 raised to the power of the number of
inversions in 𝜋 (pairs 𝑖, 𝑗 such that 𝑖 > 𝑗 but 𝜋(𝑖) < 𝜋(𝑗)).
This definition suggests that computing det(𝐴) might require
summing over |𝑆𝑛 | terms which would take exponential time since
|𝑆𝑛 | = 𝑛! > 2𝑛 . However, there are other ways to compute the de-
terminant. For example, it is known that det is the only function that
satisfies the following conditions:
✓ Chapter Recap
12.5 EXERCISES
The naive algo-
Exercise 12.1 — exponential time algorithm for longest path.
rithm for computing the longest path in a given graph could take
more than 𝑛! steps. Give a 𝑝𝑜𝑙𝑦(𝑛)2𝑛 time algorithm for the longest 2
Hint: Use dynamic programming to compute for
every 𝑠, 𝑡 ∈ [𝑛] and 𝑆 ⊆ [𝑛] the value 𝑃 (𝑠, 𝑡, 𝑆)
path problem in 𝑛 vertex graphs.2 which equals 1 if there is a simple path from 𝑠 to 𝑡
■ that uses exactly the vertices in 𝑆. Do this iteratively
for 𝑆’s of growing sizes.
Exercise 12.2 — 2SAT algorithm. For every 2CNF 𝜑, define the graph 𝐺𝜑
on 2𝑛 vertices corresponding to the literals 𝑥1 , … , 𝑥𝑛 , 𝑥1 , … , 𝑥𝑛 , such
that there is an edge ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
ℓ𝑖 ℓ𝑗 iff the constraint ℓ𝑖 ∨ ℓ𝑗 is in 𝜑. Prove that 𝜑
is unsatisfiable if and only if there is some 𝑖 such that there is a path
from 𝑥𝑖 to 𝑥𝑖 and from 𝑥𝑖 to 𝑥𝑖 in 𝐺𝜑 . Show how to use this to solve
2SAT in polynomial time.
■
13
compute in 𝑂(𝑛𝑘 ) time.
• The class P/poly of non-uniform computation
and the result that P ⊆ P/poly
Max Newman: It is all very well to say that a machine could … do this or
that, but … what about the time it would take to do it?
Alan Turing: To my mind this time factor is the one question which will
involve all the real technical difficulty.
BBC radio panel on “Can automatic Calculating Machines Be Said to
Think?”, 1952
Let 𝑇 ∶ ℕ → ℕ be some
Definition 13.1 — Running time (Turing Machines).
function mapping natural numbers to natural numbers. We say
that a function 𝐹 ∶ {0, 1}∗ → {0, 1}∗ is computable in 𝑇 (𝑛) Turing-
Machine time (TM-time for short) if there exists a Turing machine 𝑀
such that for every sufficiently large 𝑛 and every 𝑥 ∈ {0, 1}𝑛 , when
given input 𝑥, the machine 𝑀 halts after executing at most 𝑇 (𝑛)
steps and outputs 𝐹 (𝑥).
We define TIMETM (𝑇 (𝑛)) to be the set of Boolean functions
(functions mapping {0, 1}∗ to {0, 1}) that are computable in 𝑇 (𝑛)
TM time.
P
Definition 13.1 is not very complicated but is one of
the most important definitions of this book. As usual,
TIMETM (𝑇 (𝑛)) is a class of functions, not of machines. If
𝑀 is a Turing machine then a statement such as “𝑀
is a member of TIMETM (𝑛2 )” does not make sense.
The concept of TM-time as defined here is sometimes
known as “single-tape Turing machine time” in the
literature, since some texts consider Turing machines
with more than one working tape.
422 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
The proof is illustrated in Fig. 13.2. Suppose that 𝐹 ∈ TIMETM (10⋅
𝑛3 ) and hence there exists some number 𝑁0 and a machine 𝑀 such
that for every 𝑛 > 𝑁0 , and 𝑥 ∈ {0, 1}∗ , 𝑀 (𝑥) outputs 𝐹 (𝑥) within
at most 10 ⋅ 𝑛3 steps. Since 10 ⋅ 𝑛3 = 𝑜(2𝑛 ), there is some number
𝑁1 such that for every 𝑛 > 𝑁1 , 10 ⋅ 𝑛3 < 2𝑛 . Hence for every
𝑛 > max{𝑁0 , 𝑁1 }, 𝑀 (𝑥) will output 𝐹 (𝑥) within at most 2𝑛 steps,
demonstrating that 𝐹 ∈ TIMETM (2𝑛 ). Figure 13.2: Comparing 𝑇 (𝑛) = 10𝑛3 with 𝑇 ′ (𝑛) =
2𝑛 (on the right figure the Y axis is in log scale).
Since for every large enough 𝑛, 𝑇 ′ (𝑛) ≥ 𝑇 (𝑛),
■
P
Please take the time to make sure you understand
these definitions. In particular, sometimes students
think of the class EXP as corresponding to functions
that are not in P. However, this is not the case. If 𝐹 is
in EXP then it can be computed in exponential time.
This does not mean that it cannot be computed in
polynomial time as well.
Solution:
To show these two sets are equal we need to show that P ⊆
∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ) and ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ) ⊆ P. We start
with the former inclusion. Suppose that 𝐹 ∈ P. Then there is some
polynomial 𝑝 ∶ ℕ → ℝ and a Turing machine 𝑀 such that 𝑀
computes 𝐹 and 𝑀 halts on every input 𝑥 within at most 𝑝(|𝑥|)
steps. We can write the polynomial 𝑝 ∶ ℕ → ℝ in the form
𝑑
𝑝(𝑛) = ∑𝑖=0 𝑎𝑖 𝑛𝑖 where 𝑎0 , … , 𝑎𝑑 ∈ ℝ, and we assume that 𝑎𝑑
is non-zero (or otherwise we just let 𝑑 correspond to the largest
number such that 𝑎𝑑 is non-zero). The degree of 𝑝 is the number 𝑑.
Since 𝑛𝑑 = 𝑜(𝑛𝑑+1 ), no matter what the coefficient 𝑎𝑑 is, for large
enough 𝑛, 𝑝(𝑛) < 𝑛𝑑+1 which means that the Turing machine 𝑀
will halt on inputs of length 𝑛 within fewer than 𝑛𝑑+1 steps, and
hence 𝐹 ∈ TIMETM (𝑛𝑑+1 ) ⊆ ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ).
For the second inclusion, suppose that 𝐹 ∈ ∪𝑐∈{1,2,3,…} TIMETM (𝑛𝑐 ).
Then there is some positive 𝑐 ∈ ℕ such that 𝐹 ∈ TIMETM (𝑛𝑐 ) which
means that there is a Turing machine 𝑀 and some number 𝑁0 such
that 𝑀 computes 𝐹 and for every 𝑛 > 𝑁0 , 𝑀 halts on length 𝑛
424 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Table : A table of the examples from Chapter 12. All these problems
are in EXP but only the ones on the left column are currently known to
be in P as well (i.e., they have a polynomial-time algorithm). See also
Fig. 13.3.
R
Remark 13.3 — Boolean versions of problems. Many
of the problems defined in Chapter 12 correspond to
non-Boolean functions (functions with more than one
bit of output) while P and EXP are sets of Boolean
functions. However, for every non-Boolean function
𝐹 we can always define a computationally-equivalent
Boolean function 𝐺 by letting 𝐺(𝑥, 𝑖) be the 𝑖-th bit
of 𝐹 (𝑥) (see Exercise 13.3). Hence the table above,
as well as Fig. 13.3, refer to the computationally- Figure 13.3: Some examples of problems that are
equivalent Boolean variants of these problems. known to be in P and problems that are known to
be in EXP but not known whether or not they are
in P. Since both P and EXP are classes of Boolean
functions, in this figure we always refer to the Boolean
(i.e., Yes/No) variant of the problems.
13.2 MODELING RUNNING TIME USING RAM MACHINES / NAND-
RAM
Turing machines are a clean theoretical model of computation, but
do not closely correspond to real-world computing architectures. The
discrepancy between Turing machines and actual computers does
not matter much when we consider the question of which functions
are computable, but can make a difference in the context of efficiency.
mod e l i ng ru n n i ng ti me 425
Let 𝑇 ∶ ℕ → ℕ be a
Theorem 13.5 — Relating RAM and Turing machines.
function such that 𝑇 (𝑛) ≥ 𝑛 for every 𝑛 and the map 𝑛 ↦ 𝑇 (𝑛) can
be computed by a Turing machine in time 𝑂(𝑇 (𝑛)). Then
P
The technical details of Theorem 13.5, such as the con-
dition that 𝑛 ↦ 𝑇 (𝑛) is computable in 𝑂(𝑇 (𝑛)) time
or the constants 10 and 4 in (13.1) (which are not tight
and can be improved), are not very important. In par-
ticular, all non-pathological time bound functions we
encounter in practice such as 𝑇 (𝑛) = 𝑛, 𝑇 (𝑛) = 𝑛 log 𝑛,
𝑇 (𝑛) = 2𝑛 etc. will satisfy the conditions of Theo-
rem 13.5, see also Remark 13.6.
The main message of the theorem is Turing machines
and RAM machines are “roughly equivalent” in the
sense that one can simulate the other with polyno-
426 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
That is, we could have equally well defined P as the class of functions
computable by NAND-RAM programs (instead of Turing machines) Figure 13.4: The proof of Theorem 13.5 shows that
we can simulate 𝑇 steps of a Turing machine with 𝑇
that run in time polynomial in the length of the input. Similarly, by
steps of a NAND-RAM program, and can simulate
instantiating Theorem 13.5 with 𝑇 (𝑛) = 2𝑛 we see that the class EXP
𝑎
𝑇 steps of a NAND-RAM program with 𝑜(𝑇 4 )
can also be defined as the set of functions computable by NAND-RAM steps of a Turing machine. Hence TIMETM (𝑇 (𝑛)) ⊆
TIMERAM (10 ⋅ 𝑇 (𝑛)) ⊆ TIMETM (𝑇 (𝑛)4 ).
programs in time at most 2𝑝(𝑛) where 𝑝 is some polynomial. Similar
equivalence results are known for many models including cellular
automata, C/Python/Javascript programs, parallel computers, and a
great many other models, which justifies the choice of P as capturing
a technology-independent notion of tractability. (See Section 13.3
for more discussion of this issue.) This equivalence between Turing
machines and NAND-RAM (as well as other models) allows us to
pick our favorite model depending on the task at hand (i.e., “have our
cake and eat it too”) even when we study questions of efficiency, as
long as we only care about the gap between polynomial and exponential
time. When we want to design an algorithm, we can use the extra
power and convenience afforded by NAND-RAM. When we want
to analyze a program or prove a negative result, we can restrict our
attention to Turing machines.
Proof Idea:
The direction TIMETM (𝑇 (𝑛)) ⊆ TIMERAM (10 ⋅ 𝑇 (𝑛)) is not hard to
show, since a NAND-RAM program 𝑃 can simulate a Turing machine
𝑀 with constant overhead by storing the transition table of 𝑀 in
mod e l i ng ru n n i ng ti me 427
The total cost for each such operation is 𝑂(𝑇 (𝑛)2 +𝑇 (𝑛)𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) =
𝑂(𝑇 (𝑛)2 ) steps.
In sum, we simulate a single step of NAND-RAM using
𝑂(𝑇 (𝑛)2 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) steps of NAND-TM, and hence the total
simulation time is 𝑂(𝑇 (𝑛)3 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛))) which is smaller than 𝑇 (𝑛)4
for sufficiently large 𝑛.
■
mod e l i ng ru n n i ng ti me 429
R
Remark 13.6 — Nice time bounds. When considering
general time bounds we need to make sure to rule
out some “pathological” cases such as functions 𝑇
that don’t give enough time for the algorithm to read
the input, or functions where the time bound itself is
uncomputable. We say that a function 𝑇 ∶ ℕ → ℕ is
a nice time bound function (or nice function for short)
if for every 𝑛 ∈ ℕ, 𝑇 (𝑛) ≥ 𝑛 (i.e., 𝑇 allows enough
time to read the input), for every 𝑛′ ≥ 𝑛, 𝑇 (𝑛′ ) ≥ 𝑇 (𝑛)
(i.e., 𝑇 allows more time on longer inputs), and the
map 𝐹 (𝑥) = 1𝑇 (|𝑥|) (i.e., mapping a string of length
𝑛 to a sequence of 𝑇 (𝑛) ones) can be computed by a
NAND-RAM program in 𝑂(𝑇 (𝑛)) time.
All the “normal” time complexity bounds we en-
counter in applications such as 𝑇 (𝑛) √ = 100𝑛,
𝑇 (𝑛) = 𝑛2 log 𝑛,𝑇 (𝑛) = 2 𝑛 , etc. are “nice”.
Hence from now on we will only care about the
class TIME(𝑇 (𝑛)) when 𝑇 is a “nice” function. The
computability condition is in particular typically easily
satisfied. For example, for arithmetic functions such
as 𝑇 (𝑛) = 𝑛3 , we can typically compute the binary
representation of 𝑇 (𝑛) in time polynomial in the num-
ber of bits of 𝑇 (𝑛) and hence poly-logarithmic in 𝑇 (𝑛).
Hence the time to write the string 1𝑇 (𝑛) in such cases
will be 𝑇 (𝑛) + 𝑝𝑜𝑙𝑦(log 𝑇 (𝑛)) = 𝑂(𝑇 (𝑛)).
• Cellular automata
• Parallel computers
The Extended Church Turing Thesis is the statement that this is true
for all physically realizable computing models. In other words, the
extended Church Turing thesis says that for every scalable computing
device 𝐶 (which has a finite description but can be in principle used
to run computation on arbitrarily large inputs), there is some con-
stant 𝑎 such that for every function 𝐹 ∶ {0, 1}∗ → {0, 1} that 𝐶 can
430 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
As in the case of Theorem 13.5, the proof of Theo-
rem 13.7 is not very deep and so it is more important
to understand its statement. Specifically, if you under-
stand how you would go about writing an interpreter
for NAND-RAM using a modern programming lan-
guage such as Python, then you know everything you
need to know about the proof of this theorem.
Let TIMEDEVAL
Theorem 13.8 — Timed Universal Turing Machine. ∶
{0, 1}∗ → {0, 1}∗ be the function defined as
Then TIMEDEVAL ∈ P.
Proof. We only sketch the proof since the result follows fairly directly
from Theorem 13.5 and Theorem 13.7. By Theorem 13.5 to show that
TIMEDEVAL ∈ P, it suffices to give a polynomial-time NAND-RAM
program to compute TIMEDEVAL.
Such a program can be obtained as follows. Given a Turing ma-
chine 𝑀 , by Theorem 13.5 we can transform it in time polynomial in Figure 13.6: The timed universal Turing machine takes
as input a Turing machine 𝑀, an input 𝑥, and a time
its description into a functionally-equivalent NAND-RAM program bound 𝑇 , and outputs 𝑀(𝑥) if 𝑀 halts within at
𝑃 such that the execution of 𝑀 on 𝑇 steps can be simulated by the most 𝑇 steps. Theorem 13.8 states that there is such a
machine that runs in time polynomial in 𝑇 .
execution of 𝑃 on 𝑐 ⋅ 𝑇 steps. We can then run the universal NAND-
RAM machine of Theorem 13.7 to simulate 𝑃 for 𝑐 ⋅ 𝑇 steps, using
𝑂(𝑇 ) time, and output 0 if the execution did not halt within this bud-
get. This shows that TIMEDEVAL can be computed by a NAND-RAM
program in time polynomial in |𝑀 | and linear in 𝑇 , which means
TIMEDEVAL ∈ P.
■
There is nothing special about log 𝑛, and we could have used any
other efficiently computable function that tends to infinity with 𝑛.
434 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 13.10 — Simpler corollary of the time hierarchy
theorem. The generality of the time hierarchy theorem
can make its proof a little hard to read. It might be
easier to follow the proof if you first try to prove by
yourself the easier statement P ⊊ EXP.
You can do so by showing that the following function
𝐹 ∶ {0, 1}∗ ∶→ {0, 1} is in EXP ⧵ P: for every Turing
machine 𝑀 and input 𝑥, 𝐹 (𝑀 , 𝑥) = 1 if and only if
𝑀 halts on 𝑥 within at most |𝑥|log |𝑥| steps. One can
show that 𝐹 ∈ TIME(𝑛𝑂(log 𝑛) ) ⊆ EXP using the
universal Turing machine (or the efficient universal
NAND-RAM program of Theorem 13.7). On the other
hand, we can use similar ideas to those used to show
the uncomputability of HALT in Section 9.3.2 to prove
that 𝐹 ∉ P.
Proof Idea:
In the proof of Theorem 9.6 (the uncomputability of the Halting
problem), we have shown that the function HALT cannot be com-
puted in any finite time. An examination of the proof shows that it
gives something stronger. Namely, the proof shows that if we fix our
computational budget to be 𝑇 steps, then not only can we not dis-
tinguish between programs that halt and those that do not, but we
cannot even distinguish between programs that halt within at most 𝑇 ′
steps and those that take more than that (where 𝑇 ′ is some number
depending on 𝑇 ). Therefore, the proof of Theorem 13.9 follows the
mod e l i ng ru n n i ng ti me 435
Proof of Theorem 13.9. Our proof is inspired by the proof of the un-
computability of the halting problem. Specifically, for every function
𝑇 as in the theorem’s statement, we define the Bounded Halting func-
tion HALT𝑇 as follows. The input to HALT𝑇 is a pair (𝑃 , 𝑥) such that
|𝑃 | ≤ log log |𝑥| and 𝑃 encodes some NAND-RAM program. We
define
(The constant 100 and the function log log 𝑛 are rather arbitrary, and
are chosen for convenience in this proof.)
Theorem 13.9 is an immediate consequence of the following two
claims:
Claim 1: HALT𝑇 ∈ TIME(𝑇 (𝑛) ⋅ log 𝑛)
and
Claim 2: HALT𝑇 ∉ TIME(𝑇 (𝑛)).
Please make sure you understand why indeed the theorem follows
directly from the combination of these two claims. We now turn to
proving them.
Proof of claim 1: We can easily check in linear time whether an
input has the form 𝑃 , 𝑥 where |𝑃 | ≤ log log |𝑥|. Since 𝑇 (⋅) is a nice
function, we can evaluate it in 𝑂(𝑇 (𝑛)) time. Thus, we can compute
HALT𝑇 (𝑃 , 𝑥) as follows:
Solution:
This statement follows directly from the time hierarchy theo-
rem, but it can be an instructive exercise to prove it directly, see
Remark 13.10. We need to show that there exists 𝐹 ∈ EXP ⧵ P.
Let 𝑇 (𝑛) = 𝑛log 𝑛 and 𝑇 ′ (𝑛) = 𝑛log 𝑛/2 . Both are nice functions.
Since 𝑇 (𝑛)/𝑇 ′ (𝑛) = 𝜔(log 𝑛), by Theorem 13.9 there exists some
𝐹 in TIME(𝑇 (𝑛)) ⧵ TIME(𝑇 ′ (𝑛)). Since for sufficiently large 𝑛,
2𝑛 > 𝑛log 𝑛 , 𝐹 ∈ TIME(2𝑛 ) ⊆ EXP. On the other hand, 𝐹 ∉ P.
Indeed, suppose otherwise that there was a constant 𝑐 > 0 and
a Turing machine computing 𝐹 on 𝑛-length input in at most 𝑛𝑐
steps for all sufficiently large 𝑛. Then since for 𝑛 large enough
𝑛𝑐 < 𝑛log 𝑛/2 , it would have followed that 𝐹 ∈ TIME(𝑛log 𝑛/2 )
contradicting our choice of 𝐹 .
■
The time hierarchy theorem tells us that there are functions we can
√
compute in 𝑂(𝑛2 ) time but not 𝑂(𝑛), in 2𝑛 time, but not 2 𝑛 , etc.. In
particular there are most definitely functions that we can compute in
time 2𝑛 but not 𝑂(𝑛). We have seen that we have no shortage of natu-
ral functions for which the best known algorithm requires roughly 2𝑛
time, and that many people have invested significant effort in trying
to improve that. However, unlike in the finite vs. infinite case, for all
of the examples above at the moment we do not know how to rule
out even an 𝑂(𝑛) time algorithm. We will however see that there is a
single unproven conjecture that would imply such a result for most of
these problems.
The time hierarchy theorem relies on the existence of an efficient
universal NAND-RAM program, as proven in Theorem 13.7. For
other models such as Turing machines we have similar time hierarchy
results showing that there are functions computable in time 𝑇 (𝑛) and
not in time 𝑇 (𝑛)/𝑓(𝑛) where 𝑓(𝑛) corresponds to the overhead in the
corresponding universal machine.
for i in range(4):
print(i)
print(0)
print(1)
print(2)
print(3)
To make this idea into an actual proof we need to tackle one tech-
nical difficulty, and this is to ensure that the NAND-TM program is
oblivious in the sense that the value of the index variable i in the 𝑗-th
iteration of the loop will depend only on 𝑗 and not on the contents of
the input. We make a digression to do just that in Section 13.6.1 and
then complete the proof of Theorem 13.12.
⋆
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
MODANDJUMP(X_nonblank[i],X_nonblank[i])
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
440 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
Y_nonblank[0] = NAND(X[0],temp_0)
temp_2 = NAND(X[i],Y[0])
temp_3 = NAND(X[i],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_0 = NAND(X[0],X[0])
one = NAND(X[0],temp_0)
zero = NAND(one,one)
temp_2 = NAND(X[0],zero)
temp_3 = NAND(X[0],temp_2)
temp_4 = NAND(zero,temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[1],Y[0])
temp_3 = NAND(X[1],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
temp_2 = NAND(X[2],Y[0])
temp_3 = NAND(X[2],temp_2)
temp_4 = NAND(Y[0],temp_2)
Y[0] = NAND(temp_3,temp_4)
Key to this transformation was the fact that in our original NAND-
TM program for XOR, regardless of whether the input is 011, 100, or
any other string, the index variable i is guaranteed to equal 0 in the
first iteration, 1 in the second iteration, 2 in the third iteration, and so
on and so forth. The particular sequence 0, 1, 2, … is immaterial: the
crucial property is that the NAND-TM program for XOR is oblivious
in the sense that the value of the index i in the 𝑗-th iteration depends
only on 𝑗 and does not depend on the particular choice of the input. Figure 13.10: A NAND circuit for XOR3 obtained by
“unrolling the loop” of the NAND-TM program for
Luckily, it is possible to transform every NAND-TM program into computing XOR three times.
a functionally equivalent oblivious program with at most quadratic
mod e l i ng ru n n i ng ti me 441
the next time in which Marker[i]= 1 (at the next sweep) at which
points 𝑃 zeroes Marker[i] and continues with the simulation. In
the worst case this will take 2𝑇 (𝑛) steps (if 𝑃 has to go all the way
from one end to the other and back again.)
There is algorithm
Theorem 13.14 — Turing-machine to circuit compiler.
UNROLL such that for every Turing machine 𝑀 and numbers 𝑛, 𝑇 ,
UNROLL(𝑀 , 1𝑇 , 1𝑛 ) runs for 𝑝𝑜𝑙𝑦(|𝑀 |, 𝑇 , 𝑛) steps and outputs a
NAND circuit 𝐶 with 𝑛 inputs, 𝑂(𝑇 2 ) gates, and one output, such
that
Figure 13.12: The function UNROLL takes as input a
{𝑦 𝑀 halts in ≤ 𝑇 steps and outputs 𝑦
⎧ Turing machine 𝑀, an input length parameter 𝑛, a
𝐶(𝑥) = ⎨ . step budget parameter 𝑇 , and outputs a circuit 𝐶 of
⎩0 otherwise
{
size 𝑝𝑜𝑙𝑦(𝑇 ) that takes 𝑛 bits of inputs and outputs
𝑀(𝑥) if 𝑀 halts on 𝑥 within at most 𝑇 steps.
P
Reviewing the transformations described in Fig. 13.13,
as well as solving the following two exercises is a great
way to get more comfort with non-uniform complexity
and in particular with P/poly and its relation to P.
Solution:
We start with the “if” direction. Suppose that there is a polynomial-
time Turing machine 𝑀 that on input 1𝑛 outputs a circuit 𝐶𝑛 that
computes 𝐹↾𝑛 . Then the following is a polynomial-time Turing
machine 𝑀 ′ to compute 𝐹 . On input 𝑥 ∈ {0, 1}∗ , 𝑀 ′ will:
• |𝑎𝑛 | ≤ 𝑝(𝑛)
• For every 𝑥 ∈ {0, 1}𝑛 , 𝑀 (𝑎𝑛 , 𝑥) = 𝐹 (𝑥).
Solution:
We only sketch the proof. For the “only if” direction, if 𝐹 ∈
P/poly then we can use for 𝑎𝑛 simply the description of the cor-
responding circuit 𝐶𝑛 and for 𝑀 the program that computes in
polynomial time the evaluation of a circuit on its input.
For the “if” direction, we can use the same “unrolling the loop”
technique of Theorem 13.12 to show that if 𝑃 is a polynomial-time
NAND-TM program, then for every 𝑛 ∈ ℕ, the map 𝑥 ↦ 𝑃 (𝑎𝑛 , 𝑥)
can be computed by a polynomial-size NAND-CIRC program 𝑄𝑛 .
■
mod e l i ng ru n n i ng ti me 445
There exists an
Theorem 13.15 — P/poly contains uncomputable functions.
uncomputable function 𝐹 ∶ {0, 1} → {0, 1} such that 𝐹 ∈ P/poly .
∗
Proof Idea:
Since P/poly corresponds to non-uniform computation, a function
𝐹 is in P/poly if for every 𝑛 ∈ ℕ, the restriction 𝐹↾𝑛 to inputs of length
𝑛 has a small circuit/program, even if the circuits for different values
of 𝑛 are completely different from one another. In particular, if 𝐹 has
the property that for every equal-length inputs 𝑥 and 𝑥′ , 𝐹 (𝑥) =
𝐹 (𝑥′ ) then this means that 𝐹↾𝑛 is either the constant function zero
or the constant function one for every 𝑛 ∈ ℕ. Since the constant
function has a (very!) small circuit, such a function 𝐹 will always
be in P/poly (indeed even in smaller classes). Yet by a reduction from
the Halting problem, we can obtain a function with this property that
is uncomputable.
⋆
SIZE(𝑇 (𝑛)) for every 𝑛 using a completely different algorithm for ev-
ery input length. For this reason we typically use P/poly not as a model
of efficient computation but rather as a way to model inefficient compu-
tation. For example, in cryptography people often define an encryp-
tion scheme to be secure if breaking it for a key of length 𝑛 requires
more than a polynomial number of NAND lines. Since P ⊆ P/poly ,
this in particular precludes a polynomial time algorithm for doing so,
but there are technical reasons why working in a non-uniform model
makes more sense in cryptography. It also allows to talk about se-
curity in non-asymptotic terms such as a scheme having “128 bits of
security”.
While it can sometimes be a real issue, in many natural settings the
difference between uniform and non-uniform computation does not
seem so important. In particular, in all the examples of problems not
known to be in P we discussed before: longest path, 3SAT, factoring,
etc., these problems are also not known to be in P/poly either. Thus,
for “natural” functions, if you pretend that TIME(𝑇 (𝑛)) is roughly the
same as SIZE(𝑇 (𝑛)), you will be right more often than wrong.
For a function 𝐹 ∶ {0, 1}∗ → {0, 1} and some nice time bound
𝑇 ∶ ℕ → ℕ, we know that:
mod e l i ng ru n n i ng ti me 447
✓ Chapter Recap
13.7 EXERCISES
Prove
Exercise 13.1 — Equivalence of different definitions of P and EXP..
that the classes P and EXP defined in Definition 13.2 are equal to
∪𝑐∈{1,2,3,…} TIME(𝑛𝑐 ) and ∪𝑐∈{1,2,3,…} TIME(2𝑛 ) respectively. (If
𝑐
are also robust with respect to our choice of the representation of the
input.
Specifically, let 𝐹 be a function mapping graphs to {0, 1}, and let
𝐹 ′ , 𝐹 ″ ∶ {0, 1}∗ → {0, 1} be the functions defined as follows. For every
𝑥 ∈ {0, 1}∗ :
• 𝑠∈ℕ
14
Polynomial-time reductions
• At the moment, for all these problems the best known algorithm is
not much faster than the trivial one in the worst case.
In this chapter we will see that for each one of the problems of find-
ing a longest path in a graph, solving quadratic equations, and finding
the maximum cut, if there exists a polynomial-time algorithm for this
problem then there exists a polynomial-time algorithm for the 3SAT
problem as well. In other words, we will reduce the task of solving
3SAT to each one of the above tasks. Another way to interpret these
results is that if there does not exist a polynomial-time algorithm for
3SAT then there does not exist a polynomial-time algorithm for these
other problems as well. In Chapter 15 we will see evidence (though
no proof!) that all of the above problems do not have polynomial-time
algorithms and hence are inherently intractable.
Solution:
If 𝐹 ≤𝑝 𝐺 and 𝐺 ≤𝑝 𝐻 then there exist polynomial-time com-
putable functions 𝑅1 and 𝑅2 mapping {0, 1}∗ to {0, 1}∗ such that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐺(𝑅1 (𝑥)) and for every 𝑦 ∈ {0, 1}∗ ,
𝐺(𝑦) = 𝐻(𝑅2 (𝑦)). Combining these two equalities, we see that
for every 𝑥 ∈ {0, 1}∗ , 𝐹 (𝑥) = 𝐻(𝑅2 (𝑅1 (𝑥))) and so to show that
𝐹 ≤𝑝 𝐻, it is sufficient to show that the map 𝑥 ↦ 𝑅2 (𝑅1 (𝑥)) is
computable in polynomial time. But if there are some constants 𝑐, 𝑑
such that 𝑅1 (𝑥) is computable in time |𝑥|𝑐 and 𝑅2 (𝑦) is computable
in time |𝑦|𝑑 then 𝑅2 (𝑅1 (𝑥)) is computable in time (|𝑥|𝑐 )𝑑 = |𝑥|𝑐𝑑
which is polynomial.
■
𝑥0 + 𝑥 1 + 𝑥 2 = 2
𝑥0 + 𝑥 2 = 1
𝑥1 + 𝑥 2 = 2
then 01EQ(𝐸) = 1 since the assignment 𝑥 = 011 satisfies all three
equations. We specifically restrict attention to linear equations in
variables 𝑥0 , … , 𝑥𝑛−1 in which every equation has the form ∑𝑖∈𝑆 𝑥𝑖 = 1
If you are familiar with matrix notation you may
𝑏 where 𝑆 ⊆ [𝑛] and 𝑏 ∈ ℕ.1 note that such equations can be written as 𝐴𝑥 = b
If we asked the question of whether there is a solution 𝑥 ∈ ℝ𝑛 of where 𝐴 is an 𝑚 × 𝑛 matrix with entries in 0/1 and
b ∈ ℕ𝑚 .
real numbers to 𝐸, then this can be solved using the famous Gaussian
elimination algorithm in polynomial time. However, there is no known
efficient algorithm to solve 01EQ. Indeed, such an algorithm would
imply an algorithm for 3SAT as shown by the following theorem:
Proof Idea:
A constraint 𝑥2 ∨ 𝑥5 ∨ 𝑥7 can be written as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 ≥ 1.
This is a linear inequality but since the sum on the left-hand side is
at most three, we can also turn it into an equality by adding two new
variables 𝑦, 𝑧 and writing it as 𝑥2 + (1 − 𝑥5 ) + 𝑥7 + 𝑦 + 𝑧 = 3. (We
will use fresh variables 𝑦, 𝑧 for every constraint.) Finally, for every
variable 𝑥𝑖 we can add a variable 𝑥′𝑖 corresponding to its negation by
poly nomi a l -ti me re d u c ti on s 459
You can verify that 𝑥 ∈ ℝ3 satisfies this set of equations if and only if
𝑥 ∈ {0, 1}3 and 𝑥0 ∨ 𝑥1 = 1.
462 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
3SAT ≤𝑝 QUADEQ
Proof Idea:
Using the transitivity of reductions (Solved Exercise 14.2), it is
enough to show that 01EQ ≤𝑝 QUADEQ, but this follows since we can
phrase the equation 𝑥𝑖 ∈ {0, 1} as the quadratic constraint 𝑥2𝑖 − 𝑥𝑖 = 0.
The takeaway technique of this reduction is that we can use non-
linearity to force continuous variables (e.g., variables taking values in
ℝ) to be discrete (e.g., take values in {0, 1}).
⋆
3SAT ≤𝑝 SSUM
Proof Idea:
We reduce from 01EQ. The intuition is the following. Consider
an instance 𝐸 of 01EQ with 𝑛 variables 𝑥0 , … , 𝑥𝑛−1 and 𝑚 equa-
tions 𝑒0 , … , 𝑒𝑚−1 . Recall that each equation 𝑒ℓ in 𝐸 has the form
𝑥𝑖 + 𝑥𝑗 + 𝑥𝑘 = 𝑏 (potentially with more or less than three variables
summed up on the left-hand side of the equation). For every variable
𝑥𝑖 , we can define a vector 𝑣𝑖 ∈ {0, 1}𝑚 where 𝑣𝑡𝑖 = 1 if the variable
464 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
In other words, 𝑦0 , … , 𝑦𝑛−1 and 𝑇 are the integers such that, written
in the 𝐵-ary basis, the 𝑡-th digit of 𝑦𝑖 is 1 iff 𝑥𝑖 appears in 𝑥𝑡 , and the
𝑡-th digit of 𝑇 is the right-hand side of 𝑒𝑡 .
The following claim will imply the correctness of the reduction:
Claim: For every 𝑥 ∈ {0, 1}𝑛 , if 𝑆 = {𝑖|𝑥𝑖 = 1} then 𝑥 satisfies the
equations of 𝐸 if and only if ∑𝑖∈𝑆 𝑦𝑖 = 𝑇 .
Proof: Key to the proof is the following simple property of grade-
school addition: when adding at most 𝑛 numbers in the 𝐵-ary basis,
if all the numbers have all their digits either 0 or 1, and 𝐵 > 𝑛, then
for every 𝑡, the 𝑡-th digit of the sum is the sum of the 𝑡-th digits of
the numbers. This is a simple consequence of the fact that there is no
“carry” in the addition. Since in our case the numbers 𝑦0 , … , 𝑦𝑛 sat-
isfy this property in the 𝐵-ary basis, and 𝐵 > 𝑛, we get that for every
poly nomi a l -ti me re d u c ti on s 465
𝑆 ⊆ [𝑛] and every digit 𝑡, the 𝑡-th digit of the sum ∑𝑖∈𝑆 𝑦𝑖 is simply
the sum of the 𝑡-th digit, which would correspond to the sum over 𝑥𝑖
for all 𝑥𝑖 ’s that participate in the 𝑡-th equation. This sum would equal
the 𝑡-th digit of 𝑇 if and only if that equation is satisfied.
The claim shows that 01EQ(𝐸) = SSUM(𝑦0 , … , 𝑦𝑛−1 , 𝑇 ) which is
what we needed to prove.
■
Proof Idea:
The idea is that finding a satisfying assignment to a 3SAT formula
corresponds to satisfying many local constraints without creating
any conflicts. One can think of “𝑥17 = 0” and “𝑥17 = 1” as two
conflicting events, and of the constraints 𝑥17 ∨ 𝑥5 ∨ 𝑥9 as creating
a conflict between the events “𝑥17 = 0”, “𝑥5 = 1” and “𝑥9 = 0”,
saying that these three cannot simultaneously co-occur. Using these
ideas, we can we can think of solving a 3SAT problem as trying to
schedule non-conflicting events, though the devil is, as usual, in the
details. The takeaway technique here is to map each clause of the
original formula into a gadget which is a small subgraph (or more
generally “subinstance”) satisfying some convenient properties. We
will see these “gadgets” used time and again in the construction of
polynomial-time reductions.
⋆
Solution:
The key observation is that if 𝑆 ⊆ 𝑉 is a vertex cover that
touches all vertices, then there is no edge 𝑒 such that both 𝑒’s end-
points are in the set 𝑆 = 𝑉 ⧵ 𝑆, and vice versa. In other words,
𝑆 is a vertex cover if and only if 𝑆 is an independent set. Since
the size of 𝑆 is |𝑉 | − |𝑆|, we see that the polynomial-time map
𝑅(𝐺, 𝑘) = (𝐺, 𝑛 − 𝑘) (where 𝑛 is the number of vertices of 𝐺)
satisfies that VC(𝑅(𝐺, 𝑘)) = ISET(𝐺, 𝑘) which means that it is a Figure 14.6: A vertex cover in a graph is a subset of
vertices that touches all edges. In this 7-vertex graph,
reduction from independent set to vertex cover. the 3 filled vertices are a vertex cover.
■
The maximum
Solved Exercise 14.4 — Clique is equivalent to independent set.
clique problem corresponds to the function CLIQUE ∶ {0, 1}∗ → {0, 1}
such that for a graph 𝐺 and a number 𝑘, CLIQUE(𝐺, 𝑘) = 1 iff there
is a subset 𝑆 of 𝑘 vertices such that for every distinct 𝑢, 𝑣 ∈ 𝑆, the edge
𝑢, 𝑣 is in 𝐺. Such a set is known as a clique.
Prove that CLIQUE ≤𝑝 ISET and ISET ≤𝑝 CLIQUE.
■
Solution:
If 𝐺 = (𝑉 , 𝐸) is a graph, we denote by 𝐺 its complement which
is the graph on the same vertices 𝑉 and such that for every distinct
𝑢, 𝑣 ∈ 𝑉 , the edge {𝑢, 𝑣} is present in 𝐺 if and only if this edge is
not present in 𝐺.
This means that for every set 𝑆, 𝑆 is an independent set in 𝐺 if
and only if 𝑆 is a clique in 𝐺. Therefore for every 𝑘, ISET(𝐺, 𝑘) =
CLIQUE(𝐺, 𝑘). Since the map 𝐺 ↦ 𝐺 can be computed efficiently,
this yields a reduction ISET ≤𝑝 CLIQUE. Moreover, since 𝐺 = 𝐺
this yields a reduction in the other direction as well.
■
Solution:
Since we know that ISET ≤𝑝 VC, using transitivity, it is enough
to show that VC ≤𝑝 DS. As Fig. 14.7 shows, a dominating set is
not the same thing as a vertex cover. However, we can still relate
the two problems. The idea is to map a graph 𝐺 into a graph 𝐻
such that a vertex cover in 𝐺 would translate into a dominating set
in 𝐻 and vice versa. We do so by including in 𝐻 all the vertices
and edges of 𝐺, but for every edge {𝑢, 𝑣} of 𝐺 we also add to 𝐻 a Figure 14.7: A dominating set is a subset 𝑆 of vertices
new vertex 𝑤𝑢,𝑣 and connect it to both 𝑢 and 𝑣. Let ℓ be the number such that every vertex in the graph is either in 𝑆 or a
of isolated vertices in 𝐺. The idea behind the proof is that we can neighbor of 𝑆. The figure above are two copies of the
same graph. The red vertices on the left are a vertex
transform a vertex cover 𝑆 of 𝑘 vertices in 𝐺 into a dominating set cover that is not a dominating set. The blue vertices
of 𝑘 + ℓ vertices in 𝐻 by adding to 𝑆 all the isolated vertices, and on the right are a dominating set that is not a vertex
cover.
moreover we can transform every 𝑘 + ℓ-sized dominating set in 𝐻
into a vertex cover in 𝐺. We now give the details.
Description of the algorithm. Given an instance (𝐺, 𝑘) for the
vertex cover problem, we will map 𝐺 into an instance (𝐻, 𝑘′ ) for
the dominating set problem as follows (see Fig. 14.8 for Python
implementation):
poly nomi a l -ti me re d u c ti on s 471
Proof Idea:
We will map a graph 𝐺 into a graph 𝐻 such that a large indepen-
dent set in 𝐺 becomes a partition cutting many edges in 𝐻. We can
think of a cut in 𝐻 as coloring each vertex either “blue” or “red”. We
will add a special “source” vertex 𝑠∗ , connect it to all other vertices,
and assume without loss of generality that it is colored blue. Hence
the more vertices we color red, the more edges from 𝑠∗ we cut. Now,
for every edge 𝑢, 𝑣 in the original graph 𝐺 we will add a special “gad-
get” which will be a small subgraph that involves 𝑢,𝑣, the source 𝑠∗ ,
and two other additional vertices. We design the gadget in a way so
that if the red vertices are not an independent set in 𝐺 then the cor-
responding cut in 𝐻 will be “penalized” in the sense that it would
poly nomi a l -ti me re d u c ti on s 475
not cut as many edges. Once we set for ourselves this objective, it is
not hard to find a gadget that achieves it− see the proof below. Once
again the takeaway technique is to use (this time a slightly more
clever) gadget.
⋆
would be done. This might not always be the case but we will see that
if 𝐼 is not an independent set then it’s also larger than 𝑘. Specifically,
we define 𝑚𝑖𝑛 = |𝐸(𝐼, 𝐼)| be the set of edges in 𝐺 that are contained
in 𝐼 and let 𝑚𝑜𝑢𝑡 = 𝑚 − 𝑚𝑖𝑛 (i.e., if 𝐼 is an independent set then
𝑚𝑖𝑛 = 0 and 𝑚𝑜𝑢𝑡 = 𝑚). By the properties of our gadget we know
that for every edge {𝑢, 𝑣} of 𝐺, we can cut at most three edges when
both 𝑢 and 𝑣 are in 𝑆, and at most four edges otherwise. Hence the
number 𝐶 of edges cut by 𝑆 satisfies 𝐶 ≤ |𝐼| + 3𝑚𝑖𝑛 + 4𝑚𝑜𝑢𝑡 =
|𝐼| + 3𝑚𝑖𝑛 + 4(𝑚 − 𝑚𝑖𝑛 ) = |𝐼| + 4𝑚 − 𝑚𝑖𝑛 . Since 𝐶 = 𝑘 + 4𝑚 we
get that |𝐼| − 𝑚𝑖𝑛 ≥ 𝑘. Now we can transform 𝐼 into an independent
set 𝐼 ′ by going over every one of the 𝑚𝑖𝑛 edges that are inside 𝐼 and
removing one of the endpoints of the edge from it. The resulting set 𝐼 ′
is an independent set in the graph 𝐺 of size |𝐼| − 𝑚𝑖𝑛 ≥ 𝑘 and so this
concludes the proof of the soundness condition.
■
Figure 14.11: In the reduction of independent set
to max cut, for every 𝑡 ∈ [𝑚], we have a “gadget”
corresponding to the 𝑡-th edge 𝑒 = {𝑣𝑖 , 𝑣𝑗 } in the
Figure 14.12: The reduction of independent set to
original graph. If we think of the side of the cut
max cut. On the right-hand side is Python code
containing the special source vertex 𝑠 as “white” and
∗
implementing the reduction. On the left-hand side is
the other side as “blue”, then the leftmost and center
an example output of the reduction where we apply
figures show that if 𝑣 and 𝑣𝑗 are not both blue then
it to the independent 𝑖set instance that is obtained by
we can cut four edges from the gadget. In contrast,
running the reduction of Theorem 14.8 on the 3CNF
by enumerating all possibilities one can verify that
formula (𝑥0 ∨ 𝑥3 ∨ 𝑥2 ) ∧ (𝑥0 ∨ 𝑥1 ∨ 𝑥2 ) ∧ (𝑥1 ∨ 𝑥2 ∨ 𝑥3 ).
if both 𝑢 and 𝑣 are blue, then no matter how we
color the intermediate vertices 𝑒0𝑡 , 𝑒1𝑡 , we will cut at
most three edges from the gadget. The figure above
contains only the gadget edges and ignores the edges
connecting 𝑠∗ to the vertices 𝑣0 , … , 𝑣𝑛−1 .
↪ if positive or negated
return int(v[2:]),False if v[0]=="¬" else
↪ int(v[1:]),True
n = numvars(φ)
clauses = getclauses(φ)
m = len(clauses)
G =Graph()
G.edge("start","start_0")
for i in range(n): # add 2 length-m paths per variable
G.edge(f"start_{i}",f"v_{i}_{0}_T")
G.edge(f"start_{i}",f"v_{i}_{0}_F")
for j in range(m-1):
G.edge(f"v_{i}_{j}_T",f"v_{i}_{j+1}_T")
G.edge(f"v_{i}_{j}_F",f"v_{i}_{j+1}_F")
G.edge(f"v_{i}_{m-1}_T",f"end_{i}")
G.edge(f"v_{i}_{m-1}_F",f"end_{i}")
if i<n-1:
G.edge(f"end_{i}",f"start_{i+1}")
G.edge(f"end_{n-1}","start_clauses")
for j,C in enumerate(clauses): # add gadget for each
↪ clause
for v in enumerate(C):
i,sign = var(v[1])
s = "F" if sign else "T"
G.edge(f"C_{j}_in",f"v_{i}_{j}_{s}")
G.edge(f"v_{i}_{j}_{s}",f"C_{j}_out")
if j<m-1:
G.edge(f"C_{j}_out",f"C_{j+1}_in")
G.edge("start_clauses","C_0_in")
G.edge(f"C_{m-1}_out","end")
return G, 1+n*(m+1)+1+2*m+1
“upper path” and a “lower path”. A simple path cannot take both the
upper path and the lower path, and so it will need to take exactly one
of them to reach 𝑠 from 𝑡.
Our intention is that a path in the graph will correspond to an as-
signment 𝑥 ∈ {0, 1}𝑛 in the sense that taking the upper path in the 𝑖𝑡ℎ
loop corresponds to assigning 𝑥𝑖 = 1 and taking the lower path cor-
responds to assigning 𝑥𝑖 = 0. When we are done snaking through all
the 𝑛 loops corresponding to the variables to reach 𝑡 we need to pass
through 𝑚 “obstacles”: for each clause 𝑗 we will have a small gad-
get consisting of a pair of vertices 𝑠𝑗 , 𝑡𝑗 that have three paths between
them. For example, if the 𝑗𝑡ℎ clause had the form 𝑥17 ∨ 𝑥55 ∨ 𝑥72 then
one path would go through a vertex in the lower loop corresponding
to 𝑥17 , one path would go through a vertex in the upper loop corre-
sponding to 𝑥55 and the third would go through the lower loop cor-
responding to 𝑥72 . We see that if we went in the first stage according
to a satisfying assignment then we will be able to find a free vertex to
travel from 𝑠𝑗 to 𝑡𝑗 . We link 𝑡1 to 𝑠2 , 𝑡2 to 𝑠3 , etc and link 𝑡𝑚 to 𝑡. Thus
a satisfying assignment would correspond to a path from 𝑠 to 𝑡 that
goes through one path in each loop corresponding to the variables,
and one path in each loop corresponding to the clauses. We can make
the loop corresponding to the variables long enough so that we must
take the entire path in each loop in order to have a fighting chance of
getting a path as long as the one corresponds to a satisfying assign-
ment. But if we do that, then the only way if we are able to reach 𝑡 is
if the paths we took corresponded to a satisfying assignment, since
otherwise we will have one clause 𝑗 where we cannot reach 𝑡𝑗 from 𝑠𝑗
without using a vertex we already used before.
■
✓ Chapter Recap
14.9 EXERCISES
15
NP, NP completeness, and the Cook-Levin Theorem
“In this paper we give theorems that suggest, but do not imply, that these
problems, as well as many others, will remain intractable perpetually”, Richard
Karp, 1972
“Sad to say, but it will be many more years, if ever before we really understand
the Mystical Power of Twoness… 2-SAT is easy, 3-SAT is hard, 2-dimensional
matching is easy, 3-dimensional matching is hard. Why? oh, Why?” Eugene
Lawler
that |𝑤| ≤ 𝑝(|𝑥|) for some polynomial 𝑝. That is, prove that for every
𝐹 ∶ {0, 1}∗ → {0, 1}, 𝐹 ∈ NP if and only if there is a polynomial-
time Turing machine 𝑉 and a polynomial 𝑝 ∶ ℕ → ℕ such that for
every 𝑥 ∈ {0, 1}∗ 𝐹 (𝑥) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥, 𝑤) = 1.
■
Solution:
The “only if” direction (namely that if 𝐹 ∈ NP then there is an
algorithm 𝑉 and a polynomial 𝑝 as above) follows immediately
from Definition 15.1 by letting 𝑝(𝑛) = 𝑛𝑎 . For the “if” direc-
tion, the idea is that if a string 𝑤 is of size at most 𝑝(𝑛) for degree
𝑑 polynomial 𝑝, then there is some 𝑛0 such that for all 𝑛 > 𝑛0 ,
|𝑤| < 𝑛 . Hence we can encode 𝑤 by a string of exactly length
𝑑+1
484 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
such that 𝑉 ′ (𝑥𝑤′ ) = 1 if and only if there exists 𝑤 ∈ {0, 1}∗ with
|𝑤| ≤ 𝑝(|𝑥|) such that 𝑉 (𝑥𝑤) = 1.
■
R
Remark 15.2 — NP not (necessarily) closed under com-
plement. Definition 15.1 is asymmetric in the sense that
there is a difference between an output of 1 and an
output of 0. You should make sure you understand
why this definition does not guarantee that if 𝐹 ∈ NP
then the function 1 − 𝐹 (i.e., the map 𝑥 ↦ 1 − 𝐹 (𝑥)) is
in NP as well.
In fact, it is believed that there do exist functions 𝐹
such that 𝐹 ∈ NP but 1 − 𝐹 ∉ NP. For example, as
shown below, 3SAT ∈ NP, but the function 3SAT that
on input a 3CNF formula 𝜑 outputs 1 if and only if 𝜑
is not satisfiable is not known (nor believed) to be in
n p, n p comp l e te n e ss, a n d the cook-l e vi n the ore m 485
Here are some more examples for problems in NP. For each one
of these problems we merely sketch how the witness is represented
and why it is efficiently checkable, but working out the details can be a
good way to get more comfortable with Definition 15.1:
subset of 𝐺’s vertices and enumerating over all the edges {𝑢, 𝑣} of
𝐺, counting those edges such that 𝑢 ∈ 𝑆 and 𝑣 ∉ 𝑆 or vice versa.
Solution:
Suppose that 𝐹 ∈ P. Define the following function 𝑉 : 𝑉 (𝑥0𝑛 ) =
1 iff 𝑛 = |𝑥| and 𝐹 (𝑥) = 1. (𝑉 outputs 0 on all other inputs.) Since
𝐹 ∈ P we can clearly compute 𝑉 in polynomial time as well.
Let 𝑥 ∈ {0, 1}𝑛 be some string. If 𝐹 (𝑥) = 1 then 𝑉 (𝑥0𝑛 ) = 1. On
the other hand, if 𝐹 (𝑥) = 0 then for every 𝑤 ∈ {0, 1}𝑛 , 𝑉 (𝑥𝑤) = 0.
Therefore, setting 𝑎 = 1 (i.e. 𝑤 ∈ {0, 1}𝑛 ), we see that 𝑉 satisfies
1
R
Remark 15.5 — NP does not mean non-polynomial!.
People sometimes think that NP stands for “non-
polynomial time”. As Solved Exercise 15.2 shows, this
is far from the truth, and in fact every polynomial-
time computable function is in NP as well.
If 𝐹 is in NP it certainly does not mean that 𝐹 is hard
to compute (though it does not, as far as we know,
necessarily mean that it’s easy to compute either).
Rather, it means that 𝐹 is easy to verify, in the technical
sense of Definition 15.1.
Solution:
Suppose that 𝐹 ∈ NP and let 𝑉 be the polynomial-time com-
putable function that satisfies (15.1) and 𝑎 the corresponding
constant. Then given every 𝑥 ∈ {0, 1}𝑛 , we can check whether
𝐹 (𝑥) = 1 in time 𝑝𝑜𝑙𝑦(𝑛) ⋅ 2 𝑛𝑎
= 𝑜(2𝑛 ) by enumerating over
𝑎+1
Solved Exercise 15.2 and Solved Exercise 15.3 together imply that
P ⊆ NP ⊆ EXP .
Solution:
Suppose that 𝐺 is in NP and in particular there exists 𝑎 and 𝑉 ∈
P such that for every 𝑦 ∈ {0, 1}∗ , 𝐺(𝑦) = 1 ⇔ ∃𝑤∈{0,1}|𝑦|𝑎 𝑉 (𝑦𝑤) = 1.
Suppose also that 𝐹 ≤𝑝 𝐺 and so in particular there is a 𝑛𝑏 -
time computable function 𝑅 such that 𝐹 (𝑥) = 𝐺(𝑅(𝑥)) for all
𝑥 ∈ {0, 1} . Define 𝑉 to be a Turing machine that on input a pair
∗ ′
We will soon show the proof of Theorem 15.6, but note that it im-
mediately implies that QUADEQ, LONGPATH, and MAXCUT all
reduce to 3SAT. Combining it with the reductions we’ve seen in Chap-
ter 14, it implies that all these problems are equivalent! For example,
to reduce QUADEQ to LONGPATH, we can first reduce QUADEQ to
3SAT using Theorem 15.6 and use the reduction we’ve seen in Theo-
rem 14.12 from 3SAT to LONGPATH. That is, since QUADEQ ∈ NP,
Theorem 15.6 implies that QUADEQ ≤𝑝 3SAT, and Theorem 14.12
implies that 3SAT ≤𝑝 LONGPATH, which by the transitivity of reduc-
tions (Solved Exercise 14.2) means that QUADEQ ≤𝑝 LONGPATH.
Similarly, since LONGPATH ∈ NP, we can use Theorem 15.6 and
Theorem 14.4 to show that LONGPATH ≤𝑝 3SAT ≤𝑝 QUADEQ,
concluding that LONGPATH and QUADEQ are computationally
equivalent.
There is of course nothing special about QUADEQ and LONGPATH
here: by combining (15.6) with the reductions we saw, we see that just
like 3SAT, every 𝐹 ∈ NP reduces to LONGPATH, and the same is true
for QUADEQ and MAXCUT. All these problems are in some sense
“the hardest in NP” since an efficient algorithm for any one of them
would imply an efficient algorithm for all the problems in NP. This
motivates the following definition:
Solution:
We have seen that the circuit (or straightline program) evalua-
tion problem can be computed in polynomial time. Specifically,
given a NAND-CIRC program 𝑄 of 𝑠 lines and 𝑛 inputs, and
𝑤 ∈ {0, 1}𝑛 , we can evaluate 𝑄 on the input 𝑤 in time which is
polynomial in 𝑠 and hence verify whether or not 𝑄(𝑤) = 1.
■
Proof Idea:
The proof closely follows the proof that P ⊆ P/poly (Theorem 13.12
, see also Section 13.6.2). Specifically, if 𝐹 ∈ NP then there is a poly-
nomial time Turing machine 𝑀 and positive integer 𝑎 such that for
every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 1 iff there is some 𝑤 ∈ {0, 1}𝑛 such that
𝑎
P
The proof is a little bit technical but ultimately follows
quite directly from the definition of NP, as well as the
ability to “unroll the loop” of NAND-TM programs as
discussed in Section 13.6.2. If you find it confusing, try
to pause here and think how you would implement
in your favorite programming language the function
unroll which on input a NAND-TM program 𝑃
and numbers 𝑇 , 𝑛 outputs an 𝑛-input NAND-CIRC
program 𝑄 of 𝑂(|𝑇 |) lines such that for every input
𝑧 ∈ {0, 1}𝑛 , if 𝑃 halts on 𝑧 within at most 𝑇 steps and
outputs 𝑦, then 𝑄(𝑧) = 𝑦.
Proof Idea:
To prove Lemma 15.9 we need to give a polynomial-time map from
every NAND-CIRC program 𝑄 to a 3NAND formula Ψ such that there
exists 𝑤 such that 𝑄(𝑤) = 1 if and only if there exists 𝑧 satisfying Ψ.
For every line 𝑖 of 𝑄, we define a corresponding variable 𝑧𝑖 of Ψ. If
the line 𝑖 has the form foo = NAND(bar,blah) then we will add the
clause 𝑧𝑖 = NAND(𝑧𝑗 , 𝑧𝑘 ) where 𝑗 and 𝑘 are the last lines in which bar
and blah were written to. We will also set variables corresponding
to the input variables, as well as add a clause to ensure that the final
output is 1. The resulting reduction can be implemented in about a
dozen lines of Python, see Fig. 15.6.
⋆
• Let ℓ∗ be the last line in which the output y_0 is assigned a value.
Then we add the constraint 𝑧ℓ∗ = NAND(𝑧ℓ0 , 𝑧ℓ0 ) where ℓ0 − 𝑛 is as
above the last line in which zero is assigned a value. Note that this
is effectively the constraint 𝑧ℓ∗ = NAND(0, 0) = 1.
To complete the proof we need to show that there exists 𝑤 ∈ {0, 1}𝑛
s.t. 𝑄(𝑤) = 1 if and only if there exists 𝑧 ∈ {0, 1}𝑛+𝑚 that satisfies all
constraints in Ψ. We now show both sides of this equivalence.
Part I: Completeness. Suppose that there is 𝑤 ∈ {0, 1}𝑛 s.t. 𝑄(𝑤) =
1. Let 𝑧 ∈ {0, 1}𝑛+𝑚 be defined as follows: for 𝑖 ∈ [𝑛], 𝑧𝑖 = 𝑤𝑖 and
for 𝑖 ∈ {𝑛, 𝑛 + 1, … , 𝑛 + 𝑚} 𝑧𝑖 equals the value that is assigned in
the (𝑖 − 𝑛)-th line of 𝑄 when executed on 𝑤. Then by construction
𝑧 satisfies all of the constraints of Ψ (including the constraint that
𝑧ℓ∗ = NAND(0, 0) = 1 since 𝑄(𝑤) = 1.)
Part II: Soundness. Suppose that there exists 𝑧 ∈ {0, 1}𝑛+𝑚 satisfy-
ing Ψ. Soundness will follow by showing that 𝑄(𝑧0 , … , 𝑧𝑛−1 ) = 1 (and
hence in particular there exists 𝑤 ∈ {0, 1}𝑛 , namely 𝑤 = 𝑧0 ⋯ 𝑧𝑛−1 ,
such that 𝑄(𝑤) = 1). To do this we will prove the following claim
(∗): for every ℓ ∈ [𝑚], 𝑧ℓ+𝑛 equals the value assigned in the ℓ-th step
of the execution of the program 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . Note that because 𝑧
satisfies the constraints of Ψ, (∗) is sufficient to prove the soundness
condition since these constraints imply that the last value assigned
to the variable y_0 in the execution of 𝑄 on 𝑧0 ⋯ 𝑧𝑛−1 is equal to 1. To
prove (∗) suppose, towards a contradiction, that it is false, and let ℓ be
496 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
the smallest number such that 𝑧ℓ+𝑛 is not equal to the value assigned
in the ℓ-th step of the execution of 𝑄 on 𝑧0 , … , 𝑧𝑛−1 . But since 𝑧 sat-
isfies the constraints of Ψ, we get that 𝑧ℓ+𝑛 = NAND(𝑧𝑖 , 𝑧𝑗 ) where
(by the assumption above that ℓ is smallest with this property) these
values do correspond to the values last assigned to the variables on the
right-hand side of the assignment operator in the ℓ-th line of the pro-
gram. But this means that the value assigned in the ℓ-th step is indeed
simply the NAND of 𝑧𝑖 and 𝑧𝑗 , contradicting our assumption on the
choice of ℓ.
■
Proof Idea:
To prove Lemma 15.10 we need to map a 3NAND formula 𝜑 into
a 3SAT formula 𝜓 such that 𝜑 is satisfiable if and only if 𝜓 is. The Figure 15.7: A 3NAND instance that is obtained by
taking a NAND-TM program for computing the
idea is that we can transform every NAND constraint of the form
AND function, unrolling it to obtain a NANDSAT
𝑎 = NAND(𝑏, 𝑐) into the AND of ORs involving the variables 𝑎, 𝑏, 𝑐 instance, and then composing it with the reduction of
and their negations, where each of the ORs contains at most three Lemma 15.9.
P
It is a good exercise for you to try to find a 3CNF for-
mula 𝜉 on three variables 𝑎, 𝑏, 𝑐 such that 𝜉(𝑎, 𝑏, 𝑐) is
true if and only if 𝑎 = NAND(𝑏, 𝑐). Once you do so, try
to see why this implies a reduction from 3NAND to
3SAT, and hence completes the proof of Lemma 15.10
15.6 WRAPPING UP
We have shown that for every function 𝐹 in NP, 𝐹 ≤𝑝 NANDSAT ≤𝑝
3NAND ≤𝑝 3SAT, and so 3SAT is NP-hard. Since in Chapter 14 we
saw that 3SAT ≤𝑝 QUADEQ, 3SAT ≤𝑝 ISET, 3SAT ≤𝑝 MAXCUT
and 3SAT ≤𝑝 LONGPATH, all these problems are NP-hard as well.
Finally, since all the aforementioned problems are in NP, they are
all in fact NP-complete and have equivalent complexity. There are
thousands of other natural problems that are NP-complete as well.
Finding a polynomial-time algorithm for any one of them will imply a
polynomial-time algorithm for all of them.
498 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
✓ Chapter Recap
15.7 EXERCISES
Prove that if there is no
Exercise 15.1 — Poor man’s Ladner’s Theorem.
2
𝑛𝑂(log 𝑛) time algorithm for 3SAT then there is some 𝐹 ∈ NP such 2
Hint: Use the function 𝐹 that on input a formula 𝜑
that 𝐹 ∉ P and 𝐹 is not NP complete.2 and a string of the form 1𝑡 , outputs 1 if and only if 𝜑
is satisfiable and 𝑡 = |𝜑|log |𝜑| .
■
16
• What is the evidence for P = NP vs P ≠ NP?
“You don’t have to believe in God, but you should believe in The Book.”, Paul 1
Paul Erdős (1913-1996) was one of the most prolific
Erdős, 1985.1 mathematicians of all times. Though he was an athe-
ist, Erdős often referred to “The Book” in which God
“No more half measures, Walter”, Mike Ehrmantraut in “Breaking Bad”, keeps the most elegant proof of each mathematical
2010. theorem.
“Suppose aliens invade the earth and threaten to obliterate it in a year’s time
unless human beings can find the [fifth Ramsey number]. We could marshal
the world’s best minds and fastest computers, and within a year we could prob-
ably calculate the value. If the aliens demanded the [sixth Ramsey number],
however, we would have no choice but to launch a preemptive attack.”, Paul
Erdős, as quoted by Graham and Spencer, 1990.2 2
The 𝑘-th Ramsey number, denoted as 𝑅(𝑘, 𝑘), is the
smallest number 𝑛 such that for every graph 𝐺 on 𝑛
vertices, both 𝐺 and its complement contain a 𝑘-sized
We have mentioned that the question of whether P = NP, which
independent set. If P = NP then we can compute
is equivalent to whether there is a polynomial-time algorithm for 𝑅(𝑘, 𝑘) in time polynomial in 2𝑘 , while otherwise it
3SAT, is the great open question of Computer Science. But why is it so
2𝑘
can potentially take closer to 22 steps.
important? In this chapter, we will try to figure out the implications of
such an algorithm.
First, let us get one qualm out of the way. Sometimes people say,
“What if P = NP but the best algorithm for 3SAT takes 𝑛1000 time?” Well,
√
𝑛1000 is much larger than, say, 20.001 𝑛 for any input smaller than 250 ,
as large as a harddrive as you will encounter, and so another way to
phrase this question is to say “what if the complexity of 3SAT is ex-
ponential for all inputs that we will ever encounter, but then grows
much smaller than that?” To me this sounds like the computer science
equivalent of asking, “what if the laws of physics change completely
once they are out of the range of our telescopes?”. Sure, this is a valid
possibility, but wondering about it does not sound like the most pro-
ductive use of our time.
So, as the saying goes, we’ll keep an open mind, but not so open
that our brains fall out, and assume from now on that:
and
• She does not “beat around the bush” or take “half measures”.
• 3SAT is very easy: 3SAT has an 𝑂(𝑛) or 𝑂(𝑛2 ) time algorithm with
a not too huge constant (say smaller than 106 .)
At the time of writing, the fastest known algorithm for 3SAT re-
quires more than 20.35𝑛 to solve 𝑛 variable formulas, while we do not
even know how to rule out the possibility that we can compute 3SAT
using 10𝑛 gates. To put it in perspective, for the case 𝑛 = 1000 our
lower and upper bounds for the computational costs are apart by
a factor of about 10100 . As far as we know, it could be the case that
1000-variable 3SAT can be solved in a millisecond on a first-generation
iPhone, and it can also be the case that such instances require more
than the age of the universe to solve on the world’s fastest supercom-
puter.
So far, most of our evidence points to the latter possibility of 3SAT
being exponentially hard, but we have not ruled out the former possi-
bility either. In this chapter we will explore some of the consequences
of the “3SAT easy” scenario.
Suppose that P
Theorem 16.1 — Search vs Decision. = NP. Then
for every polynomial-time algorithm 𝑉 and 𝑎, 𝑏 ∈ ℕ, there is a
polynomial-time algorithm FIND𝑉 such that for every 𝑥 ∈ {0, 1}𝑛 ,
if there exists 𝑦 ∈ {0, 1}𝑎𝑛 satisfying 𝑉 (𝑥𝑦) = 1, then FIND𝑉 (𝑥)
𝑏
P
To understand what the statement of Theo-
rem 16.1 means, let us look at the special case of
the MAXCUT problem. It is not hard to see that there
is a polynomial-time algorithm VERIFYCUT such that
VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if and only if 𝑆 is a subset
of 𝐺’s vertices that cuts at least 𝑘 edges. Theorem 16.1
implies that if P = NP then there is a polynomial-time
algorithm FINDCUT that on input 𝐺, 𝑘 outputs a set
𝑆 such that VERIFYCUT(𝐺, 𝑘, 𝑆) = 1 if such a set
exists. This means that if P = NP, by trying all values
of 𝑘 we can find in polynomial time a maximum cut
in any given graph. We can use a similar argument to
show that if P = NP then we can find a satisfying as-
signment for every satisfiable 3CNF formula, find the
longest path in a graph, solve integer programming,
and so and so forth.
Proof Idea:
The idea behind the proof of Theorem 16.1 is simple; let us
demonstrate it for the special case of 3SAT. (In fact, this case is not
so “special”− since 3SAT is NP-complete, we can reduce the task of
solving the search problem for MAXCUT or any other problem in
NP to the task of solving it for 3SAT.) Suppose that P = NP and we
are given a satisfiable 3CNF formula 𝜑, and we now want to find a
satisfying assignment 𝑦 for 𝜑. Define 3SAT0 (𝜑) to output 1 if there is
a satisfying assignment 𝑦 for 𝜑 such that its first bit is 0, and similarly
define 3SAT1 (𝜑) = 1 if there is a satisfying assignment 𝑦 with 𝑦0 = 1.
504 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
The key observation is that both 3SAT0 and 3SAT1 are in NP, and so if
P = NP then we can compute them in polynomial time as well. Thus
we can use this to find the first bit of the satisfying assignment. We
can continue in this way to recover all the bits.
⋆
maintain the invariant that there exists 𝑦 ∈ {0, 1}𝑎𝑛 whose first ℓ bits
𝑏
are 𝑧 s.t. 𝑉 (𝑥𝑦) = 1. Note that this claim implies the theorem, since in
particular it means that for ℓ = 𝑎𝑛𝑏 − 1, 𝑧 satisfies 𝑉 (𝑥𝑧) = 1.
We prove the claim by induction. For ℓ = 0, this holds vacuously.
Now for every ℓ > 0, if the call STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0)
returns 1, then we are guaranteed the invariant by definition of
STARTSWITH𝑉 . Now under our inductive hypothesis, there is
𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 such that 𝑃 (𝑥𝑧0 , … , 𝑧ℓ−1 𝑦ℓ , … , 𝑦𝑎𝑛𝑏 −1 ) = 1. If the call to
STARTSWITH𝑉 (𝑥𝑧0 ⋯ 𝑧ℓ−1 0) returns 0 then it must be the case that
𝑦ℓ = 1, and hence when we set 𝑧ℓ = 1 we maintain the invariant.
■
16.2 OPTIMIZATION
Theorem 16.1 allows us to find solutions for NP problems if P = NP,
but it is not immediately clear that we can find the optimal solution.
For example, suppose that P = NP, and you are given a graph 𝐺. Can
you find the longest simple path in 𝐺 in polynomial time?
P
This is actually an excellent question for you to at-
tempt on your own. That is, assuming P = NP, give
a polynomial-time algorithm that on input a graph 𝐺,
outputs a maximally long simple path in the graph 𝐺.
P
The statement of Theorem 16.3 is a bit cumbersome.
To understand it, think how it would subsume the
example above of a polynomial time algorithm for
finding the maximum length path in a graph. In
this case the function 𝑓 would be the map that on
input a pair 𝑥, 𝑦 outputs 0 if the pair (𝑥, 𝑦) does not
represent some graph and a simple path inside the
graph respectively; otherwise 𝑓(𝑥, 𝑦) would equal
the length of the path 𝑦 in the graph 𝑥. Since a path
in an 𝑛 vertex graph can be represented by at most
𝑛 log 𝑛 bits, for every 𝑥 representing a graph of 𝑛 ver-
tices, finding max𝑦∈{0,1}𝑛 log 𝑛 𝑓(𝑥, 𝑦) corresponds to
finding the length of the maximum simple path in the
graph corresponding to 𝑥, and finding the string 𝑦∗
that achieves this maximum corresponds to actually
finding the path.
Proof Idea:
The proof follows by generalizing our ideas from the longest path
example above. Let 𝑓 be as in the theorem statement. If P = NP then
for every for every string 𝑥 ∈ {0, 1}∗ and number 𝑘, we can test in
𝑝𝑜𝑙𝑦(|𝑥|, 𝑚) time whether there exists 𝑦 such that 𝑓(𝑥, 𝑦) ≥ 𝑘, or in
other words test whether max𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘. If 𝑓(𝑥, 𝑦) is an
integer between 0 and 𝑝𝑜𝑙𝑦(|𝑥| + |𝑦|) (as is the case in the example of
longest path) then we can just try out all possibilities for 𝑘 to find the
maximum number 𝑘 for which max𝑦 𝑓(𝑥, 𝑦) ≥ 𝑘. Otherwise, we can
use binary search to hone down on the right value. Once we do so, we
can use search-to-decision to actually find the string 𝑦∗ that achieves
the maximum.
⋆
⎧
{1 ∃𝑦∈{0,1}𝑚 𝑓(𝑥, 𝑦) ≥ 𝑘
𝐹 (𝑥, 1𝑚 , 𝑘) = ⎨
⎩0 otherwise
{
Since 𝑓 is computable in polynomial time, 𝐹 is in NP, and so under
our assumption that P = NP, 𝐹 itself can be computed in polynomial
time. Now, for every 𝑥 and 𝑚, we can compute the largest 𝑘 such that
𝐹 (𝑥, 1𝑚 , 𝑘) = 1 by a binary search. Specifically, we will do this as
follows:
w hat i f p e q ua l s n p ? 507
R
Remark 16.5 — Need for binary search. In many ex-
amples, such as the case of finding the longest path,
we don’t need to use the binary search step in Theo-
rem 16.3, and can simply enumerate over all possible
values for 𝑘 until we find the correct one. One exam-
ple where we do need to use this binary search step
is in the case of the problem of finding a maximum
length path in a weighted graph. This is the problem
where 𝐺 is a weighted graph, and every edge of 𝐺 is
given a weight which is a number between 0 and 2𝑘 .
Theorem 16.3 shows that we can find the maximum-
weight simple path in 𝐺 (i.e., simple path maximizing
the sum of the weights of its edges) in time polyno-
mial in the number of vertices and in 𝑘.
Beyond just this example there is a vast field of math-
ematical optimization that studies problems of the
same form as in Theorem 16.3. In the context of opti-
mization, 𝑥 typically denotes a set of constraints over
some variables (that can be Boolean, integer, or real
valued), 𝑦 encodes an assignment to these variables,
and 𝑓(𝑥, 𝑦) is the value of some objective function that
we want to maximize. Given that we don’t know
efficient algorithms for NP complete problems, re-
searchers in optimization research study special cases
of functions 𝑓 (such as linear programming and
semidefinite programming) where it is possible to
optimize the value efficiently. Optimization is widely
used in a great many scientific areas including: ma-
chine learning, engineering, economics and operations
research.
to be correct. There are several ways to model this, but one popular
approach is to pick some fairly simple function 𝐻 ∶ {0, 1}𝑘+𝑛 → {0, 1}.
We think of the first 𝑘 inputs as the parameters and the last 𝑛 inputs
as the example data. (For example, we can think of the first 𝑘 inputs
of 𝐻 as specifying the weights and connections for some neural net-
work that will then be applied on the latter 𝑛 inputs.) We can then
phrase the supervised learning problem as finding, given a set of la-
beled examples 𝑆 = {(𝑥0 , 𝑦0 ), … , (𝑥𝑚−1 , 𝑦𝑚−1 )}, the set of parameters
𝜃0 , … , 𝜃𝑘−1 ∈ {0, 1} that minimizes the number of errors made by
the predictor 𝑥 ↦ 𝐻(𝜃, 𝑥) (This is often known as Empirical Risk
Minimization.)
In other words, we can define for every set 𝑆 as above the function
𝐹𝑆 ∶ {0, 1}𝑘 → [𝑚] such that 𝐹𝑆 (𝜃) = ∑(𝑥,𝑦)∈𝑆 |𝐻(𝜃, 𝑥) − 𝑦|. Now,
finding the value 𝜃 that minimizes 𝐹𝑆 (𝜃) is equivalent to solving the
supervised learning problem with respect to 𝐻. For every polynomial-
time computable 𝐻 ∶ {0, 1}𝑘+𝑛 → {0, 1}, the task of minimizing
𝐹𝑆 (𝜃) can be “massaged” to fit the form of Theorem 16.3 and hence if
P = NP, then we can solve the supervised learning problem in great
generality. In fact, this observation extends to essentially any learn-
ing model, and allows for finding the optimal predictors given the
minimum number of examples. (This is in contrast to many current
learning algorithms, which often rely on having access to an extremely
large number of examples− far beyond the minimum needed, and
in particular far beyond the number of examples humans use for the
same tasks.)
for this and let 𝜑(𝑛) = max𝐹 𝜓(𝐹 , 𝑛). The question is how fast 𝜑(𝑛)
grows for an optimal machine. One can show that 𝜑 ≥ 𝑘 ⋅ 𝑛 [for some
constant 𝑘 > 0]. If there really were a machine with 𝜑(𝑛) ∼ 𝑘 ⋅ 𝑛 (or
even ∼ 𝑘 ⋅ 𝑛2 ), this would have consequences of the greatest importance.
Namely, it would obviously mean that in spite of the undecidability 3
The undecidability of Entscheidungsproblem refers
of the Entscheidungsproblem,3 the mental work of a mathematician to the uncomputability of the function that maps a
concerning Yes-or-No questions could be completely replaced by a ma- statement in first order logic to 1 if and only if that
chine. After all, one would simply have to choose the natural number statement has a proof.
𝑛 so large that when the machine does not deliver a result, it makes no
sense to think more about the problem.
For many reasonable proof systems (including the one that Gödel
referred to), SHORTPROOF𝑉 is in fact NP-complete, and so Gödel can
be thought of as the first person to formulate the P vs NP question.
Unfortunately, the letter was only discovered in 1988.
∃𝑦∈{0,1}𝑝(|𝑥|) 𝑉 (𝑥𝑦) = 1
which has the form (16.1). (Since NAND-CIRC programs are equiv-
alent to Boolean circuits, the search problem corresponding to (16.3)
known as the circuit minimization problem and is widely studied in
Engineering. You can skip ahead to Section 16.4.1 to see a particularly
compelling application of this.)
Another example of a statement involving 𝑎 levels of quantifiers
would be to check, given a chess position 𝑥, whether there is a strategy
that guarantees that White wins within 𝑎 steps. For example is 𝑎 = 3
we would want to check if given the board position 𝑥, there exists a
move 𝑦 for White such that for every move 𝑧 for Black there exists a
move 𝑤 for White that ends in a a checkmate.
It turns out that if P = NP then we can solve these kinds of prob-
lems as well:
Proof Idea:
To understand the idea behind the proof, consider the special case
where we want to decide, given 𝑥 ∈ {0, 1}𝑛 , whether for every 𝑦 ∈
{0, 1}𝑛 there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Consider the
function 𝐹 such that 𝐹 (𝑥𝑦) = 1 if there exists 𝑧 ∈ {0, 1}𝑛 such that
𝑉 (𝑥𝑦𝑧) = 1. Since 𝑉 runs in polynomial-time 𝐹 ∈ NP and hence if
P = NP, then there is an algorithm 𝑉 ′ that on input 𝑥, 𝑦 outputs 1 if
and only if there exists 𝑧 ∈ {0, 1}𝑛 such that 𝑉 (𝑥𝑦𝑧) = 1. Now we
can see that the original statement we consider is true if and only if for
every 𝑦 ∈ {0, 1}𝑛 , 𝑉 ′ (𝑥𝑦) = 1, which means it is false if and only if
the following condition (∗) holds: there exists some 𝑦 ∈ {0, 1}𝑛 such
that 𝑉 ′ (𝑥𝑦) = 0. But for every 𝑥 ∈ {0, 1}𝑛 , the question of whether
the condition (∗) is itself in NP (as we assumed 𝑉 ′ can be computed
w hat i f p e q ua l s n p ? 513
The algorithm of Theorem 16.6 can also solve the search problem
as well: find the value 𝑦0 that certifies the truth of (16.4). We note
that while this algorithm is in polynomial time, the exponent of this
514 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
incredible. And you learn that there is a taller mountain out there. Find it,
Mount Quantum…. they’re not smoothly connected … you’ve got to make a
jump to go from classical to quantum … This also tells you why we have such
major challenges in trying to extend our understanding of physics. We don’t
have these knobs, and little wheels, and twiddles that we can turn. We have to
learn how to make these jumps. And it is a tall order. And that’s why things are
difficult.”
✓ Chapter Recap
16.10 EXERCISES
w hat i f p e q ua l s n p ? 521
17.1 EXERCISES
“Einstein was doubly wrong … not only does God definitely play dice, but He
sometimes confuses us by throwing them where they can’t be seen.”, Stephen
Hawking
We can also use the intersection (∩) and union (∪) operators to
talk about the probability of both event 𝐴 and event 𝐵 happening, or
the probability of event 𝐴 or event 𝐵 happening. For example, the
probability 𝑝 that 𝑥 has an even number of ones and 𝑥0 = 1 is the same
𝑛−1
as Pr[𝐴 ∩ 𝐵] where 𝐴 = {𝑥 ∈ {0, 1}𝑛 ∶ ∑𝑖=0 𝑥𝑖 = 0 mod 2} and
𝐵 = {𝑥 ∈ {0, 1}𝑛 ∶ 𝑥0 = 1}. This probability is equal to 1/4 for
𝑛 > 1. (It is a great exercise for you to pause here and verify that you
understand why this is the case.)
Because intersection corresponds to considering the logical AND
of the conditions that two events happen, while union corresponds
to considering the logical OR, we will sometimes use the ∧ and ∨
operators instead of ∩ and ∪, and so write this probability 𝑝 = Pr[𝐴 ∩
𝐵] defined above also as
𝑛−1
Pr 𝑛
[∑ 𝑥 𝑖 = 0 mod 2 ∧ 𝑥0 = 1] .
𝑥∼{0,1}
𝑖=0
Pr[𝐴] = |𝐴|
2𝑛 = 2𝑛 −|𝐴|
2𝑛 =1− |𝐴|
2𝑛 = 1 − Pr[𝐴] .
This makes sense: since 𝐴 happens if and only if 𝐴 does not happen,
the probability of 𝐴 should be one minus the probability of 𝐴.
R
Remark 18.2 — Remember the sample space. While the
above definition might seem very simple and almost
trivial, the human mind seems not to have evolved for
probabilistic reasoning, and it is surprising how often
people can get even the simplest settings of probability
wrong. One way to make sure you don’t get confused
p roba bi l i ty the ory 1 0 1 531
Proof.
𝔼[𝑋 + 𝑌 ] = ∑ 2−𝑛 (𝑋(𝑥) + 𝑌 (𝑥)) =
𝑥∈{0,1}𝑛
−𝑛
∑ 2 𝑋(𝑥) + ∑ 2−𝑛 𝑌 (𝑥) =
𝑥∈{0,1}𝑛 𝑥∈{0,1}𝑛
𝔼[𝑋] + 𝔼[𝑌 ]
■
Solution:
We can solve this using the linearity of expectation. We can de-
fine random variables 𝑋0 , 𝑋1 , … , 𝑋𝑛−1 such that 𝑋𝑖 (𝑥) = 𝑥𝑖 . Since
each 𝑥𝑖 equals 1 with probability 1/2 and 0 with probability 1/2,
𝑛−1
𝔼[𝑋𝑖 ] = 1/2. Since 𝑋 = ∑𝑖=0 𝑋𝑖 , by the linearity of expectation
𝑛
𝔼[𝑋] = 𝔼[𝑋0 ] + 𝔼[𝑋1 ] + ⋯ + 𝔼[𝑋𝑛−1 ] = 2 .
P
If you have not seen discrete probability before, please
go over this argument again until you are sure you
follow it; it is a prototypical simple example of the
type of reasoning we will employ again and again in
this course.
P
Before looking at the proof, try to see why the union
bound makes intuitive sense. We can also prove
it directly from the definition of probabilities and
the cardinality of sets, together with the equation
|𝐴 ∪ 𝐵| ≤ |𝐴| + |𝐵|. Can you see why the latter
equation is true? (See also Fig. 18.3.)
p roba bi l i ty the ory 1 0 1 533
Proof of Lemma 18.4. For every 𝑥, the variable 1𝐴∪𝐵 (𝑥) ≤ 1𝐴 (𝑥)+1𝐵 (𝑥).
Hence, Pr[𝐴∪𝐵] = 𝔼[1𝐴∪𝐵 ] ≤ 𝔼[1𝐴 +1𝐵 ] = 𝔼[1𝐴 ]+𝔼[1𝐵 ] = Pr[𝐴]+Pr[𝐵].
■
Pr[𝑥0 = 1] = 1
2
Pr[𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{011, 101, 110, 111}] = 4
= 1
8 2 Figure 18.4: Two events 𝐴 and 𝐵 are independent if
but Pr[𝐴 ∩ 𝐵] = Pr[𝐴] ⋅ Pr[𝐵]. In the two figures above,
the empty 𝑥 × 𝑥 square is the sample space, and 𝐴
and 𝐵 are two events in this sample space. In the left
figure, 𝐴 and 𝐵 are independent, while in the right
Pr[𝑥0 = 1 ∧ 𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{101, 110, 111}] = 3
8 > 1
2 ⋅ 1
2 figure they are negatively correlated, since 𝐵 is less
likely to occur if we condition on 𝐴 (and vice versa).
and hence, as we already observed, the events {𝑥0 = 1} and {𝑥0 + Mathematically, one can see this by noticing that in
the left figure the areas of 𝐴 and 𝐵 respectively are
𝑥1 + 𝑥2 ≥ 2} are not independent and in fact are positively correlated.
𝑎 ⋅ 𝑥 and 𝑏 ⋅ 𝑥, and so their probabilities are 𝑎⋅𝑥 = 𝑥𝑎
On the other hand, Pr[𝑥0 = 1 ∧ 𝑥1 = 1] = Pr[{110, 111}] = 28 = 12 ⋅ 21
𝑥2
and 𝑥2 = 𝑥 respectively, while the area of 𝐴 ∩ 𝐵 is
𝑏⋅𝑥 𝑏
and hence the events {𝑥0 = 1} and {𝑥1 = 1} are indeed independent. 𝑎 ⋅ 𝑏 which corresponds to the probability 𝑎⋅𝑏 𝑥2
. In the
right figure, the area of the triangle 𝐵 is 𝑏⋅𝑥
2 which
corresponds to a probability of 2𝑥 𝑏
, but the area of
R 𝐴 ∩ 𝐵 is 𝑏′ ⋅𝑎
for some 𝑏′ < 𝑏. This means that the
Remark 18.5 — Disjointness vs independence. People 2
′
sometimes confuse the notion of disjointness and in- probability of 𝐴 ∩ 𝐵 is 𝑏2𝑥⋅𝑎2 < 𝑏
2𝑥 ⋅ 𝑥𝑎 , or in other words
Pr[𝐴 ∩ 𝐵] < Pr[𝐴] ⋅ Pr[𝐵].
p roba bi l i ty the ory 1 0 1 535
More than two events: We can generalize this definition to more than
two events. We say that events 𝐴1 , … , 𝐴𝑘 are mutually independent
if knowing that any set of them occurred or didn’t occur does not
change the probability that an event outside the set occurs. Formally,
the condition is that for every subset 𝐼 ⊆ [𝑘],
Pr[∧𝑖∈𝐼 𝐴𝑖 ] = ∏ Pr[𝐴𝑖 ].
𝑖∈𝐼
For example, if 𝑥 ∼ {0, 1}3 , then the events {𝑥0 = 1}, {𝑥1 = 1} and
{𝑥2 = 1} are mutually independent. On the other hand, the events
{𝑥0 = 1}, {𝑥1 = 1} and {𝑥0 + 𝑥1 = 0 mod 2} are not mutually
independent, even though every pair of these events is independent
(can you see why? see also Fig. 18.5).
P
The notation in the lemma’s statement is a bit cum-
bersome, but at the end of the day, it simply says that
if 𝑋 and 𝑌 are random variables that depend on two
disjoint sets 𝑆 and 𝑇 of coins (for example, 𝑋 might
be the sum of the first 𝑛/2 coins, and 𝑌 might be the
largest consecutive stretch of zeroes in the second 𝑛/2
coins), then they are independent.
|𝐶|
2𝑛 = |𝐴| |𝐵| 2𝑛−𝑘−𝑚
2𝑘 2𝑚 2𝑛−𝑘−𝑚 = Pr[𝑋 = 𝑎] Pr[𝑌 = 𝑏].
𝔼[𝑋] 𝔼[𝑌 ]
∑ Pr[𝑋 = 𝑥] Pr[𝑌 = 𝑦] =
𝑥 s.t.𝐹 (𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏
⎜ ∑ Pr[𝑋 = 𝑥]⎞
⎛ ⎜ ∑ Pr[𝑌 = 𝑦]⎞
⎟⋅⎛ ⎟=
⎝𝑥 s.t.𝐹 (𝑥)=𝑎 ⎠ ⎝𝑦 s.t.𝐺(𝑦)=𝑏 ⎠
Pr[𝐹 (𝑋) = 𝑎] Pr[𝐺(𝑌 ) = 𝑏].
P
We leave proving Lemma 18.7 and Lemma 18.8 as
Exercise 18.6 and Exercise 18.7. It is a good idea for
you stop now and do these exercises to make sure you
are comfortable with the notion of independence, as
we will use it heavily later on in this course.
If 𝑋 is a non-negative random
Theorem 18.9 — Markov’s inequality.
variable then for every 𝑘 > 0, Pr[𝑋 ≥ 𝑘 𝔼[𝑋]] ≤ 1/𝑘.
Proof. Suppose towards the sake of contradiction that Pr[𝑋 < 𝔼[𝑋]] =
1. Then the random variable 𝑌 = 𝔼[𝑋] − 𝑋 is always positive. By
linearity of expectation 𝔼[𝑌 ] = 𝔼[𝑋] − 𝔼[𝑋] = 0. Yet by Markov, a
non-negative random variable 𝑌 with 𝔼[𝑌 ] = 0 must equal 0 with
probability 1, since the probability that 𝑌 > 𝑘 ⋅ 0 = 0 is at most 1/𝑘 for
every 𝑘 > 1. Hence we get a contradiction to the assumption that 𝑌 is
always positive.
■
distributed (i.i.d for short) variables with values in [0, 1] where each
has expectation 1/2. Since 𝔼[𝑋] = ∑𝑖 𝔼[𝑋𝑖 ] = 𝑛/2, we would like to
say that 𝑋 is very likely to be in, say, the interval [0.499𝑛, 0.501𝑛]. Us-
ing Markov’s inequality directly will not help us, since it will only tell
us that 𝑋 is very likely to be at most 100𝑛 (which we already knew,
since it always lies between 0 and 𝑛). However, since 𝑋1 , … , 𝑋𝑛 are
independent,
We omit the proof, which appears in many texts, and uses Markov’s
inequality on i.i.d random variables 𝑌0 , … , 𝑌𝑛 that are of the form
𝑌𝑖 = 𝑒𝜆𝑋𝑖 for some carefully chosen parameter 𝜆. See Exercise 18.11
for a proof of the simple (but highly useful and representative) case
where each 𝑋𝑖 is {0, 1} valued and 𝑝 = 1/2. (See also Exercise 18.12
for a generalization.)
R
Remark 18.13 — Slight simplification of Chernoff. Since 𝑒
is roughly 2.7 (and in particular larger than 2),
(18.2) would still be true if we replaced its right-hand
side with 𝑒−2𝜖 𝑛+1 . For 𝑛 > 1/𝜖2 , the equation will
2
𝑖=0
where the probability is taken over the choice of the set of samples
𝑆.
In particular if |𝒞| ≤ 2𝑘 and 𝑛 > 𝑘 log(1/𝛿)
𝜖2 then with probability at
least 1 − 𝛿, the classifier ℎ∗ ∈ 𝒞 that minimizes that empirical test er-
ror 𝐿̂ 𝑆 (𝐶) satisfies 𝐿(ℎ∗ ) ≤ 𝐿̂ 𝑆 (ℎ∗ ) + 𝜖, and hence its test error is at
most 𝜖 worse than its training error.
Proof Idea:
The idea is to combine the Chernoff bound with the union bound.
Let 𝑘 = log |𝒞|. We first use the Chernoff bound to show that for
every fixed ℎ ∈ 𝒞, if we choose 𝑆 at random then the probability that
|𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖 will be smaller than 2𝛿𝑘 . We can then use the union
bound over all the 2𝑘 members of 𝒞 to show that this will be the case
for every ℎ.
⋆
⎧
{1 ℎ(𝑥𝑖 ) ≠ 𝑦𝑖
𝑋𝑖 = ⎨ .
⎩0 otherwise
{
(using the fact that 𝑒 > 2). Since 𝐿(ℎ)̂ = 𝑛1 ∑𝑖∈[𝑛] 𝑋𝑖 , this completes
the proof of the claim.
Given the claim, the theorem follows from the union bound. In-
deed, for every ℎ ∈ 𝒞, define the “bad event” 𝐵ℎ to be the event (over
the choice of 𝑆) that |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| > 𝜖. By the claim Pr[𝐵ℎ ] < 𝛿/2𝑘 ,
and hence by the union bound the probability that the union of 𝐵ℎ for
all ℎ ∈ ℋ happens is smaller than |𝒞|𝛿/2𝑘 = 𝛿. If for every ℎ ∈ 𝒞, 𝐵ℎ
does not happen, it means that for every ℎ ∈ ℋ, |𝐿(ℎ) − 𝐿̂ 𝑆 (ℎ)| ≤ 𝜖,
p roba bi l i ty the ory 1 0 1 543
✓ Chapter Recap
18.4 EXERCISES
Suppose that we toss three independent fair coins 𝑎, 𝑏, 𝑐 ∈
Exercise 18.1
{0, 1}. What is the probability that the XOR of 𝑎,𝑏, and 𝑐 is equal to 1?
What is the probability that the AND of these three values is equal to
1? Are these two events independent?
■
such that 𝑋 and 𝑌 are not independent but 𝔼[XY] = 𝔼[𝑋] 𝔼[𝑌 ].
■
Prove that if
Exercise 18.8 — Variance of independent random variables.
𝑋0 , … , 𝑋𝑛−1 are independent random variables then Var[𝑋0 + ⋯ +
𝑛−1
𝑋𝑛−1 ] = ∑𝑖=0 Var[𝑋𝑖 ].
■
2. Use this and Exercise 18.10 to prove (an approximate version of)
the Chernoff bound for the case that 𝑋0 , … , 𝑋𝑛−1 are i.i.d. random
variables over {0, 1} each equaling 0 and 1 with probability 1/2.
That is, prove that for every 𝜖 > 0, and 𝑋0 , … , 𝑋𝑛−1 as above,
𝑛−1
Pr[| ∑𝑖=0 𝑋𝑖 − 𝑛2 | > 𝜖𝑛] < 20.1⋅𝜖 𝑛 .
2
Exercise 18.12 — Poor man’s Chernoff. Exercise 18.11 establishes the Cher-
noff bound for the case that 𝑋0 , … , 𝑋𝑛−1 are i.i.d variables over {0, 1}
with expectation 1/2. In this exercise we use a slightly different
method (bounding the moments of the random variables) to estab-
lish a version of Chernoff where the random variables range over [0, 1]
and their expectation is some number 𝑝 ∈ [0, 1] that may be different
than 1/2. Let 𝑋0 , … , 𝑋𝑛−1 be i.i.d random variables with 𝔼 𝑋𝑖 = 𝑝 and
Pr[0 ≤ 𝑋𝑖 ≤ 1] = 1. Define 𝑌𝑖 = 𝑋𝑖 − 𝑝.
p roba bi l i ty the ory 1 0 1 545
1. Prove that for every 𝑗0 , … , 𝑗𝑛−1 ∈ ℕ, if there exists one 𝑖 such that 𝑗𝑖
𝑛−1 𝑗
is odd then 𝔼[∏𝑖=0 𝑌𝑖 𝑖 ] = 0.
3
Hint: Bound the number of tuples 𝑗0 , … , 𝑗𝑛−1 such
2. Prove that for every 𝑘, 𝔼[(∑𝑖=0 𝑌𝑖 )𝑘 ] ≤ (10𝑘𝑛)𝑘/2 .3
𝑛−1
that every 𝑗𝑖 is even and ∑ 𝑗𝑖 = 𝑘 using the Binomial
coefficient and the fact that in any such tuple there are
𝑛/(10000 log 1/𝜖) 4 at most 𝑘/2 distinct indices.
3. Prove that for every 𝜖 > 0, Pr[| ∑𝑖 𝑌𝑖 | ≥ 𝜖𝑛] ≥ 2−𝜖 .
2
4
Hint: Set 𝑘 = 2⌈𝜖2 𝑛/1000⌉ and then show that if the
event | ∑ 𝑌𝑖 | ≥ 𝜖𝑛 happens then the random variable
■ (∑ 𝑌𝑖 )𝑘 is a factor of 𝜖−𝑘 larger than its expectation.
a. 1,000
b. 10,000
c. 100,000
d. 1,000,000
a. 1,000
b. 10,000
c. 100,000
d. 1,000,000
■
546 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
19
Probabilistic computation
“in 1946 .. (I asked myself) what are the chances that a Canfield solitaire laid
out with 52 cards will come out successfully? After spending a lot of time
trying to estimate them by pure combinatorial calculations, I wondered whether
a more practical method … might not be to lay it out say one hundred times and
simply observe and count”, Stanislaw Ulam, 1983
“The salient features of our method are that it is probabilistic … and with a
controllable miniscule probability of error.”, Michael Rabin, 1977
Proof Idea:
We simply choose a random cut: we choose a subset 𝑆 of vertices by
choosing every vertex 𝑣 to be a member of 𝑆 with probability 1/2 in-
dependently. It’s not hard to see that each edge is cut with probability
1/2 and so the expected number of cut edges is 𝑚/2.
⋆
every such edge 𝑒 = {𝑖, 𝑗}, 𝑋𝑒 (𝑥) = 1 if and only if 𝑥𝑖 ≠ 𝑥𝑗 . Since the
pair (𝑥𝑖 , 𝑥𝑗 ) obtains each of the values 00, 01, 10, 11 with probability
1/4, the probability that 𝑥𝑖 ≠ 𝑥𝑗 is 1/2 and hence 𝔼[𝑋𝑒 ] = 1/2. If we let
𝑋 be the random variable corresponding to the total number of edges
cut by 𝑆, then 𝑋 = ∑𝑒∈𝐸 𝑋𝑒 and hence by linearity of expectation
Proof Idea:
To see the idea behind the proof, think of the case that 𝑚 = 1000. In
this case one can show that we will cut at least 500 edges with proba-
bility at least 0.001 (and so in particular larger than 1/(2𝑚) = 1/2000).
Specifically, if we assume otherwise, then this means that with proba-
bility more than 0.999 the algorithm cuts 499 or fewer edges. But since
550 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
we can never cut more than the total of 1000 edges, given this assump-
tion, the highest value of the expected number of edges cut is if we
cut exactly 499 edges with probability 0.999 and cut 1000 edges with
probability 0.001. Yet even in this case the expected number of edges
will be 0.999 ⋅ 499 + 0.001 ⋅ 1000 < 500, which contradicts the fact that
we’ve calculated the expectation to be at least 500 in Theorem 19.1.
⋆
Proof of Lemma 19.2. Let 𝑝 be the probability that we cut at least 𝑚/2
edges and suppose, towards a contradiction, that 𝑝 < 1/(2𝑚). Since
the number of edges cut is an integer, and 𝑚/2 is a multiple of 0.5,
by definition of 𝑝, with probability 1 − 𝑝 we cut at most 𝑚/2 − 0.5
edges. Moreover, since we can never cut more than 𝑚 edges, under
our assumption that 𝑝 < 1/(2𝑚), we can bound the expected number
of edges cut by
• Since the earth is about 5 billion years old, we can estimate the
chance that an asteroid of the magnitude that caused the dinosaurs’
extinction will hit us this very second to be about 2−60 . It is quite
likely that even a deterministic algorithm will fail if this happens.
Algorithm WalkSAT:
Input: An 𝑛 variable 3CNF formula 𝜑.
Parameters: 𝑇 , 𝑆 ∈ ℕ
Operation:
𝑛−1 𝑛−1
𝑃 (𝑥0,0 , … , 𝑥𝑛−1,𝑛−1 ) = ∑ ( ∏ 𝑠𝑖𝑔𝑛(𝜋)𝐴𝑖,𝜋(𝑖) ) ∏ 𝑥𝑖,𝜋(𝑖) (19.1)
𝜋∈𝑆𝑛 𝑖=0 𝑖=0
𝑛−1
that ∏𝑖=0 𝐴𝑖,𝜋(𝑖) 𝑥𝑖,𝜋(𝑖) ≠ 0. But for this to happen, it must be that
𝐴𝑖,𝜋(𝑖) ≠ 0 for all 𝑖, which means that for every 𝑖, the edge (𝑖, 𝜋(𝑖))
exists in the graph, and hence 𝜋 must be a perfect matching in 𝐺.
■
If a polynomial is not identically zero, then it can’t have “too many” roots.
556 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
This makes sense: if there are only “few” roots, then we expect that
with high probability the random input 𝑥 is not going to be one of
those roots. However, to transform this into an actual algorithm, we
need to make both the intuition and the notion of a “random” input
precise. Choosing a random real number is quite problematic, espe-
cially when you have only a finite number of coins at your disposal,
and so we start by reducing the task to a finite setting. We will use the
following result:
Algorithm Perfect-Matching:
Input: Bipartite graph 𝐺 on 2𝑛 vertices {ℓ0 , … , ℓ𝑛−1 , 𝑟0 , … , 𝑟𝑛−1 }.
Operation:
✓ Chapter Recap
19.4 EXERCISES
Exercise 19.1 — Amplification for max cut. Prove Lemma 19.3
■
5
TODO: add exercise to give a deterministic max cut
Exercise 19.2 — Deterministic max cut algorithm. 5 algorithm that gives 𝑚/2 edges. Talk about greedy
■ approach.
19.6 ACKNOWLEDGEMENTS
Learning Objectives:
• Formal definition of probabilistic polynomial
time: the class BPP.
• Proof that every function in BPP can be
computed by 𝑝𝑜𝑙𝑦(𝑛)-sized NAND-CIRC
programs/circuits.
• Relations between BPP and NP.
• Pseudorandom generators
20
Modeling randomized computation
“Any one who considers arithmetical methods of producing random digits is, of
course, in a state of sin.” John von Neumann, 1951.
foo = RAND()
where this probability is taken over the result of the RAND opera-
tions of 𝑃 .
Note that the probability in (20.1) is taken only over the ran-
dom choices in the execution of 𝑃 and not over the choice of the in-
put 𝑥. In particular, as discussed in Big Idea 24, BPP is still a worst
case complexity class, in the sense that if 𝐹 is in BPP then there is a
polynomial-time randomized algorithm that computes 𝐹 with proba-
bility at least 2/3 on every possible (and not just random) input.
The same polynomial-overhead simulation of NAND-RAM pro-
grams by NAND-TM programs we saw in Theorem 13.5 extends to
randomized programs as well. Hence the class BPP is the same re-
gardless of whether it is defined via RNAND-TM or RNAND-RAM
programs. Similarly, we could have just as well defined BPP using
randomized Turing machines.
Because of these equivalences, below we will use the name “poly-
nomial time randomized algorithm” to denote a computation that can be
modeled by a polynomial-time RNAND-TM program, RNAND-RAM
program, or a randomized Turing machine (or any programming lan-
guage that includes a coin tossing operation). Since all these models
are equivalent up to polynomial factors, you can use your favorite
model to capture polynomial-time randomized algorithms without
any loss in generality.
Modern programming lan-
Solved Exercise 20.1 — Choosing from a set.
guages often involve not just the ability to toss a random coin in {0, 1}
but also to choose an element at random from a set 𝑆. Show that you
can emulate this primitive using coin tossing. Specifically, show that
562 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Solution:
If the size of 𝑆 is a power of two, that is 𝑚 = 2ℓ for some ℓ ∈ 𝑁 ,
then we can choose a random element in 𝑆 by tossing ℓ coins to
obtain a string 𝑤 ∈ {0, 1}ℓ and then output the 𝑖-th element of 𝑆
where 𝑖 is the number whose binary representation is 𝑤.
If 𝑆 is not a power of two, then our first attempt will be to let
ℓ = ⌈log 𝑚⌉ and do the same, but then output the 𝑖-th element of
𝑆 if 𝑖 ∈ [𝑚] and output “fail” otherwise. Conditioned on not out-
putting “fail”, this element is distributed uniformly in 𝑆. However,
in the worst case, 2ℓ can be almost 2𝑚 and so the probability of fail
might be close to half. To reduce the failure probability, we can
repeat the experiment above 𝑛 times. Specifically, we will use the
following algorithm
Pr [𝐺(𝑥𝑟) = 𝐹 (𝑥)] ≥ 2
3 . (20.2)
𝑟∼{0,1}𝑎|𝑥|𝑏
where the probability in the right-hand side is taken over the RAND()
operations in 𝑃 . In particular this means that if we define 𝐺(𝑥𝑟) =
𝑃 ′ (𝑥𝑟) then the function 𝐺 satisfies the conditions of (20.2).
The algorithm 𝑃 ′ will be very simple: it simulates the program 𝑃 ,
maintaining a counter 𝑖 initialized to 0. Every time that 𝑃 makes a
RAND() operation, the program 𝑃 ′ will supply the result from 𝑟𝑖 and
increment 𝑖 by one. We will never “run out” of bits, since the running
time of 𝑃 is at most 𝑎𝑛𝑏 and hence it can make at most this number of
564 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
RAND() calls. The output of 𝑃 ′ (𝑥𝑟) for a random 𝑟 ∼ {0, 1}𝑚 will be
distributed identically to the output of 𝑃 (𝑥).
For the other direction, given a function 𝐺 ∈ P satisfying the condi-
tion (20.2) and a NAND-TM 𝑃 ′ that computes 𝐺 in polynomial time,
we can construct an RNAND-TM program 𝑃 that computes 𝐹 in poly-
nomial time. On input 𝑥 ∈ {0, 1}𝑛 , the program 𝑃 will simply use the
RAND() instruction 𝑎𝑛𝑏 times to fill an array R[0] , …, R[𝑎𝑛𝑏 − 1] and
then execute the original program 𝑃 ′ on input 𝑥𝑟 where 𝑟𝑖 is the 𝑖-th
element of the array R. Once again, it is clear that if 𝑃 ′ runs in polyno-
mial time then so will 𝑃 , and for every input 𝑥 and 𝑟 ∈ {0, 1}𝑎𝑛 , the
𝑏
R
Remark 20.4 — Definitions of BPP and NP. The char-
acterization of BPP in Theorem 20.3 is reminiscent
of the characterization of NP in Definition 15.1, with
the randomness in the case of BPP playing the role
of the solution in the case of NP. However, there are
important differences between the two:
1 1
Pr[𝐴(𝑥) = 𝐹 (𝑥)] ≥ + . (20.3)
2 𝑝(𝑛)
Proof Idea:
The proof is the same as we’ve seen before in the case of maximum
cut and other examples. We use the Chernoff bound to argue that if
𝐴 computes 𝐹 with probability at least 12 + 𝜖 and we run it 𝑂(𝑘/𝜖2 )
times, each time using fresh and independent random coins, then the
probability that the majority of the answers will not be correct will be
less than 2−𝑘 . Amplification can be thought of as a “polling” of the
choices for randomness for the algorithm (see Fig. 20.3).
⋆
the probability that the plurality value is not correct is at most 2𝑒−𝜖 𝑡 ,
2
One would be correct about the former, but wrong about the latter.
As we will see, we do in fact have reasons to believe that BPP = P.
This can be thought of as supporting the extended Church Turing hy-
pothesis that deterministic polynomial-time Turing machines capture
what can be feasibly computed in the physical world.
We now survey some of the relations that are known between
BPP and other complexity classes we have encountered. (See also
Fig. 20.4.)
Figure 20.4: Some possibilities for the relations be-
tween BPP and other complexity classes. Most
20.3.1 Solving BPP in exponential time researchers believe that BPP = P and that these
It is not hard to see that if 𝐹 is in BPP then it can be computed in classes are not powerful enough to solve NP-complete
problems, let alone all problems in EXP. However,
exponential time. we have not even been able yet to rule out the possi-
bility that randomness is a “silver bullet” that allows
Theorem 20.7 — Simulating randomized algorithms in exponential time. exponential speedup on all problems, and hence
BPP = EXP. As we’ve already seen, we also can’t
BPP ⊆ EXP rule out that P = NP. Interestingly, in the latter case,
P = BPP.
568 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
The proof of Theorem 20.7 readily follows by enumer-
ating over all the (exponentially many) choices for the
random coins. We omit the formal proof, as doing it
by yourself is an excellent way to get comfortable with
Definition 20.1.
Proof Idea:
The idea behind the proof is that we can first amplify by repetition
the probability of success from 2/3 to 1 − 0.1 ⋅ 2−𝑛 . This will allow us to
show that for every 𝑛 ∈ ℕ there exists a single fixed choice of “favorable
coins” which is a string 𝑟 of length polynomial in 𝑛 such that if 𝑟 is
used for the randomness then we output the right answer on all of
the possible 2𝑛 inputs. We can then use the standard “unravelling the
loop” technique to transform an RNAND-TM program to an RNAND-
CIRC program, and “hardwire” the favorable choice of random coins
mod e l i ng r a n d omi ze d comp u tati on 569
20.4 DERANDOMIZATION
The proof of Theorem 20.8 can be summarized as follows: we can
replace a 𝑝𝑜𝑙𝑦(𝑛)-time algorithm that tosses coins as it runs with an
algorithm that uses a single set of coin tosses 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) which
570 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
will be good enough for all inputs of size 𝑛. Another way to say it is
that for the purposes of computing functions, we do not need “online”
access to random coins and can generate a set of coins “offline” ahead
of time, before we see the actual input.
But this does not really help us with answering the question of
whether BPP equals P, since we still need to find a way to generate
these “offline” coins in the first place. To derandomize an RNAND-
TM program we will need to come up with a single deterministic
algorithm that will work for all input lengths. That is, unlike in the
case of RNAND-CIRC programs, we cannot choose for every input
length 𝑛 some string 𝑟∗ ∈ {0, 1}𝑝𝑜𝑙𝑦(𝑛) to use as our random coins.
Can we derandomize randomized algorithms, or does randomness
add an inherent extra power for computation? This is a fundamentally
interesting question but is also of practical significance. Ever since
people started to use randomized algorithms during the Manhattan
project, they have been trying to remove the need for randomness and
replace it with numbers that are selected through some deterministic
process. Throughout the years this approach has often been used 2
One amusing anecdote is a recent case where scam-
successfully, though there have been a number of failures as well.2 mers managed to predict the imperfect “pseudo-
A common approach people used over the years was to replace random generator” used by slot machines to cheat
casinos. Unfortunately we don’t know the details of
the random coins of the algorithm by a “randomish looking” string
how they did it, since the case was sealed.
that they generated through some arithmetic progress. For example,
one can use the digits of 𝜋 for the random tape. Using these type of
methods corresponds to what von Neumann referred to as a “state
of sin”. (Though this is a sin that he himself frequently committed,
as generating true randomness in sufficient quantity was and still is
often too expensive.) The reason that this is considered a “sin” is that
such a procedure will not work in general. For example, it is easy to
modify any probabilistic algorithm 𝐴 such as the ones we have seen in
Chapter 19, to an algorithm 𝐴′ that is guaranteed to fail if the random
tape happens to equal the digits of 𝜋. This means that the procedure
“replace the random tape by the digits of 𝜋” does not yield a general
way to transform a probabilistic algorithm to a deterministic one that
will solve the same problem. Of course, this procedure does not always
fail, but we have no good way to determine when it fails and when
it succeeds. This reasoning is not specific to 𝜋 and holds for every
deterministically produced string, whether it obtained by 𝜋, 𝑒, the
Fibonacci series, or anything else.
An algorithm that checks if its random tape is equal to 𝜋 and then
fails seems to be quite silly, but this is but the “tip of the iceberg” for a
very serious issue. Time and again people have learned the hard way
that one needs to be very careful about producing random bits using
deterministic means. As we will see when we discuss cryptography,
mod e l i ng r a n d omi ze d comp u tati on 571
P
This is a definition that’s worth reading more than
once, and spending some time to digest it. Note that it
takes several parameters:
We will now (partially) answer both questions. For the first ques-
tion, let us come clean and confess we do not know how to prove that
interesting pseudorandom generators exist. By interesting we mean
pseudorandom generators that satisfy that 𝜖 is some small constant
(say 𝜖 < 1/3), 𝑚 > ℓ, and the function 𝐺 itself can be computed in
𝑝𝑜𝑙𝑦(𝑚) time. Nevertheless, Lemma 20.12 (whose statement and proof
is deferred to the end of this chapter) shows that if we only drop the
last condition (polynomial-time computability), then there do in fact
exist pseudorandom generators where 𝑚 is exponentially larger than ℓ.
P
At this point you might want to skip ahead and look at
the statement of Lemma 20.12. However, since its proof
is somewhat subtle, I recommend you defer reading it
until you’ve finished reading the rest of this chapter.
P
The “optimal PRG conjecture” is worth while reading
more than once. What it posits is that we can obtain
a (𝑇 , 𝜖) pseudorandom generator 𝐺 such that every
output bit of 𝐺 can be computed in time polynomial
in the length ℓ of the input, where 𝑇 is exponentially
large in ℓ and 𝜖 is exponentially small in ℓ. (Note that
we could not hope for the entire output to be com-
putable in ℓ, as just writing the output down will take
too long.)
To understand why we call such a pseudorandom
generator “optimal,” it is a great exercise to convince
yourself that, for example, there does not exist a
(21.1ℓ , 2−1.1ℓ ) pseudorandom generator (in fact, the
number 𝛿 in the conjecture must be smaller than 1). To
see that we can’t have 𝑇 ≫ 2ℓ , note that if we allow a
NAND-CIRC program with much more than 2ℓ lines
then this NAND-CIRC program could “hardwire” in-
side it all the outputs of 𝐺 on all its 2ℓ inputs, and use
that to distinguish between a string of the form 𝐺(𝑠)
and a uniformly chosen string in {0, 1}𝑚 . To see that
we can’t have 𝜖 ≪ 2−ℓ , note that by guessing the input
𝑠 (which will be successful with probability 2−ℓ ), we
can obtain a small (i.e., 𝑂(ℓ) line) NAND-CIRC pro-
gram that achieves a 2−ℓ advantage in distinguishing a
pseudorandom and uniform input. Working out these
details is a highly recommended exercise.
We emphasize again that the optimal PRG conjecture is, as its name
implies, a conjecture, and we still do not know how to prove it. In par-
ticular, it is stronger than the conjecture that P ≠ NP. But we do have
some evidence for its truth. There is a spectrum of different types of
pseudorandom generators, and there are weaker assumptions than
the optimal PRG conjecture that suffice to prove that BPP = P. In
particular this is known to hold under the assumption that there exists
a function 𝐹 ∈ TIME(2𝑂(𝑛) ) and 𝜖 > 0 such that for every sufficiently
large 𝑛, 𝐹↾𝑛 is not in SIZE(2𝜖𝑛 ). The name “Optimal PRG conjecture”
is non-standard. This conjecture is sometimes known in the literature 3
A pseudorandom generator of the form we posit,
as the existence of exponentially strong pseudorandom functions.3 where each output bit can be computed individually
in time polynomial in the seed length, is commonly
known as a pseudorandom function generator. For more
20.4.3 Usefulness of pseudorandom generators on the many interesting results and connections in the
study of pseudorandomness, see this monograph of Salil
We now show that optimal pseudorandom generators are indeed very Vadhan.
useful, by proving the following theorem:
Proof Idea:
574 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
P
Before reading the proof, it is instructive to think
why this result is not “obvious.” If P = NP then
given any randomized algorithm 𝐴 and input 𝑥,
we will be able to figure out in polynomial time if
there is a string 𝑟 ∈ {0, 1}𝑚 of random coins for 𝐴
such that 𝐴(𝑥𝑟) = 1. The problem is that even if
Pr𝑟∼{0,1}𝑚 [𝐴(𝑥𝑟) = 𝐹 (𝑥)] ≥ 0.9999, it can still be the
case that even when 𝐹 (𝑥) = 0 there exists a string 𝑟
such that 𝐴(𝑥𝑟) = 1.
The proof is rather subtle. It is much more important
that you understand the statement of the theorem than
that you follow all the details of the proof.
Proof Idea:
The construction follows the “quantifier elimination” idea which
we have seen in Theorem 16.6. We will show that for every 𝐹 ∈ BPP,
we can reduce the question of some input 𝑥 satisfies 𝐹 (𝑥) = 1 to the
question of whether a formula of the form ∃𝑢∈{0,1}𝑚 ∀𝑣∈{0,1}𝑘 𝑃 (𝑢, 𝑣)
is true, where 𝑚, 𝑘 are polynomial in the length of 𝑥 and 𝑃 is
polynomial-time computable. By Theorem 16.6, if P = NP then we can
decide in polynomial time whether such a formula is true or false.
The idea behind this construction is that using amplification we
can obtain a randomized algorithm 𝐴 for computing 𝐹 using 𝑚 coins
such that for every 𝑥 ∈ {0, 1}𝑛 , if 𝐹 (𝑥) = 0 then the set 𝑆 ⊆ {0, 1}𝑚
of coins that make 𝐴 output 1 is extremely tiny (i.e., exponentially
small relative to 2𝑚 ), and if 𝐹 (𝑥) = 1 then 𝑆 is very large (of size
close to 2𝑚 ). We then consider “shifts” of the set 𝑆: sets of the form
𝑆 ⊕ 𝑠 where 𝑠 ∈ {0, 1}𝑚 is some string, where 𝑆 ⊕ 𝑠 is defined as
{𝑟 ⊕ 𝑠 | 𝑟 ∈ 𝑆}. Note that for every such shift 𝑠, the cardinality of 𝑆 ⊕ 𝑠
is the same as the cardinality of 𝑆. Hence, if 𝐹 (𝑥) = 0, and so 𝑆 is
“tiny”, then for every polynomial number of shifts 𝑠0 , … , 𝑠𝑘 ∈ {0, 1}𝑚 ,
Figure 20.7: If 𝐹 ∈ BPP then through amplification we
the union of the sets 𝑆 ⊕ 𝑠𝑖 will not cover {0, 1}𝑚 . On the other hand,
can ensure that there is an algorithm 𝐴 to compute
we will show that if 𝑆 is very large, then there exists a polynomial 𝐹 on 𝑛-length inputs and using 𝑚 coins such that
number of such shifts such as ∪𝑘−1 Pr𝑟∼{0,1}𝑚 [𝐴(𝑥𝑟) ≠ 𝐹 (𝑥)] ≪ 1/𝑝𝑜𝑙𝑦(𝑚). Hence
𝑖=0 (𝑆 ⊕ 𝑠𝑖 ) = {0, 1} .
𝑚
if 𝐹 (𝑥) = 1 then almost all of the 2𝑚 choices for 𝑟
We can express the condition that there exists 𝑠0 , … , 𝑠𝑘−1 such that will cause 𝐴(𝑥𝑟) to output 1, while if 𝐹 (𝑥) = 0 then
∪𝑖∈[𝑘] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 as a statement with a constant number of 𝐴(𝑥𝑟) = 0 for almost all 𝑟’s. To prove the Sipser–
Gács Theorem we consider several “shifts” of the set
quantifiers. (Specifically, this condition holds if for every 𝑦 ∈ {0, 1}𝑚 ,
𝑆 ⊆ {0, 1}𝑚 of the coins 𝑟 such that 𝐴(𝑥𝑟) = 1. If
there exists 𝑠 ∈ 𝑆 and 𝑖 ∈ {0, … , 𝑘 − 1} such that 𝑦 = 𝑠 ⊕ 𝑠𝑖 .) 𝐹 (𝑥) = 1 then we can find a set of 𝑘 shifts 𝑠0 , … , 𝑠𝑘−1
⋆ for which ∪𝑖∈[𝑘] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 . If 𝐹 (𝑥) = 0 then
for every such set | ∪𝑖∈[𝑘] 𝑆𝑖 | ≤ 𝑘|𝑆| ≪ 2𝑚 . We can
phrase the question of whether there is such a set of
Proof of Theorem 20.11. Let 𝐹 ∈ BPP. Using Theorem 20.5, there shifts using a constant number of quantifiers, and so
exists a polynomial-time algorithm 𝐴 such that for every 𝑥 ∈ {0, 1}𝑛 , can solve it in polynomial time if P = NP.
576 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Pr [𝐴(𝑥𝑟) = 𝐹 (𝑥)] ≥ 1 − 1
10𝑚2 . (20.8)
𝑟∈{0,1}𝑚
Let 𝑥 ∈ {0, 1}𝑛 , and let 𝑆𝑥 ⊆ {0, 1}𝑚 be the set {𝑟 ∈ {0, 1}𝑚 ∶
𝐴(𝑥𝑟) = 1}. By our assumption, if 𝐹 (𝑥) = 0 then |𝑆𝑥 | ≤ 10𝑚 1
22
𝑚
and if
𝐹 (𝑥) = 1 then |𝑆𝑥 | ≥ (1 − 10𝑚2 )2 .
1 𝑚
For a set 𝑆 ⊆ {0, 1}𝑚 and a string 𝑠 ∈ {0, 1}𝑚 , we define the set
𝑆 ⊕ 𝑠 to be {𝑟 ⊕ 𝑠 ∶ 𝑟 ∈ 𝑆} where ⊕ denotes the XOR operation. That
is, 𝑆 ⊕ 𝑠 is the set 𝑆 “shifted” by 𝑠. Note that |𝑆 ⊕ 𝑠| = |𝑆|. (Please
make sure that you see why this is true.)
The heart of the proof is the following two claims:
CLAIM I: For every subset 𝑆 ⊆ {0, 1}𝑚 , if |𝑆| ≤ 1000𝑚 1
2𝑚 , then for
every 𝑠0 , … , 𝑠100𝑚−1 ∈ {0, 1}𝑚 , ∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) ⊊ {0, 1}𝑚 .
CLAIM II: For every subset 𝑆 ⊆ {0, 1}𝑚 , if |𝑆| ≥ 21 2𝑚 then there
exist a set of string 𝑠0 , … , 𝑠100𝑚−1 such that ∪𝑖∈[100𝑚] (𝑆 ⊕ 𝑠𝑖 ) = {0, 1}𝑚 .
CLAIM I and CLAIM II together imply the theorem. Indeed, they
mean that under our assumptions, for every 𝑥 ∈ {0, 1}𝑛 , 𝐹 (𝑥) = 1 if
and only if
∃𝑠0 ,…,𝑠100𝑚−1 ∈{0,1}𝑚 ∀𝑤∈{0,1}𝑚 (𝑤 ∈ (𝑆𝑥 ⊕𝑠0 )∨𝑤 ∈ (𝑆𝑥 ⊕𝑠1 )∨⋯ 𝑤 ∈ (𝑆𝑥 ⊕𝑠100𝑚−1 ))
or equivalently
100𝑚−1 100𝑚−1
∣∪𝑖∈[100𝑚−1] 𝑆𝑥 ⊕ 𝑠𝑖 ∣ ≤ ∑ |𝑆𝑥 ⊕ 𝑠𝑖 | = ∑ |𝑆𝑥 | = 100𝑚|𝑆𝑥 | .
𝑖=0 𝑖=0
𝑖=0
Pr [𝑧 ∈ 𝑆 ⊕ 𝑠] ≥ 1
2 . (20.10)
𝑠∼{0,1}𝑚
Proof Idea:
The proof uses an extremely useful technique known as the “prob-
abilistic method” which is not too hard mathematically but can be 5
There is a whole (highly recommended) book by
confusing at first.5 The idea is to give a “non-constructive” proof of Alon and Spencer devoted to this method.
existence of the pseudorandom generator 𝐺 by showing that if 𝐺 was
chosen at random, then the probability that it would be a valid (𝑇 , 𝜖)
pseudorandom generator is positive. In particular this means that
there exists a single 𝐺 that is a valid (𝑇 , 𝜖) pseudorandom generator.
The probabilistic method is just a proof technique to demonstrate the
existence of such a function. Ultimately, our goal is to show the exis-
tence of a deterministic function 𝐺 that satisfies the condition.
⋆
⎧
{ ⎫
}
1 1
𝐵𝑃 = 𝐺 ∈ ℱ𝑚
ℓ ∣ ∣ 2ℓ ∑ 𝑃 (𝐺(𝑠)) − ∑ 𝑃 (𝑟)∣ > 𝜖
⎨
{
2𝑚 ⎬
}
⎩ 𝑠∈{0,1}ℓ 𝑟∈{0,1}𝑚 ⎭
(20.11)
mod e l i ng r a n d omi ze d comp u tati on 579
𝐿−1
∣ 𝐿1 ∑ 𝑃 (𝑦𝑖 ) − Pr [𝑃 (𝑠) = 1]∣ > 𝜖 (20.12)
𝑠∼{0,1}𝑚
𝑖=0
is at most 2−𝑇 .
2
✓ Chapter Recap
20.7 EXERCISES
21
Cryptography
“A good disguise should not reveal the person’s height”, Shafi Goldwasser
and Silvio Micali, 1982
to letters that occur in the alphabet with higher frequency. From this
observation, there is a short gap to completely breaking the cipher,
which was in fact done by Queen Elizabeth’s spies, who used the de-
coded letters to learn of all the co-conspirators and to convict Queen
Mary of treason, a crime for which she was executed. Trusting in su-
perficial security measures (such as using “inscrutable” symbols) is a
trap that users of cryptography have been falling into again and again
over the years. (As with many things, this is the subject of a great
XKCD cartoon, see Fig. 21.2.)
The Vigenère cipher is named after Blaise de Vigenère, who de-
scribed it in a book in 1586 (though it was invented earlier by Bellaso).
The idea is to use a collection of substitution cyphers: if there are 𝑛
different ciphers then the first letter of the plaintext is encoded with
the first cipher, the second with the second cipher, the 𝑛𝑡ℎ with the 𝑛𝑡ℎ
cipher, and then the 𝑛 + 1𝑠𝑡 letter is again encoded with the first cipher.
The key 𝑘 is usually a word or a phrase of 𝑛 letters. The 𝑖𝑡ℎ substitu-
tion cipher will shift each letter by the same shift needed to get from A
Figure 21.2: XKCD’s take on the added security of
to 𝑘𝑖 . If 𝑘𝑖 is C, for example, the 𝑖𝑡ℎ substitution cipher will shift every using uncommon symbols
letter by two places. This “flattens” the frequencies and makes it much
harder to do frequency analysis, which is why this cipher was consid-
ered “unbreakable” for 300+ years and got the nickname “le chiffre
indéchiffrable” (“the unbreakable cipher”). Nevertheless, Charles
Babbage cracked the Vigenère cipher in 1854 (though he did not pub-
lish it). In 1863 Friedrich Kasiski broke the cipher and published the
result. The idea is that once you guess the length of the cipher, you
can reduce the task to breaking a simple substitution cipher which can
be done via frequency analysis (can you see why?). Confederate gen-
erals used Vigenère regularly during the civil war, and their messages
were routinely cryptanalyzed by Union officers.
The Enigma cipher was a mechanical cipher (looking like a type-
Figure 21.3: Confederate Cipher Disk for implement-
writer, see Fig. 21.5) where each letter typed would get mapped into
ing the Vigenère cipher
a different letter depending on the (rather complicated) key and cur-
rent state of the machine,d which had several rotors that rotated at
different paces. An identically wired machine at the other end could
be used to decrypt. Just as many ciphers in history, this has also been
believed by the Germans to be “impossible to break” and even quite
late in the war they refused to believe it was broken despite mount-
ing evidence to that effect. (In fact, some German generals refused
to believe it was broken even after the war.) Breaking Enigma was an
heroic effort which was initiated by the Poles and then completed by Figure 21.4: Confederate encryption of the message
“Gen’l Pemberton: You can expect no help from this
the British at Bletchley Park, with Alan Turing (of the Turing machine) side of the river. Let Gen’l Johnston know, if possible,
playing a key role. As part of this effort the Brits built arguably the when you can attack the same point on the enemy’s
world’s first large scale mechanical computation devices (though they lines. Inform me also and I will endeavor to make a
diversion. I have sent some caps. I subjoin a despatch
looked more similar to washing machines than to iPhones). They were from General Johnston.”
588 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
also helped along the way by some quirks and errors of the German
operators. For example, the fact that their messages ended with “Heil
Hitler” turned out to be quite useful.
Here is one entertaining anecdote: the Enigma machine would
never map a letter to itself. In March 1941, Mavis Batey, a cryptana-
lyst at Bletchley Park received a very long message that she tried to
decrypt. She then noticed a curious property— the message did not
contain the letter “L”.1 She realized that the probability that no “L”’s
appeared in the message is too small for this to happen by chance.
Hence she surmised that the original message must have been com-
posed only of L’s. That is, it must have been the case that the operator,
perhaps to test the machine, have simply sent out a message where he Figure 21.5: In the Enigma mechanical cipher the secret
repeatedly pressed the letter “L”. This observation helped her decode key would be the settings of the rotors and internal
wires. As the operator typed up their message, the
the next message, which helped inform of a planned Italian attack and encrypted version appeared in the display area above,
secure a resounding British victory in what became known as “the and the internal state of the cipher was updated (and
so typing the same letter twice would generally result
Battle of Cape Matapan”. Mavis also helped break another Enigma
in two different letters output). Decrypting follows
machine. Using the information she provided, the Brits were able the same process: if the sender and receiver are using
to feed the Germans with the false information that the main allied the same key then typing the ciphertext would result
in the plaintext appearing in the display.
invasion would take place in Pas de Calais rather than on Normandy. 1
Here is a nice exercise: compute (up to an order
In the words of General Eisenhower, the intelligence from Bletchley of magnitude) the probability that a 50-letter long
Park was of “priceless value”. It made a huge difference for the Allied message composed of random letters will end up not
containing the letter “L”.
war effort, thereby shortening World War II and saving millions of
lives. See also this interview with Sir Harry Hinsley.
We will often write the first input (i.e., the key) to the encryp-
tion and decryption as a subscript and so can write (21.1) also as
𝐷𝑘 (𝐸𝑘 (𝑥)) = 𝑥.
Prove that for
Solved Exercise 21.1 — Lengths of ciphertext and plaintext.
every valid encryption scheme (𝐸, 𝐷) with functions 𝐿, 𝐶. 𝐶(𝑛) ≥
𝐿(𝑛) for every 𝑛.
■
Solution:
For every fixed key 𝑘 ∈ {0, 1}𝑛 , the equation (21.1) implies that Figure 21.6: A private-key encryption scheme is a
the map 𝑦 ↦ 𝐷𝑘 (𝑦) inverts the map 𝑥 ↦ 𝐸𝑘 (𝑥), which in partic- pair of algorithms 𝐸, 𝐷 such that for every key
𝑘 ∈ {0, 1}𝑛 and plaintext 𝑥 ∈ {0, 1}𝐿(𝑛) , 𝑦 = 𝐸𝑘 (𝑥)
ular means that the map 𝑥 ↦ 𝐸𝑘 (𝑥) must be one to one. Hence
is a ciphertext of length 𝐶(𝑛). The encryption scheme
its codomain must be at least as large as its domain, and since its is valid if for every such 𝑦, 𝐷𝑘 (𝑦) = 𝑥. That is, the
domain is {0, 1}𝐿(𝑛) and its codomain is {0, 1}𝐶(𝑛) it follows that decryption of an encryption of 𝑥 is 𝑥, as long as both
encryption and decryption use the same key.
𝐶(𝑛) ≥ 𝐿(𝑛).
■
P
You would appreciate the subtleties of defining secu-
rity of encryption more if at this point you take a five
minute break from reading, and try (possibly with a
partner) to brainstorm on how you would mathemat-
ically define the notion that an encryption scheme is
secure, in the sense that it protects the secrecy of the
plaintext 𝑥.
A cryptosystem should be secure even if everything about the system, except the
key, is public knowledge.2
2
The actual quote is “Il faut qu’il n’exige pas le
secret, et qu’il puisse sans inconvénient tomber entre
les mains de l’ennemi” loosely translated as “The
Why is it OK to assume the key is secret and not the algorithm? system must not require secrecy and can be stolen by
Because we can always choose a fresh key. But of course that won’t the enemy without causing trouble”. According to
Steve Bellovin the NSA version is “assume that the
help us much if our key is “1234” or “passw0rd!”. In fact, if you use
first copy of any device we make is shipped to the
any deterministic algorithm to choose the key then eventually your Kremlin”.
adversary will figure this out. Therefore for security we must choose
the key at random and can restate Kerckhoffs’s principle as follows:
R
Remark 21.2 — Randomness in the real world. Choos-
ing the secrets for cryptography requires generating
randomness, which is often done by measuring some
“unpredictable” or “high entropy” data, and then
applying hash functions to the result to “extract” a
c ry p tog ra p hy 591
P
This definition might take more than one reading
to parse. Try to think of how this condition would
correspond to your intuitive notion of “learning no
information” about 𝑥 from observing 𝐸𝑘 (𝑥), and to
Shannon’s quote in the beginning of this chapter.
In particular, suppose that you knew ahead of time
that Alice sent either an encryption of 𝑥 or an en-
cryption of 𝑥′ . Would you learn anything new from
observing the encryption of the message that Alice
actually sent? It may help you to look at Fig. 21.7.
Pr[𝑖 = 0 ∧ 𝑦 = 𝐸𝑘 (𝑥𝑖 )]
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = . (21.2)
Pr[𝑦 = 𝐸𝑘 (𝑥)]
1
2 𝑝0 (𝑦) 𝑝 1
Pr[𝑖 = 0|𝑦 = 𝐸𝑘 (𝑥𝑖 )] = 1 1
= =
2 𝑝0 (𝑦) + 2 𝑝1 (𝑦)
𝑝+𝑝 2
using the fact that 𝑝0 (𝑦) = 𝑝1 (𝑦) = 𝑝. This means that observing the
ciphertext 𝑦 did not help us at all! We still would not be able to guess
whether Alice sent “attack” or “retreat” with better than 50/50 odds!
This example can be vastly generalized to show that perfect secrecy
is indeed “perfect” in the sense that observing a ciphertext gives Eve
no additional information about the plaintext beyond her a priori knowl-
edge.
There is a per-
Theorem 21.4 — One Time Pad (Vernam 1917, Shannon 1949).
fectly secret valid encryption scheme (𝐸, 𝐷) with 𝐿(𝑛) = 𝐶(𝑛) = 𝑛.
Proof Idea:
Our scheme is the one-time pad also known as the “Vernam Ci-
pher”, see Fig. 21.9. The encryption is exceedingly simple: to encrypt
a message 𝑥 ∈ {0, 1}𝑛 with a key 𝑘 ∈ {0, 1}𝑛 we simply output 𝑥 ⊕ 𝑘
where ⊕ is the bitwise XOR operation that outputs the string corre-
sponding to XORing each coordinate of 𝑥 and 𝑘.
⋆
Proof of Theorem 21.4. For two binary strings 𝑎 and 𝑏 of the same
length 𝑛, we define 𝑎 ⊕ 𝑏 to be the string 𝑐 ∈ {0, 1}𝑛 such that
𝑐𝑖 = 𝑎𝑖 + 𝑏𝑖 mod 2 for every 𝑖 ∈ [𝑛]. The encryption scheme
(𝐸, 𝐷) is defined as follows: 𝐸𝑘 (𝑥) = 𝑥 ⊕ 𝑘 and 𝐷𝑘 (𝑦) = 𝑦 ⊕ 𝑘.
By the associative law of addition (which works also modulo two),
𝐷𝑘 (𝐸𝑘 (𝑥)) = (𝑥 ⊕ 𝑘) ⊕ 𝑘 = 𝑥 ⊕ (𝑘 ⊕ 𝑘) = 𝑥 ⊕ 0𝑛 = 𝑥, using the fact
that for every bit 𝜎 ∈ {0, 1}, 𝜎 + 𝜎 mod 2 = 0 and 𝜎 + 0 = 𝜎 mod 2.
Hence (𝐸, 𝐷) form a valid encryption.
To analyze the perfect secrecy property, we claim that for every
𝑥 ∈ {0, 1}𝑛 , the distribution 𝑌𝑥 = 𝐸𝑘 (𝑥) where 𝑘 ∼ {0, 1}𝑛 is simply
the uniform distribution over {0, 1}𝑛 , and hence in particular the
distributions 𝑌𝑥 and 𝑌𝑥′ are identical for every 𝑥, 𝑥′ ∈ {0, 1}𝑛 . Indeed,
for every particular 𝑦 ∈ {0, 1}𝑛 , the value 𝑦 is output by 𝑌𝑥 if and
only if 𝑦 = 𝑥 ⊕ 𝑘 which holds if and only if 𝑘 = 𝑥 ⊕ 𝑦. Since 𝑘 is
chosen uniformly at random in {0, 1}𝑛 , the probability that 𝑘 happens
to equal 𝑥 ⊕ 𝑦 is exactly 2−𝑛 , which means that every string 𝑦 is output
by 𝑌𝑥 with probability 2−𝑛 .
■
P
The argument above is quite simple but is worth
reading again. To understand why the one-time pad
is perfectly secret, it is useful to envision it as a bi-
partite graph as we’ve done in Fig. 21.8. (In fact the
Figure 21.9: In the one time pad encryption scheme we
encryption scheme of Fig. 21.8 is precisely the one-
encrypt a plaintext 𝑥 ∈ {0, 1}𝑛 with a key 𝑘 ∈ {0, 1}𝑛
time pad for 𝑛 = 2.) For every 𝑛, the one-time pad
by the ciphertext 𝑥 ⊕ 𝑘 where ⊕ denotes the bitwise
encryption scheme corresponds to a bipartite graph XOR operation.
with 2𝑛 vertices on the “left side” corresponding to the
plaintexts in {0, 1}𝑛 and 2𝑛 vertices on the “right side”
corresponding to the ciphertexts {0, 1}𝑛 . For every
𝑥 ∈ {0, 1}𝑛 and 𝑘 ∈ {0, 1}𝑛 , we connect 𝑥 to the vertex
𝑦 = 𝐸𝑘 (𝑥) with an edge that we label with 𝑘. One can
see that this is the complete bipartite graph, where
every vertex on the left is connected to all vertices on
the right. In particular this means that for every left
vertex 𝑥, the distribution on the ciphertexts obtained
c ry p tog ra p hy 595
Proof Idea:
The idea behind the proof is illustrated in Fig. 21.11. We define a
graph between the plaintexts and ciphertexts, where we put an edge
between plaintext 𝑥 and ciphertext 𝑦 if there is some key 𝑘 such that
𝑦 = 𝐸𝑘 (𝑥). The degree of this graph is at most the number of potential
keys. The fact that the degree is smaller than the number of plaintexts
(and hence of ciphertexts) implies that there would be two plaintexts
𝑥 and 𝑥′ with different sets of neighbors, and hence the distribution
of a ciphertext corresponding to 𝑥 (with a random key) will not be
identical to the distribution of a ciphertext corresponding to 𝑥′ .
⋆
and
How does this mesh with the fact that, as we’ve already seen, peo-
ple routinely use cryptosystems with a 16 byte (i.e., 128 bit) key but
many terabytes of plaintext? The proof of Theorem 21.5 does give in
fact a way to break all these cryptosystems, but an examination of this
proof shows that it only yields an algorithm with time exponential in
the length of the key. This motivates the following relaxation of perfect
secrecy to a condition known as “computational secrecy”. Intuitively,
an encryption scheme is computationally secret if no polynomial time
algorithm can break it. The formal definition is below:
P
Definition 21.6 requires a second or third read and
some practice to truly understand. One excellent exer-
cise to make sure you follow it is to see that if we allow
𝑃 to be an arbitrary function mapping {0, 1}𝑚(𝑛) to
{0, 1}, and we replace the condition in (21.3) that the
left-hand side is smaller than 𝑝(𝑛)1
with the condition
that it is equal to 0 then we get the perfect secrecy
condition of Definition 21.3. Indeed if the distributions
𝐸𝑘 (𝑥0 ) and 𝐸𝑘 (𝑥1 ) are identical then applying any
function 𝑃 to them we get the same expectation. On
the other hand, if the two distributions above give a
different probability for some element 𝑦∗ ∈ {0, 1}𝑚(𝑛) ,
then the function 𝑃 (𝑦) that outputs 1 iff 𝑦 = 𝑦∗ will
598 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Regarding the first question, it is not hard to show that if, for ex-
ample, Alice uses a computationally secret encryption algorithm to
encrypt either “attack” or “retreat” (each chosen with probability
1/2), then as long as she’s restricted to polynomial-time algorithms, an
adversary Eve will not be able to guess the message with probability
better than, say, 0.51, even after observing its encrypted form. (We
omit the proof, but it is an excellent exercise for you to work it out on
your own.)
To answer the second question we will show that under the same
assumption we used for derandomizing BPP, we can obtain a com-
putationally secret cryptosystem where the key is almost exponentially
smaller than the plaintext.
Let 𝐿 ∶ ℕ → ℕ be
Definition 21.7 — Cryptographic pseudorandom generator.
some function. A cryptographic pseudorandom generator with stretch
𝐿(⋅) is a polynomial-time computable function 𝐺 ∶ {0, 1}∗ → {0, 1}∗
such that:
1
∣ Pr [𝐶(𝐺(𝑠)) = 1] − Pr [𝐶(𝑟) = 1]∣ < .
𝑠∼{0,1}ℓ 𝑟∼{0,1}𝑚 𝑝(𝑛)
Proof Idea:
The proof is illustrated in Fig. 21.12. We simply take the one-time
pad on 𝐿 bit plaintexts, but replace the key with 𝐺(𝑘) where 𝑘 is a
string in {0, 1}𝑛 and 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 is a pseudorandom gen-
erator. Since the one time pad cannot be broken, an adversary that
breaks the derandomized one-time pad can be used to distinguish
between the output of the pseudorandom generator and the uniform
distribution.
⋆
Proof of Theorem 21.8. Let 𝐺 ∶ {0, 1}𝑛 → {0, 1}𝐿 for 𝐿 = 𝑛𝑎 be the
restriction to input length 𝑛 of the pseudorandom generator 𝐺 whose
600 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
1
∣ 𝔼 [𝑄(𝐸𝑘 (𝑥))] − 𝔼 [𝑄(𝐸𝑘 (𝑥′ ))]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑘∼{0,1}𝑛
(We use here the simple fact that for a {0, 1}-valued random variable
𝑋, Pr[𝑋 = 1] = 𝔼[𝑋].)
By the definition of our encryption scheme, this means that
Now since (as we saw in the security analysis of the one-time pad),
for every strings 𝑥, 𝑥′ ∈ {0, 1}𝐿 , the distribution 𝑟 ⊕ 𝑥 and 𝑟 ⊕ 𝑥′ are
identical, where 𝑟 ∼ {0, 1}𝐿 . Hence
1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)] + 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.6)
(Please make sure that you can see why this is true.)
Now we can use the triangle inequality that |𝐴 + 𝐵| ≤ |𝐴| + |𝐵| for
every two numbers 𝐴, 𝐵, applying it for 𝐴 = 𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘) ⊕ 𝑥)] −
𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥)] and 𝐵 = 𝔼𝑟∼{0,1}𝐿 [𝑄(𝑟⊕𝑥′ )]−𝔼𝑘∼{0,1}𝑛 [𝑄(𝐺(𝑘)⊕𝑥′ )]
to derive
1
∣ 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥)] − 𝔼 [𝑄(𝑟 ⊕ 𝑥)]∣+∣ 𝔼 [𝑄(𝑟 ⊕ 𝑥′ )] − 𝔼 [𝑄(𝐺(𝑘) ⊕ 𝑥′ )]∣ > 𝑝(𝐿) .
𝑘∼{0,1}𝑛 𝑟∼{0,1}𝐿 𝑟∼{0,1}𝐿 𝑘∼{0,1}𝑛
(21.7)
In particular, either the first term or the second term of the left-
hand side of (21.7) must be at least 2𝑝(𝐿)
1
. Let us assume the first case
holds (the second case is analyzed in exactly the same way). Then we
get that
R
Remark 21.9 — Stream ciphers in practice. The two
most widely used forms of (private key) encryption
schemes in practice are stream ciphers and block ciphers.
(To make things more confusing, a block cipher is
always used in some mode of operation and some
of these modes effectively turn a block cipher into
a stream cipher.) A block cipher can be thought as
a sort of a “random invertible map” from {0, 1}𝑛 to
{0, 1}𝑛 , and can be used to construct a pseudorandom
generator and from it a stream cipher, or to encrypt
data directly using other modes of operations. There
are a great many other security notions and consider-
ations for encryption schemes beyond computational
secrecy. Many of those involve handling scenarios
such as chosen plaintext, man in the middle, and cho-
sen ciphertext attacks, where the adversary is not just
merely a passive eavesdropper but can influence the
communication in some way. While this chapter is
meant to give you some taste of the ideas behind cryp-
tography, there is much more to know before applying
it correctly to obtain secure applications, and a great
many people have managed to get it wrong.
If P
Theorem 21.10 — Breaking encryption using NP algorithm. = NP
then there is no computationally secret encryption scheme with
𝐿(𝑛) > 𝑛.
Furthermore, for every valid encryption scheme (𝐸, 𝐷) with
𝐿(𝑛) > 𝑛 + 100 there is a polynomial 𝑝 such that for every large
enough 𝑛 there exist 𝑥0 , 𝑥1 ∈ {0, 1}𝐿(𝑛) and a 𝑝(𝑛)-line NAND-
602 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof Idea:
The proof follows along the lines of Theorem 21.5 but this time
paying attention to the computational aspects. If P = NP then for
every plaintext 𝑥 and ciphertext 𝑦, we can efficiently tell whether there
exists 𝑘 ∈ {0, 1}𝑛 such that 𝐸𝑘 (𝑥) = 𝑦. So, to prove this result we need
to show that if the plaintexts are long enough, there would exist a pair
𝑥0 , 𝑥1 such that the probability that a random encryption of 𝑥1 also is
a valid encryption of 𝑥0 will be very small. The details of how to show
this are below.
⋆
We will now use the following extremely simple but useful fact
known as the averaging principle (see also Lemma 18.10): for every
random variable 𝑍, if 𝔼[𝑍] = 𝜇, then with positive probability 𝑍 ≤ 𝜇.
(Indeed, if 𝑍 > 𝜇 with probability one, then the expected value of 𝑍
will have to be larger than 𝜇, just like you can’t have a class in which
all students got A or A- and yet the overall average is B+.) In our case
it means that with positive probability ∑𝑘∈{0,1}𝑛 𝑍𝑘 ≤ 22𝐿(𝑛) . In other
2𝑛
words, there exists some 𝑥1 ∈ {0, 1}𝐿(𝑛) such that ∑𝑘∈{0,1}𝑛 𝑍𝑘 (𝑥1 ) ≤
22𝑛
2𝐿(𝑛)
.Yet this means that if we choose a random 𝑘 ∼ {0, 1}𝑛 , then
the probability that 𝐸𝑘 (𝑥1 ) ∈ 𝑆0 is at most 21𝑛 ⋅ 22𝐿(𝑛) = 2𝑛−𝐿(𝑛) .
2𝑛
obvious that if two people have never had the opportunity to prear-
range an encryption method, then they will be unable to communicate
securely over an insecure channel… I believe it is false”. The project
proposal was rejected by his professor as “not good enough”. Merkle
later submitted a paper to the communication of the ACM where he
apologized for the lack of references since he was unable to find any
mention of the problem in the scientific literature, and the only source
where he saw the problem even raised was in a science fiction story.
The paper was rejected with the comment that “Experience shows that
it is extremely dangerous to transmit key information in the clear.”
Merkle showed that one can design a protocol where Alice and Bob
can use 𝑇 invocations of a hash function to exchange a key, but an
adversary (in the random oracle model, though he of course didn’t
use this name) would need roughly 𝑇 2 invocations to break it. He
conjectured that it may be possible to obtain such protocols where
breaking is exponentially harder than using them, but could not think of
any concrete way to doing so.
We only found out much later that in the late 1960’s, a few years
before Merkle, James Ellis of the British Intelligence agency GCHQ
was having similar thoughts. His curiosity was spurred by an old
World-War II manuscript from Bell Labs that suggested the following
way that two people could communicate securely over a phone line.
Alice would inject noise to the line, Bob would relay his messages,
and then Alice would subtract the noise to get the signal. The idea is
that an adversary over the line sees only the sum of Alice’s and Bob’s
signals, and doesn’t know what came from what. This got James Ellis
thinking whether it would be possible to achieve something like that
digitally. As Ellis later recollected, in 1970 he realized that in princi-
ple this should be possible, since he could think of an hypothetical
black box 𝐵 that on input a “handle” 𝛼 and plaintext 𝑥 would give a
“ciphertext” 𝑦 and that there would be a secret key 𝛽 corresponding
to 𝛼, such that feeding 𝛽 and 𝑦 to the box would recover 𝑥. However,
Ellis had no idea how to actually instantiate this box. He and others
kept giving this question as a puzzle to bright new recruits until one
of them, Clifford Cocks, came up in 1973 with a candidate solution
loosely based on the factoring problem; in 1974 another GCHQ re-
cruit, Malcolm Williamson, came up with a solution using modular
exponentiation.
But among all those thinking of public key cryptography, probably
the people who saw the furthest were two researchers at Stanford,
Whit Diffie and Martin Hellman. They realized that with the advent
of electronic communication, cryptography would find new applica-
tions beyond the military domain of spies and submarines, and they
understood that in this new world of many users and point to point
c ry p tog ra p hy 605
• Bob: Given the triple (𝑝, 𝑔, ℎ), Bob sends a message 𝑥 ∈ {0, 1}𝐿
to Alice by choosing 𝑏 at random in [𝑝], and sending to Alice the
pair (𝑔𝑏 mod 𝑝, 𝑟𝑒𝑝(ℎ𝑏 mod 𝑝) ⊕ 𝑥) where 𝑟𝑒𝑝 ∶ [𝑝] → {0, 1}∗
is some “representation function” that maps [𝑝] to {0, 1}𝐿 . (The
function 𝑟𝑒𝑝 does not need to be one-to-one and you can think of
𝑟𝑒𝑝(𝑧) as simply outputting 𝐿 of the bits of 𝑧 in the natural binary
representation, it does need to satisfy certain technical conditions
which we omit in this description.)
608 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
The correctness of the protocol follows from the simple fact that
(𝑔𝑎 )𝑏 = (𝑔𝑏 )𝑎 for every 𝑔, 𝑎, 𝑏 and this still holds if we work modulo
𝑝. Its security relies on the computational assumption that computing
this map is hard, even in a certain “average case” sense (this computa-
tional assumption is known as the Decisional Diffie Hellman assump-
tion). The Diffie-Hellman key exchange protocol can be thought of as
a public key encryption where Alice’s first message is the public key,
and Bob’s message is the encryption.
One can think of the Diffie-Hellman protocol as being based on
a “trapdoor pseudorandom generator” where the triple 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑎𝑏
looks “random” to someone that doesn’t know 𝑎, but someone that
does know 𝑎 can see that raising the second element to the 𝑎-th power
yields the third element. The Diffie-Hellman protocol can be described
abstractly in the context of any finite Abelian group for which we can
efficiently compute the group operation. It has been implemented
on other groups than numbers modulo 𝑝, and in particular Elliptic
Curve Cryptography (ECC) is obtained by basing the Diffie Hell-
man on elliptic curve groups which gives some practical advantages.
Another common group theoretic basis for key-exchange/public key
encryption protocol is the RSA function. A big disadvantage of Diffie-
Hellman (both the modular arithmetic and elliptic curve variants)
and RSA is that both schemes can be broken in polynomial time by a
quantum computer. We will discuss quantum computing later in this
course.
21.10 MAGIC
Beyond encryption and signature schemes, cryptographers have man-
aged to obtain objects that truly seem paradoxical and “magical”. We
briefly discuss some of these objects. We do not give any details, but
hopefully this will spark your curiosity to find out more.
can be used even for computing randomized processes, with one exam-
ple being playing Poker over the net without having to trust any server
for correct shuffling of cards or not revealing the information.
✓ Chapter Recap
21.11 EXERCISES
ing the known algorithms are optimal) we need to set the prime to be
bigger (and so have larger key sizes with corresponding overhead in
communication and computation) to get the same level of security.
Zero-knowledge proofs were constructed by Goldwasser, Micali,
and Rackoff in 1982, and their wide applicability was shown (using
the theory of NP completeness) by Goldreich, Micali, and Wigderson
in 1986.
Two party and multiparty secure computation protocols were con-
structed (respectively) by Yao in 1982 and Goldreich, Micali, and
Wigderson in 1987. The latter work gave a general transformation
c ry p tog ra p hy 613
“Let’s not try to define knowledge, but try to define zero-knowledge.”, Shafi
Goldwasser.
• Interactive proofs
22.1 EXERCISES
Quantum computing
“We always have had (secret, secret, close the doors!) … a great deal of diffi-
culty in understanding the world view that quantum mechanics represents …
It has not yet become obvious to me that there’s no real problem. … Can I learn
anything from asking this question about computers–about this may or may
not be mystery as to what the world view of quantum mechanics is?” , Richard
Feynman, 1981
“The only difference between a probabilistic classical world and the equations
of the quantum world is that somehow or other it appears as if the probabilities
would have to go negative”, Richard Feynman, 1981
P
You should read the paragraphs above more than
once and make sure you appreciate how truly mind Figure 23.2: The setup of the double slit experiment
boggling these results are. in the case of photon or electron guns. We see also
destructive interference in the sense that there are
some positions on the wall that get fewer hits when
both slits are open than they get when only one of the
slits is open. Image credit: Wikipedia.
23.2 QUANTUM AMPLITUDES
The double slit and other experiments ultimately forced scientists to
accept a very counterintuitive picture of the world. It is not merely
about nature being randomized, but rather it is about the probabilities
in some sense “going negative” and cancelling each other!
q ua n tu m comp u ti ng 619
Specifically, consider an event that can either occur or not (e.g. “de-
tector number 17 was hit by a photon”). In classical probability, we
model this by a probability distribution over the two outcomes: a pair
of non-negative numbers 𝑝 and 𝑞 such that 𝑝 + 𝑞 = 1, where 𝑝 corre-
sponds to the probability that the event occurs and 𝑞 corresponds to
the probability that the event does not occur. In quantum mechanics,
we model this also by pair of numbers, which we call amplitudes. This
is a pair of (potentially negative or even complex) numbers 𝛼 and 𝛽
such that |𝛼|2 + |𝛽|2 = 1. The probability that the event occurs is |𝛼|2
and the probability that it does not occur is |𝛽|2 . In isolation, these
negative or complex numbers don’t matter much, since we square
them anyway to obtain probabilities. But the interaction of positive
and negative amplitudes can result in surprising cancellations where
somehow combining two scenarios where an event happens with
positive probability results in a scenario where it never does.
P
If you don’t find the above description confusing and
unintuitive, you probably didn’t get it. Please make
sure to re-read the above paragraphs until you are
thoroughly confused.
R
Remark 23.1 — Complex vs real, other simplifications. If
(like the author) you are a bit intimidated by complex
numbers, don’t worry: you can think of all ampli-
tudes as real (though potentially negative) numbers
without loss of understanding. All the “magic” of
quantum computing already arises in this case, and
so we will often restrict attention to real amplitudes in
this chapter.
We will also only discuss so-called pure quantum
states, and not the more general notion of mixed states.
Pure states turn out to be sufficient for understanding
the algorithmic aspects of quantum computing.
More generally, this chapter is not meant to be a com-
plete description of quantum mechanics, quantum
information theory, or quantum computing, but rather
illustrate the main points where these differ from
classical computing.
𝑓(𝑥) ⊕ 𝑔(𝑦) = 𝑥 ∧ 𝑦
for all the four choices of 𝑥, 𝑦 ∈ {0, 1}2 . Let’s plug in all these four
choices and see what we get (below we use the equalities 𝑧 ⊕ 0 = 𝑧,
𝑧 ∧ 0 = 0 and 𝑧 ∧ 1 = 𝑧):
fact, they can succeed with probability about 0.85, see Lemma 23.5).
R
Remark 23.3 — More on quantum. The discussion in this
lecture is quite brief and somewhat superficial. The
chapter on quantum computation in my book with
Arora (see draft here) is one relatively short resource
that contains essentially everything we discuss here
and more. See also this blog post of Aaronson for a
high level explanation of Shor’s algorithm which ends
with links to several more detailed expositions. This
lecture of Aaronson contains a great discussion of
the feasibility of quantum computing (Aaronson’s
course lecture notes and the book that they spawned
are fantastic reads as well). The videos of Umesh Vazi-
rani’z EdX course are an accessible and recommended
introduction to quantum computing. See the “biblio-
graphical notes” section at the end of this chapter for
more resources.
So, he asked whether one could design a quantum system such that
its outcome 𝑦 based on the initial condition 𝑥 would be some function
6
As its title suggests, Feynman’s lecture was actually
𝑦 = 𝑓(𝑥) such that (a) we don’t know how to efficiently compute
focused on the other side of simulating physics with
in any other way, and (b) is actually useful for something.6 In 1985, a computer. However, he mentioned that as a “side
David Deutsch formally suggested the notion of a quantum Turing remark” one could wonder if it’s possible to simulate
physics with a new kind of computer - a “quantum
machine, and the model has been since refined in works of Deutsch computer” which would “not [be] a Turing machine,
and Josza and Bernstein and Vazirani. Such a system is now known as but a machine of a different kind”. As far as I know,
Feynman did not suggest that such a computer could
a quantum computer.
be useful for computations completely outside the
For a while these hypothetical quantum computers seemed useful domain of quantum simulation. Indeed, he was
for one of two things. First, to provide a general-purpose mecha- more interested in the question of whether quantum
mechanics could be simulated by a classical computer.
nism to simulate a variety of the real quantum systems that people
care about, such as various interactions inside molecules in quantum
chemistry. Second, as a challenge to the Extended Church Turing hypoth-
esis which says that every physically realizable computation device
can be modeled (up to polynomial overhead) by Turing machines (or
equivalently, NAND-TM / NAND-RAM programs).
Quantum chemistry is important (and in particular understand-
ing it can be a bottleneck for designing new materials, drugs, and
more), but it is still a rather niche area within the broader context of
computing (and even scientific computing) applications. Hence for a
while most researchers (to the extent they were aware of it), thought
of quantum computers as a theoretical curiosity that has little bear-
ing to practice, given that this theoretical “extra power” of quantum
computer seemed to offer little advantage in the majority of the prob-
lems people want to solve in areas such as combinatorial optimization,
machine learning, data structures, etc..
To some extent this is still true today. As far as we know, quantum
7
This “95 percent” is a figure of speech, but not com-
computers, if built, will not provide exponential speed ups for 95%
pletely so. At the time of this writing, cryptocurrency
of the applications of computing.7 In particular, as far as we know, mining electricity consumption is estimated to use
quantum computers will not help us solve NP complete problems in up at least 70Twh or 0.3 percent of the world’s pro-
duction, which is about 2 to 5 percent of the total
polynomial or even sub-exponential time, though Grover’s algorithm ( energy usage for the computing industry. All the
Remark 23.4) does yield a quadratic advantage in many cases. current cryptocurrencies will be broken by quantum
computers. Also, for many web servers the TLS pro-
However, there is one cryptography-sized exception: In 1994 Peter
tocol (which is based on the current non-lattice based
Shor showed that quantum computers can solve the integer factoring systems and would be completely broken by quantum
and discrete logarithm problems in polynomial time. This result has computing) is responsible for about 1 percent of the
CPU usage.
captured the imagination of a great many people, and completely
energized research into quantum computing. This is both because the
hardness of these particular problems provides the foundations for
securing such a huge part of our communications (and these days,
our economy), and because it was a powerful demonstration that
q ua n tu m comp u ti ng 625
P
Please make sure you understand why performing the
operation will take a system in state 𝑝 to a system in
the state 𝐹 𝑝. Understanding the evolution of proba-
bilistic systems is a prerequisite to understanding the
evolution of quantum systems.
If your linear algebra is a bit rusty, now would be a
good time to review it, and in particular make sure
you are comfortable with the notions of matrices, vec-
tors, (orthogonal and orthonormal) bases, and norms.
real (though potentially negative), and hence often drop the absolute
value operator. (This turns out not to make much of a difference in
explanatory power.) As before, we think of 𝛼2 as the probability that
the bit equals 0 and 𝛽 2 as the probability that the bit equals 1. As we
did before, we can model the NOT operation by the map 𝑁 ∶ ℝ2 → ℝ2
where 𝑁 (𝛼, 𝛽) = (𝛽, 𝛼).
Following quantum tradition, instead of using 𝑒0 and 𝑒1 as we did
above, from now on we will denote the vector (1, 0) by |0⟩ and the
vector (0, 1) by |1⟩ (and moreover, think of these as column vectors).
This is known as the Dirac “ket” notation. This means that NOT is
the unique linear map 𝑁 ∶ ℝ2 → ℝ2 that satisfies 𝑁 |0⟩ = |1⟩ and
𝑁 |1⟩ = |0⟩. In other words, in the quantum case, as in the probabilistic
case, NOT corresponds to the matrix
0 1
𝑁 =( ) .
1 0
+1 +1
𝐻= √1 ( ) .
2 +1 −1
23.6.2 Recap
The state of a quantum system of 𝑛 qubits is modeled by an 2𝑛 dimen-
sional vector 𝜓 of unit norm (i.e., squares of all coordinates sums up
to 1), which we write as 𝜓 = ∑𝑥∈{0,1}𝑛 𝜓𝑥 |𝑥⟩ where |𝑥⟩ is the col-
umn vector that has 0 in all coordinates except the one corresponding
628 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
Proof. Alice and Bob will start by preparing a 2-qubit quantum system
in the state
𝜓= √1 |00⟩ + √1 |11⟩
2 2
(this state is known as an EPR pair). Alice takes the first qubit of
the system to her room, and Bob takes the qubit to his room. Now,
when Alice receives 𝑥 if 𝑥 = 0 she does nothing and if 𝑥 = 1 she ap-
𝑐𝑜𝑠𝜃 − sin 𝜃
plies the unitary map 𝑅−𝜋/8 to her qubit where 𝑅𝜃 = ( )
sin 𝜃 cos 𝜃
is the unitary operation corresponding to rotation in the plane with
angle 𝜃. When Bob receives 𝑦, if 𝑦 = 0 he does nothing and if 𝑦 = 1
he applies the unitary map 𝑅𝜋/8 to his qubit. Then each one of them
measures their qubit and sends this as their response.
Recall that to win the game Bob and Alice want their outputs to
be more likely to differ if 𝑥 = 𝑦 = 1 and to be more likely to agree
q ua n tu m comp u ti ng 629
otherwise. We will split the analysis in one case for each of the four
possible values of 𝑥 and 𝑦.
Case 1: 𝑥 = 0 and 𝑦 = 0. If 𝑥 = 𝑦 = 0 then the state does not
change. Because the state 𝜓 is proportional to |00⟩ + |11⟩, the measure-
ments of Bob and Alice will always agree (if Alice measures 0 then the
state collapses to |00⟩ and so Bob measures 0 as well, and similarly for
1). Hence in the case 𝑥 = 𝑦 = 0, Alice and Bob always win.
Case 2: 𝑥 = 0 and 𝑦 = 1. If 𝑥 = 0 and 𝑦 = 1 then after Alice
measures her bit, if she gets 0 then the system collapses to the state
|00⟩, in which case after Bob performs his rotation, his qubit is in
the state cos(𝜋/8)|0⟩ + sin(𝜋/8)|1⟩. Thus, when Bob measures his
qubit, he will get 0 (and hence agree with Alice) with probability
cos2 (𝜋/8) ≥ 0.85. Similarly, if Alice gets 1 then the system collapses
to |11⟩, in which case after rotation Bob’s qubit will be in the state
− sin(𝜋/8)|0⟩ + cos(𝜋/8)|1⟩ and so once again he will agree with Alice
with probability cos2 (𝜋/8).
The analysis for Case 3, where 𝑥 = 1 and 𝑦 = 0, is completely
analogous to Case 2. Hence Alice and Bob will agree with probability 12
We are using the (not too hard) observation that
cos2 (𝜋/8) in this case as well.12 the result of this experiment is the same regardless of
Case 4: 𝑥 = 1 and 𝑦 = 1. For the case that 𝑥 = 1 and 𝑦 = 1, the order in which Alice and Bob apply their rotations
and measurements.
after both Alice and Bob perform their rotations, the state will be
proportional to
R
Remark 23.6 — Quantum vs probabilistic strategies. It
is instructive to understand what about quantum
mechanics enabled this gain in Bell’s Inequality.
Consider the following analogous probabilistic strat-
egy for Alice and Bob. They agree that each one of
them will output 0 if they get 0 as input and output 1
with probability 𝑝 if they get 1 as input. In this case
one can see that their success probability would be
1
4
⋅ 1 + 21 (1 − 𝑝) + 14 [2𝑝(1 − 𝑝)] = 0.75 − 0.5𝑝2 ≤ 0.75.
The quantum strategy we described above can be
thought of as a variant of the probabilistic strategy for
parameter 𝑝 set to sin2 (𝜋/8) = 0.15. But in the case
𝑥 = 𝑦 = 1, instead of disagreeing only with probability
2𝑝(1 − 𝑝) = 1/4, we can use the negative probabilities
in the quantum world and rotate the state in opposite
directions. Therefore, the probability of disagreement
ends up being sin2 (𝜋/4) = 0.5.
0 1 0 0 0 0 0 0
⎛
⎜ ⎞
⎜1 0 0 0 0 0 0 0⎟⎟
⎜
⎜ ⎟
⎜0 0 0 1 0 0 0 0⎟⎟
⎜
⎜ ⎟
0 0 1 0 0 0 0 0⎟
𝑈𝑁𝐴𝑁𝐷 =⎜
⎜
⎜
⎟
⎟
⎜0 0 0 0 0 1 0 0⎟⎟
⎜
⎜ ⎟
⎜0 0 0 0 1 0 0 0⎟⎟
⎜
⎜ ⎟
⎜0 0 0 0 0 0 1 0⎟⎟
⎝0 0 0 0 0 0 0 1⎠
+1 +1
𝐻= √1 ( ) .
2 +1 −1
HAD
𝑖
∑ 𝑣𝑥 |𝑥⟩ = √1 ∑ |𝑥0 ⋯ 𝑥𝑖−1 ⟩ (|0⟩ + (−1)𝑥𝑖 |1⟩) |𝑥𝑖 ⋯ 𝑥𝑛−1 ⟩ .
2
𝑥∈{0,1}𝑛 𝑥∈{0,1}𝑛
2
∑ |𝑣𝑦 |2 ≥ .
𝑦∈{0,1}𝑚 s.t. 𝑦𝑚−1 =𝑓(𝑥)
3
P
Please stop here and see that this definition makes
sense to you.
R
Remark 23.9 — The obviously exponential fallacy. A
priori it might seem “obvious” that quantum com-
puting is exponentially powerful, since to perform a
quantum computation on 𝑛 bits we need to maintain
the 2𝑛 dimensional state vector and apply 2𝑛 × 2𝑛 ma-
trices to it. Indeed popular descriptions of quantum
computing (too) often say something along the lines
that the difference between quantum and classical
computers is that a classical bit can either be zero or
one while a qubit can be in both states at once, and
so in many qubits a quantum computer can perform
exponentially many computations at once.
Depending on how you interpret it, this description
is either false or would apply equally well to proba-
bilistic computation, even though we’ve already seen
that every randomized algorithm can be simulated by
a similar-sized circuit, and in fact we conjecture that
BPP = P.
Moreover, this “obvious” approach for simulating
a quantum computation will take not just exponen-
tial time but exponential space as well, while it can be
shown that using a simple recursive formula one can
calculate the final quantum state using polynomial
space (in physics this is known as “Feynman path inte-
grals”). So, the exponentially long vector description
by itself does not imply that quantum computers are
exponentially powerful. Indeed, we cannot prove that
they are (i.e., as far as we know, every QNAND-CIRC
program could be simulated by a NAND-CIRC pro-
gram with polynomial overhead), but we do have
some problems (integer factoring most prominently)
for which they do provide exponential speedup over
the currently best known classical (deterministic or
probabilistic) algorithms.
Definition 23.10 — The class BQP. Let 𝐹 ∶ {0, 1}∗ → {0, 1}. We say that
𝐹 ∈ BQP if there exists a polynomial time NAND-TM program 𝑃
such that for every 𝑛, 𝑃 (1𝑛 ) is the description of a quantum circuit
𝐶𝑛 that computes the restriction of 𝐹 to {0, 1}𝑛 .
q ua n tu m comp u ti ng 635
P
One way to verify that you’ve understood these def-
initions it to see that you can prove (1) P ⊆ BQP
and in fact the stronger statement BPP ⊆ BQP, (2)
BQP ⊆ EXP, and (3) For every NP-complete function
𝐹 , if 𝐹 ∈ BQP then NP ⊆ BQP. Exercise 23.1 asks you
to work these out.
The relation between NP and BQP is not known (see also Re-
mark 23.4). It is widely believed that NP ⊈ BQP, but there is no
consensus whether or not BQP ⊆ NP. It is quite possible that these
two classes are incomparable, in the sense that NP ⊈ BQP (and in par-
ticular no NP-complete function belongs to BQP) but also BQP ⊈ NP
(and there are some interesting candidates for such problems).
It can be shown that QNANDEVAL (evaluating a quantum circuit
on an input) is computable by a polynomial size QNAND-CIRC pro-
gram, and moreover this program can even be generated uniformly
and hence QNANDEVAL is in BQP. This allows us to “port” many
of the results of classical computational complexity into the quantum
realm as well.
R
Remark 23.11 — Restricting attention to circuits. Because
the non uniform model is a little cleaner to work with,
in the rest of this chapter we mostly restrict attention
to this model, though all the algorithms we discuss
can be implemented in uniform computation as well.
roughly 2𝑂(𝑛 ) time, where the 𝑂̃ notation hides factors that are
̃ 1/3
Using a simple
Step 2: Period finding via the Quantum Fourier Transform.
trick known as “repeated squaring”, it is possible to compute the
map 𝑥 ↦ 𝐹𝐴 (𝑥) in time polynomial in 𝑚, which means we can also
compute this map using a polynomial number of NAND gates,and so
in particular we can generate in polynomial quantum time a quantum
state 𝜌 that is (up to normalization) equal to
∑ |𝑥⟩|𝐹𝐴 (𝑥)⟩ .
𝑥∈{0,1}𝑚
{1 𝑦 = 𝐴𝑥 ( mod 𝑀 )
⎧
𝑓𝐴,𝑦 (𝑥) = ⎨ .
⎩0 otherwise
{
R
Remark 23.13 — Quantum Fourier Transform. Despite
its name, the Quantum Fourier Transform does not
actually give a way to compute the Fourier Trans-
form of a function 𝑓 ∶ {0, 1}𝑚 → ℝ. This would be
impossible to do in time polynomial in 𝑚, as simply
writing down the Fourier Transform would require 2𝑚
coefficients. Rather the Quantum Fourier Transform
gives a quantum state where the amplitude correspond-
ing to an element (think: frequency) ℎ is equal to
the corresponding Fourier coefficient. This allows to
sample from a distribution where ℎ is drawn with
probability proportional to the square of its Fourier
coefficient. This is not the same as computing the
640 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
R
Remark 23.14 — Group theory. While we define the con-
cepts we use, some background in group or number
theory might be quite helpful for fully understanding
this section.
We will not use anything more than the basic proper-
ties of finite Abelian groups. Specifically we use the
following notions:
̂
𝑓 = ∑ 𝑓(𝑔)𝜒 𝑔 , (23.2)
𝑔∈𝔾
̂
𝑓 = ∑ 𝑓(𝑦)𝜒 𝑦
𝑦∈{0,1}
̂
∑ 𝑓(𝑦)|𝑦⟩
𝑦∈{0,1}𝑛
642 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
where 𝑓 = ∑𝑦 𝑓(𝑦)𝜒
̂
𝑦 and 𝜒𝑦 ∶ {0, 1}
𝑛
→ ℂ is the function
𝜒𝑦 (𝑥) = −1∑ 𝑥𝑖 𝑦𝑖 .
Proof Idea:
The idea behind the proof is that the Hadamard operation corre-
sponds to the Fourier transform over the group {0, 1}𝑛 (with the XOR
operations). To show this, we just need to do the calculations.
⋆
HAD|𝑎⟩ = √1 (|0⟩
2
+ (−1)𝑎 |1⟩) .
𝜌= ∑ 𝑓(𝑥)|𝑥⟩ .
𝑥∈{0,1}𝑛
𝑛−1
2−𝑛/2 ∑ 𝑓(𝑥) ∏ (|0⟩ + (−1)𝑥𝑖 |1⟩) .
𝑥∈{0,1}𝑛 𝑖=0
We can now use the distributive law and open up a term of the
form
But by changing the order of summations, we see that the final state
is
̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(𝑥)𝜔𝑥𝑦 . (23.4)
𝑥∈ℤ𝐿
̂ =
𝑓(𝑦) √1
𝐿
∑ 𝑓(2𝑧)(𝜔2 )𝑦𝑧 + 𝜔𝑦
√
𝐿
∑ 𝑓(2𝑧 + 1)(𝜔2 )𝑦𝑧 (23.5)
𝑧∈𝑍𝐿/2 𝑧∈ℤ𝐿/2
Specifically, the Fourier characters of the group ℤ𝐿/2 are the func-
tions 𝜒𝑦 (𝑥) = 𝑒2𝜋𝑖/(𝐿/2)𝑦𝑥 = (𝜔2 )𝑦𝑥 for every 𝑥, 𝑦 ∈ ℤ𝐿/2 . Moreover,
since 𝜔𝐿 = 1, (𝜔2 )𝑦 = (𝜔2 )𝑦 mod 𝐿/2 for every 𝑦 ∈ ℕ. Thus (23.5)
translates into
̂ = 𝑓 ̂ (𝑦
𝑓(𝑦) 𝑒𝑣𝑒𝑛 mod 𝐿/2) + 𝜔𝑦 𝑓𝑜𝑑𝑑
̂ (𝑦 mod 𝐿/2) .
This observation is usually used to obtain a fast (e.g. 𝑂(𝐿 log 𝐿))
time to compute the Fourier transform in a classical setting, but it can
be used to obtain a quantum circuit of 𝑝𝑜𝑙𝑦(log 𝐿) gates to transform a
state of the form ∑𝑥∈ℤ 𝑓(𝑥)|𝑥⟩ to a state of the form ∑𝑦∈ℤ 𝑓(𝑦)|𝑦⟩.
̂
𝐿 𝐿
The case that 𝐿 is not an exact power of two causes some complica-
tions in both the classical case of the Fast Fourier Transform and the
quantum setting of Shor’s algorithm. However, it is possible to handle
these. The idea is that we can embed 𝑍𝐿 in the group ℤ𝐴⋅𝐿 for any
integer 𝐴, and we can find an integer 𝐴 such that 𝐴 ⋅ 𝐿 will be close
enough to a power of 2 (i.e., a number of the form 2𝑚 for some 𝑚), so
that if we do the Fourier transform over the group ℤ2𝑚 then we will
not introduce too many errors.
✓ Chapter Recap
23.12 EXERCISES
R
Remark 23.16 — Disclaimer. Most of the exercises have
been written in the summer of 2018 and haven’t yet
been fully debugged. While I would prefer people
do not post online solutions to the exercises, I would
greatly appreciate if you let me know of any bugs. You
can do so by posting a GitHub issue about the exer-
q ua n tu m comp u ti ng 645
Prove the
Exercise 23.1 — Quantum and classical complexity class relations.
following relations between quantum complexity classes and classical
ones:
21
Hint: You can use 𝑈𝑁𝐴𝑁𝐷 to simulate NAND gates.
1. P/poly ⊆ BQP/poly .21
22
Hint: Use the alternative characterization of P as in
2. P ⊆ BQP.22 Solved Exercise 13.4.
23
Hint: You can use the HAD gate to simulate a coin
toss.
3. BPP ⊆ BQP.23 24
Hint: In exponential time simulating quantum
computation boils down to matrix multiplication.
4. BQP ⊆ EXP.24 25
Hint: If a reduction can be implemented in P it can
be implemented in BQP as well.
5. If SAT ∈ BQP then NP ⊆ BQP.25
Show a probabilistic
Exercise 23.2 — Discrete logarithm from order finding.
polynomial time classical algorithm that given an Abelian finite group
𝔾 (in the form of an algorithm that computes the group operation),
a generator 𝑔 for the group, and an element ℎ ∈ 𝔾, as well access to a
black box that on input 𝑓 ∈ 𝔾 outputs the order of 𝑓 (the smallest 𝑎
such that 𝑓 𝑎 = 1), computes the discrete logarithm of ℎ with respect to 26
We are given ℎ = 𝑔𝑥 and need to recover 𝑥. To
𝑔. That is the algorithm should output a number 𝑥 such that 𝑔𝑥 = ℎ. do so we can compute the order of various elements
of the form ℎ𝑎 𝑔𝑏 . The order of such an element is
See footnote for hint.26
a number 𝑐 satisfying 𝑐(𝑥𝑎 + 𝑏) = 0 (mod |𝔾|).
■ With a few random examples we will get a non trivial
equation on 𝑥 (where 𝑐 is not zero modulo |𝔾|) and
then we can use our knowledge of 𝑎, 𝑏, 𝑐 to recover 𝑥.
23.13 BIBLIOGRAPHICAL NOTES
Chapters 9 and 10 in the book Quantum Computing Since Democritus
give an informal but highly informative introduction to the topics
of this lecture and much more. Shor’s and Simon’s algorithms are
also covered in Chapter 10 of my book with Arora on computational
complexity.
There are many excellent videos available online covering some
of these materials. The Fourier transform is covered in this videos of
Dr. Chris Geoscience, Clare Zhang and Vi Hart. More specifically to
quantum computing, the videos of Umesh Vazirani on the Quantum
Fourier Transform and Kelsey Houston-Edwards on Shor’s Algorithm
are very recommended.
Chapter 10 in Avi Wigderson’s book gives a high level overview of
quantum computing. Andrew Childs’ lecture notes on quantum algo-
rithms, as well as the lecture notes of Umesh Vazirani, John Preskill,
and John Watrous
646 i n trod u c ti on to the ore ti ca l comp u te r sc i e nc e
23.15 ACKNOWLEDGEMENTS
Thanks to Scott Aaronson for many helpful comments about this
chapter.
VI
APPENDICES
Bibliography