0% found this document useful (0 votes)
202 views534 pages

Information Randomness Incompleteness Papers On Algorithmic Information Theory

15235

Uploaded by

Carlos González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views534 pages

Information Randomness Incompleteness Papers On Algorithmic Information Theory

15235

Uploaded by

Carlos González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 534

INFORMATION,

RANDOMNESS &
INCOMPLETENESS
Papers on Algorithmic
Information Theory
Second Edition

G J Chaitin
IBM, P O Box 704
Yorktown Heights, NY 10598
[email protected]

September 30, 1997


This collection of reprints was published by World
Scientic in Singapore. The rst edition appeared
in 1987, and the second edition appeared in 1990.
This is the second edition with an updated bibliog-
raphy.
Acknowledgments
The author and the publisher are grateful to the following for permis-
sion to reprint the papers included in this volume.
 Academic Press, Inc. (Adv. Appl. Math.)

 American Mathematical Society (AMS Notices)

 Association for Computing Machinery (J. ACM, SICACT News,


SIGACT News)
 Cambridge University Press (Algorithmic Information Theory)

 Elsevier Science Publishers (Theor. Comput. Sci.)

 IBM (IBM J. Res. Dev.)

 IEEE (IEEE Trans. Info. Theory, IEEE Symposia Abstracts)

 N. Ikeda, Osaka University (Osaka J. Math.)

 John Wiley & Sons, Inc. (Encyclopedia of Statistical Sci., Com-


mun. Pure Appl. Math.)
 I. Kalantari, Western Illinois University (Recursive Function The-
ory: Newsletter)
 MIT Press (The Maximum Entropy Formalism)

 Pergamon Journals Ltd. (Comp. Math. Applic.)

 Plenum Publishing Corp. (Int. J. Theor. Phys.)

1
2
 I. Prigogine, Universit
e Libre de Bruxelles (Mondes en Developpe-
ment)
 Springer-Verlag (Open Problems in Communication and Compu-
tation)
 Verlag Kammerer & Unverzagt (The Universal Turing Machine|
A Half-Century Survey)
 W. H. Freeman and Company (Sci. Amer.).
Preface

God not only plays dice in quantum mechanics, but even with the whole
numbers! The discovery of randomness in arithmetic is presented in my
book Algorithmic Information Theory published by Cambridge Univer-
sity Press. There I show that to decide if an algebraic equation in
integers has nitely or in nitely many solutions is in some cases ab-
solutely intractable. I exhibit an in nite series of such arithmetical
assertions that are random arithmetical facts, and for which it is es-
sentially the case that the only way to prove them is to assume them
as axioms. This extreme form of G odel incompleteness theorem shows
that some arithmetical truths are totally impervious to reasoning.
The papers leading to this result were published over a period of
more than twenty years in widely scattered journals, but because of
their unity of purpose they fall together naturally into the present
book, intended as a companion volume to my Cambridge University
Press monograph. I hope that it will serve as a stimulus for work on
complexity, randomness and unpredictability, in physics and biology as
well as in metamathematics.
For the second edition, I have added the article \Randomness in
arithmetic" (Part I), a collection of abstracts (Part VII), a bibliogra-
phy (Part VIII), and, as an Epilogue, two essays which have not been
published elsewhere that assess the impact of algorithmic information
theory on mathematics and biology, respectively. I should also like to
point out that it is straightforward to apply to LISP the techniques
used in Part VI to study bounded-transfer Turing machines. A few
footnotes have been added to Part VI, but the subject richly deserves

3
4
book length treatment, and I intend to write a book about LISP in the
near future.1
Gregory Chaitin

1 LISP program-size complexity is discussed at length in my book Information-


Theoretic Incompleteness published by World Scientic in 1992.]
Contents
I Introductory/Tutorial/Survey Papers 9
Randomness and mathematical proof 11
Randomness in arithmetic 29
On the diculty of computations 41
Information-theoretic computational complexity 57
Algorithmic information theory 75
Algorithmic information theory 83

II Applications to Metamathematics 109


G odel's theorem and information 111
Randomness and G odel's theorem 131
An algebraic equation for the halting probability 137
Computing the busy beaver function 145

III Applications to Biology 151


To a mathematical de nition of \life" 153

5
6
Toward a mathematical de nition of \life" 165

IV Technical Papers on Self-Delimiting Pro-


grams 195
A theory of program size formally identical to information
theory 197
Incompleteness theorems for random reals 225
Algorithmic entropy of sets 261

V Technical Papers on Blank-Endmarker Pro-


grams 289
Information-theoretic limitations of formal systems 291
A note on Monte Carlo primality tests and algorithmic
information theory 335
Information-theoretic characterizations of recursive in -
nite strings 345
Program size, oracles, and the jump operation 351

VI Technical Papers on Turing Machines &


LISP 367
On the length of programs for computing nite binary se-
quences 369
On the length of programs for computing nite binary se-
quences: statistical considerations 411
On the simplicity and speed of programs for computing
in nite sets of natural numbers 435
7
VII Abstracts 461
On the length of programs for computing nite binary se-
quences by bounded-transfer Turing machines 463
On the length of programs for computing nite binary se-
quences by bounded-transfer Turing machines II 465
Computational complexity and G odel's incompleteness
theorem 467
Computational complexity and G odel's incompleteness
theorem 469
Information-theoretic aspects of the Turing degrees 473
Information-theoretic aspects of Post's construction of a
simple set 475
On the diculty of generating all binary strings of com-
plexity less than n 477
On the greatest natural number of de nitional or informa-
tion complexity  n 479
A necessary and sucient condition for an in nite binary
string to be recursive 481
There are few minimal descriptions 483
Information-theoretic computational complexity 485
A theory of program size formally identical to information
theory 487
Recent work on algorithmic information theory 489
8
VIII Bibliography 491
Publications of G J Chaitin 493
Discussions of Chaitin's work 499

Epilogue 503
Undecidability & randomness in pure mathematics 503
Algorithmic information & evolution 517

About the author 529


Part I
Introductory/Tutorial/Survey
Papers

9
RANDOMNESS AND
MATHEMATICAL PROOF
Scienti
c American 232, No. 5
(May 1975), pp. 47{52

by Gregory J. Chaitin

Abstract
Although randomness can be precisely dened and can even be measured,
a given number cannot be proved to be random. This enigma establishes
a limit to what is possible in mathematics.

Almost everyone has an intuitive notion of what a random number is.


For example, consider these two series of binary digits:
01010101010101010101
01101100110111100010

11
12 Part I|Introductory/Tutorial/Survey Papers
The rst is obviously constructed according to a simple rule it consists
of the number 01 repeated ten times. If one were asked to speculate
on how the series might continue, one could predict with considerable
con dence that the next two digits would be 0 and 1. Inspection of
the second series of digits yields no such comprehensive pattern. There
is no obvious rule governing the formation of the number, and there
is no rational way to guess the succeeding digits. The arrangement
seems haphazard in other words, the sequence appears to be a random
assortment of 0's and 1's.
The second series of binary digits was generated by ipping a coin
20 times and writing a 1 if the outcome was heads and a 0 if it was tails.
Tossing a coin is a classical procedure for producing a random number,
and one might think at rst that the provenance of the series alone
would certify that it is random. This is not so. Tossing a coin 20 times
can produce any one of 220 (or a little more than a million) binary series,
and each of them has exactly the same probability. Thus it should be
no more surprising to obtain the series with an obvious pattern than to
obtain the one that seems to be random each represents an event with a
probability of 2;20. If origin in a probabilistic event were made the sole
criterion of randomness, then both series would have to be considered
random, and indeed so would all others, since the same mechanism can
generate all the possible series. The conclusion is singularly unhelpful
in distinguishing the random from the orderly.
Clearly a more sensible de nition of randomness is required, one
that does not contradict the intuitive concept of a \patternless" num-
ber. Such a de nition has been devised only in the past 10 years. It
does not consider the origin of a number but depends entirely on the
characteristics of the sequence of digits. The new de nition enables
us to describe the properties of a random number more precisely than
was formerly possible, and it establishes a hierarchy of degrees of ran-
domness. Of perhaps even greater interest than the capabilities of the
de nition, however, are its limitations. In particular the de nition can-
not help to determine, except in very special cases, whether or not a
given series of digits, such as the second one above, is in fact random
or only seems to be random. This limitation is not a aw in the de-
nition it is a consequence of a subtle but fundamental anomaly in
the foundation of mathematics. It is closely related to a famous theo-
Randomness and Mathematical Proof 13
rem devised and proved in 1931 by Kurt G odel, which has come to be
known as G odel's incompleteness theorem. Both the theorem and the
recent discoveries concerning the nature of randomness help to de ne
the boundaries that constrain certain mathematical methods.

Algorithmic De
nition
The new de nition of randomness has its heritage in information theory,
the science, developed mainly since World War II, that studies the
transmission of messages. Suppose you have a friend who is visiting
a planet in another galaxy, and that sending him telegrams is very
expensive. He forgot to take along his tables of trigonometric functions,
and he has asked you to supply them. You could simply translate
the numbers into an appropriate code (such as the binary numbers)
and transmit them directly, but even the most modest tables of the
six functions have a few thousand digits, so that the cost would be
high. A much cheaper way to convey the same information would be
to transmit instructions for calculating the tables from the underlying
trigonometric formulas, such as Euler's equation eix = cos x + i sin x.
Such a message could be relatively brief, yet inherent in it is all the
information contained in even the largest tables.
Suppose, on the other hand, your friend is interested not in
trigonometry but in baseball. He would like to know the scores of
all the major-league games played since he left the earth some thou-
sands of years before. In this case it is most unlikely that a formula
could be found for compressing the information into a short message
in such a series of numbers each digit is essentially an independent item
of information, and it cannot be predicted from its neighbors or from
some underlying rule. There is no alternative to transmitting the entire
list of scores.
In this pair of whimsical messages is the germ of a new de nition
of randomness. It is based on the observation that the information
embodied in a random series of numbers cannot be \compressed," or
reduced to a more compact form. In formulating the actual de nition
it is preferable to consider communication not with a distant friend but
with a digital computer. The friend might have the wit to make infer-
14 Part I|Introductory/Tutorial/Survey Papers
ences about numbers or to construct a series from partial information
or from vague instructions. The computer does not have that capacity,
and for our purposes that de ciency is an advantage. Instructions given
the computer must be complete and explicit, and they must enable it
to proceed step by step without requiring that it comprehend the result
of any part of the operations it performs. Such a program of instruc-
tions is an algorithm. It can demand any nite number of mechanical
manipulations of numbers, but it cannot ask for judgments about their
meaning.
The de nition also requires that we be able to measure the infor-
mation content of a message in some more precise way than by the cost
of sending it as a telegram. The fundamental unit of information is the
\bit," de ned as the smallest item of information capable of indicating
a choice between two equally likely things. In binary notation one bit
is equivalent to one digit, either a 0 or a 1.
We are now able to describe more precisely the dierences between
the two series of digits presented at the beginning of this article:
01010101010101010101
01101100110111100010
The rst could be speci ed to a computer by a very simple algorithm,
such as \Print 01 ten times." If the series were extended according to
the same rule, the algorithm would have to be only slightly larger it
might be made to read, for example, \Print 01 a million times." The
number of bits in such an algorithm is a small fraction of the number
of bits in the series it speci es, and as the series grows larger the size
of the program increases at a much slower rate.
For the second series of digits there is no corresponding shortcut.
The most economical way to express the series is to write it out in
full, and the shortest algorithm for introducing the series into a com-
puter would be \Print 01101100110111100010." If the series were much
larger (but still apparently patternless), the algorithm would have to
be expanded to the corresponding size. This \incompressibility" is a
property of all random numbers indeed, we can proceed directly to
de ne randomness in terms of incompressibility: A series of numbers is
random if the smallest algorithm capable of specifying it to a computer
has about the same number of bits of information as the series itself.
Randomness and Mathematical Proof 15
This de nition was independently proposed about 1965 by A. N.
Kolmogorov of the Academy of Science of the U.S.S.R. and by me,
when I was an undergraduate at the City College of the City University
of New York. Both Kolmogorov and I were then unaware of related
proposals made in 1960 by Ray J. Solomono of the Zator Company
in an endeavor to measure the simplicity of scienti c theories. During
the past decade we and others have continued to explore the meaning
of randomness. The original formulations have been improved and the
feasibility of the approach has been amply con rmed.

Model of Inductive Method


The algorithmic de nition of randomness provides a new foundation
for the theory of probability. By no means does it supersede classical
probability theory, which is based on an ensemble of possibilities, each
of which is assigned a probability. Rather, the algorithmic approach
complements the ensemble method by giving precise meaning to con-
cepts that had been intuitively appealing but that could not be formally
adopted.
The ensemble theory of probability, which originated in the 17th
century, remains today of great practical importance. It is the founda-
tion of statistics, and it is applied to a wide range of problems in science
and engineering. The algorithmic theory also has important implica-
tions, but they are primarily theoretical. The area of broadest interest
is its ampli cation of G odel's incompleteness theorem. Another appli-
cation (which actually preceded the formulation of the theory itself) is
in Solomono's model of scienti c induction.
Solomono represented a scientist's observations as a series of bi-
nary digits. The scientist seeks to explain these observations through
a theory, which can be regarded as an algorithm capable of generating
the series and extending it, that is, predicting future observations. For
any given series of observations there are always several competing the-
ories, and the scientist must choose among them. The model demands
that the smallest algorithm, the one consisting of the fewest bits, be
selected. Stated another way, this rule is the familiar formulation of
Occam's razor: Given diering theories of apparently equal merit, the
16 Part I|Introductory/Tutorial/Survey Papers
simplest is to be preferred.
Thus in the Solomono model a theory that enables one to under-
stand a series of observations is seen as a small computer program that
reproduces the observations and makes predictions about possible fu-
ture observations. The smaller the program, the more comprehensive
the theory and the greater the degree of understanding. Observations
that are random cannot be reproduced by a small program and there-
fore cannot be explained by a theory. In addition the future behavior
of a random system cannot be predicted. For random data the most
compact way for the scientist to communicate his observations is for
him to publish them in their entirety.
De ning randomness or the simplicity of theories through the ca-
pabilities of the digital computer would seem to introduce a spurious
element into these essentially abstract notions: the peculiarities of the
particular computing machine employed. Dierent machines commu-
nicate through dierent computer languages, and a set of instructions
expressed in one of those languages might require more or fewer bits
when the instructions are translated into another language. Actually,
however, the choice of computer matters very little. The problem can
be avoided entirely simply by insisting that the randomness of all num-
bers be tested on the same machine. Even when dierent machines are
employed, the idiosyncrasies of various languages can readily be com-
pensated for. Suppose, for example, someone has a program written
in English and wishes to utilize it with a computer that reads only
French. Instead of translating the algorithm itself he could preface the
program with a complete English course written in French. Another
mathematician with a French program and an English machine would
follow the opposite procedure. In this way only a xed number of bits
need be added to the program, and that number grows less signi cant
as the size of the series speci ed by the program increases. In practice a
device called a compiler often makes it possible to ignore the dierences
between languages when one is addressing a computer.
Since the choice of a particular machine is largely irrelevant, we
can choose for our calculations an ideal computer. It is assumed to
have unlimited storage capacity and unlimited time to complete its
calculations. Input to and output from the machine are both in the
form of binary digits. The machine begins to operate as soon as the
Randomness and Mathematical Proof 17
program is given it, and it continues until it has nished printing the
binary series that is the result. The machine then halts. Unless an
error is made in the program, the computer will produce exactly one
output for any given program.

Minimal Programs and Complexity


Any speci ed series of numbers can be generated by an in nite number
of algorithms. Consider, for example, the three-digit decimal series 123.
It could be produced by an algorithm such as \Subtract 1 from 124 and
print the result," or \Subtract 2 from 125 and print the result," or an
in nity of other programs formed on the same model. The programs of
greatest interest, however, are the smallest ones that will yield a given
numerical series. The smallest programs are called minimal programs
for a given series there may be only one minimal program or there may
be many.
Any minimal program is necessarily random, whether or not the
series it generates is random. This conclusion is a direct result of the
way we have de ned randomness. Consider the program P , which
is a minimal program for the series of digits S . If we assume that
P is not random, then by de nition there must be another program,
P 0, substantially smaller than P that will generate it. We can then
produce S by the following algorithm: \From P 0 calculate P , then
from P calculate S ." This program is only a few bits longer than P 0,
and thus it must be substantially shorter than P . P is therefore not a
minimal program.
The minimal program is closely related to another fundamental con-
cept in the algorithmic theory of randomness: the concept of complex-
ity. The complexity of a series of digits is the number of bits that must
be put into a computing machine in order to obtain the original series
as output. The complexity is therefore equal to the size in bits of the
minimal programs of the series. Having introduced this concept, we
can now restate our de nition of randomness in more rigorous terms:
A random series of digits is one whose complexity is approximately
equal to its size in bits.
The notion of complexity serves not only to de ne randomness but
18 Part I|Introductory/Tutorial/Survey Papers
also to measure it. Given several series of numbers each having n dig-
its, it is theoretically possible to identify all those of complexity n ; 1,
n ; 10, n ; 100 and so forth and thereby to rank the series in decreasing
order of randomness. The exact value of complexity below which a se-
ries is no longer considered random remains somewhat arbitrary. The
value ought to be set low enough for numbers with obviously random
properties not to be excluded and high enough for numbers with a con-
spicuous pattern to be disquali ed, but to set a particular numerical
value is to judge what degree of randomness constitutes actual random-
ness. It is this uncertainty that is reected in the quali ed statement
that the complexity of a random series is approximately equal to the
size of the series.

Properties of Random Numbers


The methods of the algorithmic theory of probability can illuminate
many of the properties of both random and nonrandom numbers. The
frequency distribution of digits in a series, for example, can be shown
to have an important inuence on the randomness of the series. Simple
inspection suggests that a series consisting entirely of either 0's or 1's
is far from random, and the algorithmic approach con rms that con-
clusion. If such a series is n digits long, its complexity is approximately
equal to the logarithm to the base 2 of n. (The exact value depends
on the machine language employed.) The series can be produced by a
simple algorithm such as \Print 0 n times," in which virtually all the
information needed is contained in the binary numeral for n. The size
of this number is about log2 n bits. Since for even a moderately long
series the logarithm of n is much smaller than n itself, such numbers are
of low complexity their intuitively perceived pattern is mathematically
con rmed.
Another binary series that can be pro tably analyzed in this way
is one where 0's and 1's are present with relative frequencies of three-
fourths and one-fourth. If the series is of size n, it can be demonstrated
that its complexity is no greater than four- fths n, that is, a program
that will produce the series can be written in 4n=5 bits. This maximum
applies regardless of the sequence of the digits, so that no series with
Randomness and Mathematical Proof 19
such a frequency distribution can be considered very random. In fact,
it can be proved that in any long binary series that is random the
relative frequencies of 0's and 1's must be very close to one-half. (In a
random decimal series the relative frequency of each digit is, of course,
one-tenth.)
Numbers having a nonrandom frequency distribution are excep-
tional. Of all the possible n-digit binary numbers there is only one,
for example, that consists entirely of 0's and only one that is all 1's.
All the rest are less orderly, and the great majority must, by any rea-
sonable standard, be called random. To choose an arbitrary limit, we
can calculate the fraction of all n-digit binary numbers that have a com-
plexity of less than n ; 10. There are 21 programs one digit long that
might generate an n-digit series there are 22 programs two digits long
that could yield such a series, 23 programs three digits long and so forth,
up to the longest programs permitted within the allowed complexity
of these there are 2n;11 . The sum of this series (21 + 22 +    + 2n;11 )
is equal to 2n;10 ; 2. Hence there are fewer than 2n;10 programs of
size less than n ; 10, and since each of these programs can specify no
more than one series of digits, fewer than 2n;10 of the 2n numbers have
a complexity less than n ; 10. Since 2n;10=2n = 1=1024, it follows
that of all the n-digit binary numbers only about one in 1,000 have a
complexity less than n ; 10. In other words, only about one series in
1,000 can be compressed into a computer program more than 10 digits
smaller than itself.
A necessary corollary of this calculation is that more than 999
of every 1,000 n-digit binary numbers have a complexity equal to or
greater than n ; 10. If that degree of complexity can be taken as an
appropriate test of randomness, then almost all n-digit numbers are in
fact random. If a fair coin is tossed n times, the probability is greater
than .999 that the result will be random to this extent. It would there-
fore seem easy to exhibit a specimen of a long series of random digits
actually it is impossible to do so.
20 Part I|Introductory/Tutorial/Survey Papers
Formal Systems
It can readily be shown that a speci c series of digits is not random
it is sucient to nd a program that will generate the series and that
is substantially smaller than the series itself. The program need not
be a minimal program for the series it need only be a small one. To
demonstrate that a particular series of digits is random, on the other
hand, one must prove that no small program for calculating it exists.
It is in the realm of mathematical proof that G odel's incompleteness
theorem is such a conspicuous landmark my version of the theorem
predicts that the required proof of randomness cannot be found. The
consequences of this fact are just as interesting for what they reveal
about G odel's theorem as they are for what they indicate about the
nature of random numbers.
G odel's theorem represents the resolution of a controversy that pre-
occupied mathematicians during the early years of the 20th century.
The question at issue was: \What constitutes a valid proof in mathe-
matics and how is such a proof to be recognized?" David Hilbert had
attempted to resolve the controversy by devising an arti cial language
in which valid proofs could be found mechanically, without any need
for human insight or judgement. G odel showed that there is no such
perfect language.
Hilbert established a nite alphabet of symbols, an unambiguous
grammar specifying how a meaningful statement could be formed, a
nite list of axioms, or initial assumptions, and a nite list of rules of
inference for deducing theorems from the axioms or from other theo-
rems. Such a language, with its rules, is called a formal system.
A formal system is de ned so precisely that a proof can be evaluated
by a recursive procedure involving only simple logical and arithmetical
manipulations. In other words, in the formal system there is an algo-
rithm for testing the validity of proofs. Today, although not in Hilbert's
time, the algorithm could be executed on a digital computer and the
machine could be asked to \judge" the merits of the proof.
Because of Hilbert's requirement that a formal system have a proof-
checking algorithm, it is possible in theory to list one by one all the
theorems that can be proved in a particular system. One rst lists
in alphabetical order all sequences of symbols one character long and
Randomness and Mathematical Proof 21
applies the proof-testing algorithm to each of them, thereby nding all
theorems (if any) whose proofs consist of a single character. One then
tests all the two-character sequences of symbols, and so on. In this
way all potential proofs can be checked, and eventually all theorems
can be discovered in order of the size of their proofs. (The method is,
of course, only a theoretical one the procedure is too lengthy to be
practical.)

Unprovable Statements
G odel showed in his 1931 proof that Hilbert's plan for a completely sys-
tematic mathematics cannot be ful lled. He did this by constructing an
assertion about the positive integers in the language of the formal sys-
tem that is true but that cannot be proved in the system. The formal
system, no matter how large or how carefully constructed it is, can-
not encompass all true theorems and is therefore incomplete. G odel's
technique can be applied to virtually any formal system, and it there-
fore demands the surprising and, for many, discomforting conclusion
that there can be no de nitive answer to the question \What is a valid
proof?"
G odel's proof of the incompleteness theorem is based on the paradox
of Epimenides the Cretan, who is said to have averred, \All Cretans
are liars" see \Paradox," by W. V. Quine Scientic American, April,
1962]. The paradox can be rephrased in more general terms as \This
statement is false," an assertion that is true if and only if it is false
and that is therefore neither true nor false. G odel replaced the concept
of truth with that of provability and thereby constructed the sentence
\This statement is unprovable," an assertion that, in a speci c formal
system, is provable if and only if it is false. Thus either a falsehood
is provable, which is forbidden, or a true statement is unprovable, and
hence the formal system is incomplete. G odel then applied a technique
that uniquely numbers all statements and proofs in the formal system
and thereby converted the sentence \This statement is unprovable" into
an assertion about the properties of the positive integers. Because this
transformation is possible, the incompleteness theorem applies with
equal cogency to all formal systems in which it is possible to deal with
22 Part I|Introductory/Tutorial/Survey Papers
the positive integers see \G odel's Proof," by Ernest Nagel and James
R. Newman Scientic American, June, 1956].
The intimate association between G odel's proof and the theory of
random numbers can be made plain through another paradox, similar in
form to the paradox of Epimenides. It is a variant of the Berry paradox,
rst published in 1908 by Bertrand Russell. It reads: \Find the smallest
positive integer which to be speci ed requires more characters than
there are in this sentence." The sentence has 114 characters (counting
spaces between words and the period but not the quotation marks),
yet it supposedly speci es an integer that, by de nition, requires more
than 114 characters to be speci ed.
As before, in order to apply the paradox to the incompleteness the-
orem it is necessary to remove it from the realm of truth to the realm of
provability. The phrase \which requires" must be replaced by \which
can be proved to require," it being understood that all statements will
be expressed in a particular formal system. In addition the vague no-
tion of \the number of characters required to specify" an integer can
be replaced by the precisely de ned concept of complexity, which is
measured in bits rather than characters.
The result of these transformations is the following computer pro-
gram: \Find a series of binary digits that can be proved to be of a
complexity greater than the number of bits in this program." The pro-
gram tests all possible proofs in the formal system in order of their size
until it encounters the rst one proving that a speci c binary sequence
is of a complexity greater than the number of bits in the program. Then
it prints the series it has found and halts. Of course, the paradox in
the statement from which the program was derived has not been elim-
inated. The program supposedly calculates a number that no program
its size should be able to calculate. In fact, the program nds the rst
number that it can be proved incapable of nding.
The absurdity of this conclusion merely demonstrates that the pro-
gram will never nd the number it is designed to look for. In a formal
system one cannot prove that a particular series of digits is of a com-
plexity greater than the number of bits in the program employed to
specify the series.
A further generalization can be made about this paradox. It is
not the number of bits in the program itself that is the limiting factor
Randomness and Mathematical Proof 23
but the number of bits in the formal system as a whole. Hidden in
the program are the axioms and rules of inference that determine the
behavior of the system and provide the algorithm for testing proofs.
The information content of these axioms and rules can be measured
and can be designated the complexity of the formal system. The size
of the entire program therefore exceeds the complexity of the formal
system by a xed number of bits c. (The actual value of c depends on
the machine language employed.) The theorem proved by the paradox
can therefore be stated as follows: In a formal system of complexity n
it is impossible to prove that a particular series of binary digits is of
complexity greater than n + c, where c is a constant that is independent
of the particular system employed.

Limits of Formal Systems


Since complexity has been de ned as a measure of randomness, this
theorem implies that in a formal system no number can be proved to
be random unless the complexity of the number is less than that of the
system itself. Because all minimal programs are random the theorem
also implies that a system of greater complexity is required in order to
prove that a program is a minimal one for a particular series of digits.
The complexity of the formal system has such an important bearing
on the proof of randomness because it is a measure of the amount of in-
formation the system contains, and hence of the amount of information
that can be derived from it. The formal system rests on axioms: funda-
mental statements that are irreducible in the same sense that a minimal
program is. (If an axiom could be expressed more compactly, then the
briefer statement would become a new axiom and the old one would
become a derived theorem.) The information embodied in the axioms
is thus itself random, and it can be employed to test the randomness of
other data. The randomness of some numbers can therefore be proved,
but only if they are smaller than the formal system. Moreover, any
formal system is of necessity nite, whereas any series of digits can
be made arbitrarily large. Hence there will always be numbers whose
randomness cannot be proved.
The endeavor to de ne and measure randomness has greatly clar-
24 Part I|Introductory/Tutorial/Survey Papers
i ed the signi cance and the implications of G odel's incompleteness
theorem. That theorem can now be seen not as an isolated paradox
but as a natural consequence of the constraints imposed by information
theory. In 1946 Hermann Weyl said that the doubt induced by such
discoveries as G odel's theorem had been \a constant drain on the en-
thusiasm and determination with which I pursued my research work."
From the point of view of information theory, however, G odel's theorem
does not appear to give cause for depression. Instead it seems simply
to suggest that in order to progress, mathematicians, like investigators
in other sciences, must search for new axioms.

Illustrations
Algorithmic denition of randomness
(a) 10100!Computer!11111111111111111111
(b) 01101100110111100010!Computer!01101100110111100010
Algorithmic de nition of randomness relies on the capabilities
and limitations of the digital computer. In order to produce a partic-
ular output, such as a series of binary digits, the computer must be
given a set of explicit instructions that can be followed without making
intellectual judgments. Such a program of instructions is an algorithm.
If the desired output is highly ordered (a), a relatively small algorithm
will suce a series of twenty 1's, for example, might be generated by
some hypothetical computer from the program 10100, which is the bi-
nary notation for the decimal number 20. For a random series of digits
(b) the most concise program possible consists of the series itself. The
smallest programs capable of generating a particular program are called
the minimal programs of the series the size of these programs, mea-
sured in bits, or binary digits, is the complexity of the series. A series
of digits is de ned as random if series' complexity approaches its size
in bits.
Randomness and Mathematical Proof 25
Formal systems
Alphabet, Grammar, Axioms, Rules of Inference
#
Computer
#
Theorem 1, Theorem 2, Theorem 3, Theorem 4, Theorem 5, : : :
Formal systems devised by David Hilbert contain an algorithm
that mechanically checks the validity of all proofs that can be formu-
lated in the system. The formal system consists of an alphabet of
symbols in which all statements can be written a grammar that speci-
es how the symbols are to be combined a set of axioms, or principles
accepted without proof and rules of inference for deriving theorems
from the axioms. Theorems are found by writing all the possible gram-
matical statements in the system and testing them to determine which
ones are in accord with the rules of inference and are therefore valid
proofs. Since this operation can be performed by an algorithm it could
be done by a digital computer. In 1931 Kurt G odel demonstrated that
virtually all formal systems are incomplete: in each of them there is at
least one statement that is true but that cannot be proved.

Inductive reasoning
Observations: 0101010101
Predictions: 01010101010101010101
Theory: Ten repetitions of 01
Size of Theory: 21 characters
Predictions: 01010101010000000000
Theory: Five repetitions of 01 followed by ten 0's
Size of Theory: 42 characters
Inductive reasoning as it is employed in science was analyzed
mathematically by Ray J. Solomono. He represented a scientist's
observations as a series of binary digits the observations are to be
explained and new ones are to be predicted by theories, which are
26 Part I|Introductory/Tutorial/Survey Papers
regarded as algorithms instructing a computer to reproduce the ob-
servations. (The programs would not be English sentences but binary
series, and their size would be measured not in characters but in bits.)
Here two competing theories explain the existing data Occam's razor
demands that the simpler, or smaller, theory be preferred. The task of
the scientist is to search for minimal programs. If the data are random,
the minimal programs are no more concise than the observations and
no theory can be formulated.

Random sequences
Illustration is a graph of number of n-digit sequences
as a function of their complexity.
The curve grows exponentially
from approximately 0 to approximately 2n
as the complexity goes from 0 to n.
Random sequences of binary digits make up the majority of all
such sequences. Of the 2n series of n digits, most are of a complexity
that is within a few bits of n. As complexity decreases, the number of
series diminishes in a roughly exponential manner. Orderly series are
rare there is only one, for example, that consists of n 1's.

Three paradoxes
Russell Paradox
Consider the set of all sets that are not members of themselves.
Is this set a member of itself?
Epimenides Paradox
Consider this statement: \This statement is false."
Is this statement true?
Berry Paradox
Consider this sentence: \Find the smallest positive integer
which to be speci ed requires more characters
than there are in this sentence."
Does this sentence specify a positive integer?
Randomness and Mathematical Proof 27
Three paradoxes delimit what can be proved. The rst, devised
by Bertrand Russell, indicated that informal reasoning in mathematics
can yield contradictions, and it led to the creation of formal systems.
The second, attributed to Epimenides, was adapted by G odel to show
that even within a formal system there are true statements that are un-
provable. The third leads to the demonstration that a speci c number
cannot be proved random.

Unprovable statements
(a) This statement is unprovable.
(b) The complexity of 01101100110111100010 is greater than 15 bits.
(c) The series of digits 01101100110111100010 is random.
(d) 10100 is a minimal program for the series 11111111111111111111.
Unprovable statements can be shown to be false, if they are
false, but they cannot be shown to be true. A proof that \This state-
ment is unprovable" (a) reveals a self-contradiction in a formal system.
The assignment of a numerical value to the complexity of a particular
number (b) requires a proof that no smaller algorithm for generating
the number exists the proof could be supplied only if the formal sys-
tem itself were more complex than the number. Statements labeled c
and d are subject to the same limitation, since the identi cation of a
random number or a minimal program requires the determination of
complexity.

Further Reading
 A Prole of Mathematical Logic. Howard DeLong. Addison-
Wesley, 1970.
 Theories of Probability: An Examination of Foundations. Ter-
rence L. Fine. Academic Press, 1973.
28 Part I|Introductory/Tutorial/Survey Papers
 Universal Gambling Schemes and the Complexity Measures of
Kolmogorov and Chaitin. Thomas M. Cover. Technical Report
No. 12, Statistics Department, Stanford University, 1974.
 \Information-Theoretic Limitations of Formal Systems." Gregory
J. Chaitin in Journal of the Association for Computing Machin-
ery, Vol. 21, pages 403{424 July, 1974.
RANDOMNESS IN
ARITHMETIC
Scienti
c American 259, No. 1
(July 1988), pp. 80{85

by Gregory J. Chaitin
Gregory J. Chaitin is on the sta of the IBM Thomas J. Watson
Research Center in Yorktown Heights, N.Y. He is the principal archi-
tect of algorithmic information theory and has just published two books
in which the theory's concepts are applied to elucidate the nature of ran-
domness and the limitations of mathematics. This is Chaitin's second
article for Scientific American.

Abstract
It is impossible to prove whether each member of a family of algebraic
equations has a nite or an innite number of solutions: the answers
vary randomly and therefore elude mathematical reasoning.

29
30 Part I|Introductory/Tutorial/Survey Papers

What could be more certain than the fact that 2 plus 2 equals 4? Since
the time of the ancient Greeks mathematicians have believed there
is little|if anything|as unequivocal as a proved theorem. In fact,
mathematical statements that can be proved true have often been re-
garded as a more solid foundation for a system of thought than any
maxim about morals or even physical objects. The 17th-century Ger-
man mathematician and philosopher Gottfried Wilhelm Leibniz even
envisioned a \calculus" of reasoning such that all disputes could one
day be settled with the words \Gentlemen, let us compute!" By the be-
ginning of this century symbolic logic had progressed to such an extent
that the German mathematician David Hilbert declared that all math-
ematical questions are in principle decidable, and he con dently set out
to codify once and for all the methods of mathematical reasoning.
Such blissful optimism was shattered by the astonishing and pro-
found discoveries of Kurt G odel and Alan M. Turing in the 1930's.
G odel showed that no nite set of axioms and methods of reasoning
could encompass all the mathematical properties of the positive inte-
gers. Turing later couched G odel's ingenious and complicated proof in
a more accessible form. He showed that G odel's incompleteness theo-
rem is equivalent to the assertion that there can be no general method
for systematically deciding whether a computer program will ever halt,
that is, whether it will ever cause the computer to stop running. Of
course, if a particular program does cause the computer to halt, that
fact can be easily proved by running the program. The diculty lies in
proving that an arbitrary program never halts.
I have recently been able to take a further step along the path laid
out by G odel and Turing. By translating a particular computer pro-
gram into an algebraic equation of a type that was familiar even to the
ancient Greeks, I have shown that there is randomness in the branch of
pure mathematics known as number theory. My work indicates that|
to borrow Einstein's metaphor|God sometimes plays dice with whole
numbers!
This result, which is part of a body of work called algorithmic in-
formation theory, is not a cause for pessimism it does not portend
anarchy or lawlessness in mathematics. (Indeed, most mathematicians
Randomness in Arithmetic 31
continue working on problems as before.) What it means is that math-
ematical laws of a dierent kind might have to apply in certain situa-
tions: statistical laws. In the same way that it is impossible to predict
the exact moment at which an individual atom undergoes radioactive
decay, mathematics is sometimes powerless to answer particular ques-
tions. Nevertheless, physicists can still make reliable predictions about
averages over large ensembles of atoms. Mathematicians may in some
cases be limited to a similar approach.

My work is a natural extension of Turing's, but whereas Turing consid-


ered whether or not an arbitrary program would ever halt, I consider
the probability that any general-purpose computer will stop running if
its program is chosen completely at random. What do I mean when
I say \chosen completely at random"? Since at the most fundamental
level any program can be reduced to a sequence of bits (each of which
can take on the value 0 or 1) that are \read" and \interpreted" by the
computer hardware, I mean that a completely random program con-
sisting of n bits could just as well be the result of ipping a coin n
times (in which a \heads" represents a 0 and a \tails" represents 1, or
vice versa).
The probability that such a completely random program will halt,
which I have named omega ($), can be expressed in terms of a real
number between 0 and 1. (The statement $ = 0 would mean that
no random program will ever halt, and $ = 1 would mean that every
random program halts. For a general-purpose computer neither of these
extremes is actually possible.) Because $ is a real number, it can be
fully expressed only as an unending sequence of digits. In base 2 such
a sequence would amount to an in nite string of 0's and 1's.
Perhaps the most interesting characteristic of $ is that it is algorith-
mically random: it cannot be compressed into a program (considered
as a string of bits) shorter than itself. This de nition of randomness,
which has a central role in algorithmic information theory, was inde-
pendently formulated in the mid-1960's by the late A. N. Kolmogorov
and me. (I have since had to correct the de nition.)
32 Part I|Introductory/Tutorial/Survey Papers
The basic idea behind the de nition is a simple one. Some sequences
of bits can be compressed into programs much shorter than they are,
because they follow a pattern or rule. For example, a 200-bit sequence
of the form 0101010101: : : can be greatly compressed by describing it
as \100 repetitions of 01." Such sequences certainly are not random. A
200-bit sequence generated by tossing a coin, on the other hand, cannot
be compressed, since in general there is no pattern to the succession of
0's and 1's: it is a completely random sequence.
Of all the possible sequences of bits, most are incompressible and
therefore random. Since a sequence of bits can be considered to be
a base-2 representation of any real number (if one allows in nite se-
quences), it follows that most real numbers are in fact random. It is
not dicult to show that an algorithmically random number, such as $,
exhibits the usual statistical properties one associates with randomness.
One such property is normality: every possible digit appears with equal
frequency in the number. In a base-2 representation this means that
as the number of digits of $ approaches in nity, 0 and 1 respectively
account for exactly 50 percent of $'s digits.
A key technical point that must be stipulated in order for $ to
make sense is that an input program must be self-delimiting: its total
length (in bits) must be given within the program itself. (This seem-
ingly minor point, which paralyzed progress in the eld for nearly a
decade, is what entailed the rede nition of algorithmic randomness.)
Real programming languages are self-delimiting, because they provide
constructs for beginning and ending a program. Such constructs allow
a program to contain well-de ned subprograms, which may also have
other subprograms nested in them. Because a self-delimiting program
is built up by concatenating and nesting self-delimiting subprograms, a
program is syntactically complete only when the last open subprogram
is closed. In essence the beginning and ending constructs for programs
and subprograms function respectively like left and right parentheses
in mathematical expressions.
If programs were not self-delimiting, they could not be constructed
from subprograms, and summing the halting probabilities for all pro-
grams would yield an in nite number. If one considers only self-
delimiting programs, not only is $ limited to the range between 0 to 1
but also it can be explicitly calculated \in the limit from below." That
Randomness in Arithmetic 33
is to say, it is possible to calculate an in nite sequence of rational num-
bers (which can be expressed in terms of a nite sequence of bits) each
of which is closer to the true value of $ than the preceding number.
One way to do this is to systematically calculate $n for increasing
values of n $n is the probability that a completely random program
up to n bits in size will halt within n seconds if the program is run on
a given computer. Since there are 2k possible programs that are k bits
long, $n can in principle be calculated by determining for every value
of k between 1 and n how many of the possible programs actually halt
within n seconds, multiplying that number by 2;k and then summing all
the products. In other words, each k-bit program that halts contributes
2;k to $n  programs that do not halt contribute 0.
If one were miraculously given the value of $ with k bits of precision,
one could calculate a sequence of $n's until one reached a value that
equaled the given value of $. At this point one would know all programs
of a size less than k bits that halt in essence one would have solved
Turing's halting problem for all programs of a size less than k bits.
Of course, the time required for the calculation would be enormous for
reasonable values of k.

So far I have been referring exclusively to computers and their pro-


grams in discussing the halting problem, but it took on a new dimen-
sion in light of the work of J. P. Jones of the University of Calgary
and Y. V. Matijasevi%c of the V. A. Steklov Institute of Mathematics
in Leningrad. Their work provides a method for casting the problem
as assertions about particular diophantine equations. These algebraic
equations, which involve only multiplication, addition and exponentia-
tion of whole numbers, are named after the third-century Greek math-
ematician Diophantos of Alexandria.
To be more speci c, by applying the method of Jones and Matija-
sevi%c one can equate the statement that a particular program does not
halt with the assertion that one of a particular family of diophantine
equations has no solution in whole numbers. As with the original ver-
sion of the halting problem for computers, it is easy to prove a solution
34 Part I|Introductory/Tutorial/Survey Papers
exists: all one has to do is to plug in the correct numbers and verify
that the resulting numbers on the left and right sides of the equal sign
are in fact equal. The much more dicult problem is to prove that
there are absolutely no solutions when this is the case.
The family of equations is constructed from a basic equation that
contains a particular variable k, called the parameter, which takes on
the values 1, 2, 3 and so on. Hence there is an in nitely large family
of equations (one for each value of k) that can be generated from one
basic equation for each of a \family" of programs. The mathematical
assertion that the diophantine equation with parameter k has no solu-
tion encodes the assertion that the kth computer program never halts.
On the other hand, if the kth program does halt, then the equation has
exactly one solution. In a sense the truth or falsehood of assertions of
this type is mathematically uncertain, since it varies unpredictably as
the parameter k takes on dierent values.
My approach to the question of unpredictability in mathematics is
similar, but it achieves a much greater degree of randomness. Instead
of \arithmetizing" computer programs that may or may not halt as
a family of diophantine equations, I apply the method of Jones and
Matijasevi%c to arithmetize a single program to calculate the kth bit in
$n .

The method is based on a curious property of the parity of binomial


coecients (whether they are even or odd numbers) that was noticed
by E
douard A. Lucas a century ago but was not properly appreciated
until now. Binomial coecients are the multiplicands of the powers of
x that arise when one expands expressions of the type (x + 1)n . These
coecients can easily be computed by constructing what is known as
Pascal's triangle.
Lucas's theorem asserts that the coecient of xk in the expansion
of (x + 1)n is odd only if each digit in the base-2 representation of the
number k is less than or equal to the corresponding digit in the base-2
representation of n (starting from the right and reading left). To put
it a little more simply, the coecient for xk in an expansion of (x + 1)n
Randomness in Arithmetic 35
is odd if for every bit of k that is a 1 the corresponding bit of n is also
a 1, otherwise the coecient is even. For example, the coecient of x2
in the binomial expansion of (x + 1)4 is 6, which is even. Hence the 1
in the base-2 representation of 2 (10) is not matched with a 1 in the
same position in the base-2 representation of 4 (100).
Although the arithmetization is conceptually simple and elegant, it
is a substantial programming task to carry through the construction.
Nevertheless, I thought it would be fun to do it. I therefore developed
a \compiler" program for producing equations from programs for a
register machine. A register machine is a computer that consists of
a small set of registers for storing arbitrarily large numbers. It is an
abstraction, of course, since any real computer has registers with a
limited capacity.
Feeding a register-machine program that executes instructions in
the LISP computer language, as input, into a real computer pro-
grammed with the compiler yields within a few minutes, as output,
an equation about 200 pages long containing about 17,000 nonnegative
integer variables. I can thus derive a diophantine equation having a
parameter k that encodes the kth bit of $n merely by plugging a LISP
program (in binary form) for calculating the kth bit of $n into the 200-
page equation. For any given pair of values of k and n, the diophantine
equation has exactly one solution if the kth bit of $n is a 1, and it has
no solution if the kth bit of $n is a 0.
Because this applies for any pair of values for k and n, one can
in principle keep k xed and systematically increase the value of n
without limit, calculating the kth bit of $n for each value of n. For
small values of n the kth bit of $n will uctuate erratically between
0 and 1. Eventually, however, it will settle on either a 0 or a 1, since
for very large values of n it will be equal to the kth bit of $, which
is immutable. Hence the diophantine equation actually has in nitely
many solutions for a particular value of its parameter k if the kth bit of
$ turns out to be a 1, and for similar reasons it has only nitely many
solutions if the kth bit of $ turns out to be a 0. In this way, instead of
considering whether a diophantine equation has any solutions for each
value of its parameter k, I ask whether it has in nitely many solutions.
36 Part I|Introductory/Tutorial/Survey Papers

Although it might seem that there is little to be gained by asking


whether there are in nitely many solutions instead of whether there
are any solutions, there is in fact a critical distinction: the answers to
my question are logically independent. Two mathematical assertions
are logically independent if it is impossible to derive one from the other,
that is, if neither is a logical consequence of the other. This notion of
independence can usually be distinguished from that applied in statis-
tics. There two chance events are said to be independent if the outcome
of one has no bearing on the outcome of the other. For example, the
result of tossing a coin in no way aects the result of the next toss: the
results are statistically independent.
In my approach I bring both notions of independence to bear. The
answer to my question for one value of k is logically independent of the
answer for another value of k. The reason is that the individual bits of
$, which determine the answers, are statistically independent.
Although it is easy to show that for about half of the values of k
the number of solutions is nite and for the other half the number of
solutions is in nite, there is no possible way to compress the answers in
a formula or set of rules they mimic the results of coin tosses. Because
$ is algorithmically random, even knowing the answers for 1,000 values
of k would not help one to give the correct answer for another value of
k. A mathematician could do no better than a gambler tossing a coin
in deciding whether a particular equation had a nite or an in nite
number of solutions. Whatever axioms and proofs one could apply to
nd the answer for the diophantine equation with one value of k, they
would be inapplicable for the same equation with another value of k.
Mathematical reasoning is therefore essentially helpless in such a
case, since there are no logical interconnections between the diophantine
equations generated in this way. No matter how bright one is or how
long the proofs and how complicated the mathematical axioms are, the
in nite series of propositions stating whether the number of solutions of
the diophantine equations is nite or in nite will quickly defeat one as k
increases. Randomness, uncertainty and unpredictability occur even in
the elementary branches of number theory that deal with diophantine
equations.
Randomness in Arithmetic 37

How have the incompleteness theorem of G odel, the halting problem of


Turing and my own work aected mathematics? The fact is that most
mathematicians have shrugged o the results. Of course, they agree in
principle that any nite set of axioms is incomplete, but in practice they
dismiss the fact as not applying directly to their work. Unfortunately,
however, it may sometimes apply. Although G odel's original theorem
seemed to apply only to unusual mathematical propositions that were
not likely to be of interest in practice, algorithmic information theory
has shown that incompleteness and randomness are natural and per-
vasive. This suggests to me that the possibility of searching for new
axioms applying to the whole numbers should perhaps be taken more
seriously.
Indeed, the fact that many mathematical problems have remained
unsolved for hundreds and even thousands of years tends to support my
contention. Mathematicians steadfastly assume that the failure to solve
these problems lies strictly within themselves, but could the fault not
lie in the incompleteness of their axioms? For example, the question of
whether there are any perfect odd numbers has de ed an answer since
the time of the ancient Greeks. (A perfect number is a number that
is exactly the sum of its divisors, excluding itself. Hence 6 is a perfect
number, since 6 equals 1 plus 2 plus 3.) Could it be that the statement
\There are no odd perfect numbers" is unprovable? If it is, perhaps
mathematicians had better accept it as an axiom.
This may seem like a ridiculous suggestion to most mathematicians,
but to a physicist or a biologist it may not seem so absurd. To those
who work in the empirical sciences the usefulness of a hypothesis, and
not necessarily its \self-evident truth," is the key criterion by which to
judge whether it should be regarded as the basis for a theory. If there
are many conjectures that can be settled by invoking a hypothesis,
empirical scientists take the hypothesis seriously. (The nonexistence of
odd perfect numbers does not appear to have signi cant implications
and would therefore not be a useful axiom by this criterion.)
Actually in a few cases mathematicians have already taken unproved
but useful conjectures as a basis for their work. The so-called Riemann
hypothesis, for instance, is often accepted as being true, even though
38 Part I|Introductory/Tutorial/Survey Papers
it has never been proved, because many other important theorems are
based on it. Moreover, the hypothesis has been tested empirically by
means of the most powerful computers, and none has come up with a
single counterexample. Indeed, computer programs (which, as I have
indicated, are equivalent to mathematical statements) are also tested
in this way|by verifying a number of test cases rather than by rigorous
mathematical proof.

Are there other problems in other elds of science that can bene t from
these insights into the foundations of mathematics? I believe algorith-
mic information theory may have relevance to biology. The regulatory
genes of a developing embryo are in eect a computer program for con-
structing an organism. The \complexity" of this biochemical computer
program could conceivably be measured in terms analogous to those I
have developed in in quantifying the information content of $.
Although $ is completely random (or in nitely complex) and can-
not ever be computed exactly, it can be approximated with arbitrary
precision given an in nite amount of time. The complexity of living
organisms, it seems to me, could be approximated in a similar way. A
sequence of $n 's, which approach $, can be regarded as a metaphor for
evolution and perhaps could contain the germ of a mathematical model
for the evolution of biological complexity.
At the end of his life John von Neumann challenged mathematicians
to nd an abstract mathematical theory for the origin and evolution
of life. This fundamental problem, like most fundamental problems,
is magni cently dicult. Perhaps algorithmic information theory can
help to suggest a way to proceed.

Further Reading
 Algorithmic Information Theory. Gregory J. Chaitin.
Cambridge University Press, 1987.
Randomness in Arithmetic 39
 Information, Randomness & Incompleteness. Gregory J.
Chaitin. World Scienti c Publishing Co. Pte. Ltd., 1987.
 The Ultimate in Undecidability. Ian Stewart in Nature,
Vol. 232, No. 6160, pages 115{116 March 10, 1988.
40 Part I|Introductory/Tutorial/Survey Papers
ON THE DIFFICULTY OF
COMPUTATIONS
IEEE Transactions on Information Theory
IT-16 (1970), pp. 5{9

Gregory J. Chaitin1

Abstract
Two practical considerations concerning the use of computing machin-
ery are the amount of information that must be given to the machine
for it to perform a given task and the time it takes the machine to per-
form it. The size of programs and their running time are studied for
mathematical models of computing machines. The study of the amount
of information (i.e., number of bits) in a computer program needed for
it to put out a given nite binary sequence leads to a denition of a ran-
dom sequence the random sequences of a given length are those that
require the longest programs. The study of the running time of programs
for computing innite sets of natural numbers leads to an arithmetic of
computers, which is a distributive lattice.
1 Manuscript received May 5, 1969 revised July 3, 1969. This paper was pre-

41
42 Part I|Introductory/Tutorial/Survey Papers

0 1 A
6 Tape
Black Box

Figure 1. A Turing-Post machine

Section I
The modern computing machine sprang into existence at the end of
World War II. But already in 1936 Turing and Post had proposed a
mathematical model of computing machines ( gure 1).2 The math-
ematical model of the computing machine that Turing and Post pro-
posed, commonly referred to as the Turing machine, is a black box with
a nite number of internal states. The box can read and write on an
in nite paper tape, which is divided into squares. A digit or letter may
be written on each square of the tape, or the square may be blank.
Each second the machine performs one of the following actions. It may
stop, it may shift the tape one square to the right or one square to the
left, it may erase the square on which the read-write head is positioned,
or it may write a digit or letter on the square on which the read-write
head is positioned. The action it performs is determined solely by the
internal state of the black box at the moment, and the current state of
the black box is determined solely by its previous internal state and the
character read on the square of the tape on which its read-write head
was positioned.
Incredible as it may seem at rst, a machine of such primitive design
can multiply numbers written on its tape, and can write on its tape
sented as a lecture at the Pan-American Symposium of Applied Mathematics,
Buenos Aires, Argentina, August 1968.
The author is at Mario Bravo 249, Buenos Aires, Argentina.
2 Their papers appear in Davis 1]. As general references on computability theory
we may also cite Davis 2]{4], Minsky 5], Rogers 6], and Arbib 7].
On the Diculty of Computations 43
the successive digits of . Indeed, it is now generally accepted that
any calculation that a modern electronic digital computer or a human
computer can do, can also be done by such a machine.

Section II
How much information must be provided to a computer in order for
it to perform a given task? The point of view we will present here is
somewhat dierent from the usual one. In a typical scienti c applica-
tion, the computer may be used to analyze statistically huge amounts of
data and produce a brief report in which a great many observations are
reduced to a handful of statistical parameters. We would view this in
the following manner. The same nal result could have been achieved
if we had provided the computer with a table of the results, together
with instructions for printing them in a neat report. This observation
is, of course, ridiculous for all practical purposes. For, had we known
the results, it would not have been necessary to use a computer. This
example, then, does not exemplify those aspects of computation that
we will emphasize.
Rather, we are thinking of such scienti c applications as solving the
Schr odinger wave equation for the helium atom. Here we have no data,
only a program and the program will produce after much calculation a
great deal of printout. Or consider calculating the apparent positions
of the planets as observed from the earth over a period of years. A
small program incorporating the very simple Newtonian theory for this
situation will predict a great many astronomical observations. In this
problem there are no data|only a program that contains, of course,
a table of the masses of the planets and their initial positions and
velocities.

Section III
Let us now consider the problem of the amount of information that
it is necessary to provide to a computer in order for it to calculate a
given nite binary sequence. A computing machine is de ned for these
44 Part I|Introductory/Tutorial/Survey Papers
purposes to be a device that accepts as input a program, performs
the calculations indicated to it in the program, and nally puts out
the binary sequence it has calculated. In line with the mathematical
theory of information, it is natural for the program to be viewed as a
sequence of bits or 0's and 1's. Furthermore, in computer engineering all
programs and data are represented in the machine's circuits in binary
form. Thus, we may consider a computer to be a device that accepts
one binary sequence (the program) and emits another (the result of the
calculation).
011001001!Computer!1111110010001100110100
As an example of a computer we would then have an electronic dig-
ital computer that accepts programs consisting of magnetized spots
on magnetic tape and puts out its results in the same form. Another
example is a Turing machine. The program is a series of 0's and 1's
written on the machine's tape at the start of the calculation, and the
result is a sequence of 0's and 1's written on its tape when it stops. As
was mentioned, the second of these examples can do anything that the
rst can.

Section IV
We are interested in the amount of information that must be supplied to
a computer M in order for it to calculate a given nite binary sequence
S . We may now de ne this as the size or length of the smallest binary
sequence that causes the machine M to calculate S . We denote the
length of the shortest program for M to calculate S by L(M S ). It
has been shown that there is a computing machine M that has the
following three properties.3
1) L(M S )  k + 1 for all binary sequences S of length k.
In other words, any binary sequence of length k can be calculated by
this computer M if it is given an appropriate program at most k +1 bits
in length. The proof is as follows. If no better way to calculate a binary
3 Solomono 8] was the rst to employ computers of this kind.
On the Diculty of Computations 45
sequence occurs to us, we can always include the binary sequence as a
table in the program. This computer is so designed that we need add
only a single bit to the sequence to obtain a program for computing it.
The computer M emits the sequence S when it is given the program
S 0.
2) Those binary sequences S for which L(M S ) < j are fewer than
2j in number.
Thus, most binary sequences of length k require programs of about
the same length k, and the number of sequences that can be com-
puted by smaller programs decreases exponentially as the size of the
program decreases. The proof is as follows. There are only 2j ; 2 bi-
nary sequences less than j in length. Thus, there are fewer than 2j
programs less than j in length, for each program is a binary sequence.
At best, a program will cause the computer to calculate a single binary
sequence. At worst, an error in the program will trap the computer
in an endless loop, and no binary sequence will be calculated. As each
program causes the computer to calculate at most one binary sequence,
the number of sequences calculated must be smaller than the number
of programs. Thus, fewer than 2j binary sequences can be calculated
by means of programs less than j in length.
3) For any other computer M 0 there exists a constant c(M 0) such
that for all binary sequences S , L(M S )  L(M 0 S ) + c(M 0).
In other words, this computer requires shorter programs than any
other computer, or more exactly it does not require programs much
longer than those required by any other computer. The proof is as
follows. The computer M is designed to interpret the circuit diagrams
of any other computer M 0. Given a program for M 0 and the circuit
diagrams of M 0, the computer M proceeds to calculate how M 0 would
behave, i.e., it proceeds to simulate M 0. Thus, we need only add a xed
number of bits to any program for M 0 in order to obtain a program that
enables M to calculate the same result. This program for M is of the
form PC 1.
The 1 at the right end of the program indicates to the computer
M that this is a simulation, C is a xed binary sequence of length
46 Part I|Introductory/Tutorial/Survey Papers
c(M 0) ; 1 giving the circuit diagrams of the computer M 0, which is to
be imitated, and P is the program for M 0.4

Section V
Kolmogorov 9] and the author 11], 12] have independently suggested
that computers such as those previously described be applied to the
problem of de ning what is meant by a random or patternless nite
binary sequence of 0's and 1's. In the traditional foundations of the
mathematical theory of probability, as expounded by Kolmogorov in
his classic 10], there is no place for the concept of an individual random
sequence of 0's and 1's. Yet it is not altogether meaningless to say that
the sequence
110010111110011001011110000010
is more random or patternless than the sequences
111111111111111111111111111111
010101010101010101010101010101
for we may describe these last two sequences as thirty 1's or fteen 01's,
but there is no shorter way to specify the rst sequence than by just
writing it all out.
We believe that the random or patternless sequences of a given
length are those that require the longest programs. We have seen that
most of the binary sequences of length k require programs of about
length k. These, then, are the random or patternless sequences. Those
sequences that can be obtained by putting into a computer a program
much shorter than k are the nonrandom sequences, those that possess
a pattern or follow a law. The more possible it is to compress a bi-
nary sequence into a short program calculation, the less random is the
sequence.
As an example of this, let us consider those sequences of 0's and 1's
in which 0's and 1's do not occur with equal frequency. Let p be the
How can the computer M separate PC into P and C? C has each of its bits
4
doubled, except the pair of bits at its left end. These are unequal and serve as
punctuation separating C from P .
On the Diculty of Computations 47
relative frequency of 1's, and let q = 1 ; p be the relative frequency of
0's. A long binary sequence that has the property that 1's are more
frequent than 0's can be obtained from a computer program whose
length is only that of the desired sequence reduced by a factor H (p q) =
;p log 2 p ; q log2 q . For example, if 1's occur approximately 34 of the
time and 0's occur 14 of the time in a long binary sequence of length k,
there is a program for computing that sequence with length only about
H ( 34  14 )k = 0:80k. That is, the program need be only approximately
80 percent the length of the sequence it computes. In summary, if 0's
and 1's occur with unequal frequencies, we can compress such sequences
into programs only a certain percentage (depending on the frequencies)
of the size of the sequence. Thus, random or incompressible sequences
will have about as many 0's as 1's, which agrees with our intuitive
expectations.
In a similar manner it can be shown that all groups of 0's and 1's
will occur with approximately the expected frequency in a long binary
sequence that we call random 01100 will appear 2;5 k times in long
sequences of length k, etc.5

Section VI
The de nition of random or patternless nite binary sequences just
presented is related to certain considerations in information theory and
in the methodology of science.
The two problems considered in Shannon's classical exposition 15]
are to transmit information as eciently and as reliably as possible.
Here we are interested in examining the viewpoint of information the-
ory concerning the ecient transmission of information. An informa-
tion source may be redundant, and information theory teaches us to
code or compress messages so that what is redundant is eliminated and
communications equipment is optimally employed. For example, let us
consider an information source that emits one symbol (either an A or
a B ) each second. Successive symbols are independent, and A's are
three times more frequent than B 's. Suppose it is desired to transmit
the messages over a channel that is capable of transmitting either an
5 Martin-Lof 14] also discusses the statistical properties of random sequences.
48 Part I|Introductory/Tutorial/Survey Papers
A or a B each second. Then the channel has a capacity of 1 bit per
second, while the information source has entropy 0.80 bits per symbol
and thus it is possible to code the messages in such a way that on
the average 1/0.80 = 1.25 symbols of message are transmitted over the
channel each second. The receiver must decode the messages that is,
expand them into their original form.
In summary, information theory teaches us that messages from an
information source that is not completely random (that is, which does
not have maximum entropy) can be compressed. The de nition of ran-
domness is merely the converse of this fundamental theorem of infor-
mation theory if lack of randomness in a message allows it to be coded
into a shorter sequence, then the random messages must be those that
cannot be coded into shorter messages. A computing machine is clearly
the most general possible decoder for compressed messages. We thus
consider that this de nition of randomness is in perfect agreement and
indeed strongly suggested by the coding theorem for a noiseless channel
of information theory.

Section VII
This de nition is also closely related to classical problems of the
methodology of science.6
Consider a scientist who has been observing a closed system that
once every second either emits a ray of light or does not. He summarizes
his observations in a sequence of 0's and 1's in which a 0 represents \ray
not emitted" and a 1 represents \ray emitted." The sequence may start
0110101110: : :
and continue for a few million more bits. The scientist then examines
the sequence in the hope of observing some kind of pattern or law.
What does he mean by this? It seems plausible that a sequence of 0's
and 1's is patternless if there is no better way to calculate it than just
by writing it all out at once from a table giving the whole sequence.
The scientist might state:
Solomono 8] also discusses the relation between program lengths and the
6
problem of induction.
On the Diculty of Computations 49
My Scientic Theory: 0110101110: : :
This would not be considered an acceptable theory. On the other hand,
if the scientist should hit upon a method by which the whole sequence
could be calculated by a computer whose program is short compared
with the sequence, he would certainly not consider the sequence to be
entirely patternless or random. The shorter the program, the greater
the pattern he may ascribe the sequence.
There are many parallels between the foregoing and the way sci-
entists actually think. For example, a simple theory that accounts for
a set of facts is generally considered better or more likely to be true
than one that needs a large number of assumptions. By \simplicity" is
not meant \ease of use in making predictions." For although general
relativity is considered to be the simple theory par excellence, very ex-
tended calculations are necessary to make predictions from it. Instead,
one refers to the number of arbitrary choices that have been made in
specifying the theoretical structure. One is naturally suspicious of a
theory whose number of arbitrary elements is of an order of magnitude
comparable to the amount of information about reality that it accounts
for.

Section VIII
Let us now turn to the problem of the amount of time necessary for
computations.7 We will develop the following thesis. Call an in nite
set of natural numbers perfect if there is no essentially quicker way to
compute in nitely many of its members than computing the whole set.
Perfect sets exist. This thesis was suggested by the following vague and
imprecise considerations.8
One of the most profound problems of the theory of numbers is that
of calculating large primes. While the sieve of Eratosthenes appears to
be as quick an algorithm for calculating all the primes as is possible, in
7 As general references we may cite Blum 16] and Arbib and Blum 17]. Our
exposition is a summary of that of 13].
8 See Hardy and Wright 18], Sections 1.4 and 2.5 for the number-theoretic back-
ground of the following remarks.
50 Part I|Introductory/Tutorial/Survey Papers
recent times hope has centered on calculating large primes by calculat-
ing a subset of the primes, those that are Mersenne numbers. Lucas's
test can decide the primality of a Mersenne number with rapidity far
greater than is furnished by the sieve method. If there are an in nity
of Mersenne primes, then it appears that Lucas has achieved a decisive
advance in this classical problem of the theory of numbers.
An opposing point of view is that there is no essentially better way
to calculate large primes than by calculating them all. If this is the case,
it apparently follows that there must be only nitely many Mersenne
primes.
These considerations, then, suggested that there are in nite sets
of natural numbers that are arbitrarily dicult to compute, and that
do not have any in nite subsets essentially easier to compute than the
whole set. Here diculty of computation refers to speed. Our devel-
opment will be as follows. First, we de ne computers for calculating
in nite sets of natural numbers. Then we introduce a way of compar-
ing the rapidity of computers, a transitive binary relation, i.e., almost
a partial ordering. Next we focus our attention on those computers
that are greater than or equal to all others under this ordering, i.e.,
the fastest computers. Our results are conditioned on the computers
having this property. The meaning of \arbitrarily dicult to compute"
is then clari ed. Last, we exhibit sets that are arbitrarily dicult to
compute and do not have any subset essentially easier to compute than
the whole set.

Section IX
We are interested in the speed of programs for generating the elements
of an in nite set of natural numbers. For these purposes we may con-
sider a computer to be a device that once a second emits a (possibly
empty) nite set of natural numbers and that once started never stops.
That is to say, a computer is now viewed as a function whose arguments
are the program and the time and whose value is a nite set of natural
numbers. If a program causes the computer to emit in nitely many
natural numbers in size order and without any repetitions, we say that
the computing machine calculates the in nite set of natural numbers
On the Diculty of Computations 51
that it emits.
A Turing machine can be used to compute in nite sets of natural
numbers it is only necessary to establish a convention as to when nat-
ural numbers are emitted. For example, we may divide the machine's
tape into two halves, and stipulate that what is written on the right
half cannot be erased. The computational scratchwork is done on the
left half of the tape, and the successive members of the in nite set of
natural numbers are written on the nonerasable squares in decimal no-
tation, separated by commas, with no blank spaces permitted between
characters. The moment a comma has been written, it is considered
that the digits between it and the previous comma form the numeral
representing the next natural number emitted by the machine. We sup-
pose that the Turing machine performs a single cycle of activity (read
tape shift, write, or erase tape change internal state) each second.
Last, we stipulate that the machine be started scanning the rst non-
erasable square of the tape, that initially the nonerasable squares be
all blank, and that the program for the computer be written on the
rst erasable squares, with a blank serving as punctuation to indicate
the end of the program and the beginning of an in nite blank region of
tape.

Section X
We now order the computers according to their speeds. C  C 0 is
de ned as meaning that C is not much slower than C 0.
What do we mean by saying that computer C is not much slower
than computer C 0 for the purpose of computing in nite sets of natural
numbers? There is a computable change of C 's time scale that makes
C as fast as C 0 or faster. More exactly, there is a computable function
f (n) (for example n! or nn with n exponents) with the following
:::
n

property. Let P 0 be any program that makes C 0 calculate an in nite


set of natural numbers. Then there exists a program P that makes
C calculate the same set of natural numbers and has the additional
property that every natural number emitted by C 0 during the rst t
seconds of calculation is emitted by C during the rst f (t) second of
calculation, for all but a nite number of values of t. We may symbolize
52 Part I|Introductory/Tutorial/Survey Papers
this relation between the computers C and C 0 as C  C 0, for it has the
property that C  C 0 and C 0  C 00 only if C  C 00.
In this way, we have introduced an ordering of the computers for
computing in nite sets of natural numbers, and it can be shown that
a distributive lattice results. The most important property of this or-
dering for our present purposes is that there is a set of computers  all
other computers. In what follows we assume that the computer that is
used is a member of this set of fastest computers.

Section XI
We now clarify what we mean by \arbitrarily dicult to compute."
Let f (n) be any computable function that carries natural numbers
into natural numbers. Such functions can get big very quickly indeed.
For example consider the function nn in which there are nn expo-
:::
n

nents. There are in nite sets of natural numbers such that, no matter
how the computer is programmed, at least f (n) seconds will pass before
the computer emits all those elements of the set that are less than or
equal to n. Of course, a nite number of exceptions are possible, for any
nite part of an in nite set can be computed very quickly by including
in the computer's program a table of the rst few elements of the set.
Note that the diculty in computing such sets of natural numbers does
not lie in the fact that their elements get very big very quickly, for even
small elements of such sets require more than astronomical amounts of
time to be computed. What is more, there are in nite sets of natural
numbers that are arbitrarily dicult to compute and include 90 percent
of the natural numbers.
We nally exhibit in nite sets of natural numbers that are arbitrar-
ily dicult to compute, and do not have any in nite subsets essentially
easier to compute than the whole set. Consider the following tree of
natural numbers ( gure 2).9 The in nite sets of natural numbers that
we promised to exhibit are obtained by starting at the root of the tree
(that is, at 0) and walking forward, including in the set every natural
number that is stepped on.
This tree is used in Rogers 6], p. 158, in connection with retraceable sets.
9
Retraceable sets are in some ways analogous to those sets that concern us here.
On the Diculty of Computations 53

.7...
.3.
. .8...
.1.
. . .9...
. .4.
. .10...
0.
. .11...
. .5.
. . .12...
.2.
. .13...
.6.
.14...

Figure 2. A tree of natural numbers

It is easy to see that no in nite subset of such a set can be computed


much more quickly than the whole set. For suppose we are told that n
is in such a set. Then we know at once that the greatest integer less
than n=2 is the previous element of the set. Thus, knowing that 1 000
000 is in the set, we immediately produce all smaller elements in it,
by walking backwards through the tree. They are 499 999, 249 999,
124 999, etc. It follows that there is no appreciable dierence between
generating an in nite subset of such a set, and generating the whole
set, for gaps in an incomplete generation can be lled in very quickly.
It is also easy to see that there are sets that can be obtained by walk-
ing through this tree and are arbitrarily dicult to compute. These,
then, are the sets that we wished to exhibit.

Acknowledgment
The author wishes to express his gratitude to Prof. G. Pollitzer of
the University of Buenos Aires, whose constructive criticism much im-
54 Part I|Introductory/Tutorial/Survey Papers
proved the clarity of this presentation.

References
1] M. Davis, Ed., The Undecidable. Hewlett, N.Y.: Raven Press,
1965.
2] |, Computability and Unsolvability. New York: McGraw-Hill,
1958.
3] |, \Unsolvable problems: A review," Proc. Symp. on Mathe-
matical Theory of Automata. Brooklyn, N.Y.: Polytech. Inst.
Brooklyn Press, 1963, pp. 15{22.
4] |, \Applications of recursive function theory to number theory,"
Proc. Symp. in Pure Mathematics, vol. 5. Providence, R.I.: AMS,
1962, pp. 135{138.
5] M. Minsky, Computation: Finite and Innite Machines. Engle-
wood Clis, N.J.: Prentice-Hall, 1967.
6] H. Rogers, Jr., Theory of Recursive Functions and Eective Com-
putability. New York: McGraw-Hill, 1967.
7] M. A. Arbib, Theories of Abstract Automata. Englewood Clis,
N.J.: Prentice-Hall (to be published).
8] R. J. Solomono, \A formal theory of inductive inference," In-
form. and Control, vol. 7, pp. 1{22, March 1964 pp. 224{254,
June 1964.
9] A. N. Kolmogorov, \Three approaches to the de nition of the
concept `quantity of information'," Probl. Peredachi Inform., vol.
1, pp. 3{11, 1965.
10] |, Foundations of the Theory of Probability. New York: Chelsea,
1950.
11] G. J. Chaitin, \On the length of programs for computing nite
binary sequences," J. ACM, vol. 13, pp. 547{569, October 1966.
On the Diculty of Computations 55
12] |, \On the length of programs for computing nite binary se-
quences: statistical considerations," J. ACM, vol. 16, pp. 145{
159, January 1969.
13] |, \On the simplicity and speed of programs for computing in -
nite sets of natural numbers," J. ACM, vol. 16, pp. 407{422, July
1969.
14] P. Martin-L of, \The de nition of random sequences," Inform. and
Control, vol. 9, pp. 602{619, December 1966.
15] C. E. Shannon and W. Weaver, The Mathematical Theory of
Communication. Urbana, Ill.: University of Illinois Press, 1949.
16] M. Blum, \A machine-independent theory of the complexity of
recursive functions," J. ACM, vol. 14, pp. 322{336, April 1967.
17] M. A. Arbib and M. Blum, \Machine dependence of degrees of
diculty," Proc. AMS, vol. 16, pp. 442{447, June 1965.
18] G. H. Hardy and E. M. Wright, An Introduction to the Theory of
Numbers. Oxford: Oxford University Press, 1962.
The following references have come to the author's atten-
tion since this lecture was given.
19] D. G. Willis, \Computational complexity and probability con-
structions," Stanford University, Stanford, Calif., March 1969.
20] A. N. Kolmogorov, \Logical basis for information theory and
probability theory," IEEE Trans. Information Theory, vol. IT-
14, pp. 662{664, September 1968.
21] D. W. Loveland, \A variant of the Kolmogorov concept of com-
plexity," Dept. of Math., Carnegie-Mellon University, Pittsburgh,
Pa., Rept. 69-4.
22] P. R. Young, \Toward a theory of enumerations," J. ACM, vol.
16, pp. 328{348, April 1969.
56 Part I|Introductory/Tutorial/Survey Papers
23] D. E. Knuth, The Art of Computer Programming vol. 2, Semi-
numerical Algorithms. Reading, Mass.: Addison-Wesley, 1969.
24] 1969 Conf. Rec. of the ACM Symp. on Theory of Computing (Ma-
rina del Rey, Calif.).
INFORMATION-
THEORETIC
COMPUTATIONAL
COMPLEXITY
Invited Paper
IEEE Transactions on Information Theory
IT-20 (1974), pp. 10{15

Gregory J. Chaitin1

Abstract
This paper attempts to describe, in nontechnical language, some of the
concepts and methods of one school of thought regarding computational
complexity. It applies the viewpoint of information theory to computers.
This will rst lead us to a denition of the degree of randomness of
individual binary strings, and then to an information-theoretic version
of Godel's theorem on the limitations of the axiomatic method. Finally,
we will examine in the light of these ideas the scientic method and von

57
58 Part I|Introductory/Tutorial/Survey Papers
Neumann's views on the basic conceptual problems of biology.

This eld's fundamental concept is the complexity of a binary string,


that is, a string of bits, of zeros and ones. The complexity of a bi-
nary string is the minimum quantity of information needed to de ne
the string. For example, the string of length n consisting entirely of
ones is of complexity approximately log2 n, because only log2 n bits of
information are required to specify n in binary notation.
However, this is rather vague. Exactly what is meant by the de -
nition of a string? To make this idea precise a computer is used. One
says that a string de nes another when the rst string gives instructions
for constructing the second string. In other words, one string de nes
another when it is a program for a computer to calculate the second
string. The fact that a string of n ones is of complexity approximately
log2 n can now be translated more correctly into the following. There is
a program log2 n + c bits long that calculates the string of n ones. The
program performs a loop for printing ones n times. A xed number c of
bits are needed to program the loop, and log2 n bits more for specifying
n in binary notation.
Exactly how are the computer and the concept of information com-
bined to de ne the complexity of a binary string? A computer is consid-
ered to take one binary string and perhaps eventually produce another.
The rst string is the program that has been given to the machine. The
second string is the output of this program it is what this program cal-
culates. Now consider a given string that is to be calculated. How
much information must be given to the machine to do this? That is to
say, what is the length in bits of the shortest program for calculating
the string? This is its complexity.
It can be objected that this is not a precise de nition of the com-
plexity of a string, inasmuch as it depends on the computer that one
Manuscript received January 29, 1973 revised July 18, 1973. This paper was
1
presented at the IEEE International Congress of Information Theory, Ashkelon,
Israel, June 1973.
The author is at Mario Bravo 249, Buenos Aires, Argentina.
Information-Theoretic Computational Complexity 59
is using. Moreover, a de nition should not be based on a machine, but
rather on a model that does not have the physical limitations of real
computers.
Here we will not de ne the computer used in the de nition of com-
plexity. However, this can indeed be done with all the precision of
which mathematics is capable. Since 1936 it has been known how to
de ne an idealized computer with unlimited memory. This was done in
a very intuitive way by Turing and also by Post, and there are elegant
de nitions based on other principles 2]. The theory of recursive func-
tions (or computability theory) has grown up around the questions of
what is computable and what is not.
Thus it is not dicult to de ne a computer mathematically. What
remains to be analyzed is which de nition should be adopted, inasmuch
as some computers are easier to program than others. A decade ago
Solomono solved this problem 7]. He constructed a de nition of a
computer whose programs are not much longer than those of any other
computer. More exactly, Solomono's machine simulates running a
program on another computer, when it is given a description of that
computer together with its program.
Thus it is clear that the complexity of a string is a mathematical
concept, even though here we have not given a precise de nition. Fur-
thermore, it is a very natural concept, easy to understand for those
who have worked with computers. Recapitulating, the complexity of
a binary string is the information needed to de ne it, that is to say,
the number of bits of information that must be given to a computer in
order to calculate it, or in other words, the size in bits of the shortest
program for calculating it. It is understood that a certain mathemati-
cal de nition of an idealized computer is being used, but it is not given
here, because as a rst approximation it is sucient to think of the
length in bits of a program for a typical computer in use today.
Now we would like to consider the most important properties of the
complexity of a string. First of all, the complexity of a string of length
n is less than n + c, because any string of length n can be calculated
by putting it directly into a program as a table. This requires n bits,
to which must be added c bits of instructions for printing the table.
In other words, if nothing betters occurs to us, the string itself can be
used as its de nition, and this requires only a few more bits than its
60 Part I|Introductory/Tutorial/Survey Papers
length.
Thus the complexity of each string of length n is less than n + c.
Moreover, the complexity of the great majority of strings of length n
is approximately n, and very few strings of length n are of complexity
much less than n. The reason is simply that there are much fewer pro-
grams of length appreciably less than n than strings of length n. More
exactly, there are 2n strings of length n, and less than 2n;k programs
of length less than n ; k. Thus the number of strings of length n and
complexity less than n ; k decreases exponentially as k increases.
These considerations have revealed the basic fact that the great ma-
jority of strings of length n are of complexity very close to n. Therefore,
if one generates a binary string of length n by tossing a fair coin n times
and noting whether each toss gives head or tail, it is highly probable
that the complexity of this string will be very close to n. In 1965
Kolmogorov proposed calling random those strings of length n whose
complexity is approximately n 8]. We made the same proposal inde-
pendently 9]. It can be shown that a string that is random in this sense
has the statistical properties that one would expect. For example, zeros
and ones appear in such strings with relative frequencies that tend to
one-half as the length of the strings increases.
Consequently, the great majority of strings of length n are random,
that is, need programs of approximately length n, that is to say, are
of complexity approximately n. What happens if one wishes to show
that a particular string is random? What if one wishes to prove that
the complexity of a certain string is almost equal to its length? What
if one wishes to exhibit a speci c example of a string of length n and
complexity close to n, and assure oneself by means of a proof that there
is no shorter program for calculating this string?
It should be pointed out that this question can occur quite natu-
rally to a programmer with a competitive spirit and a mathematical
way of thinking. At the beginning of the sixties we attended a course
at Columbia University in New York. Each time the professor gave an
exercise to be programmed, the students tried to see who could write
the shortest program. Even though several times it seemed very di-
cult to improve upon the best program that had been discovered, we
did not fool ourselves. We realized that in order to be sure, for exam-
ple, that the shortest program for the IBM 650 that prints the prime
Information-Theoretic Computational Complexity 61
numbers has, say, 28 instructions, it would be necessary to prove it, not
merely to continue for a long time unsuccessfully trying to discover a
program with less than 28 instructions. We could never even sketch a
rst approach to a proof.
It turns out that it was not our fault that we did not nd a proof,
because we faced a fundamental limitation. One confronts a very basic
diculty when one tries to prove that a string is random, when one
attempts to establish a lower bound on its complexity. We will try to
suggest why this problem arises by means of a famous paradox, that of
Berry 1, p. 153].
Consider the smallest positive integer that cannot be de ned by an
English phrase with less than 1 000 000 000 characters. Supposedly the
shortest de nition of this number has 1 000 000 000 or more characters.
However, we de ned this number by a phrase much less than 1 000 000
000 characters in length when we described it as \the smallest positive
integer that cannot be de ned by an English phrase with less than 1
000 000 000 characters!"
What relationship is there between this and proving that a string is
complex, that its shortest program needs more than n bits? Consider
the rst string that can be proven to be of complexity greater than 1
000 000 000. Here once more we face a paradox similar to that of Berry,
because this description leads to a program with much less than 1 000
000 000 bits that calculates a string supposedly of complexity greater
than 1 000 000 000. Why is there a short program for calculating \the
rst string that can be proven to be of complexity greater than 1 000
000 000?"
The answer depends on the concept of a formal axiom system, whose
importance was emphasized by Hilbert 1]. Hilbert proposed that math-
ematics be made as exact and precise as possible. In order to avoid
arguments between mathematicians about the validity of proofs, he set
down explicitly the methods of reasoning used in mathematics. In fact,
he invented an arti cial language with rules of grammar and spelling
that have no exceptions. He proposed that this language be used to
eliminate the ambiguities and uncertainties inherent in any natural lan-
guage. The speci cations are so precise and exact that checking if a
proof written in this arti cial language is correct is completely mechan-
ical. We would say today that it is so clear whether a proof is valid or
62 Part I|Introductory/Tutorial/Survey Papers
not that this can be checked by a computer.
Hilbert hoped that this way mathematics would attain the great-
est possible objectivity and exactness. Hilbert said that there can no
longer be any doubt about proofs. The deductive method should be
completely clear.
Suppose that proofs are written in the language that Hilbert con-
structed, and in accordance with his rules concerning the accepted
methods of reasoning. We claim that a computer can be programmed
to print all the theorems that can be proven. It is an endless program
that every now and then writes on the printer a theorem. Furthermore,
no theorem is omitted. Each will eventually be printed, if one is very
patient and waits long enough.
How is this possible? The program works in the following manner.
The language invented by Hilbert has an alphabet with nitely many
signs or characters. First the program generates the strings of charac-
ters in this alphabet that are one character in length. It checks if one
of these strings satis es the completely mechanical rules for a correct
proof and prints all the theorems whose proofs it has found. Then the
program generates all the possible proofs that are two characters in
length, and examines each of them to determine if it is valid. The pro-
gram then examines all possible proofs of length three, of length four,
and so on. If a theorem can be proven, the program will eventually nd
a proof for it in this way, and then print it.
Consider again \the rst string that can be proven to be of com-
plexity greater than 1 000 000 000." To nd this string one generates
all theorems until one nds the rst theorem that states that a partic-
ular string is of complexity greater than 1 000 000 000. Moreover, the
program for nding this string is short, because it need only have the
number 1 000 000 000 written in binary notation, log2 1 000 000 000
bits, and a routine of xed length c that examines all possible proofs
until it nds one that a speci c string is of complexity greater than 1
000 000 000.
In fact, we see that there is a program log2 n + c bits long that
calculates the rst string that can be proven to be of complexity greater
than n. Here we have Berry's paradox again, because this program
of length log2 n + c calculates something that supposedly cannot be
calculated by a program of length less than or equal to n. Also, log2 n+c
Information-Theoretic Computational Complexity 63
is much less than n for all suciently great values of n, because the
logarithm increases very slowly.
What can the meaning of this paradox be? In the case of Berry's
original paradox, one cannot arrive at a meaningful conclusion, inas-
much as one is dealing with vague concepts such as an English phrase's
de ning a positive integer. However our version of the paradox deals
with exact concepts that have been de ned mathematically. Therefore,
it cannot really be a contradiction. It would be absurd for a string not
to have a program of length less than or equal to n for calculating it,
and at the same time to have such a program. Thus we arrive at the in-
teresting conclusion that such a string cannot exist. For all suciently
great values of n, one cannot talk about \the rst string that can be
proven to be of complexity greater than n," because this string cannot
exist. In other words, for all suciently great values of n, it cannot be
proven that a particular string is of complexity greater than n. If one
uses the methods of reasoning accepted by Hilbert, there is an upper
bound to the complexity that it is possible to prove that a particular
string has.
This is the surprising result that we wished to obtain. Most strings
of length n are of complexity approximately n, and a string generated
by tossing a coin will almost certainly have this property. Nevertheless,
one cannot exhibit individual examples of arbitrarily complex strings
using methods of reasoning accepted by Hilbert. The lower bounds on
the complexity of speci c strings that can be established are limited,
and we will never be mathematically certain that a particular string is
very complex, even though most strings are random.2
In 1931 G odel questioned Hilbert's ideas in a similar way 1], 2].
Hilbert had proposed specifying once and for all exactly what is ac-
cepted as a proof, but G odel explained that no matter what Hilbert
speci ed so precisely, there would always be true statements about the
integers that the methods of reasoning accepted by Hilbert would be
2 This is a particularly perverse example of Kac's comment 13, p. 16] that \as is
often the case, it is much easier to prove that an overwhelming majority of objects
possess a certain property than to exhibit even one such object." The most familiar
example of this is Shannon's proof of the coding theorem for a noisy channel while it
is shown that most coding schemes achieve close to the channel capacity, in practice
it is dicult to implement a good coding scheme.
64 Part I|Introductory/Tutorial/Survey Papers
incapable of proving. This mathematical result has been considered to
be of great philosophical importance. Von Neumann commented that
the intellectual shock provoked by the crisis in the foundations of math-
ematics was equaled only by two other scienti c events in this century:
the theory of relativity and quantum theory 4].
We have combined ideas from information theory and computability
theory in order to de ne the complexity of a binary string, and have
then used this concept to give a de nition of a random string and to
show that a formal axiom system enables one to prove that a random
string is indeed random in only nitely many cases.
Now we would like to examine some other possible applications of
this viewpoint. In particular, we would like to suggest that the con-
cept of the complexity of a string and the fundamental methodological
problems of science are intimately related. We will also suggest that
this concept may be of theoretical value in biology.
Solomono 7] and the author 9] proposed that the concept of com-
plexity might make it possible to precisely formulate the situation that a
scientist faces when he has made observations and wishes to understand
them and make predictions. In order to do this the scientist searches
for a theory that is in agreement with all his observations. We consider
his observations to be represented by a binary string, and a theory to
be a program that calculates this string. Scientists consider the sim-
plest theory to be the best one, and that if a theory is too \ad hoc," it
is useless. How can we formulate these intuitions about the scienti c
method in a precise fashion? The simplicity of a theory is inversely
proportional to the length of the program that constitutes it. That is
to say, the best program for understanding or predicting observations is
the shortest one that reproduces what the scientist has observed up to
that moment. Also, if the program has the same number of bits as the
observations, then it is useless, because it is too \ad hoc." If a string of
observations only has theories that are programs with the same length
as the string of observations, then the observations are random, and
can neither be comprehended nor predicted. They are what they are,
and that is all the scientist cannot have a theory in the proper sense
of the concept he can only show someone else what he observed and
say \it was this."
In summary, the value of a scienti c theory is that it enables one to
Information-Theoretic Computational Complexity 65
compress many observations into a few theoretical hypotheses. There
is a theory only when the string of observations is not random, that
is to say, when its complexity is appreciably less than its length in
bits. In this case the scientist can communicate his observations to a
colleague much more economically than by just transmitting the string
of observations. He does this by sending his colleague the program
that is his theory, and this program must have much fewer bits than
the original string of observations.
It is also possible to make a similar analysis of the deductive method,
that is to say, of formal axiom systems. This is accomplished by ana-
lyzing more carefully the new version of Berry's paradox that was pre-
sented. Here we only sketch the three basic results that are obtained
in this manner.3
1. In a formal system with n bits of axioms it is impossible to prove
that a particular binary string is of complexity greater than n + c.
2. Contrariwise, there are formal systems with n + c bits of axioms
in which it is possible to determine each string of complexity less
than n and the complexity of each of these strings, and it is also
possible to exhibit each string of complexity greater than or equal
to n, but without being able to know by how much the complexity
of each of these strings exceeds n.
3. Unfortunately, any formal system in which it is possible to deter-
mine each string of complexity less than n has either one grave
problem or another. Either it has few bits of axioms and needs
incredibly long proofs, or it has short proofs but an incredibly
great number of bits of axioms. We say \incredibly" because these
quantities increase more quickly than any computable function of
n.
It is necessary to clarify the relationship between this and the pre-
ceding analysis of the scienti c method. There are less than 2n strings
of complexity less than n, but some of them are incredibly long. If one
wishes to communicate all of them to someone else, there are two alter-
natives. The rst is to directly show all of them to him. In this case one
3 See the Appendix.
66 Part I|Introductory/Tutorial/Survey Papers
will have to send him an incredibly long message because some of these
strings are incredibly long. The other alternative is to send him a very
short message consisting of n bits of axioms from which he can deduce
which strings are of complexity less than n. Although the message is
very short in this case, he will have to spend an incredibly long time to
deduce from these axioms the strings of complexity less than n. This
is analogous to the dilemma of a scientist who must choose between di-
rectly publishing his observations, or publishing a theory that explains
them, but requires very extended calculations in order to do this.
Finally, we would like to suggest that the concept of complexity
may possibly be of theoretical value in biology.
At the end of his life von Neumann tried to lay the foundation for a
mathematics of biological phenomena. His rst eort in this direction
was his work Theory of Games and Economic Behavior, in which he
analyzes what is a rational way to behave in situations in which there
are conicting interests 3]. The Computer and the Brain, his notes
for a lecture series, was published shortly after his death 5]. This
book discusses the dierences and similarities between the computer
and the brain, as a rst step to a theory of how the brain functions. A
decade later his work Theory of Self-Reproducing Automata appeared,
in which von Neumann constructs an arti cial universe and within it a
computer that is capable of reproducing itself 6]. But von Neumann
points out that the problem of formulating a mathematical theory of
the evolution of life in this abstract setting remains to be solved and to
express mathematically the evolution of the complexity of organisms,
one must rst de ne complexity precisely.4 We submit that \organism"
must also be de ned, and have tried elsewhere to suggest how this might
perhaps be done 10].
We believe that the concept of complexity that has been presented
here may be the tool that von Neumann felt is needed. It is by no means
accidental that biological phenomena are considered to be extremely
complex. Consider how a human being analyzes what he sees, or uses
natural languages to communicate. We cannot carry out these tasks
by computer because they are as yet too complex for us|the programs
In an important paper 14], Eigen studies these questions from the point of view
4
of thermodynamics and biochemistry.
Information-Theoretic Computational Complexity 67
would be too long.5

Appendix
In this Appendix we try to give a more detailed idea of how the results
concerning formal axiom systems that were stated are established.6
Two basic mathematical concepts that are employed are the con-
cepts of a recursive function and a partial recursive function. A function
is recursive if there is an algorithm for calculating its value when one
is given the value of its arguments, in other words, if there is a Tur-
ing machine for doing this. If it is possible that this algorithm never
terminates and the function is thus unde ned for some values of its
arguments, then the function is called partial recursive.7
In what follows we are concerned with computations involving bi-
nary strings. The binary strings are considered to be ordered in the
following manner: (, 0, 1, 00, 01, 10, 11, 000, 001, 010, : : : The natural
number n is represented by the nth binary string (n = 0 1 2 : : :). The
length of a binary string s is denoted lg(s). Thus if s is considered to
be a natural number, then lg(s) = log2(s + 1)]. Here x] is the greatest
integer  x.
De nition 1. A computer is a partial recursive function C (p). Its
argument p is a binary string. The value of C (p) is the binary string
output by the computer C when it is given the program p. If C (p) is
unde ned, this means that running the program p on C produces an
unending computation.
De nition 2. The complexity IC (s) of a binary string s is de ned
to be the length of the shortest program p that makes the computer C
output s, i.e.,
IC (s) = Cmin
(p)=s
lg(p):
If no program makes C output s, then IC (s) is de ned to be in nite.
5 Chandrasekaran and Reeker 15] discuss the relevance of complexity to articial
intelligence.
6 See 11], 12] for dierent approaches.
7 Full treatments of these concepts can be found in standard texts, e.g., Rogers
16].
68 Part I|Introductory/Tutorial/Survey Papers
De nition 3. A computer U is universal if for any computer C and
any binary string s, IU (s)  IC (s) + c, where the constant c depends
only on C .
It is easy to see that there are universal computers. For example,
consider the computer U such that U (0i 1p) = Ci(p), where Ci is the
ith computer, i.e., a program for U consists of two parts: the left-hand
part indicates which computer is to be simulated, and the right-hand
part gives the program to be simulated. We now suppose that some
particular universal computer U has been chosen as the standard one
for measuring complexities, and shall henceforth write I (s) instead of
IU (s).
De nition 4. The rules of inference of a class of formal axiom
systems is a recursive function F (a h) (a a binary string, h a natural
number) with the property that F (a h)  F (a h + 1). The value
of F (a h) is the nite (possibly empty) set of theorems that can be
proven from the axioms a by means of proofs  h characters in length.
F (a) = Sh F (a h) is the set of theorems that are consequences of the
axioms a. The ordered pair hF ai, which implies both the choice of
rules of inference and axioms, is a particular formal axiom system.
This is a fairly abstract de nition, but it retains all those features of
formal axiom systems that we need. Note that although one may not be
interested in some axioms (e.g., if they are false or incomprehensible),
it is stipulated that F (a h) is always de ned.
Theorem 1. a) There is a constant c such that I (s)  lg(s) + c
for all binary strings s. b) There are less than 2n binary strings of
complexity less than n.
Proof of a). There is a computer C such that C (p) = p for all
programs p. Thus for all binary strings s, I (s)  IC (s) + c = lg(s) + c.
Proof of b). As there are less than 2n programs of length less than
n, there must be less than this number of binary strings of complexity
less than n. Q.E.D.
Thesis. A random binary string s is one having the property that
I (s) lg(s).
Theorem 2. Consider the rules of inference F . Suppose that a
proposition of the form \I (s)  n" is in F (a) only if it is true, i.e., only
if I (s)  n. Then a proposition of the form \I (s)  n" is in F (a) only
if n  lg(a) + c, where c is a constant that depends only on F .
Information-Theoretic Computational Complexity 69
Proof. Consider that binary string sk having the shortest proof
from the axioms a that it is of complexity > lg(a) + 2k. We claim that
I (sk )  lg(a) + k + c0, where c0 depends only on F . Taking k = c0,
we conclude that the binary string sc with the shortest proof from the
0

axioms a that it is of complexity > lg(a) + 2c0 is, in fact, of complexity


 lg(a) + 2c0 , which is impossible. It follows that sk doesn't exist for
k = c0, that is, no binary string can be proven from the axioms a to be
of complexity > lg(a) + 2c0 . Thus the theorem is proved with c = 2c0.
It remains to verify the claim that I (sk)  lg(a)+k+c0. Consider the
computer C that does the following when it is given the program 0k 1a.
It calculates F (a h) for h = 0 1 2 : : : until it nds the rst theorem in
F (a h) of the form \I (s)  n" with n > lg(a) + 2k. Finally C outputs
the binary string s in the theorem it has found. Thus C (0k 1a) is equal
to sk , if sk exists. It follows that
I (sk) = I (C (0k 1a))
 IC (C (0k 1a)) + c00
 lg(0k 1a) + c00 = lg(a) + k + (c00 + 1) = lg(a) + k + c0 :

Q.E.D.
De nition 5. An is de ned to be the kth binary string of length
n, where k is the number of programs p of length < n for which U (p)
is de ned, i.e., An has n and this number k coded into it.
Theorem 3. There are rules of inference F 1 such that for all n,
F (An) is the union of the set of all true propositions of the form
1
\I (s) = k" with k < n and the set of all true propositions of the form
\I (s)  n."
Proof. From An one knows n and for how many programs p of
length < n U (p) is de ned. One then simulates in parallel, running each
program p of length < n on U until one has determined the value of U (p)
for each p of length < n for which U (p) is de ned. Knowing the value
of U (p) for each p of length < n for which U (p) is de ned, one easily
determines each string of complexity < n and its complexity. What's
more, all other strings must be of complexity  n. This completes our
sketch of how all true propositions of the form \I (s) = k" with k < n
and of the form \I (s)  n" can be derived from the axiom An. Q.E.D.
70 Part I|Introductory/Tutorial/Survey Papers
Recall that we consider the nth binary string to be the natural
number n.
De nition 6. The partial function B (n) is de ned to be the biggest
natural number of complexity  n, i.e.,
B (n) = Imax
(k)n
k = lg(max
p)n
U (p):
Theorem 4. Let f be a partial recursive function that carries
natural numbers into natural numbers. Then B (n)  f (n) for all
suciently great values of n.
Proof. Consider the computer C such that C (p) = f (p) for all p.
I (f (n))  IC (f (n)) + c  lg(n) + c = log2(n + 1)] + c < n
for all suciently great values of n. Thus B (n)  f (n) for all suciently
great values of n. Q.E.D.
Theorem 5. Consider the rules of inference F . Let

Fn = F (a B (n))
a
where the union is taken over all binary strings a of length  B (n),
i.e., Fn is the ( nite) set of all theorems that can be deduced by means
of proofs with not more than B (n) characters from axioms with not
more than B (n) bits. Let sn be the rst binary string s not in any
proposition of the form \I (s) = k" in Fn. Then I (sn)  n + c, where
the constant c depends only on F .
Proof. We claim that there is a computer C such that if U (p) =
B (n), then C (p) = sn. As, by the de nition of B , there is a p0 of length
 n such that U (p0 ) = B (n), it follows that

I (sn)  IC (sn) + c = IC (C (p0)) + c  lg(p0 ) + c  n + c


which was to be proved.
It remains to verify the claim that there is a C such that if U (p) =
B (n), then C (p) = sn. C works as follows. Given the program p, C rst
simulates running the program p on U . Once C has determined U (p),
it calculates F (a U (p)) for all binary strings a such that lg(a)  U (p),
and forms the union of these 2U (p)+1 ; 1 dierent sets of propositions,
Information-Theoretic Computational Complexity 71
which is Fn if U (p) = B (n). Finally C outputs the rst binary string s
not in any proposition of the form \I (s) = k" in this set of propositions
s is sn if U (p) = B (n). Q.E.D.
Theorem 6. Consider the rules of inference F . If F (a h) includes
all true propositions of the form \I (s) = k" with k  n + c, then either
lg(a) > B (n) or h > B (n). Here c is a constant that depends only on
F.
Proof. This is an immediate consequence of Theorem 5. Q.E.D.
The following theorem gives an upper bound on the size of the proofs
in the formal systems hF 1 Ani that were studied in Theorem 3, and
also shows that the lower bound on the size of these proofs that is given
by Theorem 6 cannot be essentially improved.
Theorem 7. There is a constant c such that for all n F 1(An B (n +
c)) includes all true propositions of the form \I (s) = k" with k < n.
Proof. We claim that there is a computer C such that for all n,
C (An) = the least natural number h such that F 1(An h) includes all
true propositions of the form \I (s) = k" with k < n. Thus the com-
plexity of this value of h is  lg(An) + c = n + c, and B (n + c) is 
this value of h, which was to be proved.
It remains to verify the claim. C works as follows when it is given the
program An. First, it determines each binary string of complexity < n
and its complexity, in the manner described in the proof of Theorem 3.
Then it calculates F 1(An h) for h = 0 1 2 : : : until all true propositions
of the form \I (s) = k" with k < n are included in F 1(An h). The nal
value of h is then output by C . Q.E.D.

References
1] J. van Heijenoort, Ed., From Frege to Godel: A Source Book
in Mathematical Logic, 1879{1931. Cambridge, Mass.: Harvard
Univ. Press, 1967.
2] M. Davis, Ed., The Undecidable|Basic Papers on Undecidable
Propositions, Unsolvable Problems and Computable Functions.
Hewlett, N.Y.: Raven Press, 1965.
72 Part I|Introductory/Tutorial/Survey Papers
3] J. von Neumann and O. Morgenstern, Theory of Games and Eco-
nomic Behavior. Princeton, N.J.: Princeton Univ. Press, 1944.
4] |, \Method in the physical sciences," in John von Neumann|
Collected Works. New York: Macmillan, 1963, vol. 6, no. 35.
5] |, The Computer and the Brain. New Haven, Conn.: Yale Univ.
Press, 1958.
6] |, Theory of Self-Reproducing Automata. Urbana, Ill.: Univ.
Illinois Press, 1966. (Edited and completed by A. W. Burks.)
7] R. J. Solomono, \A formal theory of inductive inference," In-
form. Contr., vol. 7, pp. 1{22, Mar. 1964 also, pp. 224{254, June
1964.
8] A. N. Kolmogorov, \Logical basis for information theory and
probability theory," IEEE Trans. Inform. Theory, vol. IT-14, pp.
662{664, Sept. 1968.
9] G. J. Chaitin, \On the diculty of computations," IEEE Trans.
Inform. Theory, vol. IT-16, pp. 5{9, Jan. 1970.
10] |, \To a mathematical de nition of `life'," ACM SICACT News,
no. 4, pp. 12{18, Jan. 1970.
11] |, \Computational complexity and G odel's incompleteness theo-
rem," (Abstract) AMS Notices, vol. 17, p. 672, June 1970 (Paper)
ACM SIGACT News, no. 9, pp. 11{12, Apr. 1971.
12] |, \Information-theoretic limitations of formal systems," pre-
sented at the Courant Institute Computational Complexity Sy-
mp., N.Y., Oct. 1971. A revised version will appear in J. Ass.
Comput. Mach.
13] M. Kac, Statistical Independence in Probability, Analysis, and
Number Theory, Carus Math. Mono., Mathematical Association
of America, no. 12, 1959.
Information-Theoretic Computational Complexity 73
14] M. Eigen, \Selforganization of matter and the evolution of bi-
ological macromolecules," Die Naturwissenschaften, vol. 58, pp.
465{523, Oct. 1971.
15] B. Chandrasekaran and L. H. Reeker, \Arti cial intelligence|a
case for agnosticism," Ohio State University, Columbus, Ohio,
Rep. OSU-CISRC-TR-72-9, Aug. 1972 also, IEEE Trans. Syst.,
Man, Cybern., vol. SMC-4, pp. 88{94, Jan. 1974.
16] H. Rogers, Jr., Theory of Recursive Functions and Eective Com-
putability. New York: McGraw-Hill, 1967.
74 Part I|Introductory/Tutorial/Survey Papers
ALGORITHMIC
INFORMATION THEORY
Encyclopedia of Statistical Sciences, Vol-
ume 1, Wiley, New York, 1982, pp. 38{41

The Shannon entropy* concept of classical information theory* 9] is an


ensemble notion it is a measure of the degree of ignorance concerning
which possibility holds in an ensemble with a given a priori probability
distribution* X
n
H (p1 : : :  pn )
; pk log2 pk :
k=1
In algorithmic information theory the primary concept is that of the
information content of an individual object, which is a measure of how
dicult it is to specify or describe how to construct or calculate that
object. This notion is also known as information-theoretic complexity.
For introductory expositions, see refs. 1, 4, and 6. For the necessary
background on computability theory and mathematical logic, see refs.
3, 7, and 8. For a more technical survey of algorithmic information
theory and a more complete bibliography, see ref. 2. See also ref. 5.
The original formulation of the concept of algorithmic information
is independently due to R. J. Solomono 22], A. N. Kolmogorov* 19],
and G. J. Chaitin 10]. The information content I (x) of a binary string
x is de ned to be the size in bits (binary digits) of the smallest pro-
gram for a canonical universal computer U to calculate x. (That the
75
76 Part I|Introductory/Tutorial/Survey Papers
computer U is universal means that for any other computer M there
is a pre x  such that the program p makes U do exactly the same
computation that the program p makes M do.) The joint information
I (x y) of two strings is de ned to be the size of the smallest program
that makes U calculate both of them. And the conditional or relative
information I (xjy ) of x given y is de ned to be the size of the smallest
program for U to calculate x from y. The choice of the standard com-
puter U introduces at most an O(1) uncertainty in the numerical value
of these concepts. (O(f ) is read \order of f " and denotes a function
whose absolute value is bounded by a constant times f:)
With the original formulation of these de nitions, for most x one
has
I (x) = jxj + O(1) (1)
(here jxj denotes the length or size of the string x, in bits), but unfor-
tunately
I (x y)  I (x) + I (y) + O(1) (2)
holds only if one replaces the O(1) error estimate by O(log I (x)I (y)).
Chaitin 12] and L. A. Levin 20] independently discovered how to re-
formulate these de nitions so that the subadditivity property (2) holds.
The change is to require that the set of meaningful computer programs
be an instantaneous code, i.e., that no program be a pre x of another.
With this modi cation, (2) now holds, but instead of (1) most x satisfy
I (x) = jxj + I (jxj) + O(1)
= jxj + O(log jxj):
Moreover, in this theory the decomposition of the joint information
of two objects into the sum of the information content of the rst object
added to the relative information of the second one given the rst has
a dierent form than in classical information theory. In fact, instead of
I (x y) = I (x) + I (yjx) + O(1) (3)
one has
I (x y) = I (x) + I (yjx I (x)) + O(1): (4)
That (3) is false follows from the fact that I (x I (x)) = I (x)+ O(1) and
I (I (x)jx) is unbounded. This was noted by Chaitin 12] and studied
more precisely by Solovay 12, p. 339] and Ga%c 17].
Algorithmic Information Theory 77
Two other important concepts of algorithmic information theory are
mutual or common information and algorithmic independence. Their
importance has been emphasized by Fine 5, p. 141]. The mutual in-
formation content of two strings is de ned as follows:
I (x : y)
I (x) + I (y) ; I (x y):
In other words, the mutual information* of two strings is the extent
to which it is more economical to calculate them together than to cal-
culate them separately. And x and y are said to be algorithmically
independent if their mutual information I (x : y) is essentially zero, i.e.,
if I (x y) is approximately equal to I (x) + I (y). Mutual information is
symmetrical, i.e., I (x : y) = I (y : x) + O(1). More important, from the
decomposition (4) one obtains the following two alternative expressions
for mutual information:
I (x : y) = I (x) ; I (xjy I (y)) + O(1)
= I (y) ; I (yjx I (x)) + O(1):
Thus this notion of mutual information, although is applies to indi-
vidual objects rather than to ensembles, shares many of the formal
properties of the classical version of this concept.
Up until now there have been two principal applications of algorith-
mic information theory: (a) to provide a new conceptual foundation
for probability theory and statistics by making it possible to rigor-
ously de ne the notion of a random sequence*, and (b) to provide an
information-theoretic approach to metamathematics and the limitative
theorems of mathematical logic. A possible application to theoretical
mathematical biology is also mentioned below.
A random or patternless binary sequence xn of length n may be
de ned to be one of maximal or near-maximal complexity, i.e., one
whose complexity I (xn) is not much less than n. Similarly, an in nite
binary sequence x may be de ned to be random if its initial segments
xn are all random nite binary sequences. More precisely, x is random
if and only if
9c8n I (xn) > n ; c]: (5)
In other words, the in nite sequence x is random if and only if there
exists a c such that for all positive integers n, the algorithmic informa-
tion content of the string consisting of the rst n bits of the sequence x,
78 Part I|Introductory/Tutorial/Survey Papers
is bounded from below by n ; c. Similarly, a random real number may
be de ned to be one having the property that the base 2 expansion of
its fractional part is a random in nite binary sequence.
These de nitions are intended to capture the intuitive notion of a
lawless, chaotic, unstructured sequence. Sequences certi ed as random
in this sense would be ideal for use in Monte Carlo* calculations 14],
and they would also be ideal as one-time pads for Vernam ciphers or as
encryption keys 16]. Unfortunately, as we shall see below, it is a variant
of G odel's famous incompleteness theorem that such certi cation is
impossible. It is a corollary that no pseudo-random number* generator
p satisfy these de nitions. Indeed, consider a real number x, such as
can
2, , or e, which has the property that it is possible to compute the
successive binary digits of its base 2 expansion. Such x satisfy
I (xn) = I (n) + O(1) = O(log n)
and are therefore maximally nonrandom. Nevertheless, most real num-
bers are random. In fact, if each bit of an in nite binary sequence is
produced by an independent toss of an unbiased coin, then the prob-
ability that it will satisfy (5) is 1. We consider next a particularly
interesting random real number, $, discovered by Chaitin 12, p. 336].
A. M. Turing's theorem that the halting problem is unsolvable is a
fundamental result of the theory of algorithms 4]. Turing's theorem
states that there is no mechanical procedure for deciding whether or
not an arbitrary program p eventually comes to a halt when run on
the universal computer U . Let $ be the probability that the standard
computer U eventually halts if each bit of its program p is produced
by an independent toss of an unbiased coin. The unsolvability of the
halting problem is intimately connected to the fact that the halting
probability $ is a random real number, i.e., its base 2 expansion is a
random in nite binary sequence in the very strong sense (5) de ned
above. From (5) it follows that $ is normal (a notion due to E
. Borel
18]), that $ is a Kollectiv* with respect to all computable place selec-
tion rules (a concept due to R. von Mises and A. Church 15]), and it
also follows that $ satis es all computable statistical tests of random-
ness* (this notion being due to P. Martin-L of 21]). An essay by C. H.
Bennett on other remarkable properties of $, including its immunity
to computable gambling schemes, is contained in ref. 6.
Algorithmic Information Theory 79
K. G odel established his famous incompleteness theorem by mod-
ifying the paradox of the liar instead of \This statement is false" he
considers \This statement is unprovable." The latter statement is true
if and only if it is unprovable it follows that not all true statements
are theorems and thus that any formalization of mathematical logic is
incomplete 3, 7, 8]. More relevant to algorithmic information theory
is the paradox of \the smallest positive integer that cannot be speci-
ed in less than a billion words." The contradiction is that the phrase
in quotes has only 14 words, even though at least 1 billion should be
necessary. This is a version of the Berry paradox, rst published by
Russell 7, p. 153]. To obtain a theorem rather than a contradiction,
one considers instead \the binary string s which has the shortest proof
that its complexity I (s) is greater than 1 billion." The point is that
this string s cannot exist. This leads one to the metatheorem that
although most bit strings are random and have information content
approximately equal to their lengths, it is impossible to prove that a
speci c string has information content greater than n unless one is using
at least n bits of axioms. See ref. 4 for a more complete exposition of
this information-theoretic version of G odel's incompleteness theorem,
which was rst presented in ref. 11. It can also be shown that n bits of
assumptions or postulates are needed to be able to determine the rst
n bits of the base 2 expansion of the real number $.
Finally, it should be pointed out that these concepts are potentially
relevant to biology. The algorithmic approach is closer to the intuitive
notion of the information content of a biological organism than is the
classical ensemble viewpoint, for the role of a computer program and
of deoxyribonucleic acid (DNA) are roughly analogous. Reference 13
discusses possible applications of the concept of mutual algorithmic
information to theoretical biology it is suggested that a living organism
might be de ned as a highly correlated region, one whose parts have
high mutual information.

General References
1] Chaitin, G. J. (1975). Sci. Amer., 232 (5), 47{52. (An introduc-
tion to algorithmic information theory emphasizing the meaning
80 Part I|Introductory/Tutorial/Survey Papers
of the basic concepts.)
2] Chaitin, G. J. (1977). IBM J. Res. Dev., 21, 350{359, 496. (A
survey of algorithmic information theory.)
3] Davis, M., ed. (1965). The Undecidable|Basic Papers on Unde-
cidable Propositions, Unsolvable Problems and Computable Func-
tions . Raven Press, New York.
4] Davis, M. (1978). In Mathematics Today: Twelve Informal Es-
says . L. A. Steen, ed. Springer-Verlag, New York, pp. 241{267.
(An introduction to algorithmic information theory largely de-
voted to a detailed presentation of the relevant background in
computability theory and mathematical logic.)
5] Fine, T. L. (1973). Theories of Probability: An Examination of
Foundations . Academic Press, New York. (A survey of the re-
markably diverse proposals that have been made for formulating
probability mathematically. Caution: The material on algorith-
mic information theory contains some inaccuracies, and it is also
somewhat dated as a result of recent rapid progress in this eld.)
6] Gardner, M. (1979). Sci. Amer., 241 (5), 20{34. (An introduc-
tion to algorithmic information theory emphasizing the funda-
mental role played by $.)
7] Heijenoort, J. van, ed. (1977). From Frege to Godel: A Source
Book in Mathematical Logic, 1879{1931 . Harvard University
Press, Cambridge, Mass. (This book and ref. 3 comprise a stimu-
lating collection of all the classic papers on computability theory
and mathematical logic.)
8] Hofstadter, D. R. (1979). Godel, Escher, Bach: An Eternal
Golden Braid . Basic Books, New York. (The longest and most lu-
cid introduction to computability theory and mathematical logic.)
9] Shannon, C. E. and Weaver, W. (1949). The Mathematical The-
ory of Communication . University of Illinois Press, Urbana, Ill.
(The rst and still one of the very best books on classical infor-
mation theory.)
Algorithmic Information Theory 81
Additional References
10] Chaitin, G. J. (1966). J. ACM, 13, 547{569 16, 145{159 (1969).
11] Chaitin, G. J. (1974). IEEE Trans. Inf. Theory, IT-20, 10{15.
12] Chaitin, G. J. (1975). J. ACM, 22, 329{340.
13] Chaitin, G. J. (1979). In The Maximum Entropy Formalism, R.
D. Levine and M. Tribus, eds. MIT Press, Cambridge, Mass., pp.
477{498.
14] Chaitin, G. J. and Schwartz, J. T. (1978). Commun. Pure Appl.
Math., 31, 521{527.
15] Church, A. (1940). Bull. AMS, 46, 130{135.
16] Feistel, H. (1973). Sci. Amer., 228 (5), 15{23.
17] Ga%c, P. (1974). Sov. Math. Dokl., 15, 1477{1480.
18] Kac, M. (1959). Statistical Independence in Probability, Analy-
sis and Number Theory . Mathematical Association of America,
Washington, D.C.
19] Kolmogorov, A. N. (1965). Problems of Inf. Transmission, 1, 1{7.
20] Levin, L. A. (1974). Problems of Inf. Transmission, 10, 206{210.
21] Martin-L of, P. (1966). Inf. Control, 9, 602{619.
22] Solomono, R. J. (1964). Inf. Control, 7, 1{22, 224{254.

(Entropy
Information Theory
Martingales
Monte Carlo Methods
Pseudo-Random Number Generators
Statistical Independence
Tests of Randomness)
82 Part I|Introductory/Tutorial/Survey Papers

G. J. Chaitin
ALGORITHMIC
INFORMATION THEORY
IBM Journal of Research and Development
21 (1977), pp. 350{359, 496

G. J. Chaitin

Abstract
This paper reviews algorithmic information theory, which is an attempt
to apply information-theoretic and probabilistic ideas to recursive func-
tion theory. Typical concerns in this approach are, for example, the
number of bits of information required to specify an algorithm, or the
probability that a program whose bits are chosen by coin ipping pro-
duces a given output. During the past few years the denitions of algo-
rithmic information theory have been reformulated. The basic features
of the new formalism are presented here and certain results of R. M.
Solovay are reported.

83
84 Part I|Introductory/Tutorial/Survey Papers
Historical Introduction
To our knowledge, the rst publication of the ideas of algorithmic in-
formation theory was the description of R. J. Solomono's ideas given
in 1962 by M. L. Minsky in his paper, \Problems of formulation for
arti cial intelligence" 1]:
Consider a slightly dierent form of inductive inference
problem. Suppose that we are given a very long \data"
sequence of symbols the problem is to make a prediction
about the future of the sequence. This is a problem famil-
iar in discussion concerning \inductive probability." The
problem is refreshed a little, perhaps, by introducing the
modern notion of universal computer and its associated
language of instruction formulas. An instruction sequence
will be considered acceptable if it causes the computer to
produce a sequence, perhaps in nite, that begins with the
given nite \data" sequence. Each acceptable instruction
sequence thus makes a prediction, and Occam's razor would
choose the simplest such sequence and advocate its predic-
tion. (More generally, one could weight the dierent pre-
dictions by weights associated with the simplicities of the
instructions.) If the simplicity function is just the length
of the instructions, we are then trying to nd a minimal
description, i.e., an optimally ecient encoding of the data
sequence.
Such an induction method could be of interest only if
one could show some signi cant invariance with respect to
choice of de ning universal machine. There is no such in-
variance for a xed pair of data strings. For one could design
a machine which would yield the entire rst string with a
very small input, and the second string only for some very
complex input. On the brighter side, one can see that in a
sense the induced structure on the space of data strings has
some invariance in an \in the large" or \almost everywhere"
sense. Given two dierent universal machines, the induced
structures cannot be desperately dierent. We appeal to
Algorithmic Information Theory 85
the \translation theorem" whereby an arbitrary instruction
formula for one machine may be converted into an equiva-
lent instruction formula for the other machine by the addi-
tion of a constant pre x text. This text instructs the second
machine to simulate the behavior of the rst machine in op-
erating on the remainder of the input text. Then for data
strings much larger than this translation text (and its in-
verse) the choice between the two machines cannot greatly
aect the induced structure. It would be interesting to see
if these intuitive notions could be pro tably formalized.
Even if this theory can be worked out, it is likely that
it will present overwhelming computational diculties in
application. The recognition problem for minimal descrip-
tions is, in general, unsolvable, and a practical induction
machine will have to use heuristic methods. In this con-
nection it would be interesting to write a program to play
R. Abbott's inductive card game 2].]
Algorithmic information theory originated in the independent work
of Solomono (see 1, 3{6]), of A. N. Kolmogorov and P. Martin-L of
(see 7{14]), and of G. J. Chaitin (see 15{26]). Whereas Solomono
weighted together all the programs for a given result into a probability
measure, Kolmogorov and Chaitin concentrated their attention on the
size of the smallest program. Recently it has been realized by Chaitin
and independently by L. A. Levin that if programs are stipulated to
be self-delimiting, these two diering approaches become essentially
equivalent. This paper attempts to cast into a uni ed scheme the recent
work in this area by Chaitin 23,24] and by R. M. Solovay 27,28]. The
reader may also nd it interesting to examine the parallel eorts of
Levin (see 29{35]). There has been a substantial amount of other
work in this general area, often involving variants of the de nitions
deemed more suitable for particular applications (see, e.g., 36{47]).
86 Part I|Introductory/Tutorial/Survey Papers
Algorithmic Information Theory of Finite
Computations 23]
Denitions
Let us start by considering a class of Turing machines with the following
characteristics. Each Turing machine has three tapes: a program tape,
a work tape, and an output tape. There is a scanning head on each
of the three tapes. The program tape is read-only and each of its
squares contains a 0 or a 1. It may be shifted in only one direction.
The work tape may be shifted in either direction and may be read
and erased, and each of its squares contains a blank, a 0, or a 1. The
work tape is initially blank. The output tape may be shifted in only
one direction. Its squares are initially blank, and may have a 0, a 1,
or a comma written on them, and cannot be rewritten. Each Turing
machine of this type has a nite number n of states, and is de ned by
an n 3 table, which gives the action to be performed and the next
state as a function of the current state and the contents of the square
of the work tape that is currently being scanned. The rst state in
this table is by convention the initial state. There are eleven possible
actions: halt, shift work tape left/right, write blank/0/1 on work tape,
read square of program tape currently being scanned and copy onto
square of work tape currently being scanned and then shift program
tape, write 0/1/comma on output tape and then shift output tape,
and consult oracle. The oracle is included for the purpose of de ning
relative concepts. It enables the Turing machine to choose between
two possible state transitions, depending on whether or not the binary
string currently being scanned on the work tape is in a certain set,
which for now we shall take to be the null set.
From each Turing machine M of this type we de ne a probability
P , an entropy H , and a complexity I . P (s) is the probability that M
eventually halts with the string s written on its output tape if each
square of the program tape is lled with a 0 or a 1 by a separate toss
of an unbiased coin. By \string" we shall always mean a nite binary
string. From the probability P (s) we obtain the entropy H (s) by taking
the negative base-two logarithm, i.e., H (s) is ; log2 P (s). A string p is
Algorithmic Information Theory 87
said to be a program if when it is written on M 's program tape and M
starts computing scanning the rst bit of p, then M eventually halts
after reading all of p and without reading any other squares of the tape.
A program p is said to be a minimal program if no other program makes
M produce the same output and has a smaller size. And nally the
complexity I (s) is de ned to be the least n such that for some contents
of its program tape M eventually halts with s written on the output
tape after reading precisely n squares of the program tape i.e., I (s) is
the size of a minimal program for s. To summarize, P is the probability
that M calculates s given a random program, H is ; log2 P , and I is
the minimum number of bits required to specify an algorithm for M to
calculate s.
It is important to note that blanks are not allowed on the program
tape, which is imagined to be entirely lled with 0's and 1's. Thus pro-
grams are not followed by endmarker blanks. This forces them to be
self-delimiting a program must indicate within itself what size it has.
Thus no program can be a pre x of another one, and the programs for
M form what is known as a pre x-free set or an instantaneous code.
This has two very important eects: It enables a natural probability
distribution to be de ned on the set of programs, and it makes it pos-
sible for programs to be built up from subroutines by concatenation.
Both of these desirable features are lost if blanks are used as program
endmarkers. This occurs because there is no natural probability distrib-
ution on programs with endmarkers one, of course, makes all programs
of the same size equiprobable, but it is also necessary to specify in some
arbitrary manner the probability of each particular size. Moreover, if
two subroutines with blanks as endmarkers are concatenated, it is nec-
essary to include additional information indicating where the rst one
ends and the second one begins.
Here is an example of a speci c Turing machine M of the above
type. M counts the number n of 0's up to the rst 1 it encounters
on its program tape, then transcribes the next n bits of the program
tape onto the output tape, and nally halts. So M outputs s i it
nds length(s) 0's followed by a 1 followed by s on its program tape.
Thus P (s) = exp2 ;2 length(s) ; 1], H (s) = 2 length(s)+1, and I (s) =
2 length(s) + 1. Here exp2 x] is the base-two exponential function 2x.
Clearly this is a very special-purpose computer which embodies a very
88 Part I|Introductory/Tutorial/Survey Papers
limited class of algorithms and yields uninteresting functions P , H , and
I.
On the other hand it is easy to see that there are \general-purpose"
Turing machines that maximize P and minimize H and I  in fact, con-
sider those universal Turing machines which will simulate an arbitrary
Turing machine if a suitable pre x indicating the machine to simulate
is added to its programs. Such Turing machines yield essentially the
same P , H , and I . We therefore pick, somewhat arbitrarily, a par-
ticular one of these, U , and the de nitive de nition of P , H , and I
is given in terms of it. The universal Turing machine U works as fol-
lows. If U nds i 0's followed by a 1 on its program tape, it simulates
the computation that the ith Turing machine of the above type per-
forms upon reading the remainder of the program tape. By the ith
Turing machine we mean the one that comes ith in a list of all possible
de ning tables in which the tables are ordered by size (i.e., number
of states) and lexicographically among those of the same size. With
this choice of Turing machine, P , H , and I can be digni ed with the
following titles: P (s) is the algorithmic probability of s, H (s) is the
algorithmic entropy of s, and I (s) is the algorithmic information of s.
Following Solomono 3], P (s) and H (s) may also be called the a priori
probability and entropy of s. I (s) may also be termed the descrip-
tive, program-size, or information-theoretic complexity of s. And since
P is maximal and H and I are minimal, the above choice of special-
purpose Turing machine shows that P (s)  exp2 ;2 length(s) ; O(1)],
H (s)  2 length(s) + O(1), and I (s)  2 length(s) + O(1).
We have de ned P (s), H (s), and I (s) for individual strings s.
It is also convenient to consider computations which produce nite
sequences of strings. These are separated by commas on the out-
put tape. One thus de nes the joint probability P (s1 : : : sn), the
joint entropy H (s1 : : : sn ), and the joint complexity I (s1 : : : sn ) of
an n-tuple s1 : : :  sn. Finally one de nes the conditional probabil-
ity P (t1 : : : tmjs1 : : :  sn) of the m-tuple t1 : : : tm given the n-tuple
s1 : : : sn to be the quotient of the joint probability of the n-tuple and
the m-tuple divided by the joint probability of the n-tuple. In particu-
lar P (tjs) is de ned to be P (s t)=P (s). And of course the conditional
entropy is de ned to be the negative base-two logarithm of the condi-
tional probability. Thus by de nition H (s t) = H (s) + H (tjs). Finally,
Algorithmic Information Theory 89
in order to extend the above de nitions to tuples whose members may
either be strings or natural numbers, we identify the natural number n
with its binary expansion.

Basic Relationships
We now review some basic properties of these concepts. The relation
H (s t) = H (t s) + O(1)
states that the probability of computing the pair s t is essentially the
same as the probability of computing the pair t s. This is true because
there is a pre x that converts any program for one of these pairs into
a program for the other one. The inequality
H (s)  H (s t) + O(1)
states that the probability of computing s is not less than the proba-
bility of computing the pair s t. This is true because a program for s
can be obtained from any program for the pair s t by adding a xed
pre x to it. The inequality
H (s t)  H (s) + H (t) + O(1)
states that the probability of computing the pair s t is not less than the
product of the probabilities of computing s and t, and follows from the
fact that programs are self-delimiting and can be concatenated. The
inequality
O(1)  H (tjs)  H (t) + O(1)
is merely a restatement of the previous two properties. However, in
view of the direct relationship between conditional entropy and relative
complexity indicated below, this inequality also states that being told
something by an oracle cannot make it more dicult to obtain t. The
relationship between entropy and complexity is
H (s) = I (s) + O(1)
i.e., the probability of computing s is essentially the same as 1= exp2 the
size of a minimal program for s]. This implies that a signi cant frac-
tion of the probability of computing s is contributed by its minimal
90 Part I|Introductory/Tutorial/Survey Papers
programs, and that there are few minimal or near-minimal programs
for a given result. The relationship between conditional entropy and
relative complexity is
H (tjs) = Is(t) + O(1):
Here Is(t) denotes the complexity of t relative to a set having a single
element which is a minimal program for s. In other words,
I (s t) = I (s) + Is(t) + O(1):
This relation states that one obtains what is essentially a minimal pro-
gram for the pair s t by concatenating the following two subroutines:
 a minimal program for s

 a minimal program for calculating t using an oracle for the set


consisting of a minimal program for s.

Algorithmic Randomness
Consider an arbitrary string s of length n. From the fact that
H (n) + H (sjn) = H (n s) = H (s) + O(1)
it is easy to show that H (s)  n + H (n) + O(1), and that less than
exp2 n ; k + O(1)] of the s of length n satisfy H (s) < n + H (n) ; k.
It follows that for most s of length n, H (s) is approximately equal to
n + H (n). These are the most complex strings of length n, the ones
which are most dicult to specify, the ones with highest entropy, and
they are said to be the algorithmically random strings of length n. Thus
a typical string s of length n will have H (s) close to n + H (n), whereas
if s has pattern or can be distinguished in some fashion, then it can
be compressed or coded into a program that is considerably smaller.
That H (s) is usually n + H (n) can be thought of as follows: In order
to specify a typical strings s of length n, it is necessary to rst specify
its size n, which requires H (n) bits, and it is necessary then to specify
each of the n bits in s, which requires n more bits and brings the
total to n + H (n). In probabilistic terms this can be stated as follows:
Algorithmic Information Theory 91
the sum of the probabilities of all the strings of length n is essentially
equal to P (n), and most strings s of length n have probability P (s)
essentially equal to P (n)=2n . On the other hand, one of the strings of
length n that is least random and that has most pattern is the string
consisting entirely of 0's. It is easy to see that this string has entropy
H (n)+ O(1) and probability essentially equal to P (n), which is another
way of saying that almost all the information in it is in its length. Here
is an example in the middle: If p is a minimal program of size n, then it
is easy to see that H (p) = n + O(1) and P (p) is essentially 2;n . Finally
it should be pointed out that since H (s) = H (n)+ H (sjn)+ O(1) if s is
of length n, the above de nition of randomness is equivalent to saying
that the most random strings of length n have H (sjn) close to n, while
the least random ones have H (sjn) close to 0.
Later we shall show that even though most strings are algorithmi-
cally random, i.e., have nearly as much entropy as possible, an inherent
limitation of formal axiomatic theories is that a lower bound n on the
entropy of a speci c string can be established only if n is less than
the entropy of the axioms of the formal theory. In other words, it is
possible to prove that a speci c object is of complexity greater than
n only if n is less than the complexity of the axioms being employed
in the demonstration. These statements may be considered to be an
information-theoretic version of G odel's famous incompleteness theo-
rem.
Now let us turn from nite random strings to in nite ones, or equiv-
alently, by invoking the correspondence between a real number and its
dyadic expansion, to random reals. Consider an in nite string X ob-
tained by ipping an unbiased coin, or equivalently a real x uniformly
distributed in the unit interval. From the preceding considerations and
the Borel-Cantelli lemma it is easy to see that with probability one
there is a c such that H (Xn ) > n ; c for all n, where Xn denotes the
rst n bits of X , that is, the rst n bits of the dyadic expansion of x.
We take this property to be our de nition of an algorithmically random
in nite string X or real x.
Algorithmic randomness is a clear-cut property for in nite strings,
but in the case of nite strings it is a matter of degree. If a cuto were
to be chosen, however, it would be well to place it at about the point at
which H (s) is equal to length(s). Then an in nite random string could
92 Part I|Introductory/Tutorial/Survey Papers
be de ned to be one for which all initial segments are nite random
strings, within a certain tolerance.
Now consider the real number $ de ned as the halting probability of
the universal Turing machine U that we used to de ne P , H , and I  i.e.,
$ is the probability that U eventually halts if each square of its program
tape is lled with a 0 or a 1 by a separate toss of an unbiased coin. Then
it is not dicult to see that $ is in fact an algorithmically random real,
because if one were given the rst n bits of the dyadic expansion of $,
then one could use this to tell whether each program for U of size less
than n ever halts or not. In other words, when written in binary the
probability of halting $ is a random or incompressible in nite string.
Thus the basic theorem of recursive function theory that the halting
problem is unsolvable corresponds in algorithmic information theory to
the theorem that the probability of halting is algorithmically random
if the program is chosen by coin ipping.
This concludes our review of the most basic facts regarding the
probability, entropy, and complexity of nite objects, namely strings
and tuples of strings. Before presenting some of Solovay's remark-
able results regarding these concepts, and in particular regarding $, we
would like to review the most important facts which are known regard-
ing the probability, entropy, and complexity of in nite objects, namely
recursively enumerable sets of strings.

Algorithmic Information Theory of In


nite
Computations 24]
In order to de ne the probability, entropy, and complexity of r.e. (recur-
sively enumerable) sets of strings it is necessary to consider unending
computations performed on our standard universal Turing machine U .
A computation is said to produce an r.e. set of strings A if all the
members of A and only members of A are eventually written on the
output tape, each followed by a comma. It is important that U not be
required to halt if A is nite. The members of the set A may be written
in arbitrary order, and duplications are ignored. A technical point: If
there are only nitely many strings written on the output tape, and the
Algorithmic Information Theory 93
last one is in nite or is not followed by a comma, then it is considered
to be an \un nished" string and is also ignored. Note that since com-
putations may be endless, it is now possible for a semi-in nite portion
of the program tape to be read.
The de nitions of the probability P (A), the entropy H (A), and the
complexity I (A) of an r.e. set of strings A may now be given. P (A) is
the probability that U produces the output set A if each square of its
program tape is lled with a 0 or a 1 by a separate toss of an unbiased
coin. H (A) is the negative base-two logarithm of P (A). And I (A)
is the size in bits of a minimal program that produces the output set
A, i.e., I (A) is the least n such that there is a program tape contents
that makes U undertake a computation in the course of which it reads
precisely n squares of the program tape and produces the set of strings
A. In order to de ne the joint and conditional probability and entropy
we need a mechanism for encoding two r.e. sets A and B into a single set
A join B . To obtain A join B one pre xes each string in A with a 0 and
each string in B with a 1 and takes the union of the two resulting sets.
Enumerating A join B is equivalent to simultaneously enumerating A
and B . So the joint probability P (A B ) is P (A join B ), the joint
entropy H (A B ) is H (A join B ), and the joint complexity I (A B ) is
I (A join B ). These de nitions can obviously be extended to more than
two r.e. sets, but it is unnecessary to do so here. Lastly, the conditional
probability P (B jA) of B given A is the quotient of P (A B ) divided
by P (A), and the conditional entropy H (B jA) is the negative base-two
logarithm of P (B jA). Thus by de nition H (A B ) = H (A) + H (B jA).
As before, one obtains the following basic inequalities:
 H (A B ) = H (B A) + O(1),

 H (A)  H (A B ) + O(1),

 H (A B )  H (A) + H (B ) + O(1),

 O(1)  H (B jA)  H (B ) + O(1),

 I (A B )  I (A) + I (B ) + O(1).


In order to demonstrate the third and the fth of these relations one
imagines two unending computations to be occurring simultaneously.
94 Part I|Introductory/Tutorial/Survey Papers
Then one interleaves the bits of the two programs in the order in which
they are read. Putting a xed size pre x in front of this, one obtains a
single program for performing both computations simultaneously whose
size is O(1) plus the sum of the sizes of the original programs.
So far things look much as they did for individual strings. But
the relationship between entropy and complexity turns out to be more
complicated for r.e. sets than it was in the case of individual strings.
Obviously the entropy H (A) is always less than or equal to the com-
plexity I (A), because of the probability contributed by each minimal
program for A:
 H (A)  I (A).

But how about bounds on I (A) in terms of H (A)? First of all, it is


easy to see that if A is a singleton set whose only member is the string
s, then H (A) = H (s) + O(1) and I (A) = I (s) + O(1). Thus the theory
of the algorithmic information of individual strings is contained in the
theory of the algorithmic information of r.e. sets as the special case of
sets having a single element:
 For singleton A, I (A) = H (A) + O(1).

There is also a close but not an exact relationship between H and I


in the case of sets consisting of initial segments of the set of natural
numbers (recall we identify the natural number n with its binary rep-
resentation). Let us use the adjective \initial" for any set consisting of
all natural numbers less than a given one:
 For initial A, I (A) = H (A) + O(log H (A)).

Moreover, it is possible to show that there are in nitely many initial


sets A for which I (A) > H (A) + O(log H (A)). This is the greatest
known discrepancy between I and H for r.e. sets. It is demonstrated by
showing that occasionally the number of initial sets A with H (A) < n
is appreciably greater than the number of initial sets A with I (A) < n.
On the other hand, with the aid of a crucial game-theoretic lemma of
D. A. Martin, Solovay 28] has shown that
 I (A)  3H (A) + O(log H (A)).
Algorithmic Information Theory 95
These are the best results currently known regarding the relationship
between the entropy and the complexity of an r.e. set clearly much
remains to be done. Furthermore, what is the relationship between the
conditional entropy and the relative complexity of r.e. sets? And how
many minimal or near-minimal programs for an r.e. set are there?
We would now like to mention some other results concerning these
concepts. Solovay has shown that:
 There are exp2 n ; H (n)+ O(1)] singleton sets A with H (A) < n,

 There are exp2 n ; H (n) + O(1)] singleton sets A with I (A) < n.

We have extended Solovay's result as follows:


 There are exp2 n ; H 0 (n) + O(1)] nite sets A with H (A) < n,

 There are exp2 n ; H (Ln ) + O(log H (Ln ))] sets A with I (A) < n,

 There are exp2 n;H 0 (Ln )+O(log H 0 (Ln ))] sets A with H (A) < n.

Here Ln is the set of natural numbers less than n, and H 0 is the entropy
relative to the halting problem if U is provided with an oracle for the
halting problem instead of one for the null set, then the probability,
entropy, and complexity measures one obtains are P 0, H 0, and I 0 instead
of P , H , and I . Two nal results:
 I 0(A the complement of A)  H (A) + O(1)

 the probability that the complement of an r.e. set has cardinality


n is essentially equal to the probability that a set r.e. in the halting
problem has cardinality n.

More Advanced Results 27]


The previous sections outline the basic features of the new formalism for
algorithmic information theory obtained by stipulating that programs
be self-delimiting instead of having endmarker blanks. Error terms in
the basic relations which were logarithmic in the previous approach 9]
are now of the order of unity.
96 Part I|Introductory/Tutorial/Survey Papers
In the previous approach the complexity of n is usually log2 n+O(1),
there is an information-theoretic characterization of recursive in nite
strings 25,26], and much is known about complexity oscillations in
random in nite strings 14]. The corresponding properties in the new
approach have been elucidated by Solovay in an unpublished paper 27].
We present some of his results here. For related work by Solovay, see
the publications 28,48,49].

Recursive Bounds on H(n)


Following 23, p. 337], let us consider recursive upper and lower bounds
on H (n). PLet f be an unbounded recursive function, and consider
the series exp2 ;f (n)] summed over all natural numbers n. If this
in nite series converges, then H (n) < f (n) + O(1) for all n. And if
it diverges, then the inequalities H (n) > f (n) and H (n) < f (n) each
hold for in nitely many n. Thus, for example, for any  > 0,
H (n) < log n + log log n + (1 + ) log log log n + O(1)
for all n, and
H (n) > log n + log log n + log log log n
for in nitely many n, where all logarithms are base two. See 50] for
the results on convergence used to prove this.
Solovay has obtained the following results regarding recursive upper
bounds on H , i.e., recursive h such that H (n) < h(n) for all n. First
he shows that there is a recursive upper bound on H which is almost
correct in nitely often, i.e., jH (n) ; h(n)j < c for in nitely many values
of n. In fact, the lim sup of the fraction of values of i less than n such
that jH (i) ; h(i)j < c is greater than 0. However, he also shows that the
values of n for which jH (n) ; h(n)j < c must in a certain sense be quite
sparse. In fact, he establishes that if h is any recursive upper bound
on H then there cannot exist a tolerance c and a recursive function f
such that there are always at least n dierent natural numbers i less
than f (n) at which h(i) is within c of H (i). It follows that the lim inf
of the fraction of values of i less than n such that jH (i) ; h(i)j < c is
zero.
Algorithmic Information Theory 97
The basic idea behind his construction of h is to choose f so that
P exp
2 ;f (n)] converges \as slowly" as possible. As a P by-product he
obtains a recursive convergent series of rational numbers an such that
if P bn is any recursive convergent series of rational numbers, then lim
sup an=bn is greater than zero.

Nonrecursive Innite Strings with Simple Initial


Segments
At the high-order end of the complexity scale for in nite strings are the
random strings, and the recursive strings are at the low order end. Is
anything else there? More formally, let X be an in nite binary string,
and let Xn be the rst n bits of X . If X is recursive, then we have
H (Xn ) = H (n) + O(1). What about the converse, i.e., what can be
said about X given only that H (Xn ) = H (n) + O(1)? Obviously
H (Xn ) = H (n Xn ) + O(1) = H (n) + H (Xn jn) + O(1):
So H (Xn ) = H (n)+ O(1) i H (Xn jn) = O(1). Then using a relativized
version of the proof in 37, pp. 525{526], one can show that X is re-
cursive in the halting problem. Moreover, by using a priority argument
Solovay is actually able to construct a nonrecursive X that satis es
H (Xn ) = H (n) + O(1).

Equivalent Denitions of an Algorithmically Ran-


dom Real
Pick a recursive enumeration O0 O1 O2 : : : of all open intervals with
rational endpoints. A sequence of open sets U0 U1 U2 : : : is said to be
simultaneously r.e. if there is a recursive function h such that Un is the
union of those Oi whose index i is of the form h(n j ), for some natural
number j . Consider a real number x in the unit interval. We say
that x has the Solovay randomness property if the following holds. Let
U0 U1 U2 : : : be any simultaneously r.e. sequence of open sets such that
the sum of the usual Lebesgue measure of the Un converges. Then x is in
only nitely many of the Un . We say that x has the Chaitin randomness
property if there is a c such that H (Xn ) > n ; c for all n, where Xn
98 Part I|Introductory/Tutorial/Survey Papers
is the string consisting of the rst n bits of the dyadic expansion of x.
Solovay has shown that these randomness properties are equivalent to
each other, and that they are also equivalent to Martin-L of's de nition
10] of randomness.

The Entropy of Initial Segments of Algorithmically


Random and of -like Reals
Consider a random real x. By the de nition of randomness, H (Xn ) >
n + O(1). On the other hand, for any in nite string X , random or not,
we have H (Xn )  n + H (n) + O(1). Solovay shows that the above
bounds are each sometimes sharp. More Pprecisely, consider a random
X and a recursive function f such that exp2 ;f (n)] diverges (e.g.,
f (n) = integer part of log2 n). Then there are in nitely many natural
numbers n such that H (Xn )  n + f (n). And consider an unbounded
monotone increasing recursive function g (e.g., g(n) = integer part of
log log n). There in nitely many natural numbers n such that it is
simultaneously the case that H (Xn )  n + g(n) and H (n)  f (n).
Solovay has obtained much more precise results along these lines
about $ and a class of reals which he calls \$-like." A real number is
said to be an r.e. real if the set of all rational numbers less than it is
an r.e. subset of the rational numbers. Roughly speaking, an r.e. real
x is $-like if for any r.e. real y one can get in an eective manner a
good approximation to y from any good approximation to x, and the
quality of the approximation to y is at most O(1) binary digits worse
than the quality of the approximation to x. The formal de nition of
$-like is as follows. The real x is said to dominate the real y if there
is a partial recursive function f and a constant c with the property
that if q is any rational number that is less than x, then f (q) is de ned
and is a rational number that is less than y and satis es the inequality
cjx ; qj  jy ; f (q)j. And a real number is said to be $-like if it is an r.e.
real that dominates all r.e. reals. Solovay proves that $ is in fact $-like,
and that if x and y are $-like, then H (Xn ) = H (Yn ) + O(1), where Xn
and Yn are the rst n bits in the dyadic expansions of x and y. It is an
immediate corollary that if x is $-like then H (Xn ) = H ($n ) + O(1),
and that all $-like reals are algorithmically random. Moreover Solovay
Algorithmic Information Theory 99
shows that the algorithmic probability P (s) of any string s is always
an $-like real.
In order to state Solovay's results contrasting the behavior of H ($n )
with that of H (Xn ) for a typical real number x, it is necessary to de ne
two extremely slowly growing monotone functions and 0. (n) =
min H (j ) (j  n), and 0 is de ned in the same manner as except
that H is replaced by H 0, the algorithmic entropy relative to the halting
problem. It can be shown (see 29, pp. 90{91]) that goes to in nity,
but more slowly than any monotone partial recursive function does.
More precisely, if f is an unbounded nondecreasing partial recursive
function, then (n) is less than f (n) for almost all n for which f (n)
is de ned. Similarly 0 goes to in nity, but more slowly than any
monotone partial function recursive in does. More precisely, if f is
an unbounded nondecreasing partial function recursive in the halting
problem, then 0(n) is less than f (n) for almost all n for which f (n) is
de ned. In particular, 0(n) is less than ( (n)) for almost all n.
We can now state Solovay's results. Consider a real number x uni-
formly distributed in the unit interval. With probability one there is a
c such that H (Xn ) > n + H (n) ; c holds for in nitely many n. And
with probability one, H (Xn ) > n + (n) + O(log (n)). Whereas if x
is $-like, then the following occurs:
H (Xn ) < n + H (n) ; (n) + O(log (n))
and for in nitely many n we have
H (Xn ) < n + 0(n) + O(log 0(n)):
This shows that the complexity of initial segments of the dyadic ex-
pansions of $-like reals is atypical. It is an open question whether
H ($n ) ; n tends to in nity Solovay suspects that it does.

Algorithmic Information Theory and Met-


amathematics
There is something paradoxical about being able to prove that a spe-
ci c nite string is random this is perhaps captured in the following
100 Part I|Introductory/Tutorial/Survey Papers
antinomies from the writings of M. Gardner 51] and B. Russell 52]. In
reading them one should interpret \dull," \uninteresting," and \inde-
nable" to mean \random," and \interesting" and \de nable" to mean
\nonrandom."
Natural] numbers can of course be interesting in a va-
riety of ways. The number 30 was interesting to George
Moore when he wrote his famous tribute to \the woman
of 30," the age at which he believed a married woman was
most fascinating. To a number theorist 30 is more likely
to be exciting because it is the largest integer such that all
smaller integers with which it has no common divisor are
prime numbers: : : The question arises: Are there any un-
interesting numbers? We can prove that there are none by
the following simple steps. If there are dull numbers, we can
then divide all numbers into two sets|interesting and dull.
In the set of dull numbers there will be only one number
that is smallest. Since it is the smallest uninteresting num-
ber it becomes, ipso facto , an interesting number. Hence
there are no dull numbers!] 51]
Among trans nite ordinals some can be de ned, while
others cannot for the total number of possible de nitions
is @0, while the number of trans nite ordinals exceeds @0 .
Hence there must be inde nable ordinals, and among these
there must be a least. But this is de ned as \the least
inde nable ordinal," which is a contradiction. 52]
Here is our incompleteness theorem for formal axiomatic theories whose
arithmetic consequences are true. The setup is as follows: The axioms
are a nite string, the rules of inference are an algorithm for enumerat-
ing the theorems given the axioms, and we x the rules of inference and
vary the axioms. Within such a formal theory a speci c string cannot
be proven to be of entropy more than O(1) greater than the entropy of
the axioms of the theory. Conversely, there are formal theories whose
axioms have entropy n + O(1) in which it is possible to establish all
true propositions of the form \H (speci c string)  n."
Proof. Consider the enumeration of the theorems of the formal ax-
iomatic theory in order of the size of their proofs. For each natural num-
Algorithmic Information Theory 101
ber k, let s be the string in the theorem of the form \H (s)  n" with
n greater than H (axioms) + k which appears rst in the enumeration.
On the one hand, if all theorems are true, then H (s) > H (axioms)+ k.
On the other hand, the above prescription for calculating s shows that
H (s)  H (axioms) + H (k) + O(1). It follows that k < H (k) + O(1).
However, this inequality is false for all k  k, where k depends only
on the rules of inference. The apparent contradiction is avoided only if
s does not exist for k = k, i.e., only if it is impossible to prove in the
formal theory that a speci c string has H greater than H (axioms)+ k.
Proof of Converse. The set T of all true propositions of the form
\H (s) < k" is r.e. Choose a xed enumeration of T without repetitions,
and for each natural number n let s be the string in the last proposition
of the form \H (s) < n" in the enumeration. It is not dicult to see
that H (s n) = n + O(1). Let p be a minimal program for the pair
s n. Then p is the desired axiom, for H (p) = n + O(1) and to obtain
all true propositions of the form \H (s)  n" from p one enumerates
T until all s with H (s) < n have been discovered. All other s have
H (s)  n.
We developed this information-theoretic approach to metamathe-
matics before being in possession of the notion of self-delimiting pro-
grams (see 20{22] and also 53]) the technical details are somewhat
dierent when programs have blanks as endmarkers. The conclusion to
be drawn from all this is that even though most strings are random, we
will never be able to explicitly exhibit a string of reasonable size which
demonstrably possess this property. A less pessimistic conclusion to be
drawn is that it is reasonable to measure the power of formal axiomatic
theories in information-theoretic terms. The fact that in some sense
one gets out of a formal theory no more than one puts in should not be
taken too seriously: a formal theory is at its best when a great many ap-
parently independent theorems are shown to be closely interrelated by
reducing them to a handful of axioms. In this sense a formal axiomatic
theory is valuable for the same reason as a scienti c theory in both
cases information is being compressed, and one is also concerned with
the trade-o between the degree of compression and the length of proofs
of interesting theorems or the time required to compute predictions.
102 Part I|Introductory/Tutorial/Survey Papers
Algorithmic Information Theory and Biol-
ogy
Above we have pointed out a number of open problems. In our opin-
ion, however, the most important challenge is to see if the ideas of
algorithmic information theory can contribute in some form or manner
to theoretical mathematical biology in the style of von Neumann 54],
in which genetic information is considered to be an extremely large and
complicated program for constructing organisms. We alluded briey to
this in a previous paper 21], and discussed it at greater length in a
publication 19] of somewhat limited access.
Von Neumann wished to isolate the basic conceptual problems of
biology from the detailed physics and biochemistry of life as we know
it. The gist of his message is that it should be possible to formulate
mathematically and to answer in a quite general setting such funda-
mental questions as \How is self-reproduction possible?", \What is an
organism?", \What is its degree of organization?", and \How probable
is evolution?". He achieved this for the rst question he showed that
exact self-reproduction of universal Turing machines is possible in a
particular deterministic model universe.
There is such an enormous dierence between dead and organized
living matter that it must be possible to give a quantitative structural
characterization of this dierence, i.e., of degree of organization. One
possibility 19] is to characterize an organism as a highly interdependent
region, one for which the complexity of the whole is much less than the
sum of the complexities of its parts. C. H. Bennett 55] has suggested
another approach based on the notion of \logical depth." A structure is
deep \if it is super cially random but subtly redundant, in other words,
if almost all its algorithmic probability is contributed by slow-running
programs. A string's logical depth should reect the amount of compu-
tational work required to expose its buried redundancy." It is Bennett's
thesis that \a priori the most probable explanation of `organized infor-
mation' such as the sequence of bases in a naturally occurring DNA
molecule is that it is the product of an extremely long evolutionary
process." For related work by Bennett, see 56].
This, then, is the fundamental problem of theoretical biology that
Algorithmic Information Theory 103
we hope the ideas of algorithmic information theory may help to solve:
to set up a nondeterministic model universe, to formally de ne what
it means for a region of space-time in that universe to be an organism
and what is its degree of organization, and to rigorously demonstrate
that, starting from simple initial conditions, organisms will appear and
evolve in degree of organization in a reasonable amount of time and
with high probability.

Acknowledgments
The quotation by M. L. Minsky in the rst section is reprinted with
the kind permission of the publisher American Mathematical Society
from Mathematical Problems in the Biological Sciences, Proceedings of
Symposia in Applied Mathematics XIV, pp. 42{43, copyright  c 1962.
We are grateful to R. M. Solovay for permitting us to include several of
his unpublished results in the section entitled \More advanced results."
The quotation by M. Gardner in the section on algorithmic information
theory and metamathematics is reprinted with his kind permission, and
the quotation by B. Russell in that section is reprinted with permission
of the Johns Hopkins University Press. We are grateful to C. H. Bennett
for permitting us to present his notion of logical depth in print for the
rst time in the section on algorithmic information theory and biology.

References
1] M. L. Minsky, \Problems of Formulation for Arti cial Intelli-
gence," Mathematical Problems in the Biological Sciences, Pro-
ceedings of Symposia in Applied Mathematics XIV, R. E. Bell-
man, ed., American Mathematical Society, Providence, RI, 1962,
p. 35.
2] M. Gardner, \An Inductive Card Game," Sci. Amer. 200, No. 6,
160 (1959).
3] R. J. Solomono, \A Formal Theory of Inductive Inference," Info.
Control 7, 1, 224 (1964).
104 Part I|Introductory/Tutorial/Survey Papers
4] D. G. Willis, \Computational Complexity and Probability Con-
structions," J. ACM 17, 241 (1970).
5] T. M. Cover, \Universal Gambling Schemes and the Complex-
ity Measures of Kolmogorov and Chaitin," Statistics Department
Report 12, Stanford University, CA, October, 1974.
6] R. J. Solomono, \Complexity Based Induction Systems: Com-
parisons and Convergence Theorems," Report RR-329, Rockford
Research, Cambridge, MA, August, 1976.
7] A. N. Kolmogorov, \On Tables of Random Numbers," Sankhya
A25, 369 (1963).
8] A. N. Kolmogorov, \Three Approaches to the Quantitative De-
nition of Information," Prob. Info. Transmission 1, No. 1, 1
(1965).
9] A. N. Kolmogorov, \Logical Basis for Information Theory and
Probability Theory," IEEE Trans. Info. Theor. IT-14, 662
(1968).
10] P. Martin-L of, \The De nition of Random Sequences," Info. Con-
trol 9, 602 (1966).
11] P. Martin-L of, \Algorithms and Randomness," Intl. Stat. Rev.
37, 265 (1969).
12] P. Martin-L of, \The Literature on von Mises' Kollektivs Revis-
ited," Theoria 35, Part 1, 12 (1969).
13] P. Martin-L of, \On the Notion of Randomness," Intuitionism and
Proof Theory, A. Kino, J. Myhill, and R. E. Vesley, eds., North-
Holland Publishing Co., Amsterdam, 1970, p. 73.
14] P. Martin-L of, \Complexity Oscillations in In nite Binary Se-
quences," Z. Wahrscheinlichk. verwand. Geb. 19, 225 (1971).
15] G. J. Chaitin, \On the Length of Programs for Computing Finite
Binary Sequences," J. ACM 13, 547 (1966).
Algorithmic Information Theory 105
16] G. J. Chaitin, \On the Length of Programs for Computing Finite
Binary Sequences: Statistical Considerations," J. ACM 16, 145
(1969).
17] G. J. Chaitin, \On the Simplicity and Speed of Programs for
Computing In nite Sets of Natural Numbers," J. ACM 16, 407
(1969).
18] G. J. Chaitin, \On the Diculty of Computations," IEEE Trans.
Info. Theor. IT-16, 5 (1970).
19] G. J. Chaitin, \To a Mathematical De nition of `Life'," ACM
SICACT News 4, 12 (1970).
20] G. J. Chaitin, \Information-theoretic Limitations of Formal Sys-
tems," J. ACM 21, 403 (1974).
21] G. J. Chaitin, \Information-theoretic Computational Complex-
ity," IEEE Trans. Info. Theor. IT-20, 10 (1974).
22] G. J. Chaitin, \Randomness and Mathematical Proof," Sci.
Amer. 232, No. 5, 47 (1975). (Also published in the Japanese
and Italian editions of Sci. Amer.)
23] G. J. Chaitin, \A Theory of Program Size Formally Identical to
Information Theory," J. ACM 22, 329 (1975).
24] G. J. Chaitin, \Algorithmic Entropy of Sets," Comput. & Math.
Appls. 2, 233 (1976).
25] G. J. Chaitin, \Information-theoretic Characterizations of Recur-
sive In nite Strings," Theoret. Comput. Sci. 2, 45 (1976).
26] G. J. Chaitin, \Program Size, Oracles, and the Jump Operation,"
Osaka J. Math., to be published in Vol. 14, No. 1, 1977.
27] R. M. Solovay, \Draft of a paper: : : on Chaitin's work: : : done for
the most part during the period of Sept.{Dec. 1974," unpublished
manuscript, IBM Thomas J. Watson Research Center, Yorktown
Heights, NY, May, 1975.
106 Part I|Introductory/Tutorial/Survey Papers
28] R. M. Solovay, \On Random R. E. Sets," Proceedings of the Third
Latin American Symposium on Mathematical Logic, Campinas,
Brazil, July, 1976. To be published.
29] A. K. Zvonkin and L. A. Levin, \The Complexity of Finite Ob-
jects and the Development of the Concepts of Information and
Randomness by Means of the Theory of Algorithms," Russ. Math.
Surv. 25, No. 6, 83 (1970).
30] L. A. Levin, \On the Notion of a Random Sequence," Soviet
Math. Dokl. 14, 1413 (1973).
31] P. Ga%c, \On the Symmetry of Algorithmic Information," Soviet
Math. Dokl. 15, 1477 (1974). \Corrections," Soviet Math. Dokl.
15, No. 6, v (1974).
32] L. A. Levin, \Laws of Information Conservation (Nongrowth) and
Aspects of the Foundation of Probability Theory," Prob. Info.
Transmission 10, 206 (1974).
33] L. A. Levin, \Uniform Tests of Randomness," Soviet Math. Dokl.
17, 337 (1976).
34] L. A. Levin, \Various Measures of Complexity for Finite Objects
(Axiomatic Description)," Soviet Math. Dokl. 17, 522 (1976).
35] L. A. Levin, \On the Principle of Conservation of Information in
Intuitionistic Mathematics," Soviet Math. Dokl. 17, 601 (1976).
36] D. E. Knuth, Seminumerical Algorithms. The Art of Computer
Programming, Volume 2, Addison-Wesley Publishing Co., Inc.,
Reading, MA, 1969. See Ch. 2, \Random Numbers," p. 1.
37] D. W. Loveland, \A Variant of the Kolmogorov Concept of Com-
plexity," Info. Control 15, 510 (1969).
38] T. L. Fine, Theories of Probability|An Examination of Founda-
tions, Academic Press, Inc., New York, 1973. See Ch. V, \Com-
putational Complexity, Random Sequences, and Probability," p.
118.
Algorithmic Information Theory 107
39] J. T. Schwartz, On Programming: An Interim Report on the
SETL Project. Installment I: Generalities, Lecture Notes, Cou-
rant Institute of Mathematical Sciences, New York University,
1973. See Item 1, \On the Sources of Diculty in Programming,"
p. 1, and Item 2, \A Second General Reection on Programming,"
p. 12.
40] T. Kamae, \On Kolmogorov's Complexity and Information," Os-
aka J. Math. 10, 305 (1973).
41] C. P. Schnorr, \Process Complexity and Eective Random Tests,"
J. Comput. Syst. Sci. 7, 376 (1973).
42] M. E. Hellman, \The Information Theoretic Approach to Cryp-
tography," Information Systems Laboratory, Center for Systems
Research, Stanford University, April, 1974.
43] W. L. Gewirtz, \Investigations in the Theory of Descriptive Com-
plexity," Courant Computer Science Report 5, Courant Institute
of Mathematical Sciences, New York University, October, 1974.
44] R. P. Daley, \Minimal-program Complexity of Pseudo-recursive
and Pseudo-random Sequences," Math. Syst. Theor. 9, 83 (1975).
45] R. P. Daley, \Noncomplex Sequences: Characterizations and Ex-
amples," J. Symbol. Logic 41, 626 (1976).
46] J. Gruska, \Descriptional Complexity (of Languages)|A Short
Survey," Mathematical Foundations of Computer Science 1976,
A. Mazurkiewicz, ed., Lecture Notes in Computer Science 45,
Springer-Verlag, Berlin, 1976, p. 65.
47] J. Ziv, \Coding Theorems for Individual Sequences," undated
manuscript, Bell Laboratories, Murray Hill, NJ.
48] R. M. Solovay, \A Model of Set-theory in which Every Set of
Reals is Lebesgue Measurable," Ann. Math. 92, 1 (1970).
49] R. Solovay and V. Strassen, \A Fast Monte-Carlo Test for Pri-
mality," SIAM J. Comput. 6, 84 (1977).
108 Part I|Introductory/Tutorial/Survey Papers
50] G. H. Hardy, A Course of Pure Mathematics, Tenth edition, Cam-
bridge University Press, London, 1952. See Section 218, \Loga-
rithmic Tests of Convergence for Series and Integrals," p. 417.
51] M. Gardner, \A Collection of Tantalizing Fallacies of Mathemat-
ics," Sci. Amer. 198, No. 1, 92 (1958).
52] B. Russell, \Mathematical Logic as Based on the Theory of
Types," From Frege to Godel: A Source Book in Mathemati-
cal Logic, 1879{1931, J. van Heijenoort, ed., Harvard University
Press, Cambridge, MA, 1967, p. 153 reprinted from Amer. J.
Math. 30, 222 (1908).
53] M. Levin, \Mathematical Logic for Computer Scientists," MIT
Project MAC TR-131, June, 1974, pp. 145, 153.
54] J. von Neumann, Theory of Self-reproducing Automata, Univer-
sity of Illinois Press, Urbana, 1966 edited and completed by A.
W. Burks.
55] C. H. Bennett, \On the Thermodynamics of Computation," un-
dated manuscript, IBM Thomas J. Watson Research Center,
Yorktown Heights, NY.
56] C. H. Bennett, \Logical Reversibility of Computation," IBM J.
Res. Develop. 17, 525 (1973).

Received February 2, 1977 revised March 9, 1977

The author is located at the IBM Thomas J. Watson Research Center,


Yorktown Heights, New York 10598.
Part II
Applications to
Metamathematics

109
GO DEL'S THEOREM AND
INFORMATION
International Journal of Theoretical
Physics 22 (1982), pp. 941{954

Gregory J. Chaitin
IBM Research, P.O. Box 218
Yorktown Heights, New York 10598

Abstract
Godel's theorem may be demonstrated using arguments having an
information-theoretic avor. In such an approach it is possible to ar-
gue that if a theorem contains more information than a given set of
axioms, then it is impossible for the theorem to be derived from the ax-
ioms. In contrast with the traditional proof based on the paradox of the
liar, this new viewpoint suggests that the incompleteness phenomenon
discovered by Godel is natural and widespread rather than pathological
and unusual.

111
112 Part II|Applications to Metamathematics
1. Introduction
To set the stage, let us listen to Hermann Weyl (1946), as quoted by
Eric Temple Bell (1951):
We are less certain than ever about the ultimate foun-
dations of (logic and) mathematics. Like everybody and
everything in the world today, we have our \crisis." We
have had it for nearly fty years. Outwardly it does not
seem to hamper our daily work, and yet I for one confess
that it has had a considerable practical inuence on my
mathematical life: it directed my interests to elds I con-
sidered relatively \safe," and has been a constant drain on
the enthusiasm and determination with which I pursued my
research work. This experience is probably shared by other
mathematicians who are not indierent to what their scien-
ti c endeavors mean in the context of man's whole caring
and knowing, suering and creative existence in the world.
And these are the words of John von Neumann (1963):
: : : there have been within the experience of people
now living at least three serious crises: : : There have been
two such crises in physics|namely, the conceptual soul-
searching connected with the discovery of relativity and the
conceptual diculties connected with discoveries in quan-
tum theory: : : The third crisis was in mathematics. It was
a very serious conceptual crisis, dealing with rigor and the
proper way to carry out a correct mathematical proof. In
view of the earlier notions of the absolute rigor of mathemat-
ics, it is surprising that such a thing could have happened,
and even more surprising that it could have happened in
these latter days when miracles are not supposed to take
place. Yet it did happen.
At the time of its discovery, Kurt G odel's incompleteness theorem
was a great shock and caused much uncertainty and depression among
mathematicians sensitive to foundational issues, since it seemed to pull
Godel's Theorem and Information 113
the rug out from under mathematical certainty, objectivity, and rigor.
Also, its proof was considered to be extremely dicult and recondite.
With the passage of time the situation has been reversed. A great many
dierent proofs of G odel's theorem are now known, and the result is
now considered easy to prove and almost obvious: It is equivalent to the
unsolvability of the halting problem, or alternatively to the assertion
that there is an r.e. (recursively enumerable) set that is not recursive.
And it has had no lasting impact on the daily lives of mathematicians
or on their working habits no one loses sleep over it any more.
G odel's original proof constructed a paradoxical assertion that is
true but not provable within the usual formalizations of number the-
ory. In contrast I would like to measure the power of a set of axioms
and rules of inference. I would like to able to say that if one has ten
pounds of axioms and a twenty-pound theorem, then that theorem can-
not be derived from those axioms. And I will argue that this approach
to G odel's theorem does suggest a change in the daily habits of math-
ematicians, and that G odel's theorem cannot be shrugged away.
To be more speci c, I will apply the viewpoint of thermodynamics
and statistical mechanics to G odel's theorem, and will use such con-
cepts as probability, randomness, entropy, and information to study
the incompleteness phenomenon and to attempt to evaluate how wide-
spread it is. On the basis of this analysis, I will suggest that mathe-
matics is perhaps more akin to physics than mathematicians have been
willing to admit, and that perhaps a more exible attitude with re-
spect to adopting new axioms and methods of reasoning is the proper
response to G odel's theorem. Probabilistic proofs of primality via sam-
pling (Chaitin and Schwartz, 1978) also suggest that the sources of
mathematical truth are wider than usually thought. Perhaps number
theory should be pursued more openly in the spirit of experimental
science (P
olya, 1959)!
I am indebted to John McCarthy and especially to Jacob Schwartz
for making me realize that G odel's theorem is not an obstacle to a
practical AI (arti cial intelligence) system based on formal logic. Such
an AI would take the form of an intelligent proof checker. Gottfried
Wilhelm Liebnitz and David Hilbert's dream that disputes could be
settled with the words \Gentlemen, let us compute!" and that mathe-
matics could be formalized, should still be a topic for active research.
114 Part II|Applications to Metamathematics
Even though mathematicians and logicians have erroneously dropped
this train of thought dissuaded by G odel's theorem, great advances
have in fact been made \covertly," under the banner of computer sci-
ence, LISP, and AI (Cole et al., 1981 Dewar et al., 1981 Levin, 1974
Wilf, 1982).
To speak in metaphors from Douglas Hofstadter (1979), we shall
now stroll through an art gallery of proofs of G odel's theorem, to the
tune of Moussorgsky's pictures at an exhibition! Let us start with
some traditional proofs (Davis, 1978 Hofstadter, 1979 Levin, 1974
Post, 1965).

2. Traditional Proofs of Godel's Theorem


G odel's original proof of the incompleteness theorem is based on the
paradox of the liar: \This statement is false." He obtains a theorem in-
stead of a paradox by changing this to: \This statement is unprovable."
If this assertion is unprovable, then it is true, and the formalization of
number theory in question is incomplete. If this assertion is provable,
then it is false, and the formalization of number theory is inconsistent.
The original proof was quite intricate, much like a long program in ma-
chine language. The famous technique of G odel numbering statements
was but one of the many ingenious ideas brought to bear by G odel to
construct a number-theoretic assertion which says of itself that it is
unprovable.
G odel's original proof applies to a particular formalization of num-
ber theory, and was to be followed by a paper showing that the same
methods applied to a much broader class of formal axiomatic systems.
The modern approach in fact applies to all formal axiomatic systems, a
concept which could not even be de ned when G odel wrote his original
paper, owing to the lack of a mathematical de nition of eective proce-
dure or computer algorithm. After Alan Turing succeeded in de ning
eective procedure by inventing a simple idealized computer now called
the Turing machine (also done independently by Emil Post), it became
possible to proceed in a more general fashion.
Hilbert's key requirement for a formal mathematical system was
that there be an objective criterion for deciding if a proof written in the
Godel's Theorem and Information 115
language of the system is valid or not. In other words, there must be an
algorithm, a computer program, a Turing machine, for checking proofs.
And the compact modern de nition of formal axiomatic system as a
recursively enumerable set of assertions is an immediate consequence
if one uses the so-called British Museum algorithm. One applies the
proof checker in turn to all possible proofs, and prints all the theorems,
which of course would actually take astronomical amounts of time. By
the way, in practice LISP is a very convenient programming language
in which to write a simple proof checker (Levin, 1974).
Turing showed that the halting problem is unsolvable, that is, that
there is no eective procedure or algorithm for deciding whether or
not a program ever halts. Armed with the general de nition of a for-
mal axiomatic system as an r.e. set of assertions in a formal language,
one can immediately deduce a version of G odel's incompleteness theo-
rem from Turing's theorem. I will sketch three dierent proofs of the
unsolvability of the halting problem in a moment rst let me derive
G odel's theorem from it. The reasoning is simply that if it were always
possible to prove whether or not particular programs halt, since the set
of theorems is r.e., one could use this to solve the halting problem for
any particular program by enumerating all theorems until the matter is
settled. But this contradicts the unsolvability of the halting problem.
Here come three proofs that the halting problem is unsolvable. One
proof considers that function F (N ) de ned to be either one more than
the value of the N th computable function applied to the natural number
N , or zero if this value is unde ned because the N th computer program
does not halt on input N . F cannot be a computable function, for if
program N calculated it, then one would have F (N ) = F (N )+1, which
is impossible. But the only way that F can fail to be computable is
because one cannot decide if the N th program ever halts when given
input N .
The proof I have just given is of course a variant of the diago-
nal method which Georg Cantor used to show that the real numbers
are more numerous than the natural numbers (Courant and Robbins,
1941). Something much closer to Cantor's original technique can also
be used to prove Turing's theorem. The argument runs along the lines
of Bertrand Russell's paradox (Russell, 1967) of the set of all things
that are not members of themselves. Consider programs for enumer-
116 Part II|Applications to Metamathematics
ating sets of natural numbers, and number these computer programs.
De ne a set of natural numbers consisting of the numbers of all pro-
grams which do not include their own number in their output set. This
set of natural numbers cannot be recursively enumerable, for if it were
listed by computer program N , one arrives at Russell's paradox of the
barber in a small town who shaves all those and only those who do
not shave themselves, and can neither shave himself nor avoid doing
so. But the only way that this set can fail to be recursively enumerable
is if it is impossible to decide whether or not a program ever outputs a
speci c natural number, and this is a variant of the halting problem.
For yet another proof of the unsolvability of the halting problem,
consider programs which take no input and which either produce a sin-
gle natural number as output or loop forever without ever producing an
output. Think of these programs as being written in binary notation,
instead of as natural numbers as before. I now de ne a so-called Busy
Beaver function: BB of N is the largest natural number output by any
program less than N bits in size. The original Busy Beaver function
measured program size in terms of the number of states in a Turing
machine instead of using the more correct information-theoretic mea-
sure, bits. It is easy to see that BB of N grows more quickly than any
computable function, and is therefore not computable, which as before
implies that the halting problem is unsolvable.
In a beautiful and easy to understand paper Post (1965) gave ver-
sions of G odel's theorem based on his concepts of simple and creative
r.e. sets. And he formulated the modern abstract form of G odel's the-
orem, which is like a Japanese haiku: there is an r.e. set of natural
numbers that is not recursive. This set has the property that there are
programs for printing all the members of the set in some order, but not
in ascending order. One can eventually realize that a natural number
is a member of the set, but there is no algorithm for deciding if a given
number is in the set or not. The set is r.e. but its complement is not.
In fact, the set of (numbers of) halting programs is such a set. Now
consider a particular formal axiomatic system in which one can talk
about natural numbers and computer programs and such, and let X
be any r.e. set whose complement is not r.e. It follows immediately
that not all true assertions of the form \the natural number N is not a
member of the set X " are theorems in the formal axiomatic system. In
Godel's Theorem and Information 117
fact, if X is what Post called a simple r.e. set, then only nitely many
of these assertions can be theorems.
These traditional proofs of G odel's incompleteness theorem show
that formal axiomatic systems are incomplete, but they do not suggest
ways to measure the power of formal axiomatic systems, to rank their
degree of completeness or incompleteness. Actually, Post's concept of
a simple set contains the germ of the information-theoretic versions
of G odel's theorem that I will give later, but this is only visible in
retrospect. One could somehow choose a particular simple r.e. set X
and rank formal axiomatic systems according to how many dierent
theorems of the form \N is not in X " are provable. Here are three
other quantitative versions of G odel's incompleteness theorem which
do sort of fall within the scope of traditional methods.
Consider a particular formal axiomatic system in which it is possible
to talk about total recursive functions (computable functions which
have a natural number as value for each natural number input) and
their running time computational complexity. It is possible to construct
a total recursive function which grows more quickly than any function
which is provably total recursive in the formal axiomatic system. It is
also possible to construct a total recursive function which takes longer
to compute than any provably total recursive function. That is to say,
a computer program which produces a natural number output and then
halts whenever it is given a natural number input, but this cannot be
proved in the formal axiomatic system, because the program takes too
long to produce its output.
It is also fun to use constructive trans nite ordinal numbers (Hof-
stadter, 1979) to measure the power of formal axiomatic systems. A
constructive ordinal is one which can be obtained as the limit from
below of a computable sequence of smaller constructive ordinals. One
measures the power of a formal axiomatic system by the rst construc-
tive ordinal which cannot be proved to be a constructive ordinal within
the system. This is like the paradox of the rst unmentionable or in-
de nable ordinal number (Russell, 1967)!
Before turning to information-theoretic incompleteness theorems, I
must rst explain the basic concepts of algorithmic information theory
(Chaitin, 1975b, 1977, 1982).
118 Part II|Applications to Metamathematics
3. Algorithmic Information Theory
Algorithmic information theory focuses on individual objects rather
than on the ensembles and probability distributions considered in
Claude Shannon and Norbert Wiener's information theory. How many
bits does it take to describe how to compute an individual object?
In other words, what is the size in bits of the smallest program for
calculating it? It is easy to see that since general-purpose computers
(universal Turing machines) can simulate each other, the choice of com-
puter as yardstick is not very important and really only corresponds to
the choice of origin in a coordinate system.
The fundamental concepts of this new information theory are: al-
gorithmic information content, joint information, relative information,
mutual information, algorithmic randomness, and algorithmic indepen-
dence. These are de ned roughly as follows.
The algorithmic information content I (X ) of an individual object
X is de ned to be the size of the smallest program to calculate X .
Programs must be self-delimiting so that subroutines can be combined
by concatenating them. The joint information I (X Y ) of two objects
X and Y is de ned to be the size of the smallest program to calculate X
and Y simultaneously. The relative or conditional information content
I (X jY ) of X given Y is de ned to be the size of the smallest program
to calculate X from a minimal program for Y .
Note that the relative information content of an object is never
greater than its absolute information content, for being given additional
information can only help. Also, since subroutines can be concatenated,
it follows that joint information is subadditive. That is to say, the joint
information content is bounded from above by the sum of the individual
information contents of the objects in question. The extent to which
the joint information content is less than this sum leads to the next
fundamental concept, mutual information.
The mutual information content I (X : Y ) measures the commonal-
ity of X and Y : it is de ned as the extent to which knowing X helps
one to calculate Y , which is essentially the same as the extent to which
knowing Y helps one to calculate X , which is also the same as the ex-
tent to which it is cheaper to calculate them together than separately.
Godel's Theorem and Information 119
That is to say,
I (X : Y ) = I (X ) ; I (X jY )
= I (Y ) ; I (Y jX )
= I (X ) + I (Y ) ; I (X Y ):
Note that this implies that
I (X Y ) = I (X ) + I (Y jX )
= I (Y ) + I (X jY ):
I can now de ne two very fundamental and philosophically signif-
icant notions: algorithmic randomness and algorithmic independence.
These concepts are, I believe, quite close to the intuitive notions that go
by the same name, namely, that an object is chaotic, typical, unnote-
worthy, without structure, pattern, or distinguishing features, and is
irreducible information, and that two objects have nothing in common
and are unrelated.
Consider, for example, the set of all N -bit long strings. Most such
strings S have I (S ) approximately equal to N plus I (N ), which is N
plus the algorithmic information contained in the base-two numeral for
N , which is equal to N plus order of log N . No N -bit long S has
information content greater than this. A few have less information
content these are strings with a regular structure or pattern. Those
S of a given size having greatest information content are said to be
random or patternless or algorithmically incompressible. The cuto
between random and nonrandom is somewhere around I (S ) equal to N
if the string S is N bits long.
Similarly, an in nite binary sequence such as the base-two expansion
of  is random if and only if all its initial segments are random, that
is, if and only if there is a constant C such that no initial segment has
information content less than C bits below its length. Of course,  is
the extreme opposite of a random string: it takes only I (N ) which is
order of log N bits to calculate 's rst N bits. But the probability
that an in nite sequence obtained by independent tosses of a fair coin
is algorithmically random is unity.
Two strings are algorithmically independent if their mutual infor-
mation is essentially zero, more precisely, if their mutual information
120 Part II|Applications to Metamathematics
is as small as possible. Consider, for example, two arbitrary strings
X and Y each N bits in size. Usually, X and Y will be random to
each other, excepting the fact that they have the same length, so that
I (X : Y ) is approximately equal to I (N ). In other words, knowing one
of them is no help in calculating the other, excepting that it tells one
the other string's size.
To illustrate these ideas, let me give an information-theoretic proof
that there are in nitely many prime numbers (Chaitin, 1979). Sup-
pose on the contrary that there are only nitely many primes, in fact,
K of them. Consider an algorithmically random natural number N .
On the one hand, we know that I (N ) is equal to log2 N + order of
log log N , since the base-two numeral for N is an algorithmically ran-
dom (log2 N )-bit string. On the other hand, N can be calculated from
the exponents in its prime factorization, and vice versa. Thus I (N ) is
equal to the joint information of the K exponents in its prime factor-
ization. By subadditivity, this joint information is bounded from above
by the sum of the information contents of the K individual exponents.
Each exponent is of order log N . The information content of each expo-
nent is thus of order log log N . Hence I (N ) is simultaneously equal to
log2 N + O(log log N ) and less than or equal to KO(log log N ), which
is impossible.
The concepts of algorithmic information theory are made to order
for obtaining quantitative incompleteness theorems, and I will now give
a number of information-theoretic proofs of G odel's theorem (Chaitin,
1974a, 1974b, 1975a, 1977, 1982 Chaitin and Schwartz, 1978 Gardner,
1979).

4. Information-Theoretic Proofs of Godel's


Theorem
I propose that we consider a formal axiomatic system to be a computer
program for listing the set of theorems, and measure its size in bits.
In other words, the measure of the size of a formal axiomatic system
that I will use is quite crude. It is merely the amount of space it
takes to specify a proof-checking algorithm and how to apply it to all
Godel's Theorem and Information 121
possible proofs, which is roughly the amount of space it takes to be very
precise about the alphabet, vocabulary, grammar, axioms, and rules of
inference. This is roughly proportional to the number of pages it takes
to present the formal axiomatic system in a textbook.
Here is the rst information-theoretic incompleteness theorem.
Consider an N -bit formal axiomatic system. There is a computer pro-
gram of size N which does not halt, but one cannot prove this within
the formal axiomatic system. On the other hand, N bits of axioms can
permit one to deduce precisely which programs of size less than N halt
and which ones do not. Here are two dierent N -bit axioms which do
this. If God tells one how many dierent programs of size less than N
halt, this can be expressed as an N -bit base-two numeral, and from it
one could eventually deduce which of these programs halt and which do
not. An alternative divine revelation would be knowing that program
of size less than N which takes longest to halt. (In the current context,
programs have all input contained within them.)
Another way to thwart an N -bit formal axiomatic system is to
merely toss an unbiased coin slightly more than N times. It is al-
most certain that the resulting binary string will be algorithmically
random, but it is not possible to prove this within the formal axiomatic
system. If one believes the postulate of quantum mechanics that God
plays dice with the universe (Albert Einstein did not), then physics pro-
vides a means to expose the limitations of formal axiomatic systems.
In fact, within an N -bit formal axiomatic system it is not even possible
to prove that a particular object has algorithmic information content
greater than N , even though almost all (all but nitely many) objects
have this property.
The proof of this closely resembles G. G. Berry's paradox of \the
rst natural number which cannot be named in less than a billion
words," published by Russell at the turn of the century (Russell, 1967).
The version of Berry's paradox that will do the trick is \that object
having the shortest proof that its algorithmic information content is
greater than a billion bits." More precisely, \that object having the
shortest proof within the following formal axiomatic system that its
information content is greater than the information content of the for-
mal axiomatic system: : : : ," where the dots are to be lled in with a
complete description of the formal axiomatic system in question.
122 Part II|Applications to Metamathematics
By the way, the fact that in a given formal axiomatic system one
can only prove that nitely many speci c strings are random, is closely
related to Post's notion of a simple r.e. set. Indeed, the set of nonran-
dom or compressible strings is a simple r.e. set. So Berry and Post had
the germ of my incompleteness theorem!
In order to proceed, I must de ne a fascinating algorithmically ran-
dom real number between zero and one, which I like to call $ (Chaitin,
1975b Gardner, 1979). $ is a suitable subject for worship by mystical
cultists, for as Charles Bennett (Gardner, 1979) has argued persua-
sively, in a sense $ contains all constructive mathematical truth, and
expresses it as concisely and compactly as possible. Knowing the nu-
merical value of $ with N bits of precision, that is to say, knowing
the rst N bits of $'s base-two expansion, is another N -bit axiom that
permits one to deduce precisely which programs of size less than N halt
and which ones do not.
$ is de ned as the halting probability of whichever standard general-
purpose computer has been chosen, if each bit of its program is pro-
duced by an independent toss of a fair coin. To Turing's theorem in
recursive function theory that the halting problem is unsolvable, there
corresponds in algorithmic information theory the theorem that the
base-two expansion of $ is algorithmically random. Therefore it takes
N bits of axioms to be able to prove what the rst N bits of $ are, and
these bits seem completely accidental like the products of a random
physical process. One can therefore measure the power of a formal ax-
iomatic system by how much of the numerical value of $ it is possible
to deduce from its axioms. This is sort of like measuring the power
of a formal axiomatic system in terms of the size in bits of the short-
est program whose halting problem is undecidable within the formal
axiomatic system.
It is possible to dress this incompleteness theorem involving $ so
that no direct mention is made of halting probabilities, in fact, in rather
straight-forward number-theoretic terms making no mention of com-
puter programs at all. $ can be represented as the limit of a monot-
one increasing computable sequence of rational numbers. Its N th bit
is therefore the limit as T tends to in nity of a computable function
of N and T . Thus the N th bit of $ can be expressed in the form
9X 8Y computable predicate of X , Y , and N ]. Complete chaos is only
Godel's Theorem and Information 123
two quanti ers away from computability! $ can also be expressed via
a polynomial P in, say, one hundred variables, with integer coecients
and exponents (Davis et al., 1976): the N th bit of $ is a 1 if and only
if there are in nitely many natural numbers K such that the equation
P (N K X1  : : : X98) = 0 has a solution in natural numbers.
Of course, $ has the very serious problem that it takes much too
long to deduce theorems from it, and this is also the case with the
other two axioms we considered. So the ideal, perfect mathematical
axiom is in fact useless! One does not really want the most compact
axiom for deducing a given set of assertions. Just as there is a trade-o
between program size and running time, there is a trade-o between the
number of bits of axioms one assumes and the size of proofs. Of course,
random or irreducible truths cannot be compressed into axioms shorter
than themselves. If, however, a set of assertions is not algorithmically
independent, then it takes fewer bits of axioms to deduce them all
than the sum of the number of bits of axioms it takes to deduce them
separately, and this is desirable as long as the proofs do not get too
long. This suggests a pragmatic attitude toward mathematical truth,
somewhat more like that of physicists.
Ours has indeed been a long stroll through a gallery of incomplete-
ness theorems. What is the conclusion or moral? It is time to make a
nal statement about the meaning of G odel's theorem.

5. The Meaning of Godel's Theorem


Information theory suggests that the G odel phenomenon is natural and
widespread, not pathological and unusual. Strangely enough, it does
this via counting arguments, and without exhibiting individual asser-
tions which are true but unprovable! Of course, it would help to have
more proofs that particular interesting and natural true assertions are
not demonstrable within fashionable formal axiomatic systems.
The real question is this: Is G odel's theorem a mandate for revolu-
tion, anarchy, and license?! Can one give up after trying for two months
to prove a theorem, and add it as a new axiom? This sounds ridicu-
lous, but it is sort of what number theorists have done with Bernhard
Riemann's
conjecture (P
olya, 1959). Of course, two months is not
124 Part II|Applications to Metamathematics
enough. New axioms should be chosen with care, because of their use-
fulness and large amounts of evidence suggesting that they are correct,
in the same careful manner, say, in practice in the physics community.
G odel himself has espoused this view with remarkable vigor and
clarity, in his discussion of whether Cantor's continuum hypothesis
should be added to set theory as a new axiom (G odel, 1964):
: : : even disregarding the intrinsic necessity of some new
axiom, and even in case it has no intrinsic necessity at all, a
probable decision about its truth is possible also in another
way, namely, inductively by studying its \success." Suc-
cess here means fruitfulness in consequences, in particular
in \veri able" consequences, i.e., consequences demonstra-
ble without the new axiom, whose proofs with the help of
the new axiom, however, are considerably simpler and eas-
ier to discover, and make it possible to contract into one
proof many dierent proofs. The axioms for the system of
real numbers, rejected by intuitionists, have in this sense
been veri ed to some extent, owing to the fact that analyt-
ical number theory frequently allows one to prove number-
theoretical theorems which, in a more cumbersome way, can
subsequently be veri ed by elementary methods. A much
higher degree of veri cation than that, however, is conceiv-
able. There might exist axioms so abundant in their veri-
able consequences, shedding so much light upon a whole
eld, and yielding such powerful methods for solving prob-
lems (and even solving them constructively, as far as that
is possible) that, no matter whether or not they are intrin-
sically necessary, they would have to be accepted at least in
the same sense as any well-established physical theory.
Later in the same discussion G odel refers to these ideas again:
It was pointed out earlier: : : that, besides mathemati-
cal intuition, there exists another (though only probable)
criterion of the truth of mathematical axioms, namely their
fruitfulness in mathematics and, one may add, possibly also
in physics: : : The simplest case of an application of the
Godel's Theorem and Information 125
criterion under discussion arises when some: : : axiom has
number-theoretical consequences veri able by computation
up to any given integer.
G odel also expresses himself in no uncertain terms in a discussion
of Russell's mathematical logic (G odel, 1964):
The analogy between mathematics and a natural science
is enlarged upon by Russell also in another respect: : : axioms
need not be evident in themselves, but rather their justi -
cation lies (exactly as in physics) in the fact that they make
it possible for these \sense perceptions" to be deduced: : :
I think that: : : this view has been largely justi ed by sub-
sequent developments, and it is to be expected that it will
be still more so in the future. It has turned out that the
solution of certain arithmetical problems requires the use
of assumptions essentially transcending arithmetic: : : Fur-
thermore it seems likely that for deciding certain questions
of abstract set theory and even for certain related ques-
tions of the theory of real numbers new axioms based on
some hitherto unknown idea will be necessary. Perhaps also
the apparently insurmountable diculties which some other
mathematical problems have been presenting for many years
are due to the fact that the necessary axioms have not yet
been found. Of course, under these circumstances mathe-
matics may lose a good deal of its \absolute certainty" but,
under the inuence of the modern criticism of the founda-
tions, this has already happened to a large extent: : :
I end as I began, with a quotation from Weyl (1949): \A truly real-
istic mathematics should be conceived, in line with physics, as a branch
of the theoretical construction of the one real world, and should adopt
the same sober and cautious attitude toward hypothetic extensions of
its foundations as is exhibited by physics."

6. Directions for Future Research


a. Prove that a famous mathematical conjecture is unsolvable in
126 Part II|Applications to Metamathematics
the usual formalizations of number theory. Problem: if Pierre
Fermat's \last theorem" is undecidable then it is true, so this is
hard to do.
b. Formalize all of college mathematics in a practical way. One
wants to produce textbooks that can be run through a practical
formal proof checker and that are not too much larger than the
usual ones. LISP (Levin, 1974) and SETL (Dewar et al., 1981)
might be good for this.
c. Is algorithmic information theory relevant to physics, in partic-
ular, to thermodynamics and statistical mechanics? Explore the
thermodynamics of computation (Bennett, 1982) and determine
the ultimate physical limitations of computers.
d. Is there a physical phenomenon that computes something non-
computable? Contrariwise, does Turing's thesis that anything
computable can be computed by a Turing machine constrain the
physical universe we are in?
e. Develop measures of self-organization and formal proofs that life
must evolve (Chaitin, 1979 Eigen and Winkler, 1981 von Neu-
mann, 1966).
f. Develop formal de nitions of intelligence and measures of its vari-
ous components apply information theory and complexity theory
to AI.

References
Let me give a few pointers to the literature. The following are my pre-
vious publications on G odel's theorem: Chaitin, 1974a, 1974b, 1975a,
1977, 1982 Chaitin and Schwartz, 1978. Related publications by other
authors include Davis, 1978 Gardner, 1979 Hofstadter, 1979 Levin,
1974 Post, 1965. For discussions of the epistemology of mathematics
and science, see Einstein, 1944, 1954 Feynman, 1965 G odel, 1964
P
olya, 1959 von Neumann, 1956, 1963 Taub, 1961 Weyl, 1946, 1949.
Godel's Theorem and Information 127
 Bell, E. T. (1951). Mathematics, Queen and Servant of Science,
McGraw-Hill, New York.
 Bennett, C. H. (1982). The thermodynamics of computation|a
review, International Journal of Theoretical Physics, 21, 905{940.
 Chaitin, G. J. (1974a). Information-theoretic computational com-
plexity, IEEE Transactions on Information Theory, IT-20, 10{
15.
 Chaitin, G. J. (1974b). Information-theoretic limitations of for-
mal systems, Journal of the ACM, 21, 403{424.
 Chaitin, G. J. (1975a). Randomness and mathematical proof,
Scientic American, 232 (5) (May 1975), 47{52. (Also published
in the French, Japanese, and Italian editions of Scientic Ameri-
can.)
 Chaitin, G. J. (1975b). A theory of program size formally identi-
cal to information theory, Journal of the ACM, 22, 329{340.
 Chaitin, G. J. (1977). Algorithmic information theory, IBM Jour-
nal of Research and Development, 21, 350{359, 496.
 Chaitin, G. J., and Schwartz, J. T. (1978). A note on Monte
Carlo primality tests and algorithmic information theory, Com-
munications on Pure and Applied Mathematics, 31, 521{527.
 Chaitin, G. J. (1979). Toward a mathematical de nition of \life,"
in The Maximum Entropy Formalism, R. D. Levine and M. Tribus
(eds.), MIT Press, Cambridge, Massachusetts, pp. 477{498.
 Chaitin, G. J. (1982). Algorithmic information theory, Encyclo-
pedia of Statistical Sciences, Vol. 1, Wiley, New York, pp. 38{41.
 Cole, C. A., Wolfram, S., et al. (1981). SMP: a symbolic manip-
ulation program, California Institute of Technology, Pasadena,
California.
 Courant, R., and Robbins, H. (1941). What is Mathematics?,
Oxford University Press, London.
128 Part II|Applications to Metamathematics
 Davis, M., Matijasevi%c, Y., and Robinson, J. (1976). Hilbert's
tenth problem. Diophantine equations: positive aspects of a
negative solution, in Mathematical Developments Arising from
Hilbert Problems, Proceedings of Symposia in Pure Mathematics,
Vol. XXVII, American Mathematical Society, Providence, Rhode
Island, pp. 323{378.
 Davis, M. (1978). What is a computation?, in Mathematics To-
day: Twelve Informal Essays, L. A. Steen (ed.), Springer-Verlag,
New York, pp. 241{267.
 Dewar, R. B. K., Schonberg, E., and Schwartz, J. T. (1981).
Higher Level Programming: Introduction to the Use of the Set-
Theoretic Programming Language SETL, Courant Institute of
Mathematical Sciences, New York University, New York.
 Eigen, M., and Winkler, R. (1981). Laws of the Game, Knopf,
New York.
 Einstein, A. (1944). Remarks on Bertrand Russell's theory of
knowledge, in The Philosophy of Bertrand Russell, P. A. Schilpp
(ed.), Northwestern University, Evanston, Illinois, pp. 277{291.
 Einstein, A. (1954). Ideas and Opinions, Crown, New York, pp.
18{24.
 Feynman, A. (1965). The Character of Physical Law, MIT Press,
Cambridge, Massachusetts.
 Gardner, M. (1979). The random number $ bids fair to hold the
mysteries of the universe, Mathematical Games Dept., Scientic
American, 241 (5) (November 1979), 20{34.
 G odel, K. (1964). Russell's mathematical logic, and What is Can-
tor's continuum problem?, in Philosophy of Mathematics, P. Be-
nacerraf and H. Putnam (eds.), Prentice-Hall, Englewood Clis,
New Jersey, pp. 211{232, 258{273.
 Hofstadter, D. R. (1979). Godel, Escher, Bach: an Eternal
Golden Braid, Basic Books, New York.
Godel's Theorem and Information 129
 Levin, M. (1974). Mathematical Logic for Computer Scientists,
MIT Project MAC report MAC TR-131, Cambridge, Massachu-
setts.
 P
olya, G. (1959). Heuristic reasoning in the theory of numbers,
American Mathematical Monthly, 66, 375{384.
 Post, E. (1965). Recursively enumerable sets of positive integers
and their decision problems, in The Undecidable: Basic Papers on
Undecidable Propositions, Unsolvable Problems and Computable
Functions, M. Davis (ed.), Raven Press, Hewlett, New York, pp.
305{337.
 Russell, B. (1967). Mathematical logic as based on the theory of
types, in From Frege to Godel: A Source Book in Mathematical
Logic, 1879{1931, J. van Heijenoort (ed.), Harvard University
Press, Cambridge, Massachusetts, pp. 150{182.
 Taub, A. H. (ed.) (1961). J. von Neumann|Collected Works,
Vol. I, Pergamon Press, New York, pp. 1{9.
 von Neumann, J. (1956). The mathematician, in The World of
Mathematics, Vol. 4, J. R. Newman (ed.), Simon and Schuster,
New York, pp. 2053{2063.
 von Neumann, J. (1963). The role of mathematics in the sciences
and in society, and Method in the physical sciences, in J. von
Neumann|Collected Works, Vol. VI, A. H. Taub (ed.), McMil-
lan, New York, pp. 477{498.
 von Neumann, J. (1966). Theory of Self-Reproducing Automata,
A. W. Burks (ed.), University of Illinois Press, Urbana, Illinois.
 Weyl, H. (1946). Mathematics and logic, American Mathematical
Monthly, 53, 1{13.
 Weyl, H. (1949). Philosophy of Mathematics and Natural Science,
Princeton University Press, Princeton, New Jersey.
 Wilf, H. S. (1982). The disk with the college education, American
Mathematical Monthly, 89, 4{8.
130 Part II|Applications to Metamathematics

Received April 14, 1982


RANDOMNESS AND
GO DEL'S THEOREM
Mondes en Developpement,
No. 54{55 (1986), pp. 125{128

G. J. Chaitin
IBM Research Division

Abstract
Complexity, non-predictability and randomness not only occur in quan-
tum mechanics and non-linear dynamics, they also occur in pure math-
ematics and shed new light on the limitations of the axiomatic method.
In particular, we discuss a Diophantine equation exhibiting randomness,
and how it yields a proof of Godel's incompleteness theorem.

Our view of the physical world has certainly changed radically during
the past hundred years, as unpredictability, randomness and complexity
have replaced the comfortable world of classical physics. Amazingly
enough, the same thing has occurred in the world of pure mathematics,

131
132 Part II|Applications to Metamathematics
in fact, in number theory, a branch of mathematics that is concerned
with the properties of the positive integers. How can an uncertainty
principle apply to number theory, which has been called the queen of
mathematics, and is a discipline that goes back to the ancient Greeks
and is concerned with such things as the primes and their properties?
Following Davis (1982), consider an equation of the form
P (x n y1 : : :  ym) = 0
where P is a polynomial with integer coecients, and x n m y1 : : : ym
are positive integers. Here n is to be regarded as a parameter, and
for each value of n we are interested in the set Dn of those values of
x for which there exist y1 to ym such that P = 0. Thus a particular
polynomial P with integer coecients in m +2 variables serves to de ne
a set Dn of values of x as a function of the choice of the parameter n.
The study of equations of this sort goes back to the ancient Greeks,
and the particular type of equation we have described is called a poly-
nomial Diophantine equation.
One of the most remarkable mathematical results of this century has
been the discovery that there is a \universal" polynomial P such that
by varying the parameter n, the corresponding set Dn of solutions that
is obtained can be any set of positive integers that can be generated
by a computer program. In particular, there is a value of n such that
the set of prime numbers is obtained. This immediately yields a prime-
generating polynomial
h i
x 1 ; (P (x n y1 : : : ym))2 
whose set of positive values, as the values of x and y1 to ym vary over
all the positive integers, is precisely equal to the primes. This is a
remarkable result that surely would have amazed Fermat and Euler,
and it is obtained as a trivial corollary to a much more general theorem!
The proof that there is such a universal P may be regarded as
the culmination of G odel's original proof of his famous incompleteness
theorem. In thinking about P , it is helpful to regard the parameter
n as the G odel number of a computer program, and to regard the set
of solutions x as the output of this computer program, and to think
Randomness and Godel's Theorem 133
of the auxiliary variables y1 to ym as a kind of multidimensional time
variable. In other words,
P (x n y1 : : : ym) = 0
if and only if the nth computer program outputs the positive integer x
at time (y1 : : :  ym).
Let us prove G odel's incompleteness theorem by making use of this
universal polynomial P and Cantor's famous diagonal method, which
Cantor originally used to prove that the real numbers are more nu-
merous than the integers. Recall that Dn denotes the set of positive
integers x for which there exist positive integers y1 to ym such that
P = 0. I.e.,
Dn = fxj(9y1 : : :  ym) P (x n y1 : : :  ym) = 0]g :
Consider the \diagonal" set
V = fnjn 62 Dn g
of all those positive integers n that are not contained in the correspond-
ing set Dn . It is easy to see that V cannot be generated by a computer
program, because V diers from the set generated by the nth computer
program regarding the membership of n. It follows that there can be
no algorithm for deciding, given n, whether or not the equation
P (n n y1 : : : ym) = 0
has a solution. And if there cannot be an algorithm for deciding if
this equation has a solution, no xed system of axioms and rules of
inference can permit one to prove whether or not it has a solution. For
if there were a formal axiomatic theory for proving whether or not there
is a solution, given any particular value of n one could in principle use
this formal theory to decide if there is a solution, by searching through
all possible proofs within the formal theory in size order, until a proof
is found one way or another. It follows that no single set of axioms
and rules of inference suce to enable one to prove whether or not a
polynomial Diophantine equation has a solution. This is a version of
G odel's incompleteness theorem.
134 Part II|Applications to Metamathematics
What does this have to do with randomness, uncertainty and un-
predictability? The point is that the solvability or unsolvability of the
equation
P (n n y1 : : : ym) = 0
in positive integers is in a sense mathematically uncertain and jumps
around unpredictably as the parameter n varies. In fact, it is possible
to construct another polynomial P 0 with integer coecients for which
the situation is much more dramatic.
Instead of asking whether P 0 = 0 can be solved, consider the ques-
tion of whether or not there are in nitely many solutions. Let Dn0 be
the set of positive integers x such that
P 0(x n y1 : : : ym) = 0
has a solution. P 0 has the remarkable property that the truth or falsity
of the assertion that the set Dn0 is in nite, is completely random. In-
deed, this in nite sequence of true/false values is indistinguishable from
the result of successive independent tosses of an unbiased coin. In other
words, the truth or falsity of each of these assertions is an independent
mathematical fact with probability one-half! These independent facts
cannot be compressed into a smaller amount of information, i.e., they
are irreducible mathematical information. In order to be able to prove
whether or not Dn0 is in nite for the rst k values of the parameter
n, one needs at least k bits of axioms and rules of inference, i.e., the
formal theory must be based on at least k independent choices between
equally likely alternative assumptions. In other words, a system of
axioms and rules of inference, considered as a computer program for
generating theorems, must be at least k bits in size if it enables one to
prove whether or not Dn0 is in nite for n = 1 2 3 : : :  k.
This is a dramatic extension of G odel's theorem. Number theory,
the queen of mathematics, is infected with uncertainty and random-
ness! Simple properties of Diophantine equations escape the power of
any particular formal axiomatic theory! To mathematicians, accus-
tomed as they often are to believe that mathematics oers absolute
certainty, this may appear to be a serious blow. Mathematicians of-
ten deride the non-rigorous reasoning used by physicists, but perhaps
they have something to learn from them. Physicists know that new
Randomness and Godel's Theorem 135
experiments, new domains of experience, often require fundamentally
new physical principles. They have a more pragmatic attitude to truth
than mathematicians do. Perhaps mathematicians should acquire some
of this exibility from their colleagues in the physical sciences!

Appendix
Let me say a few words about where P 0 comes from. P 0 is closely
related to the fascinating random real number which I like to call $. $
is de ned to be the halting probability of a universal Turing machine
when its program is chosen by coin tossing, more precisely, when a
program n bits in size has probability 2;n see Gardner (1979)]. One
could in principle try running larger and larger programs for longer
and longer amounts of time on the universal Turing machine. Thus if a
program ever halts, one would eventually discover this if the program
is n bits in size, this would contribute 2;n more to the total halting
probability $. Hence $ can be obtained as the limit from below of a
computable sequence r1  r2  r3     of rational numbers:
$ = klim r
!1 k
this sequence converges very slowly, in fact, in a certain sense, as slowly
as possible. The polynomial P 0 is constructed from the sequence rk by
using the theorem that \a set of tuples of positive integers is Diophan-
tine if and only if it is recursively enumerable" see Davis (1982)]: the
equation
P 0(k n y1 : : :  ym) = 0
has a solution if and only if the nth bit of the base-two expansion of rk
is a \1". Thus Dn0 , the set of x such that
P 0(x n y1 : : : ym) = 0
has a solution, is in nite if and only if the nth bit of the base-two
expansion of $ is a \1". Knowing whether or not Dn0 is in nite for
n = 1 2 3 : : :  k is therefore equivalent to knowing the rst k bits of
$.
136 Part II|Applications to Metamathematics
References
 G. J. Chaitin (1975), \Randomness and mathematical proof,"
Scientic American 232 (5), pp. 47{52.
 M. Davis (1978), \What is a computation?", Mathematics Today:
Twelve Informal Essays, L. A. Steen, Springer-Verlag, New York,
pp. 241{267.
 D. R. Hofstadter (1979), Godel, Escher, Bach: an Eternal Golden
Braid, Basic Books, New York.
 M. Gardner (1979), \The random number $ bids fair to hold the
mysteries of the universe," Mathematical Games Dept., Scientic
American 241 (5), pp. 20{34.
 G. J. Chaitin (1982), \G odel's theorem and information," Inter-
national Journal of Theoretical Physics 22, pp. 941{954.
 M. Davis (1982), \Hilbert's Tenth Problem is Unsolvable," Com-
putability & Unsolvability, Dover, New York, pp. 199{235.
AN ALGEBRAIC
EQUATION FOR THE
HALTING PROBABILITY
In R. Herken, The Universal Turing Ma-
chine, Oxford University Press, 1988, pp.
279{283

Gregory J. Chaitin

Abstract
We outline our construction of a single equation involving only addi-
tion, multiplication, and exponentiation of non-negative integer con-
stants and variables with the following remarkable property. One of
the variables is considered to be a parameter. Take the parameter to
be 0 1 2 : : : obtaining an innite series of equations from the original
one. Consider the question of whether each of the derived equations has
nitely or innitely many non-negative integer solutions. The original
equation is constructed in such a manner that the answers to these ques-

137
138 Part II|Applications to Metamathematics
tions about the derived equations are independent mathematical facts
that cannot be compressed into any nite set of axioms. To produce
this equation, we start with a universal Turing machine in the form of
the Lisp universal function Eval written as a register machine program
about 300 lines long. Then we \compile" this register machine program
into a universal exponential Diophantine equation. The resulting equa-
tion is about 200 pages long and has about 17,000 variables. Finally, we
substitute for the program variable in the universal Diophantine equa-
tion the Godel number of a Lisp program for $, the halting probability
of a universal Turing machine if n-bit programs have measure 2;n . Full
details appear in a book.1

More than half a century has passed since the famous papers of G odel
(1931) and Turing (1936) that shed so much light on the foundations
of mathematics, and that simultaneously promulgated mathematical
formalisms for specifying algorithms, in one case via primitive recursive
function de nitions, and in the other case via Turing machines. The
development of computer hardware and software technology during this
period has been phenomenal, and as a result we now know much better
how to do the high-level functional programming of G odel, and how
to do the low-level machine language programming found in Turing's
paper. And we can actually run our programs on machines and debug
them, which G odel and Turing could not do.
I believe that the best way to actually program a universal Tur-
ing machine is John McCarthy's universal function Eval. In 1960
McCarthy proposed Lisp as a new mathematical foundation for the
theory of computation (McCarthy 1960). But by a quirk of fate Lisp
has largely been ignored by theoreticians and has instead become the
standard programming language for work on arti cial intelligence. I
believe that pure Lisp is in precisely the same role in computational
mathematics that set theory is in theoretical mathematics, in that it
This article is the introduction of the book G. J. Chaitin, Algorithmic Infor-
1
mation Theory, copyright  c 1987 by Cambridge University Press, and is reprinted
by permission.
An Algebraic Equation for the Halting Probability 139
provides a beautifully elegant and extremely powerful formalism which
enables concepts such as that of numbers and functions to be de ned
from a handful of more primitive notions.
Simultaneously there have been profound theoretical advances.
G odel and Turing's fundamental undecidable proposition, the ques-
tion of whether an algorithm ever halts, is equivalent to the question
of whether it ever produces any output. In another paper (Chaitin
1987a) I have shown that much more devastating undecidable proposi-
tions arise if one asks whether an algorithm produces an in nite amount
of output or not.
G odel expended much eort to express his undecidable proposition
as an arithmetical fact. Here too there has been considerable progress.
In my opinion the most beautiful proof is the recent one of Jones and
Matijasevi%c (1984), based on three simple ideas:
1. the observation that 110 = 1, 111 = 11, 112 = 121, 113 = 1331,
114 = 14641 reproduces Pascal's triangle, makes it possible to
express binomial coecients as the digits of powers of 11 written
in high enough bases
2. an appreciation of E. Lucas's hundred-year-old
n remarkable theo-
rem that the binomial coecient k is odd if and only if each
bit in the base-two numeral for k implies the corresponding bit
in the base-two numeral for n
3. the idea of using register machines rather than Turing machines,
and of encoding computational histories via variables which are
vectors giving the contents of a register as a function of time.
Their work gives a simple straight-forward proof, using almost no
number theory, that there is an exponential Diophantine equation with
one parameter p which has a solution if and only if the pth computer
program (i.e., the program with G odel number p) ever halts. Similarly,
one can use their method to arithmetize my undecidable proposition.
The result is an exponential Diophantine equation with the parameter
n and the property that it has in nitely many solutions if and only if the
nth bit of $ is a 1. Here $ is the halting probability of a universal Turing
machine if an n-bit program has measure 2;n (Chaitin 1986a, 1986b).
140 Part II|Applications to Metamathematics
$ is an algorithmically random real number in the sense that the rst
N bits of the base-two expansion of $ cannot be compressed into a
program shorter than N bits, from which it follows that the successive
bits of $ cannot be distinguished from the result of independent tosses
of a fair coin. It can also be shown that an N -bit program cannot
calculate the positions and values of more than N scattered bits of $,
not just the rst N bits (Chaitin 1987a). This implies that there are
exponential Diophantine equations with one parameter n which have
the property that no formal axiomatic theory can enable one to settle
whether the number of solutions of the equation is nite or in nite for
more than a nite number of values of the parameter n.
What is gained by asking if there are in nitely many solutions rather
than whether or not a solution exists? The question of whether or
not an exponential Diophantine equation has a solution is in general
undecidable, but the answers to such questions are not independent.
Indeed, if one considers such an equation with one parameter k, and
asks whether or not there is a solution for k = 0 1 2 : : :  N ; 1, the
N answers to these N questions really only constitute log2 N bits of
information. The reason for this is that we can in principle determine
which equations have a solution if we know how many of them are
solvable, for the set of solutions and of solvable equations is r.e. On
the other hand, if we ask whether the number of solutions is nite
or in nite, then the answers can be independent, if the equation is
constructed properly.
In view of the philosophical impact of exhibiting an algebraic equa-
tion with the property that the number of solutions jumps from nite
to in nite at random as a parameter is varied, I have taken the trouble
of explicitly carrying out the construction outlined by Jones and Mati-
jasevi%c. That is to say, I have encoded the halting probability $ into an
exponential Diophantine equation. To be able to actually do this, one
has to start with a program for calculating $, and the only language I
can think of in which actually writing such a program would not be an
excruciating task is pure Lisp. It is in fact necessary to go beyond the
ideas of McCarthy in three fundamental ways:
1. First of all, we simplify Lisp by only allowing atoms to be one
character long. (This is similar to McCarthy's \linear Lisp.")
An Algebraic Equation for the Halting Probability 141
2. Secondly, Eval must not lose control by going into an in nite
loop. In other words, we need a safe Eval that can execute
garbage for a limited amount of time, and always results in an
error message or a valid value of an expression. This is similar
to the notion in modern operating systems that the supervisor
should be able to give a user task a time slice of Cpu, and that
the supervisor should not abort if the user task has an abnormal
error termination.
3. Lastly, in order to program such a safe time-limited Eval, it
greatly simpli es matters if we stipulate \permissive" Lisp se-
mantics with the property that the only way a syntactically valid
Lisp expression can fail to have a value is if it loops forever. Thus,
for example, the head (Car) and tail (Cdr) of an atom is de ned
to be the atom itself, and the value of an unbound variable is the
variable.
Proceeding in this spirit, we have de ned a class of abstract com-
puters which, as in Jones and Matijasevi%c's treatment, are register ma-
chines. However, our machine's nite set of registers each contain a
Lisp S-expression in the form of a character string with balanced left
and right parentheses to delimit the list structure. And we use a small
set of machine instructions, instructions for testing, moving, erasing,
and setting one character at a time. In order to be able to use subrou-
tines more eectively, we have also added an instruction for jumping
to a subroutine after putting into a register the return address, and an
indirect branch instruction for returning to the address contained in a
register. The complete register machine program for a safe time-limited
Lisp universal function (interpreter) Eval is about 300 instructions
long. To test this Lisp interpreter written for an abstract machine,
we have written in 370 machine language a register machine simulator.
We have also rewritten this Lisp interpreter directly in 370 machine
language, representing Lisp S-expressions by binary trees of pointers
rather than as character strings, in the standard manner used in prac-
tical Lisp implementations. We have then run a large suite of tests
through the very slow interpreter on the simulated register machine,
and also through the extremely fast 370 machine language interpreter,
142 Part II|Applications to Metamathematics
in order to make sure that identical results are produced by both im-
plementations of the Lisp interpreter.
Our version of pure Lisp also has the property that in it we can write
a short program to calculate $ in the limit from below. The program
for calculating $ is only a few pages long, and by running it (on the
370 directly, not on the register machine!), we have obtained a lower
bound of 127=128-ths for the particular de nition of $ we have chosen,
which depends on our choice of a self-delimiting universal computer.
The nal step was to write a compiler that compiles a register ma-
chine program into an exponential Diophantine equation. This com-
piler consists of about 700 lines of code in a very nice and easy to
use programming language invented by Mike Cowlishaw called Rexx
(Cowlishaw 1985). Rexx is a pattern-matching string processing lan-
guage which is implemented by means of a very ecient interpreter.
It takes the compiler only a few minutes to convert the 300-line Lisp
interpreter into a 200-page 17,000-variable universal exponential Dio-
phantine equation. The resulting equation is a little large, but the ideas
used to produce it are simple and few, and the equation results from
the straight-forward application of these ideas.
I have published the details of this adventure (but not the full equa-
tion!) as a book (Chaitin 1987b). My hope is that this book will con-
vince mathematicians that randomness not only occurs in non-linear
dynamics and quantum mechanics, but that it even happens in rather
elementary branches of number theory.

References
Chaitin, G.J.
1986a Randomness and G odel's theorem. Mondes en Developpe-
ment No. 54{55 (1986) 125{128.
1986b Information-theoretic computational complexity and Godel's
theorem and information. In: New Directions in the Philos-
ophy of Mathematics, ed. T. Tymoczko. Boston: Birkh auser
(1986).
An Algebraic Equation for the Halting Probability 143
1987a Incompleteness theorems for random reals. Adv. Appl. Math.
8 (1987) 119{146.
1987b Algorithmic Information Theory. Cambridge, England:
Cambridge University Press (1987).
Cowlishaw, M.F.
1985 The REXX Language. Englewood Clis, NJ: Prentice-Hall
(1985).
G odel, K.
1931 On formally undecidable propositions of Principia mathe-
matica and related systems I. In: Kurt Godel: Collected
Works, Volume I: Publications 1929{1936, ed. S. Feferman.
New York: Oxford University Press (1986).
Jones, J.P., and Y.V. Matijasevi%c
1984 Register machine proof of the theorem on exponential Dio-
phantine representation of enumerable sets. J. Symb. Log.
49 (1984) 818{829.
McCarthy, J.
1960 Recursive functions of symbolic expressions and their com-
putation by machine, Part I. ACM Comm. 3 (1960) 184{
195.
Turing, A.M.
1936 On computable numbers, with an application to the Ent-
scheidungsproblem. P. Lond. Math. Soc. (2) 42 (1936)
230{265 with a correction, Ibid. (2) 43 (1936-7) 544{546
reprinted in: The Undecidable, ed. M. Davis. Hewlett, NY:
Raven Press (1965).
144 Part II|Applications to Metamathematics
COMPUTING THE BUSY
BEAVER FUNCTION
In T. M. Cover and B. Gopinath, Open
Problems in Communication and Compu-
tation, Springer, 1987, pp. 108{112

Gregory J. Chaitin
IBM Research Division, P.O. Box 218
Yorktown Heights, NY 10598, U.S.A.

Abstract
Eorts to calculate values of the noncomputable Busy Beaver function
are discussed in the light of algorithmic information theory.

I would like to talk about some impossible problems that arise when one
combines information theory with recursive function or computability
theory. That is to say, I'd like to look at some unsolvable problems
which arise when one examines computation unlimited by any practical

145
146 Part II|Applications to Metamathematics
bound on running time, from the point of view of information theory.
The result is what I like to call \algorithmic information theory" 5].
In the Computer Recreations department of a recent issue of Sci-
entic American 7], A. K. Dewdney discusses eorts to calculate the
Busy Beaver function *. This is a very interesting endeavor for a num-
ber of reasons.
First of all, the Busy Beaver function is of interest to information
theorists, because it measures the capability of computer programs as a
function of their size, as a function of the amount of information which
they contain. *(n) is de ned to be the largest number which can be
computed by an n-state Turing machine to information theorists it
is clear that the correct measure is bits, not states. Thus it is more
correct to de ne *(n) as the largest natural number whose program-
size complexity or algorithmic information content is less than or equal
to n. Of course, the use of states has made it easier and a de nite and
fun problem to calculate values of *(number of states) to deal with
*(number of bits) one would need a model of a binary computer as
simple and compelling as the Turing machine model, and no obvious
natural choice is at hand.
Perhaps the most fascinating aspect of Dewdney's discussion is that
it describes successful attempts to calculate the initial values *(1),
*(2), *(3) : : : of an uncomputable function *. Not only is * uncom-
putable, but it grows faster than any computable function can. In fact,
it is not dicult to see that *(n) is greater than the computable func-
tion f (n) as soon as n is greater than (the program-size complexity
or algorithmic information content of f ) + O(1). Indeed, to compute
f (n) + 1 it is sucient to know (a minimum-size program for f ), and
the value of the integer (n ; the program-size complexity of f ). Thus
the program-size complexity of f (n) + 1 is  (the program-size com-
plexity of f ) + O(log jn ; the program-size complexity of f j), which is
< n if n is greater than O(1)+the program-size complexity of f . Hence
f (n) + 1 is included in *(n), that is, *(n)  f (n) + 1, if n is greater
than O(1) + the program-size complexity of f .
Yet another reason for interest in the Busy Beaver function is that,
when properly de ned in terms of bits, it immediately provides an
information-theoretic proof of an extremely fundamental fact of recur-
sive function theory, namely Turing's theorem that the halting problem
Computing the Busy Beaver Function 147
is unsolvable 2]. Turing's original proof involves the notion of a com-
putable real number, and the observation that it cannot be decided
whether or not the nth computer program ever outputs an nth digit,
because otherwise one could carry out Cantor's diagonal construction
and calculate a paradoxical real number whose nth digit is chosen to
dier from the nth digit output by the nth program, and which there-
fore cannot actually be a computable real number after all. To use the
noncomputability of * to demonstrate the unsolvability of the halting
problem, it suces to note that in principle, if one were very patient,
one could calculate *(n) by checking each program of size less than or
equal to n to determine whether or not it halts, and then running each
of the programs which halt to determine what their output is, and then
taking the largest output. Contrariwise, if * were computable, then it
would provide a solution to the halting problem, for an n-bit program
either halts in time less than *(n + O(1)), or else it never halts.
The Busy Beaver function is also of considerable metamathematical
interest in principle it would be extremely useful to know larger values
of *(n). For example, this would enable one to settle the Goldbach
conjecture and the Riemann hypothesis, and in fact any conjecture such
as Fermat's which can be refuted by a numerical counterexample. Let P
be a computable predicate of a natural number, so that for any speci c
natural number n it is possible to compute in a mechanical fashion
whether or not P (n), P of n, is true or false, that is, to determine
whether or not the natural number n has property P . How could one
use the Busy Beaver function to decide if the conjecture that P is true
for all natural numbers is correct? An experimental approach is to
use a fast computer to check whether or not P is true, say for the
rst billion natural numbers. To convert this empirical approach into a
proof, it would suce to have a bound on how far it is necessary to test
P before settling the conjecture in the armative if no counterexample
has been found, and of course rejecting it if one was discovered. *
provides this bound, for if P has program-size complexity or algorithmic
information content k, then it suces to examine the rst *(k + O(1))
natural numbers to decide whether or not P is always true. Note that
the program-size complexity or algorithmic information content of a
famous conjecture P is usually quite small it is hard to get excited
about a conjecture that takes a hundred pages to state.
148 Part II|Applications to Metamathematics
For all these reasons, it is really quite fascinating to contemplate
the successful eorts which have been made to calculate some of the
initial values of *(n). In a sense these eorts simultaneously penetrate
to \mathematical bedrock" and are \storming the heavens," to use
images of E. T. Bell. They amount to a systematic eort to settle all
nitely refutable mathematical conjectures, that is, to determine all
constructive mathematical truth. And these eorts y in the face of
fundamental information-theoretic limitations on the axiomatic method
1,2,6], which amount to an information-theoretic version of G odel's
famous incompleteness theorem 3].
Here is the Busy Beaver version of G odel's incompleteness theorem:
n bits of axioms and rules of inference cannot enable one to prove what
is the value of *(k) for any k greater than n + O(1). The proof of
this fact is along the lines of the Berry paradox. Contrariwise, there
is an n-bit axiom which does enable one to demonstrate what is the
value of *(k) for any k less than n ; O(1). To get such an axiom,
one either asks God for the number of programs less than n bits in
size which halt, or one asks God for a speci c n-bit program which
halts and has the maximum possible running time or the maximum
possible output before halting. Equivalently, the divine revelation is a
conjecture 8k P (k) (with P of program-size complexity or algorithmic
information content  n) which is false and for which (the smallest
counterexample i with :P (i)) is as large as possible. Such an axiom
would pack quite a wallop, but only in principle, because it would take
about *(n) steps to deduce from it whether or not a speci c program
halts and whether or not a speci c mathematical conjecture is true for
all natural numbers.
These considerations involving the Busy Beaver function are closely
related to another fascinating noncomputable object, the halting prob-
ability of a universal Turing machine on random input, which I like to
call $, and which is the subject of an essay by my colleague Charles
Bennett that was published in the Mathematical Games department of
Scientic American some years ago 4].
Computing the Busy Beaver Function 149
References
1] G. J. Chaitin, \Randomness and mathematical proof," Scientic
American 232, No. 5 (May 1975), 47{52.
2] M. Davis, \What is a computation?" in Mathematics Today:
Twelve Informal Essays, L. A. Steen (ed.), Springer-Verlag, New
York, 1978, 241{267.
3] D. R. Hofstadter, Godel, Escher, Bach: an Eternal Golden Braid,
Basic Books, New York, 1979.
4] M. Gardner, \The random number $ bids fair to hold the myster-
ies of the universe," Mathematical Games Dept., Scientic Amer-
ican 241, No. 5 (Nov. 1979), 20{34.
5] G. J. Chaitin, \Algorithmic information theory," in Encyclopedia
of Statistical Sciences, Volume 1, Wiley, New York, 1982, 38{41.
6] G. J. Chaitin, \G odel's theorem and information," International
Journal of Theoretical Physics 22 (1982), 941{954.
7] A. K. Dewdney, \A computer trap for the busy beaver, the
hardest-working Turing machine," Computer Recreations Dept.,
Scientic American 251, No. 2 (Aug. 1984), 19{23.
150 Part II|Applications to Metamathematics
Part III
Applications to Biology

151
TO A MATHEMATICAL
DEFINITION OF \LIFE"
ACM SICACT News, No. 4
(January 1970), pp. 12{18

G. J. Chaitin

Abstract
\Life" and its \evolution" are fundamental concepts that have not yet
been formulated in precise mathematical terms, although some eorts
in this direction have been made. We suggest a possible point of de-
parture for a mathematical denition of \life." This denition is based
on the computer and is closely related to recent analyses of \inductive
inference" and \randomness." A living being is a unity it is simpler
to view a living organism as a whole than as the sum of its parts. If
we want to compute a complete description of the region of space-time
that is a living being, the program will be smaller in size if the calcula-
tion is done all together, than if it is done by independently calculating
descriptions of parts of the region and then putting them together.

153
154 Part III|Applications to Biology
1. The Problem
\Life" and its \evolution" from the lifeless are fundamental concepts of
science. According to Darwin and his followers, we can expect living
organisms to evolve under very general conditions. Yet this theory
has never been formulated in precise mathematical terms. Supposing
Darwin is right, it should be possible to formulate a general de nition
of \life" and to prove that under certain conditions we can expect it
to \evolve." If mathematics can be made out of Darwin, then we will
have added something basic to mathematics while if it cannot, then
Darwin must be wrong, and life remains a miracle which has not been
explained by science.
The point is that the view that life has spontaneously evolved, and
the very concept of life itself, are very general concepts, which it should
be possible to study without getting involved in, for example, the de-
tails of quantum chemistry. We can idealize the laws of physics and
simplify them and make them complete, and then study the resulting
universe. It is necessary to do two things in order to study the evolution
of life within our model universe. First of all, we must de ne \life," we
must characterize a living organism in a precise fashion. At the same
time it should become clear what the complexity of an organism is, and
how to distinguish primitive forms of life from advanced forms. Then
we must study our universe in the light of the de nition. Will an evo-
lutionary process occur? What is the expected time for a certain level
of complexity to be reached? Or can we show that life will probably
not evolve?

2. Previous Work
Von Neumann devoted much attention to the analysis of fundamental
biological questions from a mathematical point of view.1 He considered
1See in particular his fth lecture delivered at the University of Illinois in De-
cember of 1949, \Re-evaluation of the problem of complicated automata|Problems
of hierarchy and evolution," and his unnished The Theory of Automata: Con-
struction, Reproduction, Homogeneity. Both are posthumously published in von
Neumann (1966).
To a Mathematical Denition of \Life" 155
a universe consisting of an in nite plane divided into squares. Time
is quantized, and at any moment each square is in one of 29 states.
The state of a square at any time depends only on its previous state
and the previous states of its four neighboring squares. The universe
is homogeneous the state transitions of all squares are governed by
the same law. It is a deterministic universe. Von Neumann showed
that a self-reproducing general-purpose computer can exist in his model
universe.
A large amount of work on these questions has been done since von
Neumann's initial investigations, and a complete bibliography would
be quite lengthy. We may mention Moore (1962), Arbib (1966,1967),
and Codd (1968).
The point of departure of all this work has been the identi cation of
\life" with \self-reproduction," and this identi cation has both helped
and hindered. It has helped, because it has not allowed fundamental
conceptual diculties to tie up work, but has instead permitted much
that is very interesting to be accomplished. But it has hindered be-
cause, in the end, these fundamental diculties must be faced. At
present the problem has evidenced itself as a question of \good taste."
As von Neumann remarks,2 good taste is required in building one's
universe. If its elementary parts are assumed to be very powerful, self-
reproduction is immediate. Arbib (1966) is an intermediate case.
What is the relation between self-reproduction and life? A man may
be sterile, but no one would doubt he is alive. Children are not identical
to their parents. Self-reproduction is not exact if it were, evolution
would be impossible. What's more, a crystal reproduces itself, yet we
would not consider it to have much life. As von Neumann comments,3
the matter is the other way around. We can deduce self-reproduction
as a property which must be possessed by many living beings, if we ask
ourselves what kinds of living beings are likely to be around. Obviously,
a species that did not reproduce would die out. Thus, if we ask what
kinds of living organisms are likely to evolve, we can draw conclusions
concerning self-reproduction.
2 See pages 76{77 of von Neumann (1966).
3 See page 78 of von Neumann (1966).
156 Part III|Applications to Biology
3. Simplicity and Complexity
\Complexity" is a concept whose importance and vagueness von Neu-
mann emphasized many times in his lectures.4 Due to the work of
Solomono, Kolmogorov, Chaitin, Martin-L of, Willis, and Loveland,
we now understand this concept a great deal better than it was un-
derstood while von Neumann worked. Obviously, to understand the
evolution of the complexity of living beings from primitive, simple life
to today's very complex organisms, we need to make precise a mea-
sure of complexity. But it also seems that perhaps a precise concept
of complexity will enable us to de ne \living organism" in an exact
and general fashion. Before suggesting the manner in which this may
perhaps be done, we shall review the recent developments which have
converted \simplicity" and \complexity" into precise concepts.
We start by summarizing Solomono's work.5 Solomono proposes
the following model of the predicament of the scientist. A scientist is
continually observing increasingly larger initial segments of an in nite
sequence of 0's and 1's. This is his experimental data. He tries to
nd computer programs which compute in nite binary sequences which
begin with the observed sequence. These are his theories. In order
to predict his future observations, he could use any of the theories.
But there will always be one theory that predicts that all succeeding
observations will be 1's, as well as others that take more account of the
previous observations. Which of the in nitely many theories should he
use to make the prediction? According to Solomono, the principle
that the simplest theory is the best should guide him.6 What is the
simplicity of a theory in the present context? It is the size of the
computer program. Larger computer programs embody more complex
theories, and smaller programs embody simpler theories.
Willis has further studied the above proposal, and also has intro-
duced the idea of a hierarchy of nite approximations to it. To my
4 See especially pages 78{80 of von Neumann (1966).
5 The earliest generally available appearance in print of Solomono's ideas of
which we are aware is Minsky's summary of them on pages 41{43 of Minsky (1962).
A more recent reference is Solomono (1964).
6 Solomono actually proposes weighing together all the theories into the predic-
tion, giving the simplest theories the largest weight.
To a Mathematical Denition of \Life" 157
knowledge, however, the success which predictions made on this basis
will have has not been made completely clear.
We must discuss a more technical aspect of Solomono's work. He
realized that the simplicity of theories, and thus also the predictions,
will depend on the computer which one is using. Let us consider only
computers whose programs are nite binary sequences, and measure
the size of a binary sequence by its length. Let us denote by C (T ) the
complexity of a theory T . By de nition, C (T ) is the size of the smallest
program which makes our computer C compute T . Solomono showed
that there are \optimal" binary computers C that have the property
that for any other binary computer C 0, C (T )  C 0(T ) + d, for all T .
Here d is a constant that depends on C and C 0, not on T . Thus,
these are the most ecient binary computers, for their programs are
shortest. Any two of these optimal binary computers C1 and C2 result
in almost the same complexity measure, for from C1(T )  C2(T ) + d12
and C2(T )  C1(T ) + d21, it follows that the dierence between C1(T )
and C2(T ) is bounded. The optimal binary computers are transparent
theoretically, they are enormously convenient from the technical point
of view. What's more, their optimality makes them a very natural
choice.7 Kolmogorov and Chaitin later independently hit upon the
same kind of computer in their search for a suitable computer upon
which to base a de nition of \randomness."
However, the naturalness and technical convenience of the Solo-
mono approach should not blind us to the fact that it is by no means
the only possible one. Chaitin rst based his de nition of randomness
on Turing machines, taking as the complexity measure the number
of states in the machine, and he later used bounded-transfer Turing
machines. Although these computers are quite dierent, they lead to
similar de nitions of randomness. Later it became clear that using
the usual 3-tape-symbol Turing machine and taking its size to be the
number of states leads to a complexity measure C3(T ) which is asymp-
totically just a Solomono measure C (T ) with its scale changed: C (T )
is asymptotic to 2C3(T ) log2 C3(T ). It appears that people interested
in computers may still study other complexity measures, but to apply
7 Solomono's approach to the size of programs has been extended in Chaitin
(1969a) to the speed of programs.
158 Part III|Applications to Biology
these concepts of simplicity/complexity it is at present most convenient
to use Solomono measures.
We now turn to Kolmogorov's and Chaitin's proposed de nition of
randomness or patternlessness. Let us consider once more the scientist
confronted by experimental data, a long binary sequence. This time
he in not interested in predicting future observations, but only in de-
termining if there is a pattern in his observations, if there is a simple
theory that explains them. If he found a way of compressing his ob-
servations into a short computer program which makes the computer
calculate them, he would say that the sequence follows a law, that it
has pattern. But if there is no short program, then the sequence has no
pattern|it is random. That is to say, the complexity C (S ) of a nite
binary sequence S is the size of the smallest program which makes the
computer calculate it. Those binary sequences S of a given length n
for which C (S ) is greatest are the most complex binary sequences of
length n, the random or patternless ones. This is a general formulation
of the de nition. If we use one of Solomono's optimal binary com-
puters, this de nition becomes even clearer. Most binary sequences
of any given length n require programs of about length n. These are
the patternless or random sequences. Those binary sequences which
can be compressed into programs appreciably shorter than themselves
are the sequences which have pattern. Chaitin and Martin-L of have
studied the statistical properties of these sequences, and Loveland has
compared several variants of the de nition.
This completes our summary of the new rigorous meaning which
has been given to simplicity/complexity|the complexity of something
is the size of the smallest program which computes it or a complete
description of it. Simpler things require smaller programs. We have
emphasized here the relation between these concepts and the philos-
ophy of the scienti c method. In the theory of computing the word
\complexity" is usually applied to the speed of programs or the amount
of auxiliary storage they need for scratch-work. These are completely
dierent meanings of complexity. When one speaks of a simple scien-
ti c theory, one refers to the fact that few arbitrary choices have been
made in specifying the theoretical structure, not to the rapidity with
which predictions can be made.
To a Mathematical Denition of \Life" 159
4. What is Life?
Let us once again consider a scientist in a hypothetical situation. He
wishes to understand a universe very dierent from his own which he
has been observing. As he observes it, he comes eventually to distin-
guish certain objects. These are highly interdependent regions of the
universe he is observing, so much so, that he comes to view them as
wholes. Unlike a gas, which consists of independent particles that do
not interact, these regions of the universe are unities, and for this reason
he has distinguished them as single entities.
We believe that the most fundamental property of living organisms
is the enormous interdependence between their components. A living
being is a unity it is much simpler to view it as a whole than as
the sum of parts. That is to say, if we want to compute a complete
description of a region of space-time that is a living being, the program
will be smaller in size if the calculation is done all together, than if it
is done by independently calculating descriptions of parts of the region
and then putting them together. What is the complexity of a living
being, how can we distinguish primitive life from complex forms? The
interdependence in a primitive unicellular organism is far less than that
in a human being.
A living being is indeed a unity. All the atoms in it cooperate and
work together. If Mr. Smith is afraid of missing the train to his oce,
all his incredibly many molecules, all his organs, all his cells, will be
cooperating so that he nishes breakfast quickly and runs to the train
station. If you cut the leg of an animal, all of it will cooperate to escape
from you, or to attack you and scare you away, in order to protect its
leg. Later the wound will heal. How dierent from what happens if you
cut the leg of a table. The whole table will neither come to the defense
of its leg, nor will it help it to heal. In the more intelligent living
creatures, there is also a very great deal of interdependence between
an animal's past experience and its present behavior that is to say,
it learns, its behavior changes with time depending on its experiences.
Such enormous interdependence must be a monstrously rare occurrence
in a universe, unless it has evolved gradually.
In summary, the case is the whole versus the sum of its parts. If
both are equally complex, the parts are independent (do not interact).
160 Part III|Applications to Biology
If the whole is very much simpler than the sum of its parts, we have
the interdependence that characterizes a living being.8 Note nally
that we have introduced something new into the study of the size of
programs (= complexity). Before we compared the sizes of programs
that calculate dierent things. Now we are interested in comparing
the sizes of programs that calculate the same things in dierent ways.
That is to say, the method by which a calculation is done is now of
importance to us in the previous section it was not.

5. Numerical Examples
In this paper, unfortunately, we can only suggest a possible point of
departure for a mathematical de nition of life. A great amount of
work must be done it is not even clear what is the formal mathemat-
ical counterpart to the informal de nition of the previous section. A
possibility is sketched here.
Consider a computer C1 which accepts programs P which are binary
sequences consisting of a number of subsequences B C P1 : : : Pk  A.
B , the leftmost subsequence, is a program for breaking the remain-
der of P into C P1 : : : Pk , and A. B is self-describing it starts with a
binary sequence which results from writing the length of B in base-two
notation, doubling each of its bits, and then placing a pair of unequal
bits at the right end. Also, B is not allowed to see whether any of the
remaining bits of P are 0's or 1's, only to separate them into groups.9
C is the description of a computer C2. For example, C2 could be
one of Solomono's optimal binary computers, or a computer which
emits the program without processing it.
P1 : : :  Pk are programs which are processed by k dierent copies of
the computer C2. R1 : : : Rk are the resulting outputs. These outputs
would be regions of space-time, a space-time which, like von Neuman-
n's, has been cut up into little cubes with a nite number of states.
A is a program for adding together R1 : : : Rk to produce R, a single
region of space-time. A merely juxtaposes the intermediate results
8 The whole cannot be more complex than the sum of its parts, because one of
the ways of looking at it is as the sum of its parts, and this bounds its complexity.
9 The awkwardness of this part of the denition is apparently its chief defect.
To a Mathematical Denition of \Life" 161
R1 : : : Rk (perhaps with some overlapping) it is not allowed to change
any of the intermediate results. In the examples below, we shall only
compute regions R which are one-dimensional strings of 0's and 1's, so
that A need only indicate that R is the concatenation of R1 : : : Rk , in
that order.
R is the output of the computer C1 produced by processing the
program P .
We now de ne a family of complexity measures C (d R), the com-
plexity of a region R of space-time when it is viewed as the sum of
independent regions of diameter not greater than d. C (d R) is the
length of the smallest program P which makes the computer C1 output
R, among all those P such that the intermediate results R1 to Rk are
all less than or equal to d in diameter. C (d R) where d equals the di-
ameter of R is to within a bounded dierence just the usual Solomono
complexity measure. But as d decreases, we may be forced to forget
any patterns in R that are more than d in diameter, and the complexity
C (d R) increases.
We present below a table with four examples. In each of the four
cases, R is a 1-dimensional region, a binary sequence of length n. R1
is a random binary sequence of length n (\gas"). R2 consists of n
repetitions of 1 (\crystal"). The left half of R3 is a random binary
sequence of length n=2. The right half of R3 is produced by rotating the
left half about R3's midpoint (\bilateral symmetry"). R4 consists of two
identical copies of a random binary sequence of length n=2 (\twins").
C (d R) = R = R1 R = R2 R = R3 R = R4
approx. ? \gas" \crystal" \bilateral \twins"
symmetry"
d=n n log2 n n=2 n=2
Note 1
d = n=k
(k > 1 xed, n k log2 n n ; (n=2k) n
n large) Notes 1,2 Note 2 Note 2
d=1 n n n n
Note 1. This supposes that n is represented in base-two notation
by a random binary sequence. These values are too high in those rare
cases where this is not true.
162 Part III|Applications to Biology
Note 2. These are conjectured values. We can only show that
C (d R) is approximately less than or equal to these values.

Bibliography
 Arbib, M. A. (1962). \Simple self-reproducing automata," Infor-
mation and Control.
 Arbib, M. A. (1967). \Automata theory and development: Part
1," Journal of Theoretical Biology.
 Arbib, M. A. \Self-reproducing automata|some implications for
theoretical biology."
 Biological Science Curriculum Study. (1968). Biological Science:
Molecules to Man, Houghton Mi+in Co.
 Chaitin, G. J. (1966). \On the length of programs for computing
nite binary sequences," Journal of the Association for Comput-
ing Machinery.
 Chaitin, G. J. (1969a). \On the length of programs for computing
nite binary sequences: Statistical considerations," ibid.
 Chaitin, G. J. (1969b). \On the simplicity and speed of programs
for computing in nite sets of natural numbers," ibid.
 Chaitin, G. J. (1970). \On the diculty of computations," IEEE
Transactions on Information Theory.
 Codd, E. F. (1968). Cellular Automata. Academic Press.
 Kolmogorov, A. N. (1965). \Three approaches to the de nition
of the concept `amount of information'," Problemy Peredachi In-
formatsii.
 Kolmogorov, A. N. (1968). \Logical basis for information the-
ory and probability theory," IEEE Transactions on Information
Theory.
To a Mathematical Denition of \Life" 163
 Loveland, D. W. \A variant of the Kolmogorov concept of com-
plexity," report 69-4, Math. Dept., Carnegie-Mellon University.
 Loveland, D. W. (1969). \On minimal program complexity mea-
sures," Conference Record of the ACM Symposium on Theory of
Computing, May 1969.
 Martin-L of, P. (1966). \The de nition of random sequences,"
Information and Control.
 Minsky, M. L. (1962). \Problems of formulation for arti cial
intelligence," Mathematical Problems in the Biological Science,
American Math. Society.
 Moore, E. F. (1962). \Machine models of self-reproduction," ibid.
 von Neumann, J. (1966). Theory of Self-Reproducing Automata.
(Edited by A. W. Burks.) University of Illinois Press.
 Solomono, R. J. (1964). \A formal theory of inductive infer-
ence," Information and Control.
 Willis, D. G. (1969). \Computational complexity and probability
constructions," Stanford University.
164 Part III|Applications to Biology
TOWARD A
MATHEMATICAL
DEFINITION OF \LIFE"
In R. D. Levine and M. Tribus, The
Maximum Entropy Formalism, MIT Press,
1979, pp. 477{498

Gregory J. Chaitin

Abstract
In discussions of the nature of life, the terms \complexity," \organ-
ism," and \information content," are sometimes used in ways remark-
ably analogous to the approach of algorithmic information theory, a
mathematical discipline which studies the amount of information nec-
essary for computations. We submit that this is not a coincidence and
that it is useful in discussions of the nature of life to be able to refer to
analogous precisely dened concepts whose properties can be rigorously
studied. We propose and discuss a measure of degree of organization

165
166 Part III|Applications to Biology
and structure of geometrical patterns which is based on the algorith-
mic version of Shannon's concept of mutual information. This paper
is intended as a contribution to von Neumann's program of formulating
mathematically the fundamental concepts of biology in a very general
setting, i.e. in highly simplied model universes.

1. Introduction
Here are two quotations from works dealing with the origins of life and
exobiology:
These vague remarks can be made more precise by in-
troducing the idea of information. Roughly speaking, the
information content of a structure is the minimum number
of instructions needed to specify the structure. Once can
see intuitively that many instructions are needed to specify
a complex structure. On the other hand, a simple repeating
structure can be speci ed in rather few instructions. 1]
The traditional concept of life, therefore, may be too
narrow for our purpose: : : We should try to break away
from the four properties of growth, feeding, reaction, and
reproduction: : : Perhaps there is a clue in the way we speak
of living organisms. They are highly organized, and per-
haps this is indeed their essence: : : What, then, is orga-
nization? What sets it apart from other similarly vague
concepts? Organization is perhaps viewed best as \complex
interrelatedness": : : A book is complex it only resembles an
organism in that passages in one paragraph or chapter refer
to others elsewhere. A dictionary or thesaurus shows more
organization, for every entry refers to others. A telephone
directory shows less, for although it is equally elaborate,
there is little cross-reference between its entries: : : 2]
If one compares the rst quotation with any introductory article on
algorithmic information theory (e.g. 3{4]), and compares the second
quotation with a preliminary version of this paper 5], one is struck
by the similarities. As these quotations show, there has been a great
Toward a Mathematical Denition of \Life" 167
deal of thought about how to de ne \life," \complexity," \organism,"
and \information content of organism." The attempted contribution
of this paper is that we propose a rigorous quantitative de nition of
these concepts and are able to prove theorems about them. We do not
claim that our proposals are in any sense de nitive, but, following von
Neumann 6{7], we submit that a precise mathematical de nition must
be given.
Some preliminary considerations: We shall nd it useful to distin-
guish between the notion of degree of interrelatedness, interdependence,
structure, or organization, and that of information content. Two ex-
treme examples are an ideal gas and a perfect crystal. The complete
microstate at a given time of the rst one is very dicult to describe
fully, and for the second one this is trivial to do, but neither is or-
ganized. In other words, white noise is the most informative message
possible, and a constant pitch tone is least informative, but neither is
organized. Neither a gas nor a crystal should count as organized (see
Theorems 1 and 2 in Section 5), nor should a whale or elephant be con-
sidered more organized than a person simply because it requires more
information to specify the precise details of the current position of each
molecule in its much larger bulk. Also note that following von Neu-
mann 7] we deal with a discrete model universe, a cellular automata
space, each of whose cells has only a nite number of states. Thus we
impose a certain level of granularity in our idealized description of the
real world.
We shall now propose a rigorous theoretical measure of degree of or-
ganization or structure. We use ideas from the new algorithmic formu-
lation of information theory, in which one considers individual objects
and the amount of information in bits needed to compute, construct,
describe, generate or produce them, as opposed to the classical for-
mulation of information theory in which one considers an ensemble of
possibilities and the uncertainty as to which of them is actually the
case. In that theory the uncertainty or \entropy" of a distribution is
de ned to be X
; pi log pi
i<k
and is a measure of one's ignorance of which of the k possibilities ac-
tually holds given that the a priori probability of the ith alternative is
168 Part III|Applications to Biology
pi . (Throughout this paper \log" denotes the base-two logarithm.) In
contrast, in the newer formulation of information theory one can speak
of the information content of an individual book, organism, or picture,
without having to imbed it in an ensemble of all possible such objects
and postulate a probability distribution on them.
We believe that the concepts of algorithmic information theory are
extremely basic and fundamental. Witness the light they have shed on
the scienti c method 8], the meaning of randomness and the Monte
Carlo method 9], the limitations of the deductive method 3{4], and
now, hopefully, on theoretical biology. An information-theoretic proof
of Euclid's theorem that there are in nitely many prime numbers should
also be mentioned (see Appendix 2).
The fundamental notion of algorithmic information theory is H (X ),
the algorithmic information content (or, more briey, \complexity") of
the object X . H (X ) is de ned to be the smallest possible number of
bits in a program for a general-purpose computer to print out X . In
other words, H (X ) is the amount of information necessary to describe
X suciently precisely for it to be constructed. Two objects X and Y
are said to be (algorithmically) independent if the best way to describe
them both is simply to describe each of them separately. That is to say,
X and Y are independent if H (X Y ) is approximately equal to H (X )+
H (Y ), i.e. if the joint information content of X and Y is just the sum
of the individual information contents of X and Y . If, however, X and
Y are related and have something in common, one can take advantage
of this to describe X and Y together using much fewer bits than the
total number that would be needed to describe them separately, and
so H (X Y ) is much less than H (X ) + H (Y ). The quantity H (X : Y )
which is de ned as follows
H (X : Y ) = H (X ) + H (Y ) ; H (X Y )
is called the mutual information of X and Y and measures the degree
of interdependence between X and Y . This concept was de ned, in
an ensemble rather than an algorithmic setting, in Shannon's original
paper 10] on information theory, noisy channels, and coding.
We now explain our de nition of the degree of organization or struc-
ture in a geometrical pattern. The d-diameter complexity Hd (X ) of an
Toward a Mathematical Denition of \Life" 169
object X is de ned to be the minimum number of bits needed to de-
scribe X as the \sum" of separate parts each of diameter not greater
than d. Let us be more precise. Given d and X , consider all possi-
ble ways of partitioning X into nonoverlapping pieces each of diameter
 d. Then Hd (X ) is the sum of the number of bits needed to describe
each of the pieces separately, plus the number of bits needed to spec-
ify how to reassemble them into X . Each piece must have a separate
description which makes no cross-references to any of the others. And
one is interested in those partitions of X and reassembly techniques
which minimize this sum. That is to say,
X
Hd (X ) = min H ( ) + H (Xi )]
i<k
the minimization being taken over all partitions of X into nonoverlap-
ping pieces
X0 X1 X2 : : : Xk;1
all of diameter  d.
Thus Hd(X ) is the minimum number of bits needed to describe X
as if it were the sum of independent pieces of size  d. For d larger
than the diameter of X , Hd (X ) will be the same as H (X ). If X is
unstructured and unorganized, then as d decreases Hd(X ) will stay
close to H (X ). However if X has structure, then Hd(X ) will rapidly
increase as d decreases and one can no longer take advantage of patterns
of size > d in describing X . Hence Hd (X ) as a function of d is a kind
of \spectrum" or \Fourier transform" of X . Hd (X ) will increase as d
decreases past the diameter of signi cant patterns in X , and if X is
organized hierarchically this will happen at each level in the hierarchy.
Thus the faster the dierence increases between Hd (X ) and H (X )
as d decreases, the more interrelated, structured, and organized X is.
Note however that X may be a \scene" containing many independent
structures or organisms. In that case their degrees of organization are
summed together in the measure
Hd (X ) ; H (X ):
Thus the organisms can be de ned as the minimal parts of the scene for
which the amount of organization of the whole can be expressed as the
170 Part III|Applications to Biology
sum of the organization of the parts, i.e. pieces for which the organiza-
tion decomposes additively. Alternatively, one can use the notion of the
mutual information of two pieces to obtain a theoretical prescription
of how to separate a scene into independent patterns and distinguish a
pattern from an unstructured background in which it is imbedded (see
Section 6).
Let us enumerate what we view as the main points in favor of this
de nition of organization: It is general, i.e. following von Neumann the
details of the physics and chemistry of this universe are not involved
it measures organized structure rather than unstructured details and
it passes the spontaneous generation or \Pasteur" test, i.e. there is
a very low probability of creating organization by chance without a
long evolutionary process (this may be viewed as a way of restating
Theorem 1 in Section 5). The second point is worth elaborating: The
information content of an organism includes much irrelevant detail, and
a bigger animal is necessarily more complex in this sense. But if it were
possible to calculate the mutual information of two arbitrary cells in a
body at a given moment, we surmise that this would give a measure of
the genetic information in a cell. This is because the irrelevant details
in each of them, such as the exact position and velocity of each molecule,
are uncorrelated and would cancel each other out.
In addition to providing a de nition of information content and
of degree of organization, this approach also provides a de nition of
\organism" in the sense that a theoretical prescription is given for dis-
secting a scene into organisms and determining their boundaries, so
that the measure of degree of organization can then be applied sepa-
rately to each organism. However a strong note of caution is in order:
We agree with 1] that a de nition of \life" is valid as long as anything
that satis es the de nition and is likely to appear in the universe under
consideration, either is alive or is a by-product of living beings or their
activities. There certainly are structures satisfying our de nition that
are not alive (see Theorems 3 to 6 in Section 5) however, we believe
that they would only be likely to arise as by-products of the activities
of living beings.
In the succeeding sections we shall do the following: give a more
formal presentation of the basic concepts of algorithmic information
theory discuss the notions of the independence and mutual information
Toward a Mathematical Denition of \Life" 171
of groups of more than two objects formally de ne Hd  evaluate Hd (R)
for some typical one-dimensional geometrical patterns R which we dub
\gas," \crystal," \twins," \bilateral symmetry," and \hierarchy" con-
sider briey the problem of decomposing scenes containing several inde-
pendent patterns, and of determining the boundary of a pattern which
is imbedded in an unstructured background discuss briey the two
and higher dimension cases and mention some alternative de nitions
of mutual information which have been proposed.
The next step in this program of research would be to proceed from
static snapshots to time-varying situations, in other words, to set up a
discrete universe with probabilistic state transitions and to show that
there is a certain probability that a certain level of organization will be
reached by a certain time. More generally, one would like to determine
the probability distribution of the maximum degree of organization of
any organism at time t + , as a function of it at time t. Let us pro-
pose an initial proof strategy for setting up a nontrivial example of the
evolution of organisms: construct a series of intermediate evolutionary
forms 11], argue that increased complexity gives organisms a selec-
tive advantage, and show that no primitive organism is so successful
or lethal that it diverts or blocks this gradual evolutionary pathway.
What would be the intellectual avor of the theory we desire? It would
be a quantitative formulation of Darwin's theory of evolution in a very
general model universe setting. It would be the opposite of ergodic the-
ory. Instead of showing that things mix and become uniform, it would
show that variety and organization will probably increase.
Some nal comments: Software is fast approaching biological lev-
els of complexity, and hardware, thanks to very large scale integration,
is not far behind. Because of this, we believe that the computer is
now becoming a valid metaphor for the entire organism, not just for
the brain 12]. Perhaps the most interesting example of this is the
evolutionary phenomenon suered by extremely large programs such
as operating systems. It becomes very dicult to make changes in
such programs, and the only alternative is to add new features rather
than modify existing ones. The genetic program has been \patched
up" much more and over a much longer period of time than even the
largest operating systems, and Nature has accomplished this in much
the same manner as systems programmers have, by carrying along all
172 Part III|Applications to Biology
the previous code as new code is added 11]. The experimental proof
of this is that ontogeny recapitulates phylogeny, i.e. each embryo to a
certain extent recapitulates in the course of its development the evo-
lutionary sequence that led to it. In this connection we should also
mention the thesis developed in 13] that the information contained in
the human brain is now comparable with the amount of information in
the genes, and that intelligence plus education may be characterized as
a way of getting around the limited modi ability and channel capacity
of heredity. In other words, Nature, like computer designers, has de-
cided that it is much more exible to build general-purpose computers
than to use heredity to \hardwire" each behavior pattern instinctively
into a special-purpose computer.

2. Algorithmic Information Theory


We rst summarize some of the basic concepts of algorithmic informa-
tion theory in its most recent formulation 14{16].
This new approach leads to a formalism that is very close to that
of classical probability theory and information theory, and is based on
the notion that the tape containing the Turing machine's program is
in nite and entirely lled with 0's and 1's. This forces programs to be
self-delimiting i.e. they must contain within themselves information
about their size, since the computer cannot rely on a blank at the end
of the program to indicate where it ends.
Consider a universal Turing machine U whose programs are in bi-
nary and are self-delimiting. By \self-delimiting" we mean, as was just
explained, that they do not have blanks appended as endmarkers. By
\universal" we mean that for any other Turing machine M whose pro-
grams p are in binary and are self-delimiting, there is a pre x  such
that U (p) always carries out the same computation as M (p).
H (X ), the algorithmic information content of the nite object X , is
de ned to be the size in bits of the smallest self-delimiting program for
U to compute X . This includes the proviso that U halt after printing
X . There is absolutely no restriction on the running time or storage
space used by this program. For example, X can be a natural number
or a bit string or a tuple of natural numbers or bit strings. Note that
Toward a Mathematical Denition of \Life" 173
variations in the de nition of U give rise to at most O(1) dierences in
the resulting H , by the de nition of universality.
The self-delimiting requirement is adopted so that one gets the fol-
lowing basic subadditivity property of H :
H (hX Y i)  H (X ) + H (Y ) + O(1):
This inequality holds because one can concatenate programs. It ex-
presses the notion of \adding information," or, in computer jargon,
\using subroutines."
Another important consequence of this requirement is that a natural
probability measure P , which we shall refer to as the algorithmic prob-
ability, can be associated with the result of any computation. P (X ) is
the probability that X is obtained as output if the standard universal
computer U is started running on a program tape lled with 0's and
1's by separate tosses of a fair coin. The algorithmic probability P and
the algorithmic information content H are related as follows 14]:
H (X ) = ; log P (X ) + O(1): (1)
Consider a binary string s. De ne the function L as follows:
L(n) = maxfH (s) : length(s) = ng:
It can be shown 14] that L(n) = n + H (n) + O(1), and that an over-
whelming majority of the s of length n have H (s) very close to L(n).
Such s have maximum information content and are highly random,
patternless, incompressible, and typical. They are said to be \algo-
rithmically random." The greater the dierence between H (s) and
L(length(s)), the less random s is. It is convenient to say that \s is
k-random". if H (s)  L(n) ; k, where n = length(s). There are at
most
2n;k+O(1)
n-bit strings which aren't k-random. As for natural numbers, most n
have H (n) very close to L(oor(log n)) Here oor(x) is the greatest
integer  x. Strangely enough, though most strings are random it is
impossible to prove that speci c strings have this property. For an
174 Part III|Applications to Biology
explanation of this paradox and further references, see the section on
metamathematics in 15], and also see 9].
We now make a few observations that will be needed later. First of
all, H (n) is a smooth function of n:
jH (n) ; H (m)j = O(log jn ; mj): (2)
(Note that this is not strictly true if jn ; mj is equal to 0 or 1, unless
one considers the log of 0 and 1 to be 1 this convention is therefore
adopted throughout this paper.) For a proof, see 16]. The following
upper bound on H (n) is an immediate corollary of this smoothness
property: H (n) = O(log n). Hence if s is an n-bit string, then H (s) 
n + O(log n). Finally, note that changes in the value of the argument of
the function L produce nearly equal changes in the value of L. Thus,
for any  there is a such that L(n)  L(m) +  if n  m + . This is
because of the fact that L(n) = n + H (n) + O(1) and the smoothness
property (2) of H .
An important concept of algorithmic information theory that has-
n't been mentioned yet is the conditional probability P (Y jX ), which
by de nition is P (hX Y i)=P (X ). To the conditional probability there
corresponds the relative information content H (Y jX ), which is de ned
to be the size in bits of the smallest programs for the standard universal
computer U to output Y if it is given X , a canonical minimum-size
program for calculating X . X  is de ned to be the rst H (X )-bit
program to compute X that one encounters in a xed recursive enu-
meration of the graph of U (i.e. the set of all ordered pairs of the form
hp U (p)i). Note that there are partial recursive functions which map
X  to hX H (X )i and back again, and so X  may be regarded as an ab-
breviation for the ordered pair whose rst element is the string X and
whose second element is the natural number that is the complexity of
X . We should also note the immediate corollary of (1) that minimum-
size or nearly minimum-size programs are essentially unique: For any 
there is a such that for all X the cardinality of fthe set of all programs
for U to calculate X that are within  bits of the minimum size H (X )g
is less than . It is possible to prove the following theorem relating the
conditional probability and the relative information content 14]:
H (Y jX ) = ; log P (Y jX ) + O(1): (3)
Toward a Mathematical Denition of \Life" 175
From (1) and (3) and the de nition P (hX Y i) = P (X )P (Y jX ), one
obtains this very basic decomposition:
H (hX Y i) = H (X ) + H (Y jX ) + O(1): (4)

3. Independence and Mutual Information


It is an immediate corollary of (4) that the following four quantities are
all within O(1) of each other:
8 H (X ) ; H (X jY )
>
>
< H (Y ) ; H (Y jX )
>
:H
> (X ) + H (Y ) ; H (hX Y i)
H (Y ) + H (X ) ; H (hY X i):
These four quantities are known as the mutual information H (X : Y )
of X and Y  they measure the extent to which X and Y are interde-
pendent. For if P (hX Y i) P (X )P (Y ), then H (X : Y ) = O(1)
and if Y is a recursive function of X , then H (Y jX ) = O(1) and
H (X : Y ) = H (Y ) + O(1). In fact,
" #
P (X )P ( Y )
H (X : Y ) = ; log P (hX Y i) + O(1)
which shows quite clearly that H (X : Y ) is a symmetric measure of the
independence of X and Y . Note that in algorithmic information theory,
what is of importance is an approximate notion of independence and
a measure of its degree (mutual information), rather than the exact
notion. This is because the algorithmic probability may vary within
a certain percentage depending on the choice of universal computer
U . Conversely, information measures in algorithmic information theory
should not vary by more than O(1) depending on the choice of U .
To motivate the de nition of the d-diameter complexity, we now
discuss how to generalize the notion of independence and mutual infor-
mation from a pair to an n-tuple of objects. In what follows classical
and algorithmic probabilities are distinguished by using curly brackets
for the rst one and parentheses for the second. In probability theory
176 Part III|Applications to Biology
the mutual independence of a set of n events fAk : k < ng is de ned
by the following 2n equations:
Y \
P fAk g = P f Ak g
k2S k2S
for all S  n. Here the set-theoretic convention due to von Neumann
is used that identi es the natural number n with the set fk : k < ng.
In algorithmic probability theory the analogous condition would be to
require that Y G
P (Ak ) P ( Ak ) (5)
k2S k2S
for all S  n. Here F Ak denotes the tuple forming operation for a
variable length tuple, i.e.
G
Ak = hA0 A1 A2 : : :  An;1i:
k<n
It is a remarkable fact that these 2n conditions (5) are equivalent to
the single requirement that
Y G
P (Ak ) P ( Ak ): (6)
k<n k<n
To demonstrate this it is necessary to make use of special properties
of algorithmic probability that are not shared by general probability
measures. In the case of a general probability space,
P fA \ B g  P fAg + P fB g ; 1
is the best lower bound on P fA \ B g that can in general be formulated
in terms of P fAg and P fB g. For example, it is possible for P fAg and
P fB g to both be 1/2, while P fA \ B g = 0. In algorithmic information
theory the situation is quite dierent. In fact one has:
P (hA B i)  c2P (A)P (B )
and this generalizes to any xed number of objects:
G Y
P ( Ak )  cn P (Ak ):
k<n k<n
Toward a Mathematical Denition of \Life" 177
Thus if the joint algorithmic probability of a subset of the n-tuple of
objects were signi cantly greater than the product of their individual
algorithmic probabilities, then this would also hold for the entire n-
tuple of objects. More precisely, for any S  n one has
G G G G Y
P ( Ak )  c0n P ( Ak )P ( Ak )  c00nP ( Ak ) P (Ak ):
k<n k 2S k2n;S k2S k2n;S
Then if one assumes that
G Y
P( Ak )  P (Ak )
k2S k2S
(here  denotes \much greater than"), it follows that
G Y
P ( Ak )  P (Ak )
k<n k<n
We conclude that in algorithmic probability theory (5) and (6) are
equivalent and thus (6) is a necessary and sucient condition for an
n-tuple to be mutually independent. Therefore the following measure
of mutual information for n-tuples accurately characterizes the degree
of interdependence of n objects:
X G
H (Ak )] ; H ( Ak ):
k<n k<n
This measure of mutual information subsumes all others in the following
precise sense:
X G X G
H (Ak )] ; H ( Ak ) = maxf H (Ak )] ; H ( Ak )g + O(1)
k<n k<n k 2S k2S
where the maximum is taken over all S  n.

4. Formal De
nition of Hd
We can now present the de nition of the d-diameter complexity Hd (R).
We assume a geometry: graph paper of some nite number of dimen-
sions that is divided into unit cubes. Each cube is black or white,
178 Part III|Applications to Biology
opaque or transparent, in other words, contains a 1 or a 0. Instead
of requiring an output tape which is multidimensional, our universal
Turing machine U outputs tuples giving the coordinates and the con-
tents (0 or 1) of each unit cube in a geometrical object that it wishes
to print. Of course geometrical objects are considered to be the same
if they are translation equivalent. We choose for this geometry the
city-block metric
D(X Y ) = max jxi ; yij
which is more convenient for our purposes than the usual metric. By
a region we mean a set of unit cubes with the property that from any
cube in it to any other one there is a path that only goes through
other cubes in the region. To this we add the constraint which in the
3-dimensional case is that the connecting path must only pass through
the interior and faces of cubes in the region, not through their edges or
vertices. The diameter of an arbitrary region R is denoted by jRj, and
is de ned to be the minimum diameter 2r of a \sphere"
fX : D(X X0 )  rg
which contains R. Hd(R), the size in bits of the smallest programs
which calculate R as the \sum" of independent regions of diameter
 d, is de ned as follows:
X
Hd(R) = min + H (Ri)]
i<k
where G
= H (Rj Ri) + H (k)
i<k
the minimization
F being taken over all k and partitions of R into k-
tuples Ri of nonoverlapping regions with the property that jRij < d
for all i < k.
The discussion in Section 3 of independence and mutual informa-
tion shows that Hd (R) is a natural measure to consider. Excepting
the term, Hd (R) ; H (R) is simply the minimum attainable mutual
information over any partition of R into nonoverlapping pieces all of
size not greater than d. We shall see in Section 5 that in practice the
Toward a Mathematical Denition of \Life" 179
min is attained with a small number of pieces and the term is not
very signi cant.
A few words about , the number of bits of information needed to
know how to assemble the pieces: The H (k) term is included in , as
illustrated in Lemma 1 below, because it is the number of bits needed
F to
tell U how many descriptions of pieces are to be read. The H (Rj Ri)
term is included in because it is the number of bits needed to tell U
how to compute R given the k-tuple of its pieces. This is perhaps the
most straight-forward formulation, and the one that is closest in spirit
to Section 5 5]. However, less information may suce, e.g.
G
H (Rjhk  (Ri )i) + H (k)
i<k
bits. In fact, one could de ne to be the minimum number of bits in
a string which yields a program to compute the entire region when it
is concatenated with minimum-size programs for all the pieces of the
region i.e. one could take
= minfjpj : U (pR0 R1R2 : : : Rk;1) = Rg:
Here are two basic properties of Hd: If d  jRj, then Hd(R) =
H (R) + O(1) Hd (R) increases monotonically as d decreases. Hd (R) =
H (R) + O(1) if d  jRj because we have included the term in the
de nition of Hd(R). Hd (R) increases as d decreases because one can no
longer take advantage of patterns of diameter greater than d to describe
R. The curve showing Hd (R) as a function of d may be considered a
kind of \Fourier spectrum" of R. Interesting things will happen to the
curve at d which are the sizes of signi cant patterns in R.
Lemma 1. (\Subadditivity for n-tuples")
G X
H ( Ak )  cn + H (Ak ):
k<n k<n
Proof.
G G
H( Ak ) = H (hn Ak i) + O(1)
k<n k<n
G
= H (n) + H ( Ak jn) + O(1)
k<n
0 X
 c + H (n) + H (Ak ):
k<n
180 Part III|Applications to Biology
Hence one can take
cn = c0 + H (n):

5. Evaluation of Hd for Typical One-Dim-


ensional Geometrical Patterns
Before turning to the examples, we present a lemma needed for esti-
mating Hd (R). The idea is simply that suciently large pieces of a
random string are also random. It is required that the pieces be su-
ciently large for the following reason: It is not dicult to see that for
any j , there is an n so large that random strings of size greater than
n must contain all 2j possible subsequences of length j . In fact, for n
suciently large the relative frequency of occurrence of all 2j possible
subsequences must approach the limit 2;j .
Lemma 2. (\Random parts of random strings")
Consider an n-bit string s to be a loop. For any natural numbers i and
j between 1 and n, consider the sequence u of contiguous bits from s
starting at the ith and continuing around the loop to the j th. Then if
s is k-random, its subsequence u is (k + O(log n))-random.
Proof. The number of bits in u is j ; i + 1 if j is  i, and is
n + j ; i + 1 if j is < i. Let v be the remainder of the loop s after u has
been excised. Then we have H (u) + H (v) + H (i) + O(1)  H (s). Thus
H (u) + n ; juj + O(log n)  H (s), or H (u)  H (s) ; n + juj + O(log n).
Thus if s is k-random, i.e. H (s)  L(n) ; k = n + H (n) ; k + O(1), then
u is x-random, where x is determined as follows: H (u)  n + H (n) ;
k ; n + juj + O(log n) = juj + H (juj) ; k + O(log n). That is to say, if
s is k-random, then its subsequence u is (k + O(log n))-random.
Lemma 3. (\Random pre xes of random strings")
Consider an n-bit string s. For any natural number j between 1 and
n, consider the sequence u consisting of the rst j bits of s. Then if s
is k-random, its j -bit pre x u is (O(log j ) + k)-random.
Proof. Let the (n ; j )-bit string v be the remainder of s after u
is excised. Then we have H (u) + H (v) + O(1)  H (s), and therefore
H (u)  H (s) ; L(n ; j ) + O(1) = L(n) ; k ; L(n ; j ) + O(1) since s
is k-random. Note that L(n) ; L(n ; j ) = j + H (n) ; H (n ; j ) + O(1)
Toward a Mathematical Denition of \Life" 181
= j + O(log j ), by the smoothness property (2) of H . Hence H (u) 
j + O(log j ) ; k. Thus if u is x-random (x as small as possible), we have
L(j ) ; x = j + O(log j ) ; x  j + O(log j ) ; k. Hence x  O(log j ) + k.
Remark. Conversely, any random n-bit string can be extended
by concatenating k bits to it in such a manner that the result is a
random (n + k)-bit string. We shall not use this converse result, but it
is included here for the sake of completeness.
Lemma 4. (\Random extensions of random strings")
Assume the string s is x-random. Consider a natural number k. Then
there is a k-bit string e such that se is y-random, as long as k, x, and
y satisfy a condition of the following form:
y  x + O(log x) + O(log k):
Proof. Assume on the contrary that the x-random string s has
no y-random k-bit extension and y  x + O(log x) + O(log k), i.e. x
< y + O(log y) + O(log k). From this assumption we shall derive a
contradiction by using the fact that most strings of any particular size
are y-random, i.e. the fraction of them that are y-random is at least
1 ; 2;y+O(1) :
It follows that the fraction of jsj-bit strings which have no y-random
k-bit extension is less than
2;y+O(1) :
Since by hypothesis no k-bit extension of s is y-random, we can uniquely
determine s if we are given y and k and the ordinal number of the
position of s in fthe set of all jsj-bit strings which have no y-random
k-bit extensiong expressed as an (jsj ; y + O(1))-bit string. Hence H (s)
is less than L(jsj ; y + O(1))+ H (y)+ H (k)+ O(1). In as much as L(n)
= n + H (n) + O(1) and jH (n) ; H (m)j = O(log jn ; mj), it follows
that H (s) is less than L(jsj) ; y + O(log y) + O(log k)]. Since s is by
assumption x-random, i.e. H (s)  L(jsj) ; x, we obtain a lower bound
on x of the form y + O(log y) + O(log k), which contradicts our original
assumption that x < y + O(log y) + O(log k).
182 Part III|Applications to Biology
Theorem 1. (\Gas")
Suppose that the region R is an O(log n)-random n-bit string. Consider
d = n=k, where n is large, and k is xed and greater than zero. Then
H (R) = n + O(log n) and Hd (R) = H (R) + O(log H (R)):
Proof that Hd (R)  H (R) + O(log H (R))
Let be concatenation of tuples of strings, i.e.
G
( Ri) = R0R1R2 : : :Rk :
ik
Note that G G
H ( ( Ri)j Ri) = O(1):
ik ik
Divide R into k successive strings of size oor(jRj=k), with one (possibly
null) stringF of size less than k left over at the end. Taking this choice of
partition Ri in the de nition of Hd(R), and using the fact that H (s)
 jsj + O(log jsj), we see that
X
Hd (R)  O(1) + H (k + 1) + fjRij + O(log jRij)g
ik
 O(1) + n + (k + 2)O(log n)
= n + O(log n):
Proof that Hd (R)  H (R) + O(log H (R))
This follows immediately from the fact that HjRj(R) = H (R)+ O(1)
and Hd(R) increases monotonically as d decreases.
Theorem 2. (\Crystal")
Suppose that the region R is an n-bit string consisting entirely of 1's,
and that the base-two numeral for n is O(log log n)-random. Consider
d = n=k, where n is large, and k is xed and greater than zero. Then
H (R) = log n + O(log log n) and Hd (R) = H (R) + O(log H (R)):
Proof that Hd (R)  H (R) + O(log H (R))
If one considers using the concatenation function for assembly as
was done in the proof of Theorem 1, and notes that H (1n ) = H (n) +
O(1), one sees that it is sucient to partition the natural number n into
Toward a Mathematical Denition of \Life" 183
O(k) summands none of which is greater than n=k in such a manner
that H (n) + O(log log n) upper bounds the sum of the complexities of
the summands. Division into equal size pieces will not do, because
H (oor(n=k)) = H (n) + O(1), and one only gets an upper bound of
kH (n) + O(1). It is necessary to proceed as follows: Let m be the
greatest natural number such that 2m  n=k. And let p be the smallest
natural number such that 2p > n. By converting n to base-two notation,
one can express n as the sum of  p distinct non-negative powers of two.
Divide all these powers of two into two groups: those that are less than
2m and those that are greater than or equal to 2m . Let f be the sum of
all the powers in the rst group. f is < 2m  n=k. Let s be the sum of
all the powers in the second group. s is a multiple of 2m in fact, it is of
the form t2m with t = O(k). Thus n = f + s = f + t2m, where f  n=k,
2m  n=k, and t = O(k). The complexity of 2m is H (m) + O(1)
= O(log m) = O(log log n). Thus the sum of the complexities of the
t summands 2m is also O(log log n). Moreover, f when expressed in
base-two notation has log k + O(1) fewer bit positions on the left than
n does. Hence the complexity of f is H (n) + O(1). In summary, we
have O(k) quantities ni with the following properties:
X X
n = ni  ni  n=k H (ni)  H (n) + O(log log n):
Thus Hd(R)  H (R) + O(log H (R)).
Proof that Hd (R)  H (R) + O(log H (R))
This follows immediately from the fact that HjRj(R) = H (R)+ O(1)
and Hd(R) increases monotonically as d decreases.
Theorem 3. (\Twins")
For convenience assume n is even. Suppose that the region R consists
of two repetitions of an O(log n)-random n=2-bit string u. Consider
d = n=k, where n is large, and k is xed and greater than unity. Then
H (R) = n=2 + O(log n) and Hd (R) = 2H (R) + O(log H (R)):
Proof that Hd (R)  2H (R) + O(log H (R))
The reasoning is the same as in the case of the \gas" (Theorem
1). Partition R into k successive strings of size oor(jRj=k), with one
(possibly null) string of size less than k left over at the end.
Proof that Hd (R)  2H (R) + O(log H (R))
184 Part III|Applications to Biology
By the de nition of Hd (R), there is a partition F Ri of R into
nonoverlapping regions which has the property that
X G
Hd(R) = + H (Ri ) = H (Rj Ri) + H (k) jRij  d:
Classify the non-null Ri into three mutually exclusive sets A, B , and
C : A is the set of all non-null Ri which come from the left half of R
(\the rst twin"), B is the (empty or singleton) set of all non-null Ri
which come from both halves of R (\straddles the twins"), and C is
the set of all non-null Ri which come from the right half of R (\the
second twin"). Let A0, B 0, and C 0 be the sets of indices i of the regions
Ri in A, B , and C , respectively. And let A00, B 00, and C 00 be the three
portions of R which contained the pieces in A, B , and C , respectively.
Using the idea of Lemma 1, one sees that
X
H (A00)  O(1) + H (#(A)) + H (Ri )
i2A
X
0

00
H (B )  O(1) + H (#(B )) + H (Ri )
i2B
X
0

00
H (C )  O(1) + H (#(C )) + H (Ri ):
i2C 0

Here # denotes the cardinality of a set. Now A00, B 00, and C 00 are each
a substring of an O(log n)-random n=2-bit string. This assertion holds
for B 00 for the following two reasons: the n=2-bit string is considered
to be a loop, and jB 00j  d = n=k  n=2 since k is assumed to be
greater than 1. Hence, applying Lemma 2, one obtains the following
inequalities:
jA00j + O(log n)  H (A00)
jB 00j + O(log n)  H (B 00)
jC 00j + O(log n)  H (C 00):
Adding both of the above sets of three inequalities and using the facts
that
jA00j + jB 00j + jC 00j = jRj = n #(A)  n=2 #(B )  1 #(C )  n=2
Toward a Mathematical Denition of \Life" 185
and that H (m) = O(log m), one sees that
n + O(log n)  H (A00) + H (B 00) + H (C 00)
 O(1) + H (#(A)) + H (#(B )) + H (#(C )) +
X
fH (Ri ) : i 2 A0  B 0  C 0g
X
 O(log n) + H (Ri ):
Hence
X
Hd (R)  H (Ri)  n + O(log n) = 2H (R) + O(log H (R)):
Theorem 4. (\Bilateral Symmetry")
For convenience assume n is even. Suppose that the region R consists
of an O(log n)-random n=2-bit string u concatenated with its reversal.
Consider d = n=k, where n is large, and k is xed and greater than
zero. Then
H (R) = n=2 + O(log n) and Hd(R) = (2 ; k;1)H (R) + O(log H (R)):
Proof. The proof is along the lines of that of Theorem 3, with
one new idea. In the previous proof we considered B 00 which is the
region Ri in the partition of R that straddles R's midpoint. Before B 00
was O(log jRj)-random, but now it can be compressed into a program
about half its size, i.e. about jB 00j=2 bits long. Hence the maximum
departure from randomness for B 00 is for it to only be O(log jRj) +
(jRj=2k)-random, and this is attained by making B 00 as large as possible
and having its midpoint coincide with that of R.
Theorem 5. (\Hierarchy")
For convenience assume n is a power of two. Suppose that the region
R is constructed in the following fashion. Consider an O(1)-random
log n-bit string s. Start with the one-bit string 1, and successively
concatenate the string with itself or with its bit by bit complement, so
that its size doubles at each stage. At the ith stage, the string or its
complement is chosen depending on whether the ith bit of s is a 0 or
a 1, respectively. Consider the resulting n-bit string R and d = n=k,
where n is large, and k is xed and greater than zero. Then
H (R) = log n + O(log log n) and Hd(R) = kH (R) + O(log H (R)):
186 Part III|Applications to Biology
Proof that Hd (R)  kH (R) + O(log H (R))
The reasoning is similar to the case of the upper bounds on Hd (R)
in Theorems 1 and 3. Partition R into k successive strings of size
oor(jRj=k), with one (possibly null) string of size less than k left over
at the end.
Proof that Hd (R)  kH (R) + O(log H (R))
F RProceeding as in the proof of Theorem 3, one considers a partition
i of R that realizes Hd (R). Using Lemma 3, one can easily see that
the following lower bound holds for any substring Ri of R:
H (Ri)  maxf1 log jRij ; c log log jRijg:
The max f1 : : :g is because H is always greater than or equal to unity
otherwise U would have only a single output. Hence the following
expression is a lower bound on Hd (R):
X
.(jRi j) (7)
where
X
.(x) = maxf1 log x ; c log log xg jRi j = jRj = n jRi j  d:

It follows that one obtains a lower bound on (7) and thus on Hd (R) by
solving the following minimization problem: Minimize
X
.(ni) (8)
subject to the following constraints:
X
ni = n ni  n=k n large k xed:
Now to do the minimization. Note that as x goes to in nity, .(x)=x
goes to the limit zero. Furthermore, the limit is never attained, i.e.
.(x)=x is never equal to zero. Moreover, for x and y suciently large
and x less than y, .(x)=x is greater than .(y)=y. It follows that a sum
of the form (8) with the ni constrained as indicated is minimized by
making the ni as large as possible. Clearly this is achieved by taking
Toward a Mathematical Denition of \Life" 187
all but one of the ni equal to oor(n=k), with the last ni equal to
remainder(n=k). For this choice of ni the value of (8) is
k log n + O(log log n)] + .(remainder(n=k))
= k log n + O(log log n)
= kH (R) + O(log H (R)):
Theorem 6. For convenience assume n is a perfectp square. Suppose
that the region Rpis an n-bit string consisting of n repetitions of an
O(log n)-random n bit string u. Consider d = n=k, where n is large,
and k is xed and greater than zero. Then
p
H (R) = n + O(log n) and Hd (R) = kH (R) + O(log H (R)):
Proof that Hd (R)  kH (R) + O(log H (R))
The reasoning is identical to the case of the upper bound on Hd (R)
in Theorem 5.
Proof that Hd (R)  kH (R) + O(log H (R))
F RProceeding as in the proof of Theorem 5, one considers a partition
i of R that realizes Hd (R). Using Lemma 2, one can easily see that
the following lower bound holds for any substring Ri of R:
p
H (Ri )  maxf1 ;c log n + minf n jRijgg:
Hence the following expression is a lower bound on Hd (R):
X
.n(jRij) (9)
where
p X
.n (x) = maxf1 ;c log n + minf n xgg jRij = jRj = n jRij  d:
It follows that one obtains a lower bound on (9) and thus on Hd (R) by
solving the following minimization problem: Minimize
X
.n (ni) (10)
subject to the following constraints:
X
ni = n ni  n=k n large k xed:
188 Part III|Applications to Biology
Now to do the minimization. Consider .n (x)=x as x goes from 1 topn.
It is easy to see that this ratio is much smaller, on the order of 1= n,
for x near to n than it is for x anywhere
p else in the interval from 1 to n.
Also, for x and y both greater than n and x less than y, .n (x)=x is
greater than .n (y)=y. It follows that a sum of the form (10) with the
ni constrained as indicated is minimized by making the ni as large as
possible. Clearly this is achieved by taking all but one of the ni equal
to oor(n=k), with the last ni equal to remainder(n=k). For this choice
of ni the value of (10) is
p
kp n + O(log n)] + .n (remainder(n=k))
= k n + O(log n)
= kH (R) + O(log H (R)):

6. Determining Boundaries of Geometrical


Patterns
What happens to the structures of Theorems 3 to 6 if they are imbedded
in a gas or crystal, i.e. in a random or constant 0 background? And
what about scenes with several independent structures imbedded in
them|do their degrees of organization sum together? Is our de nition
suciently robust to work properly in these circumstances?
This raises the issue of determining the boundaries of structures. It
is easy to pick out the hierarchy of Theorem 5 from an unstructured
background. Any two \spheres" of diameter will have a high mutual
information given  if and only if they are both in the hierarchy instead
of in the background. Here we are using the notion of the mutual
information of X and Y given Z , which is denoted H (X : Y jZ ), and is
de ned to be H (X jZ ) + H (Y jZ ) ; H (hX Y ijZ ). The special case of
this concept that we are interested in, however, can be expressed more
simply: for if X and Y are both strings of length n, then it can be
shown that H (X : Y jn) = H (X jY ) ; H (n). This is done by using
the decomposition (4) and the fact that since X and Y are both of
length n, H (hn X i) = H (X ) + O(1), H (hn Y i) = H (Y ) + O(1), and
Toward a Mathematical Denition of \Life" 189
H (hn hX Y ii) = H (hX Y i) + O(1), and thus
H (X jn ) = H (X ) ; H (n) + O(1)
H (Y jn) = H (Y ) ; H (n) + O(1)
H (hX Y ijn ) = H (hX Y i) ; H (n) + O(1):
How can one dissect a structure from a comparatively unorganized
background in the other cases, the structures of Theorems 3, 4, and 6?
The following de nition is an attempt to provide a tool for doing this:
An  -pattern R is a maximal region (\maximal" means not extensible,
not contained in a bigger region R0 which is also an  -pattern) with
the property that for any -diameter sphere R1 in R there is a disjoint
-diameter sphere R2 in R such that
H (R1 : R2j )  :
The following questions immediately arise: What is the probability of
having an  -pattern in an n-bit string, i.e. what proportion of the
n-bit strings contain an  -pattern? This is similar to asking what is
the probability that an n-bit string s satis es
Hn=k (s) ; H (s) > x:
A small upper bound on the latter probability can be derived from
Theorem 1.

7. Two and Higher Dimension Geometrical


Patterns
We make a few brief remarks.
In the general case, to say that a geometrical object O is \ran-
dom" means H (Ojshape(O) ) volume(O), or H (O) volume(O) +
H (shape(O)). Here shape(O) denotes the object O with all the 1's that
it contains in its unit cubes changed to 0's. Here are some examples:
A random n by n square has complexity
n2 + H (n) + O(1):
190 Part III|Applications to Biology
A random n by m rectangle doesn't have complexity nm + H (n) +
H (m)+ O(1), for if m = n this states that a random n by n square has
complexity
n2 + 2H (n) + O(1)
which is false. Instead a random n by m rectangle has complexity
nm + H (hn mi) + O(1) = nm + H (n) + H (mjn) + O(1), which gives
the right answer for m = n, since H (njn) = O(1). One can show that
most n by m rectangles have complexity nm + H (hn mi) + O(1), and
less than two raised to the nm ; k + O(1) have complexity less than
nm + H (hn mi) ; k.
Here is a two-dimensional version of Lemma 2: Any large chunk of
a random square which has a shape that is easy to describe, must itself
be random.

8. Common Information
We should mention some new concepts that are closely related to the
notion of mutual information. They are called measures of common
information. Here are three dierent expressions de ning the common
information content of two strings X and Y . In them the parameter 
denotes a small tolerance, and as before H (X : Y jZ ) denotes H (X jZ )+
H (Y jZ ) ; H (hX Y ijZ ).
maxfH (Z ) : H (Z jX ) <  & H (Z jY ) < g
minfH (hX Y i : Z ) : H (X : Y jZ ) < g
minfH (Z ) : H (X : Y jZ ) < g
Thus the rst expression for the common information of two strings
de nes it to be the maximum information content of a string that can
be extracted easily from both, the second de nes it to be the minimum
of the mutual information of the given strings and any string in the
light of which the given strings look nearly independent, and the third
de nes it to be the minimum information content of a string in the light
of which the given strings appear nearly independent. Essentially these
de nitions of common information are given in 17{19]. 17] considers
an algorithmic formulation of its common information measure, while
18] and 19] deal exclusively with the classical ensemble setting.
Toward a Mathematical Denition of \Life" 191
Appendix 1: Errors in 5]
: : : The de nition of the d-diameter complexity given in 5] has a basic
aw which invalidates the entries for R = R2 R3 and R4 and d = n=k
in the table in 5]: It is insensitive to changes in the diameter d : : :
There is also another error in the table in 5], even if we forget the
aw in the de nition of the d-diameter complexity. The entry for the
crystal is wrong, and should read log n rather than k log n (see Theorem
2 in Section 5 of this paper).

Appendix 2: An Information-Theoretic
Proof That There Are In
nitely Many
Primes
It is of methodological interest to use widely diering techniques in
elementary proofs of Euclid's theorem that there are in nitely many
primes. For example, see Chapter II of Hardy and Wright 20], and also
21{23]. Recently Billingsley 24] has given an information-theoretic
proof of Euclid's theorem. The purpose of this appendix is to point out
that there is an information-theoretic proof of Euclid's theorem that
utilizes ideas from algorithmic information theory instead of the classi-
cal measure-theoretic setting employed by Billingsley. We consider the
algorithmic entropy H (n), which applies to individual natural numbers
n instead of to ensembles.
The proof is by reductio ad absurdum. Suppose on the contrary that
there are only nitely many primes p1 : : :  pk . Then one way to specify
algorithmically an arbitrary natural number
Y
n = pei i

is by giving the k-tuple he1 : : : ek i of exponents in any of its prime


factorizations (we pretend not to know that the prime factorization is
unique). Thus we have
H (n)  H (he1 : : :  eki) + O(1):
192 Part III|Applications to Biology
By the subadditivity of algorithmic entropy we have
X
H (n)  H (ei) + O(1):
Let us examine this inequality. Most n are algorithmically random and
so the left-hand side is usually log n+O(log log n). As for the right-hand
side, since
n  pei  2e 
i i

each ei is  log n. Thus H (ei)  log log n + O(log log log n). So for
random n we have
log n + O(log log n)  k log log n + O(log log log n)]
where k is the assumed nite number of primes. This last inequality is
false for large n, as it assuredly is not the case that log n = O(log log n).
Thus our initial assumption that there are only k primes is refuted, and
there must in fact be in nitely many primes.
This proof is merely a formalization of the observation that if there
were only nitely many primes, the prime factorization of a number
would usually be a much more compact representation for it than its
base-two numeral, which is absurd. This proof appears, formulated as
a counting argument, in Section 2.6 of the 1938 edition of Hardy and
Wright 20] we believe that it is also quite natural to present it in an
information-theoretic setting.

References
1] L. E. Orgel, The Origins of Life: Molecules and Natural Selection,
Wiley, New York, 1973, pp. 187{197.
2] P. H. A. Sneath, Planets and Life, Funk and Wagnalls, New York,
1970, pp. 54{71.
3] G. J. Chaitin, \Information-Theoretic Computational Complex-
ity," IEEE Trans. Info. Theor. IT-20 (1974), pp. 10{15.
4] G. J. Chaitin, \Randomness and Mathematical Proof," Sci.
Amer. 232, No. 5 (May 1975), pp. 47{52.
Toward a Mathematical Denition of \Life" 193
5] G. J. Chaitin, \To a Mathematical De nition of \Life"," ACM
SICACT News 4 (Jan. 1970), pp. 12{18.
6] J. von Neumann, \The General and Logical Theory of Au-
tomata," John von Neumann|Collected Works, Volume V, A.
H. Taub (ed.), Macmillan, New York, 1963, pp. 288{328.
7] J. von Neumann, Theory of Self-Reproducing Automata, Univ.
Illinois Press, Urbana, 1966, pp. 74{87 edited and completed by
A. W. Burks.
8] R. J. Solomono, \A Formal Theory of Inductive Inference," Info.
& Contr. 7 (1964), pp. 1{22, 224{254.
9] G. J. Chaitin and J. T. Schwartz, \A Note on Monte Carlo Pri-
mality Tests and Algorithmic Information Theory," Comm. Pure
& Appl. Math., to appear.
10] C. E. Shannon and W. Weaver, The Mathematical Theory of
Communication, Univ. Illinois Press, Urbana, 1949.
11] H. A. Simon, The Sciences of the Articial, MIT Press, Cam-
bridge, MA, 1969, pp. 90{97, 114{117.
12] J. von Neumann, The Computer and the Brain, Silliman Lectures
Series, Yale Univ. Press, New Haven, CT, 1958.
13] C. Sagan, The Dragons of Eden|Speculations on the Evolution of
Human Intelligence, Random House, New York, 1977, pp. 19{47.
14] G. J. Chaitin, \A Theory of Program Size Formally Identical to
Information Theory," J. ACM 22 (1975), pp. 329{340.
15] G. J. Chaitin, \Algorithmic Information Theory," IBM J. Res.
Develop. 21 (1977), pp. 350{359, 496.
16] R. M. Solovay, \On Random R.E. Sets," Non-Classical Logics,
Model Theory, and Computability, A. I. Arruda, N. C. A. da
Costa, and R. Chuaqui (eds.), North-Holland, Amsterdam, 1977,
pp. 283{307.
194 Part III|Applications to Biology
17] P. G
acs and J. K orner, \Common Information Is Far Less Than
Mutual Information," Prob. Contr. & Info. Theor. 2, No. 2
(1973), pp. 149{162.
18] A. D. Wyner, \The Common Information of Two Dependent Ran-
dom Variables," IEEE Trans. Info. Theor. IT-21 (1975), pp. 163{
179.
19] H. S. Witsenhausen, \Values and Bounds for the Common In-
formation of Two Discrete Random Variables," SIAM J. Appl.
Math. 31 (1976), pp. 313{333.
20] G. H. Hardy and E. M. Wright, An Introduction to the Theory of
Numbers, Clarendon Press, Oxford, 1962.
21] G. H. Hardy, A Mathematician's Apology, Cambridge University
Press, 1967.
22] G. H. Hardy, Ramanujan|Twelve Lectures on Subjects Suggested
by His Life and Work, Chelsea, New York, 1959.
23] H. Rademacher and O. Toeplitz, The Enjoyment of Mathematics,
Princeton University Press, 1957.
24] P. Billingsley, \The Probability Theory of Additive Arithmetic
Functions," Ann. of Prob. 2 (1974), pp. 749{791.
25] A. W. Burks (ed.), Essays on Cellular Automata, Univ. Illinois
Press, Urbana, 1970.
26] M. Eigen, \The Origin of Biological Information," The Physicist's
Conception of Nature, J. Mehra (ed.), D. Reidel Publishing Co.,
Dordrecht-Holland, 1973, pp. 594{632.
27] R. Landauer, \Fundamental Limitations in the Computational
Process," Ber. Bunsenges. Physik. Chem. 80 (1976), pp. 1048{
1059.
28] H. P. Yockey, \A Calculation of the Probability of Spontaneous
Biogenesis by Information Theory," J. Theor. Biol. 67 (1977),
pp. 377{398.
Part IV
Technical Papers on
Self-Delimiting Programs

195
A THEORY OF PROGRAM
SIZE FORMALLY
IDENTICAL TO
INFORMATION THEORY
Journal of the ACM 22 (1975),
pp. 329{340

Gregory J. Chaitin1
IBM Thomas J. Watson Research Center
Yorktown Heights, New York

Abstract
A new denition of program-size complexity is made. H (A B=C D)
is dened to be the size in bits of the shortest self-delimiting program
for calculating strings A and B if one is given a minimal-size self-
delimiting program for calculating strings C and D. This diers from
previous denitions: (1) programs are required to be self-delimiting, i.e.
no program is a prex of another, and (2) instead of being given C and
D directly, one is given a program for calculating them that is minimal
in size. Unlike previous denitions, this one has precisely the formal
197
198 Part IV|Technical Papers on Self-Delimiting Programs
properties of the entropy concept of information theory. For example,
H (A B ) = H (A) + H (B=A) + O(1). Also, if a program of length k
is assigned measure 2;k , then H (A) = ; log2(the probability that the
standard universal computer will calculate A) + O(1).

Key Words and Phrases:


computational complexity, entropy, information theory, instantaneous
code, Kraft inequality, minimal program, probability theory, program
size, random string, recursive function theory, Turing machine

CR Categories:
5.25, 5.26, 5.27, 5.5, 5.6

1. Introduction
There is a persuasive analogy between the entropy concept of informa-
tion theory and the size of programs. This was realized by the rst
workers in the eld of program-size complexity, Solomono 1], Kol-
mogorov 2], and Chaitin 3,4], and it accounts for the large measure of
success of subsequent work in this area. However, it is often the case
that results are cumbersome and have unpleasant error terms. These
ideas cannot be a tool for general use until they are clothed in a pow-
erful formalism like that of information theory.
This opinion is apparently not shared by all workers in this eld
(see Kolmogorov 5]), but it has led others to formulate alternative
1Copyright  c 1975, Association for Computing Machinery, Inc. General permis-
sion to republish, but not for prot, all or part of this material is granted provided
that ACM's copyright notice is given and that reference is made to the publica-
tion, to its date of issue, and to the fact that reprinting privileges were granted by
permission of the Association for Computing Machinery.
This paper was written while the author was a visitor at the IBM Thomas J. Wat-
son Research Center, Yorktown Heights, New York, and was presented at the IEEE
International Symposium on Information Theory, Notre Dame, Indiana, October
1974.
Author's present address: Rivadavia 3580, Dpto. 10A, Buenos Aires, Argentina.
A Theory of Program Size 199
de nitions of program-size complexity, for example, Loveland's uni-
form complexity 6] and Schnorr's process complexity 7]. In this paper
we present a new concept of program-size complexity. What train of
thought led us to it?
Following 8, Sec. VI, p.7], think of a computer as decoding equip-
ment at the receiving end of a noiseless binary communications channel.
Think of its programs as code words, and of the result of the compu-
tation as the decoded message. Then it is natural to require that the
programs/code words form what is called an \instantaneous code," so
that successive messages sent across the channel (e.g. subroutines) can
be separated. Instantaneous codes are well understood by informa-
tion theorists 9{12] they are governed by the Kraft inequality, which
therefore plays a fundamental role in this paper.
One is thus led to de ne the relative complexity H (A B=C D) of
A and B with respect to C and D to be the size of the shortest self-
delimiting program for producing A and B from C and D. However,
this is still not quite right. Guided by the analogy with information
theory, one would like
H (A B ) = H (A) + H (B=A) + ,
to hold with an error term , bounded in absolute value. But, as is
shown in the Appendix, j,j is unbounded. So we stipulate instead
that H (A B=C D) is the size of the smallest self-delimiting program
that produces A and B when it is given a minimal-size self-delimiting
program for C and D. Then it can be shown that j,j is bounded.
In Sections 2{4 we de ne this new concept formally, establish the
basic identities, and briey consider the resulting concept of random-
ness or maximal entropy.
We recommend reading Willis 13]. In retrospect it is clear that he
was aware of some of the basic ideas of this paper, though he developed
them in a dierent direction. Chaitin's study 3,4] of the state com-
plexity of Turing machines may be of interest, because in his formalism
programs can also be concatenated. To compare the properties of our
entropy function H with those it has in information theory, see 9{12]
to contrast its properties with those of previous de nitions of program-
size complexity, see 14]. Cover 15] and Gewirtz 16] use our new
200 Part IV|Technical Papers on Self-Delimiting Programs
de nition. See 17{32] for other applications of information/entropy
concepts.

2. De
nitions
X = f( 0 1 00 01 10 11 000 : : :g is the set of nite binary strings,
and X 1 is the set of in nite binary strings. Henceforth we shall merely
say \string" instead of \binary string," and a string will be understood
to be nite unless the contrary is explicitly stated. X is ordered as
indicated, and jsj is the length of the string s. The variables p, q, s,
and t denote strings. The variables and ! denote in nite strings. n
is the pre x of of length n. N = f0 1 2 : : :g is the set of natural
numbers. The variables c, i, j , k, m, and n denote natural numbers.
R is the set of positive rationals. The variable r denotes an element of
R. We write \r.e." instead of \recursively enumerable," \lg" instead of
\log2," and sometimes \2 " (x)" instead of \2x ." #(S ) is the cardinality
of the set S .
Concrete De nition of a Computer. A computer C is a Turing
machine with two tapes, a program tape and a work tape. The program
tape is nite in length. Its leftmost square is called the dummy square
and always contains a blank. Each of its remaining squares contains
either a 0 or a 1. It is a read-only tape, and has one read head on it
which can move only to the right. The work tape is two-way in nite
and each of its squares contains either a 0, a 1, or a blank. It has one
read-write head on it.
At the start of a computation the machine is in its initial state,
the program p occupies the whole program tape except for the dummy
square, and the read head is scanning the dummy square. The work
tape is blank except for a single string q whose leftmost symbol is being
scanned by the read-write head. Note that q can be equal to (. In that
case the read-write head initially scans a blank square. p can also be
equal to (. In that case the program tape consists solely of the dummy
square. See Figure 1.
During each cycle of operation the machine may halt, move the
read head of the program tape one square to the right, move the read-
write head of the work tape one square to the left or to the right, erase
A Theory of Program Size 201

0 0 1 1 0 1 0
6
Initial State

:::
? :::
1 1 0 0

Figure 1. The start of a computation: p = 0011010 and q = 1100.

0 0 1 1 0 1 0
6
Halted

:::
? :::
0 1 0

Figure 2. The end of a successful computation: C (p q) = 010.

the square of the work tape being scanned, or write a 0 or a 1 on


the square of the work tape being scanned. The the machine changes
state. The action performed and the next state are both functions of
the present state and the contents of the two squares being scanned,
and are indicated in two nite tables with nine columns and as many
rows as there are states.
If the Turing machine eventually halts with the read head of the
program tape scanning its rightmost square, then the computation is
a success. If not, the computation is a failure. C (p q) denotes the
result of the computation. If the computation is a failure, then C (p q)
is unde ned. If it is a success, then C (p q) is the string extending to
the right from the square of the work tape that is being scanned to the
rst blank square. Note that C (p q) = ( if the square of the work tape
202 Part IV|Technical Papers on Self-Delimiting Programs
being scanned is blank. See Figure 2.
De nition of an Instantaneous Code. An instantaneous code
is a set of strings S with the property that no string in S is a pre x of
another.
Abstract De nition of a Computer. A computer is a partial
recursive function C : X X ! X with the property that for each q
the domain of C (: q) is an instantaneous code i.e. if C (p q) is de ned
and p is a proper pre x of p0 , then C (p0 q) is not de ned.
Theorem 2.1. The two de nitions of a computer are equivalent.
Proof. Why does the concrete de nition satisfy the abstract one?
The program must indicate within itself where it ends since the machine
is not allowed to run o the end of the tape or to ignore part of the
program. Thus no program for a successful computation is the pre x
of another.
Why does the abstract de nition satisfy the concrete one? We show
how a concrete computer C can simulate an abstract computer C 0. The
idea is that C should read another square of its program tape only when
it is sure that this is necessary.
Suppose C found the string q on its work tape. C then generates
the r.e. set S = fpjC 0(p q) is de nedg on its work tape.
As it generates S , C continually checks whether or not that part
p of the program that it has already read is a pre x of some known
element s of S . Note that initially p = (.
Whenever C nds that p is a pre x of an s 2 S , it does the following.
If p is a proper pre x of s, C reads another square of the program tape.
And if p = s, C calculates C 0(p q) and halts, indicating this to be the
result of the computation. Q.E.D.
De nition of an Optimal Universal Computer. U is an op-
timal universal computer i for each computer C there is a constant
sim(C ) with the following property: if C (p q) is de ned, then there is
a p0 such that U (p0 q) = C (p q) and jp0j  jpj + sim(C ).
Theorem 2.2. There is an optimal universal computer U .
Proof. U reads its program tape until it gets to the rst 1. If U has
read i 0's, it then simulates Ci, the ith computer (i.e. the computer with
the ith pair of tables in a recursive enumeration of all possible pairs
of de ning tables), using the remainder of the program tape as the
program for Ci. Thus if Ci(p q) is de ned, then U (0i 1p q) = Ci(p q).
A Theory of Program Size 203
Hence U satis es the de nition of an optimal universal computer with
sim(Ci) = i + 1. Q.E.D.
We somehow pick out a particular optimal universal computer U as
the standard one for use throughout the rest of this paper.
De nition of Canonical Programs, Complexities, and Prob-
abilities.
(a) The canonical program.
s = min p (U (p () = s).
I.e. s is the rst element in the ordered set X of all strings
that is a program for U to calculate s.
(b) Complexities.
HC (s) = min jpj (C (p () = s) (may be 1),
H (s) = HU (s),
HC (s=t) = min jpj (C (p t) = s) (may be 1),
H (s=t) = HU (s=t).
(c) Probabilities.
PC (s) = P 2;jpj (C (p () = s),
P (s) = PU (s),
PC (s=t) = P 2;jpj (C (p t) = s),
P (s=t) = PU (s=t).
Remark on Nomenclature. There are two dierent sets of termi-
nology for these concepts, one derived from computational complexity
and the other from information theory. H (s) may be referred to as the
information-theoretic or program-size complexity, and H (s=t) may be
referred to as the relative information-theoretic or program-size com-
plexity. Or H (s) and H (s=t) may be termed the algorithmic entropy
and the conditional algorithmic entropy, respectively. Similarly, this
eld might be referred to as \information-theoretic complexity" or as
\algorithmic information theory."
204 Part IV|Technical Papers on Self-Delimiting Programs
Remark on the De nition of Probabilities. There is a very
intuitive way of looking at the de nition of PC . Change the de nition
of the computer C so that the program tape is in nite to the right,
and remove the (now impossible) requirement for a computation to
be successful that the rightmost square of the program tape is being
scanned when C halts. Imagine each square of the program tape except
for the dummy square to be lled with a 0 or a 1 by a separate toss
of a fair coin. Then the probability that the result s is obtained when
the work tape is initially blank is PC (s), and the probability that the
result s is obtained when the work tape initially has t on it is PC (s=t).
Theorem 2.3.
(a) H (s)  HC (s) + sim(C ),
(b) H (s=t)  HC (s=t) + sim(C ),
(c) s 6= (,
(d) s = U (s (),
(e) H (s) = jsj,
(f) H (s) 6= 1,
(g) H (s=t) 6= 1,
(h) 0  PC (s)  1,
(i) 0  PC (s=t)  1,
(j) 1  Ps PC (s),
(k) 1  Ps PC (s=t),
(l) PC (s)  2 " (;HC (s)),
(m) PC (s=t)  2 " (;HC (s=t)),
(n) 0 < P (s) < 1,
(o) 0 < P (s=t) < 1,
A Theory of Program Size 205
(p) #(fsjHC (s) < ng) < 2n ,
(q) #(fsjHC (s=t) < ng) < 2n ,
(r) #(fsjPC (s) > rg) < 1=r,
(s) #(fsjPC (s=t) > rg) < 1=r.
Proof. These are immediate consequences of the de nitions. Q.E.D.
De nition of Tuples of Strings. Somehow pick out a particular
recursive bijection b : X X ! X for use throughout the rest of this
paper. The 1-tuple hs1i is de ned to be the string s1. For n  2 the
n-tuple hs1 : : : sn i is de ned to be the string b(hs1 : : : sn;1 i sn).
Extensions of the Previous Concepts to Tuples of Strings.
(n  1 m  1).
 HC (s1  : : : sn ) = HC (hs1  : : : sn i)

 HC (s1  : : : sn =t1  : : : tm ) = HC (hs1  : : : sn i=ht1  : : : tm i)

 H (s1  : : : sn ) = HU (s1  : : : sn )

 H (s1  : : : sn =t1 : : :  tm) = HU (s1  : : : sn =t1  : : : tm )

 PC (s1  : : : sn ) = PC (hs1  : : : sn i)

 PC (s1  : : : sn =t1  : : : tm ) = PC (hs1  : : : sn i=ht1  : : : tm i)

 P (s1  : : : sn ) = PU (s1  : : : sn )

 P (s1  : : : sn =t1 : : :  tm) = PU (s1  : : : sn =t1  : : : tm ):

De nition of the Information in One Tuple of Strings About


Another. (n  1 m  1).
 IC (s1 : : :  sn : t1 : : : tm) =
HC (t1 : : : tm) ; HC (t1 : : :  tm=s1 : : :  sn)
 I (s1 : : : sn : t1 : : :  tm) = IU (s1 : : : sn : t1 : : : tm):
206 Part IV|Technical Papers on Self-Delimiting Programs
Extensions of the Previous Concepts to Natural Numbers.
We have de ned H , P , and I for tuples of strings. This is now extended
to tuples each of whose elements may either be a string or a natural
number. We do this by identifying the natural number n with the nth
string (n = 0 1 2 : : :). Thus, for example, \H (n)" signi es \H (the
nth element of X )," and \U (p () = n" stands for \U (p () = the nth
element of X ."

3. Basic Identities
This section has two objectives. The rst is to show that H and I satisfy
the fundamental inequalities and identities of information theory to
within error terms of the order of unity. For example, the information
in s about t is nearly symmetrical. The second objective is to show
that P is approximately a conditional probability measure: P (t=s) and
P (s t)=P (s) are within a constant multiplicative factor of each other.
The following notation is convenient for expressing these approxi-
mate relationships. O(1) denotes a function whose absolute value is
less than or equal to c for all values of its arguments. And f g means
that the functions f and g satisfy the inequalities cf  g and f  cg
for all values of their arguments. In both cases c 2 N is an unspeci ed
constant.
Theorem 3.1.
(a) H (s t) = H (t s) + O(1),
(b) H (s=s) = O(1),
(c) H (H (s)=s) = O(1),
(d) H (s)  H (s t) + O(1),
(e) H (s=t)  H (s) + O(1),
(f) H (s t)  H (s) + H (t=s) + O(1),
(g) H (s t)  H (s) + H (t) + O(1),
(h) I (s : t)  O(1),
A Theory of Program Size 207
(i) I (s : t)  H (s) + H (t) ; H (s t) + O(1),
(j) I (s : s) = H (s) + O(1),
(k) I (( : s) = O(1),
(l) I (s : () = O(1).
Proof. These are easy consequences of the de nitions. The proof of
Theorem 3.1(f) is especially interesting, and is given in full below. Also,
note that Theorem 3.1(g) follows immediately from Theorem 3.1(f,e),
and Theorem 3.1(i) follows immediately from Theorem 3.1(f) and the
de nition of I .
Now for the proof of Theorem 3.1(f). We claim that there is
a computer C with the following property. If U (p s ) = t and
jpj = H (t=s) (i.e. if p is a minimal-size program for calculating t
from s), then C (sp () = hs ti. By using Theorem 2.3(e,a) we see
that HC (s t)  jspj = jsj + jpj = H (s) + H (t=s), and H (s t) 
HC (s t) + sim(C )  H (s) + H (t=s) + O(1).
It remains to verify the claim that there is such a computer. C does
the following when it is given the program sp on its program tape
and the string ( on its work tape. First it simulates the computation
that U performs when given the same program and work tapes. In this
manner C reads the program s and calculates s. Then it simulates the
computation that U performs when given s on its work tape and the
remaining portion of C 's program tape. In this manner C reads the
program p and calculates t from s. The entire program tape has now
been read, and both s and t have been calculated. C nally forms the
pair hs ti and halts, indicating this to be the result of the computation.
Q.E.D.
Remark. The rest of this section is devoted to showing that the
\" in Theorem 3.1(f) and 3.1(i) can be replaced by \=." The argu-
ments used to do this are more probabilistic than information-theoretic
in nature.
Theorem 3.2. (Extension of the Kraft inequality condition for the
existence of an instantaneous code).
Hypothesis. Consider an eectively given list of nitely or in nitely
many \requirements" hsk  nk i (k = 0 1 2 : : :) for the construction of a
208 Part IV|Technical Papers on Self-Delimiting Programs
computer. The requirements are said to be \consistent" if 1  Pk 2 "
(;nk ), and we assume that they are consistent. Each requirement
hsk  nk i requests that a program of length nk be \assigned" to the result
sk . A computer C is said to \satisfy" the requirements if there are
precisely as many programs p of length n such that C (p () = s as
there are pairs hs ni in the list of requirements. Such a C must have the
property that PC (s) = P 2 " (;nk ) (sk = s) and HC (s) = min nk (sk =
s).
Conclusion. There are computers that satisfy these requirements.
Moreover, if we are given the requirements one by one, then we can
simulate a computer that satis es them. Hereafter we refer to the par-
ticular computer that the proof of this theorem shows how to simulate
as the one that is \determined" by the requirements.
Proof.
(a) First we give what we claim is the (abstract) de nition of a par-
ticular computer C that satis es the requirements. In the second
part of the proof we justify this claim.
As we are given the requirements, we assign programs to results.
Initially all programs for C are available. When we are given
the requirement hsk  nk i we assign the rst available program of
length nk to the result sk ( rst in the ordering which X was de-
ned to have in Section 2). As each program is assigned, it and all
its pre xes and extensions become unavailable for future assign-
ments. Note that a result can have many programs assigned to it
(of the same or dierent lengths) if there are many requirements
involving it.
How can we simulate C ? As we are given the requirements, we
make the above assignments, and we simulate C by using the tech-
nique that was given in the proof of Theorem 2.1 for a concrete
computer to simulate an abstract one.
(b) Now to justify the claim. We must show that the above rule for
making assignments never fails, i.e. we must show that it is never
the case that all programs of the requested length are unavailable.
The proof we sketch is due to N. J. Pippenger.
A Theory of Program Size 209
A geometrical interpretation is necessary. Consider the unit in-
terval 0 1). The kth program of length n (0  k < 2n ) corre-
sponds to the interval k2;n (k + 1)2;n ). Assigning a program
corresponds to assigning all the points in its interval. The condi-
tion that the set of assigned programs must be an instantaneous
code corresponds to the rule that an interval is available for as-
signment i no point in it has already been assigned. The rule
we gave above for making assignments is to assign that interval
k2;n  (k + 1)2;n ) of the requested length 2;n that is available
that has the smallest possible k. Using this rule for making as-
signments gives rise to the following fact.
Fact. The set of those points in 0 1) that are unassigned can
always be expressed as the union of a nite number of intervals
ki2 " (;ni) (ki + 1)2 " (;ni)) with the following properties:
ni > ni+1, and
(ki + 1)2 " (;ni)  ki+12 " (;ni+1):
I.e. these intervals are disjoint, their lengths are distinct powers
of 2, and they appear in 0 1) in order of increasing length.
We leave to the reader the veri cation that this fact is always
the case and that it implies that an assignment is impossible
only if the interval requested is longer than the total length of
the unassigned part of 0 1), i.e. only if the requirements are
inconsistent. Q.E.D.
Theorem 3.3. (Recursive \estimates" for HC and PC ). Consider
a computer C .
(a) The set of all true propositions of the form \HC (s)  n" is r.e.
Given t one can recursively enumerate the set of all true propo-
sitions of the form \HC (s=t)  n:"
(b) The set of all true propositions of the form \PC (s) > r" is r.e.
Given t one can recursively enumerate the set of all true propo-
sitions of the form \PC (s=t) > r:"
Proof. This is an easy consequence of the fact that the domain of
C is an r.e. set. Q.E.D.
210 Part IV|Technical Papers on Self-Delimiting Programs
Remark. The set of all true propositions of the form \H (s=t)  n"
is not r.e. for if it were r.e., it would easily follow from Theorems 3.1(c)
and 2.3(q) that Theorem 5.1(f) is false, which is a contradiction.
Theorem 3.4. For each computer C there is a constant c such that
(a) H (s)  ; lg PC (s) + c,
(b) H (s=t)  ; lg PC (s=t) + c.
Proof. It follows from Theorem 3.3(b) that the set T of all true
propositions of the form \PC (s) > 2;n " is r.e., and that given t one
can recursively enumerate the set Tt of all true propositions of the form
\PC (s=t) > 2;n ." This will enable us to use Theorem 3.2 to show that
there is a computer C 0 with these properties:
HC (s) = d; lg PC (s)e + 1
0
(1)
PC (s) = 2 " (;d; lg PC (s)e)
0

HC (s=t) = d; lg PC (s=t)e + 1
0
(2)
PC (s=t) = 2 " (;d; lg PC (s=t)e):
0

Here dxe denotes the least integer greater than x. By applying The-
orem 2.3(a,b) to (1) and (2), we see that Theorem 3.4 holds with
c = sim(C 0) + 2.
How does the computer C 0 work? First of all, it checks whether
it has been given ( or t on its work tape. These two cases can be
distinguished, for by Theorem 2.3(c) it is impossible for t to be equal
to (.
(a) If C 0 has been given ( on its work tape, it enumerates T and
simulates the computer determined by all requirements of the
form
hs n + 1i (\PC (s) > 2;n " 2 T ): (3)
Thus hs ni is taken as a requirement i n  d; lg PC (s)e + 1.
Hence the number of programs p of length n such that C 0(p () =
s is 1 if n  d; lg PC (s)e+1 and is 0 otherwise, which immediately
yields (1).
However, we must check that the requirements (3) are consistent.
P 2;jpj (over all programs p we wish to assign to the result s) =
A Theory of Program Size 211
2 " (;d; lgPPC (s)e) < PC (s). Hence P 2;jpj (over all p we wish to
assign) < s PC (s)  1 by Theorem 2.3(j). Thus the hypothesis
of Theorem 3.2 is satis ed, the requirements (3) indeed determine
a computer, and the proof of (1) and Theorem 3.4(a) is complete.
(b) If C 0 has been given t on its work tape, it enumerates Tt and
simulates the computer determined by all requirements of the
form
hs n + 1i (\PC (s=t) > 2;n " 2 Tt ): (4)
Thus hs ni is taken as a requirement i n  d; lg PC (s=t)e + 1.
Hence the number of programs p of length n such that C 0(p t) =
s is 1 if n  d; lg PC (s=t)e + 1 and is 0 otherwise, which imme-
diately yields (2).
However, we must check that the requirements (4) are consistent.
P 2;jpj (over all programs p we wish to assign to the result s) =
2 " (;d; lg PC (s=tP)e) < PC (s=t). Hence P 2;jpj (over all p we
wish to assign) < s PC (s=t)  1 by Theorem 2.3(k). Thus the
hypothesis of Theorem 3.2 is satis ed, the requirements (4) indeed
determine a computer, and the proof of (2) and Theorem 3.4(b)
is complete. Q.E.D.
Theorem 3.5.
(a) For each computer C there is a constant c such that P (s) 
2;c PC (s), P (s=t)  2;c PC (s=t).
(b) H (s) = ; lg P (s) + O(1), H (s=t) = ; lg P (s=t) + O(1).
Proof. Theorem 3.5(a) follows immediately from Theorem 3.4 us-
ing the fact that P (s)  2 " (;H (s)) and P (s=t)  2 " (;H (s=t))
(Theorem 2.3(l,m)). Theorem 3.5(b) is obtained by taking C = U in
Theorem 3.4 and also using these two inequalities. Q.E.D.
Remark. Theorem 3.4(a) extends Theorem 2.3(a,b) to probabili-
ties. Note that Theorem 3.5(a) is not an immediate consequence of our
weak de nition of an optimal universal computer.
Theorem 3.5(b) enables one to reformulate results about H as re-
sults concerning P , and vice versa it is the rst member of a trio of
212 Part IV|Technical Papers on Self-Delimiting Programs
formulas that will be completed with Theorem 3.9(e,f). These formu-
las are closely analogous to expressions in information theory for the
information content of individual events or symbols 10, Secs. 2.3, 2.6,
pp. 27{28, 34{37].
Theorem 3.6.
(a) # (fpjU (p () = s & jpj  H (s) + ng)  2 " (n + O(1)).
(b) # (fpjU (p t) = s & jpj  H (s=t) + ng)  2 " (n + O(1)).
Proof. This follows immediately
P from Theorem 3.5(b). Q.E.D.
Theorem 3.7. P (s) t P (s t).
Proof. On the one hand, there isPa computer C such that C (p () = s
if U (p () = hs ti. Thus PC (s)  t P (s t). Using Theorem 3.5(a), we
see that P (s)  2;c Pt P (s t).
On the other hand,Pthere is a computer C such that C (p () = hs si
if U (p () = s. Thus t PC (s t)  PC (s s)  P (s). Using Theorem
3.5(a), we see that Pt P (s t)  2;c P (s). Q.E.D.
Theorem 3.8. There is a computer C and a constant c such that
HC (t=s) = H (s t) ; H (s) + c.
Proof. The set of all programs p such that U (p () is de ned is
r.e. Let pk be the kth program in a particular recursive enumeration
of this set, and de ne sk and tk by hsk  tk i = U (pk  ().PBy Theorems
3.7 and 3.5(b) there is a c such that 2 " (H (s) ; c) t P (s t)  1
for all s. Given s on its work tape, C simulates the computer Cs
determined by the requirements htk  jpk j ; jsj + ci for k = 0 1 2 : : :
such that sk = U (s (). Recall Theorem 2.3(d,e). Thus for each p such
that U (p () = hs ti there is a corresponding p0 such that C (p0 s) =
Cs(p0 () = t and jp0j = jpj ; H (s) + c. Hence
HC (t=s) = H (s t) ; H (s) + c:
However,
P 2 " (;jwe must check that the requirements for Cs are consistent.
P p j) (over all programs p0 we wish to assign to any result
0
t) = 2 " (;jPpj + H (s) ; c) (over all p such that U (p () = hs ti) =
2 " (H (s) ; c) t P (s t)  1 because of the way c was chosen. Thus the
hypothesis of Theorem 3.2 is satis ed, and these requirements indeed
determine Cs. Q.E.D.
Theorem 3.9.
A Theory of Program Size 213
(a) H (s t) = H (s) + H (t=s) + O(1),
(b) I (s : t) = H (s) + H (t) ; H (s t) + O(1),
(c) I (s : t) = I (t : s) + O(1),
(d) P (t=s) P (s t)=P (s),
(e) H (t=s) = lg P (s)=P (s t) + O(1),
(f) I (s : t) = lg P (s t)=P (s)P (t) + O(1).
Proof.
(a) Theorem 3.9(a) follows immediately from Theorems 3.8, 2.3(b),
and 3.1(f).
(b) Theorem 3.9(b) follows immediately from Theorem 3.9(a) and the
de nition of I (s : t).
(c) Theorem 3.9(c) follows immediately from Theorems 3.9(b) and
3.1(a).
(d,e) Theorem 3.9(d,e) follows immediately from Theorems 3.9(a) and
3.5(b).
(f) Theorem 3.9(f) follows immediately from Theorems 3.9(b) and
3.5(b). Q.E.D.
Remark. We thus have at our disposal essentially the entire formal-
ism of information theory. Results such as these can now be obtained
eortlessly:
 H (s1 )  H (s1 =s2 ) + H (s2 =s3 ) + H (s3 =s4 ) + H (s4 ) + O(1)

 H (s1  s2 s3  s4) = H (s1 =s2  s3 s4 ) + H (s2 =s3  s4) + H (s3=s4 ) +
H (s4) + O(1):
However, there is an interesting class of identities satis ed by our H
function that has no parallel in information theory. The simplest of
these is H (H (s)=s) = O(1) (Theorem 3.1(c)), which with Theorem
214 Part IV|Technical Papers on Self-Delimiting Programs
3.9(a) immediately yields H (s H (s)) = H (s) + O(1). This is just one
pair of a large family of identities, as we now proceed to show.
Keeping Theorem 3.9(a) in mind, consider modifying the computer
C used in the proof of Theorem 3.1(f) so that it also measures the
lengths H (s) and H (t=s) of its subroutines s and p, and halts indi-
cating hs t H (s) H (t=s)i to be the result of the computation instead
of hs ti. It follows that H (s t) = H (s t H (s) H (t=s)) + O(1) and
H (H (s) H (t=s)=s t) = O(1). In fact, it is easy to see that
H (H (s) H (t) H (t=s) H (s=t) H (s t)=s t) = O(1)
which implies H (I (s : t)=s t) = O(1). And of course these identities
generalize to tuples of three or more strings.

4. A Random In
nite String
The undecidability of the halting problem is a fundamental theorem
of recursive function theory. In algorithmic information theory the
corresponding theorem is as follows: The base-two representation of the
probability that U halts is a random (i.e. maximally complex) in nite
string. In this section we formulate this statement precisely and prove
it.
Theorem 4.1. (Bounds on the complexity of natural numbers).
(a) Pn 2;H (n)  1.
Consider a recursive function f : N ! N .
(b) If Pn 2;f (n) diverges, then H (n) > f (n) in nitely often.
(c) If Pn 2;f (n) converges, then H (n)  f (n) + O(1).
Proof.
(a) By Theorem 2.3(l,j), Pn 2;H (n)  Pn P (n)  1.
(b) If Pn 2;f (n) diverges, andP H (n)  f (n) held for all but nitely
many values of n, then n 2;H (n) would also diverge. But this
would contradict Theorem 4.1(a), and thus H (n) > f (n) in nitely
often.
A Theory of Program Size 215
(c) If Pn 2;f (n) converges, there is an n0 such that Pnn0 2;f (n) 
1. By Theorem 3.2 there is a computer C determined by the
requirements hn f (n)i (n  n0). Thus H (n)  f (n) + sim(C ) for
all n  n0. Q.E.D.
Theorem 4.2. (Maximal complexity nite and in nite strings).
(a) max H (s) (jsj = n) = n + H (n) + O(1).
(b) # (fsj jsj = n & H (s)  n + H (n) ; kg)  2 " (n ; k + O(1)).
(c) Imagine that the in nite string is generated by tossing a fair
coin once for each if its bits. Then, with probability one, H ( n) >
n for all but nitely many n.
Proof.
(a,b) Consider a string s of length n. By Theorem 3.9(a), H (s) =
H (n s)+ O(1) = H (n)+ H (s=n)+ O(1). We now obtain Theorem
4.2(a,b) from this estimate for H (s).
There is a computer C such that C (p jpj) = p for all p. Thus
H (s=n)  n + sim(C ), and H (s)  n + H (n) + O(1). On the
other hand, by Theorem 2.3(q), fewer than 2n;k of the s satisfy
H (s=n) < n ; k. Hence fewer than 2n;k of the s satisfy H (s) <
n ; k + H (n) + O(1). Thus we have obtained Theorem 4.2(a,b).
(c) Now for the proof of Theorem 4.2(c). By Theorem 4.2(b), at most
a fraction of 2 " (;H (n) + c) of the strings s of length n satisfy
H (s)  n. Thus the probability that Psatis es H ( n )  n is
 2 " (;H (n) + c). By Theorem 4.1(a), n 2 " (;H (n) + c) con-
verges. Invoking the Borel-Cantelli lemma, we obtain Theorem
4.2(c). Q.E.D.
De nition of Randomness. A string s is random i H (s) is
approximately equal to jsj + H (jsj). An in nite string is random i
9c 8n H ( n ) > n ; c.
Remark. In the case of in nite strings there is a sharp distinction
between randomness and nonrandomness. In the case of nite strings
it is a matter of degree. To the question \How random is s?" one must
reply indicating how close H (s) is to jsj + H (jsj).
216 Part IV|Technical Papers on Self-Delimiting Programs
C. P. Schnorr (private communication) has shown that this com-
plexity-based de nition of a random in nite string and P. Martin-L of's
statistical de nition of this concept 7, pp. 379{380] are equivalent.
De nition of Base-Two Representations. The base-two rep-
resentation of a real number x 2 (0 1]Pis that unique string b1b2b3 : : :
with innitely many 1's such that x = k bk 2;k . P P (s) =
De nition of the Probability
P 2;jpj (U (p () is de ned). ! that U Halts. ! = s

By Theorem 2.3(j,n), ! 2 (0 1]. Therefore the real number ! has


a base-two representation. Henceforth ! denotes both the real number
and its base-two representation. Similarly, !n denotes a string of length
n and a rational number m=2n with the property that ! > !n and
! ; !n  2;n .
Theorem 4.3. (Construction of a random in nite string).
(a) There is a recursive function w : N ! R such that w(n) 
w(n + 1) and ! = limn!1 w(n).
(b) ! is random.
(c) There is a recursive predicate D : N N N ! ftrue, falseg such
that the k-th bit of ! is a 1 i 9i 8j D(i j k) (k = 0 1 2 : : :).
Proof.
(a) fpjU (p () is de nedg is r.e. Let pk (k = 0 1 2 : : :) denote the kth
pPin a particular recursive enumeration of this set. Let w(n) =
kn 2 " (;jpk j). w(n) tends monotonically to ! from below,
which proves Theorem 4.3(a).
(b) In view of the fact that ! > !n  ! ; 2;n (see the de nition of
!), if one is given !n one can nd an m such that !  w(m) >
!n  ! ; 2;n . Thus ! ; w(m) < 2;n , and fpk jk  mg contains
all programs p of length less than or equal to n such that U (p ()
is de ned. Hence fU (pk  ()jk  m & jpk j  ng = fsjH (s) 
ng. It follows there is a computer C with the property that if
U (p () = !n , then C (p () equals the rst string s such that
H (s) > n. Thus n < H (s)  H (!n ) + sim(C ), which proves
Theorem 4.3(b).
A Theory of Program Size 217
(c) To prove Theorem 4.3(c), de ne D as follows: D(i j k) i j  i
implies the kth bit of the base-two representation of w(j ) is a 1.
Q.E.D.

5. Appendix. The Traditional Concept of


Relative Complexity
In this Appendix programs are required to be self-delimiting, but the
relative complexity H (s=t) of s with respect to t will now mean that
one is directly given t, instead of being given a minimal-size program
for t.
The standard optimal universal computer U remains the same as
before. H and P are rede ned as follows:
 HC (s=t) = min jpj (C (p t) = s) (may be 1),

 HC (s) = HC (s=(),

 H (s=t) = HU (s=t),

 H (s) = HU (s),
P
 PC (s=t) = 2;jpj (C (p t) = s),

 PC (s) = PC (s=(),

 P (s=t) = PU (s=t),

 P (s) = PU (s).

These concepts are extended to tuples of strings and natural numbers


as before. Finally, ,(s t) is de ned as follows:
 H (s t) = H (s) + H (t=s) + ,(s t).

Theorem 5.1.
(a) H (s H (s)) = H (s) + O(1),
(b) H (s t) = H (s) + H (t=s H (s)) + O(1),
218 Part IV|Technical Papers on Self-Delimiting Programs
(c) ;H (H (s)=s) ; O(1)  ,(s t)  O(1),
(d) ,(s s) = O(1),
(e) ,(s H (s)) = ;H (H (s)=s) + O(1),
(f) H (H (s)=s) 6= O(1).
Proof.
(a) On the one hand, H (s H (s))  H (s) + c because a minimal-size
program for s also tells one its length H (s), i.e. because there
is a computer C such that C (p () = hU (p () jpji if U (p () is
de ned. On the other hand, obviously H (s)  H (s H (s)) + c.
(b) On the one hand, H (s t)  H (s) + H (t=s H (s)) + c follows from
Theorem 5.1(a) and the obvious inequality H (s t)  H (s H (s))+
H (t=s H (s)) + c. On the other hand, H (s t)  H (s) +
H (t=s H (s)) ; c follows from the inequality H (t=s H (s)) 
H (s t) ; H (s) + c analogous to Theorem 3.8 and obtained by
adapting the methods of Section 3 to the present setting.
(c) This follows from Theorem 5.1(b) and the obvious inequality
H (t=s H (s)) ; c  H (t=s)  H (H (s)=s) + H (t=s H (s)) + c.
(d) If t = s, H (s t) ; H (s) ; H (t=s) = H (s s) ; H (s) ; H (s=s) =
H (s) ; H (s) + O(1) = O(1), for obviously H (s s) = H (s) + O(1)
and H (s=s) = O(1).
(e) If t = H (s), H (s t) ; H (s) ; H (t=s) = H (s H (s)) ; H (s) ;
H (H (s)=s) = ;H (H (s)=s) + O(1) by Theorem 5.1(a).
(f) The proof is by reductio ad absurdum. Suppose on the contrary
that H (H (s)=s) < c for all s. First we adapt an idea of A. R.
Meyer and D. W. Loveland 6, pp. 525{526] to show that there
is a partial recursive function f : X ! N with the property that
if f (s) is de ned it is equal to H (s) and this occurs for in nitely
many values of s. Then we obtain the desired contradiction by
showing that such a function f cannot exist.
A Theory of Program Size 219
Consider the set Ks of all natural numbers k such that H (k=s) < c
and H (s)  k. Note that min Ks = H (s), #(Ks) < 2c, and
given s one can recursively enumerate Ks . Also, given s and
#(Ks) one can recursively enumerate Ks until one nds all its
elements, and, in particular, its smallest element, which is H (s).
Let m = lim sup #(Ks), and let n be such that jsj  n implies
#(Ks)  m.
Knowing m and n one calculates f (s) as follows. First one checks
if jsj < n. If so, f (s) is unde ned. If not, one recursively enu-
merates Ks until m of its elements are found. Because of the way
n was chosen, Ks cannot have more than m elements. If it has
less than m, one never nishes searching for m of them, and so
f (s) is unde ned. However, if #(Ks ) = m, which occurs for in -
nitely many values of s, then one eventually realizes all of them
have been found, including f (s) = min Ks = H (s). Thus f (s) is
de ned and equal to H (s) for in nitely many values of s.
It remains to show that such an f is impossible. As the length
of s increases, H (s) tends to in nity, and so f is unbounded.
Thus given n and H (n) one can calculate a string sn such that
H (n) + n < f (sn ) = H (sn ), and so H (sn=n H (n)) is bounded.
Using Theorem 5.1(b) we obtain H (n) + n < H (sn )  H (n sn ) +
c0  H (n)+ H (sn=n H (n))+ c00  H (n)+ c000, which is impossible
for n  c0000. Thus f cannot exist, and our initial assumption that
H (H (s)=s) < c for all s must be false. Q.E.D.
Remark. Theorem 5.1 makes it clear that the fact that H (H (s)=s)
is unbounded implies that H (t=s) is less convenient to use than
H (t=s H (s)). In fact, R. Solovay (private communication) has an-
nounced that max H (H (s)=s) taken over all strings s of length n is
asymptotic to lg n. The de nition of the relative complexity of s with
respect to t given in Section 2 is equivalent to H (s=t H (t)).

Acknowledgments
The author is grateful to the following for conversations that helped
to crystallize these ideas: C. H. Bennett, T. M. Cover, R. P. Daley,
220 Part IV|Technical Papers on Self-Delimiting Programs
M. Davis, P. Elias, T. L. Fine, W. L. Gewirtz, D. W. Loveland, A. R.
Meyer, M. Minsky, N. J. Pippenger, R. J. Solomono, and S. Winograd.
The author also wishes to thank the referees for their comments.

References
1] Solomonoff, R. J. A formal theory of inductive inference. In-
form. and Contr. 7 (1964), 1{22, 224{254.
2] Kolmogorov, A. N. Three approaches to the quantitative de-
nition of information. Problems of Inform. Transmission 1, 1
(Jan.{March 1965), 1{7.
3] Chaitin, G. J. On the length of programs for computing nite
binary sequences. J. ACM 13, 4 (Oct. 1966), 547{569.
4] Chaitin, G. J. On the length of programs for computing nite
binary sequences: Statistical considerations. J. ACM 16, 1 (Jan.
1969), 145{159.
5] Kolmogorov, A. N. On the logical foundations of information
theory and probability theory. Problems of Inform. Transmission
5, 3 (July{Sept. 1969), 1{4.
6] Loveland, D. W. A variant of the Kolmogorov concept of com-
plexity. Inform. and Contr. 15 (1969), 510{526.
7] Schnorr, C. P. Process complexity and eective random tests.
J. Comput. and Syst. Scis. 7 (1973), 376{388.
8] Chaitin, G. J. On the diculty of computations. IEEE Trans.
IT-16 (1970), 5{9.
9] Feinstein, A. Foundations of Information Theory. McGraw-
Hill, New York, 1958.
10] Fano, R. M. Transmission of Information. Wiley, New York,
1961.
A Theory of Program Size 221
11] Abramson, N. Information Theory and Coding. McGraw-Hill,
New York, 1963.
12] Ash, R. Information Theory. Wiley-Interscience, New York,
1965.
13] Willis, D. G. Computational complexity and probability con-
structions. J. ACM 17, 2 (April 1970), 241{259.
14] Zvonkin, A. K. and Levin, L. A. The complexity of nite
objects and the development of the concepts of information and
randomness by means of the theory of algorithms. Russ. Math.
Survs. 25, 6 (Nov.{Dec. 1970), 83{124.
15] Cover, T. M. Universal gambling schemes and the complexity
measures of Kolmogorov and Chaitin. Rep. No. 12, Statistics
Dep., Stanford U., Stanford, Calif., 1974. Submitted to Ann.
Statist.
16] Gewirtz, W. L. Investigations in the theory of descriptive com-
plexity. Ph.D. Thesis, New York University, 1974 (to be published
as a Courant Institute rep.).
17] Weiss, B. The isomorphism problem in ergodic theory. Bull.
Amer. Math. Soc. 78 (1972), 668{684.
18] Renyi, A. Foundations of Probability. Holden-Day, San Fran-
cisco, 1970.
19] Fine, T. L. Theories of Probability: An Examination of Foun-
dations. Academic Press, New York, 1973.
20] Cover, T. M. On determining the irrationality of the mean of
a random variable. Ann. Statist. 1 (1973), 862{871.
21] Chaitin, G. J. Information-theoretic computational complexity.
IEEE Trans. IT-20 (1974), 10{15.
22] Levin, M. Mathematical Logic for Computer Scientists. Rep.
TR-131, M.I.T. Project MAC, 1974, pp. 145{147,153.
222 Part IV|Technical Papers on Self-Delimiting Programs
23] Chaitin, G. J. Information-theoretic limitations of formal sys-
tems. J. ACM 21, 3 (July 1974), 403{424.
24] Minsky, M. L. Computation: Finite and Innite Machines.
Prentice-Hall, Englewood Clis, N.J., 1967, pp. 54, 55, 66.
25] Minsky, M. and Papert, S. Perceptrons: An Introduction to
Computational Geometry. M.I.T. Press, Cambridge, Mass. 1969,
pp. 150{153.
26] Schwartz, J. T. On Programming: An Interim Report on
the SETL Project. Installment I: Generalities. Lecture Notes,
Courant Institute, New York University, New York, 1973, pp.
1{20.
27] Bennett, C. H. Logical reversibility of computation. IBM J.
Res. Develop. 17 (1973), 525{532.
28] Daley, R. P. The extent and density of sequences within the
minimal-program complexity hierarchies. J. Comput. and Syst.
Scis. (to appear).
29] Chaitin, G. J. Information-theoretic characterizations of recur-
sive in nite strings. Submitted to Theoretical Comput. Sci.
30] Elias, P. Minimum times and memories needed to compute the
values of a function. J. Comput. and Syst. Scis. (to appear).
31] Elias, P. Universal codeword sets and representations of the
integers. IEEE Trans. IT (to appear).
32] Hellman, M. E. The information theoretic approach to cryp-
tography. Center for Systems Research, Stanford U., Stanford,
Calif., 1974.
33] Chaitin, G. J. Randomness and mathematical proof. Sci.
Amer. 232, 5 (May 1975), in press. (Note. Reference 33] is
not cited in the text.)
A Theory of Program Size 223

Received April 1974 Revised December 1974


224 Part IV|Technical Papers on Self-Delimiting Programs
INCOMPLETENESS
THEOREMS FOR
RANDOM REALS
Advances in Applied Mathematics 8
(1987), pp. 119{146

G. J. Chaitin
IBM Thomas J. Watson Research Center, P.O. Box 218,
Yorktown Heights, New York 10598

Abstract
We obtain some dramatic results using statistical mechanics{ther-
modynamics kinds of arguments concerning randomness, chaos, unpre-
dictability, and uncertainty in mathematics. We construct an equation
involving only whole numbers and addition, multiplication, and expo-
nentiation, with the property that if one varies a parameter and asks
whether the number of solutions is nite or innite, the answer to this
question is indistinguishable from the result of independent tosses of a
fair coin. This yields a number of powerful Godel incompleteness-type
results concerning the limitations of the axiomatic method, in which
entropy{information measures are used.  c 1987 Academic Press, Inc.
225
226 Part IV|Technical Papers on Self-Delimiting Programs
1. Introduction
It is now half a century since Turing published his remarkable paper On
Computable Numbers, with an Application to the Entscheidungsproblem
(Turing 15]). In that paper Turing constructs a universal Turing ma-
chine that can simulate any other Turing machine. He also uses Can-
tor's method to diagonalize over the countable set of computable real
numbers and construct an uncomputable real, from which he deduces
the unsolvability of the halting problem and as a corollary a form of
G odel's incompleteness theorem. This paper has penetrated into our
thinking to such a point that it is now regarded as obvious, a fate which
is suered by only the most basic conceptual contributions. Speaking
as a mathematician, I cannot help noting with pride that the idea of
a general purpose electronic digital computer was invented in order
to cast light on a fundamental question regarding the foundations of
mathematics, years before such objects were actually constructed. Of
course, this is an enormous simpli cation of the complex genesis of the
computer, to which many contributed, but there is as much truth in
this remark as there is in many other historical \facts."
In another paper 5], I used ideas from algorithmic information
theory to construct a diophantine equation whose solutions are in a
sense random. In the present paper I shall try to give a relatively
self-contained exposition of this result via another route, starting from
Turing's original construction of an uncomputable real number.
Following Turing, consider an enumeration r1 r2 r3 : : : of all com-
putable real numbers between zero and one. We may suppose that rk is
the real number, if any, computed by the kth computer program. Let
:dk1dk2dk3 : : : be the successive digits in the decimal expansion of rk .
Following Cantor, consider the diagonal of the array of rk ,
r1 = :d11d12d13 : : :
r2 = :d21d22d23 : : :
r3 = :d31d32d33 : : :
This gives us a new real number with decimal expansion :d11d22d33 : : :
Now change each of these digits, avoiding the digits zero and nine.
The result is an uncomputable real number, because its rst digit is
Incompleteness Theorems for Random Reals 227
dierent from the rst digit of the rst computable real, its second
digit is dierent from the second digit of the second computable real,
etc. It is necessary to avoid zero and nine, because real numbers with
dierent digit sequences can be equal to each other if one of them ends
with an in nite sequence of zeros and the other ends with an in nite
sequence of nines, for example, .3999999: : : = .4000000: : :
Having constructed an uncomputable real number by diagonalizing
over the computable reals, Turing points out that it follows that the
halting problem is unsolvable. In particular, there can be no way of
deciding if the kth computer program ever outputs a kth digit. Because
if there were, one could actually calculate the successive digits of the
uncomputable real number de ned above, which is impossible. Turing
also notes that a version of G odel's incompleteness theorem is an im-
mediate corollary, because if there cannot be an algorithm for deciding
if the kth computer program ever outputs a kth digit, there also cannot
be a formal axiomatic system which would always enable one to prove
which of these possibilities is the case, for in principle one could run
through all possible proofs to decide. Using the powerful techniques
which were developed in order to solve Hilbert's tenth problem (see
Davis et al. 7] and Jones and Matijasevi%c 11]), it is possible to encode
the unsolvability of the halting problem as a statement about an expo-
nential diophantine equation. An exponential diophantine equation is
one of the form
P (x1 : : :  xm) = P 0(x1 : : :  xm)
where the variables x1 : : :  xm range over natural numbers and P and
P 0 are functions built up from these variables and natural number con-
stants by the operations of addition, multiplication, and exponentia-
tion. The result of this encoding is an exponential diophantine equation
P = P 0 in m + 1 variables n x1 : : : xm with the property that
P (n x1 : : :  xm) = P 0(n x1 : : :  xm)
has a solution in natural numbers x1 : : :  xm if and only if the nth
computer program ever outputs an nth digit. It follows that there can
be no algorithm for deciding as a function of n whether or not P = P 0
has a solution, and thus there cannot be any complete proof system for
settling such questions either.
228 Part IV|Technical Papers on Self-Delimiting Programs
Up to now we have followed Turing's original approach, but now we
will set o into new territory. Our point of departure is a remark of
Courant and Robbins 6] that another way of obtaining a real number
that is not on the list r1 r2 r3 : : : is by tossing a coin. Here is their
measure-theoretic argument that the real numbers are uncountable.
Recall that r1 r2 r3 : : : are the computable reals between zero and
one. Cover r1 with an interval of length =2, cover r2 with an interval
of length =4, cover r3 with an interval of length =8, and in general
cover rk with an interval of length =2k . Thus all computable reals in
the unit interval are covered by this in nite set of intervals, and the
total length of the covering intervals is
X1 
k = :
k=1 2
Hence if we take  suciently small, the total length of the covering
is arbitrarily small. In summary, the reals between zero and one con-
stitute an interval of length one, and the subset that are computable
can be covered by intervals whose total length is arbitrarily small. In
other words, the computable reals are a set of measure zero, and if we
choose a real in the unit interval at random, the probability that it is
computable is zero. Thus one way to get an uncomputable real with
probability one is to ip a fair coin, using independent tosses to obtain
each bit of the binary expansion of its base-two representation.
If this train of thought is pursued, it leads one to the notion of a
random real number, which can never be a computable real. Following
Martin-L of 12], we give a de nition of a random real using constructive
measure theory. We say that a set of real numbers X is a constructive
measure zero set if there is an algorithm A which given n generates
a (possibly in nite) set of intervals whose total length is less than or
equal to 2;n and which covers the set X . More precisely, the covering
is in the form of a set C of nite binary strings s such that
X ;jsj ;n
2 2
s2C
(here jsj denotes the length of the string s), and each real in the covered
set X has a member of C as the initial part of its base-two expansion.
Incompleteness Theorems for Random Reals 229
In other words, we consider sets of real numbers with the property that
there is an algorithm A for producing arbitrarily small coverings of the
set. Such sets of reals are constructively of measure zero. Since there are
only countably many algorithms A for constructively covering measure
zero sets, it follows that almost all real numbers are not contained in
any set of constructive measure zero. Such reals are called (Martin-L of)
random reals. In fact, if the successive bits of a real number are chosen
by coin ipping, with probability one it will not be contained in any set
of constructive measure zero, and hence will be a random real number.
Note that no computable real number r is random. Here is how we
get a constructive covering of arbitrarily small measure. The covering
algorithm, given n, yields the n-bit initial sequence of the binary digits
of r. This covers r and has total length or measure equal to 2;n . Thus
there is an algorithm for obtaining arbitrarily small coverings of the
set consisting of the computable real r, and r is not a random real
number. We leave to the reader the adaptation of the argument in
Feller 9] proving the strong law of large numbers to show that reals in
which all digits do not have equal limiting frequency have constructive
measure zero. It follows that random reals are normal in Borel's sense,
that is, in any base all digits have equal limiting frequency.
Let us consider the real number p whose nth bit in base-two nota-
tion is a zero or a one depending on whether or not the exponential
diophantine equation
P (n x1 : : :  xm) = P 0(n x1 : : :  xm)
has a solution in natural numbers x1 : : : xm. We will show that p is
not a random real. In fact, we will give an algorithm for producing
coverings of measure (n + 1)2;n , which can obviously be changed to
one for producing coverings of measure not greater than 2;n . Consider
the rst N values of the parameter n. If one knows for how many of
these values of n, P = P 0 has a solution, then one can nd for which
values of n < N there are solutions. This is because the set of solutions
of P = P 0 is recursively enumerable, that is, one can try more and
more solutions and eventually nd each value of the parameter n for
which there is a solution. The only problem is to decide when to give
up further searches because all values of n < N for which there are
230 Part IV|Technical Papers on Self-Delimiting Programs
solutions have been found. But if one is told how many such n there
are, then one knows when to stop searching for solutions. So one can
assume each of the N +1 possibilities ranging from p has all of its initial
N bits o to p has all of them on, and each one of these assumptions
determines the actual values of the rst N bits of p. Thus we have
determined N + 1 dierent possibilities for the rst N bits of p, that
is, the real number p is covered by a set of intervals of total length
(N + 1)2;N , and hence is a set of constructive measure zero, and p
cannot be a random real number.
Thus asking whether an exponential diophantine equation has a
solution as a function of a parameter cannot give us a random real
number. However asking whether or not the number of solutions is
in nite can give us a random real. In particular, there is a exponential
diophantine equation Q = Q0 such that the real number q is random
whose nth bit is a zero or a one depending on whether or not there are
in nitely many natural numbers x1 : : :  xm such that
Q(n x1 : : :  xm) = Q0(n x1 : : : xm):
The equation P = P 0 that we considered before encoded the halting
problem, that is, the nth bit of the real number p was zero or one
depending on whether the nth computer program ever outputs an nth
digit. To construct an equation Q = Q0 such that q is random is
somewhat more dicult we shall limit ourselves to giving an outline
of the proof:1
1. First show that if one had an oracle for solving the halting prob-
lem, then one could compute the successive bits of the base-two
representation of a particular random real number q.
2. Then show that if a real number q can be computed using an
oracle for the halting problem, it can be obtained without using
an oracle as the limit of a computable sequence of dyadic rational
numbers (rationals of the form K=2L ).
1The full proof is given later in this paper (Theorems R6 and R7), but is slightly
dierent it uses a particular random real number, , that arises naturally in algo-
rithmic information theory.
Incompleteness Theorems for Random Reals 231
3. Finally show that any real number q that is the limit of a com-
putable sequence of dyadic rational numbers can be encoded into
an exponential diophantine equation Q = Q0 in such a manner
that
Q(n x1 : : :  xm) = Q0(n x1 : : : xm)
has in nitely many solutions x1 : : : xm if and only if the nth bit
of the real number q is a one. This is done using the fact \that
every r.e. set has a singlefold exponential diophantine represen-
tation" (Jones and Matijasevi%c 11]).
Q = Q0 is quite a remarkable equation, as it shows that there is a
kind of uncertainty principle even in pure mathematics, in fact, even
in the theory of whole numbers. Whether or not Q = Q0 has in nitely
many solutions jumps around in a completely unpredictable manner as
the parameter n varies. It may be said that the truth or falsity of the
assertion that there are in nitely many solutions is indistinguishable
from the result of independent tosses of a fair coin. In other words,
these are independent mathematical facts with probability one-half!
This is where our search for a probabilistic proof of Turing's theorem
that there are uncomputable real numbers has led us, to a dramatic
version of G odel's incompleteness theorem.
In Section 2 we de ne the real number $, and we develop as much
of algorithmic information theory as we shall need in the rest of the
paper. In Section 3 we compare a number of de nitions of randomness,
we show that $ is random, and we show that $ can be encoded into
an exponential diophantine equation. In Section 4 we develop incom-
pleteness theorems for $ and for its exponential diophantine equation.

2. Algorithmic Information Theory 3]


First a piece of notation. By log x we mean the integer part of the
base-two logarithm of x. That is, if 2n  x < 2n+1 , then log x = n.
Thus 2log x  x, even if x < 1.
Our point of departure is the observation that the series
X1 X 1 X 1
 
n n log n n log n log log n   
232 Part IV|Technical Papers on Self-Delimiting Programs
all diverge. On the other hand,
X1 X 1 X 1
 
n2 n(log n)2 n log n(log log n)2   
all converge. To show this we use the Cauchy condensation testP(Hardy
10]): if (n) is a nonincreasing function P of n, then the series (n) is
convergent or divergent according as 2n (2n ) is convergent or diver-
gent.
Here is a proof of the Cauchy condensation test
X Xh n i
(k)  (2 + 1) +    + (2n+1 )
X n n+1 1 X n+1 n+1
 2 (2 ) = 2 2 (2 ):
X Xh i X
(k)  (2n ) +    + (2n+1 ; 1)  2n (2n ):
Thus P 1 behaves the same as P 2n 1 = P 1, which diverges.
P 1 behaves n
the same as P 2n 1 = P21 , which diverges.
n

n log n n
2 n n
X 1
n log n log log n
behaves the same as P 2n 2 n1log n = P n log1 n , which diverges, etc.
On the other hand, P 1 behaves the same as P 2n 1 = P 1 ,
n

which converges. P n(log1 n)2 behaves the same as P 2n 2 1n2 = P n12 ,


n2 22n
2
n

which converges.
X 1
n log n(log log n)2
behaves the same as P 2n 2 n(log
n
1 P 1
n)2 = n(log n)2 , which converges, etc.
For the purposes of this paper, it is best to think of the algorithmic
informationP content H , which we shall now de ne, as the borderline
;
between 2 f (n) converging and diverging!
De nition. De ne an information content measure H (n) to be a
function of the natural number n having the property that
X
$
2;H (n)  1 (1)
Incompleteness Theorems for Random Reals 233
and that H (n) is computable as a limit from above, so that the set
f\H (n)  k "g (2)
of all upper bounds is r.e. We also allow H (n) = +1, which contributes
zero to the sum (1) since 2;1 = 0. It contributes no elements to the
set of upper bounds (2).
Note. If H isPan information content measure, then it follows
immediately from 2;H (n) = $  1 that
#fkjH (k)  ng  2n:
That is, there are at most 2n natural numbers with information content
less than or equal to n.
Theorem I. There is a minimal information content measure H ,
i.e., an information content measure with the property that for any
other information content measure H 0, there exists a constant c de-
pending only on H and H 0 but not on n such that
H (n)  H 0(n) + c:
That is, H is smaller, within O(1), than any other information content
measure.
Proof. De ne H as
H (n) = min H (n) + k] 
k 1 k
(3)
where Hk denotes the information content measure resulting from tak-
ing the kth (k  1) computer algorithm and patching it, if necessary, so
that it gives limits from above and does not violate the $  1 condition
(1). Then (3) gives H as a computable limit from above, and
X X X X
$ = 2;H (n)  2;k 2;H (n)]  2;k = 1:
k

n k1 n k1
Q.E.D.
De nition. Henceforth we use this minimal information content
measure H , and we refer to H (n) as the information content of n. We
also consider each natural number n to correspond to a bit string s and
234 Part IV|Technical Papers on Self-Delimiting Programs
vice versa, so that H is de ned for strings as well as numbers.2 In ad-
dition, let hn mi denote a xed computable one-to-one correspondence
between natural numbers and ordered pairs of natural numbers. We
de ne the joint information content of n and m to be H (hn mi). Thus
H is de ned for ordered pairs of natural numbers as well as individual
natural numbers. We de ne the relative information content H (mjn)
of m relative to n by the equation
H (hn mi)
H (n) + H (mjn):
That is,
H (mjn)
H (hn mi) ; H (n):
And we de ne the mutual information content I (n : m) of n and m by
the equation
I (n : m)
H (m) ; H (mjn)
H (n) + H (m) ; H (hn mi):
Note. $ = P 2;H (n) is just on the borderline between convergence
and divergence:

P 2;H (n) converges.
 If f (n) is computable and unbounded, then
P 2;H (n)+f (n) di-
verges.
P
 If f (n) is computable and 2;f (n) converges, then H (n)  f (n)+
O(1).
P
 If f (n) is computable and 2;f (n) diverges, then H (n)  f (n)
in nitely often.
Let us look at a real-valued function (n) that Pis computable as a
limit of rationals from below. And suppose that (n)  1. Then
H (n)  ; log (n) + O(1). So 2;H (n) can be thought of as a maxi-
mal function (n) that is computable in the limit from below and has
2It is important to distinguish between the length of a string and its information
content! However, a possible source of confusion is the fact that the \natural unit"
for both length and information content is the \bit." Thus one often speaks of an
n-bit string, and also of a string whose information content is  n bits.
Incompleteness Theorems for Random Reals 235
P (n)  1, instead of thinking of H (n) as a minimal function f (n)
that is computable in the limit from above and has P 2;f (n)  1.
Lemma I. For all n,
H (n)  2 log n + c
 log n + 2 log log n + c0
 log n + log log n + 2 log log log n + c00 : : :
For in nitely many values of n,
H (n)  log n
 log n + log log n
 log n + log log n + log log log n : : :
Lemma I2. H (s)  jsj + H (jsj) + O(1). jsj = the length in bits of
the string s.
Proof.
X X X
1$= 2;H (n) = 2;H (n) 2;n ]
n n
X X ;jnsj+=Hn(n)]
= 2
n jsj=n
X ;jsj+H (jsj)]
= 2 :
s
The lemma follows by the minimality of H . Q.E.D.
Lemma I3. There are < 2n;k+c n-bit strings s such that H (s) <
n + H (n) ; k. Thus there are < 2n;H (n);k+c n-bit strings s such that
H (s) < n ; k.
Proof. X X ;H (s) X ;H (s)
2 = 2 = $  1:
n jsj=n s
Hence by the minimality of H
X
2;H (n)+c  2;H (s)
jsj=n
which yields the lemma. Q.E.D.
236 Part IV|Technical Papers on Self-Delimiting Programs
Lemma I4. If (n) is a computable partial function, then
H ((n))  H (n) + c :
Proof. X X X ;H (x)
1  $ = 2;H (n)  2 :
n y (x)=y
Note that X
2;a  2;b i
) a  min bi : (4)
i
The lemma follows by the minimality of H . Q.E.D.
Lemma I5. H (hn mi) = H (hm ni) + O(1).
Proof. X ;H (hnmi) X ;H (hnmi)
2 = 2 = $  1:
hnmi hmni
The lemma follows by using the minimality of H in both directions.
Q.E.D.
Lemma I6. H (hn mi)  H (n) + H (m) + O(1).
Proof. X ;H (n)+H (m)] 2 2
2 = $  1  1:
hnmi
The lemma follows by the minimality of H . Q.E.D.
Lemma I7. H (n)  H (hn mi) + O(1).
Proof.
X X ;H (hnmi) X ;H (hnmi)
2 = 2 = $  1:
n hnmi hnmi
The lemma follows from (4) and the minimality of H . Q.E.D.
Lemma I8. H (hn H (n)i) = H (n) + O(1).
Proof. By Lemma I7,
H (n)  H (hn H (n)i) + O(1):
On the other hand, consider
X ;i;1 X ;H (n);j;1
2 = 2
hnii hnH (n)+j i
H (n)i
XX X
= 2;H (n);k = 2;H (n) = $  1:
n k1 n
Incompleteness Theorems for Random Reals 237
By the minimality of H ,
H (hn H (n) + j i)  H (n) + j + O(1):
Take j = 0. Q.E.D.
Lemma I9. H (hn ni) = H (n) + O(1).
Proof. By Lemma I7,
H (n)  H (hn ni) + O(1):
On the other hand, consider (n) = hn ni. By Lemma I4,
H ((n))  H (n) + c :
That is,
H (hn ni)  H (n) + O(1):
Q.E.D.
Lemma I10. H (hn 0i) = H (n) + O(1).
Proof. By Lemma I7,
H (n)  H (hn 0i) + O(1):
On the other hand, consider (n) = hn 0i. By Lemma I4,
H ((n))  H (n) + c :
That is,
H (hn 0i)  H (n) + O(1):
Q.E.D.
Lemma I11. H (mjn)
H (hn mi) ; H (n)  ;c.
(Proof: use Lemma I7.)
Lemma I12. I (n : m)
H (n) + H (m) ; H (hn mi)  ;c.
(Proof: use Lemma I6.)
Lemma I13. I (n : m) = I (m : n) + O(1).
(Proof: use Lemma I5.)
Lemma I14. I (n : n) = H (n) + O(1).
(Proof: use Lemma I9.)
Lemma I15. I (n : 0) = O(1).
(Proof: use Lemma I10.)
238 Part IV|Technical Papers on Self-Delimiting Programs
Note. The further development of this algorithmic version of infor-
mation theory3 requires the notion of the size in bits of a self-delimiting
computer program (Chaitin 3]), which, however, we can do without in
this paper.

3. Random Reals
De nition (Martin-L of 12]). Speaking geometrically, a real r is
Martin-L of random if it is never the case that it is contained in each
set of an r.e. in nite sequence Ai of sets of intervals with the property
that the measure4 of the ith set is always less than or equal to 2;i ,
(Ai)  2;i : (5)
Here is the de nition of a Martin-L of random real r in a more compact
notation: h i
8i (Ai )  2;i ) :8i r 2 Ai] :
An equivalent de nition, if we restrict ourselves to reals in the unit
interval 0  r  1, may be formulated in terms of bit strings rather
than geometrical notions, as follows. De ne a covering to be an r.e. set
of ordered pairs consisting of a natural number i and a bit string s,
Covering = fhi sig
with the property that if hi si 2 Covering and hi s0i 2 Covering, then
it is not the case that s is an extension of s0 or that s0 is an extension
3 Compare the original ensemble version of information theory given in Shannon
and Weaver 13].
4 I.e., the sum of the lengths of the intervals, being careful to avoid counting
overlapping intervals twice.
Incompleteness Theorems for Random Reals 239
of s.5 We simultaneously consider Ai to be a set of ( nite) bit strings
fsjhi si 2 Coveringg

and to be a set of real numbers, namely those which in base-two nota-


tion have a bit string in Ai as an initial segment.6 Then condition (5)
becomes X
(Ai) = 2;jsj  2;i  (6)
hisi2Covering
where jsj = the length in bits of the string s.
Note. This is equivalent to stipulating the existence of an arbitrary
\regulator of convergence" f ! 1 that is computable and nondecreas-
ing such that (Ai)  2;f (i). A0 is only required to have measure  1
and is sort of useless, since we are working within the unit interval
0  r  1.7
Any real number, considered as a singleton set, is a set of measure
zero, but not constructively so! Similarly, the notion of a von Mises'
collective,8 which is an in nite bit string such that any place selection
rule based on the preceding bits picks out a substring with the same
limiting frequency of 0's and 1's as the whole string has, is contradictory.
But Alonzo Church's idea, to allow only computable place selection
rules, saves the concept.
5 This is to avoid overlapping intervals and enable us to use the formula (6). It
is easy to convert a covering which does not have this property into one that covers
exactly the same set and does have this property. How this is done depends on the
order in which overlaps are discovered: intervals which are subsets of ones which
have already been included in the enumeration of Ai are eliminated, and intervals
which are supersets of ones which have already been included in the enumeration
must be split into disjoint subintervals, and the common portion must be thrown
away.
6 I.e., the geometrical statement that a point is covered by (the union of) a set of
intervals, corresponds in bit string language to the statement that an initial segment
of an innite bit string is contained in a set of nite bit strings. The fact that some
reals correspond to two innite bit strings, e.g., .100000... = .011111..., causes no
problems. WePare working with closed intervals, which include their endpoints.
7 It makes (Ai )  2 instead of what it should be, namely,  1. So A0 really
ought to be abolished!
8 See Feller 9].
240 Part IV|Technical Papers on Self-Delimiting Programs
De nition (Solovay 14]). A real r is Solovay random if for any r.e.
in nite sequence Ai of sets of intervals with the property that the sum
of the measures of the Ai converges
X
(Ai) < 1
r is contained in at most nitely many of the Ai. In other words,
X
(Ai) < 1 ) 9N 8(i > N ) r 62 Ai] :
A real r is weakly Solovay random (\Solovay random with a regulator
of convergence") if for any r.e. in nite sequence Ai of sets of intervals
with the property that the sum of the measures of the Ai converges
constructively, then r is contained in at most nitely many of the Ai.
In other words, a real r is weakly Solovay random if the existence of a
computable function f (n) such that for each n,
X
(Ai)  2;n 
if (n)
implies that r is contained in at most nitely many of the Ai. That is
to say, X
8n (Ai)  2;n ] ) 9N 8(i > N ) r 62 Ai]:
if (n)
De nition (Chaitin 3]). A real r is Chaitin random if (the infor-
mation content of the initial segment rn of length n of the base-two ex-
pansion of r) does not drop arbitrarily far below n: lim inf H (rn ) ; n >
;1.9 In other words,

9c8n H (rn )  n ; c] :
A real r is strongly Chaitin random if (the information content of the
initial segment rn of length n of the base-two expansion of r) eventually
9 Thus
n ; c  H(rn )  n + H(n) + c 0

 n + logn + 2 loglog n + c
00

by Lemmas I2 and I.
Incompleteness Theorems for Random Reals 241
becomes and remains arbitrarily greater than n: lim inf H (rn ) ; n = 1.
In other words,
8k 9Nk 8(n  Nk ) H (rn )  n + k] :
Note. All these de nitions hold with probability one (see Theorem
R4).
Theorem R1. Martin-L of random , Chaitin random.
Proof. :Martin-L of ) :Chaitin. Suppose that a real number r has
the property that
h i
8i (Ai)  2;i & r 2 Ai :
The series
X n n2 X ;n2 +n ;0 ;0 ;2 ;6 ;12 ;20
2 =2 = 2 = 2 + 2 + 2 + 2 + 2 + 2 + 
obviously converges, and de ne N so that
X ;n2 +n
2  1:
nN
(In fact, we can take N = 2.) Let the variable s range over bit strings,
and consider
X X ;jsj;n] X n X
2 = 2 (An2 )  2;n2 +n  1:
nN s2A 2
n
nN nN

It follows from the minimality of H that


s 2 An2 and n  N ) H (s)  jsj ; n + c:
Thus, since r 2 An2 for all n  N , there will be in nitely many initial
segments rk of length k of the base-two expansion of r with the property
that rk 2 An2 and n  N , and for each of these rk we have
H (rk )  jrk j ; n + c:
Thus the information content of an initial segment of the base-two
expansion of r can drop arbitrarily far below its length.
242 Part IV|Technical Papers on Self-Delimiting Programs
Proof. :Chaitin ) :Martin-L of. Suppose that H (rn ) ; n can
go arbitrarily negative. There are < 2n;k+c n-bit strings s such that
H (s) < n + H (n) ; k (Lemma I3). Thus there are < 2n;H (n);k n-bit
strings s such that H (s) < n ; k ; c. That is, the probability that an
n-bit string s has H (s) < n ; k ; c is < 2;H (n);k . Summing this over
all n, we get
X ;H (n);k ;k X ;H (n) ;k
2 =2 2 = 2 $  2;k 
n n
since $  1. Thus if a real r has the property that H (rn) dips below
n ; k ; c for even one value of n, then r is covered by an r.e. set Ak of
intervals & (Ak )  2;k . Thus if H (rn) ; n goes arbitrarily negative,
for each k we can compute an Ak with (Ak )  2;k & r 2 Ak , and r is
not Martin-L of random. Q.E.D.
Theorem R2. Solovay random , strong Chaitin random.
Proof. :Solovay ) :(strong Chaitin). Suppose that a real number
r has the property that it is in in nitely many Ai and
X
(Ai) < 1:
Then there must be an N such that
X
(Ai)  1:
iN
Hence XX X
2;jsj = (Ai)  1:
iN s2Ai iN
It follows from the minimality of H that
s 2 Ai and i  N ) H (s)  jsj + c
i.e., if a bit string s is in Ai and i  N , then its information content is
less than or equal to its size in bits +c. Thus H (rn )  jrn j+c = n+c for
in nitely many initial segments rn of length n of the base-two expansion
of r, and it is not the case that H (rn ) ; n ! 1.
Proof. :(strong Chaitin) ) :Solovay. :(strong Chaitin) says that
there is a k such that for in nitely many values of n we have H (rn ) ;
Incompleteness Theorems for Random Reals 243
n < k. The probability that an n-bit string s has H (s) < n + k is
< 2;H (n)+k+c (Lemma I3). Let An be the r.e. set of all n-bit strings s
such that H (s) < n + k.
X X X
(An)  2;H (n)+k+c = 2k+c 2;H (n) = 2k+c $  2k+c 
n
since $  1. Hence P (An ) < 1 and r is in in nitely many of the
An, and thus r is not Solovay random. Q.E.D.
Theorem R3. Martin-L of random , weak Solovay random.
Proof. :Martin-L of ) :(weak Solovay).
P We are given that
;
8i r 2 Ai ] and 8i (Ai )  2 ]. Hence (Ai ) converges and the in-
i
equality X
(Ai)  2;N
i>N
gives us a regulator of convergence.
Proof. :(weak Solovay) ) :Martin-L of. Suppose
X
(Ai)  2;n
if (n)
and the real number r is in in nitely many of the Ai. Let

Bn = Ai:
if (n)
Then (Bn )  2;n and r 2 Bn , so r is not Martin-L of random. Q.E.D.
Note. In summary, the ve de nitions of randomness reduce to at
most two:
 Martin-L of random , Chaitin random ,
weak Solovay random.10
 Solovay random , strong Chaitin random.11
 Solovay random ) Martin-L of random.12
 Martin-L of random ) Solovay random???
10 Theorems R1 and R3.
11 Theorem R2.
12 Because strong Chaitin ) Chaitin.
244 Part IV|Technical Papers on Self-Delimiting Programs
Theorem R4. With probability one, a real number r is Martin-L of
random and Solovay random.
Proof 1. Since Solovay random ) Martin-L of random (is the con-
verse true?), it is sucient to show that r is Solovay random with
probability one. Suppose
X
(Ai) < 1
where the Ai are an r.e. in nite sequence of sets of intervals. Then (this
is the Borel{Cantelli lemma (Feller 9])),
 X
lim
N !1
Pr f A i g  lim
N !1
(Ai) = 0
iN iN
and the probability is zero that a real r is in in nitely many of the Ai.
But there are only countably many choices for the r.e. sequence of Ai,
since there are only countably many algorithms. Since the union of a
countable number of sets of measure zero is also of measure zero, it
follows that with probability one r is Solovay random.
Proof 2. We use the Borel{Cantelli lemma again. This time we show
that the strong Chaitin criterion for randomness, which is equivalent
to the Solovay criterion, is true with probability one. Since for each k,
X
PrfH (rn ) < n + kg  2k+c
n
and thus follows that for each k with probability one
converges,13 it
H (rn ) < n + k only nitely often. Thus, with probability one,
!1 H (rn ) ; n = 1:
nlim
Q.E.D.
Theorem R5. r Martin-L of random ) H (rn ) ; n is unbounded.
(Does r Martin-L of random ) lim H (rn ) ; n = 1?)
Proof. We shall prove the theorem by assuming that H (rn ) ; n < c
for all n and deducing that r cannot be Martin-L of random. Let c0 be
the constant of Lemma I3, so that the number of k-bit strings s with
H (s) < k + H (k) ; i is < 2k;i+c 0

13 See the second half of the proof of Theorem R2.


Incompleteness Theorems for Random Reals 245
Consider rk for k = 1 to 2n+c+c . We claim that the probability of
0

the event An that r simultaneously satis es the 2n+c+c inequalities


0

H (rk ) < k + c (k = 1 : : : 2n+c+c ) 0

is < 2;n . (See the next paragraph for the proof of this claim.) Thus
we have an r.e. in nite sequence An of sets of intervals with measure
(An)  2;n which all contain r. Hence r is not Martin-L of random.
Proof of Claim. Since P 2;H (k) = $  1, there is a k between 1 and
2n+c+c such that H (k)  n + c + c0. For this value of k,
0

PrfH (rk ) < k + cg  2;H (k)+c+c  2;n 0

since the number of k-bit strings s with H (s) < k + H (k) ; i is < 2k;i+c 0

(Lemma I3). Q.E.D.


Theorem R6. $ is a Martin-L of{Chaitin{weak Solovay random
real number. More generally, if N is an in nite r.e. set of natural
numbers, then X
 = 2;H (n)
n2N
is a Martin-L of{Chaitin{weak Solovay random real.14
Proof. Since H (n) can be computed as a limit from above, 2;H (n)
can be computed as a limit from below. It follows that given k , the rst
k bits of the base-two expansion without innitely many consecutive
trailing zeros 15 of the real number , one can calculate the nite set of
all n 2 N such that H (n)  k, and then, since N is in nite, one can
calculate an n 2 N with H (n) > k. That is, there is a computable
partial function  such that
(k ) = a natural number n with H (n) > k:
14 Incidentally, this implies that  is not a computable real number, from which
it follows that 0 <  < 1, that  is irrational, and even that  is transcendental.
15If there is a choice between ending the base-two expansion of  with innitely
many consecutive zeros or with innitely many consecutive ones (i.e., if  is a dyadic
rational), then we must choose the innity of consecutive ones. This is to ensure
that considered as real numbers
k <  < k + 2 k :
;

Of course, it will follow from this theorem that  must be an irrational number, so
this situation cannot actually occur, but we don't know that yet!
246 Part IV|Technical Papers on Self-Delimiting Programs
But by Lemma I4,
H ((k ))  H (k ) + c :
Hence
k < H ((k ))  H (k ) + c
and
H (k ) > k ; c :
Thus  is Chaitin random, and by Theorems R1 and R3 it is also
Martin-L of random and weakly Solovay random. Q.E.D.
Theorem R7. There is an exponential diophantine equation
L(n x1 : : : xm) = R(n x1 : : : xm)
which has only nitely many solutions x1 : : : xm if the nth bit of $ is
a 0, and which has in nitely many solutions x1 : : :  xm if the nth bit
of $ is a 1.
Proof. Since H (n) can be computed as a limit from above, 2;H (n)
can be computed as a limit from below. It follows that
X
$ = 2;H (n)
is the limit from below of a computable sequence !1  !2  !3    
of rational numbers
$ = klim
!1 k
!:
This sequence converges extremely slowly! The exponential diophan-
tine equation L = R is constructed from the sequence !k by using the
theorem that \every r.e. relation has a singlefold exponential diophan-
tine representation" (Jones and Matijasevi%c 11]). Since the assertion
that
\the nth bit of !k is a 1"
is an r.e. relation between n and k (in fact, it is a recursive relation),
the theorem of Jones and Matijasevi%c yields an equation
L(n k x2 : : : xm) = R(n k x2 : : :  xm)
involving only additions, multiplications, and exponentiations of nat-
ural number constants and variables, and this equation has exactly one
Incompleteness Theorems for Random Reals 247
solution x2 : : :  xm in natural numbers if the nth bit of the base-two
expansion of !k is a 1, and it has no solution x2 : : :  xm in natural
numbers if the nth bit of the base-two expansion of !k is a 0. The
number of dierent m-tuples x1 : : :  xm of natural numbers which are
solutions of the equation
L(n x1 : : : xm) = R(n x1 : : : xm)
is therefore in nite if the nth bit of the base-two expansion of $ is a
1, and it is nite if the nth bit of the base-two expansion of $ is a 0.
Q.E.D.

4. Incompleteness Theorems
Having developed the necessary information-theoretic formalism in Sec-
tion 2, and having studied the notion of a random real in Section 3, we
can now begin to derive incompleteness theorems.
The setup is as follows. The axioms of a formal theory are consid-
ered to be encoded as a single nite bit string, the rules of inference
are considered to be an algorithm for enumerating the theorems given
the axioms, and in general we shall x the rules of inference and vary
the axioms. More formally, the rules of inference F may be considered
to be an r.e. set of propositions of the form
\Axioms `F Theorem."
The r.e. set of theorems deduced from the axiom A is determined by
selecting from the set F the theorems in those propositions which have
the axiom A as an antecedent. In general we will consider the rules of
inference F to be xed and study what happens as we vary the axioms
A. By an n-bit theory we shall mean the set of theorems deduced from
an n-bit axiom.

4.1. Incompleteness Theorems for Lower Bounds


on Information Content
Let us start by rederiving within our current formalism an old and very
basic result, which states that even though most strings are random,
one can never prove that a speci c string has this property.
248 Part IV|Technical Papers on Self-Delimiting Programs
If one produces a bit string s by tossing a coin n times, 99.9% of
the time it will be the case that H (s) n + H (n) (Lemmas I2 and I3).
In fact, if one lets n go to in nity, with probability one H (s) > n for
all but nitely many n (Theorem R4). However,
Theorem LB (Chaitin 1,2,4]). Consider a formal theory all of
whose theorems are assumed to be true. Within such a formal theory
a speci c string cannot be proven to have information content more
than O(1) greater than the information content of the axioms of the
theory. That is, if \H (s)  n" is a theorem only if it is true, then it is
a theorem only if n  H (axioms) + O(1). Conversely, there are formal
theories whose axioms have information content n + O(1) in which it is
possible to establish all true propositions of the form \H (s)  n" and
of the form \H (s) = k" with k < n.
Proof. Consider the enumeration of the theorems of the formal
axiomatic theory in order of the size of their proofs. For each natural
number k, let s be the string in the theorem of the form \H (s)  n"
with n > H (axioms) + k which appears rst in the enumeration. On
the one hand, if all theorems are true, then
H (axioms) + k < H (s):
On the other hand, the above prescription for calculating s shows that
s = (hhaxioms H (axioms)i ki) ( partial recursive)
and thus
H (s)  H (hhaxioms H (axioms)i ki)+ c  H (axioms)+ H (k)+ O(1):
Here we have used the subadditivity of information H (hs ti)  H (s) +
H (t)+ O(1) (Lemma I6) and the fact that H (hs H (s)i)  H (s) + O(1)
(Lemma I8). It follows that
H (axioms) + k < H (s)  H (axioms) + H (k) + O(1)
and thus
k < H (k) + O(1):
However, this inequality is false for all k  k0, where k0 depends only
on the rules of inference. A contradiction is avoided only if s does not
Incompleteness Theorems for Random Reals 249
exist for k = k0, i.e., it is impossible to prove in the formal theory that
a speci c string has H greater than H (axioms) + k0.
Proof of Converse. The set T of all true propositions of the form
\H (s)  k" is r.e. Choose a xed enumeration of T without repeti-
tions, and for each natural number n, let s be the string in the last
proposition of the form \H (s)  k" with k < n in the enumeration.
Let
, = n ; H (s) > 0:
Then from s H (s) &, we can calculate n = H (s) + ,, then all
strings s with H (s) < n, and then a string sn with H (sn)  n. Thus
n  H (sn ) = H ((hhs H (s)i ,i)) ( partial recursive)
and so
n  H (hhs  H (s)i ,i) + c  H (s) + H (,) + O(1) (7)
 n + H (,) + O(1)
by Lemmas I6 and I8. The rst line of (7) implies that
,
n ; H (s)  H (,) + O(1)
which implies that , and H (,) are both bounded. Then the second
line of (7) implies that
H (hhs H (s)i ,i) = n + O(1):
The triple hhs H (s)i ,i is the desired axiom: it has information
content n + O(1), and by enumerating T until all true propositions
of the form \H (s)  k" with k < n have been discovered, one can
immediately deduce all true propositions of the form \H (s)  n" and
of the form \H (s) = k" with k < n. Q.E.D.

4.2. Incompleteness Theorems for Random Reals:


First Approach
In this section we begin our study of incompleteness theorems for ran-
dom reals. We show that any particular formal theory can enable one
250 Part IV|Technical Papers on Self-Delimiting Programs
to determine at most a nite number of bits of $. In the following
sections (4.3 and 4.4) we express the upper bound on the number of
bits of $ which can be determined, in terms of the axioms of the the-
ory for now, we just show that an upper bound exists. We shall not
use any ideas from algorithmic information theory until Section 4.4
for now (Sections 4.2 and 4.3) we only make use of the fact that $ is
Martin-L of random.
If one tries to guess the bits of a random sequence, the average
number of correct guesses before failing is exactly 1 guess! Reason: if
we use the fact that the expected value of a sum is equal to the sum
of the expected values, the answer is the sum of the chance of getting
the rst guess right, plus the chance of getting the rst and the second
guesses right, plus the chance of getting the rst, second and third
guesses right, etc.,
1 + 1 + 1 + 1 +    = 1:
2 4 8 16
Or if we directly calculate the expected value as the sum of (the # right
till rst failure) (the probability),
0 1 + 1 1 + 2 1 + 3 1 + 4 1 + 
2 4 8 16 32
X ;k X ;k X ;k
= 1 2 + 1 2 + 1 2 + 
k>1 k>2 k>3
= 21 + 14 + 18 +    = 1:
On the other hand (see the next section), if we are allowed to try 2n
times a series of n guesses, one of them will always get it right, if we
try all 2n dierent possible series of n guesses.
Theorem X. Any given formal theory T can yield only nitely
many (scattered) bits of (the base-two expansion of) $.
When we say that a theory yields a bit of $, we mean that it enables
us to determine its position and its 0/1 value.
Proof. Consider a theory T , an r.e. set of true assertions of the form
\The nth bit of $ is 0."
\The nth bit of $ is 1."
Incompleteness Theorems for Random Reals 251
Here n denotes speci c natural numbers.
If T provides k dierent (scattered) bits of $, then that gives us
a covering Ak of measure 2;k which includes $: Enumerate T until
k bits of $ are determined, then the covering is all bit strings up to
the last determined bit with all determined bits okay. If n is the last
determined bit, this covering will consist of 2n;k n-bit strings, and will
have measure 2n;k =2n = 2;k .
It follows that if T yields in nitely many dierent bits of $, then
for any k we can produce by running through all possible proofs in T a
covering Ak of measure 2;k which includes $. But this contradicts the
fact that $ is Martin-L of random. Hence T yields only nitely many
bits of $. Q.E.D.
Corollary X. Since by Theorem R7 $ can be encoded into an
exponential diophantine equation
L(n x1 : : : xm) = R(n x1 : : : xm) (8)
it follows that any given formal theory can permit one to determine
whether (8) has nitely or in nitely many solutions x1 : : :  xm, for only
nitely many speci c values of the parameter n.

4.3. Incompleteness Theorems for Random Reals:


jAxioms j

Theorem A. If P 2;f (n)  1 and f is computable, then there is a


constant cf with the property that no n-bit theory ever yields more
than n + f (n) + cf bits of $.
Proof. Let Ak be the event that there is at least one n such that
there is an n-bit theory that yields n + f (n) + k or more bits of $.
20 n 1 0 ;n+f (n)+k] 13
X 6B 2 2
PrfAk g  4@ n-bit CA B@ probability that yields CA75
n theories n + f (n) + k bits of $
; k X ;f (n) ;k
= 2 2 2
n
since P 2;f (n)  1. Hence PrfAk g  2;k , and P PrfAk g also converges.
Thus only nitely many of the Ak occur (Borel{Cantelli lemma (Feller
252 Part IV|Technical Papers on Self-Delimiting Programs
9])). That is,
 X
lim Prf
N !1
Akg  PrfAk g  2;N ! 0:
k>N k>N
More Detailed Proof. Assume the opposite of what we want to
prove, namely that for every k there is at least one n-bit theory that
yields n + f (n) + k bits of $. From this we shall deduce that $ cannot
be Martin-L of random, which is impossible.
To get a covering Ak of $ with measure  2;k , consider a speci c n
and all n-bit theories. Start generating theorems in each n-bit theory
until it yields n + f (n)+ k bits of $ (it does not matter if some of these
bits are wrong). The measure of the set of possibilities for $ covered
by the n-bit theories is thus  2n 2;n;f (n);k = 2;f (n);k . The measure
(Ak ) of the union of the set of possibilities for $ covered by n-bit
theories with any n is thus
X X X
 2;f (n);k = 2;k 2;f (n)  2;k (since 2;f (n)  1):
n n
Thus $ is covered by Ak and (Ak )  2;k for every k if there is always
an n-bit theory that yields n + f (n) + k bits of $, which is impossible.
Q.E.D.
Corollary A. If P 2;f (n) converges and f is computable, then there
is a constant cf with the property that no n-bit theory ever yields more
than n + f (n) + cf bits of $.P P
Proof. Choose c so that 2;f (n)  2c . Then 2;f (n)+c]  1, and
we can apply Theorem A P to f 0(n) = f (n) + c. Q.E.D.
Corollary A2. Let 2;f (n) converge and f be computable as
before. If g(n) is computable, then there is a constant cfg with the
property that no g(n)-bit theory ever yields more than g(n)+ f (n)+ cfg
bits of $. For example, consider N of the form 22 . For such N , no
n

N -bit theory ever yields more than N + f (log log N ) + cfg bits of $.
Note. Thus for n of special form, i.e., which have concise descrip-
tions, we get better upper bounds on the number of bits of $ which
are yielded by n-bit theories. This is a foretaste of the way algorithmic
information theory will be used in Theorem C and Corollary C2 (Sect.
4.4).
Incompleteness Theorems for Random Reals 253
Lemma for Second Borel{Cantelli Lemma! For any nite set
fxk g of non-negative real numbers,
Y
(1 ; xk )  P1x :
k
Proof. If x is a real number, then
1 ; x  1 +1 x :
Thus Y
(1 ; xk )  Q 1 1
P 
(1 + xk ) xk
since if all the xk are non-negative
Y X
(1 + xk )  xk :
Q.E.D.
Second Borel{Cantelli Lemma (Feller 9]). Suppose that the
events An have the property that it is possible to determine whether or
not the event An occurs by examining the rst f (n) bits of $, where
f is aPcomputable function. If the events An are mutually independent
and PrfAng diverges, then $ has the property that in nitely many
of the An must occur.
Proof. Suppose on the contrary that $ has the property that only
nitely many of the events An occur. Then there is an N such that
the event An does not occur if n  N . The probability that none
of the events AN  AN +1 : : : AN +k occur is, since the An are mutually
independent, precisely
Yk
(1 ; PrfAN +ig)  hPk 1 i
i=0 i=0 PrfAN +ig
which goes to zero as k goes to in nity. This would give us arbitrarily
small covers for $, which contradicts the fact that $ is Martin-L of
random. Q.E.D. P
Theorem B. If 2n;f (n) diverges and f is computable, then in -
nitely often there is a run of f (n) zeros between bits 2n & 2n+1 of $
254 Part IV|Technical Papers on Self-Delimiting Programs
(2n  bit < 2n+1 ). Hence there are rules of inference which have the
property that there are in nitely many N -bit theories that yield (the
rst) N + f (log N ) bits of $.
Proof. We wish to prove that in nitely often $ must have a run of
k = f (n) consecutive zeros between its 2n th & its 2n+1 th bit position.
There are 2n bits in the range in question. Divide this into nonoverlap-
ping blocks of 2k bits each, giving a total of 2n=2k blocks. The chance
of having a run of k consecutive zeros in each block of 2k bits is
k2k;2
 2k : (9)
2
Reason:
 There are 2k ; k + 1  k dierent possible choices for where to
put the run of k zeros in the block of 2k bits.
 Then there must be a 1 at each end of the run of 0's, but the
remaining 2k ; k ; 2 = k ; 2 bits can be anything.
 This may be an underestimate if the run of 0's is at the beginning
or end of the 2k bits, and there is no room for endmarker 1's.
 There is no room for another 10k 1 to t in the block of 2k bits, so
we are not overestimating the probability by counting anything
twice.
Summing (9) over all 2n =2k blocks and over all n, we get
X " k2k;2 2n # 1 X n;k 1 X n;f (n)

22k 2k = 8 n 2 = 8 2 = 1:
n
Invoking the Psecond Borel{Cantelli lemma (if the events Ai are inde-
pendent and PrfAig diverges, then in nitely many of the Ai must
occur), we are nished.
P Q.E.D.
Corollary B. If 2;f (n) diverges and f is computable and nonde-
creasing, then in nitely often there is a run of f (2n+1 ) zeros between
bits 2n & 2n+1 of $ (2n  bit < 2n+1 ). Hence there are in nitely many
N -bit theories that yield (the rst) N + f (N ) bits of $.
Incompleteness Theorems for Random Reals 255
Proof. If P 2;f (n) diverges and f is computable and nondecreasing,
then by the Cauchy condensation test
X n ;f (2 )
22 n

also diverges, and therefore so does


X n ;f (2 +1)
22 n
:
Hence, by Theorem B, in nitely often there is a run of f (2n+1 ) zeros
between bits 2n and 2n+1P. Q.E.D.
Corollary B2. If 2;f (n) diverges and f is computable, then
in nitely often there is a run of n + f (n) zeros between bits 2n & 2n+1
of $ (2n  bit < 2n+1 ). Hence there are in nitely many N -bit theories
that yield (the rst) N + log N + f (log N ) bits of $.
Proof. Take f (n) = n + f 0(n) in Theorem B. Q.E.D.
Theorem AB. (a) There is a c with the property that no n-bit
theory ever yields more than n + log n + 2 log log n + c (scattered) bits
of $.
(b) There are in nitely many n-bit theories that yield (the rst)
n + log n + log log n bits of $.
Proof. Using the Cauchy condensation test, we have seen (beginning
of Sect. 2) that
X
(a) n(log1 n)2 converges and
X 1
(b) n log n diverges.
The theorem follows immediately from Corollaries A and B. Q.E.D.

4.4. Incompleteness Theorems for Random Reals:


H(Axioms)
Theorem C is a remarkable extension of Theorem R6:
 We have seen that the information content of knowing the rst
n bits of $] is  n ; c.
256 Part IV|Technical Papers on Self-Delimiting Programs
 Now we show that the information content of knowing any n bits
of $ (their positions and 0/1 values)] is  n ; c.
Lemma C. Pn #fsjH (s) < ng2;n = $  1:
Proof.
X X
1  $ = 2;H (s) = #fsjH (s) = ng2;n
s Xn X
= #fsjH (s) = ng2;n 2;k
n k 1
XX
= #fsjH (s) = ng2;n;k
n k1
X
= #fsjH (s) < ng2;n :
n
Q.E.D.
Theorem C. If a theory has H (axiom) < n, then it can yield at
most n + c (scattered) bits of $.
Proof. Consider a particular k and n. If there is an axiom with
H (axiom) < n which yields n + k scattered bits of $, then even without
knowing which axiom it is, we can cover $ with an r.e. set of intervals
of measure
0 #fsjH (s) < ng 1 0 2;n;k 1
B CC BB CC
B B ;n;k
@ # of axioms A @ measure of set of CA = #fsjH (s) < ng2 :
C B
with H < n possibilities for $
But by the preceding lemma, we see that
X X
#fsjH (s) < ng2;n;k = 2;k #fsjH (s) < ng2;n  2;k :
n n
Thus if even one theory with H < n yields n + k bits of $, for any n, we
get a cover for $ of measure  2;k . This can only be true for nitely
many values of k, or $ would not be Martin-L of random. Q.E.D.
Corollary C. No n-bit theory ever yields more than n + H (n) + c
bits of $.
(Proof: Theorem C and by Lemma I2, H (axiom)  jaxiomj +
H (jaxiomj) + c.)
Incompleteness Theorems for Random Reals 257
Lemma C2. If g(n) is computable and unbounded, then H (n) <
g(n) for in nitely many values of n.
Proof. De ne the inverse of g as
g;1 (n) = gmin
(k)n
k:
Then using Lemmas I and I4 we see that for all suciently large values
of n,
H (g;1(n))  H (n) + O(1)  O(log n) < n  g(g;1 (n)):
That is, H (k) < g(k) for all k = g;1(n) and n suciently large. Q.E.D.
Corollary C2. Let g(n) be computable and unbounded. For in-
nitely many n, no n-bit theory yields more than n + g(n) + c bits of
$.
(Proof: Corollary C and Lemma C2.)
Note. In appraising Corollaries C and C2, the trivial formal sys-
tems in which there is always an n-bit axiom that yields the rst n bits
of $ should be kept in mind. Also, compare Corollaries C and A, and
Corollaries C2 and A2.
In summary,
Theorem D. There is an exponential diophantine equation
L(n x1 : : : xm) = R(n x1 : : : xm) (10)
which has only nitely many solutions x1 : : : xm if the nth bit of $ is
a 0, and which has in nitely many solutions x1 : : : xm if the nth bit of
$ is a 1. Let us say that a formal theory \settles k cases" if it enables
one to prove that the number of solutions of (10) is nite or that it
is in nite for k speci c values (possibly scattered) of the parameter n.
Let f (n) and g(n) be computable functions.

P 2;f (n) < 1 ) all n-bit theories settle  n + f (n) + O(1) cases.

P 2;f (n) = 1 and f (n)  f (n +1) ) for in nitely many n, there
is an n-bit theory that settles  n + f (n) cases.
 H (theory) < n ) it settles  n + O(1) cases.
258 Part IV|Technical Papers on Self-Delimiting Programs
n-bit theory ) it settles  n + H (n) + O(1) cases.


 g unbounded ) for in nitely many n, all n-bit theories settle


 n + g (n) + O(1) cases.

Proof. The theorem combines Theorem R7, Corollaries A and B,


Theorem C, and Corollaries C and C2. Q.E.D.

5. Conclusion
In conclusion, we have seen that proving whether particular exponen-
tial diophantine equations have nitely or in nitely many solutions, is
absolutely intractable. Such questions escape the power of mathemat-
ical reasoning. This is a region in which mathematical truth has no
discernible structure or pattern and appears to be completely random.
These questions are completely beyond the power of human reasoning.
Mathematics cannot deal with them.
Quantum physics has shown that there is randomness in nature. I
believe that we have demonstrated in this paper that randomness is
already present in pure mathematics. This does not mean that the
universe and mathematics are lawless, it means that laws of a dierent
kind apply: statistical laws.

References
1] G. J. Chaitin, Information-theoretic computational complexity,
IEEE Trans. Inform. Theory 20 (1974), 10{15.
2] G. J. Chaitin, Randomness and mathematical proof, Sci. Amer.
232, No. 5 (1975), 47{52.
3] G. J. Chaitin, A theory of program size formally identical to
information theory, J. Assoc. Comput. Mach. 22 (1975), 329{340.
4] G. J. Chaitin, G odel's theorem and information, Internat. J.
Theoret. Phys. 22 (1982), 941{954.
Incompleteness Theorems for Random Reals 259
5] G. J. Chaitin, Randomness and G odel's theorem, \Mondes en
D
eveloppement," Vol. 14, No. 53, in press.
6] R. Courant and H. Robbins, \What is Mathematics?," Ox-
ford Univ. Press, London, 1941.
7] M. Davis, H. Putnam, and J. Robinson, The decision prob-
lem for exponential diophantine equations, Ann. Math. 74
(1961), 425{436.
8] M. Davis, \The Undecidable|Basic Papers on Undecidable
Propositions, Unsolvable Problems and Computable Functions,"
Raven, New York, 1965.
9] W. Feller, \An Introduction to Probability Theory and Its
Applications, I," Wiley, New York, 1970.
10] G. H. Hardy, \A Course of Pure Mathematics," 10th ed., Cam-
bridge Univ. Press, London, 1952.
11] J. P. Jones and Y. V. Matijasevic, Register machine proof
of the theorem on exponential diophantine representation of enu-
merable sets, J. Symbolic Logic 49 (1984), 818{829.
12] P. Martin-Lof, The de nition of random sequences, Inform.
Control 9 (1966), 602{619.
13] C. E. Shannon and W. Weaver, \The Mathematical Theory
of Communication," Univ. of Illinois Press, Urbana, 1949.
14] R. M. Solovay, Private communication, 1975.
15] A. M. Turing, On computable numbers, with an application to
the Entscheidungsproblem, Proc. London Math. Soc. 42 (1937),
230{265 also in 8].
260 Part IV|Technical Papers on Self-Delimiting Programs
ALGORITHMIC ENTROPY
OF SETS
Computers & Mathematics with
Applications 2 (1976), pp. 233{245
Gregory J. Chaitin
IBM Thomas J. Watson Research Center
Yorktown Heights, NY 10598, U.S.A.

Abstract
In a previous paper a theory of program size formally identical to infor-
mation theory was developed. The entropy of an individual nite object
was dened to be the size in bits of the smallest program for calculating
it. It was shown that this is ; log2 of the probability that the object
is obtained by means of a program whose successive bits are chosen by
ipping an unbiased coin. Here a theory of the entropy of recursively
enumerable sets of objects is proposed which includes the previous the-
ory as the special case of sets having a single element. The primary
concept in the generalized theory is the probability that a computing
machine enumerates a given set when its program is manufactured by
coin ipping. The entropy of a set is dened to be ; log2 of this prob-
ability.

261
262 Part IV|Technical Papers on Self-Delimiting Programs
1. Introduction
In a classical paper on computability by probabilistic machines 1], de
Leeuw et al. showed that if a machine with a random element can
enumerate a speci c set of natural numbers with positive probability,
then there is a deterministic machine that also enumerates this set. We
propose to throw further light on this matter by bringing into play the
concepts of algorithmic information theory 2,3].
As in 3], we require a computing machine to read the successive
bits of its program from a semi-in nite tape that has been lled with
0's and 1's by ipping an unbiased coin, and to decide by itself where
to stop reading the program, for there is no endmarker. In 3] this
convention has the important consequence that a program can be built
up from subroutines by concatenating them.
In this paper we turn from nite computations to unending com-
putations. The computer is used to enumerate a set of objects instead
of a single one. An important dierence between this paper and 3] is
that here it is possible for the machine to read the entire program tape,
so that in a sense in nite programs are permitted. However, following
1] it is better to think of these as cases in which a nondeterministic
machine uses coin-ipping in nitely often.
Here, as in 3], we pick a universal computer that makes the prob-
ability of obtaining any given machine output as high as possible.
We are thus led to de ne three concepts: P (A), the probability that
the standard machine enumerates the set A, which may be called the
algorithmic probability of the set A H (A), the entropy of the set A,
which is ; log2 of P (A) and the amount of information that must be
speci ed to enumerate A, denoted I (A), which is the size in bits of the
smallest program for A. In other words, I (A) is the least number n such
that for some program tape contents the standard machine enumerates
the set A and in the process of doing so reads precisely n bits of the
program tape.
One may also wish to use the standard machine to simultaneously
enumerate two sets A and B , and this leads to the joint concepts
P (A B ), H (A B ), and I (A B ). In 3] programs could be concatenated,
and this fact carries over here to programs that enumerate singleton sets
(i.e. sets with a single element). What about arbitrary sets? Programs
Algorithmic Entropy of Sets 263
that enumerate arbitrary sets can be merged by interweaving their bits
in the order that they are read when running at the same time, that is,
in parallel. This implies that the joint probability P (A B ) is not less
than the product of the individual probabilities P (A) and P (B ), from
which it is easy to show that H has all the formal properties of the
entropy concept of classical information theory 4]. This also implies
that I (A B ) is not greater than the sum of I (A) and I (B ).
The purpose of this paper is to propose this new approach and to
determine what is the number of sets A that have probability P (A)
greater than 2;n , in other words, that have entropy H (A) less than
n. It must be emphasized that we do not present a complete theory.
For example, the relationship between H (A) and I (A) requires further
study. In 3] we proved that the dierence between H (A) and I (A) is
bounded for singleton sets A, but we shall show that even for nite A
this is no longer the case.

2. De
nitions and Their Elementary Pro-
perties
The formal de nition of computing machine that we use is the Tur-
ing machine. However, we have made a few changes in the standard
de nition 5, pp. 13{16].
Our Turing machines have three tapes: a program tape, a work tape
and an output tape. The program tape is only in nite to the right. It
can be read by the machine and it can be shifted to the left. Each
square of the program tape contains a 0 or a 1. The program tape is
initially positioned at its leftmost square. The work tape is in nite in
both directions, can be read, written and erased, and can be shifted
in either direction. Each of its squares may contain a blank, a 0, or a
1. Initially all squares are blank. The output tape is in nite in both
directions and it can be written on and shifted to the left. Each square
may contain a blank or a $. Initially all squares are blank.
A Turing machine with n states, the rst of which is its initial state,
is de ned in a table with 6n entries which is consulted each machine
cycle. Each entry corresponds to one of the 6 possible contents of the
264 Part IV|Technical Papers on Self-Delimiting Programs
2 squares being read, and to one of the n states. All entries must be
present, and each speci es an action to be performed and the next state.
There are 8 possible actions: program tape left, output tape left, work
tape left/right, write blank/0/1 on work tape and write $ on output
tape.
Each way of lling this 6 n table produces a dierent n-state
Turing machine M . We imagine M to be equipped with a clock that
starts with time 1 and advances one unit each machine cycle. We call
a unit of time a quantum. Starting at its initial state M carries out
an unending computation, in the course of which it may read all or
part of the program tape. The output from this computation is a set
of natural numbers A. n is in A i a $ is written by M on the output
tape that is separated by exactly n blank squares from the previous $
on the tape. The time at which M outputs n is de ned to be the clock
reading when two $'s separated by n blanks appear on the output tape
for the rst time.
Let p be a nite binary sequence (henceforth string ) or an in nite
binary sequence (henceforth sequence ). M (p) denotes the set of nat-
ural numbers output (enumerated) by M with p as the contents of the
program tape if p is a sequence, and with p written at the beginning
of the program tape if p is a string. M (p) is always de ned if p is a
sequence, but if p is a string and M reads beyond the end of p, then
M (p) is unde ned. However, instead of saying that M (p) is unde ned,
we shall say that M (p) halts. Thus for any string p, M (p) is either
de ned or halts. If M (p) halts, the clock reading when M reads past
the end of p is said to be the time at which M (p) halts.
De nition.
 PM (A) is the probability that M (p) = A if each bit of the se-
quence p is obtained by a separate toss of an unbiased coin. In
other words, PM (A) is the probability that a program tape pro-
duced by coin ipping makes M enumerate A.
 HM (A) = ; log2 PM (A) (= 1 if PM (A) = 0).
 IM (A) is the number of bits in the smallest string p such that
M (p) = A (= 1 if no such p exists).
Algorithmic Entropy of Sets 265
We now pick a particular universal Turing machine U having the
ability to simulate any other machine as the standard one for use
throughout this paper. U has the property that for each M there is a
string %M such that for all sequences p, U (%M p) = M (p) and U reads
exactly as much of p as M does. To be more precise %M = 0g 1, where
g is the G odel number for M . That is to say, g is the position of M in
a standard list of all possible Turing machine de ning tables.
De nition.
 P (A) = PU (A) is the algorithmic probability of the set A.
 H (A) = HU (A) is the algorithmic entropy of the set A.

 I (A) = IU (A) is the algorithmic information of the set A.

The quali cation \algorithmic" is usually omitted below.


We say that a string or sequence p is a program for A if U (p) = A.
If U (p) = A and p is a string of I (A) bits, then p is said to be a
minimal-size program for A. The recursively enumerable (r.e.) sets are
de ned to be those sets of natural numbers A for which I (A) < 1.
(This is equivalent to the standard de nition 5, p. 58].) As there are
nondenumerably many sets of natural numbers and only denumerably
many r.e. sets, most A have I (A) = 1.
The following theorem, whose proof is immediate, shows why U
is a good machine to use. First some notation must be explained.
f (x)  g(x) means that 9c 8x f (x)  cg(x). f (x)  g(x) means
that g(x)  f (x). And f (x)  g(x) means that f (x)  g(x) and
f (x)  g(x). O(f (x)) denotes an F (x) with the property that there
are constants c1 and c2 such that for all x, jF (x)j  jc1f (x)j + jc2j,
where f (x) is to be replaced by 0 if it is unde ned for a particular
value of x.
Theorem 1. P (A)  PM (A), H (A)  HM (A) + O(1), and I (A) 
IM (A) + O(1).
De nition.
 A join B = f2n : n 2 Agf2n +1 : n 2 B g. 5, pp. 81, 168]. Enu-
merating A join B is equivalent to simultaneously enumerating A
and B .
266 Part IV|Technical Papers on Self-Delimiting Programs
 P (A B ) = P (A join B ) (joint probability)
 H (A B ) = H (A join B ) (joint entropy)
 I (A B ) = I (A join B ) (joint information)
 P (A=B ) = P (A B )=P (B ) (conditional probability)
 H (A=B ) = ; log2 P (A=B ) = H (A B ) ; H (B )
(conditional entropy)
 P (A : B ) = P (A)P (B )=P (A B ) (mutual probability)
 H (A : B ) = ; log2 P (A : B ) = H (A) + H (B ) ; H (A B )
(mutual entropy).
Theorem 2.
(a) P (A)  2I (A)
(b) H (A)  I (A)
(c) For singleton A, H (A) = I (A) + O(1).
(d) H (A) < 1 implies I (A) < 1.
Proof. (a) and (b) are immediate (c) is Theorem 3.5(b) 3] (d)
follows from Theorem 2 1].
Theorem 3.
(a) P (A B )  P (B A)
(b) P (A A)  P (A)
(c) P (A )  P (A)
(d) P (A=A)  1
(e) P (A=)  P (A)
(f) P (A B )  P (A)P (B )
(g) P (A=B )  P (A)
Algorithmic Entropy of Sets 267
(h) P P (A B )  P (B )
A
(i) P (A B )  P (B )
(j) PA P (A=B )  1.
Proof. The proof is straightforward. For example, (f) was shown in
Section 1. And (h) follows from the fact that there is a %P= 0g 1 such
that n 2 U (%p) i 2n + 1 2 U (p). Thus P (B )  2;j%j A P (A B ),
which taken together with (b) yields (h). Here, and henceforth, the
absolute value jsj of a string s signi es the number of bits in s.
The remainder of the proof is omitted.
Theorem 4.
(a) H (A B ) = H (B A) + O(1)
(b) H (A A) = H (A) + O(1)
(c) H (A ) = H (A) + O(1)
(d) H (A=A) = O(1)
(e) H (A=) = H (A) + O(1)
(f) H (A B )  H (A) + H (B ) + O(1)
(g) H (A=B )  H (A) + O(1)
(h) H (A)  H (A B ) + O(1)
(i) H (A : ) = O(1)
(j) H (A : A) = H (A) + O(1)
(k) H (A : B ) = H (B : A) + O(1)
(l) H (A : B ) = H (A) ; H (A=B ) + O(1).
Theorem 5.
(a) I (A B ) = I (B A) + O(1)
(b) I (A A) = I (A) + O(1)
268 Part IV|Technical Papers on Self-Delimiting Programs
(c) I (A B )  I (A) + I (B ) + O(1)
(d) I (A)  I (A B ) + O(1)
(e) I (A) = I (A fn : n < I (A)g) + O(1).
The proofs of Theorems 4 and 5 are straightforward and are omitted.

20. The Oracle Machine U0


In order to study P , H , and I , which are de ned in terms of U , we
shall actually need to study a more powerful machine called U 0, which,
unlike U , could never actually be built. U 0 is almost identical to U , but
it cannot be built because it contains one additional feature, an oracle
that gives U 0 yes/no answers to speci c questions of the form \Does
U (p) halt?" U 0 can ask the oracle such questions whenever it likes.
An oracle is needed because of a famous theorem on the undecidability
of the halting problem 5, pp. 24{26], which states that there is no
algorithm for answering these questions. U 0 is a special case of the
general concept of relative recursiveness 5, pp. 128{134].
As a guide to intuition it should be stated that the properties of U 0
are precisely analogous to those of U  one simply imagines a universe
exactly like ours except that sealed oracle boxes can be computer sub-
systems. We now indicate how to modify Section 2 so that it applies
to U 0 instead of U .
One begins by allowing an oracle machine M to indicate in each
entry of its table one of 9 possible actions (before there were 8). The
new possibility is to ask the oracle if the string s currently being read
on the work tape has the property that U (s) halts. In response the
oracle instantly writes a 1 on the work tape if the answer is yes and
writes a 0 if the answer is no.
After de ning an arbitrary oracle machine M , and PM0 , HM0 , and
IM0 , one then de nes the standard oracle machine U 0 which can simulate
any M . The next step is to de ne P 0(A), H 0(A), and I 0(A), which are
the probability, entropy, and information of the set A relative to the
halting problem. Furthermore, p is said to be an oracle program for A
if U 0(p) = A, and a minimal-size one if in addition jpj = I 0(A). Then A
Algorithmic Entropy of Sets 269
is de ned to be r.e. in the halting problem if I 0(A) < 1. One sees as
before that P 0 is maximal and H 0 and I 0 are minimal, and then de nes
the corresponding joint, conditional, and mutual concepts. Lastly one
formulates the proves the corresponding Theorems 20, 30, 40 , and 50.
Theorem 6. P 0(A)  P (A), H 0(A)  H (A) + O(1), and I 0(A) 
I (A) + O(1).
Proof. There is a % = 0g 1 such that for all sequences p, U 0(%p) =
U (p) and U 0 reads precisely as much of p as U does.

3. Summary and Discussion of Results


The remainder of this paper is devoted to counting the number of sets
A of dierent kinds having information I (A) less than n and having
entropy H (A) less than n. The kinds of A we shall consider are: single-
ton sets, consecutive sets, nite sets, co nite sets, and arbitrary sets.
A is consecutive if it is nite and n + 1 2 A implies n 2 A. A is co nite
if it contains all but nitely many natural numbers.
The following 4 pairs of estimates will be demonstrated in this pa-
per. The rst pair is due to Solovay 6]. #X denotes the cardinality of
X . Sn denotes the Singleton set fng. Cn denotes the Consecutive set
fk : k < ng.

log2 #fsingleton A : I (A) < ng = n ; I (Sn) + O(1):


log2 #fsingleton A : H (A) < ng = n ; I (Sn) + O(1):

log2 #fconsecutive A : I (A) < ng = n ; I (Cn) + O(log I (Cn)):


log2 #fA : I (A) < ng = n ; I (Cn) + O(log I (Cn)):

log2 #fconsecutive A : H (A) < ng = n ; I 0(Sn ) + O(1):


log2 #f nite A : H (A) < ng = n ; I 0(Sn ) + O(1):

log2 #fco nite A : H (A) < ng = n ; I 0(Cn) + O(log I 0(Cn)):


log2 #fA : H (A) < ng = n ; I 0(Cn) + O(log I 0(Cn)):
270 Part IV|Technical Papers on Self-Delimiting Programs
These estimates are expressed in terms of I (Sn), I (Cn), I 0(Sn) and
I 0(Cn). These quantities are variations on a theme: specifying the
natural number n in a more or less constructive manner. I (Sn) is the
number of bits of information needed to directly calculate n. I (Cn) is
the number of bits of information needed to obtain n in the limit from
below. I 0(Sn) is the number of bits of information needed to directly
calculate n using an oracle for the halting problem. And I 0(Cn ) is the
number of bits of information needed to obtain n in the limit from
below using an oracle for the halting problem. The following theorem,
whose straightforward proof is omitted, gives some facts about these
quantities and the relationship between them.
Theorem 7. I (Sn) I (Cn) I 0(Sn ) and I 0(Cn)
(a) All four quantities vary smoothly. For example, jI (Sn) ; I (Sm)j 
O(log jn ; mj), and the same inequality holds for the other three
quantities.
(b) For most n all four quantities are log2 n + O(log log n). Such n
are said to be random because they are speci ed by table look-up
without real computation.
(c) The four ways of specifying n are increasingly indirect:
I 0(Cn)  I 0(Sn) + O(1)
I 0(Sn)  I (Cn) + O(1) and
I (Cn)  I (Sn) + O(1):
(d) Occasionally n is random with respect to one kind of speci ca-
tion, but has a great deal of pattern and its description can be
considerably condensed if more indirect means of speci cation are
allowed. For example, the least n  2k such that I (Sn)  k has
the following properties: n < 2k+1, I (Sn) = k + O(log k) and
I (Cn)  log2 k + O(log log k). This relationship between I (Sn)
and I (Cn) also holds for I (Cn) and I 0(Sn), and for I 0(Sn ) and
I 0(Cn).
We see from Theorem 7(b) that all 4 pairs of estimates for log2 #n
are usually n ; log2 n + O(log log n) and thus close to each other. But
Algorithmic Entropy of Sets 271
Theorem 7(c) shows that the 4 pairs are shown above in what is essen-
tially ascending numerical order. In fact, by Theorem 7(d), for each k
there is an n such that k = log2 n + O(1) and one pair of estimates is
that
log2 #n = n ; log2 n + O(log log n)
while the next pair is that
log2 #n = n ; (a quantity  log2 log2 n) + O(log log log n):
Hence each pair of cardinalities can be an arbitrarily small fraction of
the next pair.
Having examined the comparative magnitude of these cardinalities,
we obtain two corollaries.
As was pointed out in Theorem 2(c), for singleton sets I (A) =
H (A) + O(1). Suppose consecutive sets also had this property. Then
using the fth estimate and Theorem 7(a) one would immediately con-
clude that #fconsecutive A : I (A) < ng  #fconsecutive A : H (A) <
ng: But we have seen that the rst of these cardinalities can be an arbi-
trarily small fraction of the second one. This contradiction shows that
consecutive sets do not have the property that I (A) = H (A) + O(1).
Nevertheless, in Section 5 it is shown that these sets do have the prop-
erty that I (A) = H (A) + O(log H (A)). Further research is needed to
clarify the relationship between I (A) and H (A) for A that are neither
singleton nor consecutive.
It is natural to ask what is the relationship between the probabilities
of sets and the probabilities of their unions, intersections, and comple-
ments. P (A  B )  P (A B )  P (A)P (B ), and the same inequality
holds for P (A \ B ). But is P (A)  P (A)? If this were the case, since
the complement of a co nite set is nite, using the sixth estimate and
Theorem 7(a) it would immediately follow that #f nite A : H (A) < ng
is  #fco nite A : H (A) < ng. But we have seen that the rst of these
cardinalities can be an arbitrarily small fraction of the second. Hence
it is not true that P (A)  P (A). However in Section 7 it is shown that
P 0(A)  P (A).
Corollary 1.
(a) For consecutive A it is not true that I (A) = H (A) + O(1).
(b) For co nite A it is not true that P (A)  P (A).
272 Part IV|Technical Papers on Self-Delimiting Programs
4. The Estimates Involving Singleton Sets
The following theorem and its proof are due to Solovay 6], who formu-
lated them in a string-entropy setting.
De nition. Consider a program p for a singleton set A. The bits
of p which have not been read by U by the time the element of A is
output are said to be superuous.
Theorem 8.
(a) log2 #fsingleton A : I (A) < ng = n ; I (Sn) + O(1).
(b) log2 #fsingleton A : H (A) < ng = n ; I (Sn) + O(1).
Proof. (b) follows immediately from (a) by using Theorems 2(c)
and 7(a). To prove (a) we break it up into two assertions: an upper
bound on log2 #, and a lower bound.
Let us start by explaining how to mend a minimal-size program
for a singleton set. The program is mended by replacing each of its
superuous bits by a 0 and adding an endmarker 1 bit.
There is an % = 0g 1 such that if p is a mended minimal-size program
for Sj , then U (%p) = fjpj ; 1g = fI (Sj )g. % accomplishes this by
instructing U to execute p in a special way: when U would normally
output the rst number, it instead immediately advances the program
tape to the endmarker 1 bit, outputs the amount of tape that has been
read, and goes to sleep.
The crux of the matter is that with this %
P (Sm)  #fj : I (Sj ) = mg2;j%j;m;1 
and so
#fj : I (Sj ) = mg  P (Sm )2m:
Substituting n ; k for m and summing over all k from 1 to n, we obtain
Xn
#fj : I (Sj ) < ng  P (Sn )2n (P (Sn;k )=P (Sn ))2;k :
k=1
It is easy to see that P (Sn)  P (Sk )P (Sn;k ) and so P (Sn;k )=P (Sn ) 
1=P (Sk )  k2. Hence the above summation is 
Xn
k22;k 
k=1
Algorithmic Entropy of Sets 273
which converges for n = 1. Thus
#fj : I (Sj ) < ng  P (Sn )2n :
Taking logarithms of both sides and using Theorem 2(c) we nally
obtain
log2 #fj : I (Sj ) < ng  n ; I (Sn) + O(1):
This upper bound is the rst half of the proof of (a). To complete
the proof we now obtain the corresponding lower bound.
There is a % = 0g 1 with the following property. Concatenate %,
a minimal-size program p for Sn with all superuous bits deleted, and
an arbitrary string s that brings the total number of bits up to n ; 1.
% is chosen so that U (%p) = Sk , where k has the property that s is a
binary numeral for it.
% instructs U to proceed as follows with the rest of its program,
which consists of the subroutine p followed by n ; 1 ; j%pj bits of data
s. First U executes p to obtain n. Then U calculates the size of s, reads
s, converts s to a natural number k, outputs k, and goes to sleep.
The reason for considering this % is that log2 of the number of
possible choices for s is jsj = j%psj ; j%pj = n ; 1 ; j%j ; jpj 
n ; 1 ; j%j ; I (Sn). And each choice of s yields a dierent singleton
set Sk = U (%ps) such that I (Sk )  j%psj = n ; 1. Hence
log2 #fk : I (Sk ) < ng  n ; 1 ; j%j ; I (Sn) = n ; I (Sn) + O(1):
The proof of (a), and thus of (b), is now complete.
Theorem 80.
(a) log2 #fsingleton A : I 0(A) < ng = n ; I 0(Sn ) + O(1).
(b) log2 #fsingleton A : H 0(A) < ng = n ; I 0(Sn ) + O(1).
Proof. Imagine that the proofs of Theorem 8 and its auxiliary the-
orems refer to U 0 instead of U .

5. The Remaining Estimates Involving


I(A)
De nition.
274 Part IV|Technical Papers on Self-Delimiting Programs
 Q(n) = P P (A) (#A < n) is the probability that a set has less
than n elements.
 Q(n)t is the probability that with a program tape produced by
coin ipping, U outputs less than n dierent numbers by time t.
Note that Q(n)t can be calculated from n and t, and is a rational
number of the form k=2t because U can read at most t bits of
program by time t.
Lemma 1.
(a) Q(0) = 0, Q(n)  Q(n + 1), limn!1 Q(n) < 1.
(b) Q(0)t = 0, Q(n)t  Q(n + 1)t, limn!1 Q(n)t = 1.
(c) For n > 0, Q(n)0 = 1, Q(n)t  Q(n)t+1, limt!1 Q(n)t = Q(n).
(d) If A is nite, then Q(#A + 1) ; Q(#A)  P (A).
Theorem 9. If A is consecutive and P (A) > 2;n , then I (A) 
n + I (Cn) + O(1).
Proof. There is a % = 0g 1 with the following property. After reading
%, U expects to nd on its program tape a string of length I (Cn) + n
which consists of a minimal-size program p for Cn appropriately merged
with the binary expansion of a rational number x = j=2n (0  j < 2n).
In parallel U executes p to obtain Cn , reads x, and outputs a consecutive
set. This is done in stages.
U begins stage t (t = 1 2 3 : : :) by simulating one more time quan-
tum of the computation that yields Cn. During this simulation, when-
ever it is necessary to read another bit of the program U supplies this
bit by reading the next square of the actual program tape. And when-
ever the simulated computation produces a new output (this will occur
n times), U instead takes this as a signal to read the next bit of x from
the program tape. Let xt denote the value of x based on what U has
read from its program tape by stage t. Note that 0  xt  xt+1 and
limt!1 xt = x < 1.
In the remaining portion of stage t U does the following. It cal-
culates Q(k)t for k = 0 1 2 : : : until Q(k)t = 1. Then it determines
mt which is the greatest value of k for which Q(k)t  xt. Note that
Algorithmic Entropy of Sets 275
since Q(0)t = 0 there is always such a k. Also, since Q(k)t is monotone
decreasing in t, and xt is monotone increasing, it follows that mt is also
monotone increasing in t. Finally U outputs the mt natural numbers
less than mt, and proceeds to stage t + 1.
This concludes the description of the instructions incorporated in %.
% is now used to prove the theorem by showing that if A is consecutive
and P (A) > 2;n , then I (A)  n + I (Cn) + j%j.
As pointed out in the lemma, Q(#A + 1) ; Q(#A)  P (A) > 2;n .
It follows that the open interval of real numbers between Q(#A) and
Q(#A + 1) contains a rational number x of the form j=2n (0  j <
2n ). It is not dicult to see that one obtains a program for A that is
j%j+I (Cn )+n bits long by concatenating % and the result of merging in
an appropriate fashion a minimal-size program for Cn with the binary
expansion of x. Hence I (A)  n + I (Cn) + j%j = n + I (Cn) + O(1).
Theorem 10. If A is consecutive I (A) = H (A) + O(log H (A)) and
H (A) = I (A) + O(log I (A)).
Proof. Consider a consecutive set A. By Theorem 7(a), I (Cn) =
O(log n). Restating Theorem 9, if H (A) < n then I (A)  n + I (Cn) +
O(1) = n + O(log n). Taking n = H (A) + 1, we see that I (A) 
H (A) + O(log H (A)). Moreover, H (A)  I (A) (Theorem 2(b)). Hence
I (A) = H (A) + O(log H (A)), and thus I (A) = H (A) + O(log I (A)).
Theorem 11. log2 #fA : I (A) < ng  n ; I (Cn) + O(log I (Cn)).
Proof. There is a % = 0g 1 with the following property. Let p be an
arbitrary sequence, and suppose that U reads precisely m bits of the
program p. Then U (%p) = Cm. % accomplishes this by instructing U
to execute p in a special way: normal output is replaced by a continually
updated indication of how many bits of program have been read.
The crux of the matter is that with this %
P (Cm)  #fA : I (A) = mg2;j%j;m 
and so
#fA : I (A) = mg  P (Cm)2m :
Replacing m by n ; k and summing over all k from 1 to n, we obtain
Xn
#fA : I (A) < ng  P (Cn)2n (P (Cn;k )=P (Cn ))2;k :
k=1
276 Part IV|Technical Papers on Self-Delimiting Programs
It is easy to see that P (Cn )  P (Ck )P (Cn;k ) and so P (Cn;k )=P (Cn) 
1=P (Ck )  k2. Hence the above summation is 
Xn
k22;k 
k=1
which converges for n = 1. Thus
#fA : I (A) < ng  P (Cn)2n :
Taking logarithms of both sides and using
log2 P (Cn ) = ;I (Cn) + O(log I (Cn))
(Theorem 10), we nally obtain
log2 #fA : I (A) < ng  n ; I (Cn) + O(log I (Cn)):
Theorem 12.
log2 #fconsecutive A : I (A) < ng  n ; I (Cn) + O(log I (Cn)):
Proof. There is a % = 0g 1 that is used in the following manner.
Concatenate these strings: %, a minimal-size program for fI (Cn)g with
all superuous bits deleted, a minimal-size program for Cn, and an
arbitrary string s of size sucient to bring the total number of bits up
to n ; 1. Call the resulting (n ; 1)-bit string p. Note that s is at least
n ; 1 ; j%j ; I (fI (Cn)g) ; I (Cn) bits long. Hence log2 of the number
of possible choices for s is, taking Theorem 7(a) into account, at least
n ; I (Cn) + O(log I (Cn)).
% instructs U to proceed as follows with the rest of p, which consists
of two subroutines and the data s. First U executes the rst subroutine
in order to calculate the size of the second subroutine and know where
s begins. Then U executes the second subroutine, and uses each new
number output by it as a signal to read another bit of the data s. Note
that U will never know when it has nished reading s. As U reads
the string s, it interprets s as the reversal of the binary numeral for
a natural number m. And U contrives to enumerate the set Cm by
outputting 2k consecutive natural numbers each time the kth bit of s
that is read is a 1.
Algorithmic Entropy of Sets 277
To recapitulate, for each choice of s one obtains an (n ; 1)-bit
program p for a dierent consecutive set (in fact, the set Cm, where s
is the reversal of a binary numeral for m). In as much as log2 of the
number of possible choices for s was shown to be at least n ; I (Cn) +
O(log I (Cn)), we conclude that log2 #fconsecutive A : I (A) < ng is
 n ; I (Cn) + O(log I (Cn )).
Theorem 13.
(a) log2 #fconsecutive A : I (A) < ng = n ; I (Cn) + O(log I (Cn)).
(b) log2 #fA : I (A) < ng = n ; I (Cn) + O(log I (Cn)).
Proof. Since #fconsecutive A : I (A) < ng  #fA : I (A) < ng,
this follows immediately from Theorems 12 and 11.
Theorem 130.
(a) log2 #fconsecutive A : I 0(A) < ng = n ; I 0(Cn) + O(log I 0(Cn )).
(b) log2 #fA : I 0(A) < ng = n ; I 0(Cn ) + O(log I 0(Cn)).
Proof. Imagine that the proofs of Theorem 13 and its auxiliary
theorems refer to U 0 instead of U .

6. The Remaining Lower Bounds


In this section we construct many consecutive sets and co nite sets with
probability greater than 2;n . To do this, computations using an oracle
for the halting problem are simulated using a fake oracle that answers
that U (p) halts i it does so within time t. As t goes to in nity, any
nite set of questions will eventually be answered correctly by the fake
oracle. This simulation in the limit % is used to: (a) take any n-bit
oracle program p for a singleton set and construct from it a consecutive
set U (%px) with probability greater than or equal to 2;j%j;n , and (b)
take any n-bit oracle program p for a consecutive set and construct
from it a co nite set U (%px) with probability greater than or equal to
2;j%j;n .
The critical feature of the simulation in the limit that accomplishes
(a) and (b) can best be explained in terms of two notions: harmless
278 Part IV|Technical Papers on Self-Delimiting Programs
overshoot and erasure. The crux of the matter is that although in
the limit the fake oracle realizes its mistakes and changes its mind, U
may already have read beyond p into x. This is called overshoot, and
could make the probability of the constructed set fall far below 2;j%j;n .
But the construction process contrives to make overshoot harmless by
eventually forgetting bits in x and by erasing its mistakes. In case (a)
erasure is accomplished by moving the end of the consecutive set. In
case (b) erasure is accomplished by lling in holes that were left in the
co nite set. As a result bits in x do not aect which set is enumerated
they can only aect the time at which its elements are output.
Lemma 2. With our Turing machine model, if k is output at time
 t, then k < t < 2t .
Theorem 14. There is a % = 0g 1 with the following property.
Suppose the string p is an oracle program for Sk . Let t1 be the time at
which k is output. Consider the nite set of questions that are asked
to the oracle during these t1 time quanta. Let t2 be the maximum
time taken to halt by any program that the oracle is asked about.
(t2 = 0 if none of them halt or if no questions are asked.) Finally, let
t = max t1 t2. Then for all sequences x, %px is a program for the set
Cl, where l = 2t + k. By the lemma k can be recovered from l.
Proof. % instructs U to act as follows on px. Initially U sets i = 0.
Then U works in stages. At stage t (t = 1 2 3 : : :) U simulates t time
quanta of the computation U 0(px), but truncates the simulation imme-
diately if U 0 outputs a number. U fakes the halting-problem oracle used
by U 0 by answering that a program halts i it takes  t time quanta
to do so. Did an output k occur during the simulated computation?
If not, nothing more is done at this stage. If so, U does the following.
First it sets i = i + 1. Let Li be the chronological list of yes/no an-
swers given by the fake oracle during the simulation. U checks whether
i = 1 or Li;1 6= Li. (Note that Li;1 = Li i the same questions were
asked in the same order and all the answers are the same.) If i > 1
and Li;1 = Li , U does nothing at this stage. If i = 1 or Li;1 6= Li,
U outputs all natural numbers less than 2t + k, and proceeds to stage
t + 1.
It is not dicult to see that this % proves the theorem.
Theorem 15. log2 #fconsecutive A : H (A) < ng  n ; I 0(Sn ) +
O(1).
Algorithmic Entropy of Sets 279
Proof. By Theorem 14, c = j%j has the property that for each
singleton set Sk such that I 0(Sk ) < n ; c there is a dierent l such that
P (Cl) > 2;n . Hence in view of Theorems 8(a)0 and 7(a)
log2 #fconsecutive A : H (A) < ng
 log2 #fsingleton A : I 0(A) < n ; cg
 n ; c ; I 0(Sn;c ) + O(1)
= n ; I 0(Sn ) + O(1):
Theorem 16. There is a % = 0g 1 with the following property.
Suppose the string p is an oracle program for the nite set A. For each
k 2 A, let t1k be the time at which it is output. Also, let t2k be the
maximum time taken to halt by any program that the oracle is asked
about during these t1k time quanta. Finally, let tk = max t1k t2k , and
lk = 2t + k. Then for all sequences x, %px is a program for the co nite
k

set B = all natural numbers not of the form lk (k 2 A). By the lemma
each k in A can be recovered from the corresponding lk .
Proof. % instructs U to act as follows on px in order to produce
B . U works in stages. At stage t (t = 1 2 3 : : :) U simulates t time
quanta of the computation U 0(px). U fakes the halting-problem oracle
used by U 0 by answering that a program halts i it takes  t time
quanta to do so. While simulating U 0(px), U notes the time at which
each output k occurs. U also keeps track of the latest stage at which
a change occurred in the chronological list of yes/no answers given by
the fake oracle during the simulation before k is output. Thus at stage
t there are current estimates for t1k , for t2k , and for tk = max t1k  t2k , for
each k that currently seems to be in U 0(px). As t goes to in nity these
estimates will attain the true values for k 2 A, and will not exist or
will go to in nity for k 62 A.
Meanwhile U enumerates B . That part of B output by stage t
consists precisely of all natural numbers less than 2t+1 that are not of
the form 2t + k, for any k in the current approximation to U 0(px). Here
k

tk denotes the current estimate for the value of tk .


It is not dicult to see that this % proves the theorem.
Theorem 17.
log2 #fco nite A : H (A) < ng  n ; I 0(Cn) + O(log I 0(Cn )):
280 Part IV|Technical Papers on Self-Delimiting Programs
Proof. By Theorem 16, c = j%j has the property that for each
consecutive set A such that I 0(A) < n ; c there is a dierent co nite
set B such that P (B ) > 2;n . Hence in view of Theorems 13(a)0 and
7(a)
log2 #fco nite B : H (B ) < ng
 log2 #fconsecutive A : I 0(A) < n ; cg
 n ; c ; I 0(Cn;c ) + O(log I 0(Cn;c ))
 n ; I 0(Cn ) + O(log I 0(Cn)):
Corollary 2. There is a % = 0g 1 with the property that for every
sequence p, #U (%p) = #U 0(p).
Proof. The % in the proof of Theorem 16 has this property.

7. The Remaining Upper Bounds


In this section we use several approximations to P (A), and the notion of
the
P 2kcanonical index of a nite set A 5, pp. 69{71]. This is de ned to be
(k 2 A), and it establishes a one-to-one correspondence between
the natural numbers and the nite sets of natural numbers. Let Di
be the nite set whose canonical index is i. We also need to use the
concept of a recursive real number, which is a real x for which one can
compute a convergent sequence of nested open intervals with rational
end-points that contain x 5, pp. 366, 371]. This is the formal de nition
corresponding to the intuitive notion that a computable real number
is one whose decimal expansion can be calculated. The recursive reals
constitute a eld.
De nition. Consider a sequence p produced by ipping an unbi-
ased coin.
 P (A)t = the probability that (the output by time t of U (p)) = A.

Let s be an arbitrary string.


 P (s)t = the probability that (8k < jsj) k 2 (the output by time
t of U (p)) i the kth bit of s is a 1].
Algorithmic Entropy of Sets 281
 P (s) = the probability that (8k < jsj) k 2 U (p) i the kth bit
of s is a 1].
Note that P (Di )t is a rational number that can be calculated from
i and t, and P (s)t is a rational number that can be calculated from s
and t.
Lemma 3.
(a) If A is nite, then P (A) = limt!1 P (A)t.
(b) P (s) = limt!1 P (s)t.
(c) P (() = 1.
(d) P (s) = P (s0) + P (s1).
(e) Consider a set A. Let an be the n-bit string whose kth bit is a 1 i
k 2 A. Then P (a0) = 1, P (an)  P (an+1), and limn!1 P (an) =
P (A).
Theorem 18.
(a) P (Di ) is a real recursive in the halting problem uniformly in i.
(b) P (s) is a real recursive in the halting problem uniformly in s.
This means that given i and s one can use the oracle to obtain these real
numbers as the limit of a convergent sequence of nested open intervals
with rational end-points.
Proof. Note that P (Di ) > n=m i there is a k such that P (Di )k >
n=m and for all t > k, P (Di)t  P (Di )k . One can use the oracle
to check whether or not a given i, n, m and k have this property,
for there is a % = 0g 1 such that U (%0i10n 10m 10k 1) does not halt
i P (Di )t  P (Di)k > n=m for all t > k. Thus if P (Di ) > n=m
one will eventually discover this by systematically checking all possible
quadruples i, n, m and k. Similarly, one can use the oracle to discover
that P (Di ) < n=m, that P (s) > n=m, and that P (s) < n=m. This is
equivalent to the assertion that P (Di ) and P (s) are reals recursive in
the halting problem uniformly in i and s.
Theorem 19. P 0(Si)  P (Di ).
282 Part IV|Technical Papers on Self-Delimiting Programs
Proof. It follows from Theorem 18(a) that there is a % = 0g 1
with the following property. Consider a real number x in the interval
between 0 and 1 and the sequence px that is its binary expansion. Then
U 0(%px) = Si if Px is in the open interval Ii of real numbers between
P
k<i P (Dk ) and ki P (Dk ). This shows that c = j%j has the property
that P 0(Si)  2;c (the length of the interval Ii) = 2;c P (Di ). (See 7,
pp. 14{15] for a construction that is analogous.)
Theorem 20.
(a) log2 #fconsecutive A : H (A) < ng = n ; I 0(Sn ) + O(1).
(b) log2 #f nite A : H (A) < ng = n ; I 0(Sn) + O(1).
Proof. From Theorems 15, 19, 8(b)0 , and 7(a), we see that
n ; I 0(Sn ) + O(1)
 log2 #fconsecutive A : H (A) < ng
 log2 #f nite A : H (A) < ng
 log2 #fsingleton A : H 0 (A) < n + cg
 n + c ; I 0(Sn+c ) + O(1)
= n ; I 0(Sn ) + O(1):
Theorem 21. I 0(A A)  H (A) + O(1).
Proof. Let us start by associating with each string s an interval Is
of length P (s). First of all, I is the interval of reals between 0 and 1,
which is okay because P (() = 1. Then each Is is partitioned into two
parts: the subinterval Is0 of length P (s0), followed by the subinterval
Is1 of length P (s1). This works because P (s) = P (s0) + P (s1).
There is a % = 0g 1 which makes U 0 behave as follows. After reading
% U 0 expects to nd the sequence px, the binary expansion of a real
number x between 0 and 1. Initially U 0 sets s = (. U 0 then works in
stages. At stage k (k = 0 1 2 : : :) U 0 initially knows that x is in the
interval Is, and contrives to decide whether it is in the subinterval Is0
or in the subinterval Is1. To do this U 0 uses the oracle to calculate the
end-points of these intervals with arbitrarily high precision, by means of
the technique indicated in the proof of Theorem 18(b). And of course
U 0 also has to read px to know the value of x, but it only reads the
Algorithmic Entropy of Sets 283
program tape when it is forced to do so in order to make a decision
(this is the crux of the proof). If U 0 decides that x is in Is0 it outputs
2k and sets s = s0. If it decides that x is in Is1 it outputs 2k + 1 and
sets s = s1. Then U 0 proceeds to the next stage.
Why does this show that I 0(A A)  H (A) + O(1)? From part
(e) of the lemma it is not dicult to see that to each r.e. set A there
corresponds an open interval IA of length P (A) consisting of reals x with
the property that U 0(%px) = A join A. Moreover U 0 only reads as much
of px as is necessary in fact, if P (A) > 2;n there is an x in IA for which
this is at most n + O(1) bits. Hence I 0(A A)  j%j + H (A) + O(1) =
H (A) + O(1).
Theorem 22.
(a) log2 #fco nite A : H (A) < ng = n ; I 0(Cn) + O(log I 0(Cn )).
(b) log2 #fA : H (A) < ng = n ; I 0(Cn ) + O(log I 0(Cn)).
Proof. From Theorems 17, 21, 5(a)0 , 5(d)0 , 13(b)0 and 7(a) we see
that
n ; I 0(Cn ) + O(log I 0(Cn))
 log2 #fco nite A : H (A) < ng
 log2 #fA : H (A) < ng
 log2 #fA : I 0(A) < n + cg
 n + c ; I 0(Cn+c ) + O(log I 0(Cn+c ))
= n ; I 0(Cn ) + O(log I 0(Cn)):
Corollary 3. P 0(A)  P (A).
Proof. By Theorems 2(b)0, 5(d)0 and 21, H 0(A)  I 0(A)  I 0(A A)+
O(1)  H (A) + O(1). Hence P 0(A)  P (A).

8. The Probability of the Set of Natural


Numbers Less than N
In the previous sections we established the results that were announced
in Section 3. The techniques that were used to do this can also be
applied to a topic of a somewhat dierent nature, P (Cn ).
284 Part IV|Technical Papers on Self-Delimiting Programs
P (Cn ) sheds light on two interesting quantities: Q1(n) the proba-
bility that a set has cardinality n, and Q0(n) the probability that the
complement of a set has cardinality n. We also consider a gigantic func-
tion G(n), which is the greatest natural number that can be obtained
in the limit from below with probability greater than 2;n .
De nition.
 Q1(n) = P P (A) (#A = n).
 Q0(n) = P P (A) (#A = n).
 G(n) = max k (P (Ck ) > 2;n ).
 Let  be the defective probability measurePon the sets of natural
numbers that is de ned as follows: A = Q1(n) (n 2 A).
 Let  be an arbitrary probability measure, possibly defective,
on the sets of natural numbers.  is said to be a C -measure
if there is a function u(n t) such that u(n t)  u(n t + 1) and
Cn = limt!1 u(n t). Here it is required that u(n t) be a rational
number that can be calculated from n and t. In other words, 
is a C -measure if Cn can be obtained as a monotone limit from
above uniformly in n.
Theorem 23.
(a) Q1(n)  P (Cn).
(b) Q0(n)  P 0(Cn).
(c)  is a C -measure.
(d) If  is a C -measure, then A  A.
(e) If H 0(Sk ) < n + O(1), then k < G(n).
(f) H 0(SG(n)) = n + O(1).
Proof.
(a) Note that Q1(n)  P (Cn). Also, there is a % = 0g 1 such that
U (%p) = C#U (p) for all sequences p. Hence P (Cn )  Q1(n).
Algorithmic Entropy of Sets 285
(b) Keep part (a) in mind. By Corollary 2, Q0(n)  Q01(n)  P 0(Cn).
And since P 0(A)  P (A) (Corollary 3), Q0(n)  Q01(n)  P 0(Cn).
(c) Lemma 1(c) states that the function Q(n)t de ned in Section 5
plays the role of u(n t).
(d) A construction similar to the proof of Theorem 9 shows that there
is a % = 0g 1 with the following property. Consider a real number
x between 0 and 1 and the sequence px that is its binary expan-
sion. U (%px) = Cn if x is in the open interval In of reals between
Cn and Cn+1 .
This proves part (d) because the length of the interval In is pre-
cisely Sn, and hence Sn = Q1(n)  2;j%j Sn.
(e) By Theorem 2(c)0, if P 0(Sk ) > 2;n then I 0(Sk ) < n + O(1). Hence
by Theorem 14, there is an l > k such that P (Cl)  2;n . Thus
k < l  G(n + O(1)).
(f) Note that the canonical index of Cn is 2k ; 1. It follows from
Theorem 19 that if P (Ck ) > 2;n , then P 0(f2k ; 1g)  2;n . There
is a % = 0g 1 such that U 0(%p) = Sk if U 0(p) = f2k ; 1g. Hence if
P (Ck ) > 2;n , then P 0(Sk )  P 0(f2k ; 1g)  2;n . In other words,
if P (Ck ) > 2;n then H 0(Sk )  n + O(1). Note that by de nition
P (CG(n)) > 2;n . Hence H 0(SG(n) )  n + O(1). Thus in view of
(e), H 0(SG(n) ) = n + O(1).

Addendum
An important advance in the line of research proposed in this paper
has been achieved by Solovay 8] with the aid of a crucial lemma of D.
A. Martin he shows that
I (A)  3H (A) + O(log H (A)):
In 9] and 10] certain aspects of the questions treated in this paper are
examined from a somewhat dierent point of view.
286 Part IV|Technical Papers on Self-Delimiting Programs
References
1] K. de Leeuw, E. F. Moore, C. E. Shannon and N. Shapiro, Com-
putability by probabilistic machines, in Automata Studies, C. E.
Shannon and J. McCarthy (Eds.), pp. 183{212. Princeton Uni-
versity Press, N.J. (1956).
2] G. J. Chaitin, Randomness and mathematical proof, Scient. Am.
232 (5), 47{52 (May 1975).
3] G. J. Chaitin, A theory of program size formally identical to in-
formation theory, J. Ass. Comput. Mach. 22 (3), 329{340 (July
1975).
4] C. E. Shannon and W. Weaver, The Mathematical Theory of
Communication. University of Illinois, Urbana (1949).
5] H. Rogers, Jr., Theory of Recursive Functions and Eective Com-
putability. McGraw-Hill, N.Y. (1967).
6] R. M. Solovay, unpublished manuscript on 3] dated May 1975.
7] S. K. Leung-Yan-Cheong and T. M. Cover, Some inequalities be-
tween Shannon entropy and Kolmogorov, Chaitin, and extension
complexities, Technical Report 16, Dept. of Statistics, Stanford
University, CA (October 1975).
8] R. M. Solovay, On random r.e. sets, Proceedings of the Third Latin
American Symposium on Mathematical Logic. Campinas, Brazil,
(July 1976), (to appear).
9] G. J. Chaitin, Information-theoretic characterizations of recursive
in nite strings, Theor. Comput. Sci. 2, 45{48 (1976).
10] G. J. Chaitin, Program size, oracles, and the jump operation,
Osaka J. Math. (to appear).
Algorithmic Entropy of Sets 287

Communicated by J. T. Schwartz
Received July 1976
288 Part IV|Technical Papers on Self-Delimiting Programs
Part V
Technical Papers on
Blank-Endmarker Programs

289
INFORMATION-
THEORETIC
LIMITATIONS OF
FORMAL SYSTEMS
Journal of the ACM 21 (1974),
pp. 403{424

Gregory J. Chaitin1
Buenos Aires, Argentina

Abstract
An attempt is made to apply information-theoretic computational com-
plexity to metamathematics. The paper studies the number of bits of
instructions that must be a given to a computer for it to perform nite
and innite tasks, and also the amount of time that it takes the com-
puter to perform these tasks. This is applied to measuring the di!culty
of proving a given set of theorems, in terms of the number of bits of
axioms that are assumed, and the size of the proofs needed to deduce
the theorems from the axioms.

291
292 Part V|Technical Papers on Blank-Endmarker Programs
Key Words and Phrases:
complexity of sets, computational complexity, diculty of theorem-
proving, entropy of sets, formal systems, G odel's incompleteness theo-
rem, halting problem, information content of sets, information content
of axioms, information theory, information time trade-os, metamath-
ematics, random strings, recursive functions, recursively enumerable
sets, size of proofs, universal computers

CR Categories:
5.21, 5.25, 5.27, 5.6

1. Introduction
This paper attempts to study information-theoretic aspects of compu-
tation in a very general setting. It is concerned with the information
that must be supplied to a computer for it to carry out nite or in nite
computational tasks, and also with the time it takes the computer to
do this. These questions, which have come to be grouped under the
heading of abstract computational complexity, are considered to be of
interest in themselves. However, the motivation for this investigation
is primarily its metamathematical applications.
Computational complexity diers from recursive function theory in
that, instead of just asking whether it is possible to compute something,
one asks exactly how much eort is needed to do this. Similarly, instead
of the usual metamathematical approach, we propose to measure the
diculty of proving something. How many bits of axioms are needed
1Copyright  c 1974, Association for Computing Machinery, Inc. General permis-
sion to republish, but not for prot, all or part of this material is granted provided
that ACM's copyright notice is given and that reference is made to the publica-
tion, to its date of issue, and to the fact that reprinting privileges were granted by
permission of the Association for Computing Machinery.
An early version of this paper was presented at the Courant Institute Computa-
tional Complexity Symposium, New York, October 1971. 28] includes a nontech-
nical exposition of some results of this paper. 1] and 2] announce related results.
Author's address: Rivadavia 3580, Dpto. 10A, Buenos Aires, Argentina.
Information-Theoretic Limitations of Formal Systems 293
to be able to obtain a set of theorems? How long are the proofs needed
to demonstrate them? What is the trade-o between how much is
assumed and the size of the proofs?
We consider the axioms of a formal system to be a program for
listing the set of theorems, and the time at which a theorem is written
out to be the length of its proof.
We believe that this approach to metamathematics may yield valu-
able dividends. Mathematicians were at rst greatly shocked at, and
then ignored almost completely, G odel's announcement that no set of
axioms for number theory is complete. It wasn't clear what, in practice,
was the signi cance of G odel's theorem, how it should aect the every-
day activities of mathematicians. Perhaps this was because the unprov-
able propositions appeared to be very pathological singular points.23
The approach of this paper, in contrast, is to measure the power of
a set of axioms, to measure the information that it contains. We shall
see that there are circumstances in which one only gets out of a set of
axioms what one puts in, and in which it is possible to reason in the
following manner. If a set of theorems constitutes t bits of information,
and a set of axioms contains less than t bits of information, then it is
impossible to deduce these theorems from these axioms.
We consider that this paper is only a rst step in the direction of
such an approach to metamathematics4 a great deal of work remains to
be done to clarify these matters. Nevertheless, we would like to sketch
here the conclusions which we have tentatively drawn.5
2 In 3] and 4] von Neumann analyzes the eect of Godel's theorem upon math-
ematicians. Weyl's reaction to Godel's theorem is quoted by Bell 5]. The original
source is 6]. See also Weyl's discussion 7] of Godel's views regarding his incom-
pleteness theorem.
3 For nontechnical expositions of G odel's incompleteness theorem, see 8, 9, 10,
Sec. 1, pp. xv-xviii, 11, and 12]. 28] contains a nontechnical exposition of an
incompleteness theorem analogous to Berry's paradox that is Theorem 4.1 of this
paper.
4 13{16] are related in approach to this paper. 13, 15, and 16] are concerned
with measuring the size of proofs and the eect of varying the axioms upon their
size. In 14] Cohen \measures the strength of a formal] system by the ordinals
which can be handled in the system."
5 The analysis that follows of the possible signicance of the results of this paper
has been in"uenced by 17 and 18], in addition to the references cited in Footnote
294 Part V|Technical Papers on Blank-Endmarker Programs
After empirically exploring, in the tradition of Euler and Gauss, the
properties of the natural numbers one may discover interesting regular-
ities. One then has two options. The rst is to accept the conjectures
one has formulated on the basis of their empirical corroboration, as
an experimental scientist might do. In this way one may have a great
many laws to remember, but will not have to bother to deduce them
from other principles. The other option is to try to nd a theory for
one's observations, or to see if they follow from existing theory. In this
case it may be possible to reduce a great many observations into a few
general principles from which they can be deduced. But there is a cost:
one can now only arrive at the regularities one observed by means of
long demonstrations.
Why use formal systems, instead of proceeding empirically? First
of all, if the empirically derived conjectures aren't independent facts,
reducing them to a few common principles allows one to have to re-
member less assumptions, and this is easier to do, and is much safer,
as one is assuming less. The cost is, of course, the size of the proofs.
What attitude, then, does this suggest toward G odel's theorem that
any formalization of number theory is incomplete? It tends to provide
theoretical justi cation for the attitude that number theorists have in
fact adopted when they extensively utilize in their work hypotheses such
as that of Riemann concerning the zeta function. G odel's theorem does
not mean that mathematicians must give up hope of understanding the
properties of the natural numbers it merely means that one may have
to adopt new axioms as one seeks to order and interrelate, to organize
and comprehend, ever more extensive mathematical observations. I.e.
the mathematician shouldn't be more upset than the physicist when he
needs to assume a new axiom nor should he be too horri ed when an
axiom must be abandoned because it is found that it contradicts pre-
viously existing theory, or because it predicts properties of the natural
numbers that are not corroborated empirically. In a word, we propose
that there may be theoretical justi cation for regarding number theory
somewhat more like a dynamic empirical science than as a closed static
body of theory.
This paper grew out of work on the concept of an individual random,
2. Incidentally, it is interesting to examine 19, p. 112] in the light of this analysis.
Information-Theoretic Limitations of Formal Systems 295
patternless, chaotic, unpredictable string of bits. This concept has been
rigorously de ned in several ways, and the properties of these random
strings have been studied by several authors (see, for example, 20{28]).
Most strings are random they have no special distinguishing features
they are typical and hard to tell apart. But can it be proved that a
particular string is random? The answer is that about n bits of axioms
are needed to be able to prove that a particular n-bit string is random.
More precisely, the train of thought was as follows. The entropy,
or information content, or complexity, of a string is de ned to be the
number of bits needed to specify it so eectively that it can be con-
structed. A random n-bit string is about n bits of information, i.e. has
complexity/entropy/information content n there is essentially noth-
ing better to do if one wishes to specify such a string than just show it
directly. But the string consisting of 1,000,000 repetitions of the 6-bit
pattern 000101 has far less than 6,000,000 bits of complexity. We have
just speci ed it using far fewer bits.
What if one wishes to be able to determine each string of complexity
 n and its complexity? It turns out that this requires n + O(1) bits of
axioms at least n ; c bits are necessary (Theorem 4.1), and n + c bits
are sucient (Theorem 4.3). But the proofs will be enormously long
unless one essentially directly takes as axioms all the theorems that
one wishes to prove, and in that case there will be an enormously great
number of bits of axioms (Theorem 7.6(c)).
Another theme of this paper arises from the following metamathe-
matical considerations, which are well known (see, for example, 29]).
In a formal system without a decision method, it is impossible to bound
the size of a proof of a theorem by a recursive function of the number
of characters in the statement of the theorem. For if there were such
a function f , one could decide whether or not an arbitrary proposition
p is a theorem, by merely checking if a proof for it appears among the
nitely many possible proofs of size bounded by f of the number of
characters in p.
Thus, in a formal system having no decision method, there are very
profound theorems, theorems that have short statements, but need im-
mensely long proofs. In Section 10 we study the function e(n), neces-
sarily nonrecursive, de ned to be the least s such that all theorems of
the formal system with  n characters have proofs of size  s.
296 Part V|Technical Papers on Blank-Endmarker Programs
To close this introduction, we would like to mention without proof
an example that shows particularly clearly the relationship between the
number of bits of axioms that are assumed and what can be deduced.
This example is based on the work of M. Davis, Ju. V. Matisjasevi%c,
H. Putnam, and J. Robinson that settled Hilbert's tenth problem (cf.
30]). There is a polynomial P in k +2 variables with integer coecients
that has the following property. Consider the in nite string whose ith
bit is 1 or 0 depending on whether or not the set
Si = fn 2 N j9x1 : : : xk 2 N P (i n x1 : : : xk ) = 0g
is in nite. Here N denotes the natural numbers. This in nite binary
sequence is random, i.e. the complexity of an initial segment is asymp-
totic to its length. What is the number of bits of axioms that is needed
to be able to prove for each natural number i < n whether or not the
set Si is in nite? By using the methods of Section 4, it is easy to see
that the number of bits of axioms that is needed is asymptotic to n.

2. De
nitions Related to Computers and
Complexity
This paper is concerned with measuring the diculty of computing
nite and in nite sets of binary strings. The binary strings are con-
sidered to be ordered in the following fashion: (, 0, 1, 00, 01, 10, 11,
000, 001, 010, 011, 100, 101, 110, 111, 0000 : : : In order to be able to
also study the diculty of computing nite or in nite sets of natural
numbers, we consider each binary string to simultaneously be a natural
number: the nth binary string corresponds to the natural number n.
Ordinal numbers are considered to start with 0, not 1. For example,
we speak of the 0th string of length n.
In order to be able to study the diculty of computing nite and
in nite sets of mathematical propositions, we also consider that each
binary string is simultaneously a proposition. Propositions use a nite
alphabet of characters which we suppose includes all the usual math-
ematical symbols. We consider the nth binary string to correspond to
the nth proposition, where the propositions are in lexicographical order
de ned by an arbitrary ordering of the symbols of their alphabet.
Information-Theoretic Limitations of Formal Systems 297
Henceforth, we say \string" instead of \binary string," it being un-
derstood that this refers to a binary string. It should be clear from the
context whether we are considering something to be a string, a natural
number, or a proposition.
Operations with strings include exponentiation: 0k and 1k denote
the string of k 0's and k 1's, respectively. lg(s) denotes the length of a
string s. Note that the length lg(n) of a natural number n is therefore
blog2 (n + 1)c. The maximum element of a nite set of strings S is
denoted by max S , and we stipulate that max  = 0. #(S ) denotes the
number of elements in a nite set S .
We use these notational conventions in a somewhat tricky way to
indicate how to compactly code several pieces of information into a
single string. Two coding techniques are used.
(a) Consider two natural numbers n and k such that 0  k < 2n.
We code n and k into the string s = 0n + k, i.e. the kth string
of length n. Given the string s, one recovers n and k as follows:
n = lg(s), k = s ; 0lg(s). This technique is used in the proofs
of Theorems 4.3, 6.1, 7.4, and 10.1. In three of these proofs k
is #(S ), where S is a subset of the strings having length < n n
and #(S ) are coded into the string s = 0n + #(S ). In the case
of Theorem 6.1, k is the number that corresponds to a string s
of length < n (thus 0  k < 2n ; 1) n and s are coded into the
string s0 = 0n + s.
(b) Consider a string p and a natural number k. We code p and k
into the string s = 0lg(k)1kp, i.e. the string consisting of lg(k) 0's
followed by a 1 followed by the kth string followed by the string
p. The length of the initial run of 0's is the same as the length of
the kth string and is used to separate kp in two and recover k and
p from s. Note that lg(s) = lg(p) + 2 lg(k) + 1. This technique is
used in the proof of Theorem 10.4. The proof of Theorem 4.1 uses
a simpler technique: p and k are coded into the string s = 0k 1p.
But this coding is less economical, for lg(s) = lg(p) + k + 1.
We use an extremely general de nition of computer this has the
advantage that if one can show that something is dicult to compute
298 Part V|Technical Papers on Blank-Endmarker Programs
using any such computer, this will be a very strong result. A computer
is de ned by indicating whether it has halted and what it has output,
as a function of its program and the time. The formal de nition of a
computer C is an ordered pair hC HC i consisting of two total recursive
functions
C : X  N ! fS 2 2X jS is niteg


and HC : X  N ! X . Here X = f0 1g, X  is the set of all strings,


and N is the set of all natural numbers. It is assumed that the functions
C and HC have the following two properties:
(a) C (p t)  C (p t + 1), and
(b) if HC (p t) = 1, then HC (p t + 1) = 1 and C (p t) = C (p t + 1).
C (p t) is the nite set of strings output by the computer C up to
time t when its program is p. If HC (p t) = 1 the computer C is halted
at time t when its program is p. If HC (p t) = 0 the computer C isn't
halted at time t when its program is p. Henceforth whether HC (p t) = 1
or 0 will be indicated by stating that \C (p t) is halted" or that \C (p t)
isn't halted." Property (a) states that C (p t) is the cumulative output,
and property (b) states that a computer that is halted remains halted
and never outputs anything else.
C (p), the output of the computation S that C performs when it is
given the program p, is de ned to be t C (p t). It is said that \C (p)
halts" i there is a t such that C (p t) is halted. Furthermore, if C (p)
halts, the time at which it halts is de ned to be the least t such that
C (p t) is halted. We say that the program p calculates the nite set
S when run on C if C (p) = S and halts. We say that the program p
enumerates the nite or in nite set S when run on C if C (p) = S .
We now de ne a class of computers that are especially suitable to
use for measuring the information needed to specify a computation. A
computer U is said to be universal if it has the following property. For
any computer C , there must be a natural number, denoted sim(C ) (the
cost of simulating C ), such that the following holds. For any program
p, there exists a program p0 such that: lg(p0)  lg(p) + sim(C ), U (p0)
halts i C (p) halts, and U (p0) = C (p).
The idea of this de nition is as follows. The universal computers
are information-theoretically the most economical ones their programs
Information-Theoretic Limitations of Formal Systems 299
are shortest. More precisely, a universal computer U is able to simu-
late any other computer, and the program p0 for U that simulates the
program p for C need not be much longer than p. If there are instruc-
tions of length n for computing something using the computer C , then
there are instructions of length  n +sim(C ) for carrying out the same
computation using U  i.e. at most sim(C ) bits must be added to the
length of the instructions, to indicate the computer that is to be simu-
lated. Note that we do not assume that there is an eective procedure
for obtaining p0 given C and p. We have no need for the concept of
an eectively universal computer in this paper. Nevertheless, the most
natural examples of universal computers are eectively universal. See
the Appendix for examples of universal computers.
We shall suppose that a particular universal computer U has some-
how been chosen, and shall use it as our standard computer for mea-
suring the information needed to specify a computation. The choice of
U corresponds to the choice of the standard of measurement.
We now de ne I (S ), the information needed to calculate the nite
set S , and Ie(S ), the information needed to enumerate the nite or
in nite set S .
I (S ) = min lg(p) (U (p) = S and halts)
(
Ie(S ) = min lg(p) (U (p) = S )
1 if there are no such p:
We say that I (S ) is the complexity of the nite set S , and that Ie(S )
is the e-complexity (enumeration complexity) of the nite or in nite
set S . Note that I (S ) is the number of bits in the shortest program
for U that calculates S , and Ie(S ) is the number of bits in the shortest
program for U that enumerates S . Also, Ie(S ), the e-complexity of a
set S , is 1 if S isn't r.e. (recursively enumerable).
We say that a program p such that U (p) = S and halts is a de-
scription of S , and a program p such that U (p) = S is an e-description
(enumeration description) of S . Moreover, if U (p) = S and halts and
lg(p) = I (S ), then we say that p is a minimal description of S . Like-
wise, if U (p) = S and lg(p) = Ie(S ), then we say that p is a minimal
e-description of S .
300 Part V|Technical Papers on Blank-Endmarker Programs
Finally, we de ne Ie(f ), the e-complexity of a partial function f .
This is de ned to be the e-complexity of the graph of f , i.e. the set of
all ordered pairs of the form (n f (n)). Here the ordered pair (i j ) is
de ned to be the natural number (2i + 1)2j ; 1 this is an eective 1{1
correspondence between the ordered pairs of natural numbers and the
natural numbers. Note that Ie(f ) = 1 if f isn't partial recursive.
Before considering basic properties of these concepts, we introduce
an abbreviated notation. Instead of I (fsg) and Ie(fsg) we shall write
I (s) and Ie(s) i.e. the complexity or e-complexity of a string is de ned
to be complexity or e-complexity of its singleton set.
We now present basic properties of these concepts. First of all,
note that there are precisely 2n programs of length n, and 2n+1 ; 1
programs of length  n. It follows that the number of dierent sets
of complexity n and the number of dierent sets of e-complexity n are
both  2n . Also, the number of dierent sets of complexity  n and
the number of dierent sets of e-complexity  n are both  2n+1 ; 1
i.e. the number of dierent objects of complexity or e-complexity  n
is bounded by the number of dierent descriptions or e-descriptions of
length  n, which is 2n+1 ; 1. Thus it might be said that almost all
sets are arbitrarily complex.
It is immediate from the de nition of complexity and of a universal
computer that Ie(C (p))  lg(p)+sim(C ), and I (C (p))  lg(p)+sim(C )
if C (p) halts. This is used often, and without explicit mention. The
following theorem lists for reference other basic properties of complexity
and e-complexity that are used in this paper.
Theorem 2.1.
(a) There is a c such that for all strings s, I (s)  lg(s) + c.
(b) There is a c such that for all nite sets S , I (S )  max S + c.
(c) For any computer C , there is a c such that for all programs p,
Ie(C (p))  I (p) + c, and I (C (p))  I (p) + c if C (p) halts.
Proof. (a) There is a computer C such that C (s) = fsg and halts
for all programs s. Thus I (s)  lg(s) + sim(C ).
(b) There is a computer C such that C (p) halts for all programs
p, and n 2 C (p) i n < lg(p) and the nth bit of p is a 1. Thus
I (S )  max S + 1 + sim(C ).
Information-Theoretic Limitations of Formal Systems 301
(c) There is a computer C 0 that does the following when it is given
the program s. First C 0 simulates running s on U , i.e. it simulates
U (s). If and when U (s) halts, C 0 has determined the set calculated by
U when it is given the program s. If this isn't a singleton set, C 0 halts.
If it is a singleton set fpg, C 0 then simulates running p on C . As C 0
determines the strings output by C (p), it also outputs them. And C 0
halts if C halts during the simulated run.
In summary, C 0(s) = C (p) and halts i C (p) does, if s is a descrip-
tion of the program p. Thus, if s is a minimal description of the string
p, then
(
Ie(C (p)) = Ie(C 0(s))  lg(s) + sim(C 0) = I (p) + sim(C 0) and
I (C (p)) = I (C 0(s))  lg(s) + sim(C 0) = I (p) + sim(C 0) if C (p) halts.
Q.E.D.
It follows from Theorem 2.1(a) that all strings of length n have com-
plexity  n + c. In conjunction with the fact that < 2n;k strings are of
complexity < n ; k, this shows that the great majority of the strings
of length n are of complexity n. These are the random strings of
length n. By taking C = U in Theorem 2.1(c), it follows that there is a
c such that for any minimal description p, I (p) + c  I (U (p)) = lg(p).
Thus minimal descriptions are highly random strings. Likewise, mini-
mal e-descriptions are highly random. This corresponds in information
theory to the fact that the most informative messages are the most un-
expected ones, the ones with least regularities and redundancies, and
appear to be noise, not meaningful messages.

3. De
nitions Related to Formal Systems
This paper deals with the information and time needed to carry out
computations. However, we wish to apply these results to formal sys-
tems. This section explains how this is done.
The abstract de nition used by Post that a formal system is an r.e.
set of propositions is close to the viewpoint of this paper (see 31]).6
6For standard denitions of formal systems, see, for example, 32{34] and 10, p.
117].
302 Part V|Technical Papers on Blank-Endmarker Programs
However, we are not quite this unconcerned with the internal details of
formal systems.
The historical motivation for formal systems was of course to con-
struct deductive theories with completely objective, formal criteria for
the validity of a demonstration. Thus, a fundamental characteristic of a
formal system is an algorithm for checking the validity of proofs. From
the existence of this proof veri cation algorithm, it follows that the set
of all theorems that can be deduced from the axioms p by means of
the rules of inference by proofs  t characters in length is given by a
total recursive function C of p and t. To calculate C (p t) one applies
the proof veri cation algorithm to each of the nitely many possible
demonstrations having  t characters.
These considerations motivate the following de nition. The rules
of inference of a class of formal systems is a total recursive function
C : X  N ! fS 2 2X jS is niteg with the property that C (p t) 


C (p t + 1). The value of C (p t) is the nite (possibly empty) set of the
theorems that can be proven from the axioms p by means of proofs S t
in size. Here p is a string and t is a natural number. C (p) = t C (p t)
is the set of theorems that are consequences of the axioms p. The
ordered pair hC pi, which implies both the choice of rules of inference
and axioms, is a particular formal system.
Note that this de nition is the same as the de nition of a computer
with the notion of \halting" omitted. Thus given any rules of inference,
there is a computer that never halts whose output up to time t consists
precisely of those propositions that can be deduced by proofs of size
 t from the axioms the computer is given as its program. And given
any computer, there are rules of inference such that the set of theorems
that can be deduced by proofs of size  t from the program, is precisely
the set of strings output by the computer up to time t. For this reason
we consider the following notions to be synonymous: \computer" and
\rules of inference," \program" and \axioms," and \output up to time
t" and \theorems with proofs of size  t."
The rules of inference that correspond to the universal computer
U are especially interesting, because they permit axioms to be very
economical. When using the rules of inference U , the number of bits
of axioms needed to deduce a given set of propositions is precisely the
e-complexity of the set of propositions. If n bit of axioms are needed to
Information-Theoretic Limitations of Formal Systems 303
obtain a set T of theorems using the rules of inference U , then at least
n ; sim(C ) bits of axioms are needed to obtain them using the rules
of inference C  i.e. if C (a) = T , then lg(a)  Ie(T ) ; sim(C ). Thus it
could be said that U is among the rules of inference that permit axioms
to be most economical. In Section 4 we are interested exclusively in
the number of bits needed to deduce certain sets of propositions, not
in the size of the proofs. We shall therefore only consider the rules of
inference U in Section 4, i.e. formal systems of the form hU pi.
As a nal comment regarding the rules of inference U , we would
like to point out the interesting fact that if these rules of inference are
used, then a minimal set of axioms for obtaining a given set of theorems
must necessarily be random. This is just another way of saying that a
minimal e-description is a highly random string, which was mentioned
at the end of Section 2.
The following theorem also plays a role in the interpretation of our
results in terms of formal systems.
Theorem 3.1.
Let f be a recursive function, and g be a recursive predicate.
(a) Let C be a computer. There is a computer C 0 that never halts
such that C 0(p t) = ff (s)js 2 C (p t) & g(s)g for all p and t.
(b) There is a c such that Ie(ff (s)js 2 S & g(s)g)  Ie(S ) + c for all
r.e. sets S .
Proof. (a) is immediate (b) follows by taking C = U in part (a).
Q.E.D.
The following is an example of the use of Theorem 3.1. Suppose we
wish to study the size of the proofs that \n 2 H " in a formal system
hC pi, where n is a numeral for a natural number. If we have a result
concerning the speed with which any computer can enumerate the set
H , we apply this result to the computer C 0 that has the property that
n 2 C 0(p t) i \n 2 H "2 C (p t) for all n, p, and t. In this case
the predicate g selects those strings that are propositions of the form
\n 2 H ," and the function f transforms \n 2 H " to n.
Here is another kind of example. Suppose there is a computer C
that enumerates a set H very quickly. Then there is a computer C 0
304 Part V|Technical Papers on Blank-Endmarker Programs
that enumerates propositions of the form \n 2 H " just as quickly. In
this case the predicate g is taken to be always true, and the function f
transforms n to \n 2 H ."

4. The Number of Bits of Axioms Needed


to Determine the Complexity of Speci
c
Strings
The set of all programs that halt when run on U is r.e. Similarly, the
set of all true propositions of the form \I (s)  n" where s is a string
and n is a natural number, is an r.e. set. In other words, if a program
halts, or if a string is of complexity less than or equal to n, one will
eventually nd this out. To do this one need only try on U longer and
longer test runs of more and more programs, and do this in a systematic
way.
The problem is proving that a program doesn't halt, or that a string
is of complexity greater than n. In this section we study how many
bits of axioms are needed, and, as was pointed out in Section 3, it is
sucient to consider only the rules of inference U . We shall see that
with n bits of axioms it is impossible to prove that a particular string
is of complexity greater than n + c, where c doesn't depend on the
particular axioms chosen (Theorem 4.1). It follows from this that if a
formal system has n bits of axioms, then there is a program of length
 n + c that doesn't halt, but the fact that this program doesn't halt
can't be proven in this formal system (Theorem 4.2).
Afterward, we show that n + c bits of axioms suce to be able
to determine each program of length not greater than n that halts
(Theorem 4.4), and thus to determine each string of complexity less
than or equal to n, and its complexity (Theorem 4.3). Furthermore,
the remaining strings must be of complexity greater than n.
Next, we construct an r.e. set of strings P that has the property
that in nitely many strings aren't in it, but a formal system with n
bits of axioms can't prove that a particular string isn't an element of
P if the length of this string is greater than n + c (Theorem 4.5). It
follows that P is what Post called a simple set that is, P is r.e., and
Information-Theoretic Limitations of Formal Systems 305
its complement is in nite, but contains no in nite r.e. subset (see 31],
Sec. 5, pp. 319{320). Moreover, n + c bits suce to determine each
string of length not greater than n that isn't an element of P .
Finally, we show that not only are n bits of axioms insucient to
exhibit a string of complexity > n+c, they are also insucient to exhibit
(by means of an e-description) an r.e. set of e-complexity greater than
n + c (Theorem 4.6). This is because no set can be of e-complexity
much greater than the complexity of one of its e-descriptions, and thus
Ie(U (s)) > k implies I (s) > k ; c, where c is a constant that doesn't
depend on s.
Although these results clarify how many bits of axioms are needed
to determine the complexity of individual strings, they raise several
questions regarding the size of proofs.
n + c bits of axioms suce to determine each string of complexity
 n and its complexity, but the method used here to do this appears
to be extremely slow, that is, the proofs appear to be extremely long.
Is this necessarily the case? The answer is \yes," as is shown in Section
7.
We have pointed out that there is a formal system having as the-
orems all true propositions of the form \U (p) halts." The size of the
proof that U (p) halts must grow faster than any recursive function f of
lg(p). For suppose that such a recursive bound on the length of proofs
existed. Then all true propositions of the form \U (p) doesn't halt"
could be enumerated by checking to see if there is no proof that U (p)
halts of size < f (lg(p)). This is impossible, by Theorem 4.2. The size
of these proofs is studied in Section 10.
Theorem 4.1. (a) There is a c such that for all programs p, if a
proposition of the form \I (s) > n" (s a string, n a natural number) is
in U (p) only if I(s) > n, then \I (s) > n" is in U (p) only if n < lg(p)+ c.
In other words: (b) There is a c such that for all formal systems
hU pi, if \I (s) > n" is a theorem of hU pi only if it is true, then
\I (s) > n" is a theorem of hU pi only if n < lg(p) + c.
For any r.e. set of propositions T , one obtains the following from
(a) by taking p to be a minimal e-description of T : (c) If T has the
property that \I (s) > n" is in T only if I (s) > n, then T has the
property that \I (s) > n" is in T only if n < Ie(T ) + c.
Idea of Proof. The following is essentially the classical Berry
306 Part V|Technical Papers on Blank-Endmarker Programs
paradox.7 For each natural number n greater than 1, consider \the
least natural number that can't be de ned in less than N characters."
Here N denotes the numeral for the number n. This is a blog10 nc + c
character phrase de ning a number that supposedly needs at least n
characters to be de ned. This is a paradox if blog10 nc + c < n, which
holds for all suciently great values of n.
The following is a sharper version of Berry's paradox. Consider:
\the least natural number whose de nition requires more characters
than there are in this phrase." This c-character phrase de nes a number
that supposedly needs more than c characters to be de ned.
The following version is analogous to our proof. Consider this pro-
gram: \Calculate the rst string that can be proven in hU pi to be of
complexity greater than the number of bits in this program, where p is
the following string: : : : " Here \ rst" refers to rst in the recursive enu-
meration U (p t) (t = 0 1 2 : : :) of the theorems of the formal system
hU pi. This program is only a constant number c of bits longer than the
number of bits in p. It is no longer a paradox it shows that in hU pi no
string can be proven to be of complexity greater than c + lg(p) = c +
the number of bits of axioms of hU pi.
Proof. Consider the computer C that does the following when it is
given the program p0. First, it solves the equation p0 = 0k 1p. If this
isn't possible (i.e. p0 = 0k ), then C halts without outputting anything.
If this is possible, C continues by simulating running the program p on
U . It generates U (p) searching for a proposition of the form \I (s) > n"
in which s is a string, n is a natural number, and n  lg(p0) + k. If and
when it nds such a proposition \I (s) > n" in U (p), it outputs s and
halts.
Suppose that p satis es the hypothesis of the theorem, i.e. \I (s) >
n" is in U (p) only if I (s) > n. Consider C (0sim(C)1p). If C (0sim(C)1p) =
fsg, then I (s)  lg(0sim(C ) 1p) + sim(C ) = lg(p) + 2 sim(C ) + 1. But
C outputs s and halts because it found the proposition \I (s) > n" in
U (p) and
n  lg(p0) + k = lg(0sim(C)1p) + sim(C ) = lg(p) + 2 sim(C ) + 1:
7Although due to Berry, its importance was recognized by Russell and it was
published by him 35, p. 153].
Information-Theoretic Limitations of Formal Systems 307
Thus, by the hypothesis of the theorem, I (s) > n  lg(p)+2 sim(C )+1,
which contradicts the upper bound on I (s). Consequently, C (0sim(C)1p)
doesn't output anything (i.e. equals ), for there is no proposition
\I (s) > n" in U (p) with n  lg(p) + 2 sim(C ) + 1. The theorem is
proved with c = 2 sim(C ) + 1. Q.E.D.
De nition 4.1. H = fpjU (p) haltsg. (H is r.e., as was pointed
out in the rst paragraph of this section.)
Theorem 4.2. There is a c such that for all formal systems hU pi,
if a proposition of the form \s 2 H " or \s 62 H " (s a string) is a theorem
of hU pi only if it is true, then there is a string s of length  lg(p) + c
such that neither \s 2 H " nor \s 62 H " is a theorem of hU pi.
Proof. Consider the computer C that does the following when it is
given the program p. It simulates running p on U , and as it generates
U (p), it checks each string in it to see if it is a proposition of the form
\s 2 H " or \s 62 H ," where s is a string. As soon as C has determined
in this way for each string s of length less than or equal to some natural
number n whether or not \s 2 H " or \s 62 H " is in U (p), it does the
following.
C supposes that these propositions are true, and thus that it has
determined the set fs 2 H j lg(s)  ng. Then it simulates running each
of the programs in this set on U until U halts, and thus determines the
set S = S U (s) (s 2 H & lg(s)  n). C then outputs the proposition
\I (f ) > n," where f is the rst string not in S , and then continues
generating U (p) as was indicated in the rst paragraph of this proof.
Inasmuch as f isn't output by any program of length  n that halts,
it must in fact be of complexity > n.
Thus, C (p) enumerates true propositions of the form \I (f ) > n"
if p satis es the hypothesis of the theorem. Hence, by Theorem 4.1,
\I (f ) > n" is in C (p) only if n < Ie(C (p)) + c0  lg(p) + sim(C ) + c0. It
is easy to see that the theorem is proved with c = sim(C ) + c0. Q.E.D.
Theorem 4.3. Consider the set Tn consisting of all true proposi-
tions of the form \I (s) = k" (s a string, k a natural number  n) and
all true propositions of the form \I (s) > n." Ie(Tn) = n + O(1).
In other words, a formal system hU p0 i whose theorems consist pre-
cisely of all true propositions of the form \I (s) = k" with k  n, and
all true propositions of the form \I (s) > n," requires n + O(1) bits
of axioms i.e. n ; c bits are necessary and n + c bits are sucient to
308 Part V|Technical Papers on Blank-Endmarker Programs
obtain this set of theorems.
Idea of Proof. If one knows n and how many programs of length
 n halt when run of U , then one can nd them all, and see what they
calculate. n and this number h can be coded into an (n + 1)-bit string.
In other words, the axiom of this formal system with theorem set Tn is
essentially \the number of programs of length  n that halt when run
on U is h," where n and h are particular natural numbers. This axiom
is n + O(1) bits of information.
Proof. By Theorem 4.1, Ie(Tn)  n ; c. It remains to show that
Ie(Tn)  n + c.
Consider the computer C that does the following when it is given
the program p of length  1. It generates the r.e. set H until it has
found p ; 0lg(p) programs of length  lg(p) ; 1 that halt when run
on U . If and when it has found this set S of programs, it simulates
running each program in S on U until it halts. C then examines each
string that is calculated by a program in S , and determines the length
of the shortest program in S that calculates it. If p ; 0lg(p) = the
number h of programs of length  lg(p) ; 1 that halt when run of
U , then C has determined each string of complexity  lg(p) ; 1 and
its complexity. If p ; 0lg(p) < h, then C 's estimates of the complexity
of strings are too high. And if p ; 0lg(p) > h, then C never nishes
generating H . Finally, C outputs its estimates as propositions of the
form \I (s) = k" with k  lg(p) ; 1, and as propositions of the form
\I (s) > k" with k = lg(p) ; 1 indicating that all other strings are of
complexity > lg(p) ; 1.
We now show how C can be used to enumerate Tn economically.
Consider h = #(fs 2 H j lg(s)  ng). As there are precisely 2n+1 ; 1
strings of length  n, 0  h  2n+1 ; 1. Let p be 0n+1 + h, that is,
the hth string of length n + 1. Then C (p) = Tn, and thus Ie(Tn) 
lg(p) + sim(C ) = n + 1 + sim(C ). Q.E.D.
Theorem 4.4. Let Tn be the set of all true propositions of the form
\s 2 H " or \s 62 H " with s a string of length  n. Ie(Tn) = n + O(1).
In other words, a formal system hU pi whose theorems consist pre-
cisely of all true propositions of the form \s 2 H " or \s 62 H " with
lg(s)  n, requires n + O(1) bits of axioms i.e. n ; c bits are necessary
and n + c bits are sucient to obtain this set of theorems.
Proof. Theorem 4.2 shows that Ie (Tn)  n ; c. The proof that
Information-Theoretic Limitations of Formal Systems 309
Ie(Tn)  n + c is obtained from the proof of Theorem 4.3 by simplifying
the de nition of the computer C so that it outputs Tn, instead of, in
eect, using Tn to determine each string of complexity  n and its
complexity. Q.E.D.
De nition 4.2. P = fsjI (s) < lg(s)g i.e. P contains each string
s whose complexity I (s) is less than its length lg(s).
Theorem 4.5. (a) P is r.e., i.e. there is a formal system with the
property that \s 2 P " is a theorem i s 2 P .
(b) P is in nite, because for each n there is a string of length n that
isn't an element of P .
(c) There is a c such that for all formal systems hU pi, if \s 62 P "
is a theorem only if it is true, then \s 62 P " is a theorem only if
lg(s) < lg(p) + c. Thus, by the de nition of e-complexity, if an r.e. set
T of propositions has the property that \s 62 P " is in T only if it is
true, then \s 62 P " is in T only if lg(s) < Ie(T ) + c.
(d) There is a c such that for all r.e. sets of strings S , if S contains
no string in P (i.e. S  P ), then lg(max S ) < Ie(S ) + c. Thus max S <
0I (S)+c = 2I (S)+c ; 1, and #(S ) < 2I (S)+c .
e e e

(e) Let Tn be the set of all true propositions of the form \s 62 P "
with lg(s)  n. Ie(Tn) = n + O(1). In other words, a formal system
hU pi whose theorems consist precisely of all true propositions of the
form \s 62 P " with lg(s)  n, requires n + O(1) bits of axioms i.e.
n ; c bits are necessary and n + c bits are sucient to obtain this set
of theorems.
Proof. (a) This is an immediate consequence of the fact that the
set of all true propositions of the form \I (s)  n" is r.e.
(b) We must show that for each n there is a string of length n whose
complexity is greater than or equal to its length. There are 2n strings
of length n. As there are exactly 2n ; 1 program of length < n, there
are < 2n strings of complexity < n. Thus at least one string of length
n must be of complexity  n.
(c) Consider the computer C that does the following when it is
given the program p. It simulates running p on U . As C generates
U (p), it examines each string in it to see if it is a proposition of the
form \s 62 P ," where s is a string of length  1. If it is, C outputs the
proposition \I (s) > n" where n = lg(s) ; 1.
If p satis es the hypothesis, i.e. \s 62 P " is in U (p) only if it is
310 Part V|Technical Papers on Blank-Endmarker Programs
true, then C (p) enumerates true propositions of the form \I (s) > n"
with n = lg(s) ; 1. It follows by Theorem 4.1 that n must be <
Ie(C (p))+ c0  lg(p)+sim(C )+ c0. Thus lg(s) ; 1 < lg(p)+sim(C )+ c0,
and part (c) of the theorem is proved with c = sim(C ) + c0 + 1.
(d) Consider the computer C that does the following when it is given
the program p. It simulates running p on U . As C generates U (p), it
takes each string s in U (p), and outputs the proposition \s 62 P ."
Suppose S contains no string in P . Let p be a minimal e-description
of S , i.e. U (p) = S and lg(p) = Ie(S ). Then C (p) enumerates true
propositions of the form \s 62 P " with s 2 S . By part (c) of this
theorem,
lg(s) < Ie(C (p)) + c0  lg(p) + sim(C ) + c0 = Ie(S ) + sim(C ) + c0:
Part (d) of the theorem is proved with c = sim(C ) + c0.
(e) That Ie(Tn)  n ; c follows from part (c) of this theorem. The
proof that Ie(Tn)  n + c is obtained by changing the de nition of the
computer C in the proof of Theorem 4.3 in the following manner. After
C has determined each string of complexity  n and its complexity,
C determines each string s of complexity  n whose complexity is
greater than or equal to its length, and then C outputs each such s in
a proposition of the form \s 62 P ." Q.E.D.
Theorem 4.6. (a) There is a c such that for all programs p, if a
proposition of the form \Ie(U (s)) > n" (s a string, n a natural number)
is in U (p) only if Ie(U (s)) > n, then \Ie(U (s)) > n" is in U (p) only if
n < lg(p) + c.
In other words: (b) There is a c such that for all formal systems
hU pi, if \Ie (U (s)) > n" is a theorem of hU pi only if it is true, then
\Ie(U (s)) > n" is a theorem of hU pi only if n < lg(p) + c.
For any r.e. set of propositions T , one obtains the following from
(a) by taking p to be a minimal e-description of T : (c) If T has the
property that \Ie (U (s)) > n" is in T only if Ie(U (s)) > n, then T has
the property that \Ie(U (s)) > n" is in T only if n < Ie(T ) + c.
Proof. By Theorem 2.1(c), there is a c0 such that Ie(U (s)) > n
implies I (s) > n ; c0.
Consider the computer C that does the following when it is given the
program p. It simulates running p on U . As C generates U (p), it checks
Information-Theoretic Limitations of Formal Systems 311
each string in it to see if it is a proposition of the form \Ie (U (s)) > n"
with s a string and n a natural number. Each time it nds such a
proposition in which n  c0, C outputs the proposition \I (s) > m"
where m = n ; c0  0.
If p satis es the hypothesis of the theorem, then C (p) enumerates
true propositions of the form \I (s) > m." \I (s) > m" (m = n ; c0  0)
is in C (p) i \Ie (U (s)) > n" (n  c0) is in U (p). By Theorem 4.1,
\I (s) > m" is in C (p) only if
m < Ie(C (p)) + c00  lg(p) + sim(C ) + c00:
Thus \Ie(U (s)) > n" (n  c0) is in U (p) only if n ; c0 < lg(p)+sim(C )+
c00. The theorem is proved with c = sim(C ) + c00 + c0. Q.E.D.

5. The Greatest Natural Number of Com-


plexity  N
The growth of a(n), the greatest natural number of complexity  n as
a partial function of n, serves as a benchmark for measuring a number
of computational phenomena. The general approach in Sections 6 to
10 will be to use a partial function of n to measure some quantity
of computational interest, and to compare the growth of this partial
function as n increases with that of a(n).
We compare rates of growth in the following fashion.
De nition 5.1. We say that a partial function f grows at least as
quickly as another partial function g, written f  g or g  f , when a
shift of f overbounds g. That is, when there is a c such that for all n,
if g(n) is de ned, then f (n + c) is de ned and f (n + c)  g(n). Note
that f  g and g  h implies f  h.
De nition 5.2. We say that the partial functions f and g grow
equally quickly, written f  g, i f  g and g  f .
We now formally de ne a(n), and list its basic properties for future
reference.
De nition 5.3. a(n) = max k (I (k)  n). The maximum is taken
over all natural numbers k of complexity  n. If there are no such k,
then a(n) is unde ned.
Theorem 5.1.
312 Part V|Technical Papers on Blank-Endmarker Programs
(a) If a(n) is de ned, then I (a(n))  n.
(b) If a(n) is de ned, then a(n + 1) is de ned and a(n)  a(n + 1).
(c) If I (n)  m, then n  a(m).
(d) n  a(I (n)).
(e) If a(m) is de ned, then n > a(m) implies I (n) > m.
(f) If I (n) > i for all n  m, then m > a(i) if a(i) is de ned.
(g) I (a(n)) = n + O(1).
(h) There is a c such that for all nite sets S of strings, max S 
a(I (S ) + c).
(i) There is a c such that for all nite sets S of strings and all n, if
a(n) is de ned and a(n) 2 S , then I (S ) > n ; c.
Proof. (a) to (f) follow immediately from the de nition of a(n).
(g) Consider the two computers C and C 0 that always halt and such
that C (n) = fn +1g and C 0(n +1) = fng. It follows by Theorem 2.1(c)
that I (n) = I (n + 1) + O(1). By part (e) of this theorem, if a(n) is
de ned then I (a(n)+1) > n. By part (a) of this theorem, I (a(n))  n.
Hence if a(n) is de ned we have I (a(n)) = I (a(n)+1)+ O(1), I (a(n)) 
n, and I (a(n) + 1) > n. It follows that I (a(n)) = n + O(1).
(h) Consider the computer C such that C (p) = fmax S g and halts
if p is a description of S , i.e. if U (p) = S and halts. It follows that
I (max S )  I (S ) + sim(C ). Thus by part (c) of this theorem, max S 
a(I (S ) + c), where c = sim(C ).
(i) Consider the computer C such that C (p) = f1 + max S g and
halts if p is a description of S , i.e. if U (p) = S and halts. It follows
that I (1 + max S )  I (S ) + sim(C ). If a(n) 2 S , then 1 + max S >
a(n), and thus by part (e) of this theorem I (1 + max S ) > n. Hence
n < I (1 + max S )  I (S ) + sim(C ), and thus I (S ) > n ; c, where
c = sim(C ). Q.E.D.
Information-Theoretic Limitations of Formal Systems 313
6. How Fast Does the Greatest Natural
Number of Complexity  N Grow with In-
creasing N?
In Theorem 6.2 we show that an equivalent de nition of a(n) is the
greatest value at n of any partial recursive function of complexity  n.
In Theorem 6.3 we use this to show that any partial function  a
eventually overtakes any partial recursive function. This will apply
directly to all the functions that will be shown in succeeding sections
to be  to a.
In Theorem 6.4 it is shown that for any partial recursive function
f , f (a(:))  a. Thus there is a c such that for all n, if a(n) is de ned,
then a(n) < a(n + c) (Theorem 6.5).
Theorem 6.1. There is a c such that if f : N ! N is a partial
recursive function de ned at n and n  Ie(f ), then I (f (n))  n + c.
Proof. Given a minimal e-description s of the graph of f , we add
it to 0n+1 . As n  Ie(f ) = lg(s), the resulting string p = 0n+1 + s has
both n (= lg(p) ; 1) and the graph of f (= U (s) = U (p ; 0lg(p))) coded
into it. Given this string p as its program, a computer C generates the
graph of f searching for the pair (n f (n)). If and when it is found,
the computer outputs f (n) and halts. Thus, f (n), if de ned, is of
complexity  lg(p) + sim(C ) = n + 1 + sim(C ). Q.E.D.
De nition 6.1. b(n) = max f (n) (Ie(f )  n). The maximum is
taken over all partial recursive functions f : N ! N that are de ned
at n and are of e-complexity  n. If there are no such functions, then
b(n) is unde ned.
Theorem 6.2. a  b.
Proof. First we show that b  a. If b(n) is de ned, then there is
a partial recursive function f : N ! N de ned at n with Ie(f )  n,
such that f (n) = b(n). By Theorem 6.1, I (f (n))  n + c, and thus
f (n)  a(n + c) by Theorem 5.1(c). Hence if b(n) is de ned, b(n) =
f (n)  a(n + c), and thus b  a.
Now we show that a  b. Suppose that a(n) is de ned, and consider
the constant function fn : N ! N whose value is always a(n), and the
computer C such that C (n) = f(0 n) (1 n) (2 n) : : :g. It follows by
Theorem 2.1(c) that Ie(fn )  I (a(n)) + c, which by Theorem 5.1(a)
314 Part V|Technical Papers on Blank-Endmarker Programs
is  n + c. Thus if a(n) is de ned, a(n) = fn (n + c)  max f (n + c)
(Ie(f )  n + c) = b(n + c). Hence a  b. Q.E.D.
Theorem 6.3. Let the partial function x : N ! N have the
property that x  a. There is a constant c0 such that the following
holds for all partial recursive functions f : N ! N . If f (n) is de ned
and n  Ie(f ) + c0, then x(n) is de ned and x(n)  f (n).
Proof. By Theorem 6.2 and the transitivity of , x  a  b. Thus
there is a c such that x(n + c) is de ned and x(n + c)  b(n) if b(n) is
de ned. Consider the shifted function f 0(n) = f (n + c). The existence
of a computer C such that (i j ) 2 C (p) i (i + c j ) 2 U (p) shows that
Ie(f 0)  Ie(f )+ sim(C ). By the de nition of b, x(n + c)  b(n)  f 0(n)
if f 0 is de ned at n and Ie(f 0)  n. Thus x(n + c)  f (n + c) if f
is de ned at n + c and Ie(f 0)  Ie(f ) + sim(C )  n. In other words,
x(n)  f (n) if f is de ned at n and Ie(f ) + sim(C ) + c  n. The
theorem is proved with c0 = sim(C ) + c. Q.E.D.
Theorem 6.4. Let f : N ! N be a partial recursive function.
f (a(:))  a.
Proof. There is a computer C such that C (n) = ff (n)g and halts if
f (n) is de ned. Thus by Theorem 2.1(c), if f (n) is de ned, I (f (n)) 
I (n)+ c. Substituting a(n) for n, we obtain I (f (a(n)))  I (a(n))+ c 
n + c, for by Theorem 5.1(a), I (a(n))  n. Thus if f (a(n)) is de ned,
f (a(n))  a(n + c), by Theorem 5.1(c). Q.E.D.
Theorem 6.5. There is a c such that for all n, if a(n) is de ned,
then a(n) < a(n + c).
Proof. Taking f (n) = n + 1 in Theorem 6.4, we obtain a(:) + 1  a.
Q.E.D.

7. The Resources Needed to Calculate/


Enumerate the Set of All Strings of Com-
plexity  N
7.1
We rst discuss the metamathematical implications of the material in
this section.
Information-Theoretic Limitations of Formal Systems 315
The basic fact used in this section (see the proof of Theorem 7.3)
is that for any computer
S C there is a c such that for all n, if a(n) is
de ned then max C (p a(n)) (lg(p)  a(n)) is less than a(n + c). Thus
a(n + c) cannot be output by programs of length  a(n) in time  a(n).
If we use Theorem 3.1(a) to take C to be such that s 2 C (p t) i
\I (s) = k"2 C (p t), and we recall that a(n+c) is a string of complexity
 n + c, we obtain the following result. Any formal system hC  pi
whose theorems include all true propositions of the form \I (s) = k"
with k  n + c, must either have more than a(n) bits of axioms, or
need proofs of size greater than a(n) to be able to demonstrate these
propositions. Here c depends only on the rules of inference C . This is
a strong result, in view of the fact that a(n) is greater than or equal to
any partial recursive function f (n) for n  Ie(f ) + c0 (Theorem 6.3).
The idea of Section 9 is to show that both extremes are possible
and there is a drastic trade-o. We can deduce these results from a few
bits of axioms ( n + c bits by Theorem 4.3) by means of enormous
proofs, or we can directly take as axioms all that we wish to prove.
This gives short proofs, but we are assuming an enormous number of
bits of axioms.
From the fact that a(n + c) > max S C (p a(n)) (lg(p)  a(n)), it
also follows that if one wishes to prove a numerical upper bound on
a(n + c), one faces the same drastic alternatives. Lin and Rado, in
trying to determine particular values of *(n) and SH(n), have, in fact,
essentially been trying to do this (see 36]). In their paper they explain
the diculties they encountered and overcame for n = 3, and expect
to be insurmountable for greater values of n.

7.2
Now we begin the formal exposition, which is couched exclusively in
terms of computers.
In this section we study the set K (n) consisting of all strings of com-
plexity  n. This set turns out to be extremely dicult to calculate,
or even to enumerate a superset of|either the program or the time
needed must be extremely large. In order to measure this diculty, we
will rst measure the resources needed to output a(n).
De nition 7.1. K (n) = fsjI (s)  ng. Note that this set may be
316 Part V|Technical Papers on Blank-Endmarker Programs
empty, and #(K (n)) isn't greater than 2n+1 ; 1, inasmuch as there are
exactly 2n+1 ; 1 programs of length  n.
We shall show that a(n) and the resources required to calcu-
late/enumerate K (n) grow equally quickly. What do we mean by the
resources required to calculate a nite set, or to enumerate a superset
of it? It is assumed that the computer C is being used to do this.
De nition 7.2. Let S be a nite set of strings. r(S ), the resources
required to calculate S , is the least r such that there is a program p
of length  r having the property that C (p r) = S and is halted. If
there is no such r, r(S ) is unde ned. re(S ), the resources required to
enumerate a superset of S , is the least r such that there is a program p
of length  r with the property that S  C (p r). If there is no such r,
re(S ) is unde ned. We abbreviate r(fsg) and re(fsg) as r(s) and re (s).
We shall nd very useful the notion of the set of all output produced
by the computer C with information and time resources limited to r.
We denote this by Cr . S
De nition 7.3. Cr = C (p r) (lg(p)  r).
We now list for future reference basic properties of these concepts.
Theorem 7.0.
(
(a) a(n) = max K (n) if K (n) 6= 
unde ned if K (n) = :
(b) K (n) 6= , and a(n) is de ned, i n  n. Here n = min I (s),
where the minimum is taken over all strings s.
(c) For all r, Cr  Cr+1.
In (d) to (k), S and S 0 are arbitrary nite sets of strings.
(d) S  Cr (S) if re(S ) is de ned.
e

(e) re(S )  r(S ) if r(S ) is de ned.


(f) If re(S 0) is de ned, then S  S 0 implies re (S )  re(S 0).
(g) If S  C (p t), then either lg(p)  re(S ) or t  re(S ).
(h) If C (p) = S and halts, then either lg(p)  r(S ), or the time at
which C (p) halts is  r(S ).
Information-Theoretic Limitations of Formal Systems 317
(i) If C (p) = S and halts, and lg(p) < r(S ), then C (p) halts at time
 r(S ).

In (j) and (k) it is assumed that C is U . Thus r(S ) and re (S ) are


always de ned.
(j) If r(S ) > I (S ), then there is a program p of length I (S ) such that
U (p) = S and halts at time  r(S ).
(k) r(S )  I (S ).
Proof. These results follow immediately from the de nitions.
Q.E.D.
Theorem 7.1.
There is a c such that for all nite sets S of strings,
(a) I (r(S ))  I (S ) + c if r(S ) is de ned, and
(b) I (re(S ))  I (S ) + c if re (S ) is de ned.
Proof. (a) Consider the computer C 0 that does the following when
it is given the program p. First, it simulates running p on U . If and
when U halts during the simulated run, C 0 has determined the nite
set S = U (p) of strings. Then C 0 repeats the following operations for
r = 0 1 2 : : :
C 0 determines C (p0 r) for each program p0 of length  r. It checks
those C (p0 r) (lg(p0)  r) that are halted to see if one of them is equal
to S . If none of them are, C 0 adds 1 to r and repeats this operation. If
one of them is equal to S , C 0 outputs r and halts.
Let p be a minimal description of a nite set S of strings, i.e. U (p) =
S and halts, and lg(p) = I (S ). Then if r(S ) is de ned, C 0(p) = fr(S )g
and halts, and thus I (r(S ))  lg(p) + sim(C 0) = I (S ) + sim(C 0). This
proves part (a) of the theorem.
(b) The proof of part (b) of the theorem is obtained from the proof
of part (a) by changing the de nition of the computer C 0 so that it
checks all C (p0 r) (lg(p0)  r) to see if one of them includes S , instead
of checking all those C (p0 r) (lg(p0)  r) that are halted to see if one
of them is equal to S . Q.E.D.
318 Part V|Technical Papers on Blank-Endmarker Programs
Theorem 7.2. max Ca(:)  a.
Proof. Theorem 2.1(c) and the existence of a computer C 0 such
that C 0(r) = Cr and halts, shows that there is a c such that for all
r, I (Cr)  I (r) + c. Thus by Theorem 5.1(h) and (b) there is a c0
such that for all r, max Cr  a(I (r) + c0). Hence if a(n) is de ned,
max Ca(n)  a(I (a(n)) + c0)  a(n + c0) by Theorem 5.1(a) and (b).
Q.E.D.
Theorem 7.3.
(a) If re(a(n)) is de ned when a(n) is, then re(a(:))  a.
(b) If r(a(n)) is de ned when a(n) is, then r(a(:))  a.
Proof. By Theorem 7.1, if re (a(n)) and r(a(n)) are de ned,
I (re(a(n)))  I (a(n)) + c and I (r(a(n)))  I (a(n)) + c. By Theorem
5.1(a), I (a(n))  n. Thus I (re(a(n)))  n + c and I (r(a(n)))  n + c.
Applying Theorem 5.1(c), we obtain re (a(n))  a(n + c) and r(a(n)) 
a(n + c). Thus we have shown that re(a(:))  a and r(a(:))  a, no
matter what C is.
r(S ), if de ned, is  re (S ) (Theorem 7.0(e)), and thus to nish the
proof it is sucient to show that a  re(a(:)) if re(a(n)) is de ned when
a(n) is. By Theorems 7.2 and 6.5 there is a c such that for all n, if
a(n) is de ned, then max Ca(n) < a(n + c), and thus a(n + c) 62 Ca(n).
And inasmuch as for all nite sets S , S  Cr (S) (Theorem 7.0(d)), it
follows that a(n + c) 2 Cr (a(n+c)).
e

In summary, there is a c such that for all n, if a(n) is de ned, then


e

a(n + c) 62 Ca(n), and a(n + c) 2 Cr (a(n+c)).


As for all r, Cr  Cr+1 (Theorem 7.0(c)), it follows that if a(n) is
e

de ned then a(n) < re (a(n + c)). Thus a  re (a(:)). Q.E.D.


Theorem 7.4. I (K (n)) = n + O(1).
Proof. As was essentially shown in the proof of Theorem 4.3, there
is a computer C 0 such that C 0(0n+1 + #(fp 2 H j lg(p)  ng)) = K (n)
and halts, for all n. Thus I (K (n))  n + 1 + sim(C 0) for all n.
K (n) =  can hold for only nitely many values of n, by Theorem
7.0(b). By Theorem 7.0(a), for all other values of n, a(n) 2 K (n), and
thus, by Theorem 5.1(i), there is a c such that I (K (n))  n ; c for all
n. Q.E.D.
Theorem 7.5.
Information-Theoretic Limitations of Formal Systems 319
(a) If re(K (n)) is de ned for all n, then re(K (:))  a.
(b) If r(K (n)) is de ned for all n, then r(K (:))  a.
Proof. By Theorem 7.1, if re (K (n)) and r(K (n)) are de ned,
I (re(K (n)))  I (K (n))+ c, and I (r(K (n)))  I (K (n))+ c. I (K (n)) =
n + O(1) (Theorem 7.4), and thus there is a c0 that doesn't depend on n
such that I (re(K (n)))  n + c0, and I (r(K (n)))  n + c0. Applying The-
orem 5.1(c), we obtain re (K (n))  a(n + c0), and r(K (n))  a(n + c0).
Thus we have shown that re (K (:))  a and r(K (:))  a, no matter
what C is.
For all nite sets S and S 0 of strings, if S  S 0, then re(S )  re(S 0) 
r(S 0) if these are de ned (Theorem 7.0(f), (e)). As a(n) 2 K (n) if a(n)
is de ned (Theorem 7.0(a)), we have re (a(n))  re (K (n))  r(K (n))
if these are de ned. By Theorem 7.3(a), re(a(:))  a if re(a(n)) is
de ned when a(n) is, and thus re (K (:))  a and r(K (:))  a if these
are de ned for all n. Q.E.D.
Theorem 7.6. Suppose re(a(n)) is de ned when a(n) is. There
is a c such that the following holds for all partial recursive functions
f : N ! N . If n  Ie(f ) + c and f (n) is de ned, then
(a) a(n) is de ned,
(b) if a(n) 2 C (p t), then either lg(p)  f (n) or t  f (n), and
(c) if K (n)  C (p t), then either lg(p)  f (n), or t  f (n).
Proof. (a) and (b) By Theorem 7.3(a), re (a(:))  a. Taking re(a(:))
to be the partial function x(:) in the hypothesis of Theorem 6.3, we
deduce that if n  Ie(f ) + c and f (n) is de ned, then a(n) is de ned
and re(a(n))  f (n). Here c doesn't depend on f .
Thus if a(n) 2 C (p t), then by Theorem 7.0(g) it follows that either
lg(p)  re (a(n))  f (n) or t  re (a(n))  f (n).
(c) Part (c) of this theorem is an immediate consequence of parts
(a) and (b) and the fact that if a(n) is de ned then a(n) 2 K (n) (see
Theorem 7.0(a)). Q.E.D.
320 Part V|Technical Papers on Blank-Endmarker Programs
8. The Minimum Time Such That All Pro-
grams of Length  N That Halt Have Done
So
In this section we show that for any computer, a  the minimum time
such that all programs of length  n that halt have done so (Theorem
8.1). Moreover, in the case of U this is true with \" instead of \"
(Theorem 8.2).
The situation revealed in the proof of Theorem 8.2 can be stated in
the following vague but suggestive manner. Suppose that one wishes
to calculate a(n) or K (n) using the standard computer U . To do this
one only needs about n bits of information. But a program of length
n+O(1) for calculating a(n) is among the programs of length  n+O(1)
that take the most time to halt. Likewise, an (n + O(1))-bit program for
calculating K (n) is among the programs of length  n + O(1) that take
the most time to halt. These are among the most dicult calculations
that can be accomplished by program having not more than n bits.
De nition 8.1. dC (n) = the least t such that for all p of length
 n, if C (p) halts, then C (p t) is halted. This is the minimum time
at which all programs of length  n that halt have done so. Although
it is 0 if no program of length  n halts, we stipulate that dC (n) is
unde ned in this case.
Theorem 8.1. dC  a.
Proof. Consider the computer C 0 that does the following when it is
given the program p. C 0 simulates C (p t) for t = 0 1 2 : : : until C (p t)
is halted. If and when this occurs, C 0 outputs the nal value of t, which
is the time at which C (p) halts. Finally, C 0 halts.
If dC (n) is de ned, then there is a program p of length  n that halts
when run on C and does this at time dC (n). Then C 0(p) = fdC (n)g and
halts. Thus I (dC (n))  lg(p) + sim(C 0)  n + sim(C 0). By Theorem
5.1(c), we conclude that dC (n) is, if de ned,  a(n + sim(C 0)). Q.E.D.
Theorem 8.2. dU  a.
Proof. In view of Theorem 8.1, dU  a. Thus we need only show
that dU  a.
Recall that a(n) is de ned i n  n (Theorem 7.0(b)). As C = U
is a universal computer, r(a(n)) is de ned if a(n) is de ned. Thus
Information-Theoretic Limitations of Formal Systems 321
Theorem 7.3(b) applies to this choice of C , and r(a(:))  a. That is to
say, there is a c such that for all n  n, r(a(n + c))  a(n).
As a  a, taking x = a and f (n) = n + c + 1 in Theorem 6.3,
we obtain the following. There is a c0 such that for all n  n + c0,
a(n)  f (n) = n + c + 1. We conclude that for all n  n + c0,
a(n) > n + c.
By Theorem 5.1(a), n + c  I (a(n + c)) for all n  n.
The preceding results may be summarized in the following chain of
inequalities. For all n  n +c0, r(a(n+c))  a(n) > n+c  I (a(n+c)).
As r(a(n + c))  I (a(n + c)), the hypothesis of Theorem 7.0(j)
is satis ed, and we conclude the following. There is a program p of
length I (a(n + c))  n + c such that U (p) = fa(n + c)g and halts at
time  r(a(n + c))  a(n). Thus for all n  n + c0, dU (n + c)  a(n).
Applying Theorem 5.1(b) to this lower bound on dU (n + c), we
conclude that for all n  n, dU (n + c0 + c)  a(n + c0)  a(n). Q.E.D.

9. Examples of Trade-Os Between Infor-


mation and Time
Consider calculating a(n) using the computer U and the computer C
de ned as follows. For all programs p, C (p 0) = fpg and is halted.
Since I (a(n))  n (Theorem 5.1(a)), there is a program  n bits
long for calculating a(n) using U . But inasmuch as r(a(:))  a (Theo-
rem 7.3(b)) and dU  a (Theorem 8.1), this program takes \about" a(n)
units of time to halt (see the proof of Theorem 8.2). More precisely,
with nitely many exceptions, this program takes between a(n ; c) and
a(n + c) units of time to halt.
What happens if one uses C to calculate a(n)? Inasmuch as
C (a(n)) = fa(n)g and halts at time 0, C can calculate a(n) imme-
diately. But this program, although fast, is lg(a(n)) = blog2(a(n) + 1)c
bits long. Thus r(a(n)) is precisely lg(a(n)) if one uses this computer.
Now for our second example. Suppose one wishes to enumerate a
superset of K (n), and is using the following two computers, which never
halt: C (p t) = fsj lg(s)  tg and C 0(p t) = fsj lg(s)  lg(p)g. These
two computers have the property that K (n), if not empty, is included in
322 Part V|Technical Papers on Blank-Endmarker Programs
C (p t), or is included in C 0(p t), i t  lg(a(n)), or i lg(p)  lg(a(n)),
respectively. Thus for these two computers, re(K (n)), which we know
by Theorem 7.5(a) must be  to a, is precisely given by the following:
re(K (n)) = 0 if a(n) is unde ned, and lg(a(n)) otherwise.
It is also interesting to slow down or speed up the computer C by
changing its time scale recursively. Let f : N ! N be an arbitrary
unbounded total recursive function with the property that for all n,
f (n)  f (n + 1). C f , the f speed-up/slowdown of C , is de ned as
follows: C f (p t) = fsjf (lg(s))  tg. For the computer C f , re (K (n)) is
precisely given by the following: re (K (n)) = 0 if a(n) is unde ned, and
f (lg(a(n))) otherwise. The fact that by Theorem 7.5(a) this must be 
to a, is related to Theorem 6.4 that for any partial recursive function
f , f (a(:))  a.
Now, for our third example, we consider trade-os in calculating
K (n). We use U and the computer C de ned as follows. For all p and
n, C (p 0) is halted, and n 2 C (p 0) i lg(p) > n and the nth bit of the
string p is a 1.
Inasmuch as I (K (n)) = n + O(1) (Theorem 7.4), if we use U there
is a program about n bits long for calculating K (n). But this pro-
gram takes \about" a(n) units of time to halt, in view of the fact that
r(K (:))  a (Theorem 7.5(b)) and dU  a (Theorem 8.1) (see the
proof of Theorem 8.2). More precisely, with nitely many exceptions,
this program takes between a(n ; c) and a(n + c) units of time to halt.
On the other hand, using the computer C we can calculate K (n)
immediately. But the shortest program for doing this has length pre-
cisely 1 + max K (n) = a(n) + 1 if a(n) is de ned, and has length 0
otherwise. In other words, for this computer r(K (n)) = 0 if a(n) is
unde ned, and a(n) + 1 otherwise.
We have thus seen three examples of a drastic trade-o between
information and time resources. In this setting information and time
play symmetrical roles, especially in the case of the resources needed
to enumerate a superset.
Information-Theoretic Limitations of Formal Systems 323
10. The Speed of Recursive Enumerations
10.1
We rst discuss the metamathematical implications of the material in
this section.
Consider a particular formal system, and a particular r.e. set of
strings R. Suppose that a proposition of the form \s 2 R" is a theorem
of this formal system i it is true, i.e. i the string s is an element of
R. De ne e(n) to be the least m such that all theorems of the formal
system of the form \s 2 R" with lg(s)  n have proofs of size  m. By
using Theorem 3.1(a) we can draw the following conclusions from the
results of this section. First, e  a for any R. Second, e  a i
I (fs 2 Rj lg(s)  ng) = n + O(1): ()
Thus r.e. sets R for which e  a are the ones that require the longest
proofs to show that \s 2 R," and this is the case i R satis es ( ). It
is shown in this section that the r.e. set of strings fpjp 2 U (p)g has
property ( ), and the reader can show without diculty that H and P
are also r.e. sets of strings that have property ( ). Thus we have three
examples of R for which e  a.

10.2
Now we begin the formal exposition, which is couched exclusively in
terms of computers.
Consider an r.e. set of strings R and a particular computer C  and
p* such that C (p) = R. How quickly is R enumerated? This is, what
is the time e(n) that it takes to output all elements of R of length  n?
De nition 10.1. Rn = fs 2 Rj lg(s)  ng. e(n) = the least t such
that Rn  C (p t).
We shall see that the rate of growth of the total function e(n) can
be related to the growth of the complexity of Rn. In this way we shall
show that some r.e. sets R are the most dicult to enumerate, i.e. take
the most time.
Theorem 10.1.8 There is a c such that for all n, I (Rn)  n + c.
8 This theorem, with a dierent proof, is due to Loveland 37, p. 64].
324 Part V|Technical Papers on Blank-Endmarker Programs
Proof. 0  #(Rn )  2n+1 ; 1, for there are precisely 2n+1 ; 1
strings of length  n. Consider p, the #(Rn)-th string of length n + 1
i.e. p = 0n+1 + #(Rn). This string has both n (= lg(p) ; 1) and
#(Rn) (= p ; 0lg(p)) coded into it. When this string p is its program,
the computer C generates the r.e. set R by simulating C (p), until it
has found #(Rn) strings of length  n in R. C then outputs this set
of strings, which is Rn , and halts. Thus I (Rn)  lg(p) + sim(C ) =
n + 1 + sim(C ). Q.E.D.
Theorem 10.2.
(a) There is a c such that for all n, e(n)  a(I (Rn) + c).
(b) e  a.
Proof. (a) Consider the computer C that does the following. Given
a description p of Rn as its program, the computer C rst simulates
running p on U in order to determine Rn . Then it simulates C (p  t)
for t = 0 1 2 : : : until Rn  C (p t). C then outputs the nal value
of t, which is e(n), and halts.
This shows that  sim(C ) bits need be added to the length of a
description of Rn to bound the length of a description of e(n) i.e. if
U (p) = Rn and halts, then C (p) = fe(n)g and halts, and thus I (e(n)) 
lg(p) + sim(C ). Taking p to be a minimal description of Rn, we have
lg(p) = I (Rn), and thus I (e(n))  I (Rn) + sim(C ). By Theorem
5.1(c), this gives us e(n)  a(I (Rn) + sim(C )). Part (a) of the theorem
is proved with c = sim(C ).
(b) By part (a) of this theorem, e(n)  a(I (Rn) + c). And by
Theorem 10.1, I (Rn)  n + c0 for all n. Applying Theorem 5.1(b), we
obtain e(n)  a(I (Rn)+ c)  a(n + c0 + c) for all n. Thus e  a. Q.E.D.
Theorem 10.3. If a  e, then there is a c such that I (Rn)  n ; c
for all n.
Proof. By Theorem 7.0(b) and the de nition of , if a  e, then
there is a c0 such that for all n  n, a(n)  e(n + c0). And by Theorem
10.2(a), there is a c1 such that e(n + c0)  a(I (Rn+c0 ) + c1) for all n.
We conclude that for all n  n, a(n)  a(I (Rn+c0 ) + c1).
By Theorems 6.5 and 5.1(b), there is a c2 such that if a(m) is de ned
and m  n ; c2, then a(m) < a(n). As we have shown in the rst
Information-Theoretic Limitations of Formal Systems 325
paragraph of this proof that for all n  n, a(n)  a(I (Rn+c0 ) + c1), it
follows that I (Rn+c0 ) > n ; c2.
In other words, for all n  n, I (Rn+c0 ) > (n + c0) ; c0 ; c1 ; c2. And
thus for all n, I (Rn)  n ; c0 ; c1 ; c2 ; M , where M = maxn<n +c0 n ;


c0 ; c1 ; c2 if this is positive, and 0 otherwise. The theorem is proved


with c = c0 + c1 + c2 + M . Q.E.D.
Theorem 10.4.
If there is a c such that I (Rn)  n ; c for all n, then
(a) there is a c0 such that if t  e(n), then I (t) > n ; c0, and
(b) e  a.
Proof. By Theorem 5.1(f) it follows from (a) that e(n) > a(n ; c0)
if a(n ; c0) is de ned. Hence e(n + c0)  a(n) if a(n) is de ned, i.e.
e  a. Thus to complete the proof we need only show that (a) follows
from the hypothesis.
We consider the case in which t  e(n) and n  I (t) = n ; k, for if
I (t) > n then any c0 will do.
There is a computer C that does the following when it is given
the program 0lg(k)1kp, where p is a minimal description of t. First,
C determines lg(p) + k = I (t) + k = (n ; k) + k = n. Second, C
simulates running p on U in order to determine U (p) = ftg. C now
uses its knowledge of n and t in order to calculate Rn . To do this C
rst simulates running p on C  in order to determine C (p t), and
nally C outputs all strings in C (p t) that are of length  n, which
is Rn , and halts.
In summary, C has the property that if t  e(n), I (t) = n ; k, and
p is a minimal description of t, then C (0lg(k)1kp) = Rn and halts, and
thus
I (Rn)
 lg(0lg(k) 1kp) + sim(C )
= lg(p) + 2 lg(k) + sim(C ) + 1
= I (t) + 2 lg(k) + sim(C ) + 1
= n ; k + 2 lg(k) + sim(C ) + 1:
326 Part V|Technical Papers on Blank-Endmarker Programs
Taking into account the hypothesis of this theorem, we obtain the
following for all n: if t  e(n) and I (t) = n ; k, then n ; c  I (Rn) 
n ; k + 2 lg(k) + sim(C ) + 1, and thus c + sim(C ) + 1  k ; 2 lg(k). As
lg(k) = blog2(k + 1)c, this implies that there is a c0 such that for all n,
if t  e(n) and I (t) = n ; k, then k < c0. We conclude that for all n, if
t  e(n), then either I (t) > n or I (t) = n ; k > n ; c0. Thus in either
case I (t) > n ; c0. Q.E.D.
Theorem 10.5. If R = fpjp 2 U (p)g, then there is a c such that
I (Rn) > n ; c for all n.
Proof. Consider the following computer C . When given the program
p, C rst simulates running p on U until U halts. If and when it nishes
doing this, C then outputs each string s 62 U (p), and never halts.
If the program p is a minimal description of Rn , then C enumerates
a set that cannot be enumerated by any program p0 run on U having
 n bits. The reason is that if lg(p0 )  n, then p0 2 C (p) i p0 62 Rn i
p0 62 U (p0). Thus  sim(C ) bits need be added to the length I (Rn) of a
minimal description p of Rn to bound the length of an e-description of
the set C (p) of e-complexity > n i.e. n < Ie(C (p))  lg(p) + sim(C ) =
I (Rn) + sim(C ). Hence n < I (Rn) + c, where c = sim(C ). Q.E.D.
Theorem 10.6.
(a) e  a and 9c 8n I (Rn)  n + c.
(b) e  a i 9c 8n I (Rn)  n ; c.
(c) If R = fpjp 2 U (p)g, then e  a and I (Rn) = n + O(1).
Proof. (a) is Theorems 10.2(b) and 10.1. (b) is Theorem 10.4(b) and
10.3. And (c) follows immediately from parts (a) and (b) and Theorem
10.5. Q.E.D.

Appendix. Examples of Universal Com-


puters
In this Appendix we use the formalism of Rogers.9 In particular, Px
denotes the xth Turing machine, '(2)
x denotes the partial recursive func-
9 See 38, pp. 13{15, 21, 70].
Information-Theoretic Limitations of Formal Systems 327
tion N N ! N that Px calculates, and Dx denotes the xth nite set
of natural numbers. Here the index x is an arbitrary natural number.
First we give a more formal de nition of computer than in Section
2.
A partial recursive function c : N N ! N is said to be adequate
(as a de ning function for a computer C ) i it has the following three
properties:
(a) it is a total function
(b) Dc(pt)  Dc(pt+1)
(c) if the natural number 0 is an element of Dc(pt), then Dc(pt) =
Dc(pt+1).
A computer C is de ned by means of an adequate function c :
N N ! N as follows.
(a) C (p t) is halted i the natural number 0 is an element of Dc(pt).
(b) C (p t) is the set of strings fnjn + 1 2 Dc(pt)g i.e. the nth string
is in C (p t) i the natural number n + 1 is an element of Dc(pt).
We now give an name to each computer. The natural number i is
said to be an adequate index i '(2) i is an adequate function. If i is
an adequate index, C denotes the computer whose de ning function
i
is '(2)
i . If i isn't an adequate index, then \C " isn't the name of a
i
computer.
We now de ne a universal computer U in such a way that it has
the property that if i is an adequate index, then U (0i1p) = C i(p) and
halts i C i(p) halts. In what follows i and t denote arbitrary natural
numbers, and p denotes an arbitrary string. U (0i  t) is de ned to be
equal to  and to be halted. U (0i 1p t) is de ned recursively. If t  1
and U (0i1p t ; 1) is halted, then U (0i 1p t) = U (0i 1p t ; 1) and is
halted. Otherwise U (0i 1p t) is the set of strings fnjn + 1 2 W g and is
halted i 0 2 W . Here

W = D'(2) (pt )0

t <t0
0
i
328 Part V|Technical Papers on Blank-Endmarker Programs
and t0 is the greatest natural number  t such that if t0 < t0 then Pi
applied to hp t0i yields an output in  t steps.
The universal computer U that we have just de ned is, in fact,
eectively universal: to simulate the computation that C i performs
when it is given the program p, one gives U the program p0 = 0i 1p,
and thus p0 can be obtained from p in an eective manner. Our second
example of a universal computer, U 0, is not eectively universal, i.e.
there is no eective procedure for obtaining p0 from p.10
U 0 is de ned as follows:
8 0
>
< U 0(( t) =  and is halted,
> U (0p t) = U (p t) ; f1g and is halted i U (p t) is, and
: U 0(1p t) = U (p t)  f1g and is halted i U (p t) is.
I.e. U 0 is almost identical to U , except that it eliminates the string 1
from the output, or forces the string 1 to be included in the output,
depending on whether the rst bit of its program is 0 or not. It is
easy to see that U 0 cannot be eectively universal. If it were, given
any program p for U , by examining the rst bit of the program p0 for
U 0 that simulates it, one could decide whether or not the string 1 is in
U (p). But there cannot be an eective procedure for deciding, given
any p, whether or not the string 1 is in U (p).

Added in Proof
The following additional references have come to our attention.
Part of G odel's analysis of Cantor's continuum problem 39] is highly
relevant to the philosophical considerations of Section 1. Cf. especially
39, pp. 265, 272].
Schwartz 40, pp. 26{28] rst reformulates our Theorem 4.1 using
the hypothesis that the formal system in question is a consistent ex-
tension of arithmetic. He then considerably extends Theorem 4.1 40,
pp. 32{34]. The following is a paraphrase of these pages.
Consider a recursive function f : N ! N that grows very quickly,
say f (n) = n!!!!!!!!!!. A string s is said to have property f if the fact
10 The denition of U is an adaptation of 38, p. 42, Exercise 2-11].
0
Information-Theoretic Limitations of Formal Systems 329
that p is a description of fsg either implies that lg(p)  lg(s) or that
U (p) halts at time > f (lg(s)). Clearly a 1000-bit string with property f
is very dicult to calculate. Nevertheless, a counting argument shows
that there are strings of all lengths with property f , and they can be
found in an eective manner 40, Lemma 7, p. 32]. In fact, the rst
string of length n with property f is given by a recursive function of
n, and is therefore of complexity  log2 n + c. This is thus an example
of an extreme trade-o between program size and the length of com-
putation. Furthermore, an argument analogous to the demonstration
of Theorem 4.1 shows that proofs that speci c strings have property f
must necessarily be extremely tedious (if some natural hypotheses con-
cerning U and the formal system in question are satis ed) 40, Theorem
8, pp. 33{34].
41, Item 2, pp. 12{20] sheds light on the signi cance of these results.
Cf. especially the rst unitalicized paragraphs of answers numbers 4 and
8 to the question \What is programming?" 41, pp. 13, 15{16]. Cf. also
40, Appendix, pp. 63{69].

Index of Symbols
Section 2: lg(s) max S #(S ) X  N C (p t) C (p) U sim(C ) I (S ) Ie(S )
Section 3: hC pi hU pi
Section 4: H P
Section 5:    a(n)
Section 6: b(n)
Section 7: K (n) r(S ) re(S ) Cr n
Section 8: dC (n)
Section 10: Rn e(n)
330 Part V|Technical Papers on Blank-Endmarker Programs
References
1] Chaitin, G. J. Information-theoretic aspects of Post's construc-
tion of a simple set. On the diculty of generating all binary
strings of complexity less than n. (Abstracts.) AMS Notices 19
(1972), pp. A-712, A-764.
2] Chaitin, G. J. On the greatest natural number of de nitional
or information complexity  n. There are few minimal descrip-
tions. (Abstracts.) Recursive Function Theory: Newsletter, no. 4
(1973), pp. 11{14, Dep. of Math., U. of California, Berkeley.
3] von Neumann, J. Method in the physical sciences. J. von
Neumann|Collected Works, Vol. VI, A. H. Taub, Ed., MacMil-
lan, New York, 1963, No. 36, pp. 491{498.
4] von Neumann, J. The mathematician. In The World of Math-
ematics, Vol. 4, J. R. Newman, Ed., Simon and Schuster, New
York, 1956, pp. 2053{2063.
5] Bell, E. T. Mathematics: Queen and Servant of Science.
McGraw-Hill, New York, 1951, pp. 414{415.
6] Weyl, H. Mathematics and logic. Amer. Math. Mon. 53 (1946),
1{13.
7] Weyl, H. Philosophy of Mathematics and Natural Science.
Princeton U. Press, Princeton, N.J., 1949, pp. 234{235.
8] Turing, A. M. Solvable and unsolvable problems. In Science
News, no. 31 (1954), A. W. Heaslett, Ed., Penguin Books, Har-
mondsworth, Middlesex, England, pp. 7{23.
9] Nagel, E., and Newman, J. R. Godel's Proof. Routledge &
Kegan Paul, London, 1959.
10] Davis, M. Computability and Unsolvability. McGraw-Hill, New
York, 1958.
11] Quine, W. V. Paradox. Scientic American 206, 4 (April 1962),
84{96.
Information-Theoretic Limitations of Formal Systems 331
12] Kleene, S. C. Mathematical Logic. Wiley, New York, 1968, Ch.
V, pp. 223{282.
13] Godel, K. On the length of proofs. In The Undecidable, M.
Davis, Ed., Raven Press, Hewlett, N.Y., 1965, pp. 82{83.
14] Cohen, P. J. Set Theory and the Continuum Hypothesis. Ben-
jamin, New York, 1966, p. 45.
15] Arbib, M. A. Speed-up theorems and incompleteness theorems.
In Automata Theory, E. R. Cainiello, Ed., Academic Press, New
York, 1966, pp. 6{24.
16] Ehrenfeucht, A., and Mycielski, J. Abbreviating proofs by
adding new axioms. AMS Bull. 77 (1971), 366{367.
17] Polya, G. Heuristic reasoning in the theory of numbers. Amer.
Math. Mon. 66 (1959), 375{384.
18] Einstein, A. Remarks on Bertrand Russell's theory of knowl-
edge. In The Philosophy of Bertrand Russell, P. A. Schilpp, Ed.,
Northwestern U., Evanston, Ill., 1944, pp. 277{291.
19] Hawkins, D. Mathematical sieves. Scientic American 199, 6
(Dec. 1958), 105{112.
20] Kolmogorov, A. N. Logical basis for information theory and
probability theory. IEEE Trans. IT-14 (1968), 662{664.
21] Martin-Lof, P. Algorithms and randomness. Rev. of Internat.
Statist. Inst. 37 (1969), 265{272.
22] Loveland, D. W. A variant of the Kolmogorov concept of com-
plexity. Inform. and Contr. 15 (1969), 510{526.
23] Chaitin, G. J. On the diculty of computations. IEEE Trans.
IT-16 (1970), 5{9.
24] Willis, D. G. Computational complexity and probability con-
structions. J. ACM 17, 2 (April 1970), 241{259.
332 Part V|Technical Papers on Blank-Endmarker Programs
25] Zvonkin, A. K., and Levin, L. A. The complexity of nite
objects and the development of the concepts of information and
randomness by means of the theory of algorithms. Russian Math.
Surveys 25, 6 (Nov.-Dec. 1970), 83{124.
26] Schnorr, C. P. Zufalligkeit und Wahrscheinlichkeit |
Eine algorithmische Begrundung der Wahrscheinlichkeitstheorie.
Springer, Berlin, 1971.
27] Fine, T. L. Theories of Probability|An Examination of Foun-
dations. Academic Press, New York, 1973.
28] Chaitin, G. J. Information-theoretic computational complexity.
IEEE Trans. IT-20 (1974), 10{15.
29] DeLong, H. A Prole of Mathematical Logic. Addison-Wesley,
Reading, Mass., 1970, Sec. 28.2, pp. 208{209.
30] Davis, M. Hilbert's tenth problem is unsolvable. Amer. Math.
Mon. 80 (1973), 233{269.
31] Post, E. Recursively enumerable sets of positive integers and
their decision problems. In The Undecidable, M. Davis, Ed.,
Raven Press, Hewlett, N.Y., 1965, pp. 305{307.
32] Minsky, M. L. Computation: Finite and Innite Machines.
Prentice-Hall, Englewood Clis, N.J., 1967, Sec. 12.2{12.5, pp.
222{232.
33] Shoenfield, J. R. Mathematical Logic. Addison-Wesley, Read-
ing, Mass., 1967, Sec. 1.2, pp. 2{6.
34] Mendelson, E. Introduction to Mathematical Logic. Van Nos-
trand Reinhold, New York, 1964, pp. 29{30.
35] Russell, B. Mathematical logic as based on the theory of types.
In From Frege to Godel, J. van Heijenoort, Ed., Harvard U. Press,
Cambridge, Mass., 1967, pp. 150{182.
36] Lin, S., and Rado, T. Computer studies of Turing machine
problems. J. ACM 12, 2 (April 1965), 196{212.
Information-Theoretic Limitations of Formal Systems 333
37] Loveland, D. W. On minimal-program complexity measures.
Conf. Rec. of the ACM Symposium on Theory of Computing,
Marina del Rey, California, May 1969, pp. 61{65.
38] Rogers, H. Theory of Recursive Functions and Eective Com-
putability. McGraw-Hill, New York, 1967.
39] Godel, K. What is Cantor's continuum problem? In Philosophy
of Mathematics, Benacerraf, P., and Putnam, H., Eds., Prentice-
Hall, Englewood Clis, N.J., 1964, pp. 258{273.
40] Schwartz, J. T. A short survey of computational complexity
theory. Notes, Courant Institute of Mathematical Sciences, NYU,
New York, 1972.
41] Schwartz, J. T. On Programming: An Interim Report on
the SETL Project. Installment I: Generalities. Lecture Notes,
Courant Institute of Mathematical Sciences, NYU, New York,
1973.
42] Chaitin, G. J. A theory of program size formally identical to in-
formation theory. Res. Rep. RC4805, IBM Res. Center, Yorktown
Heights, N.Y., 1974.

Received October 1971 Revised July 1973


334 Part V|Technical Papers on Blank-Endmarker Programs
A NOTE ON MONTE
CARLO PRIMALITY
TESTS AND
ALGORITHMIC
INFORMATION THEORY
Communications on Pure and Applied
Mathematics 31 (1978), pp. 521{527

Gregory J. Chaitin
IBM Thomas J. Watson Research Center
Jacob T. Schwartz1
Courant Institute of Mathematical Sciences

Abstract
Solovay and Strassen, and Miller and Rabin have discovered fast al-
gorithms for testing primality which use coin-ipping and whose con-

335
336 Part V|Technical Papers on Blank-Endmarker Programs
clusions are only probably correct. On the other hand, algorithmic in-
formation theory provides a precise mathematical denition of the no-
tion of random or patternless sequence. In this paper we shall describe
conditions under which if the sequence of coin tosses in the Solovay{
Strassen and Miller{Rabin algorithms is replaced by a sequence of heads
and tails that is of maximal algorithmic information content, i.e., has
maximal algorithmic randomness, then one obtains an error-free test
for primality. These results are only of theoretical interest, since it
is a manifestation of the Godel incompleteness phenomenon that it is
impossible to \certify" a sequence to be random by means of a proof,
even though most sequences have this property. Thus by using certi-
ed random sequences one can in principle, but not in practice, convert
probabilistic tests for primality into deterministic ones.

1. Algorithmic Information Theory


To prepare for discussion of the Solovay{Strassen and Miller{Rabin
algorithms, we rst summarize some of the basic concepts of algorithmic
information theory 1]{ 4].2
Consider a universal Turing machine U whose programs are in bi-
nary. By \universal" we mean that for any other Turing machine M
whose programs p are in binary there is a pre x  such that U (p)
always carries out the same computation as M (p).
I (X ), the algorithmic information content of X , is de ned to be
the size in bits of the smallest programs for U to compute X . There
is absolutely no restriction on the running time or storage space used
by these programs. If X is a nite object such as a natural number
or bit string, this includes the proviso that U halt after printing X .
If X is an in nite object such as a set of natural numbers or of bit
strings, then of course U does not halt. Sets, as opposed to sequences,
1 The second author has been supported by US DOE, Contract EY-76-C-02-
3077*000. We wish to thank John Gill III and Charles Bennett for helpful discus-
sions. Reproduction in whole or in part is permitted for any purpose of the United
States Government.
2 We could equally well have used in this paper the newer formalism of 7], in
which programs are \self-delimiting."
A Note on Monte Carlo Primality Tests 337
may have their members printed in arbitrary order. X can also be an
r.e. function f  in that case U prints the graph of f , i.e., the set of all
ordered pairs hx f (x)i. Note that variations in the de nition of U give
rise to at most O(1) dierences in the resulting I , by the de nition of
universality.
It is easy to show (cf. 1]{ 4]) that the maximum value of I (s) taken
over all n-bit strings s is equal to n + O(1), and that an overwhelm-
ing majority of the s of length n have I (s) very close to n. Such s
have maximum information content or \entropy" and are highly ran-
dom, patternless, incompressible, and typical. They are said to be
\algorithmically random." The greater the dierence between I (s) and
the length of s, the less random s is, the more atypical it is, and the
more pattern it has. It is convenient to say that \s is c-random" if
I (s)  n ; c, where n is the length of s. Less than 2n;c n-bit strings
are not c-random. As for natural numbers, I (n)  log2 n + O(1) and
most n have I (n) very close to log2 n. Strangely enough, though most
strings are random, it is impossible to prove that speci c strings have
this property! For an explanation of this paradox see 1]{ 6].

2. The Solovay{Strassen and Miller{Rabin


Algorithms 8]{10]
The general form of these algorithms is as follows: To test whether
n is prime, take k natural numbers uniformly distributed between 1
and n ; 1, inclusive, and for each one i check whether the predicate
W (i n) holds. (Read \i is a witness of n's compositeness.") If so, n is
composite. If not, n is prime with probability 1 ; 2;k . This is because,
as proved in 8]{ 10], at least half the i's from 1 to n ; 1 are in fact
witnesses of n's compositeness, if n is indeed composite, and none of
them are if n is prime. The de nition of W is dierent in the Solovay{
Strassen and the Miller{Rabin algorithms, but both algorithms are of
this form, where W (i n) can be computed quickly, i.e., the running
time of a program which computes W (i n) is bounded by a polynomial
in the size of n, in other words, by a polynomial in log n.
We shall now show that if suciently long random sequences are
338 Part V|Technical Papers on Blank-Endmarker Programs
supplied, the probabilistic reasoning of 8]{ 10] can be converted into
a rigorous proof of primality. To state our precise results, we need to
make the following de nition:
De nition 1. Let s be an m-bit sequence, and let J be an integer.
Find the smallest integer k such that (J ; 1)k+1 > 2m ; 1, and (by
converting s to base J ; 1 representation) nd the unique sequence
dk dk;1 : : :d0 of base J ; 1 digits such that
X
di(J ; 1)i = s:
0ik
Calculate
Z (s J ) = :W (1 + d0 J )&    &:W (1 + dk;1  J )
where W (i n) is as above. Then we say that J passes the s-test for
primality if and only if Z (s J ) is true.
Lemma 1. Let m, J , k, and Z be as in De nition 1. Then the
number of m-bit sequences s for which Z (s J ) is true is 2m if J is prime,
but is not more than 2m+1;k if J is not a prime.
Proof. If J is prime, then W (i J ) is always false, so our assertion
is trivial. Now suppose J is composite, so that W (i J ) is true for at
least (J ; 1)=2 values of i. Since dk (J ; 1)k  2m , i.e.,
0  dk  2m(J ; 1);k 
it follows that the number of s satisfying Z (s J ) is at most
hm i h i
2 (J ; 1);k + 1 (J ; 1)=2]k = 2m + (J ; 1)k 2;k 
or 2m+1;k since (J ; 1)k  2m .
It is now easy to prove our results:
Theorem 1. For all suciently large c, if s is any c-random j (j +
2c)-bit sequence and J any integer whose binary representation is j bits
long, then Z (s J ) if and only if J is a prime.
Theorem 2. For all suciently large c, if s is any c-random
2j (i + c)-bit sequence and J any integer whose binary representation
is j bits long and whose information content I (J ) is not more than i,
then Z (s J ) if and only if J is a prime.
A Note on Monte Carlo Primality Tests 339
Proof of Theorem 1. Denote the cardinality of a set e by writing jej,
and let (m) be the set of all m-bit sequences. Let J be a non-prime
integer j bits long. By Lemma 1,
jfs 2  (j (j + 2c)) : Z (s J )gj  2j (j +2c)+1;(j +2c) :

Hence
jfs 2  (j (j + 2c)) : 9J 2  (j ) J is composite & Z (s J )]gj
 2j (j +2c)+1;2c :
(1)
Since any member s of the set S appearing in (1) can be calculated
uniquely if we are given c and the ordinal number of the position of s
in S expressed as a j (j + 2c) + 1 ; 2c bit string, it follows that
I (s)  j (j + 2c) + 1 ; 2c + 2I (c) + O(1)  j (j + 2c) ; 2c + O(log c):
(The coecient 2 in the term 2I (c) is present because when two strings
are encoded into a single one by concatenating them, it is necessary to
add information indicating where to separate them. The most straight-
forward technique for providing punctuation doubles the length of the
shorter string.) Hence if c is suciently large, no c-random j (j + 2c)
bit string can belong to S .
Proof of Theorem 2. Arguing as in the proof of Theorem 1, let J
be a non-prime integer j bits long such that I (J )  i. By Lemma 1,
jfs 2  (2j (i + c)) : Z (s J )gj  22j (i+c)+1;2(i+c) : (10)
Since any member s of the set S 0 appearing in (10) can be calculated
uniquely if we are given J and the ordinal number of the position of s
in S 0 expressed as a 2j (i + c) + 1 ; 2(i + c) bit string, it follows that
I (s)  2j (i + c) + 1 ; 2(i + c) + 2I (J ) + O(1)  2j (i + c) ; 2c + O(1):
(The coecient 2 in the term 2I (J ) is present for the same reason as in
the proof of Theorem 1.) Hence if c is suciently large, no c-random
2j (i + c) bit sequence can belong to S 0.
340 Part V|Technical Papers on Blank-Endmarker Programs
3. Applications of the Foregoing Results
Let s be a probabilistically determined sequence in which 0's and 1's
appear independently with probabilities  1 ; , where 0 < < 1.
Group s into successive pairs of bits, and then drop all 00 and 11 pairs
and convert each 01 (respectively 10) pair into a 0 (respectively, a 1).
This gives a sequence s0 in which 0's and 1's appear independently with
exactly equal probabilities. If s0 is n bits long, then the probability that
I (s0) < n ; c is less than 2;c  thus c-random sequences can be derived
easily from probabilistic experiments. Theorem 2 gives the number of
potential witnesses of compositeness which must be checked to ensure
that primality for numbers of special form is determined correctly with
high probability (or with certainty, if some oracle gave us a long bit
string known to satisfy the randomness criterion of algorithmic infor-
mation theory). Mersenne numbers N = 2n ; 1 only require checking
O(log n) = O(log log N ) potential witnesses. Fermat numbers
N = 22 + 1 n

only require checking O(log n) = O(log log log N ) potential witnesses.


Eisenstein{Bell3 numbers
N = 222 (n 2's altogether) + 1
:::

only require checking O(log n) = O(logk N ) (for any k) potential


witnesses. A number of the form 10n + k only requires checking
O(log n) + O(log k) potential witnesses.
Concerning Theorem 1 it is worthwhile to remark the following:
Using the extended Riemann Hypothesis, Miller was able to show that
if n is composite, then the rst natural number that is a witness of
n's compositeness (under the Miller{Rabin version of the predicate W )
3 Quotation from Bell 11]: \F. M. G. Eisenstein (1823{1852), a rst-rate arith-
metician, stated (1844) as a problem that there are an innity of primes in the
sequence 2
22 + 1 222 + 1 222 + 1 : : :
Doubtless he had a proof. This looks like the sort of thing an ingenious amateur
might settle. If anyone asks why I have not done it myself|I am neither an amateur
nor ingenious."
A Note on Monte Carlo Primality Tests 341
is always less than O((log n)2). In contrast, we only need to check a
\certi ed" random sample of log2 n + O(1) potential witnesses.

4. Additional Remarks
The central idea of the Solovay{Strassen and Miller{Rabin algorithms
and of the preceding discussion can be expressed as follows: Consider
a speci c propositional formula F in n variables for which we somehow
know that the percentage of satis ability is greater than 75% or less
than 25%. We wish to decide which of these two possibilities is in fact
the case. The obvious way of deciding is to evaluate F at all 2n possible
n-tuples of the variables. But only O(I (F )) data points are necessary
to decide which case holds by sampling, if one posses an algorithmically
random sequence O(nI (F )) bits long. Thus one need only evaluate F
for O(I (F )) n-tuples of its variables, if the random sample is \certi ed."
These algorithms would be even more interesting if it were possible
to show that they are faster than any deterministic algorithms which
accomplish the same task. Gill 12], 13] in fact attacked the problem
of showing that there are tasks which can be accomplished faster by a
Monte Carlo algorithm than deterministically, before the current surge
of interest in these matters caused by the discovery of several proba-
bilistic algorithms which are much better than any known deterministic
ones for the same task.
The discussion of extensible formal systems given in 14] raises the
question of how to nd systematic sources of new axioms, likely to be
consistent with the existing axioms of logic and set theory, which can
shorten the proofs of interesting theorems. From the metamathematical
results of 1]{ 3], we know that no statement of the form \s is c-random"
can be proved if s has a length signi cantly greater than c. This raises
the question of whether statements of the form \s is c-random" are
generally useful new axioms. (Note that Ehrenfeucht and Mycielski 15]
show that by adding any previously unprovable statement X to a formal
system, one always shortens very greatly the lengths of in nitely many
proofs. Their argument is roughly as follows: Consider a proposition
of the form \either X or algorithm A halts," where A in fact halts but
takes a very long time to do so. Previously the proof of this assertion
342 Part V|Technical Papers on Blank-Endmarker Programs
was very long one had to simulate A's computation until it halted.
Now the proof is immediate, for X is an axiom. See also G odel 16].)
Hence it is reasonable to ask whether the addition of axioms \s
is c-random" is likely either to allow interesting new theorems to be
proved, or to shorten the proof of interesting theorems which could
have been proved anyhow (but perhaps by unreachably long proofs).
The following discussion of this issue is very informal and is intended to
be merely suggestive. On the one hand, it is easy to see that interesting
new theorems are probably not obtained in this manner. The argument
is as follows. If it were highly probable that a particular theorem T can
be deduced from axioms of the form \s is c-random," then T could in
fact be proved without extending the axiom system. For even without
extending the axiom system one could show that \if s is random, then
T " holds for many s, and thus T would follow from the fact that most
s are indeed random. In other words, we would have before us a proof
by cases in which we do not know which case holds, but can show that
most do. Hence it seems that interesting new theorems will probably
not be obtained by extending a formal system in this way.
As to the possibility of interesting proof-shortenings, we can note
that Ehrenfeucht{Mycielski theorems are not very interesting ones.
Quick Monte Carlo algorithms for primality suggest another possibility.
Perhaps adding axioms of the form \s is random" makes it possible to
obtain shorter proofs of primality? Pratt's work 17] suggests caution,
but the following more general conjecture seems reasonable. If it is
in fact the case that for some tasks Monte Carlo algorithms are much
better than deterministic ones, then it may also be the case that some
interesting theorems have much shorter proofs when a formal system is
extended by adding axioms of the form \s is random."

References
1] Chaitin, G. J., Information-theoretic computational complexity,
IEEE Trans. Info. Theor. IT-20, 1974, pp. 10{15.
2] Chaitin, G. J., Information-theoretic limitations of formal sys-
tems, J. ACM 21, 1974, pp. 403{424.
A Note on Monte Carlo Primality Tests 343
3] Chaitin, G. J., Randomness and mathematical proof, Sci. Amer.
232, 5, May 1975, pp. 47{52.
4] Schwartz, J. T., Complexity of statement, computation and proof,
AMS Audio Recordings of Mathematical Lectures 67, 1972.
5] Levin, M., Mathematical logic for computer scientists, MIT
Project MAC TR-131, June 1974, pp. 145{147, 153.
6] Davis, M., What is a computation? in Mathematics Today |
Twelve Informal Essays, Springer-Verlag, New York, to appear
in 1978.
7] Chaitin, G. J., Algorithmic information theory, IBM J. Res. De-
velop. 21, 1977, pp. 350{359, 496.
8] Solovay, R., and Strassen, V., A fast Monte-Carlo test for primal-
ity, SIAM J. Comput. 6, 1977, pp. 84{85.
9] Miller, G. L., Riemann's hypothesis and tests for primality, J.
Comput. Syst. Sci. 13, 1976, pp. 300{317.
10] Rabin, M. O., Probabilistic algorithms in Algorithms and Com-
plexity | New Directions and Recent Results, J. F. Traub (ed.),
Academic Press, New York, 1976, pp. 21{39.
11] Bell, E. T., Mathematics | Queen and Servant of Science, Mc-
Graw-Hill, New York, 1951, pp. 225{226.
12] Gill, J. T. III, Computational complexity of probabilistic Turing
machines, Proc. 6th Annual ACM Symp. Theory of Computing,
Seattle, Washington, April 1974, pp. 91{95.
13] Gill, J. T. III, Computational complexity of probabilistic Turing
machines, SIAM J. Comput. 6, 1977, pp. 675{695.
14] Davis, M., and Schwartz, J. T., Correct-Program Technology/
Extensibility of Veriers|Two Papers on Program Verication,
Courant Computer Science Report #12, Courant Institute of
Mathematical Sciences, New York University, September 1977.
344 Part V|Technical Papers on Blank-Endmarker Programs
15] Ehrenfeucht, A., and Mycielski, J., Abbreviating proofs by adding
new axioms, AMS Bull. 77, 1971, pp. 366{367.
16] G odel, K., On the length of proofs in The Undecidable|Basic Pa-
pers on Undecidable Propositions, Unsolvable Problems and Com-
putable Functions, M. Davis (ed.), Raven Press, Hewlett, New
York, 1965, pp. 82{83.
17] Pratt, V. R., Every prime has a succinct certicate, SIAM J.
Comput. 4, 1975, pp. 214{220.

Received January, 1978.


INFORMATION-
THEORETIC
CHARACTERIZATIONS
OF RECURSIVE INFINITE
STRINGS
Theoretical Computer Science 2 (1976),
pp. 45{48

Gregory J. Chaitin
IBM Thomas J. Watson Research Center
Yorktown Heights, N.Y. 10598, USA

Abstract
Loveland and Meyer have studied necessary and su!cient conditions
for an innite binary string x to be recursive in terms of the program-
size complexity relative to n of its n-bit prexes xn . Meyer has shown
that x is recursive i 9c 8n K (xn=n)  c, and Loveland has shown
that this is false if one merely stipulates that K (xn =n)  c for innitely

345
346 Part V|Technical Papers on Blank-Endmarker Programs
many n. We strengthen Meyer's theorem. From the fact that there
are few minimal-size programs for calculating a given result, we obtain
a necessary and su!cient condition for x to be recursive in terms of
the absolute program-size complexity of its prexes: x is recursive i
9c 8n K (xn)  K (n) + c. Again Loveland's method shows that this
is no longer a su!cient condition for x to be recursive if one merely
stipulates that K (xn)  K (n) + c for innitely many n.

N = f0 1 2 : : :g is the set of natural numbers, S = f(, 0, 1, 00, 01,


10, 11, 000 : : :g is the set of strings, and X is the set of in nite strings.
All strings and in nite strings are binary. The variables c, i, m and n
range over N  the variables p, q, s and t range over S  and the variable
x ranges over X .
jsj is the length of a string s, and sn and xn are the pre xes of
length n of s and x. x is recursive i there is a recursive function
f : N ! S such that xn = f (n) for all n. B (n) is the nth element
of S  the function B : N ! S is a recursive bijection. The quantity
jB (n)j = blog2 (n + 1)c plays an important role in this paper.
A computer C is a partial recursive function C : S S ! S .
C (p q) is the output resulting from giving C the program p and the
data q. The relative complexity KC : S S ! N is de ned as follows:
KC (s=t) = min jpj (C (p t) = s). The complexity KC : S ! N is
de ned as follows: KC (s) = KC (s=(). KC (s) is the length of the
shortest program for calculating s on C without any data. A computer
U is universal i for each computer C there is a constant c such that
KU (s=t)  KC (s=t) + c for all s and t.
Pick a standard G odel numbering of the partial recursive functions
C : S S ! S , and de ne the computer U as follows. U (0i  q) = (,
and U (0i 1p q) = Ci(p q), where Ci is the ith computer. U is universal,
and is our standard computer for measuring complexities. The \U " in
\KU " is henceforth omitted.
The following situation occurs in the proofs of our main theorems,
Theorems 3 and 6. There is an algorithm A for enumerating a set of
strings. There are certain inputs to A, e.g. n and m. A(n m) denotes
Characterizations of Recursive Innite Strings 347
the enumeration (ordered set) produced by A from the inputs n and m.
And ind(s A(n m)) denotes the index of s in the enumeration A(n m)
ind(: A(: :)) : S N N ! N is a partial recursive function. A key
step in the proofs of Theorems 3 and 6 is that if s 2 A(n m) and one
knows A, n, m, and ind(s A(n m)), then one can calculate s.
Theorem 1. (a) 9c 8s t K (s=t)  jsj + c.
(b) 8n t 9s jsj = n and K (s=t)  n.
(c) There is a recursive function f : S S N ! N such that
f (s t n)  f (s t n + 1) and K (s=t) = limn!1 f (s t n).
(d) 9c 8s K (B (jsj)) ; c  K (s)  jsj + c.
Proof. (a) There is a computer C such that C (p q ) = p for all p
and q. Thus KC (s=t) = jsj, and K (s=t)  KC (s=t) + c = jsj + c.
(b) There are 2n strings s of length n, but only 2n ; 1 programs for
U (: t) of length < n. Thus at least one s of length n needs a program
of length  n.
(c) Since U is a partial recursive function, its graph fhp q U (p q)ig
is an r.e. set. Let Un be the rst n triples in a xed recursive enumer-
ation of the graph of U . Recall the upper bound jsj + c on K (s=t).
Take
f (s t n) = minfjsj + cg  fjpj : hp t si 2 Un g:
(d) Theorem 1(a) yields K (s) = K (s=()  jsj + c. And there is a
computer C such that C (p q) = B (jU (p q)j). Thus KC (B (jsj))  K (s)
and K (B (jsj))  K (s) + c. 2
Theorem 2. (a) If x is recursive, then there is a c such that
K (xn=B (n))  c and K (xn)  K (B (n)) + c for all n.
(b) For each c there are only nitely many x such that 8n,
K (xn=B (n))  c, and each of these x is recursive (Meyer).
(c) There is a c such that nondenumerably many x have the property
that K (xn=B (n))  c and K (xn )  K (B (n)) + c for in nitely many n
(Loveland).
Proof. (a) By de nition, if x is recursive there is a recursive function
f : N ! S such that f (n) = xn . There is a computer C such that
C (p q) = f (B ;1(q)). Thus KC (xn =B (n)) = 0 and K (xn=B (n))  c.
There is also a computer C such that C (p q) = f (B ;1(U (p q))). Thus
KC (xn) = K (B (n)) and K (xn)  K (B (n)) + c.
(b) See 4, pp. 525{526].
348 Part V|Technical Papers on Blank-Endmarker Programs
(c) See 4, pp. 515{516]. 2
Theorem 3. Consider a computer D. A t-description of s is a string
p such that D(p t) = s. There is a recursive function fD : N ! N with
the property that no string s has more than fD (n) t-descriptions of
length < K (s=t) + n.
Proof. There are < 2n t-descriptions of length < n. Thus there
are < 2n;m strings s with  2m t-descriptions of length < n. Since
the graph of D is an r.e. set, given n, m and t, one can recursively
enumerate the < 2n;m strings having  2m t-descriptions of length
< n. Pick an algorithm A(n m t) for doing this.
There is a computer C with the following property. Suppose s has
 2m t-descriptions of length < n. Then s = C (0jpj1pq t), where
p = B (m) and q is the ind(s A(n m t)) th string of length n ; m. C
recovers information from this program in the following order: jB (m)j,
m, n ; m, n, ind(s A(n m t)), and s. Thus
KC (s=t)  j0jpj1pqj = jqj + 2jpj + 1 = n ; m + 2jB (m)j + 1
and K (s=t)  n ; m + 2jB (m)j + c. Note that A, C and c all depend
on D.
We restate this. Suppose s has  2m t-descriptions of length <
K (s=t) + n. Then K (s=t)  K (s=t) + n ; m + 2jB (m)j + c. I.e.,
m ; 2jB (m)j ; c  n. Let fD (n) be two raised to the least m such that
m ; 2jB (m)j ; c > n. s cannot have  fD (n) t-descriptions of length
< K (s=t) + n. 2
Theorem 4. There is a recursive function f4 : N ! N with the
property that for no n and m are there more than f4(m) strings s of
length n with K (s) < K (B (n)) + m.
Proof. In Theorem 3 consider only (-descriptions and take D to be
the computer de ned as follows: D(p q) = B (jU (p q)j). It follows that
there is a recursive function fD : N ! N with the property that for no n
and m are there more than fD (m) programs p of length < K (B (n))+ m
such that jU (p ()j = n. f4 = fD is the function whose existence we
wished to prove. 2
Theorem 5. (a) There is an algorithm A5(n m) that enumerates
the pre xes of length n of strings s of length 2n such that K (si) 
jB (i)j + m for all i 2 n 2n].
Characterizations of Recursive Innite Strings 349
(b) There is a recursive function f5 : N ! N with the property that
A5(n m) never has more than f5(m) elements.
Proof. (a) The existence of A5 is an immediate consequence of the
fact that K (s) can be recursively approximated arbitrarily closely from
above (Theorem 1(c)).
(b) By using the counting argument that established Theorem 1(b),
it also follows that 9c 8n 9i 2 n 2n] such that K (B (i)) > jB (i)j ; c.
For such i the condition that K (si)  jB (i)j + m implies K (si) <
K (B (i)) + m + c, which by Theorem 4 can hold for at most f4(m + c)
dierent strings si of length i. Thus there are at most f5(m) elements
in A5(n m), where f5(m) = f4(m + c). 2
Theorem 6. For each c there are only nitely many x such that
8n K (xn)  jB (n)j + c, and each of these x is recursive.
Proof. By hypothesis x has the property that 8n K (xn)  jB (n)j +
c. Thus 8n xn 2 A5(n c), where A5 is the enumeration algorithm of
Theorem 5(a). There is a computer C (depending on c) such that
xn = C (B (ind(xn A5(n c))) B (n)) for all n:
Thus
KC (xn=B (n))  jB (ind(xn A5(n c)))j  jB (f5(c))j by Theorem 5(b).
As KC (xn=B (n))  jB (f5(c))j for all n, it follows that there is a c0 such
that K (xn=B (n))  c0 for all n. Applying Theorem 2(b) we conclude
that x is recursive and there can only be nitely many such x. 2
Theorem 7. x is recursive i 9c 8n K (xn)  K (B (n)) + c.
Proof. The \only if" is Theorem 2(a). The \if" follows from
Theorem 6 and the fact that 9c 8n K (xn)  K (B (n)) + c implies
9c 8n K (xn)  jB (n)j + c, which is an immediate consequence of The-
orem 1(a). 2
Can this information-theoretic characterization of recursive in nite
strings be reformulated in terms of other de nitions of program-size
complexity? It is easy to see that Theorem 7 also holds for Schnorr's
process complexity 5]. This is not the case for the algorithmic entropy
H (see 3]). Although recursive x satisfy 9c 8n H (xn)  H (B (n)) + c,
Solovay (private communication) has announced there is a nonrecursive
x that also has this property.
350 Part V|Technical Papers on Blank-Endmarker Programs
Theorems 6 and 7 reveal a complexity gap, because K (B (n)) is
sometimes much smaller than jB (n)j.

References
1] G. J. Chaitin, Information-theoretic aspects of the Turing de-
grees, Abstract 72T-E77, AMS Notices 19 (1972) A-601, A-602.
2] G. J. Chaitin, There are few minimal descriptions, A necessary
and sucient condition for an in nite binary string to be recur-
sive, (abstracts), Recursive Function Theory Newsletter (January
1973) 13{14.
3] G. J. Chaitin, A theory of program size formally identical to in-
formation theory, J. ACM 22 (1975) 329{340.
4] D. W. Loveland, A variant of the Kolmogorov concept of com-
plexity, Information and Control 15 (1969) 510{526.
5] C. P. Schnorr, Process complexity and eective random tests, J.
Comput. System Sci. 7 (1973) 376{388.

Communicated by A. Meyer
Received November 1974
Revised March 1975
PROGRAM SIZE,
ORACLES, AND THE
JUMP OPERATION
Osaka Journal of Mathematics 14 (1977),
pp. 139{149

Gregory J. Chaitin

Abstract
There are a number of questions regarding the size of programs for cal-
culating natural numbers, sequences, sets, and functions, which are best
answered by considering computations in which one is allowed to con-
sult an oracle for the halting problem. Questions of this kind suggested
by work of T. Kamae and D. W. Loveland are treated.

351
352 Part V|Technical Papers on Blank-Endmarker Programs
1. Computer Programs, Oracles, Informa-
tion Measures, and Codings
In this paper we use as much as possible Rogers' terminology and no-
tation 1, pp. xv{xix]. Thus N = f0 1 2 : : :g is the set of (natural)
numbers i, j , k, n, v, w, x, y, z are elements of N  A, B , X are
subsets of N  f , g, h are functions from N into N  ',  are partial
functions from N into N  hx1 : : :  xki denotes the ordered k-tuple con-
sisting of the numbers x1 : : :  xk  the lambda notation x : : :x : : :] is
used to denote the partial function of x whose value is : : :x : : : and the
mu notation x : : : x : : :] is used to denote the least x such that : : : x : : :
is true.
The size of the number x, denoted lg(x), is de ned to be the number
of bits in the xth binary string. The binary strings are (, 0, 1, 00, 01,
10, 11, 000 : : : Thus lg(x) is the integer part of log2(x + 1). Note that
there are 2n numbers x of size n, and 2n ; 1 numbers x of size less than
n.
We are interested in the size of programs for a certain class of com-
puters. The zth computer in this class is de ned in terms of '(2) z
X
1, pp. 128{134], which is the two-variable partial X -recursive func-
tion with G odel number z. These computers use an oracle for deciding
membership in the set X , and the zth computer produces the output
'(2)
z (x y ) when given the program x and the data y . Thus the output
X
depends on the set X as well as the numbers x and y.
We now choose the standard universal computer U that can simulate
any other computer. U is de ned as follows:
U X ((2x + 1)2z ; 1 y) = '(2)X
z (x y ):
Thus for each computer C there is a constant c such that any program
of size n for C can be simulated by a program of size  n + c for U .
Having picked the standard computer U , we can now de ne the
program size measures that will be used throughout this paper.
The fundamental concept we shall deal with is I (=X ), which is the
number of bits of information needed to specify an algorithm relative
to X for the partial function , or, more briey, the information in 
relative to X . This is de ned to be the size of the smallest program for
Program Size, Oracles, and the Jump Operation 353
:
I (=X ) = minlg(x) ( = y U X (x y)]):
Here it is understood that I (=X ) = 1 if  is not partial X -recursive.
I (x ! y=X ), which is the information relative to X to go from the
number x to the number y, is de ned as follows:
I (x ! y=X ) = min I (=X ) ((x) = y):
And I (x=X ), which is the information in the number x relative to the
set X , is de ned as follows:
I (x=X ) = I (0 ! x=X ):
Finally I (=X ) is used to de ne three versions I (A=X ), Ir(A=X ),
and If (A=X ) of the information relative to X of a set A. These corre-
spond to the three ways of naming a set 1, pp. 69{71]: by r.e. indices,
by characteristic indices, and by canonical indices. The rst de nition
is as follows:
I (A=X ) = I (x if x 2 A then 1 else unde ned]=X ):
Thus I (A=X ) < 1 i A is r.e. in X . The second de nition is as follows:
Ir (A=X ) = I (x if x 2 A then 1 else 0]=X ):
Thus Ir(A=X ) < 1 i A is recursive in X . And the third de nition,
which applies only to nite sets, is as follows:
X
If (A=X ) = I ( 2x=X ):
x2A
The following notational convention is used: I (), I (x ! y), I (x),
I (A), Ir (A), and If (A) are abbreviations for I (=), I (x ! y=),
I (x=), I (A=), Ir (A=), and If (A=), respectively.
We use the coding   of nite sequences of numbers into individual
numbers 1, p. 71]:   is an eective one-one mapping from S1k=0 N k
onto N . And we also use the notation f (x) for   of the sequence
hf (0) f (1) : : :  f (x ; 1)i 1, p. 377] for any function f , f (x) is the
code number for the nite initial segment of f of length x.
The following theorems, whose straight-forward proofs are omitted,
give some basic properties of these concepts.
Theorem 1.
354 Part V|Technical Papers on Blank-Endmarker Programs
(a) I (x=X )  lg(x) + c
(b) There are less than 2n numbers x with I (x=X ) < n.
(c) jI (x=X ) ; I (y=X )j  2 lg(jx ; y j) + c

(d) The set of all true propositions of the form \I (x ! y=X )  z" is
r.e. in X .
(e) I (x ! y=X )  I (y=X ) + c
Recall that there are 2n numbers x of size n, that is, there are 2n
numbers x with lg(x) = n. In view of (a) and (b) most x of size n have
I (x=X ) n. Such x are said to be X -random. In other words, x is
said to be X -random if I (x=X ) is approximately equal to lg(x) most
x have this property.
Theorem 2.
(a) I ( (hx yi))  I ( (hy xi)) + c
(b) I ( (hx yi) !  (hy xi))  c
(c) I (x)  I ( (hx yi)) + c
(d) I ( (hx yi) ! x)  c
Theorem 3.
(a) I (x ! (x)=X )  I (=X )
(b) For each  that is partial X -recursive there is a c such that
I ((x)=X )  I (x=X ) + c.
(c) I (x ! f (x)=X )  I (f=X ) + c
(d) I (f (x) ! x=X )  c and I (x=X )  I (f (x)=X ) + c
Theorem 4.
(a) I (x=X )  I (y x]=X ) and I (y x]=X )  I (x=X ) + c
(b) I (x=X )  If (fxg=X ) + c and If (fxg=X )  I (x=X ) + c
Program Size, Oracles, and the Jump Operation 355
(c) I (x=X )  Ir (fxg=X ) + c and Ir (fxg=X )  I (x=X ) + c
(d) I (x=X )  I (fxg=X ) + c and I (fxg=X )  I (x=X ) + c
(e) Ir(A=X )  If (A=X ) + c and I (A=X )  Ir(A=X ) + c
See 2] for a dierent approach to de ning program size measures
for functions, numbers, and sets.

2. The Jump and Limit Operations


The jump X 0 of a set X is de ned in such a manner that having an
oracle for deciding membership in X 0 is equivalent to being able to solve
the halting problem for algorithms relative to X 1, pp. 254{265].
In this paper we study a number of questions regarding the infor-
mation I () in  relative to the empty set, that are best answered
by considering I (=0) and I (=00), which are the information in 
relative to the halting problem and relative to the jump of the halt-
ing problem. The thesis of this paper is that in order to understand
I (=X ) with X = , which is the case of practical signi cance, it is
sometimes necessary to jump higher in the arithmetical hierarchy to
X = 0 or X = 00.
The following theorem, whose straight-forward proof is omitted,
gives some facts about how the jump operation aects program size
measures.
Theorem 5.
(a) xy I (x ! y=X )] and x I (x=X )] are X 0 -recursive.
(b) I (=X 0)  I (=X ) + c
(c) For each n consider the least x such that lg(x)  n and I (x=X ) 
n. This x has the property that lg(x) = n, I (x=X )  n + c1, and
I (x=X 0)  I (n=X 0) + c2  lg(n) + c3.
(d) I (A=X 0)  I (A=X ) + c
(e) Ir(A=X 0)  I (A=X ) + c
356 Part V|Technical Papers on Blank-Endmarker Programs
(f) If A is nite If (A=X 0)  I (A=X ) + c
(g) I (X 0=X )  c and Ir(X 0=X ) = 1
It follows from (b) that X 0-randomness implies X -randomness.
However (c) shows that the converse is false: there are X -random num-
bers that are not at all X 0-random.
Having examined the jump operation, we now introduce the limit
operation. The following theorem shows that the limit operation is in
a certain sense analogous to the jump operation. This theorem is the
tool we shall use to study work of Kamae and Loveland in the following
sections.
De nition. Consider a function f . limx f (x) denotes a number z
having the property that there is an x0 such that f (x) = z if x  x0.
If no such z exists, limx f (x) is unde ned. In other words limx f (x) is
the value that f (x) assumes for almost all x (if there is such a value).
Theorem 6.
(a) If I (z=X 0) < n, then there is a function f such that z = limx f (x)
and I (f=X ) < n + c.
(b) If I (f=X ) < n and limx f (x) = z, then I (z=X 0) < n + c.
Proof.
(a) By hypothesis there is a program w of size less than n such that
U X (w 0) = z. Given w and an arbitrary number x, one cal-
0

culates f (x) using the oracle for membership in X as follows.


Choose a xed algorithm relative to X for enumerating X 0.
One performs x steps of the computation U X (w 0). This is done
0

using a fake oracle for X 0 that answers that v is in X 0 i v is


obtained during the rst x steps of the algorithm relative to X
for enumerating X 0. If a result is obtained by performing x steps
of U X (w 0) in this manner, that is the value of f (x). If not, f (x)
0

is 0.
It is easy to see that limx f (x) = U X (w 0) = z and I (f=X ) 
0

lg(w) + c < n + c.
Program Size, Oracles, and the Jump Operation 357
(b) By hypothesis there is a program w of size less than n such that
limx U X (w x) = z. Given w one can use the oracle for X 0 to
calculate z. At stage i one asks the oracle whether there is a
j > i such that U X (w j ) 6= U X (w i). If so, one goes to stage
i + 1 and tries again. If not, one is nished because U X (w i) = z.
This shows that I (z=X 0)  lg(w) + c < n + c. Q.E.D.
See 3] for applications of oracles and the jump operation in the
context of self-delimiting programs for sets and probability constructs
in this paper we are only interested in programs with endmarkers.

3. The Kamae Information Measure


In this section we study an information measure K (x) suggested by
work of Kamae 4] (see also 5]).
I (y ! x) is less than or equal to I (x) + c, and it is natural to call
I (x) ; I (y ! x) the degree to which y is helpful to x. Let us look at
some examples. By de nition I (x) = I (0 ! x), and so 0 is no help at
all. On the other hand some y are very helpful: I (y ! x) < c for all
those y whose prime factorization has 2 raised to the power x. Thus
every x has in nitely many y that are extremely helpful to it.
Kamae proves in 4] that for each c there is an x such that I (y !
x) < I (x) ; c holds for almost all y. In other words, for each c there
is an x such that almost all y are helpful to x more than c. This is
surprising one would have expected there to be a c with the property
that every x has in nitely many y that are helpful to it less than c,
that is, in nitely many y with I (y ! x) > I (x) ; c.
We shall now study how I (y ! x=X ) varies when x is held xed
and y goes to in nity. Note that I (y ! x=X ) is bounded (in fact, by
I (x=X ) + c). This suggests the following de nition: K (x=X ) is the
greatest w such that I (y ! x=X ) = w holds for in nitely many y. In
other words, K (x=X ) is the least v such that I (y ! x=X )  v holds
for almost all y.
Note that there are less than 2n numbers x with K (x=X ) < n, so
that K (x=X ) clearly measures bits of information in some sense. The
trivial inequality K (x=X )  I (x=X ) + c relates K (x=X ) and I (x=X ),
but the following theorem shows that K (x=X ) is actually much more
358 Part V|Technical Papers on Blank-Endmarker Programs
intimately related to the information measures I (x=X 0) and I (x=X 00)
than to I (x=X ).
Theorem 7.
(a) K (x=X )  I (x=X 0) + c
(b) I (x=X 00)  K (x=X ) + c
Proof.
(a) Consider a number x0. By Theorem 6a there is a function f such
that limy f (y) = x0 and I (f=X )  I (x0=X 0) + c. Hence I (y !
f (y)=X )  I (f=X )  I (x0=X 0) + c. In as much as f (y) = x0
for almost all y, it follows that I (y ! x0=X )  I (x0=X 0 ) + c for
almost all y. Hence K (x0=X )  I (x0=X 0 ) + c.
(b) By using an oracle for membership in X 0 one can decide whether
or not I (y ! x=X ) < n. Thus by using an oracle for membership
in X 00 one can decide whether or not y0 has the property that
I (y ! x=X ) < n for all y  y0. It follows that the set An =
fxjK (x=X ) < ng is r.e. in X 00 uniformly in n.
Suppose that x0 2 An. Consider j = 2n + k, where k=(the
position of x0 in a xed X 00-recursive enumeration of An uniform
in n). Since there are less than 2n numbers in An, k < 2n and
2n  j < 2n+1. Thus one can recover from j the values of n
and k. And if one is given j one can calculate x0 using an oracle
for membership in X 00. Thus if K (x0=X ) < n, then I (x0=X 00) 
lg(j ) + c1  n + c2. Q.E.D.
What is the signi cance of this theorem? First of all, note that
most x are 00-random and thus have lg(x) I (x=00) I (x=0)
I (x) K (x). In other words, there is a c such that every n has the
property that at least 99% of the x of size n have all four quantities
I (x=00), I (x=0), I (x), and K (x) inside the interval between n ; c
and n + c. These x are \normal" because there are in nitely many
y that do not help x at all, that is, there are in nitely many y with
I (y ! x) > I (x) ; c.
Now let us look at the other extreme, at the \abnormal" x discov-
ered by Kamae.
Program Size, Oracles, and the Jump Operation 359
Consider the rst -random number of size n, where n itself is 00-
random. More precisely, let xn be the rst x such that lg(x) = n and
I (x)  n. (There is such an xn, because there are 2n numbers x of size
n, and at most 2n ; 1 of these x have I (x) < n:) Moreover, we stipulate
that n itself be 00-random, so that I (n=00) lg(n).
It is easy to see that these xn have the property that lg(n)
I (xn=00) I (xn=0) K (xn) and I (xn) lg(xn) = n. Thus most
y help these xn a great deal, because I (xn) n and for almost all y,
I (y ! xn ) < log n.
2
Theorem 7 enables us to make very precise statements about K (x)
when I (x=00) I (x=0). But where is K (x) in the interval between
I (x=00) and I (x=0) when this interval is wide? The following theorem
shows that if I (x=00) and I (x=0) are many orders of magnitude apart,
then K (x) will be of the same order of magnitude as I (x=0). To be
more precise, Theorems 7a and 8 show that
1 I (x=0) ; c  K (x)  I (x=0) + c:
2
Theorem 8. If K (x0=X ) < n, then I (x0=X 0 ) < 2n + c.
Proof. Consider a xed number n and a xed set X . Let #x be
the cardinality of the set Bx = fzjI (x ! z=X ) < ng. Note that #x is
bounded (in fact, by 2n ; 1). Let i be the greatest w such that #x = w
holds for in nitely many x, which is also the least v such that #x  v
holds for almost all x. Let j = z #x  i if x  z], and let A be the
in nite set of x greater than or equal to j such that #x = i. Thus Bx
has exactly i elements if x 2 A.
It is not dicult to see that if one knows n and i, then one can
calculate j by using an oracle for membership in X 0. And if one knows
n, i, and j , by using an oracle for membership in X one can enumerate
A
P 2and simultaneously calculate for each x 2 A the canonical index
z (z 2 Bx ) of the i-element set Bx .
De ne J (x) as follows: J (x) = (the greatest w such that I (y !
x=X ) = w holds for in nitely many y 2 A) = (the least v such that
I (y ! x=X )  v holds for almost all y 2 A). It is not dicult to see
from the previous paragraph that if one is given n and i and uses an
oracle for membership in X 0, one can enumerate the set of all x such
that J (x) < n.
360 Part V|Technical Papers on Blank-Endmarker Programs
Note that there are less than 2n numbers x with J (x) < n, and that
if K (x=X ) < n, then J (x) < n. Suppose that x0 has the property that
K (x0=X ) < n. Consider the number k = (2n + i)2n + i2, where i2 =
(the position of x0 in the above-mentioned X 0-recursive enumeration of
fxjJ (x) < ng). Since i < 2n and i2 < 2n , one can recover n, i, and i2
from k.
It is not dicult to see that if one is given k, then one can calculate
x0 using an oracle for membership in X 0. Thus if K (x0=X ) < n, then
I (x0=X 0)  lg(k) + c1 < 2n + c2. Q.E.D.

4. The Loveland Information Measure


De ne L(f=X ) to be maxx I (x ! f (x)=X ), and to be 1 if I (x !
f (x)=X ) is unbounded. This concept is suggested by work of Loveland
6]. Since there are less than 2n functions f with L(f=X ) < n, it is
clear that in some sense L(f=X ) measures bits of information. I (x !
f (x)=X ) is bounded if f is X -recursive, and conversely A. R. Meyer 6,
pp. 525{526] has shown that if I (x ! f (x)=X ) is bounded then f is
X -recursive. Thus L(f=X ) < 1 i I (f=X ) < 1.
But can something more precise be said about the relationship be-
tween L(f ) and I (f )? L(f )  I (f ) + c, but as is pointed out in 6, p.
515], the proof that I (f ) < 1 if L(f ) < 1 is nonconstructive and does
not give an upper bound on I (f ) in terms of L(f ). We shall show that
in fact I (f ) can be enormous for reasonable values of L(f ). The proof
that I (f ) < 1 if L(f ) < 1 may therefore be said to be extremely
nonconstructive.
In 7] it is shown that I (f ) < 1 i there is a c such that
I (f (x)) ; I (x)  c for all x. This result it now also seen to be extremely
nonconstructive, because I (f ) may be enormous for reasonable c.
Furthermore, R. M. Solovay has studied in 8] what is the situation
if the endmarker program size measure I used here is replaced by the
self-delimiting program size measure H of 9]. He shows that there is
a nonrecursive function f such that H (f (x)) ; H (x) is bounded. This
result previously seemed to contrast sharply with the fact that f is re-
cursive if I (x ! f (x)) is bounded 6] or if I (f (x)) ; I (x) is bounded 7].
But now a harmonious whole is perceived since the sucient conditions
Program Size, Oracles, and the Jump Operation 361
for f to be recursive just barely manage to keep I (f ) from being 1.
Theorem 9. If I (k=X 0)  n, then there is a function f such that
L(f=X )  n + c and I (f=X )  k ; c.
Proof. First we de ne the function g as follows: g (x) is the rst
non-zero y such that I (y=X )  x. Note that g is X 0-recursive.
By hypothesis I (k=X 0)  n. Hence I (g(k)=X 0)  I (k=X 0) + c1 
n + c1. By Theorem 6a, there is a function h such that I (h=X )  n + c2
and limx h(x) = g(k). Let x0 = z h(x) = g(k) if x  z]. Thus
h(x) = g(k) if x  x0.
The function f whose existence is claimed is de ned as follows:
(
f (x) = 0h(x) ifif xx < x0, and
 x0 .

Thus f (x) = g(k) if x  x0.


First we obtain a lower bound for I (f=X ). The following inequality
holds for any function f :
I (f (x f (x) 6= 0])=X )  I (f=X ) + c3:
Hence for this particular f we see that I (f=X ) + c3  I (f (x0)=X ) =
I (g(k)=X ). Thus, by the de nition of g, I (f=X )+ c3  I (g(k)=X )  k.
Now to obtain an upper bound for I (x ! f (x)=X ). There are two
cases: x  x0 and x > x0. If x  x0, then f (x) is the code number for a
sequence of x 0's and thus I (x ! f (x)=X )  I (x  (h0ix )]=X ) = c4,
where h0ix denotes a sequence of x 0's. If x > x0, then
I (x ! f (x)=X )
 I (x !  (hh(x) x0 i)=X ) + c5
= I (x !  (hh(x) z h(x) = h(y) if x  y  z]i)=X ) + c5
 I (h=X ) + c6
 n + c2 + c6 :

Thus I (x ! f (x)=X ) is either bounded by c4 or by n + c2 + c6. Hence


I (x ! f (x)=X )  n + c7 and L(f=X )  n + c7.
To recapitulate, we have shown that this f has the property that
I (f=X )  k ; c3 and L(f=X )  n + c7. Taking c = max c3 c7, we see
that I (f=X )  k ; c and L(f=X )  n + c. Q.E.D.
362 Part V|Technical Papers on Blank-Endmarker Programs
Why does Theorem 9 show that I (f ) can be enormous even though
L(f ) has a reasonable value? Consider the function g(x) de ned to
x !'s
z }| {
be (: : : ((x!)!)! : : :!). g(x) quickly becomes astronomical as x increases.
However, I (g(n)=0)  I (g(n)) + c1  I (n) + c2  lg(n) + c3, and
lg(n) + c3 is less than n for almost all n. Hence almost all n have the
property that there is a function f with L(f )  n + c and I (f ) 
g(n) ; c.
In fact the situation is much worse. It is easy to de ne a function h
that is 0-recursive and grows more quickly than any recursive function.
In other words, h is recursive in the halting problem and for any re-
cursive function g, h(x) > g(x) for almost all x. As before we see that
I (h(n)=0) < n for almost all n. Hence almost all n have the property
that there is a function f with L(f )  n + c and I (f )  h(n) ; c.

5. Other Applications
In this section some other applications of oracles and the jump opera-
tion are presented without proof.
First of all, we would like to examine a question raised by C. P.
Schnorr 10, p. 189] concerning the relationship between I (x) and the
limiting relative frequency of programs for x. However, it is more ap-
propriate to ask what is the relationship between the self-delimiting
program size measure H (x) 9] and the limiting relative frequency
of programs for x (with endmarkers). De ne F (x n) to be ; log2
of (the number of programs w less than or equal to n such that
U (w 0) = x)=(n + 1). Then Theorem 12 of 10] is analogous to the
following:
Theorem 10. There is a c such that every x satis es F (x n) 
H (x) ; c for almost all n.
This shows that if H (x) is small, then x has many programs.
Schnorr asks whether the converse is true. In fact it is not:
Theorem 11. There is a c such that every x satis es F (x n) 
H (x=0) ; c for almost all n.
Thus even though H (x) is large, x will have many programs if
H (x=0) is small.
Program Size, Oracles, and the Jump Operation 363
We would like to end by examining the maximum nite cardinality
#A and co-cardinality #A attainable by a set A of bounded program
size. First we de ne the partial function G :
G(x=X ) = max z (I (z=X )  x):
The following easily established results show how gigantic G is:
(a) If  is partial X -recursive and x > I (=X ) + c, then (x), if
de ned, is less than G(x=X ).
(b) If  is partial X -recursive, then there is a c such that (G(x=X )),
if de ned, is less than G(x + c=X ).
Theorem 12.
(a) G(x ; c) < max#A (If (A)  x) < G(x + c)
(b) G(x ; c=0) < max #A (Ir(A)  x) < G(x + c=0)
(c) G(x ; c=0) < max #A (Ir (A)  x) < G(x + c=0)
(d) G(x ; c=0) < max #A (I (A)  x) < G(x + c=0)
(e) G(x ; c=00) < max #A (I (A)  x) < G(x + c=00)
Here it is understood that the maximizations are only taken over
those cardinalities which are nite.
The proof of (e) is beyond the scope of the method used in this
paper (e) is closely related to the fact that fxjWx is co- niteg is *3-
complete 1, p. 328] and to Theorem 16 of 3].

Appendix
Theorem 3b can be strengthened to the following:
I ((x)=X )  I (x=X ) + I (=X ) +
lg(I (=X )) + lg(lg(I (=X ))) +
2 lg(lg(lg(I (=X )))) + c:
364 Part V|Technical Papers on Blank-Endmarker Programs
There are many other similar inequalities.
To formulate sharp results of this kind it is necessary to abandon the
formalism of this paper, in which programs have endmarkers. Instead
one must use the self-delimiting program formalism of 9] and 3] in
which programs can be concatenated and merged. In that setting the
following inequalities are immediate:
H ((x)=X )  H (x=X ) + H (=X ) + c
H (x ('(x))]=X )  H ('=X ) + H (=X ) + c:

References
1] H. Rogers, Jr.: Theory of recursive functions and eective com-
putability, McGraw-Hill, New York, 1967.
2] G. J. Chaitin: Information-theoretic limitations of formal sys-
tems, J. ACM. 21 (1974), 403{424.
3] G. J. Chaitin: Algorithmic entropy of sets, Comput. Math. Appl.
2 (1976), 233{245.
4] T. Kamae: On Kolmogorov's complexity and information, Osaka
J. Math. 10 (1973), 305{307.
5] R. P. Daley: A note on a result of Kamae, Osaka J. Math 12
(1975), 283{284.
6] D. W. Loveland: A variant of the Kolmogorov concept of com-
plexity, Information and Control 15 (1969), 510{526.
7] G. J. Chaitin: Information-theoretic characterizations of recur-
sive innite strings, Theoretical Comput. Sci. 2 (1976), 45{48.
8] R. M. Solovay: unpublished manuscript on 9] dated May 1975.
9] G. J. Chaitin: A theory of program size formally identical to in-
formation theory, J. ACM 22 (1975), 329{340.
Program Size, Oracles, and the Jump Operation 365
10] C. P. Schnorr: Optimal enumerations and optimal Godel number-
ings, Math. Systems Theory 8 (1975), 182{191.
11] G. J. Chaitin: Algorithmic information theory, IBM J. Res. De-
velop. 21 (1977), in press.

IBM Thomas J. Watson Research Center


(Received January 17, 1976)
366 Part V|Technical Papers on Blank-Endmarker Programs
Part VI
Technical Papers on Turing
Machines & LISP

367
ON THE LENGTH OF
PROGRAMS FOR
COMPUTING FINITE
BINARY SEQUENCES
Journal of the ACM 13 (1966),
pp. 547{569

Gregory J. Chaitin1
The City College of the City University of New York
New York, N.Y.

Abstract
The use of Turing machines for calculating nite binary sequences is
studied from the point of view of information theory and the theory of
recursive functions. Various results are obtained concerning the number
of instructions in programs. A modied form of Turing machine is
studied from the same point of view. An application to the problem of
dening a patternless sequence is proposed in terms of the concepts here

369
370 Part VI|Technical Papers on Turing Machines & LISP
developed.

Introduction
In this paper the Turing machine is regarded as a general purpose
computer and some practical questions are asked about programming
it. Given an arbitrary nite binary sequence, what is the length of the
shortest program for calculating it? What are the properties of those
binary sequences of a given length which require the longest programs?
Do most of the binary sequences of a given length require programs of
about the same length?
The questions posed above are answered in Part 1. In the course of
answering them, the logical design of the Turing machine is examined
as to redundancies, and it is found that it is possible to increase the
eciency of the Turing machine as a computing instrument without
a major alteration in the philosophy of its logical design. Also, the
following question raised by C. E. Shannon 1] is partially answered:
What eect does the number of dierent symbols that a Turing machine
can write on its tape have on the length of the program required for a
given calculation?
In Part 2 a major alteration in the logical design of the Turing
machine is introduced, and then all the questions about the lengths of
programs which had previously been asked about the rst computer
are asked again. The change in the logical design may be described in
the following terms: Programs for Turing machines may have transfers
from any part of the program to any other part, but in the programs
for the Turing machines which are considered in Part 2 there is a xed
upper bound on the length of transfers.
Part 3 deals with the somewhat philosophical problem of de ning
a random or patternless binary sequence. The following de nition is
proposed: Patternless nite binary sequences of a given length are se-
quences which in order to be computed require programs of approxi-
mately the same length as the longest programs required to compute
1 This paper was written in part with the help of NSF Undergraduate Research
Participation Grant GY-161.
Computing Finite Binary Sequences 371
any binary sequences of that given length. Previous work along these
lines and its relationship to the present proposal are discussed briey.

Part 1
1.1. We de ne an N -state M -tape-symbol Turing machine by an N -
row by M -column table. Each of the NM places in this table must
have an entry consisting of an ordered pair (i j ) of natural numbers,
where i goes from 0 to N and j goes from 1 to M + 2. These entries
constitute, when speci ed, the program of the N -state M -tape-symbol
Turing machine. They are to be interpreted as follows: An entry (i j )
in the kth row and the pth column of the table means that when the
machine is in its kth state and the square of its one-way in nite tape
which is being scanned is marked with the pth symbol, then the machine
is to go to its ith state if i 6= 0 (the machine is to halt if i = 0) after
performing the operation of
1. moving the tape one square to the right if j = M + 2,
2. moving the tape one square to the left if j = M + 1, and
3. marking (overprinting) the square of the tape being scanned with
the j th symbol if 1  j  M .
Special names are given to the rst, second and third symbols. They
are, respectively, the blank (for unmarked square), 0 and 1.
A Turing machine may be represented schematically as follows:

1 0 0 1 0 0 ::::::
End of Tape Scanner 6 Tape
Black Box

It is stipulated that
(1.1A) Initially the machine is in its rst state and scanning the rst
square of the tape.
372 Part VI|Technical Papers on Turing Machines & LISP
(1.1B) No Turing machine may in the course of a calculation scan the
end square of the tape and then move the tape one square to the
right.
(1.1C) Initially all squares of the tape are blank.
Since throughout this paper we shall be concerned with computing
nite binary sequences, when we say that a Turing machine calculates
a particular nite binary sequence (say, 01111000), we shall mean that
the machine stops with the sequence written at the end of the tape,
with all other squares of the tape blank and with its scanner on the rst
blank square of the tape. For example, the following Turing machine
has just calculated the sequence mentioned above:

0 1 1 1 1 0 0 0 ::::::
6
Halted

1.2. There are exactly


((N + 1)(M + 2))NM
possible programs for an N -state M -tape-symbol Turing machine.
Thus to specify a single one of these programs requires
log2(((N + 1)(M + 2))NM )
bits of information, which is asymptotic to NM log2 N bits for M xed
and N large. Therefore a program for an N -state M -tape-symbol Tur-
ing machine (considering M to be xed and N to be large) can be
regarded as consisting of about NM log2 N bits of information. It may
be, however, that in view of the fact that dierent programs may cause
the machine to behave in exactly the same way, a substantial portion
of the information necessary to specify a program is redundant in its
speci cation of the behavior of the machine. This in fact turns out to
be the case. It will be shown in what follows that for M xed and
Computing Finite Binary Sequences 373
N large at least 1=M of the bits of information of a program are re-
dundant. Later we shall be in a position to ask to what extent the
remaining portion of (1 ; 1=M ) of the bits is redundant.
The basic reason for this redundancy is that any renumbering of
the rows of the table (this amounts to a renaming of the states of the
machine) in no way changes the behavior that a given program will
cause the machine to have. Thus the states can be named in a manner
determined by the sequencing of the program, and this makes possible
the omission of state numbers from the program. This idea is by no
means new. It may be seen in most computers with random access
memories. In these computers the address of the next instruction to be
executed is usually 1 more than the address of the current instruction,
and this makes it generally unnecessary to use memory space in order
to give the address of the next instruction to be executed. Since we
are not concerned with the practical engineering feasibility of a logical
design, we can take this idea a step farther.
1.3. In the presentation of the redesigned Turing machine let us
begin with an example of the manner in which one can take a program
for a Turing machine and reorder its rows (rename its states) until it
is in the format of the redesigned machine. In the process, several row
numbers in the program are removed and replaced by + or ++ |this is
how redundant information in the program is removed. The \operation
codes" (which are 1 for \print blank," 2 for \print zero," 3 for \print
one," 4 for \shift tape left" and 5 for \shift tape right") are omitted
from the program every time the rows are reordered, the op-codes are
just carried along. The program used as an example is as follows:
row 1 1 9 7
row 2 8 8 8
row 3 9 6 1
row 4 3 2 0
row 5 7 7 8
row 6 6 5 4
row 7 8 6 9
row 8 9 8 1
row 9 9 1 8
To prevent confusion later, letters instead of numbers are used in
374 Part VI|Technical Papers on Turing Machines & LISP
the program:
row A A I G
row B H H H
row C I F A
row D C B J
row E G G H
row F F E D
row G H F I
row H I H A
row I I A H
Row A is the rst row of the table and shall remain so. Replace A
by 1 throughout the table:
row 1 1 I G
row B H H H
row C I F 1
row D C B J
row E G G H
row F F E D
row G H F I
row H I H 1
row I I 1 H
To nd to which row of the table to assign the number 2, read across
the rst row of the table until a letter is reached. Having found an I,
1. replace it by a +,
2. move row I so that it becomes the second row of the table, and
3. replace I by 2 throughout the table:
Computing Finite Binary Sequences 375
row 1 1 + G
row 2 2 1 H
row B H H H
row C 2 F 1
row D C B J
row E G G H
row F F E D
row G H F 2
row H 2 H 1
To nd to which row of the table to assign the number 3, read across
the second row of the table until a letter is found. Having found an H,
1. replace it by a +,
2. move row H so that it becomes the third row of the table, and
3. replace H by 3 throughout the table:
row 1 1 + G
row 2 2 1 +
row 3 2 3 1
row B 3 3 3
row C 2 F 1
row D C B J
row E G G 3
row F F E D
row G 3 F 2
To nd to which row of the table to assign the number 4, read across
the third row of the table until a letter is found. Having failed to nd
one, read across rows 1, 2 and 3, respectively, until a letter is found.
(A letter must be found, for otherwise rows 1, 2 and 3 are the whole
program.) Having found a G in row 1,
1. replace it by a ++,
2. move row G so that it becomes the fourth row of the table, and
3. replace G by 4 throughout the table:
376 Part VI|Technical Papers on Turing Machines & LISP
row 1 1 + ++
row 2 2 1 +
row 3 2 3 1
row 4 3 F 2
row B 3 3 3
row C 2 F 1
row D C B J
row E 4 4 3
row F F E D
The next two assignments proceed as in the case of rows 2 and 3:
row 1 1 + ++
row 2 2 1 +
row 3 2 3 1
row 4 3 + 2
row 5 5 E D
row B 3 3 3
row C 2 5 1
row D C B J
row E 4 4 3
row 1 1 + ++
row 2 2 1 +
row 3 2 3 1
row 4 3 + 2
row 5 5 + D
row 6 4 4 3
row B 3 3 3
row C 2 5 1
row D C B J
To nd to which row of the table to assign the number 7, read across
the sixth row of the table until a letter is found. Having failed to nd
one, read across rows 1, 2, 3, 4, 5 and 6, respectively, until a letter is
found. (A letter must be found, for otherwise rows 1, 2, 3, 4, 5 and 6
are the whole program.) Having found a D in row 5,
Computing Finite Binary Sequences 377
1. replace it by a ++,
2. move row D so that it becomes the seventh row of the table, and
3. replace D by 7 throughout the table:
row 1 1 + ++
row 2 2 1 +
row 3 2 3 1
row 4 3 + 2
row 5 5 + ++
row 6 4 4 3
row 7 C B J
row B 3 3 3
row C 2 5 1
After three more assignments the following is nally obtained:
row 1 1 + ++
row 2 2 1 +
row 3 2 3 1
row 4 3 + 2
row 5 5 + ++
row 6 4 4 3
row 7 + ++ ++
row 8 2 5 1
row 9 3 3 3
row 10
This example is atypical in several respects: The state order could
have needed a more elaborate scrambling (instead of which the row of
the table to which a number was assigned always happened to be the
last row of the table at the moment), and the ctitious state used for
the purposes of halting (state 0 in the formulation of Section 1.1) could
have ended up as any one of the rows of the table except the rst row
(instead of which it ended up as the last row of the table).
The reader will note, however, that 9 row numbers have been elim-
inated (and replaced by + or ++) in a program of 9 (actual) rows,
and that, in general this process will eliminate a row number from the
378 Part VI|Technical Papers on Turing Machines & LISP
program for each row of the program. Note too that if a program is
\linear" (i.e., the machine executes the instruction in storage address
1, then the instruction in storage address 2, then the instruction in
storage address 3, etc.), only + will be used departures from linearity
necessitate use of ++.
There follows a description of the redesigned machine. In the for-
malism of that description the program given above is as follows:
(1, ,0) (0, ,1) (0, ,2)
(2, ,0) (1, ,0) (0, ,1)
(2, ,0) (3, ,0) (1, ,0)
(3, ,0) (0, ,1) (2, ,0)
10 (5, ,0) (0, ,1) (0, ,2)
(4, ,0) (4, ,0) (3, ,0)
(0, ,1) (0, ,2) (0, ,2)
(2, ,0) (5, ,0) (1, ,0)
(3, ,0) (3, ,0) (3, ,0)
Here the third member of a triple is the number of +'s, the second
member is the op-code, and the rst member is the number of the
next state of the machine if there are no +'s (if there are +'s, the rst
member of the triple is 0). The number outside the table is the number
of the ctitious row of the program used for the purposes of halting.
We de ne an N -state M -tape-symbol Turing machine by an (N +
1) M table and a natural number n (2  n  N + 1). Each of
the (N + 1)M places in this table (with the exception of those in the
nth row) must have an entry consisting of an ordered triple (i j k) of
natural numbers, where k is 0, 1 or 2 j goes from 1 to M + 2 and i
goes from 1 to N + 1 if k = 0, i = 0 if k 6= 0. (Places in the nth row
are left blank.) In addition:
(1.3.1) The entries in which k = 1 or k = 2 are N in number.
Entries are interpreted as follows:
(1.3.2) An entry (i j 0) the pth row and the mth column of the table
means that when the machine is in the pth state and the square
of its one-way in nite tape which is being scanned is marked with
the mth symbol, then the machine is to go to its ith state if i 6= n
Computing Finite Binary Sequences 379
(if i = n, the machine is instead to halt) after performing the
operation of
1. moving the tape one square to the right if j = M + 2,
2. moving the tape one square to the left if j = M + 1, and
3. marking (overprinting) the square of the tape being scanned
with the j th symbol if 1  j  M .
(1.3.3) An entry (0 j 1) in the pth row and mth column of the table
is to be interpreted in accordance with (1.3.2) as if it were the
entry (p + 1 j 0).
(1.3.4) For an entry (0 j 2) in the pth row and mth column of the
table the machine proceeds as follows:
(1.3.4a) It determines the number p0 of entries of the form (0,
,2) in rows of the table preceding the pth row or to the left
of the mth column in the pth row.
(1.3.4b) It determines the rst p0 + 1 rows of the table which
have no entries of the form (0, ,1) or (0, ,2). Suppose the
last of these p0 + 1 rows is the p00th row of the table.
(1.3.4c) It interprets the entry in accordance with (1.3.2) as if
it were the entry (p00 + 1 j 0).
1.4. In Section 1.2 it was stated that the programs of the N -state
M -tape-symbol Turing machines of Section 1.3 require in order to be
speci ed (1;1=M ) the number of bits of information required to specify
the programs of the N -state M -tape-symbol Turing machines of Section
1.1. (As before, M is regarded to be xed and N to be large.) This
assertion is justi ed here. In view of (1.3.1), at most
N (3(M + 2))NM (N + 1)N (M ;1)
ways of making entries in the table of an N -state M -tape-symbol Tur-
ing machine of Section 1.3 count as programs. Thus only log2 of this
number or asymptotically N (M ; 1) log2 N bits are required to specify
the program of an N -state M -tape-symbol machine of Section 1.3.
380 Part VI|Technical Papers on Turing Machines & LISP
Henceforth, in speaking of an N -state M -tape-symbol Turing ma-
chine, one of the machines of Section 1.3 will be meant.
1.5. We now de ne two sets of functions which play a fundamental
role in all that follows.
The members LM (:) of the rst set are de ned for M = 3 4 5 : : :
on the set of all nite binary sequences S as follows: An N -state M -
tape-symbol Turing machine can be programmed to calculate S if and
only if N  LM (S ).
The second set LM (Cn) (M = 3 4 5 : : :) is de ned by
LM (Cn ) = max
S M
L (S ) 
where S is any binary sequence of length n.
Finally, we denote by CnM (M = 3 4 5 : : :) the set of all binary
sequences S of length n satisfying LM (S ) = LM (Cn).
1.6. In this section it is shown that for M = 3 4 5 : : :,
LM (Cn ) ! (M ; 1) n
log2 n :
We rst show that LM (Cn ) is greater than a function of n which
is asymptotically equal to (n=((M ; 1) log2 n)). From Section 1.4 it is
clear that there are at most
2((1+N )N (M ;1)log2 N )

dierent programs for an N -state M -tape-symbol Turing machine,


where x denotes a (not necessarily positive) function of x and pos-
sibly other variables which tends to zero as x goes to in nity with any
other variables held xed. Since a dierent program is required to cal-
culate each of the 2n dierent binary sequences of length n, we see
that an N -state M -tape-symbol Turing machine can be programmed
to calculate any binary sequence of length n only if
(1 + N )N (M ; 1) log2 N  n
or n
N  (1 + n) (M ; 1) log2 n :
Computing Finite Binary Sequences 381
It follows from the de nition of LM (Cn) that
LM (Cn )  (1 + n) (M ; 1) n
log2 n :
Next we show that LM (Cn) is less than a function of n which is
asymptotically equal to (n=((M ; 1) log2 n)). This is done by showing
how to construct for any binary sequence S of length not greater than
(1 + N )N (M ; 1) log2 N a program which causes an N -state M -tape-
symbol Turing machine to calculate S . The main idea is illustrated in
the case where M = 3.
Row Column Number
Number 1 2 3
1 2, 4 2, 4 2, 4
2 : : : , 2 : : : , 3 3, 4
3 : : : , 2 : : : , 3 4, 4 Section I:
4 : : : , 2 : : : , 3 5, 4 approximately
5 : : : , 2 : : : , 3 6, 4 (1 ; 1= log2 N )N
6 : : : , 2 : : : , 3 7, 4 rows
7 : : : , 2 : : : , 3 8, 4
8 : : : , 2 : : : , 3 9, 4
... ... ... ...
d d + 1 4 d + 1 4 d + 1 4
d+1 d + 2 4 d + 2 4 d + 2 4
d+2 d + 3 4 d + 3 4 d + 3 4 Section II:
d+3 d + 4 4 d + 4 4 d + 4 4 approximately
d+4 d + 5 4 d + 5 4 d + 5 4 N= log2 N
d+5 d + 6 4 d + 6 4 d + 6 4 rows
d+6 d + 7 4 d + 7 4 d + 7 4
d+7 d + 8 4 d + 8 4 d + 8 4
... ... ... ...
f This section is the same
f +1 (except for the changes Section III:
f +2 in row numbers caused by a xed number
f +3 relocation) regardless of rows
f +4 of the value of N .
... ... ... ...
382 Part VI|Technical Papers on Turing Machines & LISP
This program is in the format of the machines of Section 1.1. There
are N rows in this table. The unspeci ed row numbers in Section I
are all in the range from d to f ; 1, inclusive. The manner in which
they are speci ed determines the nite binary sequence S which the
program computes.
The execution of this program is divided into phases. There are
twice as many phases as there are rows in Section I. The current phase
is determined by a binary sequence P which is written out starting on
the second square of the tape. The nth phase starts in row 1 with the
scanner on the rst square of the tape and with
(
P = 111 : : : 1 (i 1's) if n = 2i + 1,
P = 111 : : : 10 (i 1's) if n = 2i + 2.
Control then passes down column three through the (i + 1)-th row of
the table, and then control passes to
(
row i + 2, column 1 if n = 2i + 1,
row i + 2, column 2 if n = 2i + 2,
which
1. changes P to what it must be at the start of the (n + 1)-th phase,
and
2. transfers control to a row in Section II.
Suppose this row to be the mth row of Section II from the end of Section
II.
Once control has passed to the row in Section II, control then passes
down Section II until row f is reached. Each row in Section II causes
the tape to be shifted one square to the left, so that when row f nally
assumes control, the scanner will be on the mth blank square to the
right of P . The following diagram shows the way things may look at
this point if n is 7 and m happens to be 11:
P 10 BLANK SQUARES LONG BLANK REGION S1  S2 : : : S6

1110 00101 11011 :::

6
Computing Finite Binary Sequences 383
Now control has been passed to Section III. First of all, Section III
accumulates in base-two on the tape a count of the number of blank
squares between the scanner and P when f assumes control. (This
number is m ; 1.) This base-two count, which is written on the tape,
is simply a binary sequence with a 1 at its left end. Section III then
removes this 1 from the left end of the binary sequence. The resulting
sequence is called Sn .
Note that if the row numbers entered in
(
row i + 2, column 1 if n = 2i + 1,
row i + 2, column 2 if n = 2i + 2,
of Section I are suitably speci ed, this binary sequence Sn can be made
any one of the 2v binary sequences of length v = (the greatest integer
not greater than log2(f ; d) ; 1). Finally, Section III writes Sn in
a region of the tape far to the right where all the previous Sj (j =
1 2 : : :  n ; 1) have been written during previous phases, cleans up the
tape so that only the sequences P and Sj (j = 1 2 3 : : :  n) remain on
it, positions the scanner back on the square at the end of the tape and,
as the last act of phase n, passes control back to row 1 again.
The foregoing description of the workings of the program omits some
important details for the sake of clarity. These follow.
It must be indicated how Section III knows when the last phase
(phase 2(d ; 2)) has occurred. During the nth phase, P is copied just
to the right of S1 S2 : : :  Sn (of course a blank square is left between Sn
and the copy of P ). And during the (n +1)-th phase, Section III checks
whether or not P is currently dierent from what it was during the nth
phase when the copy of it was made. If it isn't dierent, then Section III
knows that phasing has in fact stopped and that a termination routine
must be executed.
The termination routine rst forms the nite binary sequence S 
consisting of
S1 S2 : : : S2(d;2)
each immediately following the other. As each of the Sj can be any one
of the 2v binary sequences of length v if the row numbers in the entries
in Section I are appropriately speci ed, it follows that S  can be any
384 Part VI|Technical Papers on Turing Machines & LISP
one of the 2w binary sequences of length w = 2(d ; 2)v. Note that
2(d ; 2)(log2(f ; d) ; 1)  w > 2(d ; 2)(log2(f ; d) ; 2)
so that
 ! !
1 N
w ! 2 (1 ; log N )N log2 log N ! 2N log2 N:
2 2
As we want the program to be able to compute any sequence S of length
not greater than (2 + N )N log2 N , we have S  consist of S followed to
the right by a single 1 and then a string of 0's, and the termination
routine removes the rightmost 0's and rst 1 from S . Q.E.D.
The result just obtained shows that it is impossible to make further
improvement in the logical design of the Turing machine of the kind
described in Section 1.2 and actually eected in Section 1.3 if we let
the number of tape symbols be xed and speak asymptotically as the
number of states goes to in nity, in our present Turing machines 100
percent of the bits required to specify a program also serve to specify
the behavior of the machine.
Note too that the argument presented in the rst paragraph of this
section in fact establishes that, say, for any xed s greater than zero,
at most n;s 2n binary sequences S of length n satisfy
n
LM (S )  (1 + n) (M ; 1) log2 n :
Thus we have: For any xed s greater than zero, at most n;s 2n binary
sequences of length n fail to satisfy the double inequality
n
(1 + n) (M ; 1)  LM (S )  (1 + 0n )
n
log2 n (M ; 1) log2 n :
1.7. It may be desirable to have some idea of the \local" as well
as the \global" behavior of LM (Cn). The following program of 8 rows
causes an 8-state 3-tape-symbol Turing machine to compute the binary
sequence 01100101 of length 8 (this program is in the format of the
machines of Section 1.1):
Computing Finite Binary Sequences 385
1,2 2,4 2,4
2,3 3,4 3,4
3,3 4,4 4,4
4,2 5,4 5,4
5,2 6,4 6,4
6,3 7,4 7,4
7,2 8,4 8,4
8,3 0,4 0,4
And in general:
(1.7.1) LM (Cn)  n.
From this it is easy to see that for m greater than n:
(1.7.2) LM (Cm)  LM (Cn) + (m ; n).
Also, for m greater than n:
(1.7.3) LM (Cm) + 1  LM (Cn).
For if one can calculate any binary sequence of length m greater
than n with an M -tape-symbol Turing machine having LM (Cm) states,
one can certainly program any M -tape-symbol Turing machine having
LM (Cm) + 1 states to calculate the binary sequence consisting of (any
particular sequence of length n) followed by a 1 followed by a sequence
of (m ; n ; 1) 0's], and then|instead of immediately halting|to rst
erase all the 0's and the rst 1 on the right end of the sequence. This
last part of the program takes up only a single row of the table in the
format of the machines of Section 1.1 this row r is:
row r r,5 r,1 0,1
Together (1.7.2) and (1.7.3) yield:
(1.7.4) jLM (Cn+1) ; LM (Cn)j  1.
From (1.7.1) it is obvious that LM (C1) = 1, and with (1.7.4) and the
fact that LM (Cn) goes to in nity with n it nally is concluded that:
(1.7.5) For any positive integer p there is at least one solution n of
LM (Cn ) = p:
386 Part VI|Technical Papers on Turing Machines & LISP
1.8. In this section a certain amount of insight is obtained into the
properties of nite binary sequences S of length n for which LM (S )
is close to LM (Cn). M is considered to be xed throughout this sec-
tion. There is some connection between the present subject and that
of Shannon in 2, Pt. I, especially Th. 9].
The main result is as follows:
(1.8.1) For any e > 0 and d > 1 one has for all suciently large n:
If S is any binary sequence of length n satisfying the statement
that
(1.8.2) the ratio of the number of 0's in S to n diers from 12 by
more than e,
then LM (S ) < LM (CndH ( 21 +e 12 ;e)] ):
Here H (p q) (p  0 q  0 p + q = 1) is a special case of the entropy
function of Boltzmann statistical mechanics and information theory and
equals 0 if p = 0 or 1, and ;p log2 p ; q log2 q otherwise. Also, a real
number enclosed in brackets denotes the least integer greater than the
enclosed real. The H function comes up because the logarithm to the
base-two of the number
X n!
j ; 1 j>e
k
k
n 2
of binary sequences of length n satisfying (1.8.2) is asymptotic to
nH ( 21 + e 12 ; e). This may be shown easily by considering the ratio of
successive binomial coecients and using the fact that log(n!) ! n log n.
To prove (1.8.1), rst construct a class of eectively computable
functions Mn (:) with the natural numbers from 1 to 2n as range and
all binary sequences of length n as domain. Mn (S ) is de ned to be
the ordinal number of the position of S in an ordering of the binary
sequences of length n de ned as follows:
1. If two binary sequences S and S 0 have, respectively, m and m0
0's, then S comes before (after) S 0 according as j mn ; 21 j is greater
(less) than j mn ; 12 j.
0
Computing Finite Binary Sequences 387
2. If 1 does not settle which comes rst, take S to come before
(after) S 0 according as S represents (ignoring 0's to the left) a
larger (smaller) number in base-two notation than S 0 represents.
The only essential feature of this ordering is that it gives small
ordinal numbers to sequences for which j mn ; 21 j has large values. In
fact, as there are only
2(1+ )nH ( 12 +e 12 ;e)
n

binary sequences S of length n satisfying (1.8.2), it follows that at


worst Mn (S ) is a number which in base-two notation is represented by
a binary sequence of length ! nH ( 12 + e 12 ; e). Thus in order to obtain
a short program for computing an S of length n satisfying (1.8.2), let
us just give a program of xed length r the values of n and Mn (S ) and
have it compute S (= Mn;1(Mn (S ))) from this data. The manner in
which for n suciently large we give the values of n and Mn(S ) to the
program is to pack them into a single binary sequence of length at most
" #
d ;1 1 1
n(1 + 2 )H ( 2 + e 2 ; e) + 2(1 + log2 n])
as follows: Consider (the binary sequence representing Mn(S ) in base-
two notation) followed by 01 followed by the binary sequence represent-
ing n with each of its bits doubled (e.g., if n = 43, this is 110011001111)].
Clearly both n and Mn (S ) can be recovered from this sequence. And
this sequence can be computed by a program of
LM (C n(1+ 2 1 )H ( 21 +e 21 ;e)]+2(1+log2 n]))
d;

rows. Thus for n suciently large this many rows plus r is all that is
needed to compute any binary sequence S of length n satisfying (1.8.2).
And by the asymptotic formula for LM (Cn ) of Section 1.6, it is seen
that the total number of rows of program required is, for n suciently
large, less than
LM (CndH ( 12 +e 12 ;e)]):
Q.E.D.
From (1.8.1) and the fact that H (p q)  1 with equality if and only
if p = q = 21 , it follows from LM (Cn ) ! (n=((M ; 1) log2 n)) that, for
example,
388 Part VI|Technical Papers on Turing Machines & LISP
(1.8.3) For any e > 0, all binary sequences S in CnM , n suciently
large, violate (1.8.2)
and more generally,
(1.8.4) Let
Sn1  Sn2  Sn3  : : :
be any in nite sequence of distinct nite binary sequences of
lengths, respectively, n1 n2 n3 : : : which satis es
LM (Sn ) ! LM (Cn ):
k k

Then as k goes to in nity, the ratio of the number of 0's in Sn


to nk tends to the limit 21 .
k

We now wish to apply (1.8.4) to programs for Turing machines. In order


to do this we need to be able to represent the table of entries de ning
any program as a single binary sequence. A method is sketched here
for coding any program TNM occupying the table of an N -state M -
tape-symbol Turing machine into a single binary sequence C (TNM ) of
length (1 + N )N (M ; 1) log2 N .
First, write all the members of the ordered triples entered in the
table in base-two notation, adding a sucient number of 0's to the left
of the numerals for all numerals to be
1. as long as the base-two numeral for N + 1 if they result from the
rst member of a triple,
2. as long as the base-two numeral for M + 2 if they result from the
second member, and
3. as long as the base-two numeral for 2 if they result from the third
member.
The only exception to this rule is that if the third member of a
triple is 1 or 2, then the rst member of the triple is not written in
base-two notation no binary sequences are generated from the rst
members of such triples. Last, all the binary sequences that have just
been obtained are joined together, starting with the binary sequence
Computing Finite Binary Sequences 389
that was generated from the rst member of the triple entered at the
intersection of row 1 with column 1 of the table, then with the binary
sequence generated from the second member of the triple: : : , : : : from
the third member:: : , : : : from the rst member of the triple entered at
the intersection of row 1 with column 2, : : : from the second member:: : ,
: : : from the third member: : : , and so on across the rst row of the table,
then across the second row of the table, then the third, : : : and nally
across the N th row.
The result of all this is a single binary sequence of length (1 +
N )N (M ; 1) log2 N (in view of (1.3.1)) from which one can eectively
determine the whole table of entries which was coded into it, if only
one is given the values of N and M . But it is possible to code in these
last pieces of information using only the rightmost
2(1 + log2 N ]) + 2(1 + log2 M ])
bits of a binary sequence consequently of total length
(1 + N )N (M ; 1) log2 N + 2(1 + log2 N ]) + 2(1 + log2 M ])
= (1 + 0N )N (M ; 1) log2 N
by employing the same trick that was used to pack two pieces of infor-
mation into a single binary sequence earlier in this section.
Thus we have a simple procedure for coding the whole table of
entries TNM de ning a program of an N -state M -tape-symbol Turing
machine and the parameters N and M of the machine into a binary
sequence C (TNM ) of (1 + N )N (M ; 1) log2 N bits.
We now obtain the result:
(1.8.5) Let
M
TL (S1)M  TL (S2)M  : : :
M

be an in nite sequence of tables of entries which de ne programs


for computing, respectively, the distinct nite binary sequences
S1 S2 : : : Then
LM (C (TL (S )M )) ! LM (Cn )
M k k

where nk is the length of


C (TL (S )M ):
M k
390 Part VI|Technical Papers on Turing Machines & LISP
With (1.8.4) this gives the proposition:
(1.8.6) On the hypothesis of (1.8.5), as k goes to in nity, the ratio of
the number of 0's in
C (TL (S )M )
M k

to its length tends to the limit 12 .


The proof of (1.8.5) depends on three facts:
(1.8.7a) There is an eective procedure for coding the table of entries
TNM de ning the program of an N -state M -tape-symbol Turing
machine together with the two parameters N and M into a single
binary sequence C (TNM ) of length (1 + N )N (M ; 1) log2 N .
(1.8.7b) Any binary sequence of length not greater than
(1 + N )N (M ; 1) log2 N
can be calculated by a suitably programmed N -state M -tape-
symbol Turing machine.
(1.8.7c) From a universal Turing machine program it is possible to
construct a program for a Turing machine (with a xed number
r of rows) to take C (TNM ) and decode it and to then imitate
the calculations of the machine whose table of entries TNM it
then knows, until it nally calculates the nite binary sequence
S which the program being imitated calculates, if S exists.
(1.8.7a) has just been demonstrated. (1.8.7b) was shown in Section 1.6.
(The concept of a universal program is due to Turing 3].)
The proof of (1.8.5) follows. From (1.8.7a) and (1.8.7b),
LM (C (TL (S )M ))  (1 + k )LM (Sk )
M k

and from (1.8.7c) and the hypothesis of (1.8.5),


LM (C (TL (S )M )) + r  LM (Sk ):
M k

It follows that
LM (C (TL (S )M )) = (1 + k )LM (Sk )
M k
Computing Finite Binary Sequences 391
which is|since the length of
C (TL M (S )M )
k

is
(1 + k )LM (Sk )(M ; 1) log2 LM (Sk )
and
LM (C(1+ )L (S )(M ;1)log2 L (S )) = (1 + 0k )LM (Sk )
k M k M k

|simply the conclusion of (1.8.5).


1.9. The topic of this section is an application of everything that
precedes with the exception of Section 1.7 and the rst half of Section
1.8. C. E. Shannon suggests 1, p. 165] that the state-symbol product
NM is a good measure of the calculating abilities of an N -state M -
tape-symbol Turing machine. If one is interested in comparing the
calculating abilities of large Turing machines whose M values vary over
a nite range, the results that follow suggest that N (M ; 1) is a good
measure of calculating abilities. We have as an application of a slight
generalization of the ideas used to prove (1.8.5):
(1.9.1a) Any calculation which an N -state M -tape-symbol Turing ma-
chine can be programmed to perform can be imitated by any
N 0-state M 0-tape-symbol Turing machine satisfying
(1 + N )N (M ; 1) log2 N < (1 + 00N )N 0(M 0 ; 1) log2 N 0
if it is suitably programmed.
And directly from the asymptotic formula for LM (Cn) we have:
(1.9.1b) If
(1 + N )N (M ; 1) log2 N < (1 + 00N )N 0(M 0 ; 1) log2 N 0
then there exist nite binary sequences which an N 0-state M 0-
tape-symbol Turing machine can be programmed to calculate and
which it is impossible to program an N -state M -tape-symbol Tur-
ing machine to calculate.
392 Part VI|Technical Papers on Turing Machines & LISP
As
(1 + N )N (M ; 1) log2 N =
((1 + 0N )N (M ; 1)) log2 ((1 + 0N )N (M ; 1))
and for x and x0 greater than one, x log2 x is greater (less) than x0 log2 x0
according as x is greater (less) than x0, it follows that the inequalities
of (1.9.1a) and (1.9.1b) give the same ordering of calculating abilities
as do inequalities involving functions of the form (1 + N )N (M ; 1).

Part 2
2.1. In this section we return to the Turing machines of Section 1.1
and add to the conventions (1.1A), (1.1B) and (1.1C),
(2.1D) An entry (i j ) in the pth row of the table of a Turing machine
must satisfy ji ; pj  b. In addition, while a ctitious state is
used (as before) for the purpose of halting, the row of the table
for this ctitious state is now considered to come directly after
the actual last row of the program.
Here b is a constant whose value is to be regarded as xed throughout
Part 2. In Section 2.2 it will be shown that b can be chosen su-
ciently large that the Turing machines thus de ned (which we take the
liberty of naming \bounded-transfer Turing machines") have all the
calculating capabilities that are basically required of Turing machines
for theoretical purposes (e.g., such purposes as de ning what one means
by \eective process for determining: : : "), and hence have calculating
abilities sucient for the proofs of Part 2 to be carried out.
(2.1D) may be regarded as a mere convention, but it is more prop-
erly considered as a change in the basic philosophy of the logical design
of the Turing machine (i.e., the philosophy expressed by A. M. Turing
3, Sec. 9]).
Here in Part 2 there will be little point in considering the general
M -tape-symbol machine. It will be understood that we are always
speaking of 3-tape-symbol machines.
There is a simple and convenient notational change which can be
made at this point it makes all programs for bounded-transfer Turing
machines instantly relocatable (which is convenient if one puts together
Computing Finite Binary Sequences 393
a program from subroutines) and it saves a great deal of superuous
writing. Entries in the tables of machines will from now on consist of
ordered pairs (i0 j 0), where i0 goes from ;b to b and j 0 goes from 1 to
5. A \new" entry (i0 j 0) is to be interpreted in terms of the functioning
of the machine in a manner depending on the number p of the row of
the table it is in this entry has the same meaning as the \old" entry
(p + i0 j 0) used to have.
Thus, halting is now accomplished by entries of the form (k j ) (1 
k  b) in the kth row (from the end) of the table. Such an entry causes
the machine to halt after performing the operation indicated by j .
2.2. In this section we attempt to give an idea of the versatility
of the bounded-transfer Turing machine. It is here shown in two ways
that b can be chosen suciently large so that any calculation which one
of the Turing machines of Section 1.1 can be programmed to perform
can be imitated by a suitably programmed bounded-transfer Turing
machine.
As the rst proof, b is taken to be the number of rows in a 3-tape-
symbol universal Turing machine program for the machines of Section
1.1. This universal program (with its format changed to that of the
bounded-transfer Turing machines) occupies the last rows of a program
for a bounded-transfer Turing machine, a program which is mainly
devoted to writing out on the tape the information which will enable
the universal program to imitate any calculation which any one of the
Turing machines of Section 1.1 can be programmed to perform. One
row of the program is used to write out each symbol of this information
(as in the program in Section 1.7), and control passes straight through
the program row after row until it reaches the universal program.
Now for the second proof. To program a bounded-transfer Turing
machine in such a manner that it imitates the calculations performed
by a Turing machine of Section 1.1, consider alternate squares on the
tape of the bounded-transfer Turing machine to be the squares of the
tape of the machine being imitated. Thus
394 Part VI|Technical Papers on Turing Machines & LISP

1 0 1 0 ::::::
6

is imitated by

1 0 1 0 ::::::
6

After the operation of a state (i.e., write 0, write 1, write blank,


shift tape left, shift tape right) has been imitated, as many 1's as the
number of the next state to be imitated are written on the squares of
the tape of the bounded-transfer Turing machine which are not used to
imitate the squares of the other machine's tape, starting on the square
immediately to the right of the one on which is the scanner of the
bounded-transfer Turing machine. Thus if in the foregoing situation
the next state to be imitated is state number three, then the tape of
the bounded-transfer Turing machine becomes

1 0 1 1 1 0 1 ::::::
6

The rows of the table which cause the bounded-transfer Turing machine
to do the foregoing (type I rows) are interwoven or braided with two
other types of rows. The rst of these (type II rows) is used for the
sole purpose of putting the bounded-transfer Turing machine back in its
initial state (row 1 of the table this row is a type III row). They appear
(as do the other two types of rows) periodically throughout the table,
and each of them does nothing but transfer control to the preceding
one. The second of these (type III rows) serve to pass control back in
Computing Finite Binary Sequences 395
the other direction each time control is about to pass a block of type I
rows that imitate a particular state of the other machine while traveling
through type III rows, the type III rows erase the rightmost of the 1's
used to write out the number of the next state to be imitated. When
nally none of these place-marking 1's is left, control is passed to the
group of type I rows that was about to be passed, which then proceeds
to imitate the appropriate state of the Turing machine of Section 1.1.
Thus the obstacle of the upper bound on the length of transfers
in bounded-transfer Turing machines is overcome by passing up and
down the table by small jumps, while keeping track of the progress to
the desired destination is achieved by subtracting a unit from a count
written on the tape just prior to departure.
Although bounded-transfer Turing machines have been shown to
be versatile, it is not true that as the number of states goes to in nity,
asymptotically 100 percent of the bits required to specify a program also
serve to specify the behavior of the bounded-transfer Turing machine.
2.3. In this section the following fundamental result is proved.
(2.3.1) L(Cn ) ! an, where a is, of course, a positive constant.
First it is shown that there exists an a greater than zero such that:
(2.3.2) L(Cn )  an.
It is clear that there are exactly
((5)(2b + 1))3N
dierent ways of making entries in the table of an N -state bounded-
transfer Turing machine that is, there are
2((3log2(10b+5))N )
dierent programs for an N -state bounded-transfer Turing machine.
Since a dierent program is required to have the machine calculate
each of the 2n dierent binary sequences of length n, it can be seen
that an N -state bounded-transfer Turing machine can be programmed
to calculate any binary sequence of length n only if
(3 log2(10b + 5))N  n or N  3 log (10 n :
2 b + 5)
396 Part VI|Technical Papers on Turing Machines & LISP
Thus one can take a = (1=(3 log 2(10b + 5))).
Next it is shown that:
(2.3.3) L(Cn ) + L(Cm)  L(Cn+m ).
To do this we present a way of making entries in a table with at most
L(Cn ) + L(Cm) rows which causes the bounded-transfer Turing ma-
chine thus programmed to calculate any particular binary sequence S
of length n + m. S can be expressed as a binary sequence S 0 of length n
followed by a binary sequence S 00 of length m. The table is then formed
from two sections which are numbered in the order in which they are
encountered in reading from row 1 to the last row of the table. Section
I consists of at most L(Cn) rows. It is a program which calculates S 0.
Section II consists of at most L(Cm) rows. It is a program which cal-
culates S 00. It follows from this construction and the de nitions that
(2.3.3) holds.
(2.3.2) and (2.3.3) together imply (2.3.1).2 This will be shown by a
demonstration of the following general proposition:
(2.3.4) Let A1 A2 A3 : : : be an in nite sequence of natural numbers
satisfying
(2.3.5) An + Am  An+m .
Then as n goes to in nity, (An=n) tends to a limit from above.
2 As stated in the preface of this book, it is straightforward to apply to LISP
the techniques used here to study bounded-transfer Turing machines. Let us dene
HLISP (x) where x is a bit string to be the size in characters of the smallest LISP
S-expression whose value is the list x of 0's and 1's. Consider the LISP S-expression
(APPEND P Q), where P is a minimal LISP S-expression for the bit string x and
Q is a minimal S-expression for the the bit string y. I.e., the value of P is the list of
bits x and P is HLISP(x) characters long, and the value of Q is the list of bits y and
Q is HLISP (y) characters long. (APPEND P Q) evaluates to the concatenation of
the bit strings x and y and is HLISP (x) + HLISP(y) + 10 characters long. Therefore,
let us dene HLISP (x) to be HLISP (x)+10: Now HLISP is subadditive like L(S). The
0 0

discussion of bounded-transfer Turing machines in this paper and the next therefore
applies practically word for word to HLISP = HLISP +10. In particular, let B(n) be
0

the maximum of HLISP(x) taken over all n-bit strings x. Then B(n)=n is bounded
0

away from zero, B(n + m)  B(n) + B(m), and B(n) is asymptotic to a nonzero
constant times n.]
Computing Finite Binary Sequences 397
For all n, An  0, so that (An=n)  0 that is, f(An=n)g is a set of
reals bounded from below. It is concluded that this set has a greatest
lowest bound a. We now show that
lim An = a:
n!1 n
Since a is the greatest lower bound of the set f(An=n)g, for any e
greater than zero there is a d for which
(2.3.6) (Ad=d) < a + e.
Every natural number n can be expressed in the form n = qd + r, where
0  r < d. From (2.3.5) it can be seen that for any n1 n2 n3 : : : nq+1,
qX
+1
An  A(P +1 n
q
):
k=1
k
=1
k k

Taking nk = d (k = 1 2 : : :  q) and nq+1 = r in this, we obtain


qAd + Ar  Aqd+r = An
which with (2.3.6) gives
qd(a + e) = (n ; r)(a + e)  An ; Ar
or
(1 ; nr )(a + e)  Ann ; Anr 
which implies
a + e  Ann + n
or
lim sup An  a + e:
n!1 n
Since e > 0 is arbitrary, it can be concluded that
lim sup An  a
n!1 n
398 Part VI|Technical Papers on Turing Machines & LISP
which with the fact that (An=n)  a for all n gives
lim An = a:
n!1 n
2.4. In Section 2.3 an asymptotic formula analogous to a part of
Section 1.6 was demonstrated in this section a result is obtained which
completes the analogy. This result is most conveniently stated with the
aid of the notation B (m) (where m is a natural number) for the binary
sequence which is the numeral representing m in base-two notation
(e.g., B (6) = 110).
(2.4.1) There exists a constant c such that those binary sequences S
of length n satisfying
(2.4.2)
L(S )  L(Cn) ; L(B (L(Cn))) ; log2 L(B (L(Cn)))]
; L(Cm ) ; log2 L(Cm )] ; c

are less than 2n;m in number.


The proof of (2.4.1) is by contradiction. We suppose that those S of
length n satisfying (2.4.2) are 2n;m or more in number and we con-
clude that for any particular binary sequence S
of length n there is a
program of at most L(Cn) ; 1 rows that causes a bounded-transfer Tur-
ing machine to calculate S
. This table consists of 11 sections which
come one after the other. The rst section consists of a single row
which moves the tape one square to the left (1,4 1,4 1,4 will certainly
do this). The second section consists of exactly L(B (L(Cn ))) rows it
is a program for computing B (L(Cn)) consisting of the smallest pos-
sible number of rows. The third section is merely a repetition of the
rst section. The fourth section consists of exactly log2 L(B (L(Cn )))]
rows. Its function is to write out on the tape the binary sequence which
represents the number L(B (L(Cn))) in base-two notation. Since this is
a sequence of exactly log2 L(B (L(Cn)))] bits, a simple program exists
for calculating it consisting of exactly log2 L(B (L(Cn)))] rows each of
which causes the machine to write out a single bit of the sequence and
then shift the tape a single square to the left (e.g., 0,2 1,4 1,4 will do
Computing Finite Binary Sequences 399
for a 0 in the sequence). The fth section is merely a repetition of the
rst section. The sixth section consists of at most L(Cm) rows it is a
program consisting of the smallest possible number of rows for comput-
ing the sequence S R of the m rightmost bits of S
. The seventh section
is merely a repetition of the rst section. The eighth section consists
of exactly log2 L(Cm )] rows. Its function is to write out on the tape
the binary sequence which represents the number L(Cm) in base-two
notation. Since this is a sequence of exactly log2 L(Cm )] bits, a sim-
ple program exists for calculating it consisting of exactly log2 L(Cm)]
rows each of which causes the machine to write out a single bit of the
sequence and then shift the tape a single square to the left. The ninth
section is merely a repetition of the rst section. The tenth section
consists of at most as many rows as the expression on the right-hand
side of the inequality (2.4.2). It is a program for calculating one (out
of not less than 2n;m ) of the sequences of length n satisfying (2.4.2)
(which one it is depends on S
in a manner which will become clear
from the discussion of the eleventh section for now we merely denote
it by S L).
We now come to the last and crucial eleventh section, which consists
by denition of (c ; 6) rows, and which therefore brings the total num-
ber of rows up to at most 1 + L(B (L(Cn ))) + 1 + log2 L(B (L(Cn)))] +
1 + L(Cm) + 1 + log2 L(Cm)] + 1+ (the expression on the right-
hand side of the inequality (2.4.2)) + (c ; 6) = L(Cn) ; 1. When
this section of the program takes over, the numbers and sequences
L(Cn ) L(B (L(Cn))) S R L(Cm) S L are written|in the above order|
on the tape. Note, rst of all, that section 11 can:
1. compute the value v of the right-hand side of the inequality (2.4.2)
from this data,
2. nd the value of n from this data (simply by counting the number
of bits in the sequence S L), and
3. nd the value of m from this data (simply by counting the number
of bits in S R).
Using its knowledge of v, m and0 n, section 11 then computes from the
sequence S L a new sequence S L which is of length (n;m). The manner
400 Part VI|Technical Papers on Turing Machines & LISP
in which it does this is discussed in the next paragraph. Finally, section
11 adjoins the sequence S R to the right of S L0, positions this sequence
which is in fact S
properly for it to be able to be regarded calculated,
cleans up the rest of the tape, and halts scanning the square just to the
right of S
. S
has been calculated.
To nish the proof of (2.4.1) we must now only indicate how section
11 arrives at S L0 (of length (n ; m)) from v, m, n, and S L. (And it
must be here that it is made clear how the choice of S L depends on
S
.) By assumption, S L satis es
(2.4.3) L(S L)  v and S L is of length n.
Also by assumption there are at least 2n;m sequences which satisfy
(2.4.3). Now section 11 contains a procedure which when given any one
of some particular serially ordered set Qnv of 2n;m sequences satisfying
(2.4.3), will nd the ordinal number of its position in Qnv. And the
number
0
of the position of S L in Qnv is the number of the position of
S L in the natural ordering of all binary sequences of length (n ; m)
(i.e., 000: : : 00, 000: : : 01, 000: : : 10, 000: : : 11, : : : , 111: : : 00, 111: : : 01,
111: : : 10, 111: : : 11). In the next and nal paragraph of this proof, the
foregoing italicized sentence is explained.
It is sucient to give here a procedure for serially calculating the
members of Qnv in order. (That is, we de ne a serially ordered Qnv for
which there is a procedure.) By assumption we know that the predicate
which is satis ed by all members of Qnv, namely,
(L(: : :)  v) & (: : : is of length n)
is satis ed by at least 2n;m sequences. It should also be clear to the
reader on the basis of some background in Turing machine and recur-
sive function theory (see especially Davis 4], where recursive function
theory is developed from the concept of the Turing machine) that the
set Q of
all natural numbers of the form 2n3v 5e , where e is the nat-
ural number represented in base-two notation by a binary
sequence S satisfying (L(S )  v) & (S is of length n)
Computing Finite Binary Sequences 401
is recursively enumerable. Let T denote some particular Turing ma-
chine which is programmed in such a manner that it recursively enu-
merates (or, to use E. Post's term, generates) Q. The de nition of Qnv
can now be given:
Qnv is the set of binary sequences of length n which repre-
sent in base-two notation the exponents of 5 in the prime
factorization of the rst 2n;m members of Q generated by
T whose prime factorizations have 2 with an exponent of n
and 3 with an exponent of v, and their order in Qnv is the
order in which T generates them.
Q.E.D.
It can be proved by contradiction that the set Q is not recursive.
For were Q recursive, there would be a program which given any nite
binary sequence S would calculate L(S ). Hence there would be a pro-
gram which given any natural number n would calculate the members
of Cn. Giving n to this program can be done by a program of length
log2 n]. Thus there would be a program of length log2 n] + c which
would calculate an element of Cn. But we know that the shortest pro-
gram for calculating an element of Cn is of length ! an, so that we
would have for n suciently large an impossibility.
It should be emphasized that if L(Cn ) is an eectively computable
function of n then the method of this section yields the far stronger
result: There exists a constant c such that those binary sequences S of
length n satisfying L(S )  L(Cn ) ; L(Cm ) ; c are less than 2n;m in
number.3
2.5. The purpose of this section is to investigate the behavior of the
right-hand side of (2.4.2). We start by showing a result which is stronger
for n suciently large than the inequality L(Cn)  n, namely, that the
constant a in the asymptotic evaluation L(Cn ) ! an of Section 2.3 is
less than 1. This is done by deriving:
3For LISP we also obtain this much neater form of result that most n-bit strings
have close to the maximum complexity HLISP . The reason is that by using EVAL
a quoted LISP S-expression tells us its size as well as its value. In other words,
HLISP (x) = HLISP (x HLISP(x)) + O(1):]
402 Part VI|Technical Papers on Turing Machines & LISP
(2.5.1) For any s there exist n and m such that
L(Cs)  L(Cn) + L(Cm ) + c
and (n + m) is the smallest integral solution x of the inequality
s  x + log2 x] ; 1:
From (2.5.1) it will follow immediately that if e(n) denotes the function
satisfying L(Cn ) = an + e(n) (note that by Section 2.3 (e(n)=n) tends
to 0 from above as n goes to in nity), then for any s, L(Cs)  L(Cn) +
L(Cm) + c for some n and m satisfying (n + m) = s ; (1 + s) log2 s,
which implies
as  a (s ; (1 + s) log2 s) + e(n) + e(m)
or
(a + s) log2 s  e(n) + e(m):
Hence as n and m are both less than s and at least one of e(n) e(m)
is greater than 12 (a + s ) log2 s, there are an in nity of n for which
e(n)  12 (a + n) log2 n. That is,
) ; an  1 .
(2.5.2) lim sup L(aCnlog
2n 2
From (2.5.2) with L(Cn)  n follows immediately
(2.5.3) a < 1.
The proof of (2.5.1) is presented by examples. The notation T U is
used, where T and U are nite binary sequences for the sequence result-
ing from adjoining U to the right of T . Suppose it is desired to calculate
some nite binary sequence S of length s, say S = 010110010100110
and s = 15. The smallest integral solution x of s  x + log2 x] ; 1
for this value of s is 12. Then S is expressed as S 0 S T where S 0 is
of length x = 12 and S T is of length s ; x = 15 ; 12 = 3, so that
S 0 = 010110010100 and S T = 110. Next S 0 is expressed as S L S R
where the length m of S L satis es A B (m) = S T for some (possibly
null) sequence A consisting entirely of 0's, and the length n of S R is
Computing Finite Binary Sequences 403
x ; m. In this case A B (m) = 110, so that m = 6, S L = 010110 and
S R = 010100. The nal result is that one has obtained the sequences
S L and S R from the sequence S . And|this is the crucial point|if
one is given the S L and S R resulting by the foregoing process from
some unknown sequence S , one can reverse the procedure and deter-
mine S . Thus suppose S L = 1110110 and S R = 01110110000 are given.
Then the length m of S L is 7, the length n of S R is 11, and the sum
x of m and n is 7 + 11 = 18. Therefore the length s of S must be
s = x + log2 x] ; 1 = 18 + 5 ; 1 = 22. Thus S = S L S R S T , where
S T is of length s ; x = 22 ; 18 = 4, and so from A B (m) = S T or
0 B (7) = S T one nds S T = 0111. It is concluded that
S = S L S R S T = 1110110011101100000111:
(For x of the form 2h what precedes is not strictly correct. In such cases
s may equal the foregoing indicated quantity or the foregoing indicated
quantity minus one. It will be indicated later how such cases are to be
dealt with.)
Let us now denote by F the function carrying (S L S R) into S , and
by FR;1 the function carrying S into S R, de ning FL;1 similarly. Then
for any particular binary sequence S of length s the following program
consists of at most
1 + L(FL;1(S )) + 1 + L(FR;1(S )) + 2 + (c ; 4)  L(Cn) + L(Cm) + c
rows with m + n = x being the smallest integral solution of s  x +
log2 x] ; 1.
404 Part VI|Technical Papers on Turing Machines & LISP
Section I:
1,4 1,4 1,4
Section II consists of L(FL;1(S )) rows.
It is a program with the smallest
possible number of rows for calculating FL;1(S ).
Section III:
1,4 1,4 1,4
Section IV consists of L(FR;1 (S )) rows.
It is a program with the smallest
possible number of rows for calculating FR;1(S ).
(Should x be of the form 2h , another section
is added at this point to tell Section V which
of the two possible values s happens to have.
This section consists of two rows it is either
1,4 1,4 1,4
1,2 1,2 1,2
or
1,4 1,4 1,4
1,3 1,3 1,3.)
Section V consists of c ; 4 rows, by de nition.
It is a program that is able to compute F .
It computes F (FL;1(S ) FR;1(S )) = S ,
positions S properly on the tape,
cleans up the rest of the tape, positions the scanner
on the square just to the right of S , and halts.
As this program causes S to be calculated, the proof is easily seen to
be complete.
The second result is:
(2.5.4) Let f (n) be any eectively computable function that goes to
in nity with n and satis es f (n + 1) ; f (n) = 0 or 1. Then there
are an in nity of distinct nk for which L(B (L(Cn ))) < f (nk ).
k

This is proved from (2.5.5), the proof being identical with that of
(1.7.5).
Computing Finite Binary Sequences 405
(2.5.5) For any positive integer p there is at least one solution n of
L(Cn) = p.
Let the nk satisfy L(Cn ) = f ;1 (k), where f ;1(k) is de ned to be
the smallest value of j for which f (j ) = k. Then since L(Cn )  n,
k

f ;1 (k)  nk . Noting that f ;1 is an eectively computable function, it


is easily seen that
L(B (L(Cn ))) = L(B (f ;1(k)))  L(B (k)) + c  log2 k] + c:
k

Hence, for all suciently large k,


L(B (L(Cn )))  log2 k] + c < k = f (f ;1 (k))  f (nk ):
k

Q.E.D.
(2.5.4) and (2.4.1) yield:
(2.5.6) Let f (n) be any eectively computable function that goes to
in nity with n and satis es f (n + 1) ; f (n) = 0 or 1. Then there
are an in nity of distinct nk for which less than 2n ;f (n ) binary
k k

sequences S of length nk satisfy L(S )  L(Cn ) ; (a + k )f (nk ).


k

Part 3
3.1. Consider a scientist who has been observing a closed system that
once every second either emits a ray of light or does not. He summarizes
his observations in a sequence of 0's and 1's in which a zero represents
\ray not emitted" and a one represents \ray emitted." The sequence
may start
0110101110 : : :
and continue for a few thousand more bits. The scientist then examines
the sequence in the hope of observing some kind of pattern or law.
What does he mean by this? It seems plausible that a sequence of 0's
and 1's is patternless if there is no better way to calculate it than just
406 Part VI|Technical Papers on Turing Machines & LISP
by writing it all out at once from a table giving the whole sequence:
My Scientic Theory
0
1
1
0
1
0
1
1
1
0
...
This would not be considered acceptable. On the other hand, if the
scientist should hit upon a method by which the whole sequence could
be calculated by a computer whose program is short compared with the
sequence, he would certainly not consider the sequence to be entirely
patternless or random. And the shorter the program, the greater the
pattern he might ascribe to the sequence.
There are many genuine parallels between the foregoing and the way
scientists actually think. For example, a simple theory that accounts
for a set of facts is generally considered better or more likely to be true
than one that needs a large number of assumptions. By \simplicity" is
not meant \ease of use in making predictions." For although General
or Extended Relativity is considered to be the simple theory par ex-
cellence, very extended calculations are necessary to make predictions
from it. Instead, one refers to the number of arbitrary choices which
have been made in specifying the theoretical structure. One naturally
is suspicious of a theory the number of whose arbitrary elements is of
an order of magnitude comparable to the amount of information about
reality that it accounts for.
On the basis of these considerations it may perhaps not appear en-
tirely arbitrary to de ne a patternless or random nite binary sequence
as a sequence which in order to be calculated requires, roughly speak-
ing, at least as long a program as any other binary sequence of the same
Computing Finite Binary Sequences 407
length. A patternless or random in nite binary sequence is then de ned
to be one whose initial segments are all random. In making these de -
nitions mathematically approachable it is necessary to specify the kind
of computer referred to in them. This would seem to involve a rather
arbitrary choice, and thus to make our de nitions less plausible, but in
fact both of the kinds of Turing machines which have been studied by
such dierent methods in Parts 1 and 2 lead to precise mathematical
de nitions of patternless sequences (namely, the patternless or random
nite binary sequences are those sequences S of length n for which L(S )
is approximately equal to L(Cn), or, xing M , those for which LM (S ) is
approximately equal to LM (Cn)) whose provable statistical properties
start with forms of the law of large numbers. Some of these properties
will be established in a paper of the author to appear.4
A nal word. In scienti c research it is generally considered better
for a proposed new theory to account for a phenomenon which had
not previously been contained in a theoretical structure, before the
discovery of that phenomenon rather than after. It may therefore be
of some interest to mention that the intuitive considerations of this
section antedated the investigations of Parts 1 and 2.
3.2. The de nition which has just been proposed5 is one of many
attempts which have been made to de ne what one means by a pattern-
less or random sequence of numbers. One of these was begun by R. von
Mises 5] with contributions by A. Wald 6], and was brought to its cul-
mination by A. Church 7]. K. R. Popper 8] criticized this de nition.
The de nition given here deals with the concept of a patternless bi-
nary sequence, a concept which corresponds roughly in intuitive intent
with the random sequences associated with probability half of Church.
However, the author does not follow the basic philosophy of the von
4 The author has subsequently learned of work of P. Martin-Lof (\The Denition
of Random Sequences," research report of the Institutionen for Forsakringsmate-
matik och Matematisk Statistik, Stockholm, Jan. 1966, 21 pp.) establishing sta-
tistical properties of sequences dened to be patternless on the basis of a type of
machine suggested by A. N. Kolmogorov. Cf. footnote 5.
5 The author has subsequently learned of the paper of A. N. Kolmogorov, Three
approaches to the denition of the concept \amount of information," Problemy
Peredachi Informatsii Problems of Information Transmission], 1 , 1 (1965), 3{11
in Russian], in which essentially the denition oered here is put forth.
408 Part VI|Technical Papers on Turing Machines & LISP
Mises{Wald{Church de nition instead, the author is in accord with
the opinion of Popper 8, Sec. 57, footnote 1]:
I come here to the point where I failed to carry out fully
my intuitive program|that of analyzing randomness as far
as it is possible within the region of nite sequences, and of
proceeding to innite reference sequences (in which we need
limits of relative frequencies) only afterwards, with the aim
of obtaining a theory in which the existence of frequency
limits follows from the random character of the sequence.
Nonetheless the methods given here are similar to those of Church the
concept of eective computability is here made the central one.
A discussion can be given of just how patternless or random the
sequences given in this paper appear to be for practical purposes. How
do they perform when subjected to statistical tests of randomness?
Can they be used in the Monte Carlo method? Here the somewhat
tantalizing remark of J. von Neumann 9] should perhaps be mentioned:
Any one who considers arithmetical methods of producing
random digits is, of course, in a state of sin. For, as has
been pointed out several times, there is no such thing as a
random number|there are only methods to produce ran-
dom numbers, and a strict arithmetical procedure of course
is not such a method. (It is true that a problem that we sus-
pect of being solvable by random methods may be solvable
by some rigorously de ned sequence, but this is a deeper
mathematical question than we can now go into.)

Acknowledgment
The author is indebted to Professor Donald Loveland of New York
University, whose constructive criticism enabled this paper to be much
clearer than it would have been otherwise.
Computing Finite Binary Sequences 409
References
1] Shannon, C. E. A universal Turing machine with two inter-
nal states. In Automata Studies, Shannon and McCarthy, Eds.,
Princeton U. Press, Princeton, N. J., 1956.
2] |. A mathematical theory of communication. Bell Syst. Tech.
J. 27 (1948), 379{423.
3] Turing, A. M. On computable numbers, with an application
to the Entscheidungsproblem. Proc. London Math. Soc. f2g 42
(1936{37), 230{265 Correction, ibid., 43 (1937), 544{546.
4] Davis, M. Computability and Unsolvability. McGraw-Hill, New
York, 1958.
5] von Mises, R. Probability, Statistics and Truth. MacMillan,
New York, 1939.
6] Wald, A. Die Widerspruchsfreiheit des Kollektivbegries der
Wahrscheinlichkeitsrechnung. Ergebnisse eines mathematischen
Kolloquiums 8 (1937), 38{72.
7] Church, A. On the concept of a random sequence. Bull. Amer.
Math. Soc. 46 (1940), 130{135.
8] Popper, K. R. The Logic of Scientic Discovery. U. of Toronto
Press, Toronto, 1959.
9] von Neumann, J. Various techniques used in connection with
random digits. In John von Neumann, Collected Works, Vol. V.
A. H. Taub, Ed., MacMillan, New York, 1963.
10] Chaitin, G. J. On the length of programs for computing nite
binary sequences by bounded-transfer Turing machines. Abstract
66T-26, Notic. Amer. Math. Soc. 13 (1966), 133.
11] |. On the length of programs for computing nite binary se-
quences by bounded-transfer Turing machines II. Abstract 631-6,
Notic. Amer. Math. Soc. 13 (1966), 228{229. (Erratum, p. 229,
line 5: replace \P " by \L".)
410 Part VI|Technical Papers on Turing Machines & LISP

Received October, 1965 Revised March, 1966


ON THE LENGTH OF
PROGRAMS FOR
COMPUTING FINITE
BINARY SEQUENCES:
STATISTICAL
CONSIDERATIONS
Journal of the ACM 16 (1969),
pp. 145{159

Gregory J. Chaitin1
Buenos Aires, Argentina

Abstract
An attempt is made to carry out a program (outlined in a previous pa-
per) for dening the concept of a random or patternless, nite binary
sequence, and for subsequently dening a random or patternless, in-
nite binary sequence to be a sequence whose initial segments are all
random or patternless nite binary sequences. A denition based on

411
412 Part VI|Technical Papers on Turing Machines & LISP
the bounded-transfer Turing machine is given detailed study, but insuf-
cient understanding of this computing machine precludes a complete
treatment. A computing machine is introduced which avoids these dif-
culties.

Key Words and Phrases:


computational complexity, sequences, random sequences, Turing ma-
chines

CR Categories:
5.22, 5.5, 5.6

1. Introduction
In this section a de nition is presented of the concept of a random or
patternless binary sequence based on 3-tape-symbol bounded-transfer
Turing machines.2 These computing machines have been introduced
and studied in 1], where a proposal to apply them in this manner is
made. The results from 1] which are used in studying the de nition
are listed for reference at the end of this section.
An N -state, 3-tape-symbol bounded-transfer Turing machine is de-
ned by an N -row, 3-column table. Each of the 3N places in this table
must contain an ordered pair (i j ) of natural numbers where i takes on
values from ;b to b, and j from 1 to 5.3 These entries constitute, when
speci ed, the program of the N -state, 3-tape-symbol bounded-transfer
Turing machine and are to be interpreted as follows. An entry (i j )
in the kth row and the pth column of the table means that when the
machine is in its kth state, and the square of its one-way in nite tape
1 Address: Mario Bravo 249, Buenos Aires, Argentina.
2 The choice of 3-tape-symbol machines is made merely for the purpose of xing
ideas.
3 Here b is a constant whose value is to be regarded as xed throughout this
paper. Its exact value is not important as long as it is not \too small." For an
explanation of the meaning of \too small," and proofs that b can be chosen so that
it is not too small, see 1, Secs. 2.1 and 2.2]. (b will not be mentioned again.)
Computing Finite Binary Sequences: Statistical Considerations 413

1 0 0 1 0 0 ::::::
End of Tape Scanner 6 Tape
Black Box

Figure 1. A Turing machine

0 1 1 1 1 0 0 0 ::::::
6
Halted

Figure 2. The end of a computation

which is being scanned contains the pth symbol, then if 1  k + i  N ,


the machine is to go to its (k + i)-th state (otherwise, the machine is
to halt) after performing one of the following operations:
(a) moving the tape one square to the right if j = 5
(b) moving the tape one square to the left if j = 4
(c) marking (overprinting) the square being scanned with the j th
symbol if 1  j  3.
The rst, second, and third symbols are called, respectively, the
blank (for unmarked square), 0, and 1.
A bounded-transfer Turing machine may be represented schemati-
cally as shown in Figure 1. We make the following stipulations: initially
the machine is in its rst state and scanning the rst square of the tape
no bounded-transfer Turing machine may in the course of a calculation
scan the end square of the tape and then move the tape one square to
the right initially all squares of the tape are blank only orders to trans-
fer to state N +1 may be used to halt the machine. A bounded-transfer
414 Part VI|Technical Papers on Turing Machines & LISP
Turing machine is said to calculate a particular nite binary sequence
(e.g. 01111000) if the machine stops with that sequence written at the
end of its tape, with all other squares of the tape blank, and with its
scanner on the rst blank square of the tape. Figure 2 illustrates a ma-
chine which has calculated the particular sequence mentioned above.
Before proceeding we would like to make a comment from the point
of view of the programmer. The logical design of the bounded-transfer
Turing machine provides automatically for relocation of programs, and
the preceding paragraph establishes linkage conventions for subroutines
which calculate nite binary sequences.
Two functions are now de ned which play fundamental roles in all
that follows. L, the rst function,4 is de ned on the set of all nite
binary sequences S as follows: An N -state, 3-tape-symbol bounded-
transfer Turing machine can be programmed to calculate S if and only
if N  L(S ).
The second function L(Cn) is de ned as
L(Cn) = S ofmax
length n
L (S )
where the maximum is taken (as indicated) over all binary sequences
S of length n. Also denote by5 Cn the set of all binary sequences S of
length n satisfying L(S ) = L(Cn ).
An attempt is made in 1, Sec. 3.1] to make it plausible, on the
basis of various philosophical considerations, that the patternless or
random nite binary sequences of length n are those sequences S for
which L(S ) is approximately equal to L(Cn ). Here an attempt is made
to clarify this (somewhat informal) de nition and to make it plausible
by proving various results concerning what may be termed statistical
properties of such nite binary sequences. The set C1 of patternless
or random, in nite binary sequences is formally de ned to be the set
of all in nite binary sequences S which satisfy the following inequality
for all suciently large values of n:
L(Sn ) > L(Cn) ; f (n)
4 Use of the letter \L" is suggested by the phrase \the Length of program neces-
sary for computing: : :".
5 Use of the letter \C" is suggested by the phrase \the most Complex binary
sequences of length: : :".
Computing Finite Binary Sequences: Statistical Considerations 415
where f (n) = 3 log2 n and Sn is the sequence of the rst n bits of S .
This de nition, unlike the rst, is quite precise but is also somewhat
arbitrary. The failure to state the exact cut-o point at which L(S )
becomes too small for S to be considered random or patternless gives
to the rst de nition its informal character. But in the case of nite
binary sequences, no gain in clarity is achieved by arbitrarily settling
on a cut-o point, while the opposite is true for in nite sequences.
The results from 1] which we need are as follows:
L(S S 0)  L(S ) + L(S 0) (1)
where S and S 0 are nite binary sequence and is the concatenation
operation.
L(Cn+m )  L(Cn) + L(Cm ) (2)
There exists a positive real constant a such that (3)
(L(Cn)=n)  a (3a)

!1(L(Cn )=n) = a :
nlim (3b)
There exists an integer c such that there are less than 2n;m binary
sequences S of length n satisfying the inequality
L(S )  L(Cn ) ; log2 n ; m ; c: (4)
Inequalities (1), (2), and (3a) are used only in Section 6, and (4) is used
only in Section 7. For the proofs of (1), (2), and (3) see 1, Sec. 2.3].
The validity of inequality (4) is easily demonstrated using the method
of 1, Sec. 2.4].
The following notational conventions are used throughout this pa-
per:
(a) denotes the concatenation operation.
(b) Let S be a nite binary sequence. S n denotes the result of con-
catenating S with itself n ; 1 times.
(c) Let m be a positive integer. B (m) denotes the binary sequence
which is the numeral representing m in base-two notation e.g.
B (37) = 100101. Note that the bit at the left end of B (m) is
always 1.
416 Part VI|Technical Papers on Turing Machines & LISP
(d) Let x be a real number. x] denotes the least integer greater than
the enclosed real x. Note that this is not the usual convention,
and that the length of B (m) is equal to log2 m]. This last fact
will be used but not explicitly mentioned.
(e) x denotes a (not necessarily positive) function of x, and possibly
other variables, which approaches zero as x approaches in nity
with any other variables held xed.
(f) Let S be an in nite binary sequence. Sk denotes the binary se-
quence consisting of the rst k bits of S .

2. The Fundamental Theorem


All of our results concerning the statistical properties of random binary
sequences will be established by applying the result which is proved in
this section.
Theorem 1. Let q be an eective ordering of the nite binary
sequences of any given length among themselves i.e. let q be an ef-
fectively computable function with domain consisting of the set of all
nite binary sequences and with range consisting of the set of all pos-
itive integers, and let the restriction of q to the domain of the set of
all binary sequences of length n have the range f1 2 3 : : :  2n g. Then
there exists a positive integer c such that for all binary sequences S of
length n,
L(S )  L(Clog2 q(S)]) + L(Clog2 n]) + c:
Proof. The program in Figure 3 calculates S and consists of the
following number of rows:
1+ L(B (q(S ))) + 1 + L(B (n)) + (c ; 2)  L(Clog2 q(S)]) + L(Clog2 n] ) + c:

3. An Application: Matching Pennies


The following example of an application of Theorem 1 concerns the
game of Matching Pennies.
Computing Finite Binary Sequences: Statistical Considerations 417

Section I:
1,4 1,4 1,4
Section II consists of L(B (q(S ))) rows. It is a
program for calculating B (q(S )) consisting of the
smallest possible number of rows.
Section III:
1,4 1,4 1,4
Section IV consists of L(B (n)) rows. It is a
program for calculating B (n) consisting of the
smallest possible number of rows.
Section V consists by de nition of c ; 2 rows.
It calculates the eectively computable function
q;1(q(S ) n) = S 
it nds the two arguments on the tape.
Figure 3. Proof of Theorem 1

According to the von Neumann and Morgenstern theory 2] of the


mixed strategy for nonstrictly determined, zero-sum two-person games,
a rational player will choose heads and tails equiprobably by some \de-
vice subject to chance".
Theorem 2. Let S1 S2 S3 : : : be a sequence of distinct, nite
binary sequences of lengths, respectively, n1 n2 n3 : : : which satis es
L(Sk ) ! L(Cn ). Let st be an eectively computable binary function
de ned on the set of all nite binary sequences and the null sequence.
k

For each positive integer k consider a sequence of nk plays of the game


of Matching Pennies. There are two players: A who attempts to avoid
matches and B who attempts to match A's penny. The players employ
the following strategies for the mth (1  m  nk ) play of the sequence.
A's strategy is to choose heads (tails) if the mth bit of Sk is 1 (0).
B 's strategy is to choose heads (tails) if 1 (or 0) = st(the sequence
consisting of the m ; 1 successive choices of A up to this point on this
sequence of plays, heads being represented by 1 and tails by 0). Then
as k approaches in nity, the ratio of the two quantities (the sum of
418 Part VI|Technical Papers on Turing Machines & LISP
the payos to A (B ) during the kth sequence of plays of the game of
Matching Pennies) approaches the limit 1.
In other words, a random or patternless sequence of choices of heads
or tails will be matched about half the time by an opponent who at-
tempts to predict the next choice in an eective manner from the pre-
vious choices.
The proof of Theorem 2 is similar to the proof of Theorem 3 below,
and therefore is omitted.

4. An Application: Simple Normality


In analogy to Borel's concept of the simple normality of a real number
r in the base b (see 3, Ch. 9] for a de nition of this concept of Borel),
let a sequence S1 S2 S3 : : : of nite b-ary sequences be called simply
normal if
occurrences of a in Sk = 1
lim the numbertheof length
k!1 of Sk b
for each of the b possible values of a. The application of Theorem 1 given
in this section concerns the simple normality of a sequence S10  S20  S30  : : :
of nite b-ary sequences in which each of the Sk0 is associated with a
binary sequence Sk in a manner de ned in the next paragraph. It will
turn out that L(Sk ) ! L(Cn ), where nk is the length of Sk , is a su-
cient condition for the simple normality of the sequence of associated
k

sequences.
Given a nite binary sequence, we may place a binary point to its
left and consider it to be the base-two notation for a nonnegative real
number r less than 1. Having done so it is natural to consider, say, the
ternary sequence used to represent r to the same degree of precision
in base-three notation. Let us de ne this formally for an arbitrary
base b. Suppose that the binary sequence S of length n represents a
real number r when a binary point is axed to its left. Let n0 be the
smallest positive integer for which 2n  bn . Now consider the set of all
0

reals written in base-b notation as a \decimal" point followed by any


of the bn b-ary sequences of length n0, including those with 0's at the
0

right end. Let r0 be the greatest of these reals which is less than or
Computing Finite Binary Sequences: Statistical Considerations 419
equal to r, and let the b-ary sequence S 0 be the one used to represent
r0 in base-b notation. S 0 is the b-ary sequence which we will associate
with the binary sequence S . Note that no two binary sequences of the
same length are associated with the same b-ary sequence.
It is now possible to state the principal result of this section.
Theorem 3. Let S1 S2 S3 : : : be a sequence of distinct, nite
binary sequences of lengths, respectively, n1 n2 n3 : : : which satis es
L(Sk ) ! L(Cn ). Then the sequence S10  S20  S30  : : : of associated b-ary
sequences is simply normal.
k

We rst prove a subsidiary result.


Lemma 1. For any real number e > 0, any real number d > 1,
b, and 0  j < b, for all suciently large values of n, if S is a binary
sequence of length n whose associated b-ary sequence S 0 of length n0
satis es the following condition
 
 the number of occurrences of j in S 0 ; 1  > e (5)
 n0 b
then
L(S ) < L(C ( 1 1 1 1 1 + ) ): e e

nd ]
H ;   ;  e
b b; b b; b
log 2 b
Here
X
b
H (p1  p2 : : : pb) (p1  0 p2  0 : : :  pb  0 pi = 1)
i=1
is de ned to be equal to
X
b
; pi log2 pi
i=1
where in this sum any terms 0 log2 0 are to be replaced by 0.
The H function occurs because the logarithm to the base two of
X  0!
(b ; 1) nk 
k
j ; 1 j>e
k
n0
b;
b
420 Part VI|Technical Papers on Turing Machines & LISP
the number of b-ary sequences S 0 of length n0 which satisfy (5) is as-
ymptotic, as n approaches in nity, to
 
n0H 1b ; b ;e 1      1b ; b ;e 1  1b + e 
which is in turn asymptotic to nH= log2 b, for n0 ! n= log2 b. This may
be shown by considering the ratio of successive terms of the sum and
using Stirling's approximation, log(n!) ! n log n 4, Ch. 6, Sec. 3].
To prove Lemma 1 we rst de ne an ordering q by the following two
conditions:
(a) Consider two binary sequences (of length n) S and T whose associ-
ated b-ary sequences (of length n0) S 0 and T 0 contain, respectively,
s and t occurrences of j . S comes before (after) T if
 s 1   
 0 ;  is greater (less) than  t0 ; 1  :
n b n b
(b) If condition (a) doesn't settle which of the two sequences of length
n comes rst, take S to come before (after) T if S 0 represents (ig-
noring 0's to the left) a larger (smaller) number in base-b notation
than T 0 represents.6
Proof. We now apply Theorem 1 to any binary sequence S of length
n such that its associated b-ary sequence S 0 of length n0 satis es (5).
Theorem 1 gives us
L(S )  L(Clog2 q(S)]) + L(Clog2 n] ) + c (6)
where, as we know from the paragraph before the last, for all suciently
large values of n,
 1  nH
log2 q(S ) < 1 + 4 (d ; 1) log b : (7)
2
From (3b) and (7) we obtain for large values of n,
  nH
L(Clog2 q(S)]) < a 1 + 12 (d ; 1) log : (8)
2b
6 This condition was chosen arbitrarily for the sole purpose of \breaking ties."
Computing Finite Binary Sequences: Statistical Considerations 421
And eq. (3b) implies that for large values of n,
nH :
L(Clog2 n]) + c < a 41 (d ; 1) log (9)
2b
Adding ineqs. (8) and (9), we see that ineq. (6) yields, for large values
of n,  3  nH

L(S ) < a 1 + 4 (d ; 1) log b :
2
Applying eq. (3b) to this last inequality, we see that for all suciently
large values of n,
L(S ) < L(Cnd log2 ])
H
b

which was to be proved.


Having demonstrated Lemma 1 we need only point out that Theo-
rem 3 follows immediately from Lemma 1, eq. (3b), and the fact that
H (p1  p2 : : : pb )  log2 b
with equality if and only if
p1 = p2 = : : : = pb = 1b
(for a proof of this inequality, see 5, Sec. 2.2]).

5. Applications of a von Mises Place Selec-


tion V
In this section we consider the nite binary sequence S 0 resulting from
the application of a von Mises place selection V to a nite binary se-
quence S which is random in the sense of Section 1. For S not to be
rejected as random in the sense of von Mises 6] (i.e. in von Mises' ter-
minology, for S not to be rejected as a collective7), S 0 must contain
about as many 0's as 1's.
7Strictly speaking we cannot employ von Mises' terminology here for von Mises
was interested only in innite sequences. Kolmogorov 7] considers nite sequences.
422 Part VI|Technical Papers on Turing Machines & LISP
A place selection V is de ned to be a binary function (following
Church 8], it must be eectively computable) de ned on the set of all
nite binary sequences and the null sequence. If S = S 0 S 00 is a nite
binary sequence, then V (S 0) = 0 (1) is interpreted to mean that the
rst (i.e. the leftmost) bit of S 00 is not (is) selected from S by the place
selection V .
By applying Theorem 1 and eq. (3b) we obtain the principal result
of this section.
Theorem 4. Let S1 S2 S3 : : : be a sequence of distinct nite bi-
nary sequences of lengths, respectively, n1 n2 n3 : : : which satis es
L(Sk ) ! L(Cn ). Let V be any place selection such that
k

 !
inf length of subsequence of S selected by V >0 (10)
length of S
where the in num is taken over all nite binary sequences S . Then as
k approaches in nity, the ratio of the number of 0's in the subsequence
of Sk which is selected by V to the number of 1's in this subsequence
tends to the limit 1.
Before proceeding to the proof it should be mentioned that a sim-
ilar result can be obtained for the generalized place selections due to
Loveland 9{11].
The proof of Theorem 4 runs parallel to the proof of Theorem 3. The
subsidiary result which is proved by taking in Theorem 1 the ordering
q de ned below is
Corollary 1. Let e be a real number greater than 0, d be a real
number greater than 1, S be a binary sequence of length n, and let V
be a place selection which selects from S a subsequence S 0 of length n0.
Suppose that  
 the number of 0's in S 0 ; 1  > e: (a)
 n0 2
Then for n0 greater than N we have
L(S )  L(Clog2 q(S)]) + L(Clog2 n]) + c
where
log2 q(S ) < n0dH ( 12 + e 12 ; e) + (n ; n0):
Computing Finite Binary Sequences: Statistical Considerations 423
Here N depends only on e and d, and c depends only on V .
De nition.8 Let S be a binary sequence of length n, let S 0 of length
n0 be the subsequence of S selected by the place selection V , and let
S 00 be the subsequence of S which is not selected by V . Let9
Q = F (S 0) S 00 01 B12 B22 B32   
where each Bi is a single bit and
1 B1 B2 B3    = B (the length of F (S 0)):
We then de ne q(S ) to be the unique solution of B (q(S )) = Q.
De nition. (Let us emphasize that F (S 0) is never more than about
n0H ( 12 + e 21 ; e)
bits long for S 0 which satisfy supposition (a) of Cor. 1: this is the crux
of the proof.) Consider the \padded" numerals for the integers from
0 to 2n ; 1 padded to a length of n0 bits by adding 0's on the left.
0

Arrange these in order of decreasing


 m 1 
 0 ; 
n 2
where m is the number of 0's in the padded numeral, and when this
does not decide the order, in numerical order (e.g. the list starts
0n  1n  0n ;1 1 0n ;2 10 : : :). Suppose that S 0 is the kth padded
0 0 0 0

numeral in the list. We de ne F (S 0) to be equal to B (k). Further


details are omitted.
Strictly speaking, this denition is incorrect. S , reconstructed from F(S ) and
8 0 0

n, and S can be \pieced together" to form S using V to dictate the intermixing,


00

and thus q(S) = q(T ) for S and T of the same length only if S = T. But q(S)
is greater than 2n for some binary sequences S of length n. To correct this it is
necessary to obtain the \real" ordering q from the ordering q that we dene here
0

by \pressing the function q down so as to eliminate gaps in its range." Formally,


consider the restriction of q to the domain of all binary sequences of length n. Let
the kth element in the range of this restriction of q, ordered according to magnitude,
be denoted by rk . Let S satisfy q(S) = rk . We dene q (S) to be equal to k. As,
0

however, the result of this redenition is to decrease the value of q(S) for some S,
this is a quibble.
9 Our superscript notation for concatenation is invoked here for the rst time.
424 Part VI|Technical Papers on Turing Machines & LISP
6. Fundamental Properties of the L-Func-
tion
In Sections 3{5 the random or patternless nite binary sequences have
been studied. Before turning our attention to the random or patternless
in nite binary sequences, we would like to show that many fundamental
properties of the L-function are simple consequences of the inequality
L(S S 0)  L(S )+L(S 0) taken in conjunction with the simple normality
of sequences of random nite binary sequences.
In Theorem 3 take b = 2k and let the in nite sequence S1 S2 S3 : : :
consist of all the elements of the various Cn's. We obtain
Corollary 2. For any e > 0, k, and for all suciently large values
of n, consider any element S of Cn to be divided into between (n=k) ; 1
and (n=k) nonoverlapping binary subsequences of length k with not
more than k ; 1 bits left over at the right end of S . Then the ratio
of the number of occurrences of any particular one of the 2k possible
binary subsequences of length k to (n=k) diers from 2;k by less than
e.
Keeping in mind the hypothesis of Corollary 2, let S be some ele-
ment of Cn. Then we have L(Cn ) = L(S ), and from Corollary 2 with
L(S ) = L(S 0 S 00 S 000   )  L(S 0 ) + L(S 00) + L(S 000) +   

(this inequality is an immediate consequence of (1)) this gives us


X
L(Cn )  nk (1 + n)(2;k L(S ))
where the sum is taken over the set of all binary sequences of length k.
That is, X
(L(Cn )=n)k  (1 + n)(2;k L(S ))
with which (3a) or (L(Cn )=n)  a gives
X
ak  (1 + n )(2;k L(S )):
We conclude from this last inequality the following theorem.
Computing Finite Binary Sequences: Statistical Considerations 425
Theorem 5. For all positive integers k,10
X
ak  2;k L(S ):
S of length k
Note that the right-hand side of the inequality of Theorem 5 is
merely the expected value of the random variable L = L(S ) where the
sample space is the set of all binary sequences of length k to which
equal probabilities have been assigned. With this probabilistic frame-
work understood, we can denote the right-hand side of the inequality
of Theorem 5 by EfLg and use the notation Prf: : :g for the probability
of the enclosed event. Recalling eq. (3b) and the de nition of L(Ck ) as
max L, we thus have for any e > 0,
ak  EfLg = P PrfS gL(S )
 PrfL  (1 ; e)ak g ((1 ; e)ak ) +
(1 ; PrfL  (1 ; e)akg) L(Ck )
= PrfL  (1 ; e)akg ((1 ; e)ak) +
(1 ; PrfL  (1 ; e)akg) ((1 + k )ak)
or
k ; (e + k ) PrfL  (1 ; e)akg  0:
Thus for any real e > 0,
lim PrfL  (1 ; e)akg = 0:
k!1
(11)
Although eq. (11) is weaker than (4), it is reached by a completely
dierent route. It must be admitted, however, that it is easy to prove
Theorem 5 from (4) by taking into account the subadditivity of the
right-hand side of the inequality of Theorem 5.
From Theorem 5 we now demonstrate
Corollary 3. For all positive integers n, (L(Cn )=n) > a.
Proof. Since L(0n )  L(B (n)) + c  log2 n]+ c, for large n, L(S ) <
an for at least one binary sequence of length n, and we therefore may
conclude from Theorem 5 that for large n there must be at least one
This statement remains true, as can be proved in several ways, if \<" replaces
10
\".
426 Part VI|Technical Papers on Turing Machines & LISP
binary sequence S 0 of length n for which L(S 0) > an that is, for large
n,
(L(Cn )=n) > a: (12)
We now nish the proof of Corollary 3 by contradiction. Suppose
that Corollary 3 is false, and there exists an n0 such that
(L(Cn0 )=n0) = a:
((L(Cn0 )=n0) < a is impossible by (3a).) Then from (2) and (3a) it
would follow that for all positive integers k,
(L(Ckn0 )=kn0) = a
which contradicts (12).
The nal topic of this section is a derivation of
Theorem 6. L(Cn) ; an is unbounded.
Proof. Consider some particular binary sequence S which is a
member of Cn. Then from Corollary 2, for large values of n there
must certainly be a sequence of k consecutive 0's in S . Suppose that
S = R 0k T . Then we have
L(Cn) = L(S ) = L(R 0k T )  L(R) + L(0k ) + L(T )
 L(R) + L(B (k)) + c + L(T )
 L(Ci) + L(Cj ) + log2 k] + c
where i is the length of R, j is the length of T , and n ; k = i + j . That
is,
Lemma 2. For any positive integer k, for all suciently large values
of n there exist i and j such that
L(Cn)  L(Ci) + L(Cj ) + log2 k] + c
and n ; k = i + j .
Theorem 6 follows immediately from Lemma 2 through proof by
contradiction.
Computing Finite Binary Sequences: Statistical Considerations 427
7. Random or Patternless In
nite Binary
Sequences
This section and Section 8 are devoted to a study of the set C1 of
random or patternless in nite binary sequences de ned in Section 1.
Two proofs that C1 is nonempty, both based on (4), are presented
here. The rst proof is measure theoretic the measure space employed
may be de ned in probabilistic terms as follows: the successive bits
of an in nite binary sequence are independent random variables which
assume the values 0 and 1 equiprobably. The second proof exhibits an
element from C1 .
Theorem 7. C1 is nonempty.
First Proof. From (4) and the Borel-Cantelli lemma, it follows im-
mediately that
C1 is a set of measure 1: (13)
Second Proof. It is easy to see from (4) that we can nd an N so
large that X ;k
Nk 2 < 1 (14)
k>N
where Nk is the number of binary sequences S of length k for which
L(S )  L(Ck ) ; 3 log2 k:
Consider the following process which never terminates (for that would
contradict (14)).
Start: Set k = 0, set S = null sequence, go to Loop1.
Loop1: Is k  N or L(S ) > L(Ck ) ; 3 log2 k?
If so, set k = k + 1, set S = S 0, go to Loop1.
If not, go to Loop2.
Loop2: If S = S 0 0, set S = S 0 1, go to Loop1.
If S = S 0 1, set k = k ; 1, set S = S 0.
If k 6= 0, go to Loop2.
If k = 0, stop.
428 Part VI|Technical Papers on Turing Machines & LISP
Then from Dirichlet's box principle (if an in nity of letters is placed in
a nite number of pigeonholes, then there is always a pigeonhole which
receives an in nite number of letters) it is clear that from some point
on, the rst bit of S will remain xed from some point on, the rst
two bits of S will remain xed : : :  from some point on (depending on
n), the rst n bits of S will remain xed : : : Let us denote by Slim the
in nite binary sequence whose nth bit is 0 (1) if from some point on,
the nth bit of S remains 0 (1). It is clear that Slim is in C1 .
Remark. When C1 was de ned in Section 1, we pointed out that
this de nition contains an arbitrary element, i.e. the choice of 3 log2 n
as the function f (n). In de ning C1 it is desirable to choose an f
which goes to in nity as slowly as possible and which results in a C1
of measure 1. We will call such f 's \suitable." From results in 1, Secs.
2.4 and 2.5], which are more powerful than (5), it follows that there is
an f which is suitable and satis es the equations
(
lim sup(f (n)= log2 n) = 2a
lim inf(f (n)= log2 n) = a:
The question of obtaining lower bounds on the growth of an f which is
suitable will be considered in Section 10, but there a dierent computing
machine is used as the basis for the de nition of random or patternless
in nite binary sequence.

8. Statistical Properties of In
nite, Ran-
dom or Patternless Binary Sequences
Results concerning the statistical properties of in nite, random or pat-
ternless binary sequences follow from the corresponding results for nite
sequences. Thus Theorem 8 is an immediate consequence of Theorem
3, and Corollary 1 and eq. (3b) yield Theorem 9.
Theorem 8. Real numbers whose binary expansions are sequences
in C1 are simply normal in every base.11
11It is known from probability theory that a real r which is simply normal in
every base has the following property. Let b be a base, and denote by an the nth
\digit" in the base-b expansion of r. Consider a b-ary sequence c1  c2 : : : cm. As n
Computing Finite Binary Sequences: Statistical Considerations 429
Theorem 9. Any in nite binary sequence in C1 is a collective with
respect to the set of place selections12 which are eectively computable
and satisfy the following condition: For any in nite binary sequence S ,
lim inf the number of bits in Skkwhich are selected by V > 0:

9. A General Formulation: Binary Com-


puting Machines
Throughout the study of random or patternless binary sequences which
has been attempted in the preceding sections, there has been a recurring
diculty. Theorem 1 and the relationship L(Cn) ! an have been
used as the cornerstones of our treatment, but the assumption that
L(Cn ) ! an does not ensure that L(Cn) behaves suciently smoothly
to make really eective use of Theorem 1. Indeed it is conceivable that
greater understanding of the bounded-transfer Turing machine would
reveal that L(Cn ) behaves rather roughly and irregularly. Therefore a
new computing machine is now introduced.13
To understand the logical design of this computing machine, it is
helpful to provide a general formulation of computing machines for cal-
culating nite binary sequences whose programs are also nite binary
sequences. We call these binary computing machines. Formally, a bi-
nary computing machine is a partial recursive function M of the nite
binary sequences which is nite binary sequence valued. The argument
of M is the program, and the partial recursive function gives the out-
put (if any) resulting from that program. LM (S ) and LM (Cn) (if the
approaches innity the ratio of (the number of those positive integers k less than n
which satisfy ak = c1  ak+1 = c2 : : : ak+m 1 = cm ) to n tends to the limit b m .
;
;

12Wald 12] introduced the notion of a collective with respect to a set of place
selections von Mises had originally permitted \all place selections which depend
only on elements of the sequence previous to the one being considered for selection."
13 The author has subsequently learned of Kolmogorov 13], in which a similar
kind of computing machine is used in essentially the same manner for the purpose
of dening a nite random sequence. Martin-Lof 14{15] studies the statistical prop-
erties of these random sequences and puts forth a denition of an innite random
sequence.
430 Part VI|Technical Papers on Turing Machines & LISP
computing machine is understood, the subscript will be omitted) are
de ned as follows:
(
LM (S ) = min M (P )=S (length of P )
1 if there are no such P
LM (Cn ) = S ofmax L (S ):
length n M
In this general setting the program for the de nition of a random or
patternless binary sequence assumes the following form: The pattern-
less or random nite binary sequences of length n are those sequences
S for which L(S ) is approximately equal to L(Cn). The patternless or
random in nite binary sequences S are those whose truncations Sn are
all patternless or random nite sequences. That is, it is necessary that
for large values of n, L(Sn ) > L(Cn ) ; f (n) where f approaches in nity
slowly.
We de ne below a binary computing machine M  which has, as is
easily seen, the following very convenient properties.
(a) L(Cn) = n + 1.
(b) Those binary sequences S of length n for which L(S ) < L(Cn );m
are less than 2n;m in number.
(c) For any binary computer M there exists a constant c such that
for all nite binary sequences S , LM (S )  LM (S ) + c.


The computing machine M  is constructed from the two-argument


partial recursive function U (P M 0 ), a universal binary computing ma-
chine. That is, U is characterized (following Turing) by the property
that for any binary computer M there exists a nite binary sequence
M 0 such that for all programs P , U (P M 0 ) = M (P ) where both sides
of the equation are unde ned whenever one of them is.
De nition. If possible14 let P = P 0 B where B is a single bit. If
B = 1 then we de ne M (P ) to be equal to P 0. If B = 0 then let the
following equation be examined for a solution: P 0 = S T 01 B12
B22 B32    where each Bi is a single bit, 1 B1 B2 B3    = B (n),
and T is of length n. If this equation has a solution then the solution
must be unique, and we de ne M (P ) to be equal to U (S T ).
14 That is, if P is a single bit this is not possible. M (P ) is therefore undened.

Computing Finite Binary Sequences: Statistical Considerations 431
10. Bounds on Suitable Functions
In Section 7 we promised to provide bounds on any f which is suitable
(i.e. suitable for de ning a C1 of measure 1). We prove here that
lim sup f (k)= log2 k  1
the constant being best possible.
We use the result 4 (1950 ed.), p. 163, prob. 4] that the set # of
those in nite binary sequences S for which r(Sk ) > log2 k] in nitely
often is of measure 1 here r denotes the length of the run of 0's at the
right end of the sequence.15 As # and C1 are both of measure 1, they
have an element S in common. Then for in nitely many values of k,
(
L(Sk ) > L(Ck ) ; f (k)
r(Sk ) > log2 k]:
But taking into account property (c) of M , we see that Sk =    0log2 k]
implies that L(Sk )  L(Ck;log2 k]) + c. Thus for in nitely many values
of k,
L(Ck;log2 k]) + c  L(Sk ) > L(Ck ) ; f (k)
or
k ; log2 k] + 1 + c  L(Sk ) > k + 1 ; f (k)
which implies that f (k) > log2 k] ; c. Hence lim sup f (k)= log2 k must
be greater than or equal to 1.
Now it is necessary to show that the constant is the best possible.
From the Borel-Cantelli lemma and property (b) of M , we see at once
that for f to be suitable, it is sucient that
X
1
2;f (k)
k=1
converges. Thus f (k) = log2(k(log k)2) is suitable, and this f (k) is
asymptotic to log2 k.
We are indebted to Professor Leonard Cohen of the City University of New
15
York for pointing out to us the existence of such results.
432 Part VI|Technical Papers on Turing Machines & LISP
11. Two Analogues to the Fundamental
Theorem
To study the statistical properties of binary sequences which are de ned
to be random or patternless on the basis of the computing machine M ,
it is necessary to have, in addition to properties (a) and (b) of M , an
analogue to Theorem 1. We state two, the second of which is just a
re nement of the rst. Both are proved using property (c) of M .
Theorem 10. On the hypothesis of Theorem 1, for all binary
sequences S of length n,
L(S )  L(B (q(S )) B (n) 01 B12 B22 B32   ) + c

where each Bi is a single bit and 1 B1 B2 B3    = B ( log2 n]).


Thus
L(S )  L(Cg(q(S)n)) + c  g(q(S ) n) + c0
where
g(q(S ) n) = log2 q(S )] + log2 n] + 2 log2 log2 n]]:
Theorem 11. On the hypothesis of Theorem 1, for all binary
sequences S of length n,
L(S )  L(B (q(S )) B (n + 2 ; log2 q(S )]) 01 B12 B22 B32   ) + c

where each Bi is a single bit, 1 B1 B2 B3    = B ( log2 g(q(S ) n)]),


and
g(q(S ) n) = n + 2 ; log2 q(S )]:
Thus
L(S )  L(Ch(q(S)n)) + c  h(q(S ) n) + c0
where
h(q(S ) n) = log2 q(S )] + log2 g(q(S ) n)] + 2 log2 log2 g(q(S ) n)]]:
On comparing property (a) of M  , property (b) of M , and Theo-
rem 11 with, respectively, (3b), (4), and Theorem 1, we see that they
are analogous but far more powerful. It therefore follows that Sections
Computing Finite Binary Sequences: Statistical Considerations 433
3{5, 7, and 8 can be applied almost verbatim to the present comput-
ing machine. In particular, Theorem 2, Lemma 1, Theorem 4, (13),
Theorem 8, and Theorem 9 hold, without any change whatsoever, for
the random sequences de ned on the basis of M . In all cases, how-
ever, much stronger assertions can be made. For example, in place of
Theorem 9 we can state that
Theorem 12.16 The set C1 of all in nite binary sequences S which
have the property that for all suciently large values of k, L(Sk ) >
L(Ck ) ; log2(k(log k)2), is of measure 1, and each element of C1 is
a collective with respect to the set of place selections V which are
eectively computable and satisfy the following condition:17 For any
in nite binary sequence S ,
Sk which are selected by V = 1:
lim the number of bits in log
k!1 2k

References
1] Chaitin, G. J. On the length of programs for computing nite
binary sequences. J. ACM 13, 4 (Oct. 1966), 547{569.
2] von Neumann, J., and Morgenstern, O. Theory of Games
and Economic Behavior. Princeton U. Press, Princeton, N. J.,
1953.
3] Hardy, G. H., and Wright, E. M. An Introduction to the
Theory of Numbers. Oxford U. Press, Oxford, 1962.
4] Feller, W. An Introduction to Probability Theory and Its Ap-
plications, Vol. I. Wiley, New York, 1964.
5] Feinstein, A. Foundations of Information Theory. McGraw-
Hill, New York, 1958.
6] von Mises, R. Probability, Statistics, and Truth. Macmillan,
New York, 1939.
16Compare the last paragraph of Section 10.
17In view of Section 10, it apparently is not possible by the methods of this paper
to replace the \log2 k" here by a signicantly smaller function.
434 Part VI|Technical Papers on Turing Machines & LISP
7] Kolmogorov, A. N. On tables of random numbers. Sankhya
A], 25 (1963), 369{376.
8] Church, A. On the concept of a random sequence. Bull. Amer.
Math. Soc. 46 (1940), 130{135.
9] Loveland, D. W. Recursively Random Sequences. Ph.D. Diss.,
N.Y.U., June 1964.
10] |. The Kleene hierarchy classi cation of recursively random se-
quences. Trans. Amer. Math. Soc. 125 (1966), 487{510.
11] |. A new interpretation of the von Mises concept of random
sequence. Z. Math. Logik Grundlagen Math. 12 (1966), 279{294.
12] Wald, A. Die Widerspruchsfreiheit des Kollectivbegries der
Wahrsheinlichkeitsrechnung. Ergebnisse eines mathematischen
Kolloquiums 8 (1937), 38{72.
13] Kolmogorov, A. N. Three approaches to the de nition of the
concept \quantity of information." Problemy Peredachi Infor-
matsii 1 (1965), 3{11. (in Russian)
14] Martin-Lof, P. The de nition of random sequences. Res. Rep.,
Inst. Math. Statist., U. of Stockholm, Stockholm, 1966, 21 pp.
15] |. The de nition of random sequences. Inform. Contr. 9 (1966),
602{619.
16] Lofgren, L. Recognition of order and evolutionary systems. In
Computer and Information Sciences|II, Academic Press, New
York, 1967, pp. 165{175.
17] Levin, M., Minsky, M., and Silver, R. On the problem of
the eective de nition of \random sequence". Memo 36 (revised),
RLE and MIT Comput. Center, 1962, 10 pp.

Received November, 1965 Revised November, 1966


ON THE SIMPLICITY
AND SPEED OF
PROGRAMS FOR
COMPUTING INFINITE
SETS OF NATURAL
NUMBERS
Journal of the ACM 16 (1969),
pp. 407{422

Gregory J. Chaitin1
Buenos Aires, Argentina

Abstract
It is suggested that there are innite computable sets of natural numbers
with the property that no innite subset can be computed more simply
or more quickly than the whole set. Attempts to establish this without
restricting in any way the computer involved in the calculations are not
435
436 Part VI|Technical Papers on Turing Machines & LISP
entirely successful. A hypothesis concerning the computer makes it pos-
sible to exhibit sets without simpler subsets. A second and analogous
hypothesis then makes it possible to prove that these sets are also with-
out subsets which can be computed more rapidly than the whole set. It
is then demonstrated that there are computers which satisfy both hy-
potheses. The general theory is momentarily set aside and a particular
Turing machine is studied. Lastly, it is shown that the second hypoth-
esis is more restrictive then requiring the computer to be capable of
calculating all innite computable sets of natural numbers.

Key Words and Phrases:


computational complexity, computable set, recursive set, Turing ma-
chine, constructive ordinal, partially ordered set, lattice

CR Categories:
5.22

Introduction
Call a set of natural numbers perfect if there is no way to compute in-
nitely many of its members essentially better (i.e. simpler or quicker)
than computing the whole set. The thesis of this paper is that per-
fect sets exist. This thesis was suggested by the following vague and
imprecise considerations.
One of the most profound problems of the theory of numbers is that
of calculating large primes. While the sieve of Eratosthenes appears to
be as simple and as quick an algorithm for calculating all the primes as
is possible, in recent times hope has centered on calculating large primes
by calculating a subset of the primes, those that are Mersenne numbers.
Lucas's test is simple and can test whether or not a Mersenne number is
a prime with rapidity far greater than is furnished by the sieve method.
If there are an in nity of Mersenne primes, then it appears that Lucas
1 Address: Mario Bravo 249, Buenos Aires, Argentina.
Computing Innite Sets of Natural Numbers 437
has achieved a decisive advance in this classical problem of the theory
of numbers.2
An opposing point of view is that there is no way to calculate large
primes essentially better than to calculate them all. If this is the case
it apparently follows that there must be only nitely many primes.

1. General Considerations
The notation and terminology of this paper are largely taken from Davis
3].
De nition 1. A computing machine * is de ned by a 2-ary non-
vanishing computable function  in the following manner. The natural
number n is part of the output *(p t) of the computer * at time t re-
sulting from the program p if and only if the nth prime3 divides (p t).
The in nite set *(p) of natural numbers which the program p causes
the computing machine * to calculate is de ned to be

*(p t)
t
if in nitely many numbers are put out by the computer in numerical
order and without any repetition. Otherwise, *(p) is unde ned.
De nition 2. A program complexity measure 1 is a computable
1-ary function with the property that only nitely many programs p
have the same complexity 1(p).
De nition 3. The complexity 1(S ) of an in nite computable
set S of natural numbers as computed by the computer * under the
complexity measure 1 is de ned to be equal to
(
min(p)=S 1(p) if there are such p
1 otherwise.
2 For Lucas's test, cf. Hardy and Wright 1, Sec. 15.5]. For a history of number
theory, cf. Dantzig 2], especially Sections 3.12 and B.8.
3 The 0th prime is 2, the 1st prime is 3, etc. The primes are, of course, used here
only for the sake of convenience.
438 Part VI|Technical Papers on Turing Machines & LISP
I.e. 1(S ) is the complexity of the simplest program which causes the
computer to calculate S , and if there is no such program,4 the com-
plexity is in nite.5
In this section we do not see any compelling reason for regarding
any particular computing machine and program complexity measure as
most closely representing the state of aairs with which number theo-
rists are confronted in their attempts to compute large primes as simply
and as quickly as possible.6 The four theorems of this section and their
extensions hold for any computer * and any program complexity mea-
sure 1. Thus, although we don't know which computer and complexity
measure to select, as this section holds true for all of them, we are
covered.
Theorem 1. For any natural number n, there exists an in nite
computable set S of natural numbers which has the following properties:
(a) 1(S ) > n.
(b) For any in nite computable set R of natural numbers, R  S
implies 1(R)  1(S ).
Proof. We rst prove the existence of an in nite computable set A
of natural numbers having no in nite computable subset B such that
1(B )  n. The in nite computable sets C of natural numbers for
which 1 (C )  n are nite in number. Each such C has a smallest
element c. Let the ( nite) set of all these c be denoted by D. We take
A = D.
Now let A0 A1 A2 : : : be the in nite computable subsets of A. Con-
sider the following set:
E = f1(A0) 1(A1) 1(A2) : : :g:
4 This possibility can never arise for the simple-program computers or the quick-
program computers introduced later such computers can be programmed to com-
pute any innite computable set of natural numbers.
5 A more formal denition would perhaps use !, the rst transnite ordinal,
instead of 1.
6 In Sections 2 and 3 the point of view is dierent some computing machines are
dismissed as degenerate cases and an explicit choice of program complexity function
is suggested.
Computing Innite Sets of Natural Numbers 439
From the manner in which A was constructed, we know that each mem-
ber of E is greater than n. And as the natural numbers are well-ordered,
we also know that E has a smallest element r. There exists a natural
number s such that 1(As) = r. We take S = As, and we are nished.
Q.E.D.
Theorem 2. For any natural number n and any in nite computable
set T of natural numbers with in nite complement, there exists a com-
putable set S of natural numbers which has the following property:
T  S and 1(S ) > n.
Proof. There are in nitely many computable sets of natural num-
bers which have T as a subset, but the in nite computable sets F of
natural numbers for which 1(F )  n are nite in number. Q.E.D.
Theorem 3. For any 1-ary computable function f , there exists an
in nite computable set S of natural numbers which has the following
property: *(p)  S implies the existence of a t0 such that for t > t0,
n 2 *(p t) only if t > f (n).
Proof. We describe a procedure for computing S in successive stages
(each stage being divided into two successive steps) during the kth
stage it is determined in the following manner whether or nor k 2
S . Two subsets of the computing machine programs p such that p <
k=4] are considered: set A, consisting of those programs which have
been \eliminated" during some stage previous to the kth and set B ,
consisting of those programs not in A which cause * to output the
natural number k during the rst f (k) time units of calculation.
Step 1. Put k in S if and only if B is empty.
Step 2. Eliminate all programs in B (i.e. during all future stages
they will be in A).
The above constructs S . That S contains in nitely many natural
numbers follows from the fact that up to the kth stage at most k=4
programs have been eliminated, and thus at most k=4 natural numbers
less than or equal to k can fail to be in S .7
7I.e. the Schnirelman density d(S) of S is greater than or equal to 3/4. It follows
from d(S)  3=4 that S is basis of the second order i.e. every natural number can
be expressed as the sum of two elements of S. Cf. Gelfond and Linnik 4, Sec. 1.1].
We conclude that the mere fact that a set is a basis of the second order for the
natural numbers does not provide a quick means for computing innitely many of
its members.
440 Part VI|Technical Papers on Turing Machines & LISP
It remains to show that *(p)  S implies the existence of a t0 such
that for t > t0, n 2 *(p t) only if t > f (n). Note that for n  4p + 4,
n 2 *(p t) only if t > f (n). For a value of n for which this failed to be
the case would assure p's being in A, which is impossible. Thus given
a program p such that *(p)  S , we can calculate a point at which the
program has become slow and will remain so i.e. we can calculate a
permissible value for t0. In fact, t0(p) = maxj<4p+4 f (j ). Q.E.D.
The following theorem and the type of diagonal process used in its
proof are similar in some ways to Blum's exposition of a theorem of
Rabin in 5, pp. 241{242].
Theorem 4. For any 1-ary computable function f and any in nite
computable set T of natural numbers with in nite complement, there
exists an in nite computable set S of natural numbers which is a su-
perset of T and which has the following property: *(p) = S implies the
existence of a t0 such that for t > t0, n 2 *(p t) only if t > f (n).
Proof. First we de ne three functions: a(n) is equal to the nth
natural number in T  b(n) is equal to the smallest natural number j
greater than or equal to n such that j 2 T and j + 1 62 T  and c(n) is
equal to maxnkb(n) f (k). As proof, we give a process for computing
S \ T in successive stages during the kth stage it is determined in the
following manner whether or not a(k) 2 S . Consider the computing
machine programs 0 1 2 : : :  k to fall into two mutually exclusive sets:
set A, consisting of those programs which have been eliminated during
some stage previous to the kth and set B , consisting of all others.
Step 1. Determine the set C consisting of the programs in B which
cause the computing machine * to output during the rst c(a(k)) time
units of calculation any natural numbers greater than or equal to a(k)
and less then or equal to b(a(k)).
Step 2. Check whether C is empty. Should C = , we neither
eliminate programs nor put a(k) in S  we merely proceed to the next
(the (k + 1)-th) stage. Should C = , however, we proceed to step 3.
Step 3. We determine p0 , the smallest natural number in C .
Step 4. We ask, \Does the program p0 cause * to output the num-
ber a(k) during the rst c(a(k)) time units of calculation?" According
as the answer is \no" or \yes" we do or don't put a(k) in S .
Step 5. Eliminate p0 (i.e. during future stages p0 will be in A).
The above constructs S . We leave to the reader the veri cation that
Computing Innite Sets of Natural Numbers 441
the constructed S has the desired properties. Q.E.D.
We now make a number of remarks.
Remark 1. We have actually proved somewhat more. Let U be
any in nite computable set of natural numbers. Theorems 1 and 3
hold even if it is required that the set S whose existence is asserted
be a subset of U . And if in Theorems 2 and 4 we make the additional
assumption that T is a subset of U , and U \ T is in nite, then we can
also require that S be a subset of U .
The above proofs can practically be taken word for word (with obvi-
ous changes which may loosely be summarized by the command \ignore
natural numbers not in U ") as proofs for these extended theorems. It
is only necessary to keep in mind the essential point, which in the case
of Theorem 3 assumes the following form. If during the kth stage of
the diagonal process used to construct S we decide whether to put in
S the kth element of U , we are still sure that *(p)  S is impossible
for all the p which were eliminated before. For if *(p)  U , then p
is eliminated as before while if *(p) has elements not in U , then it is
clear that *(p)  S is impossible, for S is a subset of U .
Remark 2. In Theorems 1 and 2 we see two possible extremes
for S . In Theorem 1 we contemplate an arbitrarily complex in nite
computable set of natural numbers that has the property that there
is no way to compute in nitely many of its members which is simpler
than computing the whole set. On the other hand, in Theorem 2 we
contemplate an in nite computable set of natural numbers that has the
property that there is a way to compute in nitely many of its members
which is very much simpler than computing the whole set. Theorems
3 and 4 are analogous to Theorems 1 and 2, but Theorem 3 does not
go as far as Theorem 1. Although Theorem 3 asserts the existence
of in nite computable sets of natural numbers which have no in nite
subsets which can be computed quickly, it does not establish that no
in nite subset can be computed more quickly than the whole set. In this
generality we are unable to demonstrate a Theorem 3 truly analogous
to Theorem 1, although an attempt to do so is made in Remark 5.
Remark 3. The restriction in the conclusions of Theorems 3 and
4 that t be greater than t0 is necessary. For as Arbib remarks in 6,
p. 8], in some computers * any nite part of S can be computed very
quickly by a table look-up procedure.
442 Part VI|Technical Papers on Turing Machines & LISP
Remark 4. The 1-ary computable function f of Theorems 3 and
4 can go to in nity very quickly indeed with increasing values of its
argument. For example, let f0(n) = 2n , fk+1 (n) = fk (fk (n)). For each
k, fk+1(n) is greater than fk (n) for all but a nite number of values
of n. We may now proceed from nite ordinal subscripts to the rst
trans nite ordinal by a diagonal process: f! (n) = maxkn fk (n). We
choose to continue the process up to !2 in the following manner, which
is a natural way to proceed (i.e. the fundamental sequences can be
computed by simple programs) but which is by no means the only way
to get to !2. i and j denote nite ordinals.
f!i+j+1 (n) = f!i+j (f!i+j (n))
f!(i+1)(n) = max f (n)
kn !i+k
f!2 (n) = max f (n):
kn !k
Taking f = f!2 in Theorem 3 yields an S such that any attempt
to compute in nitely many of its elements requires an amount of time
which increases almost incomprehensibly quickly with the size of the
elements computed.
More generally, the above process may be continued through to
any constructive ordinal.8 For example, there are more or less natural
manners to reach 0, the rst epsilon-number the territory up to it is
very well charted.9
The above is essentially a constructive version of remarks by Borel
8] in an appendix on a theorem of P. du Bois-Reymond. These remarks
are partly reproduced in Hardy 9].
Remark 5. Remark 4 suggests the following approach to the speed
of programs. For any constructive ordinal there is a computable 2-ary
function f (by no means unique) with the property that the set of 1-ary
functions fk de ned by fk (n) = f (k n) is a representative of when
ordered in such a manner that a function g comes before a function h
if and only if g(n) < h(n) holds for all but a nite number of values of
n. We now associate an ordinal Ord (S ) with each in nite computable
set S of natural numbers in accordance with the following rules:
8Cf. Davis 3, Sec. 11.4] for a denition of the concept of a constructive ordinal
number.
9 Cf. Fraenkel 7, pp. 207{208].
Computing Innite Sets of Natural Numbers 443
(a) Ord(S ) equals the smallest ordinal < such that fk0 , the
th element of the set of functions fk , has the following property:
There exists a program p and a time t0 such that *(p) = S and
for t > t0, n 2 *(p t) only if t  fk0 (n).
(b) If (a) fails to de ne Ord (S ) (i.e. if the set of ordinals is empty),
then Ord (S ) = .
Then for any constructive ordinal we have the following analogue
to Theorem 1.
Theorem 10. Any in nite computable set T of natural numbers
has an in nite computable subset S with the following properties:
(a) Ord(S )  Ord(T ).
(b) For any in nite computable set R of natural numbers, R  S
implies Ord(S )  Ord (R).
Proof. Let T0 T1 T2 : : : be the in nite computable subsets of T .
Consider the following set of ordinal numbers less than or equal to :
fOrd (T0) Ord (T1 ) Ord (T2) : : :g:

As the ordinal numbers less than or equal to are well-ordered, this


set has a smallest element . There exists a natural number s such
that Ord(Ts) = . We take S = Ts. Q.E.D.
However, we must admit that this approach to the speed of pro-
grams does not seem to be a convincing support for the thesis of this
paper.

2. Connected Sets, Simple-Program Com-


puters, and Quick-Program Computers
The principal results of subsections 2.A and 2.B, namely, Theorems 6
and 8, hold only for certain computers, but we argue that all other
computing machines are degenerate cases of computers which in view
of their unnecessarily restricted capabilities do not merit consideration.
444 Part VI|Technical Papers on Turing Machines & LISP
In this section and the next we attempt to make plausible the con-
tention that some connected sets (de ned below) may well be consid-
ered to be perfect sets. In subsection 2.A we study the complexity of
subsets of connected sets, and in subsection 2.B we study the speed
of programs for computing subsets of connected sets. The treatments
are analogous but we nd the second more convincing, because in the
rst treatment one explicit choice is made for the program complexity
measure 1. 1(p) is taken to be log2(p + 1)].
The concept of a connected set is analogous to the concept of a
retraceable set, cf. Dekker and Myhill 10].
De nition 4. A connecting function  is a one-to-one onto map-
ping carrying the set of all nite sets of natural numbers onto the set of
all natural numbers. The monotonicity conditions  (V  W )   (W )
must be satis ed andQthere must be a 1-ary computable function g
such that  (W ) = g( n2W pn ), where pn denotes the nth prime.2 Let
S = fs0 s1 s2 : : :g (s0 < s1 < s2 <   ) be a computable set of natural
numbers with m members (0  m  @0 ). From a connecting function
 we de ne a secondary connecting function ; as follows:10
 
;(S ) = f ( fsj g)g:
k<m j k
A  -connected set is de ned to be any in nite computable set of natural
numbers which is in the range of ;.
Remark 6. Consider a connecting function  . Note that any two
 -connected sets which have an in nite intersection must be identical.
In fact, two  -connected sets which have an element in common must
be identical up to that element.
The following important results concerning  -connected sets are es-
tablished by the methods of Section 1 and thus hold for any computer
* and complexity measure 1. For any natural number n there exists
a  -connected set S such that 1(S ) > n. This follows from the fact
that there are in nitely many  -connected sets, while the in nite com-
putable sets H of natural numbers such that 1(H )  n are only nite
in number. Theorem 3 remains true if we require that the set S whose
10Thus ;(S) always has the same number of elements as S, be S empty, nite or
innite.
Computing Innite Sets of Natural Numbers 445
existence is asserted be a  -connected set. S may be constructed by
a procedure similar to that of the proof of Theorem 3 during the kth
stage instead of deciding whether or not k 2 S , it is decided whether
or not k 2 ;;1(S ). These two results should be kept in mind while
appraising the extent to which the theorems of this section and the
next corroborate the thesis of this paper.

2.A. Simplicity
In this subsection we make one explicit choice for the program com-
plexity measure 1. We consider programs to be nite binary sequences
as well as natural numbers:
Programs
Binary Sequence ( 0 1 00 01 10 11 000 001 010 011 : : :
Natural Number 0 1 2 3 4 5 6 7 8 9 10 : : :
Henceforth, when we denote a program by a lowercase (uppercase)
Latin letter, we are referring to the program considered as a natural
number (binary sequence). Next we de ne the complexity of a program
P to be the number of bits in P (i.e. its length). I.e. the complexity
1(p) of a program p is equal to log2(p + 1)], the greatest integer not
greater than the base-2 logarithm of p + 1.
We now introduce the simple-program computers. Computers sim-
ilar to them have been used in Solomono 11], Kolmogorov 12], and
in 13].
De nition 5. A simple-program computer * has the following
property: For any computer 3, there exists a natural number c such
that 1(S )  1(S ) + c for all in nite computable sets S of natural
numbers.
To the extent that it is plausible to consider all computer programs
to be binary sequences, it seems plausible to consider all computers
which are not simple-program computers as unnecessarily awkward de-
generate cases which are unworthy of attention.
Remark 7. Note that if * and 3 are two simple-program com-
puters, then there exists a natural number c which has the following
property: j1(S ) ; 1(S )j  c for all in nite computable sets S of
446 Part VI|Technical Papers on Turing Machines & LISP
natural numbers. In fact we can take
c = max(c c):
Theorem 5. For any connecting function  , there exists a simple-
program computer *
which has the following property: For any  -
connected set S and any in nite computable subset R of S ,
1 (S )  1 (R):
 

Proof. Taking for granted the existence of a simple program com-


puter * (cf. Theorem 9), we construct the computer *
from it as
follows:
8

< *
(( t) = 
>
> * (P 0 t) = * (P t) (1)
: *
(P 1 t) = Tt <t *
(P 1 t0) \ ;(Sn2 (Pt)  ;1(n)):
0 

As * is a simple-program computer, so is *
, for *
(P 0 t) = *(P t).
*
also has the following very important property: For all programs P 0
for which *
(P 0) is a subset of some  -connected set S , *
(P 1) = S .
Moreover, *
(P 1) cannot be a proper subset of any  -connected set.
In summary, given a P such that *
(P ) is a proper subset of a  -
connected set S , then by changing the rightmost bit of P to a 1 we get
a program P 0 with the property that *
(P 0) = S . This implies that for
any in nite computable subset R of a  -connected set S ,
1 (S )  1 (R):
 

Q.E.D.
In view of Remark 7, the following theorem is merely a corollary to
Theorem 5.
Theorem 6. Consider a simple-program computer *. For any
connecting function  , there exists a natural number c
which has the
following property: For any  -connected set S and any in nite com-
putable subset R of S , 1(S )  1(R) + c
. In fact, we can take11
c
= 2 max(c  c ):
 

11 That
c = c  + c 
will do follows upon taking a slightly closer look at the matter.
Computing Innite Sets of Natural Numbers 447
2.B. Speed
This treatment runs parallel to that of subsection 2.A.
De nition 6. A quick-program computer * has the following prop-
erty: For any computer 3, there exists a 1-ary computable function s
such that for all programs p for which 3(p) is de ned, there exists a
program p0 such that *(p0) = 3(p) and
 
3(p t0)  *(p0 t0)
t t0 t s
( )
0
t

for all but a nite number of values of t.


Theorem 7. For any connecting function  , there exists a quick-
program computer *
which has the following property: For any pro-
gram P such that *
(P ) is a proper subset of a  -connected set S , there
exists a program P 0 such that *
(P 0) = S and *
(P t)  *
(P 0 t) for
all t. In fact, P 0 is just P with the 0 at its right end changed to a 1, as
the reader has no doubt guessed.
Proof. Taking for granted the existence of a quick-program com-
puter * (cf. Theorem 9), we construct *
from it exactly as in the
proof of Theorem 5. I.e. *
is de ned, as before, by eqs. (1). The
remainder of the proof parallels the proof of Theorem 5. Q.E.D.
Theorem 7 yields the following corollary in a manner analogous to
the manner in which Theorem 5 yields Theorem 6.
Theorem 8. Consider a quick-program computer *. For any con-
necting function  there exists a 1-ary computable function s
which
has the following property: For any program p such that *(p) is a
subset of a  -connected set S , there exists a program p0 such that
0
 *(p0) = S
*(p t )  *(p0  t0)
t t
0
t s (t)
0


for all but a nite number of values of t. In fact we can take


s
(n) = s (s  (n)):
 

Remark 8. Arbib and Blum 14] base their treatment of program


speed upon the idea that if two computers can imitate act by act the
448 Part VI|Technical Papers on Turing Machines & LISP
computations of the other, and not take too many time units of calcu-
lation to imitate the rst several time units of calculation of the other,
then these computers are essentially equivalent. The idea used to de-
rive Theorem 8 from Theorem 7 is similar: Any two quick-program
computers (and in particular *
and *) can imitate act by act each
other's computations and are thus in a sense equivalent.
In order to clarify the above, let us formally de ne within the frame-
work of Arbib and Blum a concept analogous to that of the quick-
program computer. In what remains of this remark we use the notation
and terminology of Arbib and Blum, not that of this paper. However, in
order to prove that this analogous concept is not vacuous, it is necessary
to make explicit an assumption which is implicit in their framework.
For any machine M there exists a total recursive function m such that
m(i x t) = 2y if and only if Mi (x) = y and .Mi (x) = t.
De nition AB. A quick-program machine M is a machine with
the following property. Consider any machine N . There exists a to-
tal recursive function fNM increasing in both its variables such that
N (f ) M  i.e. M is at least as complex as N (modulo (fNM )). Here,
a two-variable function enclosed in parentheses denotes the monoid with
NM

multiplication and identity e(x y) = y, which is generated by the


function.
Then by (ii) of Theorem 2 14] we have
Theorem AB. Consider two quick-program machines M and N .
There exists a total recursive function gNM increasing in both of its
variables such that N
(g ) M  i.e. N and M are (gNM )-equivalent.
Remark 9. In an eort to make this subsection more comprehen-
NM

sible, we now cast it into the framework of lattice theory, cf. Birkho
15].
De nition L1. Let *1 and *2 be computing machines. *1 im *2
(*1 can be imitated by *2) if and only if there exists a 1-ary computable
function f which has the following property: For any program p for
which *1(p) is de ned, there exists a program p0 such that *2(p0) =
*1(p) and  
*1(p t0)  *2(p0 t0)
t t
0
t f (t)
0

for all but a nite number of values of t.


Lemma L1. The binary relation im is reexive and transitive.
Computing Innite Sets of Natural Numbers 449
De nition L2. Let *1 and *2 be computing machines. *1 eq *2
if and only if *1 im *2 and *2 im *1.
Lemma L2. The binary relation eq is an equivalence relation.
De nition L3. L is the set of equivalence classes induced by the
equivalence relation eq. For any computer *, (*) is the equivalence
class of *, i.e. the set of all computers *0 such that *0 eq *. For any
(*1) (*2) 2 L, (*1)  (*2) if and only if *1 im *2.
Lemma L3. L is partially ordered by the binary relation .
Lemma L4. Consider a computer which cannot be programmed
to compute any in nite set of natural numbers, e.g. the computer *0
de ned by *0(p t) = . Denote by 0 the equivalence class of this
computer i.e. denote by 0 the computers which compute no in nite
sets of natural numbers. 0 bounds L from below i.e. 0  A for all
A 2 L.
Lemma L5. Consider a quick-program computer, e.g. the com-
puter * of Theorem 9. Denote by 1 the equivalence class of this
computer i.e. denote by 1 the quick-program computers. 1 bounds L
from above i.e. A  1 for all A 2 L.
Lemma L6. Let *1 and *2 be computers. De ne the computer
*3 as follows: *3(( t) = , *3(P 0 t) = *1(P t), *3(P 1 t) = *2(P t).
(*3) is the l.u.b. of (*1) and (*2).
Lemma L7. Let *1 and *2 be computers. De ne the computer
*3 as follows: Consider the sets

S1 = *1(K (p) t0)
t t
0


S2 = *2(L(p) t0)
t t
0

where (K (p) L(p)) is the pth ordered pair in an eective enumeration


of the ordered pairs of natural numbers (cf. Davis 3, pp. 43{45]). If *1
and *2 output in size order and without repetitions the elements of,
respectively, S1 and S2, and S1  S2 or S2  S1, then

*3(p t) = S1 \ S2 \ *3(p t0):
t <t
0

Otherwise, *3(p t) = . (*3) is the g.l.b. of (*1) and (*2).


450 Part VI|Technical Papers on Turing Machines & LISP
Theorem L. L is a denumerable, distributive lattice with zero el-
ement and one element.
We may describe the g.l.b. and l.u.b. operations of this lattice as
follows. The l.u.b. of two computers is the slowest computer which is
faster than both of them, and the g.l.b. of two computers is the fastest
computer which is slower than both of them.

3. A Simple, Quick-Program Computer


This section is the culmination of this paper. A computer is constructed
which is both a simple-program computer and a quick-program com-
puter.
If it is believed that programs are essentially binary sequences and
that the only natural measure of the complexity of a program consid-
ered as a binary sequence is its length, then apparently the conclusion
would have to be drawn that only simple, quick-program computers are
worthy of attention, all other computers being degenerate cases.
It would seem to follow that the connected sets indeed corroborate
this paper's thesis. For there is a simple, quick-program computer
which best represents mathematically the possibilities open to number
theorists in their attempts to calculate large primes. We do not know
which it may happen to be, but we do know (cf. Remark 6) that there
are connected sets which are very complex and which must be computed
very slowly when one is using this computer. In view of Theorems 6
and 8 it would seem to be appropriate to consider these connected sets
to be perfect sets. Thus our quest for perfect sets comes to a close.
Theorem 9. There exists a simple, quick-program computer,
namely *.
Proof. We take it for granted that there is a computer *$ which
can compute every 2-ary computable function f in the following sense:
There exists a binary sequence Pf and a 2-ary computable function #f
increasing in its second argument such that
ff (n m)g = *$ (B (n)Pf  #f (n m))

for all natural numbers n and m. Moreover, *$ (B (n)Pf  t) is nonempty


only if there exists an m such that t = #f (n m). Here B is the function
Computing Innite Sets of Natural Numbers 451
carrying each natural number into its associated binary sequence, as in
Section 2.
From *$ we now construct the computer *: n 2 *(p t) if and
only if *$ (p t) has only a single element, this element is not zero, and
the nth prime divides it.2
We now verify that * is a simple, quick-program computer. Con-
sider a computer 3. We give the natural number c  explicitly: c 
 

is the length of P . We also give the 1-ary computable function s  

explicitly:
s (n) = max
 # (k n):
kn
Here  is, of course, the 2-ary computable function which de nes the
computer 3 as in De nition 1. Q.E.D.

Appendix A. A Turing Machine


The contents of this appendix have yet to be tted into the general
framework which we attempted to develop in Sections 1{3.
De nition A. , is a Turing machine. ,'s \tape" is a quarter-plane
or quadrant divided into squares. It has a single scanner which scans
one of the squares. If the scanner runs o the quadrant, , halts. , can
perform any one of the following operations: quadrant one square left
(L), right (R), up (U), or down (D) or the scanner can overprint a 0
(0), a 1 (1), or erase (E) the square of the quadrant being scanned. The
programs of , are tables with three columns headed \blank," \0," and
\1," and consecutively numbered rows. Each place in the table must
have an ordered pair the rst member of the pair gives the operation
to be performed and the second member gives the number of the next
row of the table to be obeyed. As program complexity measure $,
we take the number of rows in the program's table. One operation
(L, R, U, D, 0, 1, or E) is performed per unit time. The computing
machine , begins calculating with its scanner on the corner square,
with the quadrant completely erased, and obeying the last row of its
program's table. The Turing machine outputs a natural number n when
the binary sequence which represents n in base-2 notation appears at
the bottom of the quadrant, starting in the corner square, ending in
452 Part VI|Technical Papers on Turing Machines & LISP
the square being scanned, and with , obeying the next to last row of
its program's table.
Theorem A1. For any connecting function  there exists a natural
number c
and a 1-ary computable function s
which have the following
property: For any program p for which ,(p) is a subset of a  -connected
set S , there is a program p0 such that
(a) ,(p0) = S , $(p0) = $(p) + c
12 and
(b) for all natural numbers t,
 
,(p t0)  ,(p0 t0)
t t
0
t t+s (n)
0


where n stands for the largest element of the left-hand side of the
relation, if this set is not empty (otherwise, n stands for 0).
Proof. p0 is obtained from p in the following manner. c
rows are
added to the table de ning the program p. All transfers to the next
to the last row in the program p are replaced by transfers to the rst
row of the added section. The new rows of the table use the program
p as a subroutine. They make the program p think that it is working
as usual, but actually p is using neither the quadrant's three edge rows
nor the three edge columns p has been fooled into thinking that these
squares do not exist because the new rows moved the scanner to the
fourth square on the diagonal of the quadrant before turning control
over to p for the rst time by transferring to the last row of p. This
protected region is used by the new rows to do its scratch-work, and
also to keep permanent records of all natural numbers which it causes
, to output.
Every time the subroutine thinks it is making , output a natural
number n, it actually only passes n and control to the new rows. These
proceed to nd out which natural numbers are in ;( ;1(n)). Then
12 This implies
 (S)   (&(p)) + c :
I.e.
 (S)   (R) + c
for any innite computable subset R of the -connected set S.
Computing Innite Sets of Natural Numbers 453
the new rows eliminate those elements of ;( ;1 (n)) which , put out
previously. Finally, they make , output those elements which remain,
move the scanner back to what the subroutine last thought was its
position, and return control to the subroutine. Q.E.D.
Remark A. Assuming that only the computer , and program
complexity measure $ are of interest, it appears that we have before
us some connected sets which are in a very strong sense perfect sets.
For, as was mentioned in Remark 6, there are  -connected sets which
, must compute very slowly. For such sets, the term s
(n) in (b) above
is negligible compared with t.
Theorem A2. Consider a simple-program computer * and the
program complexity measure 1(p) = log2(p + 1)]. Let S0 S1 S2 : : :
be a sequence of distinct in nite computable sets of natural numbers.
Then we may conclude that
lim 1(Sk )
k!1 2$ (Sk ) log2 $ (Sk )

exists and is in fact unity.


Proof. Apply the technique of 16, Pt. 1].
Of course,
Theorem A3. , is a quick-program computer.

Appendix B. A Lattice of Computer Spe-


eds
The purpose of this appendix is to study L, the lattice of speeds of
computers which calculate all in nite computable sets of natural num-
bers. L is a sublattice (in fact, a lter) of the lattice L of Remark 9.
It will be shown that L has a rich structure: every countable partially
ordered set is imbeddable in L.13 Thus to require a computer to be
a quick-program computer is more than to require that it be able to
compute all in nite computable sets of natural numbers.
13An analogous result is due to Sacks 17, p. 53]. If P is a countable partially
ordered set, then P is imbeddable in the upper semilattice of degrees of recursively
enumerable sets. Cf. also Sacks 17, p. 21].
454 Part VI|Technical Papers on Turing Machines & LISP
De nition B1. L is the sublattice of L consisting of the (*) such
that * can be programmed to compute all in nite computable sets of
natural numbers.
In several respects the following theorem is quite similar to Theorem
9 of Hartmanis and Stearns 18] and to Theorem 8 of Blum 19]. The
diagonal process of the proof of Theorem 3 is built into a computer's
circuits.
Theorem B1. There exists a quick-program computer *1 with
the property that for any 1-ary computable function f and any in nite
computable set U of natural numbers, there exists a 1-ary computable
function g and an in nite computable set S of natural numbers such
that
(a) S  U ,
(b) g(n) > f (n) for all but a nite number of values of n,
(c) there exists a program p such that *1(p) = S and n 2 *1(p g(n)+
1) for all n 2 S ,
(d) for all programs p0 such that *1(p0)  S , n 2 *1(p0 t) only if
t > g(n), with the possible exception of a nite number of values
of n.
Proof. Let * be a quick-program computer. We construct *1 from
it. *1(( t) = , *1(P 0 t) = *(P t), *1(P 1 0) = , and *1(P 1 t + 1)
is a subset of *(P t). For each element n# of *(P t), it is determined
in the following manner whether or not n# 2 *1(P 1 t + 1). De ne m,
nk (0  k  m), m0, and tk (0  k  m) as follows:

*(P t0) = fn0 n1 n2 : : : nm g (n0 < n1 < n2    < nm )
t t
0

n# = nm  0

nk 2 *(P tk ) (0  k  m):
De ne A(i j ) (the predicate \the program j is eliminated during the
ith stage"),14 A (the set of programs eliminated before the m0 th stage),
14During the ith stage of this diagonal process it is decided whether or not the
ith element of '(P) is in '1(P1).
Computing Innite Sets of Natural Numbers 455
and A0 (the set of programs eliminated before or during the m0th stage)
as follows: 
A(i j ) i j < i=4] and ni 2 *1(j t0)
t t
0
i

A = fj jA(i j ) for some i < m0g


A0 = fj jA(i j ) for some i  m0g:
n# 2 *1(P 1 t + 1) i A0 = A.
That the above indeed constructs *1 follows from the fact that each
of the tk is less than t +1, and thus *1(P 1 t +1) is de ned only in terms
of *1(p0 t0), for which t0 is less than t + 1. I.e. that *1(p t) is de ned
follows by induction on t. Also, *1 is a quick-program computer, for
*1(P 0 t) = *(P t).
We now de ne the function g and the set S , whose existence is
asserted by the theorem. By one of the extensions of Theorem 3, there
exists a program P which has the following properties:
1. *(P )  U .
2. For all but a nite number of values of n, n 2 *(P t) only if
t > f (n).
S = *1(P 1). That S is in nite follows from the fact that at most k=4
of the rst k elements of *(P ) fail to be in S . g(n) is de ned for all
n 2 *(P ) by n 2 *(P g(n)). It is irrelevant how g(n) is de ned for
n 62 *(P ), as long as g(n) > f (n).
Part (a) of the conclusion follows from the fact that *1(P 1 t + 1) 
*(P t)  U for all t. Part (c) follows from the fact that if n 2 *1(P 1),
then
n 2 *1(P 1 g(n) + 1):
Part (d) follows from the fact that if *1(p0) is de ned and n is the rst
element of *1(p0) \ *(p) which is greater than or equal to the (4p0 + 4)-
th element15 of *(P ) and which is contained in a *1(p0 t) such that
t  g(n), then n is not an element of S . Q.E.D.
Corollary B1. On the hypothesis of Theorem B1, not only do the
g and S whose existence is asserted have the properties (a) to (d), they
15 I.e. it is greater than or equal to n4p +4 .
0
456 Part VI|Technical Papers on Turing Machines & LISP
also, as follows immediately from (c) and (d), have the property that
for any quick-program computer *:
(e) There exists a program p2 such that *(p2) = S and n 2 *(p2  t)
with t  s1 (g(n) + 1) for all but a nite number of values of
n 2 S.
(f) For all programs p3 such that *(p3 )  S , n 2 *(p3 t) only if
s1 (t) > g(n), with the possible exception of a nite number of
values of n.
Remark B. Theorem B1 is a \no speed-up" theorem i.e. it con-
trasts with Blum's speed-up theorem (cf. 5, 6, 19]). Each S whose
existence is asserted by Theorem B1 has a program for *1 to compute
it which is as fast as possible. I.e. no other program for *1 to compute
S can output more than a nite number of elements more quickly. Thus
it is not possible to speed up every program for computing S by the
computer *1. And, as is pointed out by Corollary B1, this also holds
for any other quick-program computer, but with the slight \fogginess"
that always results in passing from a statement about one particular
quick-program computer to a statement about another.
De nition B2. Let S be a computable set of natural numbers and
let * be a computer. *S denotes the computer which can compute only
subsets of S , but which is otherwise identical to *. I.e.
( S *(p t0)  S
S
* (p t) = *( p t)  if
0
t t
otherwise.
Theorem B2. There is a computer *0 such that (*0) 2 L and
(*0) < 1. Moreover, for any computable sets T and R of natural
numbers,
(a) if T ; R and R ; T are both in nite, then l.u.b. (*0) (*T1 ) and
l.u.b. (*0) (*R1 ) are incomparable members of L, and
(b) if T  R and R ; T is in nite, then the rst of these two members
of L is less than the second.
Computing Innite Sets of Natural Numbers 457
Proof. *0 is constructed from the computer *1 of Theorem B1 as
follows. n 2 *0(p t) if and only if there exist t0 and t00 with max(t0 t00) =
t such that
n 2 *1(p t0) st 2 *1(p t00)
0

where

*1(p t3) = fs0 s1 s2 : : :g (s0 < s1 < s2 <   )
t3 t
and for no n1  n2, t1 < t2  t is it simultaneously the case that
n1 2 *1(p t1) and n2 2 *1(p t2). Note that for all p, *1(p) = *0(p),
both sides of the equation being unde ned if one of them is.
Sets S whose existence is asserted by Theorem B1 which must be
computed very slowly by *1 must be computed very much more slowly
indeed by *0, and thus *1 im *0 cannot be the case. Moreover, within
any in nite computable set U of natural numbers, there are such sets
S.
We now show in greater detail that (*0) < (*1) = 1 by a reductio
ad absurdum of *1 im *0. Suppose *1 im *0. Then by de nition
there exists a 1-ary computable function h such that for any program p
for which *1(p) is de ned, there exists a program p0 such that *0(p0) =
*1(p) and  
*1(p t0)  *0(p0 t0)
t t
0
t h(t)
0

for all but a nite number of values of t.


In Theorem B1 we now take f (n) = max(n maxkn h(k)). We
obtain g and S satisfying
1. g(n) > n,
2. g(n) > maxkn h(k) for all but nitely many n. From (1) it follows
that for all but a nite number of values of n 2 S , the fastest
program for *1 to compute S outputs n at time t0 = g(n) + 1,
while the fastest program for *0 to compute S outputs n at time
t00 = g(sg(n)+1 ) + 1 = g(st ) + 1 > st + 1  t0 + 1:
0 0

Here, as usual, S = fs0 s1 s2 : : :g (s0 < s1 < s2 <   ). Note
that sk , the kth element of S , must be greater than or equal to
k:
458 Part VI|Technical Papers on Turing Machines & LISP
3. sk  k.
By the de nition of (*1) im (*0) we must have
h(g(n) + 1)  g(sg(n)+1) + 1
for all but nitely many n 2 S . By (2) this implies
h(g(n) + 1) > kmax
s
h(k) + 1
( )+1
g n

hence g(n) + 1 > sg(n)+1 for all but nitely many n 2 S . Invoking (3)
we obtain g(n) + 1 > sg(n)+1  g(n) + 1, which is impossible. Q.E.D.
A slightly dierent way of obtaining the following theorem was an-
nounced in 20].
Theorem B3. Any countable partially ordered set is order-
isomorphic with a subset of L. That is, L is a \universal" countable
partially ordered set.
Proof. We show that an example of a universal partially ordered set
is C , the computable sets of natural numbers ordered by set inclusion.
Thus the theorem is established if we can nd in L an isomorphic image
of C . This isomorphic image is obtained in the following manner. Let
S be a computable set of natural numbers. Let S 0 be the set of all odd
multiples of 2n , where n ranges over all elements of S . The isomorphic
image of the element S of C is the element l.u.b. (*0) (*S1 ) of L. Here
0

*0 is the computer of Theorem B2, *1 is the computer of Theorem B1,


and \*S1 " is written in accordance with the notational convention of
0

De nition B2.
It only remains to prove that C is a universal partially ordered set.
Sacks 17, p. 53] attributes to Mostowski 21] the following result: There
is a universal countable partially ordered set A = fa0 a1 a2 : : :g with
the property that the predicate an  am is computable. We nish the
proof by constructing in C an isomorphic image A0 = fA0 A1 A2 : : :g
of A as follows:
Ai = fkjak  aig:
It is easy to see that Ai  Aj if and only if ai  aj .] Q.E.D.
Corollary B2. L has exactly @0 elements.
Computing Innite Sets of Natural Numbers 459
References
1] Hardy, G. H., and Wright, E. M. An Introduction to the
Theory of Numbers. Clarendon Press, Oxford, 1962.
2] Dantzig, T. Number, the Language of Science. Macmillan, New
York, 1954.
3] Davis, M. Computability and Unsolvability. McGraw-Hill, New
York, 1958.
4] Gelfond, A. O., and Linnik, Yu. V. Elementary Methods
in Analytic Number Theory. Rand McNally, Chicago, 1965.
5] Blum, M. Measures on the computation speed of partial recur-
sive functions. Quart. Prog. Rep. 72, Res. Lab. Electronics, MIT,
Cambridge, Mass., Jan. 1964, pp. 237{253.
6] Arbib, M. A. Speed-up theorems and incompleteness theorems.
In Automata Theory, E. R. Cainiello (Ed.), Academic Press, New
York, 1966, pp. 6{24.
7] Fraenkel, A. A. Abstract Set Theory. North-Holland, Amster-
dam, The Netherlands, 1961.
8] Borel, E . Le"cons sur la Theorie des Fonctions. Gauthier-
Villars, Paris, 1914.
9] Hardy, G. H. Orders of Innity. Cambridge Math. Tracts, No.
12, U. of Cambridge, Cambridge, Eng., 1924.
10] Dekker, J. C. E., and Myhill, J. Retraceable sets. Canadian
J. Math. 10 (1958), 357{373.
11] Solomonoff, R. J. A formal theory of inductive inference, Pt.
I. Inform. Contr. 7 (1964), 1{22.
12] Kolmogorov, A. N. Three approaches to the de nition of the
concept \amount of information." Problemy Peredachi Informat-
sii 1 (1965), 3{11. (Russian)
460 Part VI|Technical Papers on Turing Machines & LISP
13] Chaitin, G. J. On the length of programs for computing nite
binary sequences: statistical considerations. J. ACM 16, 1 (Jan.
1969), 145{159.
14] Arbib, M. A., and Blum, M. Machine dependence of degrees
of diculty. Proc. Amer. Math. Soc. 16 (1965), 442{447.
15] Birkhoff, G. Lattice Theory. Amer. Math. Soc. Colloq. Publ.
Vol. 25, Amer. Math. Soc., Providence, R. I., 1967.
16] Chaitin, G. J. On the length of programs for computing nite
binary sequences. J. ACM 13, 4 (Oct. 1966), 547{569.
17] Sacks, G. E. Degrees of Unsolvability. No. 55, Annals of Math.
Studies, Princeton U. Press, Princeton, N. J., 1963.
18] Hartmanis, J., and Stearns, R. E. On the computational
complexity of algorithms. Trans. Amer. Math. Soc. 117 (1965),
285{306.
19] Blum, M. A machine-independent theory of the complexity of
recursive functions. J. ACM 14, 2 (Apr. 1967), 322{336.
20] Chaitin, G. J. A lattice of computer speeds. Abstract 67T-397,
Notices Amer. Math. Soc. 14 (1967), 538.
21] Mostowski, A. U ber gewisse universelle Relationen. Ann. Soc.
Polon. Math. 17 (1938), 117{118.
22] Blum, M. On the size of machines. Inform. Contr. 11 (1967),
257{265.
23] Chaitin, G. J. On the diculty of computations. Panamerican
Symp. of Appl. Math., Buenos Aires, Argentina, Aug. 10, 1968.
(to be published)

Received October, 1966 Revised December, 1968


Part VII
Abstracts

461
ON THE LENGTH OF
PROGRAMS FOR
COMPUTING FINITE
BINARY SEQUENCES BY
BOUNDED-TRANSFER
TURING MACHINES
AMS Notices 13 (1966), p. 133

Abstract 66T{26. G. J. Chaitin, The City College of the City Univer-


sity of New York, 819 Madison Avenue, New York, New York 10021. On
the length of programs for computing nite binary sequences by bounded-
transfer Turing machines. Preliminary report.
Consider Turing machines with one-way in nite tapes, n numbered
internal states, the tape symbols blank, 0, and 1, starting in state 1 and
halting in state n, and in which the number j of the next internal state
after being in state i satis es ji;j j  b. b can be chosen suciently large
that any eectively computable in nite binary sequence is computable
by such a machine. Such a Turing machine is said to compute a nite
binary sequence S if starting with its tape blank and scanning the end
square of the tape, it nally halts with S written at the end of the tape,

463
464 Part VII|Abstracts
with the rest of the tape blank, and scanning the rst blank square of
the tape. De ne L(S ) for any nite binary sequence S by: A Turing
machine with n internal states can be programmed to compute S if and
only if n  L(S ). De ne L(Cn) by L(Cn) = max L(S ), where S is any
binary sequence of length n. Let Cn be the set of all binary sequences
of length n satisfying L(S ) = L(Cn).
Then
(1) L(Cn ) ! an:
(2) There exists a constant c such that for all m and n, those binary
sequences S of length n satisfying
L(S ) < L(Cn ) ; log2 n] ; m ; c
are less than 2n;m in number.
(3) For any e > 0 and d > 1, for all n suciently large, if S is a
binary sequence of length n such that the ratio of the number of 0's in
S to n diers from 12 by more than e, then
L(S ) < L(CndH ( 21 +e 12 ;e)]):
Here
H (p q) = ;p log2 p ; q log2 q:
We propose also that elements of Cn be considered the most pattern-
less or random binary sequences of length n. This leads to a de nition
and theory of randomness related to the R. von Mises{A. Wald{A.
Church theory, but in accord with some criticisms of K. R. Popper.
(Received October 19, 1965.)
ON THE LENGTH OF
PROGRAMS FOR
COMPUTING FINITE
BINARY SEQUENCES BY
BOUNDED-TRANSFER
TURING MACHINES II
AMS Notices 13 (1966), pp. 228{229

Abstract 631{6. G. J. Chaitin, 819 Madison Avenue, New York,


New York 10021. On the length of programs for computing nite binary
sequences by bounded-transfer Turing machines. II.
Refer to Abstract 66T{26, these Notices 13 (1966), 133. There
it is proposed that elements of Cn may be considered patternless or
random. This is applied some properties of the L function are derived
by using what may be termed the simple normality (Borel) of these
binary sequences.
Note that
(4) L(S S 0)  L(S ) + L(S 0)
where S and S 0 are nite binary sequences and is the concatenation

465
466 Part VII|Abstracts
operation. Hence
L(Cn+m )  L(Cn ) + L(Cm):
With (1) this subadditivity property yields
(5) an  L(Cn)
(actually, subadditivity is used in the proof of (1)). Also,
(6) for any natural number k, if an element of Cn is partitioned
into successive subsequences of length k, then each of the 2k possible
subsequences will occur ! 2;k (n=k) times.
(6) follows from (1) and a generalization of (3). (4), (5) and (6) give
immediately X
(7) an  2;n L(S )
where the summation is over binary sequences S of length n. Denote
the binary sequence of length n consisting entirely of zeros by 0n . As
L(0n ) = O(log n), for n suciently large
X
L(Cn ) > 2;n L(S )  an
or
(8) an < L(Cn):
For each k it follows from (4) and (6) that for s suciently large
L(Cs) = L(S ) = L(S 0 0k S 00)
where
S 0 0k S 00 = S 2 Cs
so that
L(Cs)  L(Cn) + L(Cm ) + L(0k ) (n + m = s ; k):
This last inequality yields
(9) (L(Cn ) ; an) is unbounded.
(Received January 6, 1966.)
COMPUTATIONAL
COMPLEXITY AND
GO DEL'S
INCOMPLETENESS
THEOREM
AMS Notices 17 (1970), p. 672

Abstract 70T{E35. Gregory J. Chaitin, Mario Bravo 249, Buenos


Aires, Argentina. Computational complexity and Godel's incomplete-
ness theorem. Preliminary report.
Given any simply consistent formal theory F of the state complexity
L(S ) of nite binary sequences S as computed by 3-tape-symbol Turing
machines, there exists a natural number L(F ) such that L(S ) > n is
provable in F only if n < L(F ). On the other hand, almost all nite
binary sequences S satisfy L(S ) > L(F ). The proof resembles Berry's
paradox, not the Epimenides nor Richard paradoxes.
(Received April 6, 1970.)

467
468 Part VII|Abstracts
COMPUTATIONAL
COMPLEXITY AND
GO DEL'S
INCOMPLETENESS
THEOREM
ACM SIGACT News, No. 9
(April 1971), pp. 11{12

G. J. Chaitin
IBM World Trade, Buenos Aires

Abstract
Given any simply consistent formal theory F of the state complexity
L(S ) of nite binary sequences S as computed by 3-tape-symbol Turing
machines, there exists a natural number L(F ) such that L(S ) > n is
provable in F only if n < L(F ). On the other hand, almost all nite
binary sequences S satisfy L(S ) > L(F ). The proof resembles Berry's
paradox, not the Epimenides nor Richard paradoxes.

469
470 Part VII|Abstracts

Computational complexity has many points of view, and many points


of contact with other elds. The purpose of this note is to show that a
strong version of G odel's classical incompleteness theorem follows very
naturally if one considers the limitations of formal theories of compu-
tational complexity.
The state complexity L(S ) of a nite binary sequence S as com-
puted by 3-tape-symbol Turing machines is de ned to be the number
of states that a 3-tape-symbol Turing machine must have in order to
compute S . This concept is a variant of the descriptive or information
complexity. Note that there are (6n)3n n-state 3-tape-symbol Turing
machines. (The 6 is because there are six operations: tape left, tape
right, halt, write 0, write 1, write blank.) Thus only nitely many -
nite binary sequences S have a given state complexity n, that is, satisfy
L(S ) = n.
Any simply consistent formal theory F of the state complexity of
nite binary sequences will have the property that L(S ) > n is provable
only if true, unless the methods of deduction of the theory are extremely
weak. For if L(S ) > n isn't true then there is an n-state 3-tape-symbol
Turing machine that computes S , and as this computation is nite, by
carrying it out step by step in F it can be proved that it works, and
thus that L(S )  n.
Suppose that there is at least one nite binary sequence S such that
L(S ) > n is a theorem of F . Then there is a (blog2 nc + 1 + cF )-state
3-tape-symbol Turing machine that computes a nite binary sequence
S satisfying L(S ) > n. Here cF is independent of n and depends only
on F . How is the Turing machine constructed? Its rst blog2 nc + 1
states write the number n in binary notation on the Turing machine's
tape. The remaining cF states then do the following. By checking in
order each nite string of letters in the alphabet of the formal theory
F (the machine codes the alphabet in binary) to see if it is a proof,
the machine generates each theorem provable in F . As each theorem is
produced it is checked to see if it is of the form L(S ) > n. The rst such
theorem encountered provides the nite binary sequence S computed
by the Turing machine.
Thus we have shown that if there were nite binary sequences which
Computational Complexity and Godel's Incompleteness Theorem 471
in F can be shown to be of state complexity greater than n, then there
would be a (blog2 nc +1+ cF )-state 3-tape-symbol Turing machine that
computes a nite binary sequence S satisfying L(S ) > n. In other
words, we would have
n < L(S )  blog2 nc + 1 + cF
which implies
n < blog2 nc + 1 + cF :
As this is impossible for
n  L(F ) cF + log2 cF 
we conclude that L(S ) > n can be proved in F only if n < L(F ).
Q.E.D.1
Why does this resemble Berry's paradox of \the least natural num-
ber not nameable in fewer than 10000000 characters"? Because it may
be paraphrased as follows. \The nite binary sequence S with the rst
proof that S cannot be described by a Turing machine with n states or
less" is a (log2 n + cF )-state description of S .
As a nal comment, it should be mentioned that an incomplete-
ness theorem may also be obtained by considering the time complexity
of in nite computations, instead of the descriptive complexity of -
nite computations. But this is much less interesting, as the resulting
proof is, essentially, just one of the classical proofs resembling Richard's
paradox, and requires that !-consistency be hypothesized.

Brief Bibliography of Godel's Theorem


Davis, M. (Ed.) The Undecidable, Raven Press, 1965.
Weyl, H. Philosophy of Mathematics and Natural Science,
Princeton University Press, 1949, pp. 219{236.
Kleene, S. C. Introduction to Metamathematics,
Van Nostrand, 1950, pp. 204{213.
1 I couldn't verify the original argument, but I got n < L(F) = 2cF + 2, by
rather loose arguments. Readers' comments will be passed on to Mr. Chaitin.|ed.]
472 Part VII|Abstracts
Turing, A. M. \Solvable and Unsolvable Problems,"
in Science News, No. 31, Penguin Books, 1954, pp. 7{23.
Nagel, E., Newman, J. R. Godel's Proof,
New York University Press, 1958.
Davis, M. Computability and Unsolvability,
McGraw-Hill, 1958, pp. 120{129.
Cohen, P. J. Set Theory and the Continuum Hypothesis,
Benjamin, 1966, pp. 39{46.
INFORMATION-
THEORETIC ASPECTS OF
THE TURING DEGREES
AMS Notices 19 (1972), pp. A{601, A{602

Abstract 72T{E77. Gregory J. Chaitin, Mario Bravo 249, Buenos


Aires, Argentina. Information-theoretic aspects of the Turing degrees.
Preliminary report.
Use is made of the concept of the relative complexity of a nite
binary string in one or more in nite binary strings. An in nite binary
string is recursive in another i 9c 8n the relative complexity of its
initial segment of length n is less than c + log2 n. With positive prob-
ability, an in nite binary string has the property that the complexity
of its initial segment of length n relative to the rest of the string is
asymptotic to n. One such string R recursive in 0 is de ned. This
in nite string R is separated into @0 independent in nite strings, i.e.
the complexity of the initial segment of length n of any of these strings
relative to all the rest of these strings is asymptotic to n. By joining
these independent in nite strings one obtains Turing degrees greater
than 0 and less than 00 with any denumerable partial order. From
the fact that R is recursive in 0 it follows that there is a recursive
predicate P such that asymptotically n bits of axioms are needed to
determine which of the following n propositions are true and which are

473
474 Part VII|Abstracts
false: 9x 8y P (x y a) (a < n).
(Received June 19, 1972.)
INFORMATION-
THEORETIC ASPECTS OF
POST'S CONSTRUCTION
OF A SIMPLE SET
AMS Notices 19 (1972), p. A{712

Abstract 72T{E85. Gregory J. Chaitin, Mario Bravo 249, Buenos


Aires, Argentina. Information-theoretic aspects of Post's construction
of a simple set. Preliminary report.
The complexity of a nite binary string is taken to be the number of
bits in the shortest program for computing it on the standard universal
computer. De ne as follows two functions P and Q from the natural
numbers into the sets of nite binary strings. P (n) is the set containing
the rst string output by any program such that the length of the string
is greater than n + the length of the program. Q(n) is the set of the
nite binary strings S such that (n + the complexity of S ) is less than
the length of S . P (n) is a version of Post's original construction of a
simple set. Thus for all n, P (n) is a simple set. Q(n) is based entirely
on information-theoretic concepts.
Theorem. There is a c such that for all n, P (n + c) is contained in
Q(n), and Q(n) is contained in P (n).
Corollary. For all n, Q(n) is a simple set.

475
476 Part VII|Abstracts
(Received June 19, 1972.)
ON THE DIFFICULTY OF
GENERATING ALL
BINARY STRINGS OF
COMPLEXITY LESS
THAN N
AMS Notices 19 (1972), p. A{764

Abstract 72T{E101. Gregory J. Chaitin, Mario Bravo 249, Buenos


Aires, Argentina. On the di!culty of generating all binary strings of
complexity less than n. Preliminary report.
Complexity is taken in the information-theoretic sense, i.e. the com-
plexity of something is the number of bits in the shortest program for
computing it on the standard universal computer.
Let
(n) = min max(the length of P in bits, the time it takes P to halt)
where the minimum is taken over all programs P whose output is the
set of all binary strings of complexity less than n.
Let
(n) = max f (n)

477
478 Part VII|Abstracts
where the maximum is taken over all number-theoretic functions f of
complexity less than n.
Let X
 (n) = (the length of S )
where the sum is taken over all binary strings S of complexity less than
n.
Take f g to mean that there are c and c0 such that for all n,
f (n)  g(n + c) and g(n)  f (n + c0).
Theorem.  .
(Received June 19, 1972.)
ON THE GREATEST
NATURAL NUMBER OF
DEFINITIONAL OR
INFORMATION
COMPLEXITY  N
Recursive Function Theory: Newsletter,
No. 4 (Jan. 1973), pp. 11{13

The growth of this number a(n) as a function of n serves to measure a


number of very general phenomena.
For example, consider the time t(n) it takes that program with not
more than n bits to halt that takes the longest time to halt. This grows
with n approximately in the same way as a(n). More exactly, for all n,
a(n)  t(n + c) and t(n)  a(n + c0).
Consider those programs that halt and whose output is the set S (n)
of all binary strings of complexity not greater than n. Any program
that halts and whose output includes S (n), must either have more than
a(n ; c) bits, or must take a time to halt exceeding a(n ; c). Both
extremes are possible: few bits of program and very long running time,
or vice versa. Thus those programs with about n bits which halt and
whose output set is precisely S (n) are among the programs of length n
479
480 Part VII|Abstracts
that take most time to halt.
Or consider a program that outputs the r.e. but not recursive set
of all programs that halt. The time it takes this program to output all
programs of length not greater than n that halt, grows with n approx-
imately like a(n).
Or consider the set P having a binary string i the string's infor-
mation or de nitional complexity is less than its length. P is \simple",
that is, P is r.e. and its complement with respect to the set of all bi-
nary strings is in nite and contains no in nite r.e. subset. In fact, P is
closely related to Post's original construction of a simple set. The time
that it takes a program that outputs P to output all P 's elements of
length not greater than n, grows with n approximately like a(n).
Each of these results can be interpreted as the precise measure of
a limitation of formal systems. For example, a formal system can be
suciently powerful for it to be possible to prove within it that each
program that halts in fact does so. Suppose that it is only possible to
prove that a program halts if this is true. Then the maximum length of
the proofs needed to establish that each program of length not greater
than n that halts in fact does so, grows with n in approximately the
same manner as a(n).
G. J. Chaitin, Mario Bravo 249, Buenos Aires, Argentina]
A NECESSARY AND
SUFFICIENT CONDITION
FOR AN INFINITE
BINARY STRING TO BE
RECURSIVE
Recursive Function Theory: Newsletter,
No. 4 (Jan. 1973), p. 13

Loveland and Meyer1 have provided a necessary and sucient condi-


tion for an in nite binary string to be recursive, in terms of the relative
information or de nitional complexity of its initial segments of length
n, given n. In their notation, x is an in nite binary string for which
there exists a constant c > 0 such that K (xn =n)  c for all n, i x
is recursive. Based on this result and other considerations we provide
a necessary and sucient condition using the absolute complexity of
the initial segments, instead of the conditional complexity. An in nite
binary string x is recursive i there exists a constant c such that for all
n the complexity K (xn) of its initial segment of length n is bounded
1 D. W. Loveland, A variant of the Kolmogorov concept of complexity, Report
69-4, Math. Dept., Carnegie-Mellon Univ.

481
482 Part VII|Abstracts
by c + log2 n.
G. J. Chaitin, Mario Bravo 249, Buenos Aires, Argentina]
THERE ARE FEW
MINIMAL DESCRIPTIONS
Recursive Function Theory: Newsletter,
No. 4 (Jan. 1973), p. 14

We are concerned with the descriptive/de nitional/information com-


plexity, i.e. the complexity of something is the number of bits in the
program for calculating it whose size is smallest. Thus the complexity
of something is the number of bits in a minimal (complete) description.
How many dierent programs for calculating something are of nearly
optimal size, i.e. how many minimal or nearly minimal descriptions of
something are there?
We give a bound b(n) on the number of programs for calculating a
nite binary string which are of size not greater than the complexity of
the string +n. I.e. a bound b(n) on the number of dierent descriptions
of a particular string which are within n bits of a minimal description.
The bound is a function of n, i.e. does not depend on the particular
string nor its complexity. The particular b(n) established has the prop-
erty that log2 b(n) is asymptotic to n. An application of this result is
given in the announcement \A necessary and sucient condition for an
in nite binary string to be recursive."
G. J. Chaitin, Mario Bravo 249, Buenos Aires, Argentina]

483
484 Part VII|Abstracts
INFORMATION-
THEORETIC
COMPUTATIONAL
COMPLEXITY
Abstracts of Papers, 1973 IEEE Inter-
national Symposium on Information The-
ory, June 25{29, 1973, King Saul Hotel,
Ashkelon, Israel, IEEE Catalog No. 73
CHO 753{4 IT, p. F1{1

Information-Theoretic Computational Complexity, G. J.


Chaitin (Mario Bravo 249, Buenos Aires, Argentina).
This paper attempts to describe, in non-technical language, some
of the concepts and methods of one school of thought regarding com-
putational complexity. It applies the viewpoint of information theory
to computers. This will rst lead us to a de nition of the degree of
randomness of individual binary strings, and then to an information-
theoretic version of G odel's theorem on the limitations of the axiomatic
method. Finally, we will examine in the light of these ideas the scienti c
method and von Neumann's views on the basic conceptual problems of

485
486 Part VII|Abstracts
biology.
A THEORY OF PROGRAM
SIZE FORMALLY
IDENTICAL TO
INFORMATION THEORY
Abstracts of Papers, 1974 IEEE Interna-
tional Symposium on Information Theory,
October 28{31, 1974, University of Notre
Dame, Notre Dame, Indiana, USA, IEEE
Catalog No. 74 CHO 883{9 IT, p. 2

Monday, October 28, 11:00 AM|Plenary Session A (continued)


A Theory of Program Size Formally Identical to Infor-
mation Theory, Gregory J. Chaitin (IBM World Trade Corporation,
Buenos Aires, Argentina).
A new de nition of the program-size complexity is made.
H (A B=C D)
is de ned to be the size in bits of the shortest self-delimiting pro-
gram for calculating the strings A and B if one is given a minimal-
size self-delimiting program for calculating the strings C and D. This
487
488 Part VII|Abstracts
diers from previous de nitions: (1) programs are required to be self-
delimiting, i.e., no program is a pre x of another, and (2) instead of
being given C and D directly, one is given a program for calculating
them that is minimal in size. Unlike previous de nitions, this one has
precisely the formal properties of the entropy concept of information
theory. For example,
H (A=B ) = H (A B ) ; H (B ) + O(1):
Also, if a program of length k is assigned measure 2;k , then
 !
the probability that the standard
H (A) = ; log2 universal computer will calculate A + O(1):
RECENT WORK ON
ALGORITHMIC
INFORMATION THEORY
Abstracts of Papers, 1977 IEEE Interna-
tional Symposium on Information Theory,
October 10{14, 1977, Cornell University,
Ithaca, New York, USA, IEEE Catalog No.
77 CH 1277{3 IT, p. 129

Recent Work on Algorithmic Information Theory, Gregory


J. Chaitin (IBM Research, Yorktown Heights, NY 10598).
Algorithmic information theory is an attempt to apply information-
theoretic and probabilistic ideas to recursive function theory. Typical
concerns in this approach are, for example, the number of bits of in-
formation required to specify an algorithm, or the probability that a
program whose bits are chosen by coin ipping produces a given output.
During the past few years the de nitions of algorithmic information the-
ory have been reformulated. We review some of the recent work in this
area.

489
490 Part VII|Abstracts
Part VIII
Bibliography

491
PUBLICATIONS OF
G J CHAITIN
1. \On the length of programs for computing nite binary sequences
by bounded-transfer Turing machines," AMS Notices 13 (1966),
p. 133.
2. \On the length of programs for computing nite binary sequences
by bounded-transfer Turing machines II," AMS Notices 13 (1966),
pp. 228{229.
3. \On the length of programs for computing nite binary se-
quences," Journal of the ACM 13 (1966), pp. 547{569.
4. \On the length of programs for computing nite binary sequences:
statistical considerations," Journal of the ACM 16 (1969), pp.
145{159.
5. \On the simplicity and speed of programs for computing in nite
sets of natural numbers," Journal of the ACM 16 (1969), pp.
407{422.
6. \On the diculty of computations," IEEE Transactions on In-
formation Theory IT-16 (1970), pp. 5{9.
7. \To a mathematical de nition of `life'," ACM SICACT News, No.
4 (Jan. 1970), pp. 12{18.
8. \Computational complexity and G odel's incompleteness theo-
rem," AMS Notices 17 (1970), p. 672.

493
494 Part VIII|Bibliography
9. \Computational complexity and G odel's incompleteness theo-
rem," ACM SIGACT News, No. 9 (April 1971), pp. 11{12.
10. \Information-theoretic aspects of the Turing degrees," AMS No-
tices 19 (1972), pp. A-601, A-602.
11. \Information-theoretic aspects of Post's construction of a simple
set," AMS Notices 19 (1972), p. A-712.
12. \On the diculty of generating all binary strings of complexity
less than n" AMS Notices 19 (1972), p. A-764.
13. \On the greatest natural number of de nitional or information
complexity  n" Recursive Function Theory: Newsletter, No. 4
(Jan. 1973), pp. 11{13.
14. \A necessary and sucient condition for an in nite binary string
to be recursive," Recursive Function Theory: Newsletter, No. 4
(Jan. 1973), p. 13.
15. \There are few minimal descriptions," Recursive Function The-
ory: Newsletter, No. 4 (Jan. 1973), p. 14.
16. \Information-theoretic computational complexity," Abstracts of
Papers, 1973 IEEE International Symposium on Information
Theory, p. F1{1.
17. \Information-theoretic computational complexity," IEEE Trans-
actions on Information Theory IT-20 (1974), pp. 10{15.
Reprinted in T. Tymoczko, New Directions in the Philosophy of
Mathematics, Birkh auser, 1986.
18. \Information-theoretic limitations of formal systems," Journal of
the ACM 21 (1974), pp. 403{424.
19. \A theory of program size formally identical to information the-
ory," Abstracts of Papers, 1974 IEEE International Symposium
on Information Theory, p. 2.
20. \Randomness and mathematical proof," Scientic American 232,
No. 5 (May 1975), pp. 47{52.
Publications of G J Chaitin 495
21. \A theory of program size formally identical to information the-
ory," Journal of the ACM 22 (1975), pp. 329{340.
22. \Information-theoretic characterizations of recursive in nite stri-
ngs," Theoretical Computer Science 2 (1976), pp. 45{48.
23. \Algorithmic entropy of sets," Computers & Mathematics with
Applications 2 (1976), pp. 233{245.
24. \Program size, oracles, and the jump operation," Osaka Journal
of Mathematics 14 (1977), pp. 139{149.
25. \Algorithmic information theory," IBM Journal of Research and
Development 21 (1977), pp. 350{359, 496.
26. \Recent work on algorithmic information theory," Abstracts of
Papers, 1977 IEEE International Symposium on Information
Theory, p. 129.
27. \A note on Monte Carlo primality tests and algorithmic informa-
tion theory," with J.T. Schwartz, Communications on Pure and
Applied Mathematics 31 (1978), pp. 521{527.
28. \Toward a mathematical de nition of `life'," in R.D. Levine and
M. Tribus, The Maximum Entropy Formalism, MIT Press, 1979,
pp. 477{498.
29. \Algorithmic information theory," in Encyclopedia of Statistical
Sciences, Volume 1, Wiley, 1982, pp. 38{41.
30. \G odel's theorem and information," International Journal of
Theoretical Physics 22 (1982), pp. 941{954. Reprinted in T.
Tymoczko, New Directions in the Philosophy of Mathematics,
Birkh auser, 1986.
31. \Randomness and G odel's theorem," Mondes en Developpement,
No. 54{55 (1986), pp. 125{128.
32. \Incompleteness theorems for random reals," Advances in Applied
Mathematics 8 (1987), pp. 119{146.
496 Part VIII|Bibliography
33. Algorithmic Information Theory, Cambridge University Press,
1987.
34. Information, Randomness & Incompleteness, World Scienti c,
1987.
35. \Randomness in arithmetic," Scientic American 259, No. 1
(July 1988), pp. 80{85.
36. Algorithmic Information Theory, 2nd printing (with revisions),
Cambridge University Press, 1988.
37. Information, Randomness & Incompleteness, 2nd edition, World
Scienti c, 1990.
38. Algorithmic Information Theory, 3rd printing (with revisions),
Cambridge University Press, 1990.
39. \A random walk in arithmetic," New Scientist 125, No. 1709 (24
March 1990), pp. 44{46. Reprinted in N. Hall, The New Scientist
Guide to Chaos, Penguin, 1992, and in N. Hall, Exploring Chaos,
Norton, 1993.
40. \Algorithmic information & evolution," in O.T. Solbrig and G.
Nicolis, Perspectives on Biological Complexity, IUBS Press, 1991,
pp. 51{60.
41. \Le hasard des nombres," La Recherche 22, N 232 (mai 1991),
pp. 610{615.
42. \Complexity and biology," New Scientist 132, No. 1789 (5 Octo-
ber 1991), p. 52.
43. \LISP program-size complexity," Applied Mathematics and Com-
putation 49 (1992), pp. 79{93.
44. \Information-theoretic incompleteness," Applied Mathematics
and Computation 52 (1992), pp. 83{101.
45. \LISP program-size complexity II," Applied Mathematics and
Computation 52 (1992), pp. 103{126.
Publications of G J Chaitin 497
46. \LISP program-size complexity III," Applied Mathematics and
Computation 52 (1992), pp. 127{139.
47. \LISP program-size complexity IV," Applied Mathematics and
Computation 52 (1992), pp. 141{147.
48. \A Diary on Information Theory," The Mathematical Intelli-
gencer 14, No. 4 (Fall 1992), pp. 69{71.
49. Information-Theoretic Incompleteness, World Scienti c, 1992.
50. Algorithmic Information Theory, 4th printing, Cambridge Uni-
versity Press, 1992. (Identical to 3rd printing.)
51. \Randomness in arithmetic and the decline and fall of reduction-
ism in pure mathematics," Bulletin of the European Association
for Theoretical Computer Science, No. 50 (June 1993), pp. 314{
328. Reprinted in J.L. Casti and A. Karlqvist, Cooperation and
Conict in General Evolutionary Processes, Wiley, 1995. Also
reprinted in Chaos, Solitons & Fractals, Vol. 5, No. 2, pp. 143{
159, 1995.
52. \On the number of n-bit strings with maximum complexity," Ap-
plied Mathematics and Computation 59 (1993), pp. 97{100.
53. \The limits of mathematics|course outline & software," 127 pp.,
December 1993. To obtain, send e-mail to \chao-dyn @ xyz.lanl.-
gov" with \Subject: get 9312006".
54. \Randomness and complexity in pure mathematics," Interna-
tional Journal of Bifurcation and Chaos 4 (1994), pp. 3{15.
55. \Responses to `Theoretical mathematics:: : '," Bulletin of the
American Mathematical Society 30 (1994), pp. 181{182.
56. Foreword in C. Calude, Information and Randomness, Springer-
Verlag, 1994, pp. ix{x.
57. \The limits of mathematics," 270 pp., July 1994. To obtain, send
e-mail to \chao-dyn @ xyz.lanl.gov" with \Subject: get 9407003".
498 Part VIII|Bibliography
58. \The limits of mathematics IV," 231 pp., July 1994. To ob-
tain, send e-mail to \chao-dyn @ xyz.lanl.gov" with \Subject:
get 9407009".
59. \The limits of mathematics|extended abstract," 7 pp., July
1994. To obtain, send e-mail to \chao-dyn @ xyz.lanl.gov" with
\Subject: get 9407010".
60. \Randomness in arithmetic and the decline and fall of reduction-
ism in pure mathematics," in J. Cornwell, Nature's Imagination,
Oxford University Press, 1995, pp. 27{44.
61. \The Berry paradox," Complexity 1, No. 1 (1995), pp. 26{30.
62. \A new version of algorithmic information theory," 12 pp., June
1995. To obtain, send e-mail to \chao-dyn @ xyz.lanl.gov" with
\Subject: get 9506003".
63. \The limits of mathematics|tutorial version," 143 pp., Septem-
ber 1995. To obtain, send e-mail to \chao-dyn @ xyz.lanl.gov"
with \Subject: get 9509010".
64. \How to run algorithmic information theory on a computer," 21
pp., September 1995. To obtain, send e-mail to \chao-dyn @
xyz.lanl.gov" with \Subject: get 9509014".
65. \The limits of mathematics," 45 pp., September 1995. To ob-
tain, send e-mail to \chao-dyn @ xyz.lanl.gov" with \Subject:
get 9509021".
DISCUSSIONS OF
CHAITIN'S WORK
1. M. Davis, \What is a computation?," in L.A. Steen, Mathematics
Today, Springer-Verlag, 1978.
2. R. Rucker, Mind Tools, Houghton Mi+in, 1987.
3. J.L. Casti, Searching for Certainty, Morrow, 1990.
4. J.A. Paulos, Beyond Numeracy, Knopf, 1991.
5. J.D. Barrow, Theories of Everything, Oxford University Press,
1991.
6. D. Ruelle, Chance and Chaos, Princeton University Press, 1991.
7. P. Davies, The Mind of God, Simon & Schuster, 1992.
8. J.D. Barrow, Pi in the Sky, Oxford University Press, 1992.
9. L. Brisson and F.W. Meyerstein, Puissance et Limites de la Rai-
son, Les Belles Lettres, 1995.
10. G. Johnson, Fire in the Mind, Knopf, 1995.
11. P. Coveney and R. High eld, Frontiers of Complexity, Fawcett
Columbine, 1995.

499
500 Part VIII|Bibliography
Epilogue

501
UNDECIDABILITY &
RANDOMNESS IN PURE
MATHEMATICS
This is a lecture that was given 28 September 1989 at the Eu-
ropalia 89 Conference on Self-Organization in Brussels. The
lecture was lmed by EuroPACE this is an edited transcript.
Published in G. J. Chaitin, Information, Randomness & In-
completeness, 2nd Edition, World Scienti c, 1990, pp. 307{
313.

G. J. Chaitin

Abstract
I have shown that God not only plays dice in physics, but even in pure
mathematics, in elementary number theory, in arithmetic! My work is a
fundamental extension of the work of Godel and Turing on undecidabil-
ity in pure mathematics. I show that not only does undecidability occur,
but in fact sometimes there is complete randomness, and mathematical
truth becomes a perfect coin toss.

503
504 Epilogue
Randomness in Physics
What I'd like to talk about today is taking an important and fundamen-
tal idea from physics and applying it to mathematics. The fundamental
idea that I'm referring to is the notion of randomness, which I think
it is fair to say obsesses physicists. That is to say, the question of to
what extent is the future predictable, to what extent is our inability to
predict the future our limitation, or whether it is in principle impossible
to predict the future.
This idea has of course a long history in physics. In Newtonian
physics there was Laplacian determinism. Then came quantum me-
chanics. One of the controversial features of quantum mechanics was
that probability and randomness were introduced at a fundamental
level to physics. This greatly upset Einstein. And then surprisingly
enough with the modern study of nonlinear dynamics we realize that
classical physics after all really did have randomness and unpredictabil-
ity at its very core. So the notion of randomness and unpredictability
begins to look like a unifying principle, and I would like to suggest that
this even extends to mathematics.
I would like to suggest that the situation in mathematics is related to
the one in physics: If we can't prove something, if we don't see a pattern
or a law, or we cannot prove a theorem, the question is, is this our fault,
is it just a human limitation because we're not bright enough or we
haven't worked long enough on the question to be able to settle it? Or
is it possible that sometimes there simply is no mathematical structure
to be discovered, no mathematical law, no mathematical theorem, and
in fact no answer to a mathematical question? This is the question
about randomness and unpredictability in physics, transferred to the
domain of mathematics.

The Hilbert Problems


One way to orient our thinking on this question, is to recall the famous
lecture given by Hilbert ninety years ago in 1900 in which he proposed a
famous list of twenty-three problems as a challenge to the new century,
a century which is now almost over.
Undecidability & Randomness in Pure Mathematics 505
One of the questions, his sixth question, had to do with axiomatizing
physics. And one of the points in this question was probability theory.
Because to Hilbert, probability was a notion that came from physics,
having to do with the real world.
Another question that he talked about was his tenth problem, hav-
ing to do with solving so-called \diophantine" equations, that is to say,
algebraic equations where you're dealing only with integers. He asked,
\Is there a way to decide whether an algebraic equation has a solution
in whole numbers or not?"
Little did Hilbert imagine that these two questions have a close
connection!
There was something so basic to Hilbert's thinking that he didn't
formulate it as a question in his 1900 talk. That was the idea that every
mathematical problem has a solution, that if you ask a clear question
you will get a clear answer. Maybe we're not smart enough to do it or
haven't worked long enough on the problem, but Hilbert believed that
in principle it was possible to settle every mathematical question, that
it's a black or white situation. Later he formulated this as a problem
to be studied, but in 1900 it was a principle that he emphasized and
did not question.
What I would like to explain in this lecture is that randomness
does occur in pure mathematics, it occurs in number theory, it occurs
in arithmetic. And the way that it occurs ties together these three
issues that Hilbert considered, because you can nd randomness in
connection with the problem of solving algebraic equations in whole
numbers. That's Hilbert's tenth problem about diophantine equations.
Then looking at Hilbert's sixth question where he refers to probabil-
ity theory, one sees that probability is not just a eld of applied mathe-
matics. It certainly is a eld of applied mathematics, but that's not the
only context in which probability occurs. It's perhaps more surprising
that one nds probability and randomness even in pure mathematics,
in number theory, the theory of whole numbers, which is one of the
oldest branches of pure mathematics going back to the ancient Greeks.
That's the point I'm going to be making.
This touches also on the basic assumption of Hilbert's talk of the
year 1900 because it turns out that it isn't always the case that clear
simple mathematical questions have clear answers.
506 Epilogue
In fact, I'll talk about some conceptually simple questions that arise
in elementary arithmetic, in elementary number theory, involving dio-
phantine equations, where the answer is completely random and looks
gray rather than black or white. The answer is random not just because
I can't solve it today or tomorrow or in a thousand years, but because
it doesn't matter what methods of reasoning you use, the answer will
always look random.
So a fancy way to summarize what I'll be talking about, going back
to Einstein's displeasure with quantum mechanics, is to say, \Not only
does God play dice in quantum mechanics and in nonlinear dynam-
ics, which is to say in quantum and in classical physics, but even in
arithmetic, even in pure mathematics!"

Formal Axiom Systems


What is the evolution of ideas leading to this surprising conclusion?
A rst point that I'd like to make, which is surprising but is very
easy to understand, has to do with the notion of axiomatic reasoning,
of formal mathematical reasoning based on axioms, which was studied
by many people including Hilbert. In particular Hilbert demanded that
when one sets up a formal axiom system there should be a mechanical
procedure for deciding if a proof is correct or not. That's a requirement
of clarity really, and of objectivity.
Here is a simple surprising fact: If one sets up a system of axioms
and it's consistent, which means that you can't prove a result and its
contrary, and also it's complete, which means that for any assertion
you can either prove that it's true or that it's false, then it follows
immediately that the so-called \decision problem" is solvable. In other
words, the whole subject becomes trivial because there is a mechanical
procedure that in principle would enable you to settle any question that
can be formulated in the theory. There's a colorful way to explain this,
the so-called \British Museum algorithm."
What one does|it can't be done in practice|it would take
forever|but in principle one could run through all possible proofs in
the formal language, in the formal axiom system, in order of their size,
in lexicographic order. That is, you simply look through all possible
Undecidability & Randomness in Pure Mathematics 507
proofs. And check which ones are correct, which ones follow the rules,
which ones are accepted as valid. That way in principle one can nd
all theorems. One will nd everything that can be proven from this set
of axioms. And if it's consistent and complete, well then any question
that one wants to settle, eventually one will either nd a proof or else
one will nd a proof of the contrary and know that it's false.
This gives a mechanical procedure for deciding whether any asser-
tion is correct or not, can be proven or not. Which means that in a
sense one no longer needs ingenuity or inspiration and in principle the
subject becomes mechanical.
I'm sure you all know that in fact mathematics isn't trivial. We
know due to the absolutely fundamental work of G odel and Turing
that this cannot be done: One cannot get a consistent and complete
axiomatic theory of mathematics, and one cannot get a mechanical
procedure for deciding if an arbitrary mathematical assertion is true or
false, is provable or not.

The Halting Problem


G odel originally came up with a very ingenious proof of this, but I think
that Turing's approach in some ways is more fundamental and easier
to understand. I'm talking about the halting problem, Turing's funda-
mental theorem on the unsolvability of the halting problem, which says
there is no mechanical procedure for deciding if a computer program
will ever halt.
Here it's important that the program have all its data inside, that
it be self contained. One sets the program running on a mathematical
idealization of a digital computer. There is no time limit, so this is a
very ideal mathematical question. One simply asks, \Will the program
go on forever, or at some point will it say `I'm nished' and halt?"
What Turing showed is that there is no mechanical procedure for
doing this, there is no algorithm or computer program that will decide
this. G odel's incompleteness theorem follows immediately. Because if
there is no mechanical procedure for deciding if a program will ever halt,
then there also cannot be a set of axioms to deduce whether a program
will halt or not. If one had it, then that would give one a mechanical
508 Epilogue
procedure by running through all possible proofs. In principle that
would work, although it would all be incredibly slow.
I don't want to give too many details, but let me outline one way
to prove that the halting problem is unsolvable, by means of a reductio
ad absurdum.
Let's assume that there is a mechanical procedure for deciding if a
program will ever halt. If there is, one can construct a program which
contains the number N , and that program will look at all programs
up to N bits in size, and check for each one whether it halts. It then
simulates running the ones that halt, all programs up to N bits in size,
and looks at the output. Let's assume the output is natural numbers.
Then what you do is you maximize over all of this, that is, you take
the biggest output produced by any program that halts that has up to
N bits in size, and let's double the result. I'm talking about a program
that given N does this.
However the program that I've just described really is only about
log N bits long, because to know N you only need log2 N bits in binary,
right? This program is log2 N bits long, but it's producing a result
which is two times greater than any output produced by a program up
to N bits in size. It is in fact only log N bits in size which is much
smaller than N . So this program is producing an output which is at
least twice as big as its own output, which is impossible.
Therefore the halting problem is unsolvable. This is an information-
theoretic way of proving the unsolvability of the halting problem.

The Halting Probability 


Okay, so I start with Turing's fundamental result on the unsolvabil-
ity of the halting problem, and to get my result on randomness in
mathematics, what I do is I just change the wording. It's sort of a
mathematical pun. From the unsolvability of the halting problem, I go
to the randomness of the halting probability.
What is the halting probability? How do I transform Turing's prob-
lem, the halting problem, in order to get my stronger result, that not
only you have undecidability in pure mathematics but in fact you even
have complete randomness?
Undecidability & Randomness in Pure Mathematics 509
Well the halting probability is just this idea: Instead of asking for a
speci c program whether or not it halts in principle given an arbitrary
amount of time, one looks at the ensemble of all possible computer
programs. One does this thought experiment using a general-purpose
computer, which in mathematical terms is a universal Turing machine.
And you have to have a probability associated with each computer
program in order to talk about what is the probability that a computer
program will halt.
One chooses each bit of the computer program by tossing a fair coin,
an independent toss for each bit, so that an N -bit program will have
probability 2;N . Once you've chosen the probability measure this way
and you choose your general-purpose computer (which is a universal
Turing machine) this will de ne a speci c halting probability.
This puts in one big bag the question of whether every program
halts. It's all combined into this one number, the halting probabil-
ity. So it takes all of Turing's problems and combines it into one real
number. I call this number $ by the way. The halting probability $
is determined once you specify the general-purpose computer, but the
choice of computer doesn't really matter very much.
My number $ is a probability, and therefore it's a real number
between 0 and 1. And one could write it in binary or any other base, but
it's particularly convenient to take it in binary. And when one de nes
this halting probability $ and writes it in binary, then the question
arises, \What is the N th bit of the halting probability?"
My claim is that to Turing's assertion that the halting problem
is undecidable corresponds my result that the halting probability is
random or irreducible mathematical information. In other words, each
bit in base-two of this real number $ is an independent mathematical
fact. To know whether that bit is 0 or 1 is an irreducible mathematical
fact which cannot be compressed or reduced any further.
The technical way of saying this is to say that the halting probability
is algorithmically random, which means that to get N bits of the real
number in binary out of a computer program, one needs a program at
least N bits long. That's a technical way of saying this. But a simpler
way to say it is this: The assertion that the N th bit of $ is a 0 or
a 1 for a particular N , to know which way each of the bits goes, is
an irreducible independent mathematical fact, a random mathematical
510 Epilogue
fact, that looks like tossing a coin.

Arithmetization
Now you will of course immediately say, \This is not the kind of math-
ematical assertion that I normally encounter in pure mathematics."
What one would like, of course, is to translate it into number theory,
the bedrock of mathematics.
And you know G odel had the same problem. When he originally
constructed his unprovable true assertion, it was bizarre. It said, \I'm
unprovable!" Now that is not the kind of mathematical assertion that
one normally considers as a working mathematician. G odel devoted a
lot of ingenuity, some very clever, brilliant and dense mathematics, to
making \I'm unprovable" into an assertion about whole numbers. This
includes the trick of G odel numbering and a lot of number theory.
There has been a lot of work deriving from that original work of
G odel's. In fact that work was ultimately used to show that Hilbert's
tenth problem is unsolvable. A number of people worked on that. I can
take advantage of all that work that's been done over the past sixty
years. There is a particularly dramatic development, the work of Jones
and Matijasevi%c which was published about ve years ago.
They discovered that the whole subject is really easy, which is sur-
prising because it had been very intricate and messy. They discovered
in fact that there was a theorem proved by E
douard Lucas a hundred
years ago, a very simple theorem that does the whole job, if one knows
how to use it properly, as Jones and Matijasevi%c showed how to do.
So one needs very little number theory to convert the assertion
about $ that I talked about into an assertion about whole numbers, an
arithmetical assertion. Let me just state this result of Lucas because
it's delightful, and it's surprisingly powerful. That was of course the
achievement of Jones and Matijasevi%c, to realize this.
The hundred-year old theorem of Lucas has to do with when is a
binomial coecient even and when is it odd. If one asks what is the
coecient of X K in the expansion of (1 + X )N , in other words, what
is the K th binomial coecient of order N , well the answer is that it's
odd if and only if K implies N |on a bit by bit basis, considered as
Undecidability & Randomness in Pure Mathematics 511
 
bit strings. In other words, to know if a binomial coecient KN \N
choose K " is odd, what you have to do is look at each bit in the lower
number K that's on, and check if the corresponding bit in the upper
number N is also on. If that's always the case on a bit by bit basis,
then, and only then, will the binomial coecient be odd. Otherwise
it'll be even.
This is a remarkable fact, and it turns out to be all the number
theory one really needs to know, amazingly enough.

Randomness in Arithmetic
So what is the result of using this technique of Jones and Matijasevi%c
based on this remarkable theorem of Lucas?
Well, the result of this is a diophantine equation. I thought it
would be fun to actually write it down, since my assertion that there
is randomness in pure mathematics would have more force if I can
exhibit it as concretely as possible. So I spent some time and eort on
a large computer and with the help of the computer I wrote down a
two-hundred page equation with seventeen-thousand variables.
This is what is called an exponential diophantine equation. That
is to say, it involves only whole numbers, in fact, non-negative whole
numbers, 0, 1, 2, 3, 4, 5, ... the natural numbers. All the variables and
constants are non-negative integers. It's called \exponential diophan-
tine," \exponential" because in addition to addition and multiplication
one allows also exponentiation, an integer raised to an integer power.
That's why it's called an exponential diophantine equation. That's also
allowed in normal polynomial diophantine equations but the power has
to be a constant. Here the power can be a variable. So in addition to
seeing X 3 one can also see X Y in this equation.
So it's a single equation with 17,000 variables and everything is
considered to be non-negative integers, unsigned whole numbers. And
this equation of mine has a single parameter, the variable N . For
any particular value of this parameter, I ask the question, \Does this
equation have a nite number of whole-number solutions or does this
equation have an in nite number of solutions?"
The answer to this question is my random arithmetical fact|it
512 Epilogue
turns out to correspond to tossing a coin. It \encodes" arithmetically
whether the N th bit of $ is a 0 or a 1. If the N th bit of $ is a 0,
then this equation, for that particular value of N , has nitely many
solutions. If the N th bit of the halting probability $ is a 1, then this
equation for that value of the parameter N has an in nite number of
solutions.
The change from Hilbert is twofold: Hilbert looked at polynomial
diophantine equations. One was never allowed to raise X to the Y th
power, only X to the 5th power. Second, Hilbert asked \Is there a
solution? Does a solution exist or not?" This is undecidable, but it is
not completely random, it only gives a certain amount of randomness.
To get complete randomness, like an independent fair coin toss, one
needs to ask, \Is there an in nite number of solutions or a nite number
of solutions?"
Let me point out, by the way, that if there are no solutions, that's
a nite number of solutions, right? So it's one way or the other. It
either has to be an in nite number or a nite number of solutions. The
problem is to know which. And my assertion is that we can never know!
In other words, to decide whether the number of solutions is nite
or in nite (the number of solutions in whole numbers, in non-negative
integers) in each particular case, is in fact an irreducible arithmetical
mathematical fact.
So let me emphasize what I mean when I say \irreducible math-
ematical facts." What I mean, is that it's just like independent coin
tosses, like a fair coin. What I mean, is that essentially the only way to
get out as theorems whether the number of solutions is nite or in nite
in particular cases, is to assume this as axioms.
In other words, if we want to be able to settle K cases of this
question|whether the number of solutions is nite or not for K par-
ticular values of the parameter N |that would require that K bits of
information be put into the axioms that we use in our formal axiom
system. That's a very strong sense of saying that these are irreducible
mathematical facts.
I think it's fair to say that whether the number of solutions is nite
or in nite can therefore be considered to be a random mathematical or
arithmetical fact.
To recapitulate, Hilbert's tenth problem asks \Is there a solution?"
Undecidability & Randomness in Pure Mathematics 513
and doesn't allow exponentiation. I ask \Is the number of solutions
nite?" and I do allow exponentiation.
In the sixth question, it is proposed to axiomatize probability theory
as part of physics, as part of Hilbert's program to axiomatize physics.
But I have found an extreme form of randomness, of irreducibility, in
pure mathematics|in a part of elementary number theory associated
with the name of Diophantos and which goes back two thousand years
to classical Greek mathematics.
Moreover, my work is an extension of the work of G odel and Turing
which refuted Hilbert's basic assumption in his 1900 lecture, that every
mathematical question has an answer|that if you ask a clear question
there is a clear answer. Hilbert believed that mathematical truth is
black or white, that something is either true or false. It now looks
like it's gray, even when you're just thinking about the unsigned whole
numbers, the bedrock of mathematics.

The Philosophy of Mathematics


This has been a century of much excitement in the philosophy and in
the foundations of mathematics. Part of it was the eort to understand
how the calculus (the notion of real number, of limit) could be made
rigorous|that goes back even more than a hundred years.
Modern mathematical self-examination really starts I believe it is
fair to say with Cantor's theory of the in nite and the paradoxes and
surprises that it engendered, and with the eorts of people like Peano
and Russell and Whitehead to give a rm foundation for mathematics.
Much hope was placed on set theory, which seemed very wonderful
and promising, but it was a pyrrhic victory|set theory does not help!
Originally the eort was made to de ne the whole numbers 1, 2, 3, ...
in terms of sets, in order to make the whole numbers clearer and more
de nite.
However, it turns out that the notion of set is full of all kinds of
paradoxes. For example the notion of the universal set turns out to be
inadmissible. And there are problems having to do with large in nities
in set theory. Set theory is fascinating and a vital part of mathematics,
but I think it is fair to say that there was a retreat away from set theory
514 Epilogue
and back to 1, 2, 3, 4, 5, ... Please don't touch them!
I think that the work I've described, and in particular my own work
on randomness, has not spared the whole numbers. I always believed, I
think most mathematicians probably do, in a kind of Platonic universe.
\Does a diophantine equation have an in nite number of solutions or
a nite number?" This question has very little concrete computational
meaning, but I certainly used to believe in my heart, that even if we
will never nd out, God knew, and either there were a nite number
of solutions or an in nite number of solutions. It was black or white
in the Platonic universe of mathematical reality. It was one way or the
other.
I think that my work makes things look gray, and that mathemati-
cians are joining the company of their theoretical physics colleagues. I
don't think that this is necessarily bad. We've seen that in classical and
quantum physics randomness and unpredictability are fundamental. I
believe that these concepts are also found at the very heart of pure
mathematics.
Future Work: In this discussion the probabilities that arise are
all real numbers. Can the probability amplitudes of quantum mechan-
ics, which are complex numbers, be used instead?

Further Reading
 I. Stewart, \The ultimate in undecidability," Nature, 10 March
1988, pp. 115{116.
 J. P. Delahaye, \Une extension spectaculaire du th
eor5eme de
G odel: l'
equation de Chaitin," La Recherche, juin 1988, pp. 860{
862. English translation, AMS Notices, October 1989, pp. 984{
987.
 G. J. Chaitin, \Randomness in arithmetic," Scientic American,
July 1988, pp. 80{85.
 G. J. Chaitin, Information, Randomness & Incompleteness|
Papers on Algorithmic Information Theory, World Scienti c, Sin-
gapore, 1987.
Undecidability & Randomness in Pure Mathematics 515
 G. J. Chaitin, Algorithmic Information Theory, Cambridge Uni-
versity Press, Cambridge, 1987.
516 Epilogue
ALGORITHMIC
INFORMATION &
EVOLUTION
This is a revised version of a lecture presented April 1988 in
Paris at the International Union of Biological Sciences Work-
shop on Biological Complexity. Published in O. T. Solbrig
and G. Nicolis, Perspectives on Biological Complexity, IUBS
Press, 1991, pp. 51{60.

G. J. Chaitin
IBM Research Division, P.O. Box 218
Yorktown Heights, NY 10598, U.S.A.

Abstract
A theory of information and computation has been developed: \algo-
rithmic information theory." Two books $11{12] have recently been
published on this subject, as well as a number of nontechnical dis-
cussions $13{16]. The main thrust of algorithmic information the-
ory is twofold: (1) an information-theoretic mathematical denition
of random sequence via algorithmic incompressibility, and (2) strong
information-theoretic versions of Godel's incompleteness theorem. The

517
518 Epilogue
halting probability $ of a universal Turing machine plays a fundamental
role. $ is an abstract example of evolution: it is of innite complexity
and the limit of a computable sequence of rational numbers.

1. Algorithmic information theory


Algorithmic information theory 11{16] is a branch of computational
complexity theory concerned with the size of computer programs rather
than with their running time. In other words, it deals with the diculty
of describing or specifying algorithms, rather than with the resources
needed to execute them. This theory combines features of probability
theory, information theory, statistical mechanics and thermodynamics,
and recursive function or computability theory.
It has so far had two principal applications. The rst is to provide a
new conceptual foundation for probability theory based on the notion
of an individual random or unpredictable sequence, instead of the usual
measure-theoretic formulation in which the key notion is the distribu-
tion of measure among an ensemble of possibilities. The second major
application of algorithmic information theory has been the dramatic
new light it throws on G odel's famous incompleteness theorem and on
the limitations of the axiomatic method.
The main concept of algorithmic information theory is that of the
program-size complexity or algorithmic information content of an ob-
ject (usually just called its \complexity"). This is de ned to be the size
in bits of the shortest computer program that calculates the object, i.e.,
the size of its minimal algorithmic description. Note that we consider
computer programs to be bit strings and we measure their size in bits.
If the object being calculated is itself a nite string of bits, and
its minimal description is no smaller than the string itself, then the
bit string is said to be algorithmically incompressible, algorithmically
irreducible, or algorithmically random. Such strings have the statistical
properties that one would expect. For example, 0's and 1's must occur
with nearly equal relative frequency otherwise the bit string could be
compressed.
An in nite bit string is said to be algorithmically incompressible,
algorithmically irreducible, or algorithmically random if all its initial
Algorithmic Information & Evolution 519
segments are algorithmically random nite bit strings.
A related concept is the mutual algorithmic information content of
two objects. This is the extent to which it is simpler to calculate them
together than to calculate them separately, i.e., the extent to which
their joint algorithmic information content is less than the sum of their
individual algorithmic information contents. Two objects are algorith-
mically independent if their mutual algorithmic information content is
zero, i.e., if calculating them together doesn't help.
These concepts provide a new conceptual foundation for probability
theory based on the notion of an individual random string of bits, rather
than the usual measure-theoretic approach. They also shed new light
on G odel's incompleteness theorem, for in some circumstances it is
possible to argue that the unprovability of certain true assertions follows
naturally from the fact that their algorithmic information content is
greater than the algorithmic information content of the axioms and
rules of inference being employed.
For example, the N -bit string of outcomes of N successive indepen-
dent tosses of a fair coin almost certainly has algorithmic information
content greater than N and is algorithmically incompressible or ran-
dom. But to prove this in the case of a particular N -bit string turns
out to require at least N bits of axioms, even though it is almost always
true. In other words, most nite bit strings are random, but individual
bits strings cannot be proved to be random 3].
Here is an even more dramatic example of this information-theoretic
approach to the incompleteness of formal systems of axioms. I have
shown that there is sometimes complete randomness in elementary
number theory 11, 13, 15{16]. I have constructed 11] a two-hundred
page exponential diophantine equation with the property that the num-
ber of solutions jumps from nite to in nite at random as a parameter
is varied.
In other words, whether the number of solutions is nite or in nite
in each case cannot be distinguished from independent tosses of a fair
coin. This is an in nite amount of independent, irreducible mathemat-
ical information that cannot be compressed into any nite number of
axioms. I.e., essentially the only way to prove these assertions is to
assume them as axioms!
This completes our sketch of algorithmic information theory. Now
520 Epilogue
let's turn to biology.

2. Evolution
The origin of life and its evolution from simpler to more complex forms,
the origin of biological complexity and diversity, and more generally
the reason for the essential dierence in character between biology and
physics, are of course extremely fundamental scienti c questions.
While Darwinian evolution, Mendelian genetics, and modern molec-
ular biology have immensely enriched our understanding of these ques-
tions, it is surprising to me that such fundamental scienti c ideas should
not be reected in any substantive way in the world of mathematical
ideas. In spite of the persuasiveness of the informal considerations that
adorn biological discussions, it has not yet been possible to extract any
nuggets of rigorous mathematical reasoning, to distill any fundamental
new rigorous mathematical concepts.
In particular, by historical coincidence the extraordinary recent
progress in molecular biology has coincided with parallel progress in
the emergent eld of computational complexity, a branch of theoretical
computer science. But in spite of the fact that the word \complexity"
springs naturally to mind in both elds, there is at present little contact
between these two worlds of ideas!
The ultimate goal, in fact, would be to set up a toy world, to de ne
mathematically what is an organism and how to measure its complexity,
and to prove that life will spontaneously arise and increase in complex-
ity with time.

3. Does algorithmic information theory ap-


ply to biology?
Can the concepts of algorithmic information theory help us to de ne
mathematically the notion of biological complexity?
One possibility is to ask what is the algorithmic information con-
tent of the sequence of bases in a particular strand of DNA. Another
possibility is to ask what is the algorithmic information content of the
Algorithmic Information & Evolution 521
organism as a whole (it must be in discrete symbolic form, e.g., imbed-
ded in a cellular automata model).
Mutual algorithmic information might also be useful in biology. For
example, it could be used for pattern recognition, to determine the
physical boundaries of an organism. This approach to a task which is
sort of like de ning the extent of a cloud, de nes an organism to be a
region whose parts have high mutual algorithmic information content,
i.e., to be a highly correlated, in an information-theoretic sense, region
of space.
Another application of the notion of mutual algorithmic information
content might be to measure how closely related are two strands of
DNA, two cells, or two organisms. The higher the mutual algorithmic
information content, the more closely related they are.
These would be one's initial hopes. But, as we shall see in reviewing
previous work, it is not that easy!

4. Previous work
I have been concerned with these extremely dicult questions for the
past twenty years, and have a series of publications 1{2, 7{13] devoted
in whole or in part to searching for ties between the concepts of al-
gorithmic information theory and the notion of biological information
and complexity.
In spite of the fact that a satisfactory de nition of randomness or
lack of structure has been achieved in algorithmic information theory,
the rst thing that one notices is that it is not ipso facto useful in
biology. For applying this notion to physical structures, one sees that
a gas is the most random, and a crystal the least random, but neither
has any signi cant biological organization.
My rst thought was therefore that the notion of mutual or com-
mon information, which measures the degree of correlation between two
structures, might be more appropriate in a biological context. I devel-
oped these ideas in a 1970 paper 1], and again in a 1979 paper 8] using
the more-correct self-delimiting program-size complexity measures.
In the concluding chapter of my Cambridge University Press book
11] I turned to these questions again, with a number of new thoughts,
522 Epilogue
among them to determine where biological questions fall in what logi-
cians call the \arithmetical hierarchy."
The concluding remarks of my 1988 Scientic American article 13]
emphasize what I think is probably the main contribution of the chapter
at the end of my book 11]. This is the fact that in a sense there is a
kind of evolution of complexity taking place in algorithmic information
theory, and indeed in a very natural context.
The remaining publications 2, 7, 9{10, 12] emphasize the impor-
tance of the problem, but do not make new suggestions.

5. The halting probability  as a model of


evolution
What is this natural and previously unappreciated example of the evo-
lution of complexity in algorithmic information theory?
In this theory the halting probability $ of a universal Turing ma-
chine plays a fundamental role. $ is used to construct the two-hundred
page equation mentioned above. If the value of its parameter is K , this
equation has nitely or in nitely many solutions depending on whether
the K th bit of the base-two expansion of $ is a 0 or a 1.
Indeed, to Turing's fundamental theorem in computability theory
that the halting problem is unsolvable, there corresponds in algorithmic
information theory my theorem 4] that the halting probability $ is a
random real number. In other words, any program that calculates N
bits of the binary expansion of $ is no better than a table look-up,
because it must itself be at least N bits long. I.e., $ is incompressible,
irreducible information.
And it is $ itself that is our abstract example of evolution! For
even though $ is of in nite complexity, it is the limit of a computable
sequence of rational numbers, each of which is of nite but eventually
increasing complexity. Here of course I am using the word \complexity"
in the technical sense of algorithmic information theory, in which the
complexity of something is measured by the size in bits of the small-
est program for calculating it. However this computable sequence of
rational numbers converges to $ very, very slowly.
Algorithmic Information & Evolution 523
In precisely what sense are we getting in nite complexity in the
limit of in nite time?
Well, it is trivial that in any in nite set of objects, almost all of them
are arbitrarily complex, because there are only nitely many objects
of bounded complexity. (In fact, there are less than 2N objects of
complexity less than N .) So we should not look at the complexity of
each of the rational numbers in the computable sequence that gives $
in the limit.
The right way to see the complexity increase is to focus on the rst
K bits of each of the rational numbers in the computable sequence.
The complexity of this sequence of K bits initially jumps about but
will eventually stay above K .
What precisely is the origin of this metaphor for evolution? Where
does this computable sequence of approximations to $ come from?
It arises quite naturally, as I explain in my 1988 Scientic American
article 13].
The N th approximation to $, that is to say, the N th stage in the
computable evolution leading in the in nite limit to the violently un-
computable in nitely complex number $, is determined as follows. One
merely considers all programs up to N bits in size and runs each mem-
ber of this nite set of programs for N seconds on the standard universal
Turing machine. Each program K bits long that halts before its time
runs out contributes measure 2;K to the halting probability $. Indeed,
this is a computable monotone increasing sequence of lower bounds on
the value of $ that converges to $, but very, very slowly indeed.
This \evolutionary" model for computing $ shows that one way
to produce algorithmic information or complexity is by doing immense
amounts of computation. Indeed, biology has been \computing" using
molecular-size components in parallel across the entire surface of the
earth for several billion years, which is an awful lot of computing.
On the other hand, an easier way to produce algorithmic informa-
tion or complexity is, as we have seen, to simply toss a coin. This would
seem to be the predominant biological source of algorithmic informa-
tion, the frozen accidents of the evolutionary trail of mutations that
are preserved in DNA.
So two dierent sources of algorithmic information do seem bio-
logically plausible, and would seem to give rise to dierent kinds of
524 Epilogue
algorithmic information.

6. Technical note: A
nite version of this
model
There is also a \ nite" version of this abstract model of evolution. In
it one xes N and constructs a computable in nite sequence st = s(t)
of N -bit strings, with the property that for all suciently large times t,
st = st+1 is a xed random N -bit string, i.e., one for which its program-
size complexity H (st) is not less than its size in bits N . In fact, we
can take st to be the rst N -bit string that cannot be produced by any
program less than N bits in size in less than t seconds.
In a sense, the N bits of information in st for t large are coming
from t itself. So one way to state this, is that knowing a suciently
large natural number t is \equivalent to having an oracle for the halting
problem" (as a logician would put it). That is to say, it provides as
much information as one wants.
By the way, computations in the limit are extensively discussed in
my two papers 5{6], but in connection with questions of interest in
algorithmic information theory rather than in biology.

7. Conclusion
To conclude, I must emphasize a number of disclaimers.
First of all, $ is a metaphor for evolution only in an extremely
abstract mathematical sense. The measures of complexity that I use,
while very pretty mathematically, pay for this prettiness by having
limited contact with the real world.
In particular, I postulate no limit on the amount of time that may
be taken to compute an object from its minimal-size description, as
long as the amount of time is nite. Nine months is already a long
time to ask a woman to devote to producing a working human infant
from its DNA description. A pregnancy of a billion years, while okay
in algorithmic information theory, is ridiculous in a biological context.
Algorithmic Information & Evolution 525
Yet I think it would also be a mistake to underestimate the signif-
icance of these steps in the direction of a fundamental mathematical
theory of evolution. For it is important to start bringing rigorous con-
cepts and mathematical proofs into the discussion of these absolutely
fundamental biological questions, and this, although to a very limited
extent, has been achieved.

References
Items 1 to 10 are reprinted in item 12.
1] G. J. Chaitin, \To a mathematical de nition of `life'," ACM
SICACT News, January 1970, pp. 12{18.
2] G. J. Chaitin, \Information-theoretic computational complexity,"
IEEE Transactions on Information Theory IT-20 (1974), pp. 10{
15.
3] G. J. Chaitin, \Randomness and mathematical proof," Scientic
American, May 1975, pp. 47{52.
4] G. J. Chaitin, \A theory of program size formally identical to
information theory," Journal of the ACM 22 (1975), pp. 329{340.
5] G. J. Chaitin, \Algorithmic entropy of sets," Computers & Math-
ematics with Applications 2 (1976), pp. 233{245.
6] G. J. Chaitin, \Program size, oracles, and the jump operation,"
Osaka Journal of Mathematics 14 (1977), pp. 139{149.
7] G. J. Chaitin, \Algorithmic information theory," IBM Journal of
Research and Development 21 (1977), pp. 350{359, 496.
8] G. J. Chaitin, \Toward a mathematical de nition of `life'," in
R.D. Levine and M. Tribus, The Maximum Entropy Formalism,
MIT Press, 1979, pp. 477{498.
9] G. J. Chaitin, \Algorithmic information theory," in Encyclopedia
of Statistical Sciences, Volume 1, Wiley, 1982, pp. 38{41.
526 Epilogue
10] G. J. Chaitin, \G odel's theorem and information," International
Journal of Theoretical Physics 22 (1982), pp. 941{954.
11] G. J. Chaitin, Algorithmic Information Theory, Cambridge Uni-
versity Press, 1987.
12] G. J. Chaitin, Information, Randomness & Incompleteness|
Papers on Algorithmic Information Theory, World Scienti c,
1987.
13] G. J. Chaitin, \Randomness in arithmetic," Scientic American,
July 1988, pp. 80{85.
14] P. Davies, \A new science of complexity," New Scientist, 26 No-
vember 1988, pp. 48{50.
15] J. P. Delahaye, \Une extension spectaculaire du th
eor5eme de
G odel: l'
equation de Chaitin," La Recherche, juin 1988, pp. 860{
862. English translation, AMS Notices, October 1989, pp. 984{
987.
16] I. Stewart, \The ultimate in undecidability," Nature, 10 March
1988, pp. 115{116.
527
About the author
Gregory J Chaitin is a member of the theoretical physics group at the
IBM Thomas J Watson Research Center in Yorktown Heights, New
York. He created algorithmic information theory in the mid 1960's
when he was a teenager. In the two decades since he has been the prin-
cipal architect of the theory. His contributions include: the denition
of a random sequence via algorithmic incompressibility, the reformu-
lation of program-size complexity in terms of self-delimiting programs,
the denition of the relative complexity of one object given a minimal-
size program for another, the discovery of the halting probability Omega
and its signicance, the information-theoretic approach to Godel's in-
completeness theorem, the discovery that the question of whether an
exponential diophantine equation has nitely or innitely many solu-
tions is in some cases absolutely random, and the theory of program size
for Turing machines and for LISP. He is the author of the monograph
\Algorithmic Information Theory" published by Cambridge University
Press in 1987.

529
530
INFORMATION,
RANDOMNESS &
INCOMPLETENESS
Papers on Algorithmic Information Theory
| Second Edition
World Scienti
c Series in Computer Sci-
ence | Vol. 8
by Gregory J Chaitin (IBM)

This book is an essential companion to Chaitin's monograph ALGO-


RITHMIC INFORMATION THEORY and includes in easily accessible
form all the main ideas of the creator and principal architect of algorith-
mic information theory. This expanded second edition has added thir-
teen abstracts, a 1988 SCIENTIFIC AMERICAN article, a transcript
of a EUROPALIA 89 lecture, an essay on biology, and an extensive
bibliography. Its larger format makes it easier to read. Chaitin's ideas
are a fundamental extension of those of G odel and Turing and have ex-
ploded some basic assumptions of mathematics and thrown new light
on the scienti c method, epistemology, probability theory, and of course
computer science and information theory.
531
532 Back Cover

Readership: Computer scientists, mathematicians, physicists, philo-


sophers and biologists.

You might also like