Lecture Notes On Descriptional Complexity and Randomness
Lecture Notes On Descriptional Complexity and Randomness
and randomness
Peter Gács
Boston University
[email protected]
Contents iii
1 Complexity 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Formal results . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 History of the problem . . . . . . . . . . . . . . . . . . . . 9
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Kolmogorov complexity . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Simple quantitative estimates . . . . . . . . . . . . . . . . 15
1.4 Simple properties of information . . . . . . . . . . . . . . . . . . 17
1.5 Algorithmic properties of complexity . . . . . . . . . . . . . . . . 21
1.6 The coding theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.6.1 Self-delimiting complexity . . . . . . . . . . . . . . . . . . 26
1.6.2 Universal semimeasure . . . . . . . . . . . . . . . . . . . . 29
1.6.3 Prefix codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6.4 The coding theorem for H(x) . . . . . . . . . . . . . . . . 32
1.6.5 Algorithmic probability . . . . . . . . . . . . . . . . . . . . 33
1.7 The statistics of description length . . . . . . . . . . . . . . . . . 34
2 Randomness 41
2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Computable distributions . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Two kinds of test . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.2 Randomness via complexity . . . . . . . . . . . . . . . . . 45
2.2.3 Conservation of randomness . . . . . . . . . . . . . . . . 47
2.3 Infinite sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iii
CONTENTS
3 Information 67
3.1 Information-theoretic relations . . . . . . . . . . . . . . . . . . . . 67
3.1.1 The information-theoretic identity . . . . . . . . . . . . . 67
3.1.2 Information non-increase . . . . . . . . . . . . . . . . . . 77
3.2 The complexity of decidable and enumerable sets . . . . . . . . 79
3.3 The complexity of complexity . . . . . . . . . . . . . . . . . . . . 83
3.3.1 Complexity is sometimes complex . . . . . . . . . . . . . 83
3.3.2 Complexity is rarely complex . . . . . . . . . . . . . . . . 84
4 Generalizations 87
4.1 Continuous spaces, noncomputable measures . . . . . . . . . . . 87
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.2 Uniform tests . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.4 Conservation of randomness . . . . . . . . . . . . . . . . 92
4.2 Test for a class of measures . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 From a uniform test . . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Typicality and class tests . . . . . . . . . . . . . . . . . . . 96
4.2.3 Martin-Löf’s approach . . . . . . . . . . . . . . . . . . . . 99
4.3 Neutral measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Monotonicity, quasi-convexity/concavity . . . . . . . . . . . . . . 106
4.5 Algorithmic entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.2 Algorithmic entropy . . . . . . . . . . . . . . . . . . . . . . 110
4.5.3 Addition theorem . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.4 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6 Randomness and complexity . . . . . . . . . . . . . . . . . . . . . 116
4.6.1 Discrete space . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6.2 Non-discrete spaces . . . . . . . . . . . . . . . . . . . . . . 119
iv
Contents
B Constructivity 163
B.1 Computable topology . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.1.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . 163
B.1.2 Constructive topological space . . . . . . . . . . . . . . . 165
B.1.3 Computable functions . . . . . . . . . . . . . . . . . . . . 167
B.1.4 Effective compactness . . . . . . . . . . . . . . . . . . . . . 168
B.1.5 Computable elements and sequences . . . . . . . . . . . 169
B.1.6 Semicomputability . . . . . . . . . . . . . . . . . . . . . . . 170
B.1.7 Computable metric space . . . . . . . . . . . . . . . . . . 172
B.2 Constructive measure theory . . . . . . . . . . . . . . . . . . . . . 175
B.2.1 Space of measures . . . . . . . . . . . . . . . . . . . . . . . 176
B.2.2 Computable and semicomputable measures . . . . . . . 177
B.2.3 Random transitions . . . . . . . . . . . . . . . . . . . . . . 178
Bibliography 181
v
CONTENTS
VERSION HISTORY
February 2010: chapters introduced, and recast using the memoir class.
April 2009: besides various corrections, a section is added on infinite
sequences. This is not new material, just places the most classical results on
randomness of infinite sequences before the more abstract theory.
January 2008: major rewrite.
• Added formal definitions throughout.
• Made corrections in the part on uniform tests and generalized complexity,
based on remarks of Hoyrup, Rojas and Shen.
• Rearranged material.
• Incorporated material on uniform tests from the work of Hoyrup-Rojas.
• Added material on class tests.
vi
1 COMPLEXITY
Laplace
1.1 INTRODUCTION
The present section can be read as an independent survey on the problems
of randomness. It serves as some motivation for the dryer stuff to follow.
If we toss a coin 100 times and it shows each time Head, we feel lifted to
a land of wonders, like Rosencrantz and Guildenstern in [48]. The argument
that 100 heads are just as probable as any other outcome convinces us only
that the axioms of Probability Theory, as developed in [27], do not solve all
mysteries they are sometimes supposed to. We feel that the sequence con-
sisting of 100 heads is not random, though others with the same probability
are. Due to the somewhat philosophical character of this paradox, its history
is marked by an amount of controversy unusual for ordinary mathematics.
1
1. COMPLEXITY
Before the reader hastes to propose a solution, let us consider a less trivial
example, due to L. A. Levin.
Suppose that in some country, the share of votes for the ruling party in
30 consecutive elections formed a sequence 0.99x i where for every even i,
the number x i is the i-th digit of π = 3.1415 . . .. Though many of us would
feel that the election results were manipulated, it turns out to be surprisingly
difficult to prove this by referring to some general principle.
In a sequence of n fair elections, every sequence ω of n digits has ap-
proximately the probability Q n (ω) = 10−n to appear as the actual sequence
of third digits. Let us fix n. We are given a particular sequence ω and want
to test the validity of the government’s claim that the elections were fair. We
interpret the assertion “ω is random with respect to Q n ” as a synonym for
“there is no reason to challenge the claim that ω arose from the distribution
Q n ”.
How can such a claim be challenged at all? The government, just like the
weather forecaster who announces 30% chance of rain, does not guarantee
any particular set of outcomes. However, to stand behind its claim, it must
agree to any bet based on the announced distribution. Let us call a payoff
function
P with respect the distribution P any nonnegative function t(ω) with
ω P(ω)t(ω) ¶ 1. If a “nonprofit” gambling casino asks 1 dollar for a game
and claims that each outcome has probability P(ω) then it must agree to
pay t(ω) dollars on outcome ω. We would propose to the government the
following payoff function t 0 with respect to Q n : let t 0 (ω) = 10n/2 for all
sequences ω whose even digits are given by π, and 0 otherwise. This bet
would cost the government 10n/2 − 1 dollars.
Unfortunately, we must propose the bet before the elections take place
and it is unlikely that we would have come up exactly with the payoff funct-
ion t 0 . Is then the suspicion unjustifiable?
No. Though the function t 0 is not as natural as to guess it in advance, it is
still highly “regular”. And already Laplace assumed in [15] that the number
of “regular” bets is so small we can afford to make them all in advance and
still win by a wide margin.
Kolmogorov discovered in [26] and [28] (probably without knowing
about [15] but following a four decades long controversy on von Mises’ con-
cept of randomness, see [51]) that to make this approach work we must
define “regular” or “simple” as “having a short description” (in some formal
sense to be specified below). There cannot be many objects having a short
description because there are not many short strings of symbols to be used
2
1.1. Introduction
as descriptions.
We thus come to the principle saying that on a random outcome, all suf-
ficiently simple payoff functions must take small values. It turns out below
that this can be replaced by the more elegant principle saying that a random
outcome itself is not too simple. If descriptions are written in a 2-letter alpha-
bet then a typical sequence of n digits takes n log 10 letters to describe (if not
stated otherwise, all logarithms in these notes are to the base 2). The digits
of π can be generated by an algorithm whose description takes up only some
constant length. Therefore the sequence x 1 . . . x n above can be described
with approximately (n/2) log 10 letters, since every other digit comes from
π. It is thus significantly simpler than a typical sequence and can be declared
nonrandom.
3
1. COMPLEXITY
Definition 1.1.4. Denote by x ∗ the first one among the shortest descriptions
of x. ù
4
1.1. Introduction
hence only a few objects x can have small complexity. The converse of the
same argument goes as follows. Let µ be a computable probability distribu-
tion, one for which there is a binary program computing µ(ω) for each ω to
any given degree of accuracy. Let H(µ) be the length of the shortest one of
these programs. Then
Here c is a universal constant. These two inequalities are the key to the esti-
mates of complexity and the characterization of randomness by complexity.
Denote
dµ (ω) = − log µ(ω) − H(ω).
Inequality (1.1.1) implies
t µ (ω) = 2dµ (ω)
can be viewed as a payoff function. Now we are in a position to solve the
election paradox. We propose the payoff function
2− log Q n (ω)−H(ω|n)
5
1. COMPLEXITY
number, suppose that t(ω) has complexity < m/2, and that t(ω0 ) > 2m .
Then inequality (1.1.1) can be applied to ν(ω) = µ(ω)t(ω), and we get
1.1.2 APPLICATIONS
Algorithmic Information Theory (AIT) justifies the intuition of random se-
quences as nonstandard analysis justifies infinitely small quantities. Any
statement of classical probability theory is provable without the notion of
randomness, but some of them are easier to find using this notion. Due to
the incomputability of the universal randomness test, only its approximations
can be used in practice.
Pseudorandom sequences are sequences generated by some algorithm,
with some randomness properties with respect to the coin-tossing distribu-
tion. They have very low complexity (depending on the strength of the tests
they withstand, see for example [14]), hence are not random. Useful pseu-
dorandom sequences can be defined using the notion of computational com-
plexity, for example the number of steps needed by a Turing machine to
compute a certain function. The existence of such sequences can be proved
using some difficult unproven (but plausible) assumptions of computation
theory. See [6], [57], [24], [33].
INDUCTIVE INFERENCE
6
1.1. Introduction
M (x y)
M (x)
is an estimate of the conditional probability that the next terms of the out-
come will be given by y provided that the first terms are given by x. It con-
verges to the actual conditional probability µ(x y)/µ(x) with µ-probability 1
for any computable distribution µ (see for example [45]). To understand the
surprising generality of the formula (1.1.2), suppose that some infinite se-
quence z(i) is given in the following way. Its even terms are the subsequent
digits of π, its odd terms are uniformly distributed, independently drawn
random digits. Let
z(1 : n) = z(1) · · · z(n).
Then M (z(1 : 2i)a/M (z(1 : 2i)) converges to 0.1 for a = 0, . . . , 9, while
M (z(1 : 2i + 1)a)/M (z(1 : 2i + 1)) converges to 1 if a is the i-th digit of π,
and to 0 otherwise.
The inductive inference formula using conditional apriori probability can
be viewed as a mathematical form of “Occam’s Razor”: the advice to predict
7
1. COMPLEXITY
by the simplest rule fitting the data. It can also viewed as a realization of
Bayes’Rule, with a universally applicable apriori distribution. Since the dis-
tribution M is incomputable, we view the main open problem of inductive
inference to find maximally efficient approximations to it. Sometimes, even
a simple approximation gives nontrivial results (see [3]).
INFORMATION THEORY
8
1.1. Introduction
LOGIC
9
1. COMPLEXITY
ACKNOWLEDGMENT
The form of exposition of the results in these notes and the general point
of view represented in them were greatly influenced by Leonid Levin. More
recent communication with Paul Vitányi, Mathieu Hoyrup, Cristóbal Rojas
and Alexander Shen has also been very important.
10
1.2. Notation
1.2 NOTATION
When not stated otherwise, log means base 2 logarithm. The cardinality
of a set A will be denoted by |A|. (Occasionally there will be inconsistency,
sometimes denoting it by |A|, sometimes by #A.) If A is a set then 1A(x) is its
indicator function:
(
1 if x ∈ A,
1A(x) =
0 otherwise.
The empty string is denoted by Λ. The set A∗ is the set of all finite strings
of elements of A, including the empty string. Let l(x) denote the length of
string x. (Occasionally there will be inconsistency, sometimes denoting it by
|x|, sometimes by l(x)). For sequences x and y, let x v y denote that x is
a prefix of y. For a string x and a (finite or infinite) sequence y, we denote
their concatenation by x y. For a sequence x and n ¶ l(x), the n-th element
of x is x(n), and
x(i : j) = x(i) · · · x( j).
Sometimes we will also write
x ¶n = x(1 : n).
11
1. COMPLEXITY
1.3.1 INVARIANCE
It is natural to try to define the information content of some text as the size
of the smallest string (code) from which it can be reproduced by some de-
coder, interpreter. We do not want too much information “to be hidden in the
decoder” we want it to be a “simple” function. Therefore, unless stated other-
wise, we require that our interpreters be computable, that is partial recursive
functions.
Partial recursive functions are relatively simple; on the other hand, the
class of partial recursive functions has some convenient closure properties.
Definition 1.3.2. For any binary interpreter A and strings x, y ∈ S, the num-
ber
KA(x | y) = min{ l(p) : A(p, y) = x }
12
1.3. Kolmogorov complexity
The number KA(x) measures the length of the shortest description for x
when the algorithm A is used to interpret the descriptions.
The value KA(x) depends, of course, on the underlying function A. But, as
Theorem 1.3.1 shows, if we restrict ourselves to sufficiently powerful inter-
preters A, then switching between them can change the complexity function
only by amounts bounded by some additive constant. Therefore complexity
can be considered an intrinsic characteristic of finite objects.
Definition 1.3.3. A binary p.r. interpreter U is called optimal if for any binary
p.r. interpreter A there is a constant c < ∞ such that for all x, y we have
KU (x | y) ¶ KA(x | y) + c. (1.3.1)
Proof. The idea is to use interpreters that come from universal partial re-
cursive functions. However, not all such functions can be used for universal
interpreters. Let us introduce an appropriate pairing function. For x ∈ Bn ,
let
x o = x(1)0x(2)0 . . . x(n − 1)0x(n)1
where x o = x1 for l(x) = 1. Any binary sequence x can be uniquely repre-
sented in the form p = a o b.
We know there is a p.r. function V : S2 × S2 × S → S that is universal: for
any p.r. binary interpreter A, there is a string a such that for all p, x, we have
A(p, x) = V (a, p, x). Let us define the function U(p, x) as follows. We repre-
sent the string p in the form p = uo v and define U(p, x) = V (u, v, x). Then
U is a p.r. interpreter. Let us verify that it is optimal. Let A be a p.r. binary
interpreter, a a binary string such that A(p, x) = U(a, p, x) for all p, x. Let
x, y be two strings. If KA(x | y) = ∞, then (1.3.1) holds trivially. Otherwise,
let p be a binary string of length KA(x | y) with A(p, y) = x. Then we have
13
1. COMPLEXITY
and
KU (x | y) ¶ 2l(a) + KA(x | y).
Definition 1.3.4. We fix an optimal binary p.r. interpreter U and write K(x |
y) for KU (x | y). ù
Theorem 1.3.1 (as well as other invariance theorems) is used in AIT for
much more than just to show that KA(x) is a proper concept. It is the principal
tool to find upper bounds on K(x); this is why most such upper bounds are
proved to hold only to within an additive constant.
The optimal interpreter U(p, x) defined in the proof of Theorem 1.3.1 is
obviously a universal partial recursive function. Because of its convenient
properties we will use it from now on as our standard universal p.r. function,
and we will refer to an arbitrary p.r. function as U p (x) = U(p, x).
14
1.3. Kolmogorov complexity
log r
l(cnvsr (x)) ¶ l(x) + 1.
log s
Now, with r-ary strings as descriptions, we must define KA(x) as the mini-
mum of l(p) log r over all programs p in S r with A(p) = x. The equivalence
of the definitions for different values of r is left as an exercise.
(b) For any positive real number v and string y, every finite set E of size m has
at least m(1 − 2−v+1 ) elements x with K(x | y) ¾ log m − v.
15
1. COMPLEXITY
Theorem 1.3.2 suggests that if there are so few strings of low complexity,
then K(n) converges fast to ∞. In fact this convergence is extremely slow, as
shown in a later section.
Proof of Theorem 1.3.2. First we prove (a). Let the interpreter A be defined
such that if β(m) = p then A(p, y) = m. Since |β(m)| = dlog me, we have
+
K(m) < KA(m) < log m + 1.
Part (b) says that the trivial estimate (1.3.3) is quite sharp for most num-
bers under m. The reason for this is that since there are only few short
programs, there are only few objects of low complexity. For any string y, and
any positive real natural number u, we have
To see this, let n = blog uc. The number 2n+1 − 1 of different binary strings of
length ¶ n is an upper bound on the number of different shortest programs
of length ¶ u. Now (b) follows immediately from (1.3.4).
Indeed, let the interpreter A be defined such that if p = β(r)o cnv2r (x) then
A(p, y) = x. We have KA(x) ¶ 2l(β(r)) + n log r + 1 ¶ (n + 2) log r + 3. Since
+
K(x) < KA(x), we are done.
We will have many more upper bound proofs of this form, and will not
+
always explicitly write out the reference to K(x | y) < KA(x | y), needed for
the last step.
Apart from the conversion, inequality (1.3.5) says that since the begin-
ning of a program can command the optimal interpreter to copy the rest, the
complexity of a binary sequence x of length n is not larger than n.
16
1.4. Simple properties of information
Inequality (1.3.5) contains the term 2 log r instead of log r since we have
to apply the encoding w o to the string w = β(r); otherwise the interpreter
cannot detach it from the program p. We could use the code β(|w|)o w, which
is also a prefix code, since the first, detachable part of the codeword tells its
length.
For binary strings x, natural numbers n and real numbers u ¾ 1 we define
J(u) = u + 2 log u,
ι(x) = β(|x|)o x,
ι(n) = ι(β(n)).
+ +
We have l(ι(x)) < J(l(x)) for binary sequences x, and l(ι(r)) < J(log r)
for numbers r. Therefore (1.3.5) is true with J(log r) in place of 2 log r.
Of course, J(x) could be replaced by still smaller functions, for example
x + J(log x). We return to this topic later.
17
1. COMPLEXITY
Further,
+
K(x | y, z) < K(x | U p ( y), z) + J(l(p)),
+
K(U p (x)) < K(x) + J(l(p)),
+
K(x | z) < K(x, y | z), (1.4.1)
+
K(x | y, z) < K(x | y),
+
K(x, x) = K(x), (1.4.2)
+
K(x, y | z) = K( y, x | z),
+
K(x | y, z) = K(x | z, y),
+
K(x, y | x, z) = K( y | x, z),
+ +
K(x | x, z) = K(x | x) = 0.
18
1.4. Simple properties of information
and in particular,
+
K(x, y) < J(K(x)) + K( y).
This corollary implies the following continuity property of the function
K(n): for any natural numbers n, h, we have
+
|K(n + h) − K(n)| < J(log h). (1.4.3)
Indeed, n + h is a recursive function of n and h. The term 2 log K(x) making
up the difference between K(x) and J(K(x)) in Theorem 1.4.2 is attributable
to the fact that minimal descriptions cannot be concatenated without loosing
an “endmarker”. It cannot be eliminated, since there is a constant c such that
for all n, there are binary strings x, y of length ¶ n with
K(x) + K( y) + log n < K(x, y) + c.
Indeed, there are n2n pairs of binary strings whose sum of lengths is n. Hence
by Theorem 1.3.2 b, there will be a pair (x, y) of binary strings whose sum of
lengths is n, with K(x, y) > n + log n − 1. For these strings, inequality (1.3.5)
+ +
implies K(x) + K( y) < l(x) + l( y). Hence K(x) + K( y) + log n < n + log n <
K(x, y) + 1.
Regularities in a string will, in general, decrease its complexity radically.
If the whole string x of length n is given by some rule, that is we have
x(k) = U p (k), for some recursive function U p , then
+
K(x) < K(n) + J(l(p)).
Indeed, let us define the p.r. function V (q, k) = Uq (1) . . . Uq (k). Then x =
V (p, n) and the above estimate follows from Corollary 1.4.1.
For another example, let x = y1 y1 y2 y2 · · · yn yn , and y = y1 y2 · · · yn . Then
+
K(x) = K( y) even though the string x is twice longer. This follows from
Corollary 1.4.1 since x and y can be obtained from each other by a simple
recursive operation.
Not all “regularities” decrease the complexity of a string, only those which
distinguish it from the mass of all strings, putting it into some class which
is both small and algorithmically definable. For a binary string of length n,
p
it is nothing unusual to have its number of 0-s between n/2 − n and n/2.
Therefore such strings can have maximal or almost maximal complexity. If x
has k zeroes then the inequality
n
+
K(x) < log + J(log n) + J(log k)
k
19
1. COMPLEXITY
Wp = { U(p, x) : x ∈ N }. (1.4.4)
It follows from Theorem 1.3.2 that whenever the set E a is finite, the esti-
mate of Theorem 1.4.3 is sharp for most of its elements x.
The shortest descriptions of a string x seem to carry some extra informa-
tion in them, above the description of x. This is suggested by the following
strange identities.
+
Theorem 1.4.4. We have K(x, K(x)) = K(x).
+ +
Proof. The inequality > follows from (1.4.1). To prove <, let A(p) =
〈U(p, l(p))〉. Then KA(x, K(x)) ¶ K(x).
Theorem 1.4.5.
+
K( y | x, i − K( y | x, i)) < K( y | x, i).
20
1.5. Algorithmic properties of complexity
β = { (p, q) : p, q ∈ Q, p < q }.
21
1. COMPLEXITY
This says that if f (x) ∈ I then sooner or later we will find an interval
I ∈ FJ with the property that f (z) ∈ J for all z ∈ I. Note that computability
implies continuity.
This says that if f (x) > r then sooner or later we will find an interval
I ∈ FJ with the property that f (z) > q for all z ∈ I.
The following facts are useful and simple to verify:
Proposition 1.5.1.
(a) The function d·e : R → R is lower semicomputable.
(b) If f , g : R → R are computable then their composition f (g(·)) is, too.
(c) If f : R → R is lower semicomputable and g : R → R (or S → R) is
computable then f (g(·)) is lower semicomputable.
(d) If f : R → R is lower semicomputable and monotonic, and g : R → R (or
S → R) is lower semicomputable then f (g(·)) is lower semicomputable.
22
1.5. Algorithmic properties of complexity
Proof. By Theorem 1.3.2 there is a constant c such that for any strings x, y,
we have K(x | y) < log〈x〉 + c. Let some Turing machine compute our
optimal binary p.r. interpreter. Let U t (p, y) be defined as U(p, y) if this
machine, when started on input (p, y), gives an output in t steps, undefined
otherwise. Let K t (x | y) be the smaller of log〈x〉 + c and
Theorem 1.5.2. Let U p (n) be a partial recursive function whose values are
numbers smaller than K(n) whenever defined. Then
+
U p (n) < J(l(p)).
Proof. The proof of this theorem resembles “Berry’s paradox ”, which says:
“The least number that cannot be defined in less than 100 words” is a def-
inition of that number in 12 words. The paradox can be used to prove,
whenever we agree on a formal notion of “definition”, that the sentence in
quotes is not a definition.
Let x(p, k) for k = 1, 2, . . . be an enumeration of the domain of U p . For
any natural number m in the range of U p , let f (p, m) be equal to the first
x(p, k) with U p (x(p, k)) = m. Then by definition, m < K( f (p, m)). On
the other hand, f (p, m) is a p.r. function, hence applying Corollary 1.4.2
and (1.3.5) we get
+
m < K( f (p, m)) < K(p) + J(K(m))
+
< l(p) + 1.5 log m.
23
1. COMPLEXITY
The paradox used in the above proof is not equivalent in any trivial way to
Russell’s paradox, since the undecidable computably enumerable sets we get
from Theorem 1.5.2 are simple. Let f (n) ¶ log n be any recursive function
with limn→∞ f (n) = ∞. We show that the set E = { n : K(n) ¶ f (n) } is
simple. It follows from Theorem 1.5.1 that E is recursively enumerable, and
from Theorem 1.3.2 that its complement is infinite. Let A be any computably
enumerable set disjoint from E. The restriction fA(n) of the function f (n) to
A is a p.r. lower bound for K(n). It follows from Theorem 1.5.2 that f (A) is
bounded. Since f (n) → ∞, this is possible only if A is finite.
Gödel used the diagonal technique to prove the incompleteness of any
sufficiently rich logical theory with a recursively enumerable axiom system.
His technique provides for any sufficiently rich computably enumerable the-
ory a concrete example of an undecidable sentence. Theorem 1.5.2 leads to
a new proof of this result, with essentially different examples of undecidable
propositions. Let us consider any first-order language L containing the lan-
guage of standard first-order arithmetic. Let 〈·〉 be a standard encoding of
the formulas of this theory into natural numbers. Let the computably enu-
merable theory Tp be the theory for which the set of codes 〈Φ〉 of its axioms
Φ is the computably enumerable set Wp of natural numbers.
Corollary 1.5.2. There is a constant c < ∞ such that for any p, if any sentences
with the meaning “m < K(n)” for m > J(l(p)) + c are provable in theory Tp
then some of these sentences are false.
Proof. For some p, suppose that all sentences “m < K(n)” provable in Tp are
true. For a sentence Φ, let Tp ` Φ denote that Φ is a theorem of the theory
Tp . Let us define the function
24
1.5. Algorithmic properties of complexity
where q = bap. (Formally, this statement follows for example from the so-
called Smn -theorem.) The function Sq (n) is a lower bound on the complexity
function K(n). By an immediate generalization of Theorem 1.5.2 for semi-
+ +
computable functions), we have S(q, n) < J(l(q)) < J(l(p)).
+
Proof. Suppose that K(x | y) < F (x, y) holds. Then (1.5.1) follows from b
of Theorem 1.3.2.
Now suppose that (1.5.1) holds for all y, m. Then let E be the computably
enumerable set of triples (x, y, m) with F (x, y) < m. It follows from (1.5.1)
+
and Theorem 1.4.3 that for all (x, y, m) ∈ E we have K(x | y, m) < m. By
+
Theorem 1.4.5 then K(x | y) < m.
25
1. COMPLEXITY
We would like to see just K(x) on the right-hand side, but the inequality
+
K(x, y) < K(x) + K( y) does not always hold (see Exercise 5). The problem
is that if we compose a program for (x, y) from those of x and y then we have
to separate them from each other somehow. In this section, we introduce a
variant of Kolmogorov’s complexity discovered by Levin and independently
by Chaitin and Schnorr, which has the property that “programs” are self-
delimiting: this will free us of the necessity to separate them from each other.
26
1.6. The coding theorem
Proof. The proof of this theorem is similar to the proof of Theorem 1.3.1.
We take the universal partial recursive function V (a, p, x). We transform
each function Va (p, x) = V (a, p, x) into a self-delimiting function Wa (p, x) =
W (a, p, x), but so as not to change the functions Va which are already self-
delimiting. Then we form T from W just as we formed U from V . It is easy to
check that the function T thus defined has the desired properties. Therefore
we are done if we construct W .
Let some Turing machine M compute y = Va (p, x) in t steps. We define
Wa (p, x) = y if l(p) ¶ t and if M does not compute in t or fewer steps any
output Va (q, x) from any extension or prefix q of length ¶ t of p. If Va is self-
delimiting then Wa = Va . But Wa is always self-delimiting. Indeed, suppose
that p0 v p1 and that M computes Va (pi , x) in t i steps. The value Wa (pi , x)
can be defined only if l(pi ) ¶ t i . The definition of Wa guarantees that if
t 0 ¶ t 1 then Wa (p1 , x) is undefined, otherwise Wa (p0 , x) is undefined.
27
1. COMPLEXITY
Let us remark that the s.d. interpreter defined in Theorem 1.6.1 has the
following stronger property. For any other s.d. interpreter G there is a string
g such that we have
G(x) = T (g x) (1.6.2)
for all x. We could say that the interpreter T is universal.
Let us show that the functions K and H are asymptotically equal.
Theorem 1.6.2.
+ +
K < H < J(K). (1.6.3)
+
Proof. Obviously, K < H. We define a self-delimiting p.r. function F (p, x).
The machine computing F tries to decompose p into the form p = uo v such
that u is the number l(v) in binary notation. If it succeeds, it outputs U(v, x).
We have
+ +
H( y | x) < K F ( y | x) < K( y | x) + 2 log K( y | x).
Let us review some of the relations proved in Sections 1.3 and 1.4. In-
+
stead of the simple estimate K(x) < n for a binary string x, of length n, only
the obvious consequence of Theorem 1.6.2 holds that is
+
H(x) < J(n).
We must similarly change Theorem 1.4.3. We will use the universal p.r. funct-
ion Tp (x) = T (p, x). The r.e. set Vp is the range of the function Tp . Let
E = Vp be an enumerable set of pairs of strings. The section E a is defined as
in (1.4.5). Then for all x ∈ E a , we have
+
H(x | a) < J(log |E a |) + l(p). (1.6.4)
28
1.6. The coding theorem
all the µ p .
29
1. COMPLEXITY
With this notation, we can make it a little more precise in which sense
this measure dominates all other constructive semimeasures.
Similar notation will be used for other objects: for example now m( f ) make
sense for a recursive function f . ù
Theorem 1.6.4. For all constructive semimeasures ν and for all strings x, we
have
∗
m(ν)ν(x) < m(x). (1.6.7)
Proof. Let us repeat the proof of Theorem 1.6.3, using m(p) in place of
δ(p). We obtain a new constructive semimeasure m0 (x) with the property
m(ν)ν(x) ¶ m0 (x) for all x. Noting m0 (x) = m(x) finishes the proof.
∗
30
1.6. The coding theorem
31
1. COMPLEXITY
+
Corollary 1.6.5. We have − log m(x) < H(x).
|p j | ¶ − log w j + 2. (1.6.11)
Proof. We follow the construction above and cut off disjoint, adjacent (not
necessarily binary) intervals I j of length w j from the left end of [0, 1]. Let
v j be the length of the longest binary intervals contained in I j . Let p j be the
binary word corresponding to the first one of these. Four or fewer intervals
of length v j cover I j . Therefore (1.6.11) holds.
Proof. We construct a self-delimiting p.r. function F (p) with the property that
K F (x) ¶ − log m(x) + 4. The function F to be constructed is the decoding
32
1.6. The coding theorem
Let us cut off consecutive adjacent, disjoint intervals I t of length 2−kt −1 from
the left side of the interval [0, 1]. We define F as follows. If [p] is a largest
binary subinterval of some I t then F (p) = z t . Otherwise F (p) is undefined.
The function F is obviously self-delimiting and partial recursive. It fol-
lows from the construction that for every x there is a t with z t = x and
0.5m(x) < 2−kt . Therefore, for every x there is a p such that F (p) = x and
|p| ¶ − log m(x) + 4.
33
1. COMPLEXITY
Definition 1.6.10. Let PT (x) be the probability that the self-delimiting ma-
chine T gives out result x after receiving random bits as inputs. ù
We can write PT (x) = WT (x) where WT (x) is the set { 2−l(p) : T (p) =
P
x }. Since 2−H(x) = max WT (x), we have − log PT (x) ¶ H(x). The semimea-
∗
sure PT is constructive, hence P < m, hence
+ +
H < − log m < − log PT ¶ H,
hence
+ +
− log PT = − log m = H.
Hence the sum PT (x) of the set WT (x) is at most a constant times larger than
its maximal element 2−H(x) . This relation was not obvious in advance. The
outcome x might have high probability because it has many long descrip-
tions. But we found that then it must have a short description too. In what
follows it will be convenient to fix the definition of m(x) as follows.
Theorem 1.7.1. Let f (x, n) be the number of binary strings p of length n with
T (p) = x for the universal s.d. interpreter T defined in Theorem 1.6.1. Then
for every n ¾ H(x), we have
+
log f (x, n) = n − H(x, n). (1.7.1)
+
Using H(x) < H(x, n), which holds just as (1.4.1), and substituting
+ +
n = H(x) we obtain log f (x, H(x)) = H(x) − H(x, H(x)) < 0, implying the
following.
34
1.7. The statistics of description length
Since the number log f (x, H(x)) is nonnegative, we also derived the iden-
tity
+
H(x, H(x)) = H(x) (1.7.2)
which could have been proven also in the same way as Theorem 1.4.4.
Proof of Theorem 1.7.1. It is more convenient to prove equation (1.7.1) in
the form
2−n f (x, n) = m(x, n)
∗
(1.7.3)
∗
where m(x, y) = m(〈x, y〉). First we prove <. Let us define a p.r. self-
delimiting function F by F (p) = 〈T (p), l(p)〉. Applying F to a coin-tossing ar-
gument, 2−n f (x, n) is the probability that the pair 〈x, n〉 is obtained. There-
fore the left-hand side of (1.7.3), as a function of 〈x, n〉, is a constructive
semimeasure, dominated by m(x, n).
∗
Now we prove >. We define a self-delimiting p.r. function G as follows.
The machine computing G(p) tries to decompose p into three segments p =
β(c)o vw in such a way that T (v) is a pair 〈x, l(p) + c〉. If it succeeds then
it outputs x. By the universality of T , there is a binary string g such that
T (g p) = G(p) for all p. Let r = l(g). For an arbitrary pair x, n, let q be a
shortest binary string with T (q) = 〈x, n〉, and w an arbitrary string of length
l = n − l(q) − r − l(β(r)o ).
Then G(β(r)o qw) = T (gβ(r)o qw) = x. Since w is arbitrary here, there are
2l binary strings p of length n with T (p) = x.
How many objects are there with a given complexity n? We can answer
this question with a good approximation.
35
1. COMPLEXITY
Lemma 1.7.2. X
m(x, y) = m(x).
∗
Proof of Theorem 1.7.2: Let dn = |Dn |. Using Lemma 1.7.2 and Theo-
rem 1.7.1 we have
X X
g T (n) ¶ dn = f (x, n) = 2n m(x, n) = 2n m(n).
∗ ∗
x x
36
1.7. The statistics of description length
+
otherwise then K F (p) ¶ n for p ∈ Dn . By H(p, l(p)) = H(p), and a variant of
Lemma 1.7.2, we have
X X
m(p) = m(p, n) = m(n).
∗ ∗
p∈Dn p∈Dn
Using the expression (1.7.4) for dn , we see that there is a constant c1 such
that X
dn−1 2−H(p) ¶ 2−n+c1 .
p∈Dn
From this, Markov’s Inequality gives that the number of strings p ∈ Bn with
H(p) < n − k is at most c2n−k .
There is a more polished way to express this result.
37
1. COMPLEXITY
+
Theorem 1.7.3. We have H + (n) = log n + H(blog nc).
+
Proof. To prove <, we construct a s.d. interpreter as follows. The machine
computing F (p) finds a decomposition uv of p such that T (u) = l(v), then
outputs the number whose binary representation (with possible leading ze-
ros) is the binary string v. With this F , we have
+
H(k) < K F (k) ¶ log n + H(blog nc) (1.7.7)
for all k ¶ n. The numbers between n/2 and n are the ones whose binary
representation has length blog nc + 1. Therefore if the bound (1.7.6) is sharp
for most x then the bound (1.7.7) is sharp for most k.
The sharp monotonic estimate log n + H(blog nc) for H(n) is less satisfac-
tory than the sharp estimate log n for Kolmogorov’s complexity K(n), because
it is not a computable function. We can derive several computable upper
bounds from it, for example log n+2 log log n, log n+log log n+2 log log log n,
etc. but none of these is sharp for large n. We can still hope to find comput-
able upper bounds of H which are sharp infinitely often. The next theorem
provides one.
Theorem 1.7.4 (see [46]). There is a computable upper bound G(n) of the
function H(n) with the property that G(n) = H(n) holds for infinitely many n.
For the proof, we make use of a s.d. Turing machine T computing the
function T . Similarly to the proof of Theorem 1.5.1, we introduce a time-
restricted complexity.
Definition 1.7.3. We know that for some constant c, the function H(n) is
bounded by 2 log n + c. Let us fix such a c.
For any number n and s.d. Turing machine M , let HM (n; t) be the min-
imum of 2 log n + c and the lengths of descriptions from which M computes
n in t or fewer steps. ù
Definition 1.7.4. Let us agree now that the universal p.r. function is com-
puted by a universal Turing machine V with the property that for any Turing
38
1.7. The statistics of description length
With these definitions, for each s.d. Turing machine M there are con-
stants m0 , m1 with
H(n, m0 t) ¶ HM (n, t) + m1 . (1.7.8)
The function H(n; t) is nonincreasing in t.
Proof of Theorem 1.7.4. First we prove that there are constants c0 , c1 such
that H(n; t) < H(n; t − 1) implies
39
2 RANDOMNESS
saying that there be at most 2n−k binary sequences x of length n with d(x) >
k. Under this condition, we even allow d(x) to take the value ∞.
To avoid arbitrariness in the distinction between random and nonran-
dom, the function d(x) must be simple. We assume therefore that the set
{ (n, k, x) : l(x) = n, d(x) > k } is recursively enumerable, or, which is the
41
2. RANDOMNESS
d : Ω → (−∞, ∞]
is lower semicomputable.
Remark 2.1.1. We do not assume that the set of strings x ∈ Bn with small
deficiency of randomness is enumerable, because then it would be easy to
construct a “random” sequence. ù
Definition 2.1.2. Let us define the following function for binary strings x:
42
2.1. Uniform distribution
Theorem 2.1.1 says that under very general assumptions about random-
ness, those strings x are random whose descriptional complexity is close to
its maximum, l(x). The more “regularities” are discoverable in a sequence,
the less random it is. However, this is true only of regularities which decrease
the descriptional complexity. “Laws of randomness” are regularities whose
probability is high, for example the law of large numbers, the law of iterated
logarithm, the arcsine law, etc. A random sequence will satisfy all such laws.
Example 2.1.2. The law of large numbers says that in most binary sequences
of length n, the number of 0’s is close to the numer of 1’s. We prove this law
of probability theory by constructing a ML-test d(x) taking large values on
sequences in which the number of 0’s is far from the number of 1’s.
Instead of requiring (2.1.1) we require a somewhat stronger property of
d(x): for all n, X
P(x)2d(x) ¶ 1. (2.1.3)
x∈Bn
X
{ P(x) : f (x) > λE P ( f ) } ¶ λ. (2.1.4)
For any string x ∈ S, and natural number i, let N (i | x) denote the number
of occurrences of i in x. For a binary string x of length n define p x = N (1 |
x)/n, and
Px ( y) = p Nx (1| y) (1 − p x )N (0| y)
d(x) = log Px (x) + n − log(n + 1).
43
2. RANDOMNESS
The test d(x) expresses the (weak) law of large numbers in a rather strong
version. We rewrite d(x) as
d(x) = n(1 − h(p x )) − log(n + 1)
where h(p) = −p log p−(1−p) log(1−p). The entropy function h(p) achieves
its maximum 1 at p = 1/2. Therefore the test d(x) tells us that the probabil-
ity of sequences x with p x < p < 1/2 for some constant p is bounded by the
exponentially decreasing quantitiy (n + 1)2−n(1−h(p)) . We also see that since
for some constant c we have
1 − h(p) > c(p − 1/2)2 ,
therefore if the difference |p x − 1/2| is much larger than
r
log n
cn
then the sequence x is “not random”, and hence, by Theorem 2.1.1 its com-
plexity is significantly smaller than n. ù
Later we will consider probability distributions P over the space NN , and they
will also be characterized by a nonnegative function P(x) over S. However,
the above condition will be replaced by a different one.
We assume P to be computable, with some Gödel number e. We want
to define a test d(x) of randomness with respect to P. It will measure how
justified is the assumption that x is the outcome of an experiment with dis-
tribution P.
44
2.2. Computable distributions
45
2. RANDOMNESS
Theorem 2.2.1. The function d P (x) = − log P(x) − H(x) is a universal inte-
grable test for any fixed computable probability distribution P. More exactly, it
is lower semicomputable, satisfies (2.2.1) and for all integrable tests d(x) for
P, we have
+
d(x) < d P (x) + H(d) + H(P). (2.2.3)
Proof. Let d(x) be an integrable test for P. Then ν(x) = 2d(x) P(x) is a
+
constructive semimeasure, and it has a self-delimiting program of length <
∗
H(d) + H(P). It follows (using the definition (1.6.6)) that m(ν) < 2H(d)+H(P) ;
hence inequality (1.6.7) gives
∗
ν(x)2H(d)+H(P) < m(x).
m(x) ¶ kP(x)
holds with probability not smaller than 1 − 1/k. If P is computable then the
inequality
∗
m(P)P(x) < m(x)
holds for all x. Therefore the following applications are obtained.
• If we assume x to be the outcome of an experiment with some simple
computable probability distribution P then m(x) is a good estimate of
P(x). The goodness depends on how simple P is to define and how
random x is with respect to P: how justified is the assumption. Of
course, we cannot compute m(x) but it is nevertheless well defined.
46
2.2. Computable distributions
d0 (x) = n − K(x | n)
for x ∈ Bn , and set it ∞ for x outside Bn . Now the similarity to the expression
is striking because n = log Pn (x). Together with the remark made in the
previous paragraph, this observation suggests that the value of − log m(x) is
close to the complexity K(x).
47
2. RANDOMNESS
+
Hence d P (x) is a test, and hence it is < d P (x) + H( f ) + H(P), by the uni-
versality of the test d P (x). (Strictly speaking, we must modify the proof of
Theorem 2.2.1 and see that a program of length H( f )+H(P) is also sufficient
to define the semimeasure P(x)d(x | P).)
48
2.2. Computable distributions
2−n
P(x) = .
n(n + 1)
This distribution is uniform on strings of equal length. Let f (x) be the funct-
ion that erases the first k bits. Then for n ¾ 1 and for any string x of length
n we have
2−n
( f ∗ P)(x) = .
(n + k)(n + k + 1)
(The empty string has the rest of the weight of f ∗ P.) The new distribution
is still uniform on strings of equal length, though the total probability of
strings of length n has changed somewhat. We certainly expect randomnes
conservation here: after erasing the first k bits of a random string of length
n + k, we should still get a random string of length n. ù
The computable function f applied to a string x can be viewed as some
kind of transformation x 7→ f (x). As we have seen, it does not make x less
random. Suppose now that we introduce randomization into the transforma-
tion process itself: for example, the machine computing f (x) can also toss
some coins. We want to say that randomness is also conserved under such
more general transformations, but first of all, how to express such a trans-
formation mathematically? We will describe it by a “matrix”: a computable
probability transition function T (x, y) ¾ 0 with the property that
X
T (x, y) = 1.
y
Now, the image of the distribution P under this transformation can be written
as T ∗ P, and defined as follows:
X
(T ∗ P)( y) = P(x)T (x, y).
x
49
2. RANDOMNESS
m(x)
t P (x) = 2dP (x) = . (2.2.7)
P(x)
Proof. Let us denote the function on the left-hand side of (2.2.8) by t P (x). It
is lower semicomputable by its very construction, using a program of length
+
< H(T ) + H(P). Let us check that it satisfies x P(x)t P (x) ¶ 1 which, in this
P
+
It follows that d P (x) = log t P (x) is an integrable test and hence d P (x) <
d P (x) + H(T ) + H(P). (See the remark at the end of the proof of Theo-
rem 2.2.2.)
Corollary 2.2.5. There is a constant c0 such that for every integer k ¾ 0, for
all x we have
X
{ T (x, y) : d T ∗ P ( y) − d P (x) > k + H(T ) + H(P) } ¶ 2−k+c0 .
50
2.3. Infinite sequences
These lecture notes treat the theory of randomness over continuous spaces
mostly separately, starting in Section 4.1. But the case of computable mea-
sures over infinite sequences is particularly simple and appealing, so we give
some of its results here, even if most follow from the more general results
given later.
In this section, we will fix a finite or countable alphabet Σ = {s1 , s2 , . . .},
and consider probability distributions over the set
X = ΣN
Our goal is here to illustrate the measure theory developed in Section A.2,
through Subsection A.2.3, on the concrete example of the set of infinite se-
quences, and then to develop the theory of randomness in it.
We will distinguish a few simple kinds of subsets of the set S of sequences,
those for which probability can be defined especially easily.
Definition 2.3.1.
• For string z ∈ Σ∗ , we will denote by zX the set of elements of X with prefix
z. Such sets will be called cylinder sets.
51
2. RANDOMNESS
Example 2.3.1. An example open set that is not finitely determined, is the set
E of all sequences that contain a substring 11. ù
The following observations are easy to prove.
Proposition 2.3.2.
(a) Every open set G can be represented as the union of a sequence of disjoint
cylinder sets.
(b) The set of open sets is closed with respect to finite intersection and arbitrar-
ily large union.
(c) Each finitely determined set is both open and closed.
(d) The class F of finitely determined sets forms an algebra.
52
2.3. Infinite sequences
µ(sX ) = µ(s)
53
2. RANDOMNESS
also an s1 ∈ Σ that is not good. But then there is also an s2 ∈ Σ such that
s1 s2 is not good. And so on, we obtain an infinite sequence s1 s2 · · · such that
s1 ·S· · sn is not good for any n. But then the sequence zs1 s2 · · · is not contained
in i x i X , contrary to the assumption .
Corollary 2.3.4. Two disjoint union representations of the same open set
[ [
xi X = yi X
i i
µ(x i ) = i µ( yi ).
P P
imply i
S
Proof. The set i x i X can also be written as
[ [
xi X = xi X ∩ yj X .
i i, j
µ( y j ).
P
The right-hand side is also equal similarly to j
54
2.3. Infinite sequences
Examples 2.3.6. Let Σ = {0, 1}, let us define the measure λ by λ(x) = 2−n
for all n and for all x ∈ Σn .
1. For every infinite sequence ξ ∈ X , the one-element set {ξ} is a null set
with respect to λ. Indeed, for each natural number n, let H n = { ξ ∈ X :
ξ(0) = Tξ(0), ξ(1) = ξ(1), . . . , ξ(n) = ξ(n) }. Then λ(H n ) = 2
−n−1
, and
{s} = n H n .
2. For a less trivial example, consider the set E of those elements t ∈ X that
are 1 in each positive even position, that is
Then E is a null set. Indeed, for each natural number n, let Gn = { T t∈X :
t(2) = 1, t(4) = 1, . . . , t(2n) = 1 }. This helps expressing E as E = n Gn ,
where λ(Gn ) = 2−n .
ù
55
2. RANDOMNESS
the measure λ) every sequence might be contained in a null set, namely the
one-element set consisting of itself. But most of these sets are not defined
simply at all. An example of a simply defined null set is given is part 2 of
Example 2.3.6. These reflections justify the following definition, in which
“simple” is specified as “constructive”.
It is easy to see that the set of nonrandom sequences is a null set. Indeed,
there is only a countable number or constructive null sets, so even their union
is a null set. The following theorem strengthens this observation significantly.
Theorem 2.3.1. Let us fix a computable probability measure P. The set of all
nonrandom sequences is a constructive null set.
Thus, there is a universal constructive null set, containing all other con-
structive null sets. A sequence is random when it does not belong to this
set.
Proof of Theorem 2.3.1. The proof uses the projection technique that has ap-
peared several times in this book, for example in proving the existence of a
universal lower semicomputable semimeasure.
We know that it is possible to list all recursively enumerable subsets of
the set N × Σ∗ into a sequence, namely that there is a recursively enumerable
56
2.3. Infinite sequences
set ∆ ⊆ N2 × Σ∗ with the property that for every recursively enumerable set
Γ ⊆ N × Σ∗ there is an e with Γ = { (m, x) : (e, m, x) ∈ ∆ }. We will write
∆e,m = { x : (e, m, x) ∈ ∆ },
[
De,m = xX.
x∈∆e,m
We transform the set ∆ into another set ∆0 with the following property.
• For each e, m we have (e,m,x)∈∆0 µ(x) ¶ 2−m+1 .
P
• If for some e we have (e,m,x)∈∆ µ(x) ¶ 2−m for all m then for all m we
P
Γ̂ = { (m, x) : ∃ e (e, m + e + 2, x) ∈ ∆0 }.
Then Ĝm =
S 0
e De,m+e+2 . For all m we have
X X X X
µ(x) = µ(x) ¶ 2−m−e−1 = 2−m .
x∈Γ̂m e (e,m+e+2,x)∈∆0 e
This shows that Γ̂ defines a constructive null set. Let Γ be any other recur-
sively enumerable subset of N × Σ∗ that defines a constructive null set. Then
there is an e such that for all m we have Gm = De,m
0
. The universality follows
now from
\ \ \ \
Gm = 0
De,m ⊆ 0
De,m+e+2 ⊆ Ĝm .
m m m m
57
2. RANDOMNESS
58
2.3. Infinite sequences
2.3.3 COMPUTABILITY
Measureable functions are quite general; it is worth introducing some more
resticted kinds of function over the set of sequences.
59
2. RANDOMNESS
2.3.4 INTEGRAL
The definition of integral over a measure space is given in the Appendix, in
Subsection A.2.3. Here, we give an exposition specialized to infinite sequen-
ces.
Proposition A.2.7, when specified to our situation here, says the follow-
ing.
60
2.3. Infinite sequences
It is called a Martin-Löf test, if instead of the latter condition only the weaker
one is satisfied saying P(d(x) > m) < 2−m for each positive integer m.
It is universal if it dominates all other integrable tests to within an additive
constant. Universal Martin-Löf tests are defined in the same way. ù
Proof. Let d be a randomness test: then for each k the set Gk = { ξ : d(ξ) >
k } is a constructive open set, and by Markov’s inequality (wich is proved in
the continuous case just as in the discrete case) we have P(Gk ) ¶ 2−k . The
sets Gk witness that N is a constructive null set. T∞
Let N be a constructive null set with N ⊆ G , where Gk is a
k=1 k
uniform sequence of constructive open sets with P(Gk ) = 2−k . Without
loss of generality assume that the sequence Gk is decreasing. Then the
function d(ξ) = sup{ k : ξ ∈ Gk } is lower semicomputable and satisfies
P{ ξ : d(ξ) ¾ k } ¶ 2−k , so it is a Martin-Löf test. Just as in Proposition 2.2.2,
it is easy to check that d(x) − 2 log d(x) − c is an integrable test for some
constant c.
Just as there is a universal constructive null set, there are universal ran-
domness tests.
61
2. RANDOMNESS
Theorem 2.3.4. Let X = ΣN be the set of infinite sequences. For all computable
measures over X , we have
+
dµ (ξ) = sup − log µ(ξ1:n ) − H(ξ1:n ) .
(2.3.4)
n
+
Here, the constant in = depends on the computable measure µ.
Corollary 2.3.13. Let λ be the uniform distribution over the set of infinite
binary sequences. Then
+
dλ (ξ) = sup n − H(ξ1:n ) .
(2.3.5)
n
62
2.3. Infinite sequences
with the property that if i < j and 1 yi (ξ) = 1 y j (ξ) = 1 then ki < k j . The
inequality follows from the fact that for P any finite sequence n1 < n2 < . . . ,
γ( =
P nj nj ki
j 2 ¶ 2 max j 2 . The function y) yi = y 2 is lower semicomputable.
With it, we have
X X
2ki 1 yi (ξ) = 1 y (ξ)γ( y).
i y∈Σ∗
∗
Since µ2dde ¶ 2, we have µ( y)γ( y) ¶ 2, hence µ( y)γ( y) < m( y), that is
P
y
∗
γ( y) < m( y)/µ( y). It follows that
∗ m( y) m(ξ1:n )
2d(ξ) < sup 1 y (ξ) = sup .
y∈Σ∗ µ( y) n µ(ξ1:n )
+
Taking logarithms, we obtain the < part of the theorem.
µ(Λ) ¶ 1.
If it is lower semicomputable we will call the semimeasure constructive. ù
63
2. RANDOMNESS
We can compute
X
PT (x) = 2−l(p) ,
T (p)wx,p minimal
64
2.3. Infinite sequences
Fix a computable measure P(x) over the space X . Just as in the discrete case,
the universal semimeasure gives rise to a universal Martin-Löf test.
The above theorem suggests that − log M (x) should also be considered a
kind of description complexity:
65
2. RANDOMNESS
∗ +
It follows that m(x) > ν(x) and hence H(x) < K M (x) + H(l(x)).
Just as in the discrete case, the universality of M (x) implies
+
K M (x) < − log P(x) + H(P).
66
3 INFORMATION
X ai a
ai log ¾ a log , (3.1.1)
i
bi b
67
3. INFORMATION
∗
On the other hand, since m(x) is a universal semimeasure, we have m(x) >
+
P(x): more precisely, H(x) < − log P(x) + H(P), leading to the following
theorem.
Note that H (P) is the entropy of the distribution P while H(P) is just
the description complexity of the function x 7→ P(x), it has nothing to do
directly with the magnitudes of the values P(x) for each x as real numbers.
These two relations give
X
+
H(P) = P(x)H(x) − H(P),
x
68
3.1. Information-theoretic relations
IDENTITIES, INEQUALITIES
Definition 3.1.2. Let X and Y be two discrete random variables with a joint
distribution. The conditional entropy of Y with respect to X is defined as
X X
H (Y | X ) = − P[X = x] P[Y = y | X = x] log P[Y = y | X = x].
x y
H (X , Y ) = H (X ) + H (Y | X ). (3.1.3)
H (Y | X ) ¶ H (Y ).
I (X : Y ) = H (Y ) − H (Y | X ).
I (X : Y ) = I (Y : X ) = H (X ) + H (Y ) − H (X , Y )
69
3. INFORMATION
70
3.1. Information-theoretic relations
Theorem 3.1.5 (see [17]). There is a constant c such that for all n there is a
binary string x of length n with
We will prove this theorem in Section 3.3. For the study of information
relations, let us introduce yet another notation.
f g mod h
mean that there is a constant c such that H( f (x) | g(x), h(x)) ¶ c for all
x. We will omit modh if h is not in the condition. The sign will mean
inequality in both directions. ù
71
3. INFORMATION
72
3.1. Information-theoretic relations
Since the quantity I(x : y) is not quite equal to H(x) − H(x | y) (in fact,
Levin showed (see [17]) that they are not even asymptotically equal), we
prefer to define information also by a symmetrical formula.
Let us show that classical information is indeed the expected value of algo-
rithmic information, for a computable probability distribution p:
where H(p) is the length of the shortest prefix-free program that computes
p(x, y) from input (x, y).
Proof. We have
X X
+
p(x, y)I ∗ (x : y) = p(x, y)[H(x) + H( y) − H(x, y)].
x, y x, y
x, y p(x, y)H(x, y) < H (p) + H(p). On the other hand, the probabilistic
P
73
3. INFORMATION
+ m(x, y)
I ∗ (x : y) = log .
m(x)m( y)
74
3.1. Information-theoretic relations
and
+
I(H(x) : x) = I(H(x) : x ∗ ).
It follows that symmetry breaks down in an exponentially great measure
either on the pair (x, H(x)) or the pair (x ∗ , H(x)).
DATA PROCESSING
The following useful identity is also classical, and is called the data processing
identity:
I (Z : (X , Y )) = I (Z : X ) + I (Z : Y | X ). (3.1.11)
Let us see what corresponds to this for algorithmic information.
Here is the correct conditional version of the addition theorem:
H (X | Y ) ¶ H (Z | Y ) + H (X | Z).
75
3. INFORMATION
Definition 3.1.8.
Then we have
+
I ∗ (x : y | z) = H( y | z) − H( y | x, H(x | z), z)
+
= H(x | z) − H(x | y, H( y | z), z).
Proof. We have
+
I ∗ (z : (x, y)) = H(x, y) + H(z) − H(x, y, z)
+
I ∗ (z : x) = H(x) + H(z) − H(x, z)
+
I ∗ (z : (x, y)) − I ∗ (z : x) = H(x, y) + H(x, z) − H(x, y, z) − H(x)
+
= H( y | x ∗ ) + H(z | x ∗ ) − H( y, z | x ∗ )
+
= I ∗ (z : y | x ∗ ),
76
3.1. Information-theoretic relations
Proof. Using the algorithmic data-processing identity (3.1.15) and the defi-
nitions, we have
which gives, after cancelling H(z) + H(x | z ∗ ) on the left-hand side with
H(x, z) on the right-hand side, to which it is equal according to additivity,
and changing sign:
+
H(x, z | y ∗ ) = H(z | y ∗ ) + H(x | y, H( y | z ∗ ), z ∗ ).
which says that for each x, the expected value of 2 I(x: y) is small, even with
respect to the universal constructive semimeasure m( y). The proof is imme-
77
3. INFORMATION
2−H( y|x)
2 I(x: y) =
∗
m( y)
and use the estimate (1.6.10).
We prove a strong version of the information non-increase law under
deterministic processing (later we need the attached corollary):
This says that the m(· | x ∗ )-expectation of the exponential of the increase
is bounded by a constant. It follows that for example, the probability of an
∗
increase of mutual information by the amount d is < 2−d . ù
78
3.2. The complexity of decidable and enumerable sets
Putting both sides into the exponent gives the statement of the theorem.
Remark 3.1.9. The theorem on information non-increase and its proof look
similar to the theorems on randomness conservation. There is a formal con-
nection, via the observation made in Remark 3.1.4. Due to the difficulties
mentioned there, we will explore the connection only later in our notes. ù
79
3. INFORMATION
Examples that each of these inequalities can be strict, are left to exercises.
Remark 3.2.1. If ω(1 : n) is a segment of an infinite sequence ω then we can
write
K(ω; n)
instead of K(ω(1 : n); n) without confusion, since there is only one way to
understand this expression. ù
Decision complexity offers an easy characterization of decidable infinite
sequences.
80
3.2. The complexity of decidable and enumerable sets
Let us use now the new tool to prove a somewhat more surprising result.
Let ω be a 0-1 sequence that is the indicator function of a recursively enu-
merable set. In other words, the set { i : ω(i) = 1 } is recursively enumerable.
As we know such sequences are not always decidable, so K(ω(1 : n); n) is
not going to be bounded. How fast can it grow? The following theorem gives
exact answer, showing that it grows only logarithmically.
(b) The set E can be chosen such that for all n we have K(ω; n) ¾ log n.
0
Proof. Let usP prove (a) first. Let n be the first power of 2 larger than n.
Let k(n) = i¶n0 ω(i). For a constant d, let p = p(n, d) be a program that
contains an initial segment q of size d followed by a the number k(n) in
binary notation, and padded to dlog ne + d with 0’s. The program q is self-
delimiting, so it sees where k(n) begins.
The machine T0 (p, i) works as follows, under the control of q. From the
length of the whole p, it determines n0 . It begins to enumerate ω(1 : n0 ) until
it found k(n) 1’s. Then it knows the whole ω(1 : n0 ), so it outputs ω(i).
Let us prove now (b). Let us list all possible programs q for the machine
T0 as q = 1, 2, . . . . Let ω(q) = 1 if and only if T0 (q, q) = 0. The sequence ω
is obviously the indicator function of a recursively enumerable set. To show
that ω has the desired property, assume that for some n there is a p with
T0 (p, i) = ω(i) for all i ¶ n. Then p > n since ω(p) is defined to be different
from T0 (p, p). It follows that l(p) ¾ log n.
Decision complexity has been especially convenient for the above theo-
rem. Neither K(ω(1 : n)) nor K(ω(1 : n) | n) would have been suitable
to formulate such a sharp result. To analyze the phenomenon further, we
introduce some more concepts.
Definition 3.2.2. Let us denote, for this section, by E the set of those binary
strings p on which the optimal prefix machine T halts:
81
3. INFORMATION
Let
χ = χE (3.2.3)
be the infinite sequence that is the indicator function of the set E, when the
latter is viewed as a set of numbers. ù
Let Ω(1 : n) be the sequence of the first n binary digits in the expansion
of Ω, and let it also denote the binary number 0.Ω(1) · · · Ω(n). Then we have
Proof. Let us show first that given Ω as an oracle, a Turing machine can
compute χ. Suppose that we want to know for a given string p of length k
whether χ(p) = 1 that is whether T (p) halts. Let t(n) be the first t for which
Ω t > Ω(1 : n). If a program p of length n is not in E t(n) then it is not in E
at all, since 2−l(p) = 2n < Ω − Ω t . It follows that χ(1 : 2n ) can be computed
from Ω(1 : n).
To show that Ω can be computed from E, let us define the recursively
enumerable set
E 0 = { r : r rational, Ω > r }.
The set E 0 is reducible to E since the latter is complete among recursively
enumerable sets with respect to many-one reduction. On the other hand, Ω
is obviously computable from E 0 .
82
3.3. The complexity of complexity
The following theorem shows that Ω stores the information more densely.
+
Theorem 3.2.4. We have H(Ω(1 : n)) > n.
Let us say that p ∈ {0, 1}s is suitable for x ∈ {0, 1}n if there exists a
k = U(p, x) and a q ∈ {0, 1}k with U(p, Λ) = x. Thus, p is suitable for x
if it produces the length of some program of x, not necessarily a shortest
program.
83
3. INFORMATION
Let Mi denote the set of those x of length n for which there exist at least
i different suitable p of length s. We will examine the sequence
We can assume
log |Mi \ Mi+1 | ¾ log |Mi | − 1, (3.3.4)
otherwise (3.3.3) is trivial. We will write a program that finds an element x
of Mi \ Mi+1 with the property H(x) ¾ log |Mi | − 1. The program works as
follows.
• It finds i, n with the help of descriptions of length log n + 2 log log n,
and s with the help of a description of length 2 log log n.
• It finds |Mi+1 | with the help of a description of length log |Mi+1 |+log n+
2 log log n.
• From these data, it determines the set Mi+1 , and then begins to enu-
merate the set Mi \ Mi+1 as x 1 , x 2 , . . . . For each of these elements x r , it
knows that there are exactly i programs suitable for x r , find all those,
and find the shortest program for x produced by these. Therefore it
can compute H(x r ).
• According to the assumption (3.3.4), there is an x r with H(x r ) ¾
log |Mi | − 1. The program outputs the first such x r .
+
The construction of the program shows that its length is < log |Mi+1 |+4 log n,
hence for the x we found
+
log |Mi | − 1 ¶ H(x) < log |Mi+1 | + 4 log n,
which proves (3.3.3). This implies j ¾ n/(4 log n), and hence (3.3.2).
Let
f (x) = H(H(x) | x).
84
3.3. The complexity of complexity
We have seen that f (x) is sometimes large. Here, we will show that the se-
quences x for which this happens are rare. Recall that we defined χ in (3.2.3)
as the indicator sequence of the halting problem.
In view of the inequality (3.1.18), this shows that such sequences are
rare, even in terms of the universal probability, so they are certainly rare if
we measure them by any computable distribution. So, we may even call such
sequences “exotic”.
In the proof, we start by showing that the sequences in question are rare.
Then the theorem will follow when we make use of the fact that f (x) is
computable from χ.
We need a lemma about the approximation of one measure by another
from below.
85
3. INFORMATION
+
Clearly, H t(k) (x) > H(x). Let us show that H t(k) (x) is a good approximation
for H(x) for most x. Let
Y (k) = { x : H(x) ¶ H t(k) (x) − 1 }.
/ Y (k) we have
By definition, for x ∈
+
|H t(k) (x) − H(x)| = 0.
On the other hand, applying Lemma 3.3.1 with µ = m t(k) , ν = m, we obtain
m(Y (k)) < 2−k+1 . (3.3.5)
Note that
+
H(H t(k) (x) | x, Ω(1 : k)) = 0,
/ Y (k) we have
therefore for x ∈
+
H(H(x) | x, Ω(1 : k)) = 0,
+
H(H(x) | x) < k + 1.2 log k.
+
If n = k + 1.2 log k then k = n − 1.2 log n, and hence, if x ∈
/ Y (n − 1.2 log n)
+
then H(H(x) | x) < n. Thus, there is a constant c such that
X (n) ⊆ Y (n − 1.2 log n − c).
Using (3.3.5) this gives the statement of the lemma.
Proof of Theorem 3.3.1. Since f (x) is computable from Ω, the function
ν(x) = m(x)2 f (x) ( f (x))−2.4
is computable from Ω. Let us show that it is a semimeasure (within a multi-
plicative constant). Indeed, using the above lemma:
X X X X X ∗
ν(x) = ν(x) = 2k k−2.4 m(X (k)) = k−1.2 < 1.
x k x∈X (k) k k
∗
Since m(· | Ω) is the universal semimeasure relative to Ω we find m(x | Ω) >
ν(x), hence
+
H(x | Ω) < − log ν(x) = H(x) − f (x) + 2.4 log f (x),
+
I(Ω : x) > f (x) − 2.4 log f (x).
Since Ω is equivalent to χ, the proof is complete.
86
4 GENERALIZATIONS
4.1.1 INTRODUCTION
The algorithmic theory of randomness is well developed when the underly-
ing space is the set of finite or infinite sequences and the underlying prob-
ability distribution is the uniform distribution or a computable distribution.
These restrictions seem artificial. Some progress has been made to extend
the theory to arbitrary Bernoulli distributions by Martin-Löf in [37], and to
arbitrary distributions, by Levin in [29, 31, 32]. The paper [25] by Hertling
and Weihrauch also works in general spaces, but it is restricted to comput-
able measures. Similarly, Asarin’s thesis [1] defines randomness for sample
paths of the Brownian motion: a fixed random process with computable dis-
tribution.
The exposition here has been inspired mainly by Levin’s early paper [31]
(and the much more elaborate [32] that uses different definitions): let us
summarize part of the content of [31]. The notion of a constructive topolog-
ical space X and the space of measures over X is introduced. Then the paper
defines the notion of a uniform test.R Each test is a lower semicomputable
function (µ, x) 7→ fµ (x), satisfying fµ (x)µ(d x) ¶ 1 for each measure µ.
There are also some additional conditions. The main claims are the follow-
ing.
(a) There is a universal test tµ (x), a test such that for each other test f there
is a constant c > 0 with fµ (x) ¶ c · tµ (x). The deficiency of randomness is
defined as dµ (x) = log tµ (x).
87
4. GENERALIZATIONS
(b) The universal test has some strong properties of “randomness conserva-
tion”: these say, essentially, that a computable mapping or a computable
randomized transition does not decrease randomness.
(c) There is a measure M with the property that for every outcome x we
have t M (x) ¶ 1. In the present work, we will call such measures neutral.
(d) Semimeasures (semi-additive measures) are introduced and it is shown
that there is a lower semicomputable semi-measure that is neutral (let
us assume that the M introduced above is lower semicomputable).
(e) Mutual information I(x : y) is defined with the help of (an appropriate
version of) Kolmogorov complexity, between outcomes x and y. It is
shown that I(x : y) is essentially equal to d M ×M (x, y). This interprets
mutual information as a kind of “deficiency of independence”.
This impressive theory leaves a number of issues unresolved:
1. The space of outcomes is restricted to be a compact topological space,
moreover, a particular compact space: the set of sequences over a finite
alphabet (or, implicitly in [32], a compactified infinite alphabet). How-
ever, a good deal of modern probability theory happens over spaces that
are not even locally compact: for example, in case of the Brownian mo-
tion, over the space of continuous functions.
2. The definition of a uniform randomness test includes some conditions
(different ones in [31] and in [32]) that seem somewhat arbitrary.
3. No simple expression is known for the general universal test in terms
of description complexity. Such expressions are nice to have if they are
available.
Here, we intend to carry out as much of Levin’s program as seems possible
after removing the restrictions. A number of questions remain open, but we
feel that they are worth to be at least formulated. A fairly large part of
the exposition is devoted to the necessary conceptual machinery. This will
also allow to carry further some other initiatives started in the works [37]
and [29]: the study of tests that test nonrandomness with respect to a whole
class of measures (like the Bernoulli measures).
Constructive analysis has been developed by several authors, converging
approximately on the same concepts, We will make use of a simplified version
of the theory introduced in [55]. As we have not found a constructive mea-
sure theory in the literature fitting our purposes, we will develop this theory
here, over (constructive) complete separable metric spaces. This generality
88
4.1. Continuous spaces, noncomputable measures
89
4. GENERALIZATIONS
lows that h00e,i (x, y, µ) = he,i (x, y, µ) for all i and hence φe0 (x, y, µ) =
φe (x, y, µ).
90
4.1. Continuous spaces, noncomputable measures
Definition 4.1.2. A uniform test u is called universal if for every other test t
there is a constant c t > 0 such that for all x, µ we have t µ (x) ¶ cuµ (x). ù
Definition 4.1.3. Let us fix a universal uniform test, called tµ (x). An element
x ∈ X is called random with respect to measure µ ∈ M (X) if tµ (x) < ∞.
The deficiency of randomness is defined as dµ (x) = log tµ (x). ù
If the space is discrete then typically all elements are random with respect
to µ, but they will still be distinguished according to their different values of
dµ (x).
4.1.3 SEQUENCES
Let our computable metric space X = (X , d, D, α) be the Cantor space of Ex-
ample B.1.19.2: the set of sequences over a (finite or countable) alphabet
ΣN . We may want to measure the non-randomness of finite sequences, view-
ing them as initial segments of infinite sequences. Take the universal test
tµ (x). For this, it is helpful to apply the representation of Proposition B.1.16,
taking into account that adding the extra parameter µ does not change the
validity of the theorem:
The properties of the function gµ (w) clearly imply that supn gµ (ξ¶n ) is a
uniform test.
The existence of a universal function among the functions g can be
proved by the usual methods:
Proposition 4.1.4. Among the functions gµ (w) satisfying the properties listed
in Proposition 4.1.3, there is one that dominates to within a multiplicative
constant.
91
4. GENERALIZATIONS
Definition 4.1.4 (Extended test). Over the Cantor space, we extend the defi-
nition of a universal test tµ (x) to finite sequences as follows. We fix a function
tµ (w) with w ∈ Σ∗ whose existence is assured by Proposition 4.1.4. For infi-
nite sequences ξ we define tµ (ξ) = supn tµ (ξ¶n ). The test with values defined
also on finite sequences will be called an extended test. ù
We could try to define extended tests also over arbitrary constructive met-
ric spaces, extending them to the canonical balls, with the monotonicity prop-
erty that tµ (v) ¶ tµ (w) if ball w is manifestly included in ball v. But there is
nothing simple and obvious corresponding to the integral requirement (c).
Over the space ΣN for a finite alphabet Σ, an extended test could also be
extracted directly from a test, using the following formula (as observed by
Vyugin and Shen).
Definition 4.1.5.
Proof. Let tν (x) be the universal test over X0 . The left-hand side of (4.1.1)
can be written as
uµ = ΛtΛ∗ µ .
92
4.1. Continuous spaces, noncomputable measures
93
4. GENERALIZATIONS
94
4.2. Test for a class of measures
Example 4.2.1. It is easy to show that the class B is effectively compact. One
way is to appeal to the general theorem in Proposition B.1.10 saying that
applying a computable function to an effectively compact set (in this case
the interval [0, 1]), the image is also an effectively compact set. ù
For the case of infinite sequences, we can also define extended tests.
Definition 4.2.3 (Extended class test). Let our space X be the Cantor space
of infinite sequences ΣN . Consider a class C of measures that is effectively
compact in the sense of Definition B.1.10 or (equivalently for metric spaces)
in the sense of Theorem B.1.1. A lower semicomputable function f : Σ∗ →
R+ is called an extended C -test if it is monotonic with respect to the prefix
relation and for all µ ∈ C and integer n ¾ 0 we have
X
µ(x) f (x) ¶ 1.
x∈Σn
95
4. GENERALIZATIONS
For the case of sequences, the same statement can be made for extended
tests. (This is not completely automatic since a test is obtained from an
extended test via a supremum, on the other hand a class test is obtained,
according the theorem above, via an infimum.)
96
4.2. Test for a class of measures
Setting n = 2k :
k
X
B p ({ x ∈ B2 : | x(i) − 2k p| > 20.6k }) < 2−0.2k . (4.2.1)
i
Then we have
X
B pξ g p (ξ) ¶ k · 2−0.2k = c < ∞
k
for some constant c, so s p (ξ) = g p (x)/c is a test for each p. The property
P2 k
s p (ξ) < ∞ implies that 2−k i=1 ξ(i) converges to p. For a given ξ this is
impossible for both p and q for p 6= q, hence s p (ξ) ∨ sq (ξ) = ∞. ù
The following structure theorem gives considerable insight.
for all µ ∈ C , x ∈ X .
∗
Proof. First, we have tC (x) ∨ sµ (x) < tµ (x). Indeed as we know from the
Uniform Extension Corollary 4.1.1, we can extend sµ (x)/2 to a uniform test,
∗
hence sµ (x) < tµ (x). Also by definition tC (x) ¶ tµ (x).
On the other hand, let us show tC (x) ∨ sµ (x) ¾ tµ (x). Suppose first that
x is not random with respect to any ν ∈ C : then tC (x) = ∞. Suppose now
that x is random with respect to some ν ∈ C , ν 6= µ. Then sµ (x) = ∞.
Finally, suppose tµ (x) < ∞. Then tν (x) = ∞ for all ν ∈ C , ν 6= µ, hence
tC (x) = infν∈C tν (x) = tµ (x), so the inequality holds again.
The above theorem separates the randomness test into two parts. One
part tests randomness with respect to the class C , the other one tests typi-
cality with respect to the measure µ. In the Bernoulli example,
97
4. GENERALIZATIONS
98
4.2. Test for a class of measures
Martin-Löf also gave a definition of Bernoulli tests in [37]. For its definition
let us introduce the following notation.
Notation 4.2.5. The set of sequences with a given frequency of 1’s will be
denoted as follows:
X
B(n, k) = { x ∈ Bn : x(i) = k }.
i
ù
Martin-Löf’s definition translates to the integral constraint version as fol-
lows:
Definition 4.2.6. Let X = BN be the set of infinite binary sequences with the
usual metrics. A combinatorial Bernoulli test is a function f : B∗ → R+ with
the following constraints:
(a) It is lower semicomputable.
(b) It is monotonic with respect to the prefix relation.
(c) For all 0 ¶ k ¶ n we have
n
X
f (x) ¶ . (4.2.2)
x∈B(n,k)
k
ù
The following observation is useful.
Proposition 4.2.6. If a combinatorial Bernoulli test f (x) is given on strings x
of length less than n, then extending it to longer strings using monotonicity we
get a function that is still a combinatorial Bernoulli test.
Proof. It is sufficient to check the relation (4.2.2). We have (even for k = 0,
when B(n − 1, k − 1) = ;):
X X X
f (x) ¶ f ( y) + f ( y)
x∈B(n,k) y∈B(n−1,k−1) y∈B(n−1,k)
n−1 n−1 n
¶ + = .
k−1 k k
99
4. GENERALIZATIONS
X n
X X
B p (x) f (x) = p k (1 − p)n−k f (x)
x∈Bn k=0 x∈B(n,k)
n
n
X
¶ p (1 − p)
k n−k
= 1.
k=0
k
On the other hand, let f (x) = p tB (x), x ∈ B∗ be the extended test for tB (ξ).
For all integers N > 0 let n = b N /2c. Then as N runs through the integers,
n also runs through all integers. For x ∈ BN let F (x) = f (x ¶n ). Since f (x)
is lower semicomputable and monotonic with respect to the prefix relation,
this is also true of F (x). P
We need to estimate x∈B(N ,K) F (x). For this, note that for y ∈ B(n, k)
we have
N −n
|{ x ∈ B(N , K) : y v x }| = . (4.2.3)
K−k
100
4.2. Test for a class of measures
N −n N
Let us estimate K−k K
. If K = k = 0 then this is 1. If k = n then it is
K···(K−n+1)
N ···(N −n+1)
. Otherwise, using p = K/N :
N −n
K−k (N − n)(N − n − 1) · · · (N − K − (n − k) + 1)/(K − k)!
=
N
N (N − 1) · · · (N − K + 1)/K!
K
K · · · (K − k + 1) · (N − K) · · · (N − K − (n − k) + 1)
= .
N · · · (N − n + 1)
K k (N − K)n−k
n
N
¶ = p (1 − p)
k n−k
. (4.2.5)
(N − n)n N −n
holds. We have
n n
N n
N
= 1+ ¶ e 2(N −n) ¶ e2 ,
N −n N −n
X n
X X
F (x) ¶ e 2
p k (1 − p)n−k f ( y) ¶ e2 ,
x∈B(N ,K) k=0 y∈B(n,k)
101
4. GENERALIZATIONS
Theorem 4.3.1. If the space X is compact then there is a neutral measure over
X.
Lemma 4.3.2. For every closed set A ⊂ X and measure µ, if µ(A) = 1 then
there is a point x ∈ A with tµ (x) ¶ 1.
102
4.3. Neutral measure
same holds for every subset of the indices {1, . . . , k}. Sperner’s Lemma 4.3.1
implies F x 1 ∩ · · · ∩ F x k 6= ;.
When the space is not compact, there are generally no neutral probability
measures, as shown by the following example.
Remark 4.3.4.
1. It is easy to see that Theorem 4.6.1 characterizing randomness in terms
of complexity holds also for the space N.
2. The topological space of semimeasures over N is not compact, and there
is no neutral one among them. Its topology is not the same as what we get
when we restrict the topology of probability measures over N to N. The
difference is that over N, for example the set of measures { µ : µ(N) >
1/2 } is closed, since N (as the whole space) is a closed set. But over N,
this set is not necessarily closed, since N is not a closed subset of N.
103
4. GENERALIZATIONS
ù
Neutral measures are not too simple, even over N, as the following theo-
rem shows.
Suppose now that ν is lower semicomputable over N. The proof for this case
is longer. We know that ν is the monotonic limit of a recursive sequence
i 7→ νi (x) of recursive semimeasures with rational values νi (x). For every
k = 0, . . . , 2n − 2, let
104
4.3. Neutral measure
Consider x = x n,k . Conditions (a) and (c) are satisfied by definition. Let us
show that condition (b) is also satisfied. Let j = j(n, k). By definition, we
have ν j (x) < 2−n+1 . Since by definition ν j ∈ Vn,k and ν j ¶ ν ∈ Vn,k , we have
Since all three conditions (a), (b) and (c) are satisfied, we have
shown (4.3.4). Now we define
X 1 X
gµ (x) = fµ (x, n, k).
n¾2
n(n + 1) k
∗ 1 2n−2
tν (x n,k ) > gν (x n,k ) ¾ fµ (x n,k , n, k) ¾ .
n(n + 1) n(n + 1)
Remark 4.3.5. In [31] and [32], Levin imposed extra conditions on tests
which allow to find a lower semicomputable neutral semimeasure. ù
105
4. GENERALIZATIONS
can serve as a reasonable deficiency of randomness. (We will also use the
test t = 2d .) If we substitute m for µ in dµ (x), we get 0. This substitution
is not justified, of course. The fact that m is not a probability measure can
be helped, using compactification as above, and extending the notion of ran-
domness tests. But the test dµ can replace dµ only for computable µ, while
m is not computable. Anyway, this is the sense in which all outcomes might
be considered random with respect to m, and the heuristic sense in which m
may be considered “neutral”.
Some people find that µ-tests as defined in Definition 4.1.1 are too gen-
eral, in case µ is a non-computable measure. In particular, randomness
with respect to computable measures has a certain—intuitively meaningful—
monotonicity property: roughly, if ν is greater than µ then if x is random
with respect to µ, it should also be random with respect to ν.
Proposition 4.4.1. For computable measures µ, ν we have for all rational c >
0:
+
2−k µ ¶ ν ⇒ dν (x) < dµ (x) + k + H(k). (4.4.1)
+
Here the constant in < depends on µ, ν, but not on k.
Proof. We have 1 ¾ νtν ¾ 2−k µtν , hence 2−k tν is a µ-test. Using the method
of Theorem 4.1.1 in finding universal tests, one can show that the sum
X
2−k−H(k) tν
k:2−k µtν <1
∗
is a µ-test, and hence < tµ . Therefore this is true of each member of the sum,
which is just what the theorem claims.
106
4.4. Monotonicity, quasi-convexity/concavity
There are other properties true for tests on computable measures that we
may want to require for all measures. For the following properties, let us
define quasi-convexity, which is a weakening of the notion of convexity.
Proof. The function tµ1 ∧ tµ2 is lower semicomputable, and is a µi -test for
∗
each i. Therefore it is also a ν-test, and as such is < tν . Here, the constant in
∗
the < depends only on (the programs for) µ1 , µ2 , and not on λ.
107
4. GENERALIZATIONS
These properties do not survive for arbitrary measures and arbitrary con-
stants.
Example 4.4.4. Let measure µ1 be uniform over the interval [0, 1/2], let µ2
be uniform over [1/2, 1]. For 0 ¶ x ¶ 1 let
ν x = (1 − x)µ1 + xµ2 .
Then ν1/2 is uniform over [0, 1]. Let φ x be the uniform distribution over
[x, x + 1/2], and ψ x the uniform distribution over [0, x] ∪ [x + 1/2, 1].
Let p < 1/2 be random with respect to the uniform distribution ν1/2 .
1. The relations
ν1/2 = (ν p + ν1−p )/2, dν1/2 (p) < ∞, dνp (p) = dν1−p (p) = ∞
108
4.5. Algorithmic entropy
4.5.1 ENTROPY
The entropy of a discrete probability distribution µ is defined as
X
H (µ) = − µ(x) log µ(x).
x
dµ µ(d x)
= =: f (x)
dν ν(d x)
for the (Radon-Nikodym) derivative (density) of µ with respect to ν, we
define
µ(d x)
Z
dµ
Hν (µ) = − log dµ = −µ x log = −ν x f (x) log f (x).
dν ν(d x)
ù
In information theory and statistics, when both µ and ν are probability
measures, then −Hν (µ) is also denoted D(µ k ν), and called (after Kullback)
the information divergence of the two measures. It is frequently used in the
role of a distance between µ and ν. It is not symmetric, but can be shown to
109
4. GENERALIZATIONS
obey the triangle inequality, and to be nonnegative. Let us prove the latter
property: in our terms, it says that relative entropy is nonpositive when both
µ and ν are probability measures.
Proposition 4.5.2. Over a space X , we have
µ(X )
Hν (µ) ¶ −µ(X ) log . (4.5.1)
ν(X )
In particular, if µ(X ) ¾ ν(X ) then Hν (µ) ¶ 0.
Proof. The inequality −a ln a ¶ −a ln b + b − a expresses the concavity of the
logarithm function. Substituting a = f (x) and b = µ(X )/ν(X ) and integrat-
ing by ν:
µ(X ) µ(X )
(ln 2)Hν (µ) = −ν x f (x) ln f (x) ¶ −µ(X ) ln + ν(X ) − µ(X )
ν(X ) ν(X )
µ(X )
= −µ(X ) ln ,
ν(X )
giving (4.5.1).
110
4.5. Algorithmic entropy
Let us allow, for a moment, measures µ that are not probability measures:
they may not even be finite. Metric and computability can be extended to
this case, the universal test tµ (x) can also be generalized. The Coding Theo-
rem 1.6.5 and other considerations suggest the introduction of the following
notation, for an arbitrary measure µ:
Definition 4.5.1. We define the algorithmic entropy of a point x with respect
to measure µ as
Hµ (x) = −dµ (x) = − log tµ (x). (4.5.2)
ù
Then, with # defined as the counting measure over the discrete set Σ∗
(that is, #(S) = |S|), we have
+
H(x) = H# (x).
µ(d x)
Hν (µ) = −µ x log ,
ν(d x)
δ(d x) ν(d x)
−µ x Hν (x | µ) = µ x log = −µ x log ,
ν(d x) δ(d x)
µ(d x) µ(X )
Hν (µ) − µ x Hν (x | µ) = −µ x log ¶ −µ(X ) log ¶ 0.
δ(d x) δ(X )
111
4. GENERALIZATIONS
2−Hµ (x| y) . The left-hand side is a sum, hence the inequality holds for each
element of the sum: just what we had to prove.
112
4.5. Algorithmic entropy
+
or in logarithmic notation − log fν (x, y) = Fν ( y) + Hν (x | y, Fν ( y)).
∗
Proof. To prove the inequality >, define
ν x 2−m tν (x | y, m) ¶ 2−m ,
ν x gν (x, y) < 2−Fν ( y) ,
∗
113
4. GENERALIZATIONS
∗
implying gν (x, y) < fν (x, y) by the optimality of fν (x, y).
To prove the upper bound, consider all lower semicomputable func-
tions φe (x, y, m, ν) (e = 1, 2, . . . ). By Theorem 4.1.1, there is a recur-
sive mapping e 7→ e0 with the property that ν x φe0 (x, y, m, ν) ¶ 2−m+1 , and
for each y, m, ν if ν x φe (x, y, m, ν) < 2−m+1 then φe = φe0 . Let us ap-
ply this transformation to the function φe (x, y, m, ν) = fν (x, y). The re-
sult is a lower semicomputable function fν0 (x, y, m) = φe0 (x, y, m, ν) with
the property that ν x fν0 (x, y, m) ¶ 2−m+1 , further ν x fν (x, y) ¶ 2−m implies
fν0 (x, y, m) = fν (x, y). Now (x, y, m, ν) 7→ 2m−1 fν0 (x, y, m) is a uniform test
∗
of x conditional on y, m and hence it is < tν (x | y, m). Substituting Fν ( y) for
m the relation ν x fν (x, y) ¶ 2−m is satisfied and hence we have
114
4.5. Algorithmic entropy
Now, we have
∗ +
Integrating over x by µ gives µ x ν y 2−G < 1, implying Hµ,ν (x, y) < Gµ,ν (x, y)
+
by the minimality property of Hµ,ν (x, y). This proves the < half of the theo-
rem.
+
To prove the inequality > let fν (x, y, µ) = 2−Hµ,ν (x, y) . Proposition 4.5.4
implies that there is a constant c with ν y fν (x, y, µ) ¶ 2−Hµ (x|ν)+c . Let
Fν (x, µ) = Hµ (x | ν).
4.5.4 INFORMATION
Mutual information has been defined in Definition 3.1.7 as I ∗ (x : y) =
+
H(x) + H( y) − H(x, y). By the Addition theorem, we have I ∗ (x : y) =
+
H( y) − H( y | x, H(x)) = H(x) − H(x | y, H( y)). The two latter expressions
show that in some sense, I ∗ (x : y) is the information held in x about y as
115
4. GENERALIZATIONS
hence also the function (x, y) 7→ m(x, y) = m(〈x, y〉). Using this knowledge,
it is possible to develop an argument similar to the proof of Theorem 4.3.2,
+
showing that H(x, y | m × m) = H(x, y) does not hold.
We have seen in the discrete case that complexity and randomness are closely
related. The connection is more delicate technically in the continuous case,
but its exploration led to some nice results.
It is known that for computable µ, the test dµ (x) can be expressed in terms
of the description complexity of x (we will prove these expressions below).
2
We cannot use the test tµ for this, since it can be shown easily that it does not to obey
randomness conservation.
116
4.6. Randomness and complexity
Assume that X is the (discrete) space of all binary strings. Then we have
The meaning of this equation is the following. The expression − log µ(x)
is an upper bound (within O(H(µ))) of the complexity H(x), and nonran-
domness of x is measured by the difference between the complexity and this
upper bound. Assume that X is the space of infinite binary sequences. Then
equation (4.6.1) must be replaced with
Proof. Let us treat the domain of our measure µ as a set of pairs (x, y). Let
x n = 0n , for n = 1, 2, . . . . For each n, let yn be some binary string of length n
with the property H(x n , yn ) > n. Let µ(x n , yn ) = 2−n . Then − log µ(x n , yn ) −
H(x n , yn ) ¶ n − n = 0. On the other hand, let t µ (x, y) be the test nonzero
only on strings x of the form x n :
m(n)
t µ (x n , y) = P .
z∈B n µ(x n , z)
∗
therefore t µ is indeed a test. Hence tµ (x, y) > t µ (x, y). Taking logarithms,
+
dµ (x n , yn ) > n − H(n).
The same example shows that the test defined as exp(− log µ(x) − H(x))
over discrete sets, does not satisfy the randomness conservation property.
117
4. GENERALIZATIONS
Proposition 4.6.2. The test defined as fµ (x) = exp(− log µ(x) − H(x)) over
discrete spaces X does not obey the conservation of randomness.
Proof. Let us use the example of Proposition 4.6.1. Consider the function π :
(x,
P y) 7→ x. The image of the measure µ under−nthe projection is (πµ)(x) =
y µ(x, y). Thus, (πµ)(x n ) = µ(x n , yn ) = 2 . Then we have seen that
log fµ (x n , yn ) ¶ 0. On the other hand,
+
log fπµ (π(x n , yn )) = − log(πµ)(x n ) − H(x n ) = n − H(n).
In the example, we have the abnormal situation that a pair is random but
one of its elements is nonrandom. Therefore even if we would not insist on
universality, the test exp(− log µ(x) − H(x)) is unsatisfactory.
Looking into the reasons of the nonconservation in the example, we will
notice that it could only have happened because the test fµ is too special. The
fact that − log(πµ)(x n ) − H(x n ) is large should show that the pair (x n , yn )
can be enclosed into the “simple” set {x n } × Y of small probability; unfortu-
nately, this observation does not reflect on − log µ(x, y) − H(x, y) (it does
for computable µ).
It is a natural idea to modify equation (4.6.1) in such a way that the
complexity H(x) is replaced with H(x | µ). However, this expression must
be understood properly. We need to use the definition of H(x) as − log m(x)
directly, and not as prefix complexity.
Let us mention the following easy fact:
+
Proposition 4.6.3. If µ is a computable measure then H(x | µ) = H(x). The
+
constant in = depends on the description complexity of µ.
118
4.6. Randomness and complexity
∗
m(x | µ)/µ(x). Let us prove > first. We will show that the right-hand side of
∗
this inequality is a test, and hence < tµ (x). However, the right-hand side is
clearly lower semicomputable in (x, µ) and when we “integrate” it (multiply
it by µ(x) and sum it), its sum is ¶ 1; thus, it is a test.
∗
Let us prove < now. The expression tµ (x)µ(x) is clearly lower semicom-
+
putable in (x, µ), and its sum is ¶ 1. Hence, it is < m(x | µ).
Remark 4.6.4. It important not to consider relative computation with respect
to µ as oracle computation in the ordinary sense. Theorem 4.3.1 below will
show the existence of a measure with respect to which every element is ran-
dom. If randomness is defined using µ as an oracle then we can always find
elements nonrandom with respect to µ.
For similar reasons, the proof of the Coding Theorem does not transfer to
the function H(x | µ) since an interpreter function should have the property
of intensionality, depending only on µ and not on the sequence representing
it. (It does transfer without problem to an oracle version of H µ (x).) The
Coding Theorem still may hold, at least in some cases: this is currently not
known. Until we know this, we cannot interpret H(x | µ) as description
complexity in terms of interpreters and codes.
(Thanks to Alexander Shen for this observation: this remark corrects an
error in the paper [21].) ù
119
4. GENERALIZATIONS
120
4.6. Randomness and complexity
Theorem 4.6.2. Suppose that the space X is effectively compact. Then for all
computable measures µ ∈ MR0 (X ), for the deficiency of randomness dµ (x), the
characterization (2.3.4) holds.
+
Proof. The proof of part > of the inequality follows directly from Proposi-
tion 4.6.7, just as in the proof of Theorem 2.3.4.
+
The proof of < is also similar to the proof of that theorem. The only
part that needs to be reproved is the statement that for every lower semi-
computable function f over X , there are computable sequences yi ∈ N∗ and
qi ∈ Q with f (x) = supi qi 1 yi (x). This follows now, since according to Propo-
sition 4.7.3, the cells Γ y form a basis of the space X .
121
4. GENERALIZATIONS
holds for all 0 ¶ p ¶ 1. A gap function D(n, k) is optimal if for every other
gap function D0 (n, k) there is a c D0 with D(n, k) ¶ D0 (n, k) + c D0 . ù
+
Proposition 4.6.8. There is an optimal gap function D(n, k) < H(n).
Proof. The existence is proved using the technique of Theorem 4.1.1. For the
inequality it is sufficient to note that H(n) is a gap function. Indeed, we have
n
XX X
B p (n, k)2−H(n) = 2−H(n) ¶ 1.
n¾1 k=0 n¾1
Definition 4.6.3. Let us fix an optimal gap function and denote it by ∆(n, k).
ù
Now we can state the test characterization theorem for Bernoulli tests.
Theorem 4.6.3. Denoting by b(ξ) the universal class test for the Bernoulli
sequences, we have b(ξ) = b(ξ), where
∗
n
log b(ξ) = sup log − H(ξ¶n | n, k, ∆(n, k)) − ∆(n, k),
n k
with k = Sn (ξ).
122
4.6. Randomness and complexity
Among these functions there is one that is optimal (maximal to within a multi-
plicative constant). Calling it δ( y) we have
n
b(ξ) = sup m(ξ¶n | n, k, ∆(n, k))δ(n, k)
n¾1 k
n
X n X
= sup δ(n, k) 1 y (ξ)m( y | n, k, ∆(n, k))
n¾1 k=0 k y∈B(n,k)
n X
XX n
¶ 1 y (ξ)δ(n, k)m( y | n, k, ∆(n, k))
n¾1 k=0
k y∈B(n,k)
n X
XX n
= 1 y (ξ)δ( y),
∗
n¾1 k=0
k y∈B(n,k)
using the notation of Claim 4.6.9 above. Let t(ξ) denote the right-hand side
here, which is thus a lower semicomputable function. We have for all p:
n
XX X
B pξ t(ξ) = B p (n, k)δ(n, k) m( y | n, k, ∆(n, k)) ¶ 1,
∗
∗ ∗
so t(ξ) > b(ξ) is a Bernoulli test, showing b(ξ) < b(ξ).
123
4. GENERALIZATIONS
∗
To show b(ξ) < b(ξ) we will follow the method of the proof of Theo-
rem 2.3.4. Replace b(ξ) with a rougher version:
1
b0 (ξ) = max{ 2n : 2n < b(ξ) },
2
then we have 2b0 < b. There are computable sequences yi ∈ B∗ and
ki ∈ N with b0 (ξ) = supi 2ki 1 yi (ξ) with the property that if i < j and
1 yi (ξ) = 1 y j (ξ) = 1 then ki < k j . As in the imitated proof, we have
2b0 (ξ) ¾ i 2ki 1 yi (ξ). The function γ( y) =
P P ki
yi = y 2 is lower semicom-
putable. We have
X X n
XX X
b ¾ 2b (ξ) ¾
0
2 1 yi (ξ) =
ki
1 y (ξ)γ( y) = 1 y (ξ)γ( y).
i y∈N∗ n k=0 y∈B(n,k)
(4.6.7)
n
y∈B(n,k) γ( y)
P
By Theorem 4.2.3 we can assume ¶ k
. Let
−1
n
δ ( y) = γ( y)
0
¶ 1,
k
X
δ0 (n, k) = δ0 ( y).
y∈B(n,k)
∗
Thus δ0 (n, k) is a gap function, hence δ0 (n, k) < 2−∆(n,k) , and by Claim 4.6.9
we have
−1
n ∗
γ( y) = δ0 ( y) < δ( y) = δ(n, k) · m( y | n, k, ∆(n, k)).
∗
k
∗
Substituting back into (4.6.7) finishes the proof of b(ξ) < b(ξ).
124
4.7. Cells
4.7 CELLS
This section allows to transfer some of the results on sequence spaces to more
general spaces, by encoding the elements into sequences. The reader who is
only interested in sequences can skip this section.
4.7.1 PARTITIONS
The coordinates of the sequences into which we want to encode our elements
will be obtained via certain partitions.
Recall from Definition A.2.16 that a measurable set A is said to be a µ-
continuity set if µ(∂ A) = 0 where ∂ A is the boundary of A.
A partition of this kind will be called a continuity partition. The sets Vj1 ,..., jn
will be called the cells of this partition. ù
Proposition 4.7.1. In a partition as given above, the values µVj1 ,..., jn are com-
putable from the names of the functions f i and the cutoff points αi j .
Proof. Straightforward.
125
4. GENERALIZATIONS
Γs
The three properties above say that if we restrict ourselves to the set X 0
then the canonical cells behave somewhat like binary subintervals: they di-
vide X 0 in half, then each half again in half, etc. Moreover, around each point,
these canonical cells become “arbitrarily small”, in some sense (though, they
may not be a basis of neighborhoods). It is easy to see that if Γs1 , Γs2 are two
canonical cells then they either are disjoint or one of them contains the other.
If Γs1 ⊂ Γs2 then s2 is a prefix of s1 . If, for a moment, we write Γ0s = Γs ∩ X 0
then we have the disjoint union Γ0s = Γ0s0 ∪ Γ0s1 .
Let us use the following notation.
126
4.7. Cells
s = x ¶n = x 1 · · · x n , µ(s) = µ(Γs ).
The 2n cells (some of them possibly empty) of the form Γs for l(s) = n form
a partition
Pn
of X 0 . ù
Thus, for elements of X 0 , we can talk about the n-th bit x n of the descrip-
tion of x: it is uniquely determined.
Examples 4.7.2.
1. If X is the set of infinite binary sequences with its usual topology, the
functions bn (x) = x n − 1/2 generate the usual cells, and X0 = X.
2. If X is the interval [0, 1], let bn (x) = − sin(2n πx). Then cells are open
intervals of the form (k · 2−n , (k + 1) · 2n ), the correspondence between
infinite binary strings and elements of X 0 is just the usual representation
of x as the binary decimal string 0.x 1 x 2 . . . .
ù
When we fix canonical cells, we will generally assume that the partition
chosen is also “natural”. The bits x 1 , x 2 , . . . could contain information about
the point x in decreasing order of importance from a macroscopic point of
view. For example, for a container of gas, the first few bits may describe,
to a reasonable degree of precision, the amount of gas in the left half of
the container, the next few bits may describe the amounts in each quarter,
the next few bits may describe the temperature in each half, the next few
bits may describe again the amount of gas in each half, but now to more
precision, etc. From now on, whenever Γ denotes a subset of X , it means a
canonical cell. From now on, for elements of X 0 , we can talk about the n-th
bit x n of the description of x: it is uniquely determined.
The following observation will prove useful.
127
4. GENERALIZATIONS
Proof. We need to prove that for every ball B(x, r), the there is a cell x ∈
Γs ⊂ B(x, r). Let C be the complement of B(x, r). For each point y of C,
there is an i such that bi (x) · bi ( y) < 0. In this case, let J 0 = { z : bi (z)
S < 0 },
J 1 = { z : bi (z) > 0 }. Let J( y) = J p such that y ∈ J p . Then C ⊂ y J( y),
and compactness implies that there is a finite sequence y1 , . . . , yk with C ⊂
Sk
j=1
J( y j ). Clearly, there is a cell
k
[
x ∈ Γs ⊂ B(x, r) \ J( y j ).
j=1
Proof. Let us list all balls B(si , rn ) into a single sequence B(sik , rnk ). The func-
tions
128
4.7. Cells
For the proof of the theorem, we use some preparation. Recall from Defi-
nition A.2.3 that an atom is a point with positive measure.
P Consider
Proof. the uniformly computable measures µi = µ ◦ f i−1 and define
ν = i 2 µi . It is easy to see that ν is a computable measure and then, by
−i
129
5 EXERCISES AND PROBLEMS
log r
|cnvsr (x)| ¶ |x| + 1. (5.0.1)
log s
KU (x | y) ¶ KA(x | y) + c. (5.0.2)
131
5. EXERCISES AND PROBLEMS
Since
|cnvsr (p)| ¶ |p| log r/ log s + 1 = KA(x | y)/ log s + 1,
we have
Exercise 3 (Snm -theorem). Prove that there is a binary string b such that
U(p, q, x) = U(b o p o q, x) holds for all binary strings p, q and arbitrary strings
x.
Exercise 4. (Schnorr) Notice that, apart from the conversion between rep-
resentations, what our optimal interpreter does is the following. It treats
the program p as a pair (p(1), p(2)) whose first member is the Gödel num-
ber of some interpreter for the universal p.r. function V (p1 , p2 , x), and the
second argument as a program for this interpreter. Prove that indeed, for
any recursive pairing function w(p, q), if there is a function f such that
|w(p, q)| ¶ f (p) + |q| then w can be used to define an optimal interpreter.
+
Exercise 5. Refute the inequality K(x, y) < K(x) + K( y).
132
+
Exercise 8. (Schnorr) Prove that if m < n then m + K(m) < n + K(n).
Exercise 9. Prove
n + n n 1 n
log = k log + (n − k) log + log .
k k n−k 2 k(n − k)
Exercise 10. (Kamae) Prove that for any natural number k there is a string
x such that for for all but finitely many strings y, we have
K(x) − K(x | y) ¾ k.
In other words, there are some strings x such that almost any string y con-
tains a large amount of information about them.
Solution. Strings x with this property are for example the ones which con-
tain information obtainable with the help of any sufficiently large number.
Let E be a r.e. set of integers. Let e0 e1 e2 . . . be the infinite string which is the
characteristic function of E, and let x(k) = e0 . . . e2k . We can suppose that
K(x(k)) ¾ k. Let n1 , n2 , . . . be a recursive enumeration of E without repeti-
tion, and let α(k) = max{i : ni ¶ 2k }. Then for any number t ¾ α(k) we have
+
K(x(k) | t) < log k. Indeed, a binary string of length k describes the number
k. Knowing t we can enumerate n1 , . . . , n t and thus learn x(k). Therefore
+
with any string y of length ¾ α(k) we have K(x) − K(x | y) > k − log k.
Exercise 11.
(a) Prove that a real function f is computable iff there is a recursive function
g(x, n) with rational values, and | f (x) − g(x, n)| < 1/n.
(b) Prove that a function f is semicomputable iff there exists a recursive
function with rational values, (or, equivalently, a computable real funct-
ion) g(x, n) nondecreasing in n, with f (x) = limn→∞ g(x, n).
Exercise 12. Prove that in Theorem 1.5.2 one can write “semicomputable”
for “partial recursive”.
133
5. EXERCISES AND PROBLEMS
p∈Bn
Exercise 15. (Solovay) Show that we cannot find effectively infinitely many
places where some recursive upper bound of H(n) is sharp. Moreover, sup-
pose that F (n) is a recursive upper bound of H(n). Then there is no recursive
function D(n) ordering to each n a finite set of natural numbers (represented
for example as a string) larger than n such that for each n there is an x in
D(n) with F (x) ¶ H(x). Notice that the function log n (or one almost equal
to it) has this property for K(n).
Solution. Suppose that such a function exists. Then we can select a sequence
n1 < n2 < . . . of integers with the property that the sets D(ni ) are all disjoint.
Let X
a(n) = 2−F (x) .
x∈D(n)
Then the sequence a(nk ) is computable and k a(nk ) < 1. It is now easy
P
Exercise 16. Prove that there is a recursive upper bound F (n) of H(n) and a
constant c with the property that there are infinitely many natural numbers n
such that for all k > 0, the quantity of numbers x ¶ n with H(x) < F (x) − k
is less than cn2−k .
Solution. Use the upper bound G found in Theorem 1.7.4 and define F (n) =
log n + G(blog nc). The property follows from the facts that log n + H(blog nc)
is a sharp upper bound for H(x) for “most” x less than n and that G(k) is
infinitely often close to H(k).
134
Hint: Let P
rn be a recursive, increasing sequence of rational numbers with
limn rn = x m(x) and let an = rn+1 − rn .
Exercise 19. (Cover) Let log∗2 n = log2 n+log2 log2 n+. . . (all positive terms).
Prove that X ∗
2− log2 n < ∞,
n
+ +
hence H(n) < log∗2 n. For which logarithm bases does H(n) < log∗2 n hold?
Exercise 20. Prove that Kamae’s result in Exercise 10 does not hold for H(x |
y).
Exercise 21. Prove that for each " there is an m such that if H (P) > m then
| x P(x)H(x)/H (P) − 1| < ".
P
135
A BACKGROUND FROM MATHEMATICS
A.1 TOPOLOGY
In this section, we summarize the notions and results of topology that are
needed in the main text.
Having a set of open and closed sets allows us to speak about closure
operations.
137
A. BACKGROUND FROM MATHEMATICS
Examples A.1.1.
1. Let X be a set, and let β be the set of all points of X . The topology with
basis β is the discrete topology of the set X . In this topology, every subset
of X is open (and closed).
2. Let X be the real line R, and let βR be the set of all open intervals (a, b).
The topology τR obtained from this basis is the usual topology of the real
line. When we refer to R as a topological space without qualification, this
is the topology we will always have in mind.
3. Let X = R = R ∪ {−∞, ∞}, and let βR consist of all open intervals (a, b)
and in addition of all intervals of the forms [−∞, a) and (a, ∞]. It is clear
how the space R+ is defined.
4. Let X be the real line R. Let βR> be the set of all open intervals (−∞, b).
The topology with basis βR> is also a topology of the real line, different
from the usual one. Similarly, let βR< be the set of all open intervals (b, ∞).
5. Let Σ be a finite or countable alphabet. On the space ΣN of infinite sequen-
ces with elements in Σ, let τC = { AΣN : A ⊆ Σ∗ } be called the topology
of the Cantor space (over Σ). Note that if a set E has the form E = AΣN
where A is finite then E is both open and closed.
ù
Starting from open sets, we can define some other kinds of set that are
still relatively simple:
Different topologies over the same space have a natural partial order re-
lation among them:
138
A.1. Topology
Both our above example topologies of the real line have this property. All
topologies considered in this survey will have the T0 property. A stronger
such property is the following:
A.1.2 CONTINUITY
We introduced topology in order to be able to speak about continuity:
139
A. BACKGROUND FROM MATHEMATICS
Given some topological spaces, we can also form larger ones using for
example the product operation:
Examples A.1.2.
1. The space R × R with the product topology has the usual topology of the
Euclidean plane.
2. Let X be a set with the discrete topology, and X N the set of infinite se-
quences with elements from X , with the product topology. A basis of this
topology is provided by all sets of the form uX N where u ∈ X ∗ . The ele-
ments of this basis are closed as well as open. When X = {0, 1} then this
topology is the usual topology of infinite binary sequences.
ù
In some special cases, one of the topologies in the definition of continuity
is fixed and known in advance:
A.1.3 SEMICONTINUITY
For real functions, a restricted notion of continuity is often useful.
140
A.1. Topology
In the important special case of Cantor spaces, the basis is given by the
set of finite sequences. In this case we can also require the function g(w) to
be monotonic in the words w:
141
A. BACKGROUND FROM MATHEMATICS
A.1.4 COMPACTNESS
There is an important property of topological spaces that, when satisfied, has
many useful implications.
Metric spaces are topological spaces with more structure: in them, the close-
ness concept is quantifiable. In our examples for metric spaces, and later in
our treatment of the space of probability measures, we refer to [5].
142
A.1. Topology
are called the open and closed balls with radius r and center x.
A metric space is bounded when d(x, y) has an upper bound on X . ù
A metric space is also a topological space, with the basis that is the set
of all open balls. Over this space, the distance function d(x, y) is obviously
continuous.
Each metric space is a Hausdorff space; moreover, it has the following
stronger property.
Examples A.1.10.
1. A discrete topological space X can be turned into a metric space as fol-
lows: d(x, y) = 0 if x = y and 1 otherwise.
2. The real line with the distance d(x, y) = |x − y| is a metric space. The
topology of this space is the usual topology τR of the real line.
143
A. BACKGROUND FROM MATHEMATICS
3. The space R × R with the Euclidean distance gives the same topology as
the product topology of R × R.
4. An arbitrary set X with the distance d(x, y) = 1 for all pairs x, y of dif-
ferent elements, is a metric space that induces the discrete topology on
X.
5. Let X be a bounded metric space, and let Y = X N be the set of in-
P −i sequences x = (x 1 , x 2 , . . . ) with distance function d (x, y) =
finite N
ù
In metric spaces, certain previously defined topological objects have
richer properties.
Examples A.1.11. Each of the following facts holds in metric spaces and is
relatively easy to prove.
1. Every closed set is a Gδ set (and every open set is an Fσ set).
2. A set is compact if and only if it is sequentially compact.
3. A set is compact if and only if it is closed and for every ", there is a finite
set of "-balls (balls of radius ") covering it.
ù
In metric spaces, the notion of continuity can be strengthened.
144
A.1. Topology
Let Lip(X ) = Lip(X , R) = β Lipβ be the set of real Lipschitz functions over
S
X. ù
This is a continuous function that is 1 in the ball B(u, r), it is 0 outside the
ball B(u, r + "), and takes intermediate values in between. It is clearly a
Lipschitz(1/") function.
If a dense set D is fixed, let F0 = F0 (D) be the set of functions of the
form gu,r,1/n where u ∈ D, r is rational, n = 1, 2, . . . . Let F1 = F1 (D) be the
maximum of a finite number of elements of F0 (D). Each element f of F1 is
bounded between 0 and 1. Let
be the smallest set of functions containing F0 and the constant 1, and closed
under ∨, ∧ and rational linear combinations. For each element of E , from its
definition we can compute a bound β such that f ∈ Lipβ . ù
145
A. BACKGROUND FROM MATHEMATICS
Every metric space has the first countability property since we can restrict
ourselves to balls with rational radius.
For example, the space R has the second countability property for all
three topologies τR , τ<
R
, τ>
R
. Indeed, we also get a basis if instead of taking all
intervals, we only take intervals with rational endpoints. On the other hand,
the metric space of Example A.1.10.7 does not have the second countability
property.
146
A.2. Measures
For example, if X is the real line with the point 0 removed then X is not
complete, since there are Cauchy sequences converging to 0, but 0 is not in
X.
It is well-known that every metric space can be embedded (as an every-
where dense subspace) into a complete space.
Example A.1.13. Consider the set D[0, 1] of functions over [0, 1] that are
right continuous and have left limits everywhere. The book [5] introduces
two different metrics for them: the Skorohod metric d and the d0 metric.
In both metrics, two functions are close if a slight monotonic continuous
deformation of the coordinates makes them uniformly close. But in the d0
metric, the slope of the deformation must be close to 1. It is shown that the
two metrics give rise to the same topology; however, the space with metric d
is not complete, and the space with metric d0 is. ù
We will develop the theory of randomness over separable complete metric
spaces. This is a wide class of spaces encompassing most spaces of practical
interest. The theory would be simpler if we restricted it to compact or locally
compact spaces; however, some important spaces like C[0, 1] (the set of
continuouos functions over the interval [0, 1], with the maximum difference
as their distance) are not locally compact.
A.2 MEASURES
For a survey of measure theory, see for example [39].
147
A. BACKGROUND FROM MATHEMATICS
A.2.2 MEASURES
Probability is an example of the more general notion of a measure.
148
A.2. Measures
if the whole space is the union of a countable set of subsets whose measure
is finite. It is finite if µ(X ) < ∞. It is a probability measure if µ(X ) = 1. ù
Example A.2.2 (Delta function). For any point x, the measure δ x is defined
as follows:
(
1 if x ∈ A,
δ x (A) =
0 otherwise.
Examples A.2.4.
1. Let x be point and let µ(A) = 1 if x ∈ A and 0 otherwise. In this case, we
say that µ is concentrated on the point x.
2. Consider the the line R, and the algebra L1 defined in Example A.2.1.1.
Let f : R → R be a monotonic real function. We define a set function over
L1 as follows. LetS[ai , bi ), (i =
P1, . . . , n) be a set of disjoint left-closed
intervals. Then µ( i [ai , bi )) = i f (bi ) − f (ai ). It is easy to see that µ is
additive. It is σ-additive if and only if f is left-continuous.
3. Let B = {0, 1}, and consider the set B N of infinite 0-1-sequences, and the
semialgebra L3 of Example A.2.1.3. Let µ : B ∗ → R+ be a function. Let
us write µ(uB N ) = µ(u) for all u ∈ B ∗ . Then it can be shown that the
149
A. BACKGROUND FROM MATHEMATICS
150
A.2. Measures
A.2.3 INTEGRAL
The notion of integral also generalizes to arbitrary measures, and is some-
times also used to define measures.
First we define integral on very simple functions.
Proposition A.2.7. Let E be any Riesz space of functions with the property
that 1 ∈ E . Let µ be a positive linear functional on E continuous on monotonic
sequences, with µ1 = 1. The functional µ can be extended to the set E+ of
monotonic limits of nonnegative elements of E , by continuity. In case when E is
the set of all step functions, the set E+ is the set of all nonnegative measurable
functions.
151
A. BACKGROUND FROM MATHEMATICS
The set of integrable functions is a Riesz space, and the positive linear
functional µ on it is continuous with respect to monotonic sequences. The
continuity over monotonic sequences also implies the following theorem.
A.2.4 DENSITY
When does one measure have a density function with respect to another?
152
A.2. Measures
µ(d x) dµ
f (x) = = .
ν(d x) dν
Example A.2.13. Suppose that the space X can be partitioned into disjoint
sets A, B such that ν(A) = µ(B) = 0. Then D(µ, ν) = µ(A) + ν(B) = µ(X ) +
ν(X ). ù
153
A. BACKGROUND FROM MATHEMATICS
Λ∗ µ. (A.2.2)
f = Λg. (A.2.3)
154
A.2. Measures
h∗ µ = Λ∗h.
WEAK TOPOLOGY
155
A. BACKGROUND FROM MATHEMATICS
A f ,c = { µ : µ f < c }
for open sets G and real numbers c. Let us find some countable subbases.
Since the space X is separable, there is a sequence U1 , U2 , . . . of open sets
that forms a basis of X. Then we can restrict the subbasis of the space of
measures to those sets BG,c where G is the union of a finite number of basis
elements Ui of X and c is rational. This way, the space (M , τw ) itself has the
second countability property.
It is more convenient to define a countable subbasis using bounded con-
tinuous functions f , since the function µ 7→ µ f is continuous on such func-
tions, while µ 7→ µU is typically not continuous when U is an open set.
Example A.2.17. If X = R and U is the open interval (0, 1), the sequence
of probability measures δ1/n (concentrated on 1/n) converges to δ0 , but
δ1/n (U) = 1, and δ0 (U) = 0. ù
For some fixed dense set D, let F1 = F1 (D) be the set of functions intro-
duced in Definition A.1.21.
156
A.2. Measures
3. For every Borel set A, that is a continuity set of µ, we have µn (A) → µ(A).
4. For every closed set F , lim infn µn (F ) ¾ µ(F ).
5. For every open set G, lim supn µn (G) ¶ µ(G).
σM (A.2.6)
Having a topology over the set of measures we can also extend Proposi-
tion A.2.10:
As mentioned above, for an open set G the value µ(G) is not a continuous
function of the measure µ. We can only say the following:
157
A. BACKGROUND FROM MATHEMATICS
property. It can happen that the function µ we end up with in the limit, is not
continuous with respect to monotone convergence. Let us therefore metrize
the space of measures: then an arbitrary measure can be defined as the limit
of a Cauchy sequence of simple meaures.
One metric that generates the topology of weak convergence is the fol-
lowing.
It can be shown that this is a metric and it generates the weak topology.
In computer science, it has been reinvented by the name of “earth mover
distance”. The following important result helps visualize it:
Since weak topology has the second countability property, the metric
space defined by the distance ρ(·, ·) is separable. This can also be seen di-
rectly: let us define a dense set in weak topology.
Definition A.2.19. For each point x, let us define by δ x the measure which
concentrates its total weight 1 in point x. Let D be a countable everywhere
dense set of points in X . Let DM be the set of finite convex rational combi-
nations of measures of the form δ x i where x i ∈ D, that is those probability
measures that are concentrated on finitely many points of D and assign ra-
tional values to them. ù
158
A.2. Measures
The definition of the Prokhorov distance uses quantification over all Borel
sets. However, in an important simple case, it can be handled efficiently.
Another useful distance for measures over a bounded space is the Wasser-
stein distance.
W (µ, ν) = sup |µ f − ν f |.
f ∈Lip1 (X )
Proof. Let M = sup x, y∈X d(x, y). The proof actually shows for " < 1:
Now suppose W (µ, ν) ¶ " 2 < 1. For a Borel set A define g"A(x) = |1 −
ρ(x, A)/"|+ . Then we have µ(A) ¶ µ(g"A) and ν g"A ¶ ν(A" ). Further " g"A ∈
159
A. BACKGROUND FROM MATHEMATICS
Lip1 , and hence W (µ, ν) ¶ " 2 implies "µ(g"A) ¶ "ν(" g"A) + " 2 . This concludes
the proof by
RELATIVE COMPACTNESS
160
A.2. Measures
So, if (X , d) is not compact then the set of measures is not compact. But
still, by Proposition A.2.27, each finite measure is “almost” concentrated on
a compact set.
SEMIMEASURES
Let us generalize then notion of semimeasure for the case of general Polish
spaces. We use Proposition A.2.19 as a starting point.
FUNCTIONALS OF MEASURES
161
A. BACKGROUND FROM MATHEMATICS
δ = f (x n ) >
P −n P −n
be the measure n 2 x n
, then by linearity we have L(µ) n2
=
P
n 1 ∞, which is not allowed.
The function µ 7→ µ f is continuous and coincides with F (µ) on the dense
set of points DM , so they are equal.
162
B CONSTRUCTIVITY
B.1.1 REPRESENTATIONS
There are several equivalent ways that notions of computability can be ex-
tended to spaces like real numbers, metric spaces, measures, and so on. We
use the the concepts of numbering (notation) and representation, as defined
in [55].
Note that
l(ι(x)) = (2l(x) + 5) ∨ 6. (B.1.2)
For strings x, x i ∈ Σ∗ , p, pi ∈ ΣN , k ¾ 1, appropriate tupling functions
〈x 1 , . . . , x k 〉, 〈x, p〉, 〈p, x〉, and so on can be defined with the help of 〈·, ·〉
and ι(·). ù
163
B. CONSTRUCTIVITY
1
Any function g realizing f via (B.1.3) automatically has a certain extensionality prop-
erty: if δ1 ( y) = δ1 ( y 0 ) then g( y) = g( y 0 ).
164
B.1. Computable topology
Let us start with the definition of topology with the help of a subbasis of open
sets.
165
B. CONSTRUCTIVITY
Definition B.1.6. Due to the T0 property, every point in our space is de-
termined uniquely by the set of open sets containing it. Thus, there is
a representation γX of X defined as follows. We say that γX (p) = x if
EnΣ (p) = { w : x ∈ ν(w) }. If γX (p) = x then we say that the infinite sequence
p is a complete name of x: it encodes all names of all subbasis elements con-
taining x. From now on, we will call γX the complete standard representation
of the space X. ù
Proof. The proof is not difficult, but it relies on the discrete nature of the
space Σ∗ and on the fact that the space ΣN is compact and its basis consists
of sets that are open and closed at the same time.
It is easy to see that if two sets are constructive open then so is their
union. The above remark implies that a space having the form Y1 × · · · × Yn
where each Yi is either Σ∗ or ΣN , also the intersection of two recursively open
sets is recursively open. We will see that this statement holds, more generally,
in all computable metric spaces.
166
B.1. Computable topology
167
B. CONSTRUCTIVITY
is recursively enumerable.
A compact space with recognizeable covers will be called effectively com-
pact. ù
Example B.1.6. Let α ∈ [0, 1] be a real number such that the set of rationals
less than α is recursively enumerable but the set of rationals larger
P than α
are not. (It is known that there are such numbers, for example x∈N 2−H(x) .)
Let X be the subspace of the real line on the interval [0, α], with the induced
topology. A basis of this topology is the set β of all nonempty intervals of the
form I ∩ [0, α], where I is an open interval with rational endpoints.
This space is compact, but not effectively so. ù
The following is immediate.
168
B.1. Computable topology
Proof. For any rational number r we must be able to recognize that the min-
imum is greater than r. For each r the set { x : f (x) > r } is a lower semi-
computable open set. It is the union of an enumerated sequence of canonical
basis elements. If it covers X this will be recognized in a finite number of
steps.
Proof. Let βX and βY be the enumerated bases of the space X and Y re-
spectively. Since f is computable, S there is a recursively enumerable set
E ⊆ βX × βY such that f −1 (V ) = (U,V )∈E U holds for all V ∈ βY . Let
EV = { U : (U, V ) ∈ E }, then f −1 (V ) = EV .
S
169
B. CONSTRUCTIVITY
B.1.6 SEMICOMPUTABILITY
Lower semicomputability is a constructive version of lower semicontinuity, as
given in Definition A.1.12, but the sets that are required to be open there are
required to be constructive open here. The analogue of Proposition A.1.5 and
Corollary A.1.7 holds also: a lower semicomputable function is the supre-
mum of simple constant functions defined on basis elements, and a lower
semicomputable function defined on a subset can always be extended to one
over the whole space.
Let Y = (R+ , σR< , νR< ) be the effective topological space introduced in Ex-
ample B.1.2.2, in which νR> is an enumeration of all open intervals of the
170
B.1. Computable topology
In the important special case of Cantor spaces, the basis is given by the
set of finite sequences. In this case we can also require the function g(w) to
be monotonic in the words w:
171
B. CONSTRUCTIVITY
172
B.1. Computable topology
2. Consider the metric space of Example A.1.10.6: the Cantor space (X , d).
Let s0 ∈ Σ be a distinguished element. For each finite sequence x ∈ Σ∗
let us define the infinite sequence ξ x = xs0 s0 . . . . The elements ξ x form
a (naturally enumerated) dense set in the space X , turning it into a con-
structive metric space.
ù
Let us point out a property of metric spaces that we use frequently.
Note that if the space has the manifest inclusion property then for every
pair x ∈ b there is a manifest representation of x beginning with b.
A constructive metric space has the manifest inclusion property as a topo-
logical space, and Cauchy representations are manifest.
Similarly to the definition of a computable sequence of computable func-
tions in Subsection B.1.5, we can define the notion of a computable sequence
of bounded computable functions, or the computable sequence f i of com-
putable Lipschitz functions: the bound and the Lipschitz constant of f i are
required to be computable from i. The following statement shows, in an ef-
fective form, that a function is lower semicomputable if and only if it is the
supremum of a computable sequence of computable functions.
173
B. CONSTRUCTIVITY
If U is a constructive open
S set, then there is 0a computably enumerable set
SU ⊂ D × Q with U = (x,r)∈SU B(x, r). Let SU = { Γ(x, r) : (x, r) ∈ SU },
S
[
U ∩V = B(x, r).
(x,r)∈SU0 ∩SV0
Proof. Suppose first that the space is effectively compact. For each ", let
B1 , B2 , . . . be a list of all canonical balls with radius ". This sequence cov-
ers the space, so already some B1 , . . . , Bn covers the space, and this will be
recognized.
174
B.2. Constructive measure theory
Suppose now that for every rational " one can find a finite set of "-balls
covering the space. Let S ⊆ β be a finite set of basis elements (balls) cover-
ing the space. For each element G = B(u, r) ∈SS , let G" = B(u, r −") S beS its "-
interior, and S" = { G" : G ∈ S }. Then G = " > 0G" , and X = ">0 S" .
Compactness implies that there is an " > 0 such that already Seps covers the
space. Let B1 , . . . , Bn be a finite set of "/2-balls Bi = B(ci , ri ) covering the
space that can be computed from ". Each of these balls Bi intersects one of
the the sets G" = B(u, r −"), d(u, ci ) ¶ r −"/2. But then Bi ⊆ B(u, r) in a rec-
ognizeable way. Once all the relations d(u, ci ) ¶ r − "/2 will be recognized
we will also recognize that S covers the space.
The basic concepts and results of measure theory are recalled in Section A.2.
For the theory of measures over metric spaces, see Subsection A.2.6. We
introduce a certain fixed, enumerated sequence of Lipschitz functions that
will be used frequently. Let E be the set of functions introduced in Defini-
tion A.1.21. The following construction will prove useful later.
175
B. CONSTRUCTIVITY
(M , ρ, DM , αM ) (B.2.1)
Proof sketch. To prove the theorem for bounded Lipschitz functions, we can
invoke the Strassen coupling theorem A.2.22.
The function f can be obtained as a limit of a computable monotone in-
creasing sequence of computable Lipschitz functions f n> , and also as a limit of
a computable monotone decreasing sequence of computable Lipschitz func-
tions f n< . In step n of our computation of µ f , we can approximate µ f n>
from above to within 1/n, and µ f n< from below to within 1/n. Let these
bounds be an> and an< . To approximate µ f to within ", find a stage n with
an> − an< + 2/n < ".
176
B.2. Constructive measure theory
Proof. The “only if” part follows from Proposition B.2.2. For the “if” part,
note that in order to trap µ within some Prokhorov neighborhood of size ", it
is sufficient to compute µg i within a small enough δ, for all i ¶ n for a large
enough n.
177
B. CONSTRUCTIVITY
178
B.2. Constructive measure theory
179
BIBLIOGRAPHY
181
BIBLIOGRAPHY
[13] Imre Csiszár and János Körner. Information Theory. Academic Press,
New York, 1981. 1.6.4, 1.6.6, 3.1.1
[16] Terrence Fine. Theories of Probability. Academic Press, New York, 1973.
1.1.3
[18] Peter Gács. Exact expressions for some randomness tests. Z. Math. Log.
Grdl. M., 26:385–394, 1980. Short version: Springer Lecture Notes in
Computer Science 67 (1979) 124-131. 1.1.3, 2.3.6
[19] Peter Gács. On the relation between descriptional complexity and al-
gorithmic probability. Theoretical Computer Science, 22:71–93, 1983.
Short version: Proc. 22nd IEEE FOCS (1981) 296-303. 1.1.3
[22] Peter Gács and János Körner. Common information is far less than
mutual information. Problems of Control and Inf. Th., 2:149–162, 1973.
3.1.1
[23] Peter Gács, John Tromp, and Paul Vitányi. Algorithmic statis-
tics. IEEE Transactions on Information Theory, 47:2443–2463, 2001.
arXiv:math/0006233 [math.PR]. 3.1, 3.1.2
182
Bibliography
[24] Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct
random functions. Journal of the Association for Computing Machinery,
33(4):792–807, 1986. 1.1.2
[25] Peter Hertling and Klaus Weihrauch. Randomness spaces. In Proc.
of ICALP’98, volume 1443 of Lecture Notes in Computer Science, pages
796–807. Springer, 1998. 4.1.1, 4.1.4
[26] A.N. Kolmogorov. Three approaches to the quantitative definition of
information. Problems of Inform. Transmission, 1(1):1–7, 1965. 1.1,
1.1.3
[27] Andrei N. Kolmogorov. Foundations of the Theory of Probability.
Chelsea, New York, 1956. 1.1
[28] Andrei N. Kolmogorov. A logical basis for information theory and prob-
ability theory. IEEE Transactions on Information Theory, IT-14:662–664,
1968. 1.1, 1.1.3
[29] Leonid A. Levin. On the notion of a random sequence. Soviet Math.
Dokl., 14(5):1413–1416, 1973. 1.1.3, 4.1.1, 4.1.1
[30] Leonid A. Levin. Laws of information conservation (nongrowth) and
aspects of the foundations of probability theory. Problems of Inform.
Transm., 10(3):206–210, 1974. 1.1.2, 1.1.3, 1.6.5, 3.1.2
[31] Leonid A. Levin. Uniform tests of randomness. Soviet Math. Dokl.,
17(2):337–340, 1976. 4.1.1, 2, 4.3.5
[32] Leonid A. Levin. Randomness conservation inequalities: Information
and independence in mathematical theories. Information and Control,
61(1):15–37, 1984. 1.1.2, 1.1.3, 2.2.3, 3.1.4, 3.1.2, 4.1.1, 1, 2, 4.3.5
[33] Leonid A. Levin. One-way functions and pseudorandom generators.
Combinatorica, 4, 1987. 1.1.2
[34] Leonid A. Levin and V.V. V’yugin. Invariant properties of information
bulks. In Proceedings of the Math. Found. of Comp. Sci. Conf., volume 53
of Lecture Notes on Comp.Sci., pages 359–364. Springer, 1977. 1.1.3
[35] Ming Li and Paul M. B. Vitányi. Introduction to Kolmogorov Complexity
and its Applications (Second edition). Springer Verlag, New York, 1997.
1.1.3, 2.2.1
183
BIBLIOGRAPHY
184
Bibliography
185