Physical Limits of Inference
Physical Limits of Inference
Physical Limits of Inference
r
X
i
v
:
0
7
0
8
.
1
3
6
2
v
2
[
c
o
n
d
-
m
a
t
.
s
t
a
t
-
m
e
c
h
]
2
3
O
c
t
2
0
0
8
Physical limits of inference
David H. Wolpert
MS 269-1, NASA Ames Research Center, Moett Field, CA 94035, USA
Abstract
I show that physical devices that perform observation, prediction, or recollection share an underlying math-
ematical structure. I call devices with that structure inference devices. I present a set of existence and
impossibility results concerning inference devices. These results hold independent of the precise physical
laws governing our universe. In a limited sense, the impossibility results establish that Laplace was wrong
to claim that even in a classical, non-chaotic universe the future can be unerringly predicted, given sucient
knowledge of the present. Alternatively, these impossibility results can be viewed as a non-quantum me-
chanical uncertainty principle. Next I explore the close connections between the mathematics of inference
devices and of Turing Machines. In particular, the impossibility results for inference devices are similar to
the Halting theorem for TMs. Furthermore, one can dene an analog of Universal TMs (UTMs) for in-
ference devices. I call those analogs strong inference devices. I use strong inference devices to dene the
inference complexity of an inference task, which is the analog of the Kolmogorov complexity of com-
puting a string. However no universe can contain more than one strong inference device. So whereas the
Kolmogorov complexity of a string is arbitrary up to specication of the UTM, there is no such arbitrariness
in the inference complexity of an inference task. I end by discussing the philosophical implications of these
results, e.g., for whether the universe is a computer.
Key words: Turing machine, automata, observation, prediction, multiverse, Kolmogorov complexity
PACS: 03.65.Ta, 89.20.Ff, 02.70.-c, 07.05.Tp, 89.70.Eg, 01.70.+w
1. Introduction
Some of the most fruitful investigations of the foundations of physics began by identifying a
set of features that are present in all physical realizations of a particular type of information pro-
cessing. The next step in these investigations was to abstract and formalize those shared features.
Once that was done, one could explore the mathematical properties of those features, and thereby
Email address: [email protected] (David H. Wolpert).
URL: ti.arc.nasa.gov/people/dhw (David H. Wolpert).
Preprint submitted to Elsevier 23 October 2008
analyze some aspects of the relationship between physics and information processing. Examples
of such investigations include the many decades of work on the relationship between physics
and computation [11,12,13,14,15,16,17,18,19,20,21,22,23,22,24], the work on observation that
started with Everetts seminal paper [25], and more recent work that considers what possible
forms physical reality might have [26,27,28,29,30,31,32,33,34,35,36].
In this spirit, here we rst present archetypal examples of physical devices that performobser-
vation, of physical devices that performprediction, and of physical devices that performrecollec-
tion. We then identify a set of features common to those examples. This is our rst contribution,
that such physical devices share those features.
Next we formalize those features, dening any device possessing them to be an inference
device. To do this requires our second contribution: a formalization of the concept of semantic
information content.
1
Loosely speaking, we dene the semantic information content of a vari-
able s concerning a variable r to be what an external scientist can infer about what the value of r
is in their particular universe by knowing the state of s. Note the central role in this denition of
the scientist external to the device. As discussed below, in the context of using inference devices
for observation, this central role of the external scientist is in some ways more consistent with
Wigners view of the observation process than with the many-worlds view of that process.
For the remainder of the paper we develop the theory of inference devices, thereby analyzing
numerous aspects of the relationship between physics and information processing. Our goal in
this endeavor is to illustrate the breadth of the theory of inference devices; an exhaustive analysis
of any one aspect of that theory is beyond what can t into this single paper.
A recurring theme in our analysis of inference devices is their relationship with Turing Ma-
chines (TMs). In particular, there are impossibility results for inference devices that are similar
to the Halting theorem for TMs. Furthermore, one can dene an analog of Universal TMs
(UTMs) for inference devices. We call those analogs strong inference devices.
A central result of this paper is how to use strong inference devices to dene the inference
complexity of an inference task, which is the analog of the Kolmogorov complexity of com-
puting a string. A task-independent bound is derived on how much the inference complexity of
an inference task can dier for two dierent inference devices. This is analogous to the encod-
ing bound governing how much the Kolmogorov complexity of a string can dier between two
UTMs used to compute that string. However no universe can contain more than one strong in-
ference device. So whereas the Kolmogorov complexity of a string is arbitrary up to specication
of the UTM, there is no such arbitrariness in the inference complexity of an inference task.
After presenting inference complexity, we informally discuss the philosophical implications
of all of our results to that point. In particular, we discuss what it might mean for the universe
to be a computer. We also show how much of philosophy can be reduced to constraint satis-
faction problems, potentially involving innite-dimensional spaces. We follow this discussion by
deriving some graph-theoretic properties governing the possible inference relationships among
any set of multiple inference devices in the same universe.
Our next contribution is an extension of the inference devices framework to include physical
devices that are used for control. Associated impossibility results provide fundamental limits on
the capabilities of physical control systems. After this we present an extension of the framework
to probabilistic inference devices. Of all the results in this paper, it is the impossibility results
concerning probabilistic inference devices that are the most similar to quantum mechanical im-
1
In contrast to the concept of syntactic information content, whose formalization by Shannon is the basis of conventional
information theory [37].
2
possibility results. We end by presenting an extension of the framework that claries its relation
with semantic information.
The crucial property underlying our results is that inference devices are embodied in the very
physical system (namely the universe) about which they are making inferences. This embedding
property and its consequences have nothing to do with the precise laws governing the underly-
ing universe. In particular, those consequences do not involve chaotic dynamics as in [17,18],
nor quantum mechanical indeterminism. Similarly, they apply independent of the values of any
physical constants (in contrast, for example, to the work in [12]), and more generally apply to
every universe in a multiverse. Nor do the results presume limitations on where in the Chomsky
hierarchy an inference device lies. So for example they would apply to oracles, if there can be
oracles in our universe. In the limited sense of our impossibility results, Laplace was wrong to
claim that even in a classical, non-chaotic universe the future can be unerringly predicted, given
sucient knowledge of the present [38]. Alternatively, these impossibility results can be viewed
as a non-quantummechanical uncertainty principle.
All non-trivial proofs are in App. A. An earlier analysis addressing some of the issues consid-
ered in this paper can be found in [26].
1.1. Notation
We will take the set of binary numbers B to equal {1, 1}, so that logical negation is indicated
by the minus sign. We will also take to be the Heaviside theta function that equals 1 if its
argument is non-negative, 0 otherwise. N is the natural numbers, 1, 2, . . .. For any function
with domain U, we will write the image of U under as (U). For any function with domain
U that we will consider, we implicitly assume that (U) contains at least two distinct elements.
For any (potentially innite) set W, |W| is the cardinality of W. For any real number a R, a
is the smallest integer greater than or equal to a. Given two functions
1
and
2
with the same
domain U, we write
1
2
for the function with domain U obeying u U : (
1
(u),
2
(u)),
and with some abuse of terminology refer to this as the product of
1
and
2
.
Given a function with domain U, we say that the partition induced by is the family of
subsets {
1
() : (U)}. Intuitively, it is the family of subsets of U each of which consists
of all elements having the same image under . We will say that a partition A over a space U
is a ne-graining of a partition B over U (or equivalently that B is a coarse-graining of A) i
every a A is a subset of some b B. Two partitions A and B are ne-grainings of each other i
A = B. Say a partition A is nite and a ne-graining of a partition B. Then |A| = |B| i A = B.
Given a probability measure, the mutual information between two associated randomvariables
a, b conditioned on event c is written M(a, b | c). The Shannon entropy of random variable a is
H(a).
2. Archetypal examples
We now illustrate that many (if not all) physical realizations of the processes of observation,
prediction, and memory share a certain mathematical structure. We do this by semi-formally
describing each of those processes, one after the other. Each such description uses language
that is purposely very similar to the other descriptions. It is that very similarity of language that
demonstrates that the same mathematical structure arises as part of each of the processes. In the
3
following sections of this paper we will formalize that mathematical structure, and then present
our formal results concerning it.
2
If the reader becomes convinced of this shared mathematical structure before reading through
all the examples, (s)he is encouraged to skip to the next section. It is in that section that we
formalize the shared mathematical structure, as an inference device.
In all of the examples in this section, U is the space of all worldlines of the entire universe
that are consistent with the laws of physics (whatever they may be), and u indicates an element
of U.
3
Example 1: We start by describing a physical system that is a general-purpose observation de-
vice, capable of observing dierent aspects of the universe. Let S be some particular variable
concerning the universe whose value at some time t
2
we want our device to observe. If the uni-
verses worldline is u, then the value of S at t
2
is given by some function of u (e.g., it could be
given by a component of u). Write that function as ; S (t
2
) = (u).
The observation device consists of two parts: an observation apparatus, and a scientist who
uses (and interprets) that apparatus. To make our observation, the scientist must rst congure
the observation apparatus to be in some appropriate state at some time t
1
< t
2
. (The idea is that
by changing how the observation apparatus is congured the scientist can change what aspect
of the universe he observes.) That conguration of the observation apparatus at t
1
is also given
by a function of the entire universes worldline u, since the observation apparatus exists in the
universe. Write that function as , with range (U).
The goals is that if the apparatus has been properly congured, then sometime after t
1
it cou-
ples with S in such a way that at some time t
3
> t
2
, the output display of the observation apparatus
accurately reects S (t
2
). Again, that output display exists in the universe. So its state at t
3
is a
function of u; write that function as .
The scientist reads the output of the apparatus and interprets that output as this attempted
observation of S (t
2
). It is this interpretation that imbues that output with semantic information.
Without such interpretation the output is just a meaningless (!) pattern, one that happens to be
physically coupled with the variable being observed. (As an extreme example of such meaning-
less coupling, if a tree falls in a forest, but the video that recorded the fall is encrypted in a way
that the scientist cannot undo, then the scientist does not observe that the tree fell by watching
the video .)
To formalize what such interpretation means, we must dene semantic information. As men-
tioned above, we want the semantic information of a variable s concerning a variable r to be what
an external scientist can infer about r by knowing the state of s. In the current example this means
we require that the scientist can ask questions of the sort, Does S (t
2
) = K? at t
3
, and that (u)
provides the scientist with (possibly erroneous) answers to such questions. As an example, say
that (u) is a display presenting integers from 0 to 1000, inclusive, with a special error symbol
for integers outside of that range. Since the scientist interprets the value on that display at t
3
as
the outcome of the observation of S (t
2
), by looking at the display at t
3
the scientist is provided
2
Some might quibble that one or another of the these examples should involve additional structure, that what is presented
in that example does not fully capture the physical processes it claims to describe. (See App. B.) The important point is
that the structure presented in these examples is always found in real-world instances of the associated physical processes.
Whether or not there is additional structure that should be assumed is not relevant. The structure that is assumed in the
examples is sucient to establish our formal results.
3
For expository simplicity we use the language of non-quantum mechanical systems in this paper. However most of
what follows holds just as well for a quantum-mechanical universe, if we interpret quantum mechanics appropriately.
4
with (possibly erroneous) answers to the question Does S (t
2
) = K? for all 1001 values of K
that can be on the display.
To make this more precise, rst note that any question like Does S (t
2
) = K? can either be
answered yes or no, and therefore is a binary function of u. For every K, write this associated
binary function of u as q
K
; K, u U, q
K
(u) = 1 if S (t
2
) = (u) = K, and it equals -1 otherwise.
Next, note that the brain of the scientist exists in the universe. So which (if any) of a set of such
possible binary questions concerning the universe the scientist is asking at t
3
is also a function
of u. We write that function as Q. In particular, we presume that any question q
K
is one of the
elements in the range of Q, i.e., it is one of the questions that (depending on the state of the
scientists brain then) the scientist might be asking at t
3
.
Now for any particular question q
K
the scientist might be asking at t
3
, the answer that the
scientist provides by interpreting the apparatus output is a bit. The value of that bit is specied
by the state of the scientists brain at t
3
. (The premise being that the state of the scientists brain
was aected by the scientists reading and then interpreting the apparatus output.) So again,
since the scientists brain exists in the universe, the value of that answer bit is a function of u. We
write that function as Y.
It is the combination of Q and Y that comprise the scientists interpretation of , and thereby
imbue any particular (u) with semantic content. Q(u) species a question q
K
. (u) then causes
Y(u) to have some associated value. We take that value to be (the scientists interpretation of) the
apparatus answer to the question of whether q
K
(u) = 1 or q
K
(u) = 1 (i.e., of whether S (t
2
) =
K). Combining, (u) causes Y(u) to have a value that we take to be (the scientists interpration
of) the apparatus answer to whether [Q(u)](u) = 1 or [Q(u)](u) = 1.
This scenario provides a set of requirements for what it means for the combination of the
observation apparatus and the scientist using that apparatus to be able to successfully observe the
state of S at t
2
: First, we require that the scientist can congure the apparatus in such a way that
its output at t
3
gives (u). We also require that the scientist can read and interpret that output.
This means at a minimum that for any question of the form Does (u) = K? the scientist can
both ask that question at t
3
and interpret (u) to accurately answer it.
To make this fully formal, we introduce a set of binary functions with domain (U): K, f
K
:
1 i = K. Note that we have one such function for every K (U). Our requirement
for successful observation is that the observation apparatus can be congured so that, for any f
K
,
if the scientist were to consider an associated binary question at t
3
and interpret (u) to answer
the question, then the scientists answer would necessarily equal f
K
((u)). In other words, there
is a value c (U) such that for any K (U), there is an associated q
K
Q(U) such that the
combination of (u) = c and Q(u) = q
K
implies that Y(u) = f
K
((u)).
Intuitively, for the scientist to use the apparatus to observe S (t
2
) only means the scientist
must congure the apparatus appropriately; the scientist must force the universe to have a world-
line u such that (u) = c, and that must in turn cause (u) to accurately give (u). In particular,
to observe S (t
2
) does not require that the scientist impose any particular value on Q(u). Rather
Qs role is to provide a way to interpret (u). The only requirement made of Q is that if the sci-
entist were to ask a question like Does S (t
2
) equal K?, then Q(u) determined by the state
of the scientists brain at t
3
would equal that question, and the scientists answer Y(u) would
be appropriately set by (u). It is by using Q this way that we formalize the notion that (u)
conveys information to the scientist concerning S (t
2
). The observation is successful if for any
such question the scientist might pose (as reected in Q(u)), their associated answer (as reected
in Y(u)) properly matches the state of S at t
2
.
We can motivate this use of Q in a less nuanced, more direct way. Consider a scenario where
5
the scientist cannot both pose all binary-valued questions f
K
concerning S (t
2
) and correctly an-
swer them using the apparatus output, (u). It would seem hard to justify the view that in this
scenario the combination of the scientist with the apparatus makes a successful observation
concerning S (t
2
).
Note that by dening an observation device as the combination of an observation appara-
tus with the external scientist who is using that apparatus, we are in a certain sense arriving at
a Wignerian approach to observation. In contrast to a more straight-forward many-worlds ap-
proach, we require that the state of the observation apparatus not just be correlated with the
variable being observed, but in fact contain semantic information concerning the variable be-
ing observed. This makes the external scientist using the observation apparatus crucial in our
approach, in contrast to the case with the many-worlds approach.
Example 2: This example is a slight variant of Ex. 1. In this variant, there is no scientist, just
inanimate pieces of hardware.
We change the apparatus of Ex. 1 slightly. First, we make the output be binary-valued. We
also change the conguration function , so that in addition to its previous duties, it also species
a question of the form, Does (u) equal K?. Then observation is successful if for any K (U),
the apparatus can be congured appropriately, so that its output correctly answers the question
of whether S (t
2
) equals K. In other words, observation is successful if for any K (U) there is
an associated c (U) such that having (u) = c implies that Y(u) = f
K
((u)).
Example 3: We now describe a physical system that is a general-purpose prediction device,
capable of correctly predicting dierent aspects of the universes future. Let S be some particular
variable concerning the universe whose value at some time t
2
we want our device to predict. If
the universes worldline is u, then the value of S at t
2
is given by some function of u which we
write as ; S (t
2
) = (u).
The prediction device consists of two parts, a physical computer, and a scientist who programs
that computer to make the prediction and interprets the computers output as that prediction. To
program the computer means that the scientist initializes it at some time t
1
< t
2
to contain
some information concerning the state of the universe and to run a simulation of the dynamics of
the universe that uses that information. Accordingly, to program the computer to perform the
prediction means making it be in some appropriate state at t
1
. (The idea is that by changing how
the computer is programmed, the scientist can change what aspect of the universe the computer
predicts.) That initialization of the computer is also given by a function of the entire universes
worldline u, since the computer exists in the universe. Write that function as , with range (U).
The hope is that if the computer is properly programmed at t
1
, then it runs a simulation con-
cerning the evolution of the universe that completes at some time t
3
> t
1
, and at that time displays
a correct prediction of S (t
2
) on its output. (In general we would like to also have t
3
< t
2
, so that
the simulation completes before the event being predicted actually occurs, but we dont require
that.) Again, that output display exists in the universe. So its state at t
3
is a function of u; write
that function as .
The scientist reads the output of the computer and interprets it as this attempted prediction of
S (t
2
), thereby imbuing that output with semantic meaning. More precisely, for the value (u) to
convey information to the scientist at t
3
, we require that the scientist can ask questions of the sort,
Does S (t
2
) = K? at t
3
, and that (u) provides the scientist with (possibly erroneous) answers
to such questions.
As in Ex. 1, to make this more formal, we note that any question like Does S (t
2
) = K? is a
6
binary function of u, of the sort q
K
presented in Ex. 1. Also as in Ex. 1, the brain of the scientist
exists in the universe. So which (if any) of a set of possible questions concerning the universe
the scientist is asking at t
3
is also a function of u, which we again write as Q. Also as in Ex. 1,
the answer of the scientist to any such question is a bit that the scientist generates by interpreting
(u). Since that answer is given by the state of the scientists brain at t
3
, it is a function of u,
which as before we write as Y.
So for the combination of the computer and the scientist using that computer to be able to
successfully predict the state of S at t
2
means two things: First, we require that the scientist
can program the computer in such a way that its output at t
3
gives (u). We also require that
the scientist can read and interpret that output. More precisely, our requirement for successful
prediction is that the computer can be programmed so that, for any f
K
, if the scientist were to
consider an associated binary question at t
3
and interpret (u) to answer the question, then the
scientists answer would necessarily equal f
K
((u)). In other words, there is a value c (U)
such that for any K (U), there is an associated q
K
Q(U) such that the combination of
(u) = c and Q(u) = q
K
implies that Y(u) = f
K
((u)).
Just as in Ex. 1, for the scientist to use the apparatus to predict S (t
2
) only means the scientist
must program the computer appropriately; the scientist must force the universe to have a world-
line u such that (u) = c, and that must in turn cause (u) to accurately give (u). In particular, to
predict S (t
2
) does not require that the scientist impose any particular value on Q(u). As before,
Qs role is to provide a way to interpret (u).
Note that the computer in this example is dened in terms of what it does, not in terms of
how it does it. This allows our formalization of prediction to avoid all issues of where exactly in
the Chomsky hierarchy some particular physical computer might lie.
Nothing in the formalizations ending Ex.s 1 - 3 relies on the precise choices of time-ordering
imposed on the values t
1
, t
2
, t
3
, t
4
. Those formalizations only concern relations between functions
, f
k
, Q, and Y, each having the entire worldline across all time as its domain. This fact means
that the same sort of formalization can be applied to retrodiction, as elaborated in the following
example.
Example 4: Say we have a system that we want to serve as a general-purpose recording and
recollection device, capable of correctly recording dierent aspects of the universe and recalling
them at a later time. Let S be some particular variable concerning the universe whose value at
some time t
2
we want our device to record. If the universes worldline is u, then the value of S at
t
2
is given by some function of u which we write as the function ; S (t
2
) = (u).
The recording device consists of two parts. The rst is a physical recording apparatus that
records many characteristics of the universe. The second is a scientist who queries that apparatus
to see what it has recorded concerning some particular characteristic of the universe, and inter-
prets the apparatus response as that recording. To query the apparatus means that the scientist
makes some variable concerning the apparatus be in an appropriate state at some time t
1
> t
2
.
(The idea is that by changing how the apparatus is queried, the scientist can change what aspect
of the universes past the apparatus displays to the scientist.) That state imposed on the variable
concerning the apparatus at t
1
is also given by a function of the entire universes worldline u,
since the apparatus exists in the universe. Write that function as , with range (U).
The hope is that if the apparatus functions properly and is properly queried at t
1
, then it re-
trieves an accurate recording of S (t
2
), and displays that recording on its output at some time
t
3
> t
1
. Again, that output display of the apparatus exists in the universe. So its state at t
3
is a
7
function of u; write that function as .
The scientist reads the output of the apparatus and interprets it as this recording of S (t
2
),
thereby imbuing that output with semantic meaning. More precisely, for the value (u) to convey
information to the scientist at t
3
, we require that the scientist can ask questions of the sort, Does
S (t
2
) = K? at t
3
, and that (u) provides the scientist with (possibly erroneous) answers to such
questions.
As in Ex. 1, to make this more formal, we note that any such question is a binary function of u,
of the sort q
K
presented in Ex. 1. Also as in Ex. 1, the brain of the scientist exists in the universe.
So which (if any) of a set of possible questions concerning the universe the scientist is asking at
t
3
is also a function of u, which we again write as Q. Also as in Ex. 1, the answer of the scientist
to any such question is a bit that the scientist generates by interpreting (u). Since that answer is
given by the state of the scientists brain at t
3
, it is a function of u, which as before we write as Y.
So for the combination of the apparatus and the scientist using that apparatus to be able to
successfully record and recall the state of S at t
2
means two things: First, we require that the
scientist can query the apparatus in such a way that its output at t
3
gives (u). We also require
that the scientist can read and interpret that output. More precisely, our requirement for successful
recording and recollection is that the apparatus can be queried so that, for any f
K
, if the scientist
were to consider an associated binary question at t
3
and interpret (u) to answer the question,
then the scientists answer would necessarily equal f
K
((u)). In other words, there is a value
c (U) such that for any K (U), there is an associated q
K
Q(U) such that the combination
of (u) = c and Q(u) = q
K
implies that Y(u) = f
K
((u)).
Just as in Ex. 1, for the scientist to use the apparatus to recall S (t
2
) only means the scientist
must query the apparatus appropriately; the scientist must force the universe to have a worldline
u such that (u) = c, and that must in turn cause (u) to accurately give (u). In particular, to
recall S (t
2
) does not require that the scientist impose any particular value on Q(u). As before,
Qs role is to provide a way to interpret (u).
Note that nothing in this example species howthe recording process operates. This is just like
how nothing in Ex. 1 species how the observation apparatus couples with S , and how nothing
in Ex. 3 species what simulation the computer runs.
See [39,11,30] for discussion about the crucial role that recollection devices play in the psy-
chological arrow of time, and of the crucial dependence of such devices on the second law of
thermodynamics. As a result of their playing such a role, the limitations on recollection devices
derived below have direct implications for the psychological and thermodynamic arrows of time.
Just as Ex. 2 varies Ex. 1 by removing the scientist, so Ex.s 3 and 4 can be varied to remove
the scientist.
3. Basic concepts
In this section we rst formalize the mathematical structure that is shared among Ex.s 1-4 of
Sec. 2. In doing so we substantially simplify that structure. After this formalization of the shared
structure in the examples we present some elementary results concerning that structure.
8
3.1. Inference devices
Denition 1: An (inference) device over a set U is a pair of functions (X, Y), both with domain
U. Y is called the conclusion function of the device, and is surjective onto B. X is called the
setup function of the device.
As an illustration, in all of Ex.s 1-4, the setup function is the composite function (, Q), and
the conclusion function is Y. The value of X(u) can loosely be interpreted as how the device is
initialized / congured.
4
The value of Y(u) should instead be viewed as all that the device
predicts /observes / recollects when it is done. A priori, we assume nothing about how X and Y
are related. Note that we do not require that the compound map (X, Y) : u U (X, Y)(u) be
surjective. There can be pairs of values x X(U), y Y(U) that never arise for the same u.
Given some function with domain U and some (U), we are interested in setting up a
device so that it is assured of correctly answering whether (u) = for the actual universe u.
Loosely speaking, we will formalize this with the condition that Y(u) = 1 i (u) = for all u
that are consistent with some associated setup value of the device, i.e., such that X(u) = x. If this
condition holds, then setting up the device to have setup value x guarantees that the device will
make the correct conclusion concerning whether (u) = . (Hence the terms setup function
and conclusion function in Def. 1.)
Note that this desired relationship between X, Y and can hold even if X(u) = x doesnt
x a unique value for Y(u). Such non-uniqueness is typical when the device is being used for
observation. Setting up a device to observe a variable outside of that device restricts the set of
possible universes; only those u are allowed that are consistent with the observation device being
set up that way to make the desired observation. But typically just setting up an observation
device to observe what value a variable has doesnt uniquely x the value of that variable.
In general we will want to predict / observe / recollect a function that can take on more than
two values. This is done by appropriately choosing X(u). As mentioned, X(u) species what is
known about the outside world together with a simulation program (in the case of computer-
based prediction), or a specication of how to set up an observation apparatus (in the case of
observation), or a specication of what to remember (in the case of a memory device). But in
addition, in all those cases X(u) species one of the possible values of (u) (i.e., it species a
question of the form Does (u) = ?). We then view the devices conclusion bit as saying
whether (u) does / doesnt have that specied value. So for example if our device is a computer
being used to predict the value of some variable concerning the state of the world, then formally
speaking, the setup of the computer species a particular one of the possible values of that vari-
able (in addition to specifying other information like what simulation to run, what is known about
the outside world, etc.). Our hope is that the computers conclusion bit correctly answers whether
the variable has that value specied in how the computer is set up.
Intuitively, this amounts to using a unary representation of (U). To formalize this with mini-
mal notation, we will use the following shorthand:
Denition 2: Let A be a set having at least two elements. A probe of A is a mapping from A onto
B that equals 1 for one and only one argument a A.
4
Care should be taken with this interpretation though. For example, in Ex. 1, concerns the state of u at time t
1
, and Q
concerns the state of u at t
3
. So X straddles multiple times.
9
So a probe of A is a function that picks out a single one of As possible values, i.e., it is a
Kronecker delta function whose second argument is xed, and whose image value 0 is replaced
by -1.
3.2. Notation for inference devices
We now have the tools to dene what it means for an inference device to successfully observe
/ predict / recall. Before presenting that denition we introduce some useful notation.
Unless specied otherwise, a device written as C
i
for any integer i is implicitly presumed
to have domain U, with setup function X
i
and conclusion function Y
i
(and similarly for no sub-
script). Similarly, unless specied otherwise, expressions like min
x
i
mean min
x
i
X
i
(U)
.
We dene a probe of a device to be a probe of the image of the devices conclusion function.
Given a function with domain U and a probe f of (U), we write f () as shorthand for the
function u U f ((u)). We write (A) to indicate the set of all probes of a set A, and () to
indicate the set of functions over U, { f () : f ((U))}.
Probes are a shorthand way of posing queries concerning membership in a set (e.g., queries
like is it true that u Y
1
(y) for some particular value y?). All such queries are binary-valued
(which is why the range of probes is B). So couching the analysis in terms of probes essentially
amounts to representing all associated spaces in terms of bits. This has the advantage that it
allows us to avoid considering the ranges of any functions that arise in the analysis. In particular,
it allows us to avoid concern for whether one such range matches up with the domains and/or
ranges of other functions. For example, it allows us to avoid concern for such matching between
the spaces dening two dierent inference devices when considering whether they infer each
other.. (See [26] for a more elaborate way of circumventing the need of those ranges to match.)
Say we are given a set of functions over U, {D
1
, d
1
, D
2
, d
2
, . . . E
1
, e
1
, E
2
, e
2
, . . .}. Then with
some abuse of terminology, we write D
1
= d
1
, D
2
= d
2
, . . . E
1
= e
1
, E
2
= e
2
, . . . as
shorthand for u U such that D
1
(u) = d
1
(u), D
2
= d
2
, . . ., and u U such that D
1
(u) =
d
1
(u), D
2
= d
2
, . . ., it is the case that E
1
(u) = e
1
(u), e
2
(u) = E
2
(u), . . .. We will often abuse
notation even further by allowing d
1
to be an element of D
1
s range. In this case, D
1
= d
1
E
1
= e
1
is shorthand foru U such that D
1
= d
1
, and u U such that D
1
(u) = d
1
, it is also
the case that E
1
(u) = e
1
(u).
3.3. Weak inference
We can now formalize inference as follows:
Denition 3: A device C (weakly) infers a function over U i f (), x such that
X = x Y = f ().
So using the denitions in the previous subsection, C weakly infers i f (), x X(U)
such that for all u U for which X(u) = x, Y(u) = f ((u)).
Recall our stipulation that all functions over U take on at least two values, and so in particular
must. Therefore () is non-empty. We will write C > if C infers . Expanding our shorthand
notation, C > means that for all (U), x X(U) with the following property: u U :
X(u) = x, it must be that Y(u) = f
((u)), where f
> t.
Let G be the set of all time-t states of the universe in which Cs output display is +1. The laws
of physics can be used to evolve G forward to time t
states of
the universe as H. Let be the binary-valued question, does the state of the universe at t
lies
outside of H?.
There is no information concerning H that can be programmed into C at some time t
< t
that guarantees that the resultant prediction that C makes at t is a correct answer to that question.
This is true no matter what t
is, i.e., no matter how much time C has to run that program before
making its answer at time t. It is also true no matter how much time there is between t
and t. It
is even true if the program with which C is initialized explicitly gives the correct answer to the
question.
Similar results hold if t
such that
s
p
(i
, j) = s
y
(i
such that s
p
(i
, j) = s
y
(i
and i
and a
value j
) = g(t), h(t
) = h(t) t
= (X
, Y
)
where X
: u (X
1
(u), X
2
(u)) and Y
= Y
1
. Then under our second formalization of emulation,
for C
1
to emulate C
2
implies that f (Y
2
), x
2
, x
such that X
1
(x
) X
1
2
(x
2
) and
X
= x
X
2
= x
2
, Y
= f (Y
2
). However X
1
(x
) X
1
2
(x
2
) means that X
= x
X
2
= x
2
, by denition of X
and C
2
that is identical to the relation between C
1
and C
2
under
the rst formalization. In this sense, our second formalization reduces to our rst. Accordingly,
we concentrate on the rst formalization, and make the following denition:
Denition 5: A device (X
1
, Y
1
) strongly infers a device (X
2
, Y
2
) i f (Y
2
) and all x
2
, x
1
such that X
1
= x
1
X
2
= x
2
, Y
1
= f (Y
2
).
If (X
1
, Y
1
) strongly infers (X
2
, Y
2
) we write (X
1
, Y
1
) (X
2
, Y
2
).
7
See App. B for a discussion of
how minimal the denition of strong inference really is.
Say we have a TM T
1
that can emulate another TM T
2
, e.g., T
1
is a UTM. This means that T
1
can calculate anything that T
2
can. The analogous property holds for strong and weak inference.
7
Note that there are only two probes of Y
2
, the identity probe f (y
2
) = y
2
and the negation probe, f (y
2
) = y
2
. Indicate
those two probes by f = 1 and f = 1, respectively. Then we can express X
1
= x
1
X
2
= x
2
, Y
1
= f (Y
2
) in set-theoretic
terms, as X
1
1
(x
1
) X
1
2
(x
2
) (Y
1
Y
2
)
1
( f ), where Y
1
Y
2
is the function u U Y
1
(u)Y
2
(u).
16
In addition, like UTM-style emulation (but unlike weak inference), strong inference is transitive.
These results are formalized as follows:
Theorem 2: Let C
1
, C
2
and C
3
be a set of inference devices over U and a function over U.
Then:
i) C
1
C
2
and C
2
> C
1
> .
ii) C
1
C
2
and C
2
C
3
C
1
C
3
.
Strong inference implies weak inference, i.e., C
1
C
2
C
1
> C
2
. We also have the follow-
ing strong inference analogs of Prop. 1(ii) and Coroll. 1 (which concerns weak inference):
Proposition 2: Let C
1
be a device over U.
i) There is a device C
2
such that C
1
C
2
.
ii) Say that x
1
, |X
1
1
(x
1
)| > 2. Then there is a device C
2
such that C
2
C
1
.
Recall that the Halting problem concerns whether there is a UTM T with the following prop-
erty: Given any TM T
, if T
and s
halts on input s
f ()
min
x:X=xY=f ()
[L(x)].
The inference complexity of with respect to C is the sum of a set of complexities, one
for each probe of , f . Loosely speaking, each of those complexities is the minimal amount
of Shannon information that must be imposed in Cs setup function in order to ensure that C
correctly concludes what value f has. In particular, if corresponds to a potential future state of
some system S external to C, then C( | C) is a measure of how dicult it is for C to predict
that future state of S . Loosely speaking, the more sensitively that future state depends on current
conditions, the more complex is the computation of that future state.
Example 6: Consider a conventional real-world computer, with a subsection of its RAM set aside
to contain the program it will run, and a separate subsection set aside to contain the conclusion
that the program will produce. Say the total number of bits in the program subsection of the
RAM is 2
k
+k for some integer k. Refer to any set of 2
k
+k bits as a complete string; the set of
all complete strings is the set of all possible bit strings in the program subsection of the RAM.
Let
k
be the set of all bit strings s consisting of at least k bits such that the rst k bits are a
binary encoding of the total number of bits in s beyond those rst k bits. So every element of
k
can be read into the beginning of the RAMs program subsection. For any s
k
dene an
associated partial string as the set of all complete strings whose rst bits are s. Intuitively, for
any such complete string, all of its bits beyond s are wild cards. (Such partial strings are just
the les of real-world operating systems.) With some abuse of terminology, when we write s
we will sometimes actually mean the partial string that s species.
We can identify a particular program input to the computer as such a partial string in its pro-
gram subsection. If we append certain bits to such an s (modifying the contents of the rst k bits
appropriately) to get a newlonger programpartial string, s
. Accordingly,
if we take logarithms to have base 2, the length of s
C( | C)
f ()
ln
_
_
x:X=xY=f ()
X
1
(x)
_ _
=
f ()
ln
_
x:X=xY=f ()
e
L(x)
_
_
,
where the equality follows from the fact that for any x, x
x, X
1
(x) X
1
(x
) = . The
argument of the ln in this modied version of inference complexity has a direct analog in TM
theory: The sum, over all input strings s to a UTM that generate a desired output string s
, of
2
n(s)
, where n(s) is the bit length of s.
We now bound how much more complex a function can appear to C
1
than to C
2
if C
1
can
strongly infer C
2
.
Theorem4: Let C
1
and C
2
be two devices and a function over U where (U) is nite, C
1
C
2
,
and C
2
> . Then
C( | C
1
) C( | C
2
) |(U)| max
x
2
min
x
1
:X
1
=x
1
X
2
=x
2
,Y
1
=Y
2
[L(x
1
) L(x
2
)].
Note that since L(x
1
) L(x
2
) = ln[
X
1
2
(x
2
)
X
1
1
(x
1
)
], the bound in Thm. 4 is independent of the units with
which one measures volume in U. (Cf. footnote 8.) Furthermore, recall that X
1
= x
1
X
2
=
x
2
, Y
1
= Y
2
i X
1
1
(x
1
) X
1
2
(x
2
) (Y
1
Y
2
)
1
(1). (Cf. footnote 7.) Accordingly, for all (x
1
, x
2
)
pairs arising in the bound in Thm. 4,
X
1
2
(x
2
)
X
1
1
(x
1
)
1. So the bound in Thm. 4 is always non-negative.
An important result in the theory of UTMs is an upper bound on the dierence between the
Kolmogorov complexity of a string using a particular UTM T
1
and its complexity if using a
dierent UTM, T
2
. This bound is independent of the computation to be performed, and can be
viewed as the Kolmogorov complexity of T
1
emulating T
2
.
The bound in Thm. 4 is the analog of this UTM result, for inference devices. In particular, the
bound in Thm. 4 is independent of all aspects of except the cardinality of (U). Intuitively,
the bound is |(U)| times the worst-case amount of computational work that C
1
has to do to
emulate C
2
s behavior for some particular value of x
2
.
19
6. Realities and copies of devices
In this section the discussion is broadened to allow sets of many functions to be inferred and
/ or inference devices. Some of the philosophical implications of the ensuing results are then
discussed.
6.1. Formal results
To analyze relationships among multiple devices and functions, dene a reality as a pair
(U; {F
, Y
)}; {
}) (U; {C
}; {
}) where {C
}) as the range of
. Expanding,
this equals
uU
[
](u), the union over all u of the tuples formed by a Cartesian product,
running over all , of the values F
(u)] [
(u)]
from the reduced form of the reality.
As an alternative we can view the reduced form of the reality as encapsulating the physical
meaning of the universe. In this alternative u does not have any physical meaning. It is only
the relationships among the inferences about u that one might want to make and the devices
20
with which to try to make those inferences that has physical meaning. One could completely
change the space U and the functions dened over it, but if the associated reduced form of the
reality does not change, then there is no way that the devices in that reality, when considering
the functions in that reality, can tell that they are now dened over a dierent U. In this view, the
laws of physics i.e., a choice for the set U, are simply a calculational shortcut for encapsulating
patterns in the reduced form of the reality. It is a particular instantiation of those patterns that has
physical meaning, not some particular element u U.
Given a reality (U; {(X
1
, Y
1
), (X
2
, Y
2
), . . .}), we say that a pair of devices in it are pairwise
distinguishable if they are distinguishable. We say that a device (X
i
, Y
i
) in that reality is outside
distinguishable i x
i
X
i
(U) and all x
i
in the range of
ji
X
j
, there is a u U such
that simultaneously X
i
(u) = x
i
and X
j
(u) = x
j
j i. (Note that that range may be a proper
subset of
ji
X
j
(U).) We say that the reality as a whole is mutually (setup) distinguishable i
x
1
X
1
(U), x
2
X
2
(U), . . . u U s.t. X
1
(u) = x
1
, X
2
(u) = x
2
, . . ..
Proposition 3:
i) There exist realities (U; C
1
, C
2
, C
3
) where each pair of devices is setup distinguishable
and C
1
> C
2
> C
3
> C
1
.
ii) There exists no reality (U; {C
i
: i N N}) where the devices are mutually
distinguishable and for some integer n, C
1
> C
2
> . . . > C
n
> C
1
.
iii) There exists no reality (U; {C
i
: i N N}) where for some integer n, C
1
C
2
. . . C
n
C
1
.
Consider a reality with a countable set of devices {C
i
}. There are many ways to view such
a reality as a graph, for example by having each node be a device while the edges between
the nodes concern distinguishability of the associated devices, or concern whether one weakly
infers the other, etc. There are restrictions on what graphs of those various sorts can exist. As an
example, given a countable reality, dene an associated directed graph by identifying each device
with a separate node in the graph, and by identifying each relationship of the formC
i
C
j
with
a directed edge going from node i to node j. We call this the strong inference graph of the
reality.
Thm. 3 means that a universal device in a reality must be a root node of the strong inference
graph of the reality. Applying Th. 3 again shows that the strong inference graph of a reality with
a universal device must contain exactly one root. In addition, by Thm. 2(ii), we know that every
node in a realitys strong inference graph has edges that lead directly to every one of its successor
nodes (whether or not there is a universal device in the reality). By Prop. 3(iii) we also know that
a realitys strong inference graph is acyclic. This latter fact establishes the following:
Proposition 4: Let D be a nite subset of the devices in a reality, where the strong inference
graph of the reality is weakly connected over D. Say that any pair of distinct devices in D that
are not connected by an edge of the strong inference graph are setup distinguishable.
Then the strong inference graph of the reality has one and only one root over D.
Results of this sort mean there are unavoidable asymmetries in the strong inference graphs of
realities. These asymmetries provide a preferred direction of strong inference in realities, akin to
the preferred direction in time provided by the second law of thermodynamics.
Note that even if a device C
1
can strongly infer all other devices C
i>1
in a reality, it may
not be able to infer them simultaneously (strongly or weakly). For example, dene : u
21
(Y
2
(u), Y
3
(u), . . .). Then the fact that C
1
is a universal device does not mean that f () x
1
:
Y
1
= f (). See the discussion in [26] on omniscient devices for more on this point.
We now dene what it means for two devices to operate in an identical manner:
Denition 7: Let U and
U be two (perhaps identical) sets. Let C
1
be a device in a reality with
domain U. Let R
1
be the relation between X
1
and Y
1
specied by the reduced form of that reality,
i.e., x
1
R
1
y
1
i the pair (x
1
, y
1
) occurs in some tuple in the reduced form of the reality. Similarly
let R
2
be the relation between X
2
and Y
2
for some separate device C
2
in the reduced form of a
reality having domain
U.
Then we say that C
1
mimics C
2
i there is an injection,
X
: X
2
(
U) X
1
(U) and a bijection
Y
: Y
2
(
U) Y
1
(U), such that for x
2
, y
2
, x
2
R
2
y
2
X
(x
2
)R
1
Y
(y
2
). If both C
1
mimics C
2
and
vice-versa, we say that C
1
and C
2
are copies of each other.
Note that because
X
in Def. 7 may not be surjective, one device may mimic multiple other
devices. (Surjectivity of
Y
simply reects the fact that since were considering devices, Y
1
(U) =
Y
2
(U) = B.) The relation of one device mimicing another is reexive and transitive. The relation
of two devices being copies is an equivalence relation.
Intuitively, when expressed as devices, two physical systems are copies if they follow the
same inference algorithm with
X
and
Y
translating between those systems. In particular, say a
reality contains two separate physical computers that are inference devices, both being used for
prediction. If those devices are copies of each other, then they form the same conclusion for the
same value of their setup function, i.e., they perform the same computation for the same input.
As another example, say that the states of some physical system S at a particular time t and
shortly thereafter at t + are identied as the setup and conclusion values of a device C
1
. In other
words, C
1
is given by the functions (X
1
(u), Y
1
(u)) (S (u
t
), S (u
t+
)). In addition, let R
S
be the
relation between X
1
and Y
1
specied by the reduced form of the reality containing the system.
Say that the time-translation of C
1
, given by the two functions S (u
t
) and S (u
t
+
), also obeys the
relation R
S
. Then the pair of functions (X
2
(u), Y
2
(u)) (S (u
t
), S (u
t
+
)) is another device that
is copy of C
1
. So for example, the same physical computer at two separate pairs of moments is
two separate devices, devices that are copies of each other, assuming they have the same set of
allowed computations.
Say that an inference device C
2
is being used for observation and C
1
mimics C
2
. The fact that
C
1
mimics C
2
does not imply that C
1
can emulate the observation that C
2
makes of some outside
function . The mimicry property only relates C
1
and C
2
, with no concern for third relationships
with any third function. (This is why for one device to emulate another is dened in terms of
strong inference rather than in terms of mimicry.)
Next for future use we note the following fact that is almost obvious (despite being so compli-
cated):
Lemma 1: Let K
1
be the set of reduced forms of all device realities. Let K
2
be the set of all
sets k with the following property: k can be written as {(
A
(s
r
, t
r
B
v
r
) : r R}
for some associated A, B and R such that for all ,
r
t
r
= B and |
r
s
r
| 2. Then K
1
= K
2
. In particular, any k K
2
is the reduced form of a reality
(U; {C
}, {
(u) = s
r
, Y
(u) = t
r
, and
(u) = v
r
.
Next, x a counting number m and a set of m cardinalities, {
i
: i = 1, . . . m}. Let M be the set
22
of all realities each of which comprises m functions, where the ranges of those m functions have
the associated cardinalities {
i
: i = 1, . . . m}.
Now say we ask whether there is a reality in M whose m functions have some particular rela-
tionship(s) with one another. (Answers to such questions form most of the results of the earlier
parts of this paper.) Lemma 1 allows us to transform this question into a constraint satisfac-
tion problem over an associated space of tuples. This transformation changes set of specied
relationship(s) into a set of simultaneous constraints over the associated space of tuples. The
precise type of constraint satisfaction problem produced by the transformation (integer-valued,
real-valued, etc.) is determined by the space of tuples under consideration, i.e., by the cardinali-
ties of the images of the functions that constitute the reality.
Often though we can use Lemma 1 more directly to answer questions concerning realities,
without invoking any techniques for solving constraint satisfaction problems. An example occurs
in the proof of the following result:
Proposition 5: Let C
1
be a copy of C
2
.
i) It is possible that C
1
and C
2
are distinguishable and C
1
> C
2
, even for nite X
1
(U), X
2
(U).
ii) It is possible that C
1
C
2
, but only if X
1
(U) and X
2
(U) are both innite.
6.2. Philosophical implications
Return now to the case where U is a set of laws of physics (i.e., the set of all worldlines consis-
tent with a set of such laws). The results of this subsection provide general restrictions that must
relate any devices in such a universe, regardless of the detailed nature of the laws of that universe.
In particular, these results would have to be obeyed by all universes in a multiverse [27,28,29].
Accordingly, it is interesting to consider these results from an informal philosophical perspec-
tive. Say we have a device C in a reality that is outside distinguishable. Such a device can be
viewed as having free will, in that the way the other devices are set up does not restrict how
C can be set up. Under this interpretation, Thm. 1 means that if two devices both have free will,
then they cannot predict / recall / observe each other with guaranteed complete accuracy. A real-
ity can have at most one of its devices that has free will and can predict / recall / observe the other
devices in that reality with guaranteed complete accuracy. (Similar conclusions hold for whether
the devices can control each other; see Sec. 7 below.)
Thm. 3 then goes further and considers devices that can emulate each other. It shows that
independent of concerns of free will, no two devices can unerringly emulate each other. (In other
words, no reality can have more than one universal device.) Somewhat tongue in cheek, taken
together, these results could be called a monotheism theorem.
Now suppose that the domain of a reality is a set of worldlines extending across time, and
consider physical devices that are identied with systems evolving in time. (See discussion
just after Def. 7.) Prop. 5 tells us that any universal device must be innite (have innite X(U)) if
there are other devices in the reality that are copies of it. Since the time-translation of a physical
device is a copy of that device, this means any physical device that is ever universal must be
innite. In addition, the impossibility of multiple universal devices in a reality means that if any
physical device is universal, it can only be so at one moment in time. (Its time-translation cannot
be universal.) Again somewhat tongue in cheek, taken together this second set of results could
be called an intelligent design theorem. (See Sec. 7 for related limitations concerning devices
that are used to control one another.)
23
In addition to the questions addressed by the monotheism and intelligent design theorems,
there are many other semi-philosophical questions one can ask of the form Can there be a
reality with the following properties?. As mentioned above, Lemma 1 can be used to reduce
all such questions to a constraint satisfaction problem, potentially involving innite-dimensional
spaces. In other words, much of philosophy can be reduced to constraint satisfaction problems.
As a nal comment, while it is most straight-forward to apply the results of this subsection
to physical universes, they can be applied more widely. In particular, somewhat speculatively,
one can consider applying them to mathematical logic itself. In such an application each u U
would be a (perhaps innite) string over some alphabet. For example, U might be dened as
the set of all strings that are true under some encoding that translates a string into axioms and
associated logical implications. Then an inference device would be a (perhaps fallible) theorem-
proving algorithm, embodied within U itself. The results of this subsection would then concern
the relation among such theorem-proving algorithms.
7. Control devices
In weak inference there is no causal arrow from to X. In fact, the only causal arrow goes
fromthe device to the function being inferred (in that Xs value forces something about s value)
rather than vice-versa. This reects what it means for us to be able to set up a device so that it is
guaranteed correct in its prediction / observation/ memory.
This causal arrow from the device to the function does not mean that the device controls the
function. The reason is that Xs value doesnt set s value, but only forces that value to be
consistent with Y. This motivates the following denition:
Denition 8: A device C controls a function over U i f (), b B, x such that
X = x Y = f () = b. C semi-controls i (U), x such that X = x = .
Semi-control has nothing to do with the conclusion function Y of the device; that function
enters when one strengthens the denition of semi-control to get the denition of control. To see
this, note that C semi-controls i f (), x such that X = x f () = 1. However if
X = x forces f () = 1, then for any probe f
f , X = x forces f
() = 0. So C semi-controls
i f (), b B, x such that X = x f () = b. This is just the denition of control,
without the extra condition that controls imposes on the value of Y. We say that one device C
(semi-) controls another if it (semi-) controls the conclusion function of that second device.
The weakness of the semi-control concept is that it stipulates nothing concerning whether C
knows (infers) that some value x forces into the state f
1
(b). In this, it doesnt capture the
intuitive notion of control. Accordingly, in the formalization of Def. 8, we stipulate that you
do not fully control a function if you force it to have some value but dont know what that value
is.
If the partition induced by X is a renement of the partition induced by [50], and in particular
if it is a ne-graining of that partition, then C semi-controls . Note also that if is binary-valued,
then having C semi-control means there is both an x such that X(u) = x u
1
(1) and an
x
u
1
(1). In the language of formal epistemology [42,43,45,44],
this means that X
1
(x) and X
1
(x
2
max
zH
| [k(z)]
2
+ k(z)m(z) + k(z)n(z) + m(z)n(z) |.
In particular, if = = 1/2, then
2
max
zH
| (z
1
z
4
)
2
(z
2
z
3
)
2
|
4
= 1/4.
26
The maximum for = = 1/2 can occur in several ways. One is when z
1
= 1, and z
2
, z
3
, z
4
all
equal 0. At these values, both devices have an inference accuracy of 1/2 at inferring each other.
Each device achieves that accuracy by perfectly inferring one probe of the other device, while
performing randomly for the remaining probe.
Similarly, say that we have a volume measure d over U, as in Sec. 5, together with a proba-
bility measure P over U. Then we can modify the denition of the length of x to be H(U | x),
the negative of the Shannon entropy under prior d of P(u | x). If as in statistical physics P is
proportional to d across the support of P, then P(u | x) d(u | x), and these two denitions of
the length of x are the same.
There are several ways to combine this new denition of length with the concept of inference
accuracy to dene a stochastic analog of inference complexity. In particular, we can dene the
stochastic inference complexity of a function with respect to C for accuracy , as
( | C)
f ()
min
x:E
P
(Y f ()|x)
[H(U | x)]
assuming the sum exists for . So for example if P is proportional to d across the support of P
and C > , then for = 1,
C
( | C) = C( | C).
One can extend this stochastic framework to include inference of the probability of an event,
e.g., have the device say whether P( = ) has some specied value. Such inference contrasts
with inference accuracy, which (like non-stochastic inference) simply concerns a devices con-
cluding whether an event occurs, e.g., concluding whether (u) = ). One can also dene stochas-
tic analogs of (semi)control, strong inference, etc. Such extensions are beyond the scope of this
paper.
9. Self-aware devices
We now return to scenarios where U has no associated probability measure. We consider
devices that know what question they are trying to answer, or at least think they do. Rather
than encode that knowledge in the conclusion function of the device, we split the conclusion
function into two parts. The value of one of those parts is (explicitly) a question for the device,
and the other part is a possible associated answer. We formalize this as follows:
Denition 12: A self-aware device is a triple (X, Y, Q) where (X, Y) is an inference device, Q is
a question function with domain U where each q Q(U) is a binary function of U, and Y Q is
surjective onto B Q(U).
Intuitively, a self-aware device is one that (potentially) knows what question it is answering in its
conclusion. When U = u, we interpret q = Q(u) as the question about the state of the universe
(i.e., about which subset of U contains the actual u) that the conclusion Y(u) is supposed to
answer. The reason we require that Y Q be surjective onto B Q(U) is so that the device is
allowed to have any conclusion for any of its questions; its the appropriate setting of X(u) that
should determine what conclusion it actually makes.
So one way to view successful inference is the mapping of any q Q(U) to an x such that
X(u) = x(u) both implies that the devices conclusion to question q is correct, i.e., Y(u) = q(u),
and also implies that the device is sure it is asking question q, i.e., Q(u) = q. As an example,
say we have a computer that we want to use make a prediction. That computer can be viewed as
27
an inference device. In this case the question q that the device is addressing is specied in the
mind of the external scientist. This means that the question is a function of u (since the scientist
exists in the universe), but need not be stored directly in the inference device. Accordingly, the
combination of the computer with the external scientist who programs the computer is a self-
aware device.
To formalize this concept, we must rst introduce some notation that is frankly cumbersome,
but necessary for complete precision. Let b be a value in some space. Then we dene b as the
constant function over U whose value is b, i.e., u U b. Intuitively, the underline operator
takes any constant and produces an associated constant-valued function over U. As a particular
example, let be a function with domain U. Then is the constant function over U whose value
is the function , i.e., u U . Similarly, let B be a set of functions with domain U, and let
A be a function with domain U whose range is B (so each A(u) is a function over U). Then we
dene A as the function taking u U [A(u)](u). So the overline operator turns any function
over U whose range is functions over U into a single function over U. Both the underline and
overline operators turn mathematical structures into functions over U; they dier in what type
of argument they take. In particular, for any function over U, () = . (Using this notation is
more intuitive in practice than these complicated denitions might suggest.)
Next, recall from Sec. 1.1 that for any probe f of a function with domain U, f () is the
function u U f ((u)).
Denition 13: Let D = (X, Y, Q) be a self-aware device.
i) A function is intelligible to D i f (), f () Q(U).
ii) D is infallible i u U, Y(u) = [Q(u)](u).
We say that D is infallible for Q
Q(U) i q Q
, Y
, Q
, Q
) is intelli-
gible to (X, Y, Q).
Def. 13 provides the extra concepts needed to analyze inference with self-aware devices. Def.
13(i) means that D is able to ask what the value is of every probe of . Def. 13(ii) ensures that
whatever the question D is asking, it is correctly answering that question. Finally, the third part
of successful inference having the device be sure it is asking the question q arises if D
semi-controls its question function.
These denitions are related to inference by the following results:
Theorem 6: Let D
1
be an infallible, self-aware device.
i) Let be a function intelligible to D
1
and say that D
1
semi-controls Q
1
. Then (X
1
, Y
1
) > .
ii) Let D
2
be a device where Y
2
is intelligible to D
1
, D
1
semi-controls (Q
1
, X
2
), and (Q
1
, X
2
)
is surjective onto Q
1
(U) X
2
(U). Then (X
1
, Y
1
) (X
2
, Y
2
).
Thm. 6 allows us to apply results concerning weak and strong inference to self-aware devices.
Note that a special case of having D
1
semi-control Q
1
is where X = Q
1
for some function
, as in Ex. 1. For such a case, Y and X share a component, namely the question being asked,
specied in Q
1
.
28
The following result concerns just intelligibility, without any concern for semi-control or in-
fallibility.
Theorem 7: Consider a pair of self-aware devices D (X, Y, Q) and D
(X
, Y
, Q
) where
there are functions R, P, R
, P
= R
(P
). If
P is intelligible to D
and P is intelligible to D
(U)| = |P(U)| = |P
(U)|.
ii) If Q(U) is nite, Q
) = (Q
).
In particular, take R and R
= Q
. Using this choice, Thm. 7 says that if each self-aware device can try to determine
what question the other one is considering, then neither device can try to determine anything
else.
An immediate corollary of Thm. 7 is the following:
Corollary 4: No two self-aware devices whose question functions have nite ranges are intelli-
gible to each other.
Note that Coroll. 4 does not rely on the devices being distinguishable (unlike Thm. 1). Indeed,
it holds even if the two devices are identical; a self-aware device whose question function has a
nite range cannot be intelligible to itself.
Coroll. 4 is a powerful limitation on any pair of self-aware devices, D and D
. It says that
for at least one of the devices, say D, there is some question q
, such that D
cannot even ask, Does D
?. So whether D could
correctly answer such a question is moot.
To circumvent Coroll. 4 we can consider self-aware devices whose conclusion functions alone
are intelligible to each other. However combining Thm.s 1 and 3(i) gives the following result:
Corollary 5: Let D
1
and D
2
be two self-aware devices that are infallible, semi-control their
questions, and are distinguishable. If in addition they infer each other, then it is not possible that
both Y
2
is intelligible to D
1
and Y
1
is intelligible to D
2
.
With self-aware devices a device C
1
may be able to infer whether a self-aware device D
2
correctly answers the question that D
2
is considering. To analyze this issue we start the following
denition:
Denition 14: If D
1
is a device and D
2
a self-aware device, then D
1
corrects D
2
i x
1
such
that X
1
= x
1
Y
1
= Y
2
Q
2
.
Def. 2 means that Y
1
= 1 i Y
2
= Q
2
, i.e., Y
2
(u) = [Q
2
(u)](u). Intuitively, if a device D
1
corrects
D
2
, then there is an x
1
where having X
1
= x
1
means that C
1
s conclusion tells us whether D
2
correctly answers q
2
.
10
10
Say that D
1
is also self-aware, and that Y
2
Q
2
has both bits in its range (so that probes of it are well-dened). Then we
can modify the denition to say that D
1
corrects D
2
i two conditions are met: all probes in (Y
2
Q
2
) are intelligible to
D
1
, and D
1
is infallible for (Y
2
Q
2
).
29
Note howweak Def. 14 is. In particular, there is no sense in which it requires that D
1
can assess
whether Y
2
(u) = q
2
(u) for all questions q
2
Q
2
(U). So long as D
1
can make that assessment for
any question in Q
2
(U), we say that D
1
corrects D
2
. Despite this weakness, we have the following
impossibility result, which is similar to Prop. 2(i):
Proposition 7: For any device D
1
there is a self-aware device D
2
that D
1
does not correct.
There are similar results for the denition of correction in footnote 10, and for the (im)possibility
of correction among multiple devices.
Finally, while there is not room to do so here, many of the concepts investigated above for
inference devices can be extended to self-aware devices. For example, one might want to modify
the denition of inference complexity slightly for self-aware devices. Let D be a self-aware
infallible device that semi-controls its question function and a function over U where (U) is
countable and is intelligible to D. Then rather than C( | (X, Y)), it may be more appropriate
to consider the self-aware inference complexity of with respect to D, dened as
D( | (X, Y, Q))
f ()
min
x:X=xQ=f ()
[L(x)].
Similarly, consider a reality that includes self-aware devices, i.e., a reality (U; {F
}) that can be
written as (U; {C
}; {D
}; {
} and devices {C
},
we have a set of self-aware devices {D
(X
(u), Y
(u))
(u)
(X
(u), Y
(u), Q
(u))
(U)
_
_
.
The last term means we include in the tuples all instances of the form [Q(u)](u
) in which a
self-aware devices question for one u is evaluated at a dierent u
u.
Due to page limits the analysis of such extensions is beyond the scope of this paper.
We close with some comments on the relation between inference with self-aware devices and
work in other elds. Loosely speaking, in the many-worlds interpretation of quantum mechan-
ics [25], observation only involves the relationship between Y and (in general, for a Y whose
range is more than binary). As discussed above, such relationships cannot imbue the observation
with semantic meaning. It is by introducing X and Q into the denition of self-aware devices that
we allow an act of observation to have semantic meaning. This is formalized in Thm. 6, when
it is applied to scenarios where weak inference is interpreted as successful observation.
Much of formal epistemology concerns knowledge functions which are maps from subsets
of U to other subsets of U [42,43,45,44]. K
i
(A), the knowledge function K
i
evaluated for an
argument A U, is interpreted as the set of possible worlds in which individual i knows that
A is true. The set A is analogous to specication of the question being asked by a self-aware
device. So by requiring the specication of A, knowledge functions involve semantic meaning,
in contrast to the process of observation in the many-worlds interpretation.
A major distinction between inference devices and both the theory of knowledge functions and
the many-worlds denition of observation is that inference devices require that the individual /
observer be able to answer multiple questions (one for each probe concerning the function being
inferred). As mentioned above, this requirement certainly holds in all real-world instances of
30
knowledge or observation. Yet it is this seemingly innocuous requirement that drives many
of the results presented above.
Future work involves exploring what inference device theory has to say about issues of interest
in the theory of knowledge functions. For example, analysis of common knowledge starts with a
formalization of what it means for individual i to knowthat individual j knows A. The inference
devices analog would be a formalization of what it means for device D to infer that device C
infers . Now for this analog to be meaningful, since D can only infer functions with at least
two values in their range, there must be some sense in which the set U both contains u under
which C infers and contains u under which it does not. Formally, this means two things. First,
it must not be the case simply that C > , since that means that C infers under all u. Second,
there must be a proper subset U
C
U such that if U were redened to be U
C
(and C and were
redened to have U
C
as their domains in the obvious way), then it would be the case that C > .
This proper subset species a binary-valued function,
C
, by
C
(u) = 1 u U
C
. The question
of whether D knows that C knows then becomes whether D can infer
C
.
ACKNOWLEDGEMENTS: I would like to thank Nihat Ay, Charlie Bennett, John Doyle,
Michael Gogins, and Walter Read for helpful discussion.
APPENDIX A: Proofs
This section presents miscellaneous proofs. Since many of the results may be counter-intuitive,
the proofs are presented in elaborate detail. The reader should bear in mind though that many of
the proofs simply amount to higher order versions of the Cretan liar paradox, Cantor diago-
nalization, or the like (just like many proofs in Turing machine theory). At the same time, in the
interest of space, little pedagogical discussion is inserted. Unfortunately, the combination makes
many of the proofs a bit of a slog.
Proof of Prop. 1: To prove (i), choose a device (X, Y) where Y(u) = 1 u W. Also have
X(u) take on a separate unique value for each u W, i.e., w W, u U : w u, X(w) X(u).
(Note that by denition of W, it contains at least two elements.) So by appropriate choice of an
x, X(u) = x forces u to be any desired element of W.
Choose i. Pick any
i
(U), and examine the probe f that equals 1 i its argument is . If
for no u W does
i
(u) = , then choose any x that forces u W. By construction, X(u) = x
Y(u) = 1, and in addition X(u) = x f (
i
(u)) = 1. So X(u) = x Y(u) = f (
i
(u)), as
desired.
Now say that there is a u W such that
i
(u) = . By hypothesis, u
W :
i
(u
) . By
construction, there is an x such that X(u
) = x u
= u
. So X(u
) = x u
W,
i
(u
) .
The rst of those two conclusions means that Y(u
)) = 1.
So again, X(u) = x Y(u) = f (
i
(u)), as desired. There are no more cases to consider.
To prove (ii), choose b B and let be a function with domain U where (u) = b for all u
obeying Y(u) = 1 and for no others. (The surjectivity of Y ensures there is at least one such u.)
Consider the probe f of (U) that equals +1 i (u) = b. For all u U, f ((u)) = Y(u). QED.
Proof of Coroll. 2: To prove the rst part of the corollary, let and be the partitions induced
by X and Y, respectively. If |X(U)| = || = 2, || = ||. Since is a ne-graining of , this means
31
that = . So without loss of generality we can label the elements of X(U) so that X = Y.
Now hypothesize that C > for some . Recall that we require that |(U)| 2. Let and
to be the probe
of (U) that equals 1 i its argument is , and dene f
X(U)
such that X(u) = x
. Since u X
1
(1) such that (u) = ,
and since Y(u) = 1 u X
1
(1), x
must equal 1.
This means that (u) equals across all of X
1
(x
) U. Therefore u X
1
(x
) such that
(u) =
. Moreover, since x
= Y(X
1
(x
)) = 1, Y(X
1
(x
)) = 1. Therefore u X
1
(x
)
such that f
), f
be one
of the other elements of that are contained in that element of with label b.
Form the union of a with all elements of that are contained in the element of with label b.
That union is a proper subset of all the elements of . Therefore it picks out a proper subset of U,
W. (Note that W has non-empty overlap with both both partition elements of .) So choose to
be binary-valued, with values given by (u) = b i u W. Then for X(u) = a, (u) = b = Y(u).
On the other hand, for X(u) = a
U
s.t. X
1
(u
) = x
1
, X
2
(u
) = x
2
. Combining, we get the contradiction Y
1
(u
) = Y
2
(u
) = Y
1
(u
).
QED.
Proof of Thm. 2: To establish (i), let f be any probe of (U). C
2
> x
2
such that
X
2
(u) = x
2
Y
2
(u) = f ((u)). In turn, C
1
C
2
x
1
such that X
1
= x
1
Y
1
= Y
2
, X
2
= x
2
(by choosing the identity probe of Y
2
(U)). Combining, X
1
= x
1
Y
1
(). So C
1
> , as claimed
in (i).
To establish (ii), let f be any probe of Y
3
(U), and x
2
any member of X
3
(U). C
2
C
3
x
2
X
2
(U) such that X
2
(u) = x
2
X
3
(u) = x
3
, Y
2
(u) = f (Y
3
(u)). C
1
C
2
then implies that x
1
such that X
1
(u) = x
1
X
2
(u) = x
2
, Y
1
(u) = Y
2
(u) (by choosing the identity probe of Y
2
(U)).
Combining, X
1
(u) = x
1
X
3
(u) = x
3
, Y
1
(u) = f (Y
3
(u)), as desired. QED.
Proof of Prop. 2: To establish the rst claim, simply take Y
2
to be the function in Prop. 1(ii).
To establish the second claim, focus attention on any x
1
X
1
(U), and dene W X
1
1
(x
1
).
Choose X
2
so that X
2
(u) take on a separate unique value for each u W, i.e., w , u U :
w u, X
2
(w) X
2
(u).
First consider the case where Y
1
(W) has a single element, i.e., Y
1
(u) is the same bit across all
X
1
1
(x
1
). Without loss of generality take that bit to be 1. Choose Y
2
(u) = 1 for some w
W,
and Y
2
(u) = 1 for all other w W. Then choose x
2
so that X
2
(u) = x
2
u = w
. Therefore
X
2
(u) = x
2
X
1
(u) = x
1
, Y
2
(u) = 1. So for the probe f of Y
1
(U) that equals Y
1
, X
2
(u) =
32
x
2
Y
2
(u) = f (Y
1
(u)). On the other hand, by hypothesis w
, and
x
2
X
2
(U) such that X
2
(u) = x
2
u = w
. Moreover, Y
2
(w
) = 1, by construction of Y
2
.
So consider the probe f
of Y
1
(U) that equals Y
1
. For all u W, f
(Y
1
(u)) = 1. In particular,
this is the case for u = w
. Combining, X
2
(u) = x
2
X
1
(u) = x
1
, Y
2
(u) = f
(Y
1
(u)). Since f
and f
, on which Y
1
takes both values. So by Prop. 1(i) there is a device C over W that infers
the restriction of Y
1
to domain W. Dene (X
2
, Y
2
) to be the same as that C for all u W, with
all members of X
2
(W) given values that are not found in X
2
(U W). Since X
1
(w) = x
1
for all
w W, this means that f (Y
1
), x
2
such that X
2
(u) = x
2
X
1
(u) = x
1
, Y
2
(u) = f (Y
1
(u)).
Combining, since Y
1
(X
1
1
(x
1
)) either is or is not a singleton for each x
1
X
1
(U), we can build
a partial device C
2
that strongly infers C
1
for each region X
1
1
(x
1
). Furthermore, those regions
form a partition of U. So by appropriately stitching together the partial C
2
s built for each
x
1
X
1
(U), we build an aggregate device C
2
that strongly infers C
1
over all U, as claimed.
QED.
Proof of Thm. 3: Let C
1
and C
2
be two devices and hypothesize that they can strongly infer each
other. Since C
1
can strongly infer C
2
, it can force X
2
to have any desired value and simultaneously
correctly infer the value of Y
2
under the identity probe. In other words, there is a function
1
I
:
X
2
(U) X
1
(U) such that for all x
2
, X
1
=
1
I
(x
2
) X
2
= x
2
and Y
1
= Y
2
. Let x
1
be any element
of
1
I
(X
2
(U)).
Similarly, by hypothesis C
2
can force X
1
to have any desired value and simultaneously cor-
rectly infer the value of Y
1
under the negation probe. In other words, there is a function
2
I
:
X
1
(U) X
2
(U) such that for all x
1
, X
2
=
2
I
(x
1
) X
1
= x
1
and Y
1
= Y
2
.
Dene x
2
2
I
( x
1
). Then X
1
(u) =
1
I
( x
2
) X
2
(u) = x
2
=
2
I
( x
1
) and Y
1
(u) = Y
2
(u).
The rst of those two conclusions in turn means that Y
1
(u) = Y
2
(u). Combining, we see that
X
1
(u) =
1
I
( x
2
) Y
2
(u) = Y
1
(u) = Y
2
(u), which is impossible. QED
Proof of Thm. 4: Since C
2
> , f (), x
2
such that X
2
= x
2
Y
2
= f (). Therefore
the set argmin
x
2
:X
2
=x
2
Y
2
=f ()
[L(x
2
)] is non-empty. Accordingly, f (), we can dene an
associated value x
f
2
X
2
(U) as some particular element of argmin
x
2
:X
2
=x
2
Y
2
=f ()
[L(x
2
)].
Now since C
1
C
2
, x
2
, x
1
such that X
1
= x
1
X
2
= x
2
, Y
1
= Y
2
. In particular, f (),
x
1
: X
1
= x
1
X
2
= x
f
2
, Y
1
= Y
2
. So by denition of x
f
2
, f (), x
1
: X
1
= x
1
X
2
=
x
f
2
, Y
1
= f ().
Combining, f (),
min
x
1
:X
1
=x
1
Y
1
=f ()
[L(x
1
)] min
x
1
:X
1
=x
1
X
2
=x
f
2
,Y
1
=Y
2
[L(x
1
)].
Accordingly,
C( | C
1
) C( | C
2
)
f ()
min
x
1
:X
1
=x
1
X
2
=x
f
2
,Y
1
=Y
2
[L(x
1
) L(x
f
2
)]
f ()
max
x
2
_
min
x
1
:X
1
=x
1
X
2
=x
2
,Y
1
=Y
2
[L(x
1
) L(x
2
)]
_
= |()| max
x
2
_
min
x
1
:X
1
=x
1
X
2
=x
2
,Y
1
=Y
2
[L(x
1
) L(x
2
)]
_
33
Using the equality |()| = |(U)| completes the proof. QED.
Proof of Thm. 5: By hypothesis, for any x
2
X
2
(U), x
1
such that X
1
= x
1
X
2
= x
2
. This is
true for any such x
2
. Write the function mapping any such x
2
to the associated x
1
as
1
. Similarly,
there is a function
2
that maps any x
1
X
1
(U) to an x
2
X
2
(U) such that X
2
=
2
(x
1
) X
1
=
x
1
. Using the axiom of choice, this provides us with a single-valued mapping from X
1
(U) into
X
2
(U) and vice-versa.
Since having X
2
(u) =
2
(x
1
) forces X
1
(u) = x
1
, the set of u U such that X
2
(u) =
2
(x
1
) must
be a subset of those u U such that X
1
(u) = x
1
, i.e., x
1
, X
1
2
[
2
(x
1
)] X
1
1
(x
1
). Similarly,
x
2
, X
1
1
[
1
(x
2
)] X
1
2
(x
2
). This second equality means in particular that X
1
1
[
1
[
2
(x
1
))]
X
1
2
(
2
(x
1
)). Combining, X
1
1
[
1
[
2
(x
1
))] X
1
1
(x
1
).
However x
1
,
1
(
2
(x
1
)) is non-empty. Since X
1
is single-valued, this means that x
1
,
1
(
2
(x
1
)) = x
1
. Combining, we see that x
1
, X
1
1
(x
1
) X
1
2
[
2
(x
1
)], and therefore X
1
2
[
2
(x
1
)] =
X
1
1
(x
1
). This in turn means that the set X
2
[X
1
1
(x
1
)] equals the singleton
2
(x
1
) for any x
1
X
1
(U). Accordingly u X
1
1
(x
1
), X
2
(u) =
2
(x
1
) =
2
(X
1
(u)). In addition, every u U obeys
u X
1
1
(x
1
) for some x
1
. Therefore we conclude that for all u U,
2
(X
1
(u)) = X
2
(u).
This establishes that the partition induced by X
1
is a ne-graining of the partition induced
by X
2
. Similar reasoning establishes that the partition induced by X
2
is a ne-graining of the
partition induced by X
1
. This means that the two partitions must be identical. QED.
Proof of Coroll. 3: By Thm. 5, we can relabel the image values of the two devices setup func-
tions to express them as C
1
= (X, Y
1
) and C
2
= (X, Y
2
).
To prove (i), note that C
1
> C
2
means x X(U) such that X = x Y
1
= Y
2
and x
X(U)
such that X = x
Y
1
= Y
2
. But those two properties in turn mean that C
2
> C
1
. A similar
argument establishes that C
2
> C
1
C
1
> C
2
.
To prove (ii), note that C
1
C
2
means that x X(u), f (Y
2
), x
such that X = x
X = x, Y
1
= f (Y
2
). In particular, x X(u), x
such that X = x
X = x, Y
1
= Y
2
, and x
such that X = x
X = x, Y
1
= Y
2
. The only way both conditions can hold is if x
= x
. But
that means it is impossible to have both Y
1
= Y
2
and Y
1
= Y
2
.
To prove (iii), hypothesize that C
1
control X. This means in particular that x X(U), x
Y
1
=
X,x
= 1 (choose b = 1 and have f be the probe that equals
1 i its argument equals x). To have
X,x
= 1 means X = x, which in turn means x
= x. So
X = x Y
1
= 1. This is true for all x X(U), so Y
1
(u) = 1 u U. However by denition,
the range of Y
1
must be B. Therefore the hypothesis is wrong. The same argument shows that C
2
cannot control X. QED.
Proof of Thm. 6: To prove (i), let f be any probe of . Intelligibility means f Q
1
(U). Since
D
1
semi-controls its question function, x
1
: X
1
= x
1
Q
1
= f . Infallibility then implies that
for any u such that X
1
(u) = x
1
, Y
1
(u) = [Q
1
(u)](u) = f (u). This proves (i).
Next, let f be any probe of Y
2
, and x
2
any element of X
2
(U). Intelligibility means f Q
1
(U).
Since D
1
semi-controls (Q
1
, X
2
) and (Q
1
, X
2
) is surjective, x
1
such that X
1
= x
1
Q
1
= f , X
2
=
x
2
. Infallibility then implies that for any u such that X
1
(u) = x
1
, Y
1
(u) = [Q
1
(u)](u) = f (u). This
proves (ii). QED.
Proof of Thm. 7: The cardinality of (P) is the cardinality of P(U), |P(U)|. Let f
1
and f
2
be two
separate such probes, so that f
1
: P(U) B diers from f
2
: P(U) B. Then as functions
over U, f
1
(P) and f
2
(P) dier. Therefore by hypothesis they correspond to two distinct qs in
34
Q
(U). So |Q
(U)|. So |Q(U)| = |Q
(U)| = |P
is intelligible to D, every f (P
). In other words, Q = (P
(U)| = |R
(P
(U))| establishes
that the partition induced by P
(P
). So (P
) = (Q
).
Similar reasoning establishes that Q
D, dene S (D
) as the union of D
) as the union of D
. S ({C
1
}) D since by hypothesis there is more than
one root node. Since D is weakly connected, this means that S ({C
1
}) P[S ({C
1
})]. Since D is
acyclic and nite, this means that there is a node C
j
S ({C
1
}) who has a root node predecessor
C
k
where C
k
S ({C
1
}).
So C
j
is a successor of two separate root nodes, C
k
and C
1
. By transitivity of strong inference,
this means that C
1
C
j
and C
k
C
j
. By the hypothesis of the proposition, since C
k
C
1
,
those two devices are distinguishable. This means it is possible for C
1
to force X
j
to have one
value while at the same time C
k
forces X
j
to have a dierent value. This is a contradiction. QED.
Proof of Prop. 5: The proof of (i) is by example. Consider the following set of ve quadruples:
V {(1, 1, 1, 1); (1, 1, 1, 1); (1, 1, 1, 1); (1, 1, 1, 1), (1, 1, 1, 1)}
By Lemma 1, V is the reduced form of a reality consisting of two devices C
1
and C
2
, where we
identify any quadruple in V as the value (x
1
, y
1
, x
2
, y
2
), so that X
1
(U) = X
2
(U) = B. By inspec-
tion, C
1
> C
2
(e.g., X
1
= 1 Y
1
= Y
2
). Similarly, by inspection C
1
and C
2
are distinguishable,
and copies of each other. This completes the proof of (i).
35
To prove the rst part of (ii), rst note that C
1
C
2
requires that for all x
2
, there is (an x
1
that forces X
2
= x
2
and Y
1
= Y
2
), and (an x
1
that forces X
2
= x
2
and Y
1
= Y
2
). In other words,
there is a single-valued map : X
2
(U) X
1
(U) such that the quadruple (X
1
= (x
2
), Y
1
=
y
1
, X
2
= x
2
, Y
2
= y
1
) occurs for some y
1
in some tuple in the reduced form of the reality while
(X
1
= (x
2
), Y
1
= y
1
, X
2
= x
2
, Y
2
= y
2
) does not occur for any y
2
if x
2
x
2
, and also does not
occur for y
2
= y
1
if x
2
= x
2
. Similarly, there is a single-valued map
: X
2
(U) X
1
(U) such
that the quadruple (X
1
= (x
2
), Y
1
= y
1
, X
2
= x
2
, Y
2
= y
1
) occurs for some y
1
in some tuple in
the reduced form of the reality while (X
1
= (x
2
), Y
1
= y
1
, X
2
= x
2
, Y
2
= y
2
) does not occur for
any y
2
if x
2
x
2
, and also does not occur for y
2
= y
1
if x
2
= x
2
. By construction, both and
(x
2
). So |X
1
(U)| 2|X
2
(U)|. On the other hand,
|X
1
(U)| = |X
2
(U)| because C
1
and C
2
are copies of each other. Therefore they must have innite
setup functions.
The existence proof for (ii) is by example. Dene a set of quadruples
T {(1, 1, 1, 1); (2, 1, 1, 1); (3, 1, 2, 1); (4, 1, 2, 1); (5, 1, 3, 1), (6, 1, 3, 1), . . .}
= {(i, 1 2(i mod 2), (i/2), 1 2((i/2) mod 2)) : i N}
Next, x any set of spaces , where the spaces {y
1
} = {y
2
} B and {x
1
} = {x
2
} N all
occur in . Let S be a subset of the Cartesian product of the spaces in . Say that for every
t T, (x
1
, y
1
, x
2
, y
2
) = t for exactly one element of S , and no element of S contains a quadruple
(x
1
, y
1
, x
2
, y
2
) T. (So there is a bijection between S and T, given by projecting any element of
S onto its four components corresponding to the spaces {x
1
}, {x
2
}, {y
1
} and {y
2
}.)
By Lemma 1, S is the reduced form of a reality, where we can dene X
1
(U) {x
1
}, Y
1
(U)
{y
1
}, X
2
(U) {x
2
}, Y
2
(U) {y
2
}. Accordingly group (X
1
, Y
1
) into a device C
1
and (X
2
, Y
2
) into
a device C
2
. By inspection, the relation in T between pairs x
1
and y
1
is identical to the relation
in T between pairs x
2
and y
2
. (Those relations are the pairs {(1, 1); (2, 1); (3, 1), . . .}.) So the
devices C
1
and C
2
in the reality are copies of each other.
Next, note that x
2
N, y
1
B, (2x
2
+
(y
1
1)
2
, y
1
, x
2
, 1 2(x
2
mod 2)) occurs (once) in T.
Accordingly, X
1
= 2x
2
+
(y
1
1)
2
X
2
= x
2
. Also, for any xed x
2
, choosing either X
1
= 2x
2
or X
1
= 2x
2
1 forces y
1
to be either 1 or 1, respectively. Therefore, given that x
2
is xed, it
also forces either y
1
= 1 2(x
2
mod 2) or y
1
= 1 2(x
2
mod 2). (For example, X
1
= 5 forces
X
2
= 3 and Y
1
= Y
2
, while X
1
= 6 forces X
2
= 3 and Y
1
= Y
2
.) So the choice of X
1
forces either
Y
1
= Y
2
or Y
1
= Y
2
. Therefore C
1
C
2
. QED.
Proof of Prop. 6: Plugging in, the product of the two inference accuracies is
_ _
f
1
(Y
2
)
max
x
1
[E
P
(Y
1
f
1
(Y
2
) | x
1
)]
2
_ __
f
2
(Y
1
)
max
x
2
[E
P
(Y
2
f
2
(Y
1
) | x
2
)]
2
_
.
Dene g Y
1
Y
2
. Then we can rewrite our product as
_
max
x
1
[E
P
(g | x
1
)]
2
+
max
x
1
[E
P
(g | x
1
)]
2
_ _
max
x
2
[E
P
(g | x
2
)]
2
+
max
x
2
[E
P
(g | x
2
)]
2
_
.
For |X
1
(U)| = |X
2
(U)| = 2, we can rewrite this as
_
|E
P
(g | X
1
= 1) E
P
(g | X
1
= 1)|
2
_ _
|E
P
(g | X
2
= 1) E
P
(g | X
2
= 1)|
2
_
.
36
Next, since the distinguishability is 1.0, X
1
and X
2
are statistically independent under P. There-
fore we can write P(g, x
1
, x
2
) = P(g | x
1
, x
2
)P(x
1
)P(x
2
). So for example, P(g | x
1
) =
_
x
2
P(g |
x
1
, x
2
)P(x
2
), and
E
P
(g | x
1
) =
x
2
[P(g = 1 | x
1
, x
2
) P(g = 1 | x
1
, x
2
)]P(x
2
)
= 2[
x
2
P(g = 1 | x
1
, x
2
)P(x
2
)] 1.
Now dene z
1
P(g = 1 | x
1
= 1, x
2
= 1), z
2
P(g = 1 | x
1
= 1, x
2
= 1), z
3
P(g = 1 |
x
1
= 1, x
2
= 1), z
4
P(g = 1 | x
1
= 1, x
2
= 1). Note that the 4-tuple (z
1
, z
2
, z
3
, z
4
) H so long
as none of its components equals 0. Plugging in,
E
P
(g | X
1
= 1) = 2[z
1
+ z
2
(1 )] 1,
E
P
(g | X
1
= 1) = 2[z
3
+ z
4
(1 )] 1,
E
P
(g | X
2
= 1) = 2[z
1
+ z
3
(1 )] 1,
E
P
(g | X
2
= 1) = 2[z
2
+ z
4
(1 )] 1.
So the product of inference accuracies is
|[(k(z)) + m(z)][(k(z) + n(z)]| = |[k(z)]
2
+ k(z)m(z) + k(z)n(z) + m(z)n(z)|.
This establishes the rst part of the proposition. Note that depending on the structure of the
mapping from (X
1
, X
2
) (Y
1
, Y
2
), if we require that both Y
i
be stochastically surjective, there
may be constraints on which quadruples z H are allowed. Such restrictions would make our
bound be loose.
When = = 1/2, the product of inference accuracies reduces to
|
z
2
1
z
2
2
z
2
3
+ dz
2
4
4
+
z
2
z
3
z
1
z
4
2
| = |
(z
1
z
4
)
2
(z
2
z
3
)
2
4
|
This establishes the second claim. The nal claim is established by maximizing this expression
over H. QED.
Proof of Prop. 7: Given any C
1
= (X
1
, Y
1
), the proposition is proven if we can construct an
associated D
2
that C
1
does not correct. To do that, choose Y
2
= Y
1
, and have Q
2
(U) consist
of two elements, q
1
= Y
1
, and q
2
= Y
1
. Dene Q
2
s dependence on u U by requiring that
Y
1
= 1 Q
2
= q
1
(i.e., u U such that Y
1
(u) = 1, Q
2
(u) = q
1
= Y
1
), and by requiring that
Y
1
= 1 Q
2
= q
2
. (Since Y
1
is surjective onto B, this denes Q
2
s dependence on all of U, and
guarantees that |Q
2
(U)| 2, as required.)
Plugging in, Q
2
= 1. Now the square of both 1 and -1 equals 1. Since Y
1
= Y
2
, this means
that Y
1
Y
2
= 1. Combining, Q
2
= Y
2
Y
1
. Therefore Y
2
Q
2
= Y
1
. Therefore it is impossible that
Y
1
= Y
2
Q
2
, i.e., there is no x
1
that implies this equality. QED.
APPENDIX B: The lack of restrictions in the denition of weak inference
Note that there is additional structure in Ex. 1 that is missing in Def. 3. Most obviously, no
analog of appears in Def. 3. In addition, Def. 3 does not require that there be a component
37
of X and/or Y that can be interpreted as a question-valued function like Q. Moreover, even if
it is the case that X = Q, Def. 3 allows the value imposed on to vary depending on
what probe one is considering, in contrast to the case in Ex. 1. Alternatively, it may be that the
question Q(u) does not equal the associated probe f
K
that is being answered, but so long as
Y(u) = f
K
((u)) whenever (u) Q(u) has a certain value, the device gets credit for being
able to answer question f
K
. In this, the denition of weak inference doesnt fully impose the
mathematical structure underpinning the concept of semantic information. Phrased dierently,
the impossibility results for weak inference hold even though weak inference only uses some of
the structure needed to dene semantic information. (See Sec. 9 for results that involve all of that
structure.)
In addition, it may be that the scientist cannot read the apparatus output display accurately.
In this case the scientist would give incorrect answers as to whats on that display. However
so long as that inaccuracy was compensated, say by a mistake in the observation apparatus, we
would still say that the device infers . Any such extra structure that is in Ex. 1 can be added
to the denition of weak inference in Def. 3 if desired, and the impossibility results presented
here for weak inference will still obtain. (See Sec. 9 for a formalization of inference that contains
additional structure much like that in Ex. 1.)
The other examples in Sec. 2 can be cast as instances of weak inference in similar fashions. In
particular, all of them have additional structure beyond that required in Def. 3.
It is worth elaborating further this point of just how unrestrictive Def. 3 is. One might argue
that to apply to things like computers being used for prediction, a denition of inference should
involve additional formal structure like time-ordering, or stipulations about the Chomsky hierar-
chy power of the device, or stipulations about physical limits restricting the devices operation
like the speed of light, quantum mechanical uncertainties, etc.. More abstractly, one might ar-
gue that for a conclusion of a device to be physically meaningful, it should be possible to act
upon that conclusion, and then test through the universes response to that action whether the
conclusion is correct. None of this is required.
Note also that Def. 3 doesnt require that the device be used to infer some aspect of world
outside of the device. For example, no restrictions are imposed concerning the physical cou-
pling (or lack thereof) at any particular instant of time between the device and what the device
infers. The device and what it is inferring can be anything from tightly coupled with each other
to completely isolated from each other, at any moment.
As an extreme version of the rst end of that spectrum, one can even have the device and
what it is inferring be the same system. For example, this is the case if X and/or Y depend on
every degree of freedom in the universe at some moment in time (in some associated reference
frame). In such a situation, the entire universe is the inference device, and it is being used to infer
something concerning itself.
As another example of the generality of the denition, note that time does not appear in Def. 3.
Ultimately, this is the basis for the fact that the denition of inference applies to both prediction
and recollection, aka retrodiction. This absence of time in Def. 3 also means that not only
might the device be the entire universe, but it might be the entire universe across all time. In such
a situation, the device is not localized either spatially or physically; the setup and/or conclusion
of the device is jointly specied by all degrees of freedom of the universe at all moments.
In addition, X = x Y = f () does not mean that Y(u) is the same for every u X
1
(x). It
simply means that whatever values Y(u) has as u varies across X
1
(x) are the same as the values
that f ((u)) has. This weakness in the denition of inference is necessary for it to accommodate
observation devices. (Recall that in such devices X(u) is how the observation device is set up,
38
and the conclusion of the device depends on characteristics of the external universe, to be types
of inference devices.)
Along the same lines, C > does not imply that there is exactly one probe of for which the
associated conclusion value is 1. (This is true even though ((U)) is a full unary representation
of (U).) Formally, C > does not imply that there is exactly one probe f of such that
x : X = x Y = f () = 1. There may be more than one such f , or even none. So as embodied
in weak inference, for C to predict (something concerning the future state of the universe as
encapsulated in the function) does not mean that for each (U) there is some associated
question x that if embodied in X guarantees that Y correctly says, yes, in this universe u, is
the value that will occur; (u) = . Weak inference only requires that for each and associated
probe, X can be set up so that the devices answer Y(u) must be correct, not that it can be set up
to be correct and answer in the armative.
Similarly, C > does not imply that C can infer a coarse-grained version of . It implies
that C can correctly answer, does (u) equal
1
? for some
1
(U), and that it can correctly
answer does (u) equal
2
for some
2
(U). However it does not imply that C can correctly
answer, does (u) equal either
1
or
2
or both?. In particular, for two functions over U, and
, C > (,
(X
, Y
) where X
(u) itself
39
species the f () that we wish to answer using the original device (X, Y). So for example, say
(X, Y) is a computer running a physical simulation program whose initialized state is given by
X(u). Then C
is that computer modied by having a front end program that runs rst to gure
out howto initialize the simulation to have the bit it produces as a conclusion answer the question
of interest. In this case, trivially, there is no issue in mapping from to x; that mapping is part of
the setup function of our new device, X
(.).
In particular, say that there is an external scientist who types into the computer C a speci-
cation of the system whose evolution is to be simulated in the computer (i.e., forces X(u) to have
a value that is interpreted as that specication). Then one can dene C
(.). In this denition, we view the human scientist as part of the device (s)he is
using.
In summary, and speaking very colloquially, one can view weak inference as a necessary
condition for saying that a device knows the actual value of a function of the state of the
universe. Whatever else such knowledge entails, it means that the device can, by whatever means,
correctly answer (with a yes or a no), Does the value of the function of the state of the universe
equal z? for any value z in the codomain of the function.
Like with weak inference, there is no requirement that a device knows how it has been set up
for it to strongly infer another device. Similarly, there is no requirement that it be able to strongly
infer the unions of probes, no requirements concerning its position in the Chomsky hierarchy,
etc. Despite being so pared-down, the denition of strong inference is still sucient to exhibit
some non-trivial behavior.
APPENDIX C: Alternative denitions of weak inference
There are alternatives to Def. 3 that accommodate the case where |(U)| > 2 without em-
ploying multiple probes. One such alternative uses multiple devices in concert, each sharing the
same setup function, and each devices conclusion giving a dierent bit concerning s value. As
an example, say that s range is R. Then we could assign each device to a separate real num-
ber, and require that for all u one and only one devices conclusion equals 1, namely the device
corresponding to the value of (u).
To formalize this, say we have a set of devices {C
z
: z R} and some function : U R. In
addition suppose there is some vector x with components x
z
running over all z R such that
i)
zR
X
1
z
(x
z
)
U
.
ii) u
U
z R, Y
z
= 1 i (u) = z.
iii) (U), u
U
such that Y
(u) = 1.
Then we can jointly set up the set of devices so that their joint conclusion gives (u), and we can
do so without precluding any element of (u). In this, the set of devices jointly infers .
Alternatively, we could use a single device, where we modify the denition of device to
allow arbitrary cardinality of the range of Y. With this modication, the conclusion function of
the device does not answer the question of what the value of a particular function of (U) is.
Rather it directly encodes the value of (U).
It would appear that under such an alternative we do not need to have the value of X(u) specify
40
the bit concerning (u) that we want to infer, and do not need to consider multiple probes. So for
example, it would appear that when the device is being used for prediction, under this alternative
X(u) need only specify what is known concerning the current state of the system whose future
state is being predicted, without specifying a particular bit concerning that future state that we
wish our device to predict. The conclusion Y (or set of conclusions, as the case might be) would
specify the prediction in full.
Things are not so simple unfortunately. If we wish to allow the device to infer functions with
dierent ranges, then under this alternative we have to allow dierent functions relating Y(u) and
(u). This need is especially acute if we want to allow |(U)| to vary.
Such functions should be surjective, to ensure that our device can conclude every possible
value of (U). (This surjectivity is analogous to the requirement that we consider all probes in
Def. 3.) For any such function : Y(U) (U), we would interpret a particular value Y(u) as
saying (u) = (Y(u)). (This contrasts with the situation when Y(U) = B, where we interpret
Y(u) = +1/1 to mean yes/no, respectively, in response to the question of whether some
associated probe has the value +1.)
One immediate problem with this alternative denition of inference is that it does not allow a
device (X, Y) to infer any function (U) where |(U)| > |Y(U)|. Such diculties do not hold for
Def. 3. For example, if X(U) = 3, X is a ne-graining of Y with two of its elements contained in
Y
1
(1), and is a ne-graining of X, then (X, Y) > . (For every probe of (U), x is chosen to
be one of the two elements that cause Y(u) = 1. The precise x chosen for a particular probe f
is the one that lies in ( f ())
1
(1).)
Other diculties arise when we try to specify this alternative denition in full. For example,
one possible such denition is that C infers i x and function : Y(U) (U) such that
X = x (Y) = . Such a denition is unsatisfying in that by not xing ahead of time, it
leaves unspecied how the conclusion of the device is to be physically interpreted as an encoding
of (u). (This is in addition to the lack of a xed mapping from to x, a lack which also arises
in Def. 3.)
To get around this problem we could pre-x a set of s, one for every member of a set of
ranges {(U)}. We could then have u pick out the precise to use. This requires introduction
of substantial additional structure into the denition of devices however. (A somewhat related
notion is considered in Sec. 9.) Another possible solution would be along the lines of :
Y(U) , x such that X = x (Y) = . But this returns us to a denition of inference
involving multiple functions relating Y and .
All of these other diculties also apply to the denition above of joint inference involving
multiple devices. In particular, say we wish to use the same set of devices to jointly infer function
having dierent ranges from one another. Then we have to specify something about how to map
the joint conclusion of the devices into an inference in any of those ranges. For example, if the
set of devices is {C
z
: z R} and (U) is non-numeric, we would need to specify something
about how a joint conclusion {Y
z
(u)} gets mapped into that non-numeric space.
As a nal possibility, we could stick with a single device and have Y(U) = B, but use some
representation of (U) in X other than the unary representation implicit in Def. 3. For example,
we could require that for all binary representations of (U), for all bits i in that representation,
there is an x such that X = x Y =
i
(). This would allow smaller spaces X(U) in general. But
it would still require consideration of multiple functions relating Y and . It would also raise the
issue of how to encode the elements of (U) as bits.
For simplicity, in the text we avoid these issues and restrict attention to the original denitions.
41
References
[1] D. Lewis, On the plurality of worlds, Blackwell publishers, 1986.
[2] R. Geroch, J. Hartle, Foundations of Physics 16 (1986) 533.
[3] I. Kanter, Physical Review Letters 64 (1990) 332.
[4] J. Berger, International Journal of Theoretical Physics 29 (1990) 985995.
[5] N. da Costa, F. Doria, International Journal of Theoretical Physics 30 (1991) 1041.
[6] M. Gell-Mann, S. Lloyd, Complexity 2 (1996) 4452.
[7] H. Touchette, S. Lloyd, Physical Review Letters 84 (2000) 12561259.
[8] K. Ruohonen, Complexity 2 (1997) 41.
[9] W. Hodges, A Shorter Model Theory, Cambridge University Press, 1997.
[10] J. Schmidhuber, The speed prior: A new simplicity measure yielding near-optimal computable predictions, in: Proc.
15th Conf. on Computational Learning Theory (COLT-2002), 2002, pp. 216228, lNAI 2375.
[11] D. Wolpert, Memory systems, computation, and the second law of thermodynamics, International Journal of
Theoretical Physics 31 (1992) 743785. Revised version available from author.
[12] S. Lloyd, Programming the universe, Random House, 2006.
[13] S. Lloyd, Nature 406 (2000) 1047.
[14] W. Zurek, Nature 341 (1984) 119.
[15] R. Landauer, IBM Journal of Research and Development 5 (1961) 183.
[16] R. Landauer, Nature 335 (1988) 779784.
[17] C. Moore, Physical Review Letters 64 (1990) 23542357.
[18] M. Pour-El, I. Richards, International Journal of Theoretical Physics 21 (1982) 553.
[19] E. Fredkin, T. Tooli, International Journal of Theoretical Physics 21 (1982) 219.
[20] R. Feynman, Foundations of Physics 16 (1986) 507.
[21] C. Bennett, IBM Journal of Research and Development 17 (1973) 525532.
[22] C. H. Bennett, International Journal of Theoretical Physics 21.
[23] C. Bennett, in: D. Pines (Ed.), Emerging Syntheses in Science, Addison Wesley, Reading MA, 1987, p. 297.
[24] S. Aaronson, quant-ph/0502072 (2005).
[25] H. Everett, Reviews of Modern Physics 29 (1957) 454462.
[26] D. Wolpert, Computational capabilities of physical systems, Physical Review E 65 (2001) 016128.
[27] L. Smolin, The life of the cosmos, Weidenfeld and Nicolson, 2002.
[28] A. Aguirre, M. Tegmark, Multiple universes, cosmic coincidences, and other dark matters, hep-th/0409072 (2005).
[29] B. Carr (Ed.), Universe or Multiverse?, Cambridge University Press, 2007.
[30] J. Barbour, The end of time, Oxford University Press, 1999.
[31] J. Conway, S. Kochen, The free will theorem, quant-ph/0604079 (2006).
[32] S. Wolfram, A new kind of Science, Wolfram Media, 2002.
[33] M. Tegmark, The mathematical universe, gr-qc:0704.0646v1 (2007).
[34] G. McCabe, gr-qc/0601073 (2006).
[35] P. Davies, Fluctuations and Noise Letters 7 (2007) C37C50.
[36] J. Schmidhuber, A computer scientists view of life, the universe, and everything, in: Foundations of Computer
Science: Potential - Theory - Cognition, 1997, pp. 201208, lNCS 1337.
[37] T. Cover, J. Thomas, Elements of Information Theory, Wiley-Interscience, New York, 1991.
[38] S. Laplace, Philosophical Essays on Probabilities, Dover, 1985, originally in 1825; translated by F.L. Emory and
F.W. Truscott.
[39] D. Wolpert, PHYSICS TODAY (1992) 98.
[40] W. Zurek, Reviews of Modern Physics 75 (2003) 715.
[41] D. Zeh, H., Foundations of Physics 1 (1970) 6976.
[42] R. J. Aumann, Interactive epistemology ii: Probability, Int. J. Game Theory 28 (1999) 301314.
[43] R. J. Aumann, A. Brandenburger, Epistemic conditions for nash equilibrium, Econometrica 63 (5) (1995) 1161
1180.
[44] K. Binmore, A. Brandenburger, Common knowledge and game theory, sT/ICERDDiscussion Paper 88/167, London
School of Economics.
[45] D. Fudenberg, J. Tirole, Game Theory, MIT Press, Cambridge, MA, 1991.
[46] D. MacKay, On the logical indeterminacy of a free choice, Mind, New Series 69 (273) (1960) 3140.
42
[47] K. Popper, The impossibility of self-prediction, in: The Open Universe: From the Postscript to the Logic of Scientic
Discovery, Routledge, 1988, p. 68.
[48] A. Lasota, M. Mackey, Chaos, fractals and noise, Springer-Verlag, 1994.
[49] J. Hopcroft, J. D. Ullman, Introduction to automata theory, languages and computation, Addison Wesley, 1979.
[50] C. Aliprantis, K. C. Border, Innite Dimensional Analysis, Springer Verlag, 2006.
43