Structured Probabilistic Reasoning: (Incomplete Draft)
Structured Probabilistic Reasoning: (Incomplete Draft)
(Incomplete draft)
Bart Jacobs
Institute for Computing and Information Sciences, Radboud University Nijmegen,
P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.
[email protected] http://www.cs.ru.nl/∼ bart
Preface page v
1 Collections 1
1.1 Cartesian products 2
1.2 Lists 6
1.3 Subsets 16
1.4 Multisets 25
1.5 Multisets in summations 35
1.6 Binomial coefficients of multisets 41
1.7 Multichoose coefficents 48
1.8 Channels 56
1.9 The role of category theory 61
2 Discrete probability distributions 69
2.1 Probability distributions 70
2.2 Probabilistic channels 81
2.3 Frequentist learning: from multisets to distributions 92
2.4 Parallel products 98
2.5 Projecting and copying 109
2.6 A Bayesian network example 116
2.7 Divergence between distributions 122
3 Drawing from an urn 127
3.1 Accumlation and arrangement, revisited 132
3.2 Zipping multisets 136
3.3 The multinomial channel 144
3.4 The hypergeometric and Pólya channels 156
3.5 Iterated drawing from an urn 168
3.6 The parallel multinomial law: four definitions 181
3.7 The parallel multinomial law: basic properties 189
iii
iv Chapter 0. Contents
iv
Preface
v
vi Chapter 0. Preface
tage is that there is no deeply engrained familiarity with the field and with its
development. But at the same time this distance may be an advantage, since
it provides a fresh perspective, without sacred truths and without adherence to
common practices and notations. For instance, the terminology and notation in
this book are influenced by quantum theory, e.g. in using the words ‘state’, ‘ob-
servable’ and ‘test’ — as synonyms for ‘distribution’, for an R-valued function
on a sample space and for compatible (summable) predicates — in using ket
notation | − i in writing discrete probability distributions, or in using daggers
as reversals, in analogy with conjugate transposes (for Hilbert spaces).
It should be said: for someone trained in formal methods, the area of prob-
ability theory can be rather sloppy: everything is called ‘P’, types are hardly
ever used, crucial ingredients (like distributions in expected values) are left
implicit, basic notions (like conjugate prior) are introduced only via examples,
calculation recipes and algorithms are regularly just given, without explana-
tion, goal or justification, etc. This hurts, especially because there is so much
beautiful mathematical structure around. For instance, the notion of a channel
(see below) formalises the idea of a conditional probability and carries a rich
mathematical structure that can be used in compositional reasoning, with both
sequential and parallel composition. The Bayesian inversion (‘dagger’) of a
channel does not only come with appealing mathematical (categorical) proper-
ties — e.g. smooth interaction with sequential and parallel composition — but
is also extremely useful in inference and learning. Via this dagger we can con-
nect forward and backward inference (see Theorem 6.6.3: backward inference
is forward inference with the dagger, and vice-versa) and capture the difference
between Pearl’s and Jeffrey’s update rules (see Theorem 6.7.4: Pearl increases
validity, whereas Jeffrey decreases divergence).
We even dare to think that this ‘sloppiness’ is ultimately a hindrance to fur-
ther development of the field, especially in computer science, where computer-
assisted reasoning requires a clear syntax and semantics. For instance, it is
hard to even express the above-mentioned theorems 6.6.3 and 6.7.4 in stan-
dard probabilistic notation. One can speculate that states/distributions are kept
implicit in traditional probability theory because in many examples they are
used as a fixed implicit assumption in the background. Indeed, in mathemati-
cal notation one tends to omit — for efficiency — the least relevant (implicit)
parameters. But the essence of probabilistic computation is state transforma-
tion, where it has become highly relevant to know explicitly in which state one
is working at which stage. The notation developed in this book helps in such
situations — and in many other situations as well, we hope.
Apart from having beautiful structure, probability theory also has magic. It
can be found in the following two points.
vi
vii
The combination of these two points is very powerful and forms the basis for
probabilistic reasoning. For instance, if we know that two phenomena are re-
lated, and we have new information about one of them, then we also learn
something new about the other phenomenon, after updating. We shall see that
such crossover ripple effects can be described in two equivalent ways, starting
from a joint distribution with evidence in one component.
vii
viii Chapter 0. Preface
viii
ix
In any case, such processing should be subject to suitable safeguards, which should
include specific information to the data subject and the right to obtain human interven-
tion, to express his or her point of view, to obtain an explanation of the decision reached
after such assessment and to challenge the decision.
It is not acceptable that your mortgage request is denied because you drive a
blue car — in presence of a correlation between driving blue cars and being
late on one’s mortgage payments.
These and other developments have led to a new area called Explainable
Artificial Intelligence (XAI), which strives to provide decisions with explana-
tions that can be understood easily by humans, without bias or discrimination.
Although this book will not contribute to XAI as such, it aims to provide a
mathematically solid basis for such explanations.
In this context it is appropriate to quote Judea Pearl [134] from 1989 about
a divide that is still wide today.
To those trained in traditional logics, symbolic reasoning is the standard, and non-
monotonicity a novelty. To students of probability, on the other hand, it is symbolic
reasoning that is novel, not nonmonotonicity. Dealing with new facts that cause proba-
bilities to change abruptly from very high values to very low values is a commonplace
phenomenon in almost every probabilistic exercise and, naturally, has attracted special
attention among probabilists. The new challenge for probabilists is to find ways of ab-
stracting out the numerical character of high and low probabilities, and cast them in
linguistic terms that reflect the natural process of accepting and retracting beliefs.
This book does not pretend to fill this gap. One of the big embarrassments of
the field is that there is no widely accepted symbolic logic for probability, to-
gether with proof rules and a denotational semantics. Such a logic for symbolic
reasoning about probability will be non-trivial, because it will have to be non-
monotonic1 — a property that many logicians shy away from. This book does
aim to contribute towards bridging the divide mentioned by Pearl, by provid-
ing a mathematical basis for such a symbolic probabilistic logic, consisting of
channels, states, predicates, transformations, conditioning, disintegration, etc.
From the perspective of this book, the structured categorical approach to
probability theory began with the work of Bill Lawvere (already in the 1960s)
and his student Michèle Giry. They recognised that taking probability dis-
tributions has the structure of a monad, which was published in the early
1980s in [58]. Roughly at the same time Dexter Kozen started the systematic
investigation of probabilistic programming languages and logics, published
in [106, 107]. The monad introduced back then is now called the Giry mo-
1 Informally, a logic is non-monotonic if adding assumptions may make a conclusion less true.
For instance, I may think that scientists are civilised people, until, at some conference dinner,
a heated scientific debate ends in a fist fight.
ix
x Chapter 0. Preface
Contents overview
The first chapter of the book covers introductory material that is meant to set
the scene. It starts from basic collection types like lists and subsets, and con-
tinues with multisets, which receive most attention. The chapter discusses the
x
xi
(free) monoid structure on all these collection types and introduces ‘unit’ and
‘flatten’ maps as their common, underlying structure. It also introduces the
basic concept of a channel, for these three collection types, and shows how
channels can be used for state transformation and how they can be composed,
both sequentially and in parallel. At the end, the chapter provides definitions
of the relevant notions from category theory.
In the second chapter (discrete) probability distributions first emerge, as a
special collection type, with their own associated form of (probabilistic) chan-
nel. The subtleties of parallel products of distributions (states), with entwined-
ness/correlation between components and the non-naturality of copying, are
discussed at this early stage. This culminates in an illustration of Bayesian net-
works in terms of (probabilistic) channels. It shows how predictions are made
within such Bayesian networks via state transformation and via compositional
reasoning, basically by translating the network structure into (sequential and
parallel) composites of channels.
Blindly drawing coloured balls from an urn is a basic model in discrete prob-
ability. Such draws are analysed systematically in Chapter 3, not only for the
two familar multinomial and hypergeometric forms (with or without replace-
ment), but also in the less familiar Pólya and Ewens forms. By describing these
draws as probabilistic channels we can derive the well known formulations for
these draw-distributions via channel-composition. Once formulated in terms of
channels, these distributions satisfy various compositionality properties. They
are typical for our approach and are (largely) absent in traditional treatments
of this topic. Urns and draws from urns are both described as multisets. The
interplay between multisets and distributions is an underlying theme in this
chapter. There is a fundamental distributive law between multisets and distri-
butions that expresses basic structural properties.
The fourth chapter is more logically oriented, via observables X → R (in-
cluding factors, predicates and events) that can be defined on sample spaces
X, providing numerical information. The chapter concentrates on validity of
obervables in states and on transformation of observables. Where the second
chapter introduces state transformation along a probabilistic channel in a for-
ward direction, this fourth chapter adds observable (predicate) transformation
in a backward direction. These two operations are of fundamental importance
in program semantics, and also in quantum computation — where they are dis-
tinguished as Schrödinger’s (forward) and Heisenberg’s (backward) approach.
In this context, a random variable is a combination of a state and an observable,
on the same underlying sample space. The statistical notions of variance and
covariance are described in terms of of validity for such random variables in
xi
xii Chapter 0. Preface
xii
xiii
els. But the most fundamental technique that is introduced in this chapter, via
string diagrams, is disintegration. In essence, it is the well known procedure
of extracting a conditional probability P(y | x) from a joint probability P(x, y).
One of the themes running through this book is how ‘crossover’ influence can
be captured via channels — extracted from joint states via disintegration —
in particular via forward and backward inference. This phenomenon is what
makes (reasoning in) Bayesian networks work. Disintegration is of interest in
itself, but also provides an intuitive formalisation of the Bayesian inversion of
a channel.
Almost all of the material in these chapters is known from the literature, but
typically not in the channel-based form in which it is presented here. This book
includes many examples, often copied from familiar sources, with the deliber-
ate aim of illustrating how the channel-based approach actually works. Since
many of these examples are taken from the literature, the interested reader may
wish to compare the channel-based description used here with the original de-
scription.
• The (non-trivial) calculations in this book have been carried out with the
EfProb library [20] for channel-based probability. Several calculations in
this book can be done by hand, typically when the outcomes are described
117
as fractions, like 2012 . Such calculations are meant to be reconstructable by
a motivated reader who really wishes to learn the ‘mechanics’ of the field.
Doing such calculations is a great way to really understand the topic — and
the approach of this book2 . Outcomes written in decimal notation 0.1234, as
approximations, or as plots, serve to give an impression of the results of a
computation.
• For the rest of this book, beyond Chapter 7, several additional chapters exist
2 Doing the actual calculations can be a bit boring and time consuming, but there are useful
online tools for calculating fractions, such as
https://www.mathpapa.com/fraction-calculator.html. Recent versions of EfProb also allow
calculations in fractional form.
xiii
xiv Chapter 0. Preface
xiv
1
Collections
There are several ways to put elements from a given set together, for instance
as lists, subsets, multisets, and as probability distributions. This introductory
chapter takes a systematic look at such collections and seeks to bring out nu-
merous similarities. For instance, lists, subsets and multisets all form monoids,
by suitable unions of collections. Unions of distributions are more subtle and
take the form of convex combinations. Also, subsets, multisets and distribu-
tions can be combined naturally via parallel products ⊗, though lists cannot.
In this first chapter, we collect some basic operations and properties of tuples,
lists, subsets and multisets — where multisets are ‘sets’ in which elements
may occur multiple times. Probability distributions will be postponed to the
next chapter. Especially, we collect several basic definitions and results for
multisets, since they play an important role in the sequel, as urns, filled with
coloured balls, as draws from such urns, and as data in learning.
The main differences between lists, subsets and multisets are summarised in
the table below.
lists subsets multisets
For instance, the lists [a, a, b], [a, b, a] and [a, b] are all different. The multisets
2| ai+1| bi and 1| bi+2| a i, with the element a occurring twice and the element
b occurring once, are the same. However, 1|a i + 1| b i is a different multiset.
Similarly, the sets {a, b}, {b, a}, and {a} ∪ {a, b} are the same.
These collections are important in themselves, in many ways, and primar-
ily (in this book) as outputs of channels. Channels are functions of the form
input → T (output), where T is a ‘collection’ operator, for instance, combin-
ing elements as lists, subsets, multisets, or distributions. Such channels capture
1
2 Chapter 1. Collections
X1 × X2 B {(x1 , x2 ) | x1 ∈ X1 and x2 ∈ X2 }.
X1 × · · · × Xn B {(x1 , . . . , xn ) | x1 ∈ X1 , . . . , xn ∈ Xn }.
2
1.1. Cartesian products 3
Q
differently using the symbol , as:
Y Y
Xi or more informally as: Xi .
1≤i≤n
In the latter case it is left implicit what the range is of the index element i.
We allow n = 0. The resulting ‘empty’ product is then written as a singleton
set, written as 1, containing the empty tuple () as sole element, as in:
1 B {()}.
For n = 1 the product X1 × · · · × Xn is (isomorphic to) the set X1 . Note that we
are overloading the symbol 1 and using it both as numeral and as singleton set.
If one of the sets Xi in a product X1 × · · · × Xn is empty, then the whole
product is empty. Also, if all of the sets Xi are finite, then so is the product
X1 × · · · × Xn . In fact, the number of elements of X1 × · · · × Xn is then obtained
by multiplying all the numbers of elements of the sets Xi .
πi ◦ h f1 , . . . , fn i = fi . (1.1)
This is an equality of functions. It can be proven easily by applying both sides
to an arbitrary element y ∈ Y.
There are some more ‘obvious’ equations about tupling of functions:
h f1 , . . . , fn i ◦ g = h f1 ◦ g, . . . , fn ◦ gi hπ1 , . . . , πn i = id , (1.2)
where g : Z → Y is an arbitrary function. In the last equation, id is the identity
function on the product X1 × · · · × Xn .
In a Cartesian product we place sets ‘in parallel’. We can also place functions
3
4 Chapter 1. Collections
X1 × · · · × Xn
f1 ×···× fn
/ Y1 × · · · × Yn
via:
f1 × · · · × fn = h f1 ◦ π1 , . . . , fn ◦ πn i
so that:
f1 × · · · × fn (x1 , . . . , xn ) = ( f1 (x1 ), . . . , fn (xn )).
The latter formulation clearly shows how the functions fi are applied in parallel
to the elements xi .
We overload the product symbol ×, since we use it both for sets and for
functions. This may be a bit confusing at first, but it is in fact quite convenient.
This new set X Y is sometimes called the function space or the exponent of X
and Y. Notice that this exponent notation is consistent with the above one for
powers, since functions n → X can be identified with n-tuples of elements in
X.
These exponents X Y are related to products in an elementary and useful way,
namely via a bijective correspondence:
Z×Y /Xf
=============Y=== (1.3)
Z /X
g
4
1.1. Cartesian products 5
Exercises
1.1.1 Check what a tuple function hπ2 , π3 , π6 i does on a product set X1 ×
· · · × X8 . What is the codomain of this function?
1.1.2 Check that, in general, the tuple function h f1 , . . . , fn i is the unique
function h : Y → X1 × · · · × Xn with πi ◦ h = fi for each i.
1.1.3 Prove, using Equations (1.1) and (1.2) for tuples and projections, that:
g1 × · · · × gn ◦ f1 × · · · × fn = (g1 ◦ f1 ) × · · · × (gn ◦ fn ).
1.1.4 Check that for each set X there is a unique function X → 1. Because
of this property the set 1 is sometimes called ‘final’ or ‘terminal’. The
unique function is often denoted by !.
Check also that a function 1 → X corresponds to an element of X.
1.1.5 Define functions in both directions, using tuples and projections, that
yield isomorphisms:
X×Y Y ×X 1×X X X × (Y × Z) (X × Y) × Z.
X1 X 1Y 1 (X × Y)Z X Z × Y Z X Y×Z X Y Z .
XK × Y K
zip[K]
/ (X × Y)K
by:
5
6 Chapter 1. Collections
1.2 Lists
The datatype of (finite) lists of elements from a given set is well-known in
computer science, especially in functional programming. This section collects
some basic constructions and properties, especially about the close relationship
between lists and monoids.
For an arbitrary set X we write L(X) for the set of all finite lists [x1 , . . . , xn ]
of elements xi ∈ X, for arbitrary n ∈ N. Notice that we use square brackets
[−] for lists, to distinguish them from tuples, which are typically written with
round brackets (−).
Thus, the set of lists over X can be defined as a union of all powers of X, as
in:
[
L(X) B Xn.
n∈N
When the elements of X are letters of an alphabet, then L(X) is the set of words
— the language — over this alphabet. The set L(X) is alternatively written as
X ? , and called the Kleene star of X.
We zoom in on some trivial cases. One has L(0) 1, since one can only
form the empty word over the empty alphabet 0 = ∅. If the alphabet contains
only one letter, a word consists of a finite number of occurrences of this single
letter. Thus: L(1) N.
We consider lists as an instance of what we call a collection data type, since
L(X) collects elements of X in a certain manner. What distinguishes lists from
other collection types is that elements may occur multiple times, and that the
order of occurrence matters. The three lists [a, b, a], [a, a, b], and [a, b] differ.
As mentioned in the introduction to this chapter, within a subset orders and
multiplicities do not matter, see Section 1.3; and in a multiset the order of
elements does not matter, but multiplicities do matter, see Section 1.4.
Let f : X → Y be an arbitrary function. It can be used to map lists over X into
lists over Y by applying f element-wise. This is what functional programmers
call map-list. Here we like overloading, so we write L( f ) : L(X) → L(Y) for
this function, defined as:
Thus, L is an operation that not only sends sets to sets, but also functions to
functions. It does so in such a way that identity maps and compositions are
preserved:
L(id ) = id L(g ◦ f ) = L(g) ◦ L( f ).
6
1.2. Lists 7
1.2.1 Monoids
A monoid is a very basic mathematical structure. For convenience we define it
explicitly.
Definition 1.2.1. A monoid consists of a set M with a binary operation M ×
M → M, written for instance as infix +, together with an identity element, say
written as 0 ∈ M. The binary operation + is associative and has 0 as identity
on both sides. That is, for all a, b, c ∈ M,
a + (b + c) = (a + b) + c and 0 + a = a = a + 0.
The monoid is called commutative if a + b = b + a, for all a, b ∈ M. It is called
idempotent if a + a = a for all a ∈ M.
Let (M, 0, +) and (N, 1, ·) be two monoids. A function f : M → N is called a
homomorphism of monoids if f preserves the unit and binary operation, in the
sense that:
7
8 Chapter 1. Collections
Thus, lists are monoids via concatenation. But there is more to say: lists are
free monoids. We shall occasionally make use of this basic property and so we
like to make it explicit. We shall encounter similar freeness properties for other
collection types.
Each element x ∈ X yields a singleton list unit(x) B [x] ∈ L(X). The result-
ing function unit : X → L(X) plays a special role, see also the next subsection.
X
unit / L(X)
f , homomorphism (1.4)
f
* M
f [] = 0 f ([x]) = f (x).
and
Further, on an list [x1 , . . . , xn ] of length n ≥ 2 we necessarily have:
f [x1 , . . . , xn ] = f [x1 ] ++ · · · ++ [xn ]
= f [x1 ] + · · · + f [xn ]
= f (x1 ) + · · · + f (xn ).
The exercise below illustrate this result. For future use we introduce monoid
actions and their homomorphisms.
8
1.2. Lists 9
f α(a, x) = β a, f (x)
for all a ∈ M, x ∈ X.
M×X
id × f
/ M×Y
α
β
X
f
/Y
Monoid actions are quite common in mathematics. For instance, scalar mul-
tiplication of a vector space forms an action. Also, as we shall see, probabilistic
updating can be described via monoid actions. The action map α : M × X → X
can be understood intuitively as pushing the elements in X forward with a
quantity from M. It then makes sense that the zero-push is the identity, and
that a sum-push is the composition of two individual pushes.
The next result contains some basic properties about unit and flatten. These
properties will first be formulated in terms of equations, and then, alternatively
as commuting diagrams. The latter style is preferred in this book.
9
10 Chapter 1. Collections
X
unit / L(X) L(L(X))
flat / L(X)
f L( f ) L(L( f )) L( f )
Y / L(Y) L(L(Y)) / L(Y)
unit flat
L(X)
unit / L(L(X)) o L(unit) L(X) L(L(L(X)))
flat / L(L(X))
L(flat)
flat flat
L(X) L(L(X)) / L(X)
flat
Proof. We shall do the first cases of each item, leaving the second cases to the
interested reader. First, for f : X → Y and x ∈ X one has:
Next, for the second item we take an arbitrary list [x1 , . . . , xn ] ∈ L(X). Then:
= [x1 , . . . , xn ]
flat ◦ L(unit) ([x1 , . . . , xn ]) = flat [unit(x1 ), . . . , unit(xn )]
= [x1 , . . . , xn ].
10
1.2. Lists 11
L(α), as in:
X
unit / L(X) L(L(X))
L(α)
/ L(X)
α flat α (1.5)
id $ α
X L(X) /X
L(M1 )
L( f )
/ L(M2 )
α1 α2 (1.6)
M1
f
/ M2 .
commutes.
This result says that instead of giving a binary operation + with an identity
element u we can give a single operation α that works on all sequences of el-
ements. This is not so surprising, since we can apply the sum multiple times.
The more interesting part is that the monoid equations can be captured uni-
formly by the diagrams/equations (1.5). We shall see that same diagrams also
work for other types of monoids (and collection types).
11
12 Chapter 1. Collections
similar manner:
(1.5)
x + (y + z) = α [x, α([y, z])] = α [α(unit(x)), α([y, z])]
= α L(α) [ [x], [y, z] ]
(1.5)
= α flat [ [x], [y, z] ]
= α flat [ [x, y], [z] ]
(1.5)
= α L(α) [ [x, y], [z] ]
= α [α([x, y]), α(unit(z))]
(1.5)
= α [α([x, y]), z] = (x + y) + z.
= f (x1 ) + · · · + f (xn )
= f (x1 + · · · + xn ) since f is a homomorphism
= f ◦ α1 ([x1 , . . . , xn ]).
12
1.2. Lists 13
For instance, for N = 4 this inverse image contains the eight lists:
[1, 1, 1, 1], [1, 1, 2], [1, 2, 1], [2, 1, 1], [2, 2], [1, 3], [3, 1], [4]. (1.7)
We can interpret the situation as follows. Suppose we have coins with value
n ∈ N>0 , for each n. Then we can ask, for an amount N: how many (ordered)
ways are there to lay out the amount N in coins? For N = 4 the different layouts
are given above. Other interpretations are possible: one can also think of the
sequences (1.7) as partitions of the numbers {1, 2, 3, 4}.
Here is a first, easy counting result.
Lemma 1.2.7. For N ∈ N>0 , the subset sum −1 (N) ⊆ L(N>0 ) has 2N−1 ele-
ments.
X 1 2K+1 − 1
= . (∗)
0≤k≤K
2 k 2K
13
14 Chapter 1. Collections
1 1 1 1 1 1 1 1
24 12 12 12 8 6 6 4
| {z }
with sum: 1
Obviously, elements in a lists are ordered. Thus, in (1.7) we distinghuish
between coin layouts [1, 1, 2], [1, 2, 1] and [2, 1, 1]. However, when we are
commonly discussing which coins add up to 4 we do not take the order into
account, for instance in saying that we use two coins of value 1 and one coin
of 2, without caring about the order. In doing so, we are not using lists as col-
lection type, but multisets — in which the order of elements does not matter.
14
1.2. Lists 15
These multisets form an important alternative collection type; they are dis-
cussed from Section 1.4 onwards.
Exercises
1.2.1 Let X = {a, b, c} and Y = {u, v} be sets with a function f : X → Y
given by f (a) = u = f (c) and f (b) = v. Write `1 = [c, a, b, a] and
`2 = [b, c, c, c]. Compute consecutively:
• `1 ++ `2
• `2 ++ `1
• `1 ++ `1
• `1 ++ (`2 ++ `1 )
• (`1 ++ `2 ) ++ `1
• L( f )(`1 )
• L( f )(`2 )
• L( f )(`1 ) ++ L( f )(`2 )
• L( f )(`1 ++ `2 ).
1.2.2 We write log for the logarithm function with some base b > 0, so that
log(x) = y iff x = by . Verify that the logarithm function log is a map
of monoids:
(R>0 , 1, ·)
log
/ (R, 0, +).
flat [`1 , . . . , `n ] = `1 ++ · · · ++ `n .
15
16 Chapter 1. Collections
L L(X)
L(k−k)
/ L(N)
flat sum
L(X)
k−k
/N
1.3 Subsets
The next collection type that will be studied is powerset. The symbol P is
commonly used for the powerset operator. We will see that there are many
similarities with lists L from the previous section. We again pay much attention
to monoid structures.
For an arbitrary set X we write P(X) for the set of all subsets of X, and
Pfin (X) for the set of finite subsets. Thus:
16
1.3. Subsets 17
If X is a finite set itself, there is no difference between P(X) and Pfin (X). In the
sequel we shall speak mostly about P, but basically all properties of interest
hold for Pfin as well.
First of all, P is a functor: it works both on sets and on functions. Given
a function f : X → Y we can define a new function P( f ) : P(X) → P(Y) by
taking the image of f on a subset. Explicitly, for U ⊆ X,
P(π1 )(R) = {π1 (z) | z ∈ R} = {π1 (x, y) | (x, y) ∈ R} = {x | ∃y. (x, y) ∈ R}.
The next topic is the monoid structure on powersets. The first result is an
analogue of Lemma 1.2.2 and its proof is left to the reader.
Lemma 1.3.1. 1 For each set X, the powerset P(X) is a commutative and
idempotent monoid, with empty subset ∅ ∈ P(X) as identity element and
union ∪ of subsets of X as binary operation.
2 Each P( f ) : P(X) → P(Y) is a map of monoids, for f : X → Y.
Next we define unit and flatten maps for subsets, much like for lists. The
function unit : X → P(X) sends an element to a singleton subset: unit(x) B
{x}. The flatten function flat : P(P(X)) → P(X) is given by union: for A ⊆
P(X),
[
flat(A) B A = {x ∈ X | ∃U ∈ A. x ∈ U}.
X
unit / P(X) P(P(X))
flat / P(X)
f P( f ) P(P( f )) P( f )
Y / P(Y) P(P(Y)) / P(Y)
unit flat
commute.
17
18 Chapter 1. Collections
P(X)
unit / P(P(X)) o P(unit) P(X) P(P(P(X)))
flat / P(P(X))
P(flat)
flat flat
P(X) P(P(X)) / P(X)
flat
Thus, the support of a list is the subset of elements occurring in the list. The
support function removes order and multiplicities. The latter happens implic-
itly, via the set notation, above on the right-hand side. For instance,
supp([b, a, b, b, b]) = {a, b} = {b, a}.
Notice that there is no way to go in the other direction, namely Pfin (X) → L(X).
Of course, one can for each subset choose an order of the elements in order to
turn the subset into a list. However, this process is completely arbitrary and is
not uniform (natural).
The support function interacts nicely with the structures that we have seen
so far. This is expressed in the result below, where we use the same notation
unit and flat for different functions, namely for L and for P. The context, and
especially the type of an argument, will make clear which one is meant.
Lemma 1.3.3. Consider the support map supp : L(X) → Pfin (X) defined above.
1 It is a map of monoids (L(X), [], ++) → (P(X), ∅, ∪).
2 It is natural, in the sense that for f : X → Y one has:
L(X)
supp
/ Pfin (X)
L( f ) Pfin ( f )
L(Y) / Pfin (Y)
supp
3 It commutes with the unit and flatten maps of list and powerset, as in:
X X L(L(X))
L(supp)
/ L(Pfin (X)) supp
/ Pfin (Pfin (X))
unit
unit flat
flat
L(X) / Pfin (X) L(X) / Pfin (X)
supp supp
18
1.3. Subsets 19
= Pfin ( f )({x1 , . . . , xn })
= { f (x1 ), . . . , f (xn )}
= supp([ f (x1 ), . . . , f (xn )])
= supp(L( f )([x1 , . . . , xn ]))
= supp ◦ L( f ) ([x1 , . . . , xn ]).
The second diagram requires a bit more work. Starting from a list of lists we
get:
X
unit / Pfin (X)
f , homomorphism (1.9)
* M
f
19
20 Chapter 1. Collections
The order in the above sum f (x1 ) + · · · + f (xn ) does not matter since M is
commutative. The function f sends unions to sums since + is idempotent.
X
unit / Pfin (X) Pfin (Pfin (X))
Pfin (α)
/ Pfin (X)
α flat α (1.10)
id % α
X Pfin (X) /X
commute.
2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative idempotent monoids,
with corresponding Pfin -algebras α1 : Pfin (M1 ) → M1 and α2 : Pfin (M2 ) →
M2 . A function f : M1 → M2 is a map of monoids if and only if the rectangle
Pfin (M1 )
Pfin ( f )
/ Pfin (M2 )
α1 α2 (1.11)
M1
f
/ M2 .
commutes.
Proof. This works very much like in the proof of Proposition 1.2.6. If (X, u, +)
is a monoid, we define α : Pfin (X) → X by freeness as α({x1 , . . . , xn }) B x1 +
· · · + xn . In the other direction, given α : Pfin (X) → X we define a sum as
x + y B α({x, y}) with unit u B α(∅). Clearly, thi sum + on X is commutative
and idempotent.
This result concentrates on the finite powerset functor Pfin . One can also con-
sider algebras P(X) → X for the (general) powerset functor P. Such algebras
turn the set X into a complete lattice, see [116, 9, 92] for details.
1.3.3 Extraction
So far we have concentrated on how similar lists and subsets are: the only struc-
tural difference that we have seen up to now is that subsets form an idempotent
20
1.3. Subsets 21
and commutative monoid. But there are other important differences. Here we
look at subsets of product sets, also known as relations.
The observation is that one can extract functions from a relation R ⊆ X × Y,
namely functions of the form extr 1 (R) : X → P(Y) and extr 2 (R) : Y → P(X),
given by:
extr 1 (R)(x) = {y ∈ Y | (x, y) ∈ R} extr 2 (R)(y) = {x ∈ X | (x, y) ∈ R}.
In fact, one can easily reconstruct the relation R from extr 1 (R), and also from
extr 2 (R), via:
R = {(x, y) | y ∈ extr 1 (R)(x)} = {(x, y) | x ∈ extr 2 (R)(y)}.
This all looks rather trivial, but such function extraction is less trivial for other
data types, as we shall see later on for distributions, where it will be called
disintegration.
Using the exponent notation from Subsection 1.1.2 we can summarise the
situation as follows. There are two isomorphisms:
P(Y)X P(X × Y) P(X)Y . (1.12)
Functions of the form X → P(Y) will later be called ‘channels’ from X to
Y, see Section 1.8. What we have just seen will then be described in terms of
‘extraction of channels’.
21
22 Chapter 1. Collections
The next lemma makes a basic result explicit, partly because it provides
valuable insight in itself, but also because we shall see generalisations later
on, for multisets instead of subsets. We use the familiar binomial and multi-
nomial coefficients. We recall their definitions, for natural numbers k ≤ n and
m1 , . . . , m` with ` ≥ 2 and i mi = m.
P
! !
n n! m m!
B and B . (1.13)
k k! · (n − k)! m1 , . . . , m` m1 ! · . . . · m` !
Recall that a partition of a set X is a disjoint cover: a finite collection of subsets
U1 , . . . , Uk ⊆ X satisfying Ui ∩ U j = ∅ for i , j and i Ui = X. We do not
S
assume that the subsets Ui are non-empty.
22
1.3. Subsets 23
Exercises
1.3.1 Continuing Exercise 1.2.1, compute:
• supp(`1 )
• supp(`2 )
• supp(`1 ++ `2 )
• supp(`1 ) ∪ supp(`2 )
• supp(L( f )(`1 ))
• Pfin ( f )(supp(`1 )).
1.3.2 We have used finite unions (∅, ∪) as monoid structure on P(X) in
Lemma 1.3.1 (1). Intersections (X, ∩) give another monoid structure
on P(X).
23
24 Chapter 1. Collections
24
1.4. Multisets 25
1.4 Multisets
So far we have discussed two collection data types, namely lists and subsets of
elements. In lists, elements occur in a particular order, and may occur multiple
times (at different positions). Both properties are lost when moving from lists
to sets. In this section we look at multisets, which are ‘sets’ in which elements
may occur multiple times. Hence multisets are in between lists and subsets,
since they do allow multiple occurrences, but the order of their elements is
irrelevant.
The list and powerset examples are somewhat remote from probability the-
ory. But multisets are much more directly relevant: first, because we use a
similar notation for multisets and distributions; and second, because observed
data can be organised nicely in terms of multisets. For instance, for statistical
analysis, a document is often seen as a multiset of words, in which one keeps
track of the words that occur in the document together with their frequency
(multiplicity); in that case, the order of the words is ignored. Also, tables with
observed data can be organised naturally as multisets, see Subsection 1.4.1 be-
low. Learning from such tables will be described in Section 2.1 as a (natural)
transformation from multisets to distributions.
Despite their importance, multisets do not have a prominent role in pro-
gramming, like lists have. Eigenvalues of a matrix form a clear example where
the ‘multi’ aspect is ignored in mathematics: eigenvalues may occur multiple
times, so the proper thing to say is that a matrix has a multiset of eigenvalues.
One reason for not using multisets may be that there is no established notation.
We shall use a ‘ket’ notation | − i that is borrowed from quantum theory, but
interchangeably also a functional notation. Since multisets are less familiar,
we take time to introduce the basic definitions and properties, in Sections 1.4
– 1.7.
We start with an introduction about notation, terminology, and conventions
for multisets. Consider a set C = {R, G, B} for the three colours Red, Green,
Blue. An example of a multiset over C is:
2| Ri + 5|G i + 0| Bi.
In this multiset the element R occurs 2 times, G occurs 5 times, and B occurs
0 times. The latter means that B does not occur, that is, B is not an element
of the multiset. From a multiset perspective, we have 2 + 5 + 0 = 7 elements
— and not just 2. A multiset like this may describe an urn containing 2 red
balls, 5 green ones, and no blue balls. Such multisets are quite common. For
instance, the chemical formula C2 H3 O2 for vinegar may be read as a multiset
25
26 Chapter 1. Collections
26
1.4. Multisets 27
This expression kϕk gives the size of the multiset, that is, its total number of
elements.
All of M, N, M∗ , N∗ , M[K], N[K] are functorial, in the same way. Hence
we concentrate on M. For a function f : X → Y we can define M( f ) : M(X) →
M(Y). When we see a multiset ϕ ∈ M(X) as an urn containing coloured balls,
with colours from X, then M( f )(ϕ) ∈ M(Y) is the urn with ‘repainted’ balls,
where the new colours are taken from the set Y. The function f : X → Y defines
the transformation of colours. It tells that a ball of colour x ∈ X in ϕ should be
repainted with colour f (x) ∈ Y.
The urn M( f )(ϕ) with repainted balls can be defined in two equivalent ways:
X X X
M( f ) ri | xi i B ri | f (xi ) i or as M( f )(ϕ)(y) B ϕ(x).
i i
x∈ f −1 (y)
It may take a bit of effort to see that these two descriptions are the same, see
P
Exercise 1.4.1 below. Notice that in the sum i ri | f (xi ) i it may happen that
f (xi1 ) = f (xi2 ) for xi1 , xi2 , so that ri1 and ri2 are added together. Thus, the sup-
P P
port of M( f ) i ri | xi i may have fewer elements than the support of i ri | xi i,
P P
but the sum of all multiplicities is the same in M( f ) i ri | xi i and i ri | xi i,
see Exercise 1.5.2 below.
27
28 Chapter 1. Collections
(r · ϕ)(x) B r · ϕ(x).
This scalar multiplication r · (−) : M(X) → M(X) preserves the sums (0, +)
from the previous item, and is thus a map of monoids.
3 For each f : X → Y, the function M( f ) : M(X) → M(Y) is a map of
monoids and also of cones. The latter means: M( f )(r · ϕ) = r · M( f )(ϕ).
28
1.4. Multisets 29
0 1 2 3 4 5 6 7 8 9 10
2 0 4 3 5 3 2 5 5 2 4
We can represent this table as a natural multiset over the set of ages {0, 1, . . . , 10}.
2| 0i + 4| 2 i + 3| 3 i + 5|4 i + 3| 5i + 2| 6i + 5| 7 i + 5| 8 i + 2|9 i + 4| 10 i.
Notice that there is no summand for age 1 because of our convention to ommit
expressions like 0| 1 i with multiplicity 0. We can visually represent the above
age data/multiset in the form of a histogram:
(1.16)
Here is another example, not with numerical data, in the form of ages, but
with nominal data, in the form of blood types. Testing the blood type of 50
individuals gives the following table.
A B O AB
10 15 18 7
This corresponds to a (natural) multiset over the set {A, B, O, AB} of blood
types, namely to:
10| A i + 15| Bi + 18| Oi + 7| ABi.
It gives rise to the following bar graph, in which there is no particular ordering
of elements. For convenience, we follow the order of the above table.
(1.17)
29
30 Chapter 1. Collections
Next, consider the two-dimensional table (1.18) below where we have com-
bined numeric information about blood pressure (either high H, or low L) and
certain medicines (either type 1, type 2, or no medicine, indicated as 0). There
is data about 100 study participants:
high 10 35 25 70
(1.18)
low 5 10 15 30
totals 15 45 40 100
We see that Table (1.18) contains ‘totals’ in its vertical and horizontal mar-
gins. They can be obtained from τ as marginals, using the functoriality of N.
This works as follows. Applying the natural multiset functor N to the two pro-
jections π1 : B × M → B and π2 : B × M → M yields marginal distributions on
30
1.4. Multisets 31
B and M, namely:
N(π1 )(τ) = 10|π1 (H, 0)i + 35| π1 (H, 1)i + 25| π1 (H, 2)i
+ 5| π1 (L, 0)i + 10| π1 (L, 1)i + 15| π1 (L, 2)i
= 10| H i + 35| H i + 25| H i + 5| L i + 10| L i + 15| L i
= 70| H i + 30| L i.
N(π2 )(τ) = (10 + 5)| 0i + (35 + 10)| 1 i + (25 + 15)| 2 i
= 15|0 i + 45| 1 i + 40|2 i.
The expression ‘marginal’ is used to describe such totals in the margin of a
multidimensional table. In Section 2.1 we describe how to obtain probabilities
from tables in a systematic manner.
X
unit / M(X) M(M(X))
flat / M(X)
f M( f ) M(M( f )) M( f )
Y / M(Y) M(M(Y)) / M(Y)
unit flat
commute.
2 The next two diagrams also commute.
M(X)
unit / M(M(X)) M(unit)
o M(X) M(M(M(X)))
flat / M(M(X))
M(flat)
flat flat
M(X) M(M(X)) / M(X)
flat
31
32 Chapter 1. Collections
The next result shows that natural multisets are free commutative monoids.
Arbitrary multisets are also free, but for other algebraic structures, see Exer-
cise 1.4.9.
Proposition 1.4.4. Let X be a set and (M, 0, +) a commutative monoid. Each
function f : X → M has a unique extension to a homomorphism of monoids
f : (N(X), 0, +) → (M, 0, +) with f ◦ unit = f . The diagram below captures
the situation, where the dashed arrow is used for uniqueness.
X
unit / N(X)
f , homomorphism (1.19)
f
* M
The unit and flatten operations for (natural) multisets can be used to cap-
ture commutative monoids more precisely, in analogy with Propositions 1.2.6
and 1.3.5
Proposition 1.4.5. Let X be an arbitrary set.
X
unit / N(X) N(N(X))
N(α)
/ N(X)
α flat α (1.20)
id % α
X N(X) /X
N(M1 )
N( f )
/ N(M2 )
α1 α2 (1.21)
M1
f
/ M2 .
commutes.
Proof. Analogously to the proof Proposition 1.2.6: if (X, u, +) is a commu-
tative monoid, we define α : N(X) → X by turning formal sums into actual
32
1.4. Multisets 33
1.4.3 Extraction
At the end of the previous section we have seen how to extract a function
(channel) from a binary subset, that is, from a relation. It turns out that one
can do the same for a binary multiset, that is, for a table. More specifically, in
terms of exponents, there are isomorphisms:
Notice that we are — conveniently — mixing ket and function notation for
multisets. Conversely, σ can be reconstructed from extr 1 (σ), and also from
extr 2 (σ), via σ(x, y) = extr 1 (σ)(x)(y) = extr 2 (σ)(y)(x).
Functions of the form X → M(Y) will also be used as channels from X to
Y, see Section 1.8. That’s why we often speak about ‘channel extraction’.
As illustration, we apply extraction to the medicine - blood pressure Ta-
ble 1.18, described as the multiset τ ∈ M(B × M). It gives rise to two channels
extr 1 (τ) : B → M(M) and extr 2 (τ) : M → M(B). Explicitly:
X
extr 1 (τ)(H) = τ(H, x)| x i = 10| 0 i + 35| 1i + 25| 2 i
x∈M
X
extr 1 (τ)(L) = τ(L, x)| x i = 5| 0i + 10| 1 i + 15| 2i.
x∈M
We see that this extracted function captures the two rows of Table 1.18. Simi-
larly we get the columns via the second extracted function:
33
34 Chapter 1. Collections
Exercises
1.4.1 In the setting of Exercise 1.2.1, consider the multisets ϕ = 3| ai +
2| bi + 8| c i and ψ = 3|b i + 1| ci. Compute:
• ϕ+ψ
• ψ+ϕ
• M( f )(ϕ), both in ket-formulation and in function-formulation
• idem for M( f )(ψ)
• M( f )(ϕ + ψ)
• M( f )(ϕ) + M( f )(ψ).
1.4.2 Consider, still in the context of Exercise 1.2.1, the ‘joint’ multiset
ϕ ∈ M(X × Y) given by ϕ = 2|a, ui + 3| a, v i + 5| c, vi. Determine the
marginals M(π1 )(ϕ) ∈ M(X) and M(π2 )(ϕ) ∈ M(Y).
1.4.3 Consider the chemical equation for burning methane:
CH4 + 2 O2 −→ CO2 + 2 H2 O.
34
1.5. Multisets in summations 35
35
36 Chapter 1. Collections
There are several ways to associate a natural number with a multiset ϕ. For
instance, we can look at the size of its support | supp(ϕ) | ∈ N, or at its size, as
total number of elements kϕk = x ϕ(x) ∈ R>0 . This size is a natural number
P
when ϕ is a natural multiset. Below we will introduce several more such num-
multisets ϕ, namely ϕ and ( ϕ ), and later on also a binomial
bers for natural
coefficient ψϕ .
ϕ ≤ ψ ⇐⇒ ∀x ∈ X. ϕ(x) ≤ ψ(x).
ϕ ≤K ψ ⇐⇒ ϕ ∈ N[K](X) and ϕ ≤ ψ.
rϕ(x)
Y
rϕ B x = ~r ϕ .
x∈X
For instance,
5!
(3| R i + 2| Bi) = 3! · 2! = 12 and ( 3|R i + 2| Bi ) = = 10.
12
The multinomial coefficient (ϕ ) in item (5) counts the number of ways of
putting kϕk items in supp(ϕ) = {x1 , . . . , xn } urns, with the restriction that ϕ(xi )
items go into urn xi . Alternatively, ( ϕ ) is the numbers of partitions (Ui ) of a
N| Ui | = ϕ(xi ).
set X with kϕk elements, where
The traditional notation m1 ,...,mk
for multinomial coefficients in (1.13) is
36
1.5. Multisets in summations 37
suboptimal for two reasons: first, the number N is superflous, since it is deter-
mined by the mi as N = i mi ; second, the order of the mi is irrelevant. These
P
disadvantages are resolved by the multiset variant ( ϕ ). It has our preference.
We recall the recurrence relations:
! ! !
K−1 K−1 K
+ ··· + = (1.24)
k1 − 1, . . . , kn k1 , . . . , kn − 1 k1 , . . . , kn
for multinomial coefficients. A snappy re-formulation, for a natural multiset ϕ,
is: X
(ϕ − 1| x i ) = (ϕ ). (1.25)
x∈supp(ϕ)
K + i ni
P ! Y
X 1
· rini = P K+1 .
n , ..., n ≥0
K, n1 , . . . , nm i (1 − i ri )
1 m
37
38 Chapter 1. Collections
Equivalently,
K + kϕk
!
X 1
· ( ϕ ) · ~r ϕ = P K+1 .
ϕ∈N({1,...,m})
K (1 − i ri )
P (n)
Proof. 1 The equation arises as the Taylor series f (x) = n f n!(0) · xn of the
function f (x) = (1−x)1 K+1 . One can show, by induction on n, that the n-th
derivative of f is:
(n + K)! 1
f (n) (x) = · .
K! (1 − x)n+K+1
2 The second equation is a special case of the first one, for K = 0. There is also
a simple direct proof. Define sn = r0 + r1 + · · · + rn . Then sn − r · sn = 1 − rn+1 ,
n+1
so that sn = 1−r 1
1−r . Hence sn → 1−r as n → ∞.
3 We choose to use the first item, but there are other ways to prove this result,
see Exercise 1.5.10.
r X n + 1!
= r · · rn by item (1), with K = 1
(1 − r)2 n≥0
1
X X
= (n + 1) · rn+1 = n · rn .
n≥0 n≥1
4 The trick is to turn the multiple sums into a single ‘leading’ one, in:
K + i ni
X P ! Y
· r ni
n1 , ..., nm ≥0
K, n 1 , . . . , n m i i
K+n
X X ! Y
= · r ni
n≥0 n1 , ..., nm , i ni =n
P K, n 1 , . . . , nm i i
K+n
! ! Y
X X n
= · · r ni
n≥0 n1 , ..., nm , i ni =n
P K n 1 , . . . , nm i i
X K + n! X n
! Y
= · · r ni
n≥0
K n1 , ..., nm , i ni =n
P n 1 , . . . , nm i i
(1.26)
X K + n! P
n
= · i ri
n≥0
K
1
= P K+1 , by item (1).
(1 − i ri )
Exercises
1.5.1 Consider the function f : {a, b, c} → {0, 1} given by f (a) = f (b) = 1
and f (c) = 0.
38
1.5. Multisets in summations 39
ϕ ≤ ψ ⇐⇒ ∃ϕ0 ∈ N(X). ϕ + ϕ0 = ψ.
ϕ ϕ(y)!
X X Q
y
=
(ϕ(x) − 1)! · y,x ϕ(y)!
Q
x∈supp(ϕ)
(ϕ − 1| x i) x∈supp(ϕ)
X ϕ(x)!
=
x∈supp(ϕ)
(ϕ(x) − 1)!
X
= ϕ(x)
x∈supp(ϕ)
= kϕk.
39
40 Chapter 1. Collections
This generalises
the well known sum-formula for binomial coeffi-
cients: 0≤k≤K Kk = 2K , for n = 2.
P
1.5.10 Elaborate the details of the following two (alternative) proofs of the
equation in Theorem 1.5.2 (3).
1 Use the derivate ddr on both sides of Theorem 1.5.2 (2).
2 Write s B n≥1 n · rn and exploit the following recursive equation.
P
40
1.6. Binomial coefficients of multisets 41
For instance:
3| Ri + 2| Bi
! ! !
3! · 2! 3 2
= = 6 = 3·2 = · .
2| Ri + 1| Bi
2! · 1! · 1! · 1! 2 1
This describes the number of possible ways of drawing 2|R i + 1| Bi from an
urn 3|R i + 2| Bi.
The following result is a generalisation of Vandermonde’s formula.
These fractions adding up to one will form the probabilities of the so-called
hypergeometric distributions, see Example 2.1.1 (3) and Section 3.4 later on.
41
42 Chapter 1. Collections
Then:
B+G
! ! !
X B G
= · . (1.29)
K b≤B, g≤G, b+g=K
b g
Intuitively: if you select K children out of B boys and G girls, the number of
options is given by the sum over the options for b ≤ B boys times the options
for g ≤ G girls, with b + g = K.
can be proven by induction on G. When G = 0 both
The equation (1.29)
sides amount to KB so we quickly proceed to the induction step. The case
K = 0 is trivial, so we may assume K > 0.
X B G+1
b · g
b≤B, g≤G+1, b+g=K
B G+1 B G+1 B G+1
= KB · G+1 0 + K−1 · 1 + · · · + K−G · G + K−G−1 · G+1
B G B G B G
= K · 0 + K−1 · 1 + K−1 · 0
B G B G B G
+ · · · + K−G · G + K−G · G−1 + K−G−1 · G
X B G X B G
= b · g + b · g
b≤B, g≤G, b+g=K b≤B, g≤G, b+g=K−1
(IH) B+G
= B+G K + K−1
(1.14) B+G+1
= K .
For the induction step, let supp(ψ) = {x1 , . . . , xn , y}, for n ≥ 2. Writing
` = ψ(y), L0 = L − ` and ψ0 = ψ − `|y i ∈ N[L0 ](X) gives:
ψ ψ(x) ` ψ (xi )
X X Y X X Y 0
ϕ = ϕ(x) x
= n · ϕ(xi ) i
ϕ≤K ψ ϕ≤K ψ n≤` ϕ≤K−n ψ0
(IH)
X ` L−` (1.29) L
= n · K−n = K.
n≤`, K−n≤L−`
L(X)
supp
/ Pfin (X)
B
(1.30)
acc , N(X) supp
The ‘accumulator’ map acc : L(X) → N(X) counts (accumulates) how many
42
1.6. Binomial coefficients of multisets 43
times an element occurs in a list, while ignoring the order of occurrences. Thus,
for a list ` ∈ L(X),
Alternatively,
acc x1 , . . . , xn = 1| x1 i + · · · + 1| xn i.
acc a, b, a, b, c, b, b = 2| ai + 4| bi + 1| c i.
The above diagram (1.30) expresses an earlier informal statement, namely that
multisets are somehow in between lists and subsets.
The starting point in this section is the question: how many (ordered) se-
quences of coloured balls give rise to a specific urn? More technically, given a
natural multiset ϕ, how many lists ` statisfy acc(`) = ϕ? In yet another form,
what is the size | acc −1 (ϕ) | of the inverse image?
As described in the beginning of this section, one aim is to relate the multiset
coefficient (ϕ ) of a multiset ϕ to the number of lists that accumlate to ϕ, as
defined in (1.31). Here we use a K-ary version of accumulation, for K ∈ N,
restricted to K-many elements. It then becomes a mapping:
XK
acc[K]
/ N[K](X). (1.32)
The parameter K will often be omitted when it is clear from the context. We are
now ready for a basic combinatorial / counting result. It involves the multiset
coefficient from Definition 1.5.1 (5).
43
44 Chapter 1. Collections
K
additions is m . Thus:
!
K
acc −1 (ϕ) = · ( ϕ0 )
m
K! (K − m)!
= ·Q
i≤n ϕ (xi )!
m!(K − m)! 0
K!
= Q since m = ϕ(xn+1 ) and ϕ0 (xi ) = ϕ(xi )
i≤n+1 ϕ(xi )!
= (ϕ)
/ M[L·K](X)
flat
M[L] M[K](X) (1.33)
We then find that the sum of these outcomes equals the coefficient of ψ:
( ψ) = 60
Y Y Y
= ( Ψ1 )· (ϕ )Ψ1 (ϕ) + (Ψ2 )· ( ϕ )Ψ2 (ϕ) + ( Ψ3 )· ( ϕ )Ψ3 (ϕ) .
ϕ ϕ ϕ
44
1.6. Binomial coefficients of multisets 45
The general result is formulated in Theorem 1.6.5 below. For its proof we need
an intermediate step.
2 Now:
X
( ψ) = ( ϕ ) · ( ψ − ϕ ).
ϕ≤K ψ
Proof. 1 Because:
(ϕ) · (ψ − ϕ) K! (L − K)! ψ
= · ·
( ψ) ϕ (ψ − ϕ) L!
ψ
K! · (L − K)! ψ ϕ
= · = L .
L! ϕ · (ψ − ϕ)
K
This equation turns out to be essential for proving that multinomial distribu-
tions are suitably closed under composition, see Theorem 3.3.6 in Chapter 3.
Proof. Since ψ ∈ N[L · K](X) we can apply Lemma 1.6.4 (2) L-many times,
giving:
X X X
( ψ) = ··· ( ϕ1 ) · ( ϕ2 ) · . . . · ( ϕ L )
ϕ1 ≤K ψ ϕ2 ≤K ψ−ϕ1 ϕL ≤K ψ−ϕ1 −···−ϕL−1
X Y
= ( ϕi ).
i
ϕ1 , ..., ϕL ≤K ψ
ϕ1 + ··· +ϕL = ψ
45
46 Chapter 1. Collections
The flatten map preserves sums of multisets, see Exercise 1.4.7, and thus maps
Ψ to ψ, via the flatten-unit law of Lemma 1.4.3.
flat Ψ = flat 1|ϕ1 i + · · · + 1| ϕL i
= flat 1| ϕ1 i + · · · + flat 1| ϕL i
Exercises
K
= 2K to:
P
1.6.1 Generalise the familiar equation 0≤k≤K k
X ψ!
= 2kψk .
ϕ≤ψ
ϕ
1.6.2 Show that kacc(`)k = k`k, using the length k`k of a list ` from Exer-
cise 1.2.3.
1.6.3 Let ϕ ∈ N[K](X) and ψ ∈ N[L](X) be given.
1 Prove that:
K+L
K
( ϕ + ψ) = ϕ+ψ · ( ϕ ) · ( ψ).
ϕ
46
1.6. Binomial coefficients of multisets 47
1.6.5 Check that the accumulator map acc : L(X) → M(X) is a homomor-
phism of monoids. Do the same for the support map supp : M(X) →
Pfin (X).
1.6.6 Prove that the accumulator and support maps acc : L(X) → M(X)
and supp : M(X) → Pfin (X) are natural: for an arbitrary function
f : X → Y both rectangles below commute.
L(X)
acc / M(X) supp
/ Pfin (X)
L( f ) M( f ) Pfin ( f )
L(Y) / M(Y) / Pfin (Y)
acc supp
is the L-fold sum of multisets — see (1.33) for the fixed size flatten
operation.
1.6.8 In analogy with the powerset operator, with type P : P(X) → P P(X) ,
a powerbag operator PB : N(X) → N N(X) is introduced in [114]
(see also [35]). It can be defined as:
X ψ!
PB(ψ) B
ϕ .
ϕ≤ψ
ϕ
PB 1| ai + 3| bi
= 1 0 + 1 1| ai + 3 1| b i + 3 1| ai + 1| b i + 3 2|b i
+ 3 1|a i + 2| bi + 1 3| b i + 1 1| ai + 3| bi .
ψ ψ
!
B .
ϕ1 , . . . , ϕ N ϕ1 · . . . · ϕ N
1 Check that for N ≥ 3, in analogy with Exercise 1.3.6,
ψ ψ ψ − ϕ1
! ! !
= ·
ϕ1 , . . . , ϕN ϕ1 ϕ2 , . . . , ϕ N
47
48 Chapter 1. Collections
5 the set N[3]({a, b, c}) of multisets of size 3 over {a, b, c}. It has
3Consider
3 = 3 = 2 = 10 elements, namely:
4·5
48
1.7. Multichoose coefficents 49
B+G
!! !! !!
X B G
= · . (1.36)
K 0≤k≤K
k K−k
In particular:
B+K B+1
! !! !! !!
X B X B
= = = . (1.37)
K K 0≤k≤K
k 0≤k≤K
K−k
n+1 n+1
n
K+1 + K = K+1 . (∗)
We shall prove the first equation (1.36) in the lemma by induction on B ≥1. In
both the base case B = 1 and the induction step we shall use induction on K.
We shall try to keep the structure clear by using nested bullets.
• Now assume Equation (1.36) holds for B (for all G, K). In order to show that
it then also holds for B + 1 we use induction on K.
49
50 Chapter 1. Collections
– Now assume that Equation (1.36) holds for K, and for B. Then:
X
B+1 G
k · (K+1)−k
0≤k≤K+1
X
= G
K+1 + B+1 G
k+1 · K−k
0≤k≤K
(∗) X h B+1 i G
= G
K+1 + B
k+1 + k · K−k
0≤k≤K
X G X
= G
K+1 + B
k+1 · K−k + B+1
k
G
· K−k
(IH, K)
X 0≤k≤K
G
0≤k≤K
(B+1)+G
= B
k · (K+1)−k + K
0≤k≤K+1
(IH, B) B+G (B+1)+G
= K+1 + K
(∗) (B+1)+G
= K+1 .
ψ
ψ
!! !!
X L X ϕ
= so L = 1.
ϕ∈N[K](X)
ϕ K ϕ∈N[K](X) K
The fractions in this equation will show up later in so-called Pólya distri-
butions, see Example 2.1.1 (4) and Section 3.4. These fractions capture the
probability of drawing a multiset ϕ from an urn ψ when for each drawn ball an
extra ball is added to the urn (of the same colour).
50
1.7. Multichoose coefficents 51
n+K−1
!! !
n
N[K](X) = = .
K K
n
Proof. The statement holds for K = 0 since there is precisely 1 = n−1 0 = 0
multiset set of 0, namely the empty multiset 0. Hence we may assume K ≥ 1,
so that Lemma 1.7.2 can be used.
We proceed
K by
induction
on n ≥ 1. For n = 1 the statement holds since there
is only 1 = K = K multiset of size K over 1 = {0}, namely K| 0i.
1
The induction step works as follows. Let X have n elements, and y < X. For
a multiset ϕ ∈ N[K] X ∪ {y} there are K + 1 possible multiplicities ϕ(n). If
ϕ(n) = k, then the number possibilities for ϕ(0), . . . , ϕ(n − 1) is the number of
multisets in N[K −k](X). Thus:
X
N[K] X ∪ {y} = N[K −k](X)
0≤k≤K !!
(IH)
X n
=
0≤k≤K
K−k
n+1
!!
= , by Lemma 1.7.2.
K
There is also a visual proof of this result, described in terms of stars and bars,
see e.g. [47, II, proof of (5.2)], where multiplicities of multisets are described
in terms of ‘occupancy numbers’.
51
52 Chapter 1. Collections
The proof below lends itself to an easy implementation, see Exercise 1.7.12
below. It uses a (chosen) order on the elements of the support of the multiset.
The above formulation shows that the outcome is independent of such an order.
52
1.7. Multichoose coefficents 53
Proof. We use induction on the size | supp(ϕ) | of the support of ϕ. If this size is
1, then ϕ is of the form m| x1 i, for some number m. Hence there are m multisets
strictly below ϕ, namely 0, 1| x1 i, . . . , (m − 1)| x1 i. Since supp(ϕ) = {x1 }, the
only non-empty subset U ⊆ supp(ϕ) is the singleton U = {x1 } and x∈U ϕ(x) =
Q
ϕ(x1 ) = m.
Next assume that supp(ϕ) = X ∪ {y} with y 6 X is of size n + 1. We write
m = ϕ(y) and ϕ0 = ϕ − m| y i so that supp(ϕ0 ) = X. The number M = | ↓= ϕ0 | is
then given by the formula in (1.38), with ϕ0 instead of ϕ. We count the multisets
strictly below ϕ as follows.
(*)
= ↓= ϕ .
Exercises
1.7.1 Check that N[K](1) has one element by Proposition 1.7.4. Describe
this element. How many elements are there in N[K](2)? Describe
them all.
1.7.2 Let ϕ, ψ be natural multiset on the same finite set X.
1 Show that:
ϕ ≺ ψ ⇐⇒ ϕ + 1 ≤ ψ ⇐⇒ ϕ ≤ ψ − 1,
where 1 =
P
x∈X 1| x i is the multiset of singletons on X.
53
54 Chapter 1. Collections
54
1.7. Multichoose coefficents 55
2 Prove next:
n+m
!! X !! !
X m n
+ = .
i<n
i j<m
j n
1.7.10 Check that Theorem 1.5.2 (1) can be reformulated as: for a real num-
ber x ∈ (0, 1) and K ≥ 1,
X K !! 1
· xn =
n≥0
n (1 − x)K
j<m
j (1− s) j<m j m−1
55
56 Chapter 1. Collections
result = 0
for x ∈ [x1 , . . . , xn ] :
result B (ϕ(x) + 1) ∗ result + ϕ(x)
return result
1.8 Channels
The previous sections covered the collection types of lists, subsets, and multi-
sets, with much emphasis on the similarities between them. In this section we
will exploit these similarities in order to introduce the concept of channel, in a
uniform approach, for all of these collection types at the same time. This will
show that these data types are not only used for certain types of collections,
but also for certain types of computation. Much of the rest of this book builds
on the concept of a channel, especially for probabilistic distributions, which
are introduced in the next chapter. The same general approach to channels that
will be described in this section will work for distributions.
Let T be one of the collection functors L, P, or M. A state of type Y is an
element ω ∈ T (Y); it collects a number of elements of Y in a certain way. In this
section we abstract away from the particular type of collection. A channel, or
sometimes more explicitly T -channel, is a collection of states, parameterised
by a set. Thus, a channel is a function of the form c : X → T (Y). Such a
channel turns an element x ∈ X into a certain collection c(x) of elements of
Y. Just like an ordinary function f : X → Y can be seen as a computation, we
see a T -channel as a computation of type T . For instance, a channel X → P(Y)
is a non-deterministic computation and a channel X → M(Y) is a resource-
sensitive computation.
56
1.8. Channels 57
When it is clear from the context what T is, we often write a channel using
functional notation, as c : X → Y, with a circle on the shaft of the arrrow.
Notice that a channel 1 → X from the singleton set 1 = {0} can be identified
with a state on X.
Definition 1.8.1. Let T ∈ {L, P, M}, each of which is functorial, with its own
flatten operation, as described in previous sections.
1 For a state ω ∈ T (X) and a channel c : X → T (Y) we can form a new state
c = ω of type Y. It is defined as:
= [v, u, v, u, u, u, u, u, v].
2 We consider the analogous example for T = P. We thus take as state ω =
{a, b, c} and as channel f : X → P(Y) with:
57
58 Chapter 1. Collections
Then:
[
f = ω = flat P( f )(ω) =
f (a), f (b), f (c)
[
= {u, v}, {u}, {u, v} = {u, v}.
f = ω = flat M( f )(ω)
We now prove some general properties about state transformation and about
composition of channels, based on the abstract description in Definition 1.8.1.
e ◦· (d ◦· c) = (e ◦· d) ◦· c,
58
1.8. Channels 59
d ◦· c = ω = d = c = ω .
f = ω = T ( f )(ω),
g ◦· f = g ◦ f f ◦· c = T ( f ) ◦ c d ◦· f = d ◦ f,
Proof. We can give generic proofs, without knowing the type T ∈ {L, P, M}
of the channel, by using earlier results like Lemma 1.2.5, 1.3.2, and 1.4.3. No-
tice that we carefully distinguish channel composition ◦· and ordinary function
composition ◦.
1 Both equations follow from the flat − unit law. By Definition 1.8.1 (2):
2 The proof of associativity uses naturality and also the commutation of flatten
with itself (the ‘flat − flat law’), expressed as flat ◦ flat = flat ◦ T (flat).
e ◦· (d ◦· c) = flat ◦ T (e) ◦ (d ◦· c)
= flat ◦ T (e) ◦ flat ◦ T (d) ◦ c
= flat ◦ flat ◦ T (T (e)) ◦ T (d) ◦ c by naturality of flat
= flat ◦ T (flat) ◦ T (T (e)) ◦ T (d) ◦ c by the flat − flat law
=
flat ◦ T flat ◦ T (e) ◦ d ◦ c by functoriality of T
=
flat ◦ T e ◦· d ◦ c
= (e ◦· d) ◦· c
59
60 Chapter 1. Collections
= flat ◦ T (d) c = ω
= d = c = ω .
4 All these properties follow from elementary facts that we have seen before:
f = ω = flat ◦ T (unit ◦ f ) (ω)
Exercises
1.8.1 For a function f : X → Y define an inverse image P-channel f −1 : Y →
X by:
f −1 (y) B {x ∈ X | f (x) = y}.
Prove that:
(g ◦ f )−1 = f −1 ◦· g−1 and id −1 = unit.
60
1.9. The role of category theory 61
1.8.2 Notice that a state of type X can be identified with a channel 1 → T (Y)
with singleton set 1 as domain. Check that under this identification,
state transformation c = ω corresponds to channel composition c ◦· ω.
1.8.3 Let f : X → Y be a channel.
1 Prove that if f is a Pfin -channel, then the state transformation func-
tion f = (−) : Pfin (X) → Pfin (Y) can also be defined via freeness,
namely as the unique function f in Proposition 1.3.4.
2 Similarly, show that f = (−) = f when f is an M-channel, as in
Exercise 1.4.9.
1.8.4 1 Describe how (non-deterministic) powerset channels can be reversed,
via a bijective correspondence between functions:
X −→ P(Y)
===========
Y −→ P(X)
(A description of this situation in terms of ‘daggers’ will appear in
Example 7.8.1.)
2 Show that for finite sets X, Y there is a similar correspondence for
multiset channels.
61
62 Chapter 1. Collections
found in e.g. the early notes [111]. This line of work was picked up, extended,
and published by his PhD student Michèle Giry. Her name continues in the
‘Giry monad’ G of continuous probability distributions, see Section ??. The
precise source of the distribution monad D for discrete probability theory, that
will be introduced in Section 2.1 in the next chapter, is less clear, but it can be
regarded as the discrete version of G. Probabilistic automata have been studied
in categorical terms as coalgebras, see Chapter ??, and e.g. [152] and [70] for
general background information on coalgebra. There is a recent surge in inter-
est in more foundational, semantically oriented studies in probability theory,
through the rise of probabilistic programming languages [59, 153], probabilis-
tic Bayesian reasoning [28, 89], and category theory ??. Probabilistic methods
have received wider attention, for instance, via the current interest in data an-
alytics (see the essay [4]), in quantum probability [128, 25], and in cognition
theory [64, 151].
Readers who know category theory will have recognised its implicit use in
earlier sections. For readers who are not familiar (yet) with category theory,
some basic concepts will be explained informally in this section. This is in no
way a serious introduction to the area. The remainder of this book will continue
to make implicit use of category theory, but will make this usage increasingly
explicit. Hence it is useful to know the basic concepts of category, functor,
natural transformation, and monad. Category theory is sometimes seen as a
difficult area to get into. But our experience is that it is easiest to learn category
theory by recognising its concepts in constructions that you already know. That
is why this chapter started with concrete descriptions of various collections and
their use in channels. For more solid expositions of category theory we refer to
the sources listed above.
1.9.1 Categories
A category is a mathematical structure given by a collection of ‘objects’ with
‘morphisms’ between them. The requirements are that these morphisms are
closed under (associative) composition and that there is an identity morphism
on each object. Morphisms are also called ‘maps’, and are written as f : X →
Y, where X, Y are objects and f is a homomorphism from X to Y. It is tempting
to think of morphisms in a category as actual functions, but there are plenty of
examples where this is not the case.
A category is like an abstract context of discourse, giving a setting in which
one is working, with properties of that setting depending on the category at
hand. We shall give a number of examples.
62
1.9. The role of category theory 63
1 There is the category Sets, whose objects are sets and whose morphisms are
ordinary functions between them. This is a standard example.
2 One can also restrict to finite sets as objects, in the category FinSets, with
functions between them. This category is more restrictive, since for instance
it contains objects n = {0, 1, . . . , n − 1} for each n ∈ N, but not N itself. Also,
Q
in Sets one can take arbitrary products i∈I Xi of objects Xi , over arbitrary
index sets I, whereas in FinSets only finite products exist. Hence FinSets is
a more restrictive world.
3 Monoids and monoid maps have been mentioned in Definition 1.2.1. They
can be organised in a category Mon, whose objects are monoids, and whose
homomorphisms are monoid maps. We now have to check that monoid maps
are closed under composition and that identity functions are monoid maps;
this is easy. Many mathematical structures can be organised into categories
in this way, where the morphisms preserve the relevant structure. For in-
stance, one can form a category PoSets, with partially ordered sets (posets)
as objects, and monotone functions between them as morphisms (also closed
under composition, with identity).
4 For T ∈ {L, P, M} we can form the category Chan(T ). Its objects are arbi-
trary sets X, but its morphisms X to Y are T -channels, X → T (Y), written as
X → Y. We have already seen that channels are closed under composition ◦·
and have unit as identity, see Lemma 1.8.3. We can now say that Chan(T )
is a category.
These categories of channels form good examples of the idea that a cate-
gory forms a universe of discourse. For instance, in Chan(P) we are in the
world of non-deterministic computation, whereas Chan(M) is the world of
computation in which resources are counted.
1.9.2 Functors
Category theorists like abstraction, hence the question: if categories are so im-
portant, then why not organise them as objects themselves in a superlarge cat-
63
64 Chapter 1. Collections
egory Cat, with morphisms between them preserving the relevant structure?
The latter morphisms between categories are called ‘functors’. More precisely,
given categories C and D, a functor F : C → D between them consists of two
mappings, both written F, sending an object X in C to an object F(X) in D,
and a morphism f : X → Y in C to a morphism F( f ) : F(X) → F(Y) in D.
This mapping F should preserve composition and identities, as in: F(g ◦ f ) =
F(g) ◦ F( f ) and F(id X ) = id F(X) .
Earlier we have already called some operations ‘functorial’ for the fact that
they preserve composition and identities. We can now be a bit more precise.
64
1.9. The role of category theory 65
in D commutes.
Such a natural transformation is often denoted by a double arrow α : F ⇒ G.
We briefly review some of the examples of natural transformations that we
have seen.
1.9.4 Monads
A monad on a category C is a functor T : C → C that comes with two natural
transformations unit : id ⇒ T and flat : (T ◦ T ) ⇒ T satisfying:
65
66 Chapter 1. Collections
66
1.9. The role of category theory 67
The writer monads from Lemma 1.9.1 give simple examples of maps of
monads: if f : M1 → M2 is a map of monoids, then the maps α B f ×id : M1 ×
X → M2 × X form a map of monoids.
For a historical account of monads and their applications we refer to [66].
Exercises
1.9.1 We have seen the functor J : Sets → Chan(T ). Check that there is
also a functor Chan(T ) → Sets in the opposite direction, which is
X 7→ T (X) on objects, and c 7→ c = (−) on morphisms. Check ex-
plicitly that composition is preserved, and find the earlier result that
stated that fact implicitly.
1.9.2 Recall from (1.15) the subset N[K](X) ⊆ M(X) of natural multisets
with K elements. Prove that N[K] is a functor Sets → Sets.
1.9.3 Show that Exercise 1.8.1 implicitly describes a functor Setsop →
Chan(P), which is the identity on objects.
1.9.4 Show that the zip function from Exercise 1.1.7 is natural: for each
pair of functions f : X → U and g : Y → V the following diagram
commutes.
XK × Y K
zip
/ (X × Y)K
f K ×gK K
( f ×g)
UK × VK
zip
/ (U × V)K
1.9.5 Fill in the remaining details in the proof of Lemma 1.9.1: that T is
a functor, that unit and flat are natural transformation, and that the
flatten equation holds.
1.9.6 For arbitrary sets X, A, write X + A for the disjoint union (coproduct)
of X and A, which may be described explicitly by tagging elements
with 1, 2 in order to distinguish them:
67
68 Chapter 1. Collections
1.9.7 Check that the support and accumulation functions form maps of
monads in the situations:
1 supp : M(X) ⇒ P(X);
2 acc : L(X) ⇒ M(X).
1.9.8 Let T = (T, unit, flat) be a monad. By definition, it involves T as
a functor T : Sets → Sets. Show that T can be ‘lifted’ to a functor
T : Chan(T ) → Chan(T ). It is defined on objects as T (X) B T (X)
and on a morphism f : X → Y as:
T(f)
/ T (T (Y)) flat / T (Y) unit / T (T (Y)) .
T ( f ) B T (X)
68
2
The previous chapter has introduced products, lists, subsets and multisets as
basic collection types and has made some of their basic properties explicit.
This serves as preparation for the current first chapter on probability distribu-
tions. We shall see that distributions also form a collection type, with much
analogous structure. In fact distributions are special multisets, where multi-
plicities add up to one.
This chapter introduces the basics of probability distributions and of prob-
abilistic channels (as indexed / conditional distributions). These notions will
play a central role in the rest of this book. Distributions will be defined as spe-
cial multisets, so that there is a simple inclusion of the set of distributions in
the set of multisets multisets, on a particular space. In the other direction, this
chapter describes the ‘frequentist learning’ construction, which turns a multi-
set into a distribution, essentially by normalisation. The chapter also describes
several constructions on distributions, like the convex sum, parallel product,
and addition of distributions (in the special case when the underlying space
happens to be a commutative monoid). Parallel products ⊗ of distributions and
joint distributions (on product spaces) are rather special, for instance, because
a joint distribution is typically not equal to the product of its marginals. In-
deed, joint distributions may involve correlations between the different (prod-
uct) components, so that updates in one component have ‘crossover’ effect in
other components. This magical phenonenom will be elaborated in later chap-
ters.
The chapter closes with an example of a Bayesian network. It shows that
the conditional probability tables that are associated with nodes in a Bayesian
network are instances of probabilistic channels. As a result, one can system-
atically organise computations in Bayesian networks as suitable (sequential
and/or parallel) compositions of channels. This is illustrated via calculations
of various predicted probabilities.
69
70 Chapter 2. Discrete probability distributions
70
2.1. Probability distributions 71
Figure 2.1 Plots of a slightly biased coin distribution 0.51| H i + 0.49| T i and a
fair (uniform) dice distribution on {1, 2, 3, 4, 5, 6} in the top row, together with the
distribution of letter frequencies in English at the bottom.
71
72 Chapter 2. Discrete probability distributions
remains unaltered (0), and in the Pólya case the drawn ball is returned to the
urn together with an extra ball of the same colour (+1).
Example 2.1.1. We shall explicitly describe several familiar distributions us-
ing the above formal convex sum notation.
1 The coin that we have seen above can be parametrised via a ‘bias’ probabil-
ity r ∈ [0, 1]. The resulting coin is often called flip and is defined as:
flip(r) B r| 1i + (1 − r)|0 i
where 1 may understood as ‘head’ and 0 as ‘tail’. We may thus see flip as a
function flip : [0, 1] → D({0, 1}) from probabilities to distributions over the
sample space 2 = {0, 1} of Booleans. This flip(r) is often called the Bernoulli
distribution, with parameter r ∈ [0, 1].
2 For each number K ∈ N and probability r ∈ [0, 1] there is the familiar
binomial distribution bn[K](r) ∈ D({0, 1, . . . , K}). It captures probabilities
for iterated coin flips, and is given by the convex sum:
X
bn[K](r) B K
k · r k
· (1 − r) k .
K−k
0≤k≤K
72
2.1. Probability distributions 73
where ( ϕ ) is the multinomial coefficient QxK! ϕ(x)! , see Definition 1.5.1 (5)
and where ωϕ B ω(x) ϕ(x)
Q
x . In the sequel we shall standardly use the
multinomial distribution and view the binomial distribution as a special case.
To see an example, for space X = {a, b, c} and urn ω = 13 | ai + 12 | b i + 16 | ci
the draws of size 3 form
a distribution over multisets, described below within
the outer ‘big’ kets − .
mn[3](ω) = 27 3| a i + 6 2| a i + 1|b i + 4 1|a i + 2| bi + 8 3| bi
1 1 1 1
+ 18 1
2| a i + 1| c i + 16 1| a i + 1| bi + 1| ci + 18 2| b i + 1| c i
+ 36 1
1| a i + 2| c i + 241
1| b i + 2|c i + 216
1
3|c i
Via the Multinomial Theorem (1.27) we see that the probabilities in the
above expression (2.2) for mn[K](ω) add up to one:
X X Y (1.26) P K
(ϕ ) · ωϕ = (ϕ) · ω(x)ϕ(x) = x ω(x) = 1K = 1.
x
ϕ∈N[K](X) ϕ∈N[K](X)
N[L](X)
hg[K]
/ D N[K](X). (2.3)
X ψϕ X x ψ(x)
Q
ϕ(x)
hg[K] ψ B L ϕ = L ϕ .
ϕ≤K ψ K ϕ≤K ψ K
73
74 Chapter 2. Discrete probability distributions
4 We have seen that in multinomial mode the drawn ball is returned to the
urn, whereas in hypergeometric mode the drawn ball is removed. There is
a logical third option where the drawn ball is returned, together with one
additional ball of the same colour. This leads to what is called the Pólya
distribution. We shall describe it as a function of the form:
N∗ (X)
pl[K]
/ D N[K](X). (2.4)
The middle and last rows in Figure 2.2 give bar plots for hypergeometric and
Pólya distributions over 2 = {0, 1}. They show the probabilities for numbers
0 ≤ k ≤ 10 in a drawn multiset k| 0 i + (10 − k)| 1i.
So far we have described distributions as formal convex sums. But they can
be described, equivalently, in functional form. This is done in the definition
below, which is much like Definition 1.4.1 for multisets. Also for distributions
we shall freely switch between the above ket-formulation and the function-
formulation.
74
2.1. Probability distributions 75
bn[10]( 13 ) bn[10]( 43 )
hg[10] 10| 0 i + 20| 1 i hg[10] 16| 0 i + 14| 1 i
pl[10] 1| 0 i + 2| 1i pl[10] 8| 0 i + 7| 1 i
Definition 2.1.2. The set D(X) of all distributions over a set X can be defined
as:
ω(x) = 1}.
P
D(X) B {ω : X → [0, 1] | supp(ω) is finite, and x
Such a function ω : X → [0, 1], with finite support and values adding up to
one, is often called a probability mass function, abbreviated as pmf.
This D is functorial: for a function f : X → Y we have D( f ) : D(X) → D(Y)
defined either as:
X
ω(x).
P P
D( f ) i ri | xi i B i ri | f (xi ) i or as: D( f )(ω)(y) B
x∈ f −1 (y)
One has to check that D( f )(ω) is a distribution again, that is, that its multi-
75
76 Chapter 2. Discrete probability distributions
ω= 1
12 | H, 0 i + 61 | H, 1i + 13 | H, 2 i + 16 | T, 0i + 1
12 | T, 1 i + 16 | T, 2 i
= 12 | H i + 6 | H i + 3 | H i + 6 | T i + 12 | T i + 6 | T i
1 1 1 1 1 1
= 12 | H i + 12 | T i.
7 5
is used to indicate the probability that the random variable R takes value
r ∈ R.
Since we now know that D is functorial, we may apply it to the function
R : X → R. It gives another function D(R) : D(X) → D(R), so that D(R)(ω)
is a distribution on R. We observe:
X
P[R = r] = D(R)(ω)(r) = ω(x).
x∈R−1 (r)
76
2.1. Probability distributions 77
D(M)(unif) = 36
1
| 1i + 36
3
|2 i + 5 | 3 i + 36
7
| 4 i + 9 | 5i + 11
36|6 i
36 36
= P[M = 1] 1 + P[M = 2] 2 + P[M = 3] 3
+ P[M = 4] 4 + P[M = 5] 5 + P[M = 6] 6 .
In the previous chapter we have seen that the sets L(X), P(X) and M(X) of
lists, powersets and multisets all carry a monoid structure. Hence one may ex-
pect a similar result saying that D(X) forms a monoid, via an elementwise sum,
like for multisets. But that does not work. Instead, one can take convex sums
of distributions. This works as follows. Suppose we have two distributions
ω, ρ ∈ D(X) and a number s ∈ [0, 1]. Then we can form a new distribution
σ ∈ D(X), as convex combination of ω and ρ, namely:
77
78 Chapter 2. Discrete probability distributions
This does not fit in D(N). Therefore we sometimes use D∞ instead of D, where
the finiteness of support requirement has been dropped:
D∞ (X) B {ω : X → [0, 1] | x ω(x) = 1}.
P
(2.8)
Notice by the way that the multiplicities add up to one in the Poisson distri-
bution because of the basic formula: eλ = k∈N λk! . The Poisson distribution is
P k
typically used for counts of rare events. The rate or intensity parameter λ is the
average number of events per time period. The Poisson distribution then gives
for each k ∈ N the probability of having k events per time period. This works
when events occur independently.
Exercises 2.1.9 and 2.1.10 contain other examples of discrete distributions
with infinite support, namely the geometric and negative-binomial distribution.
Another illustration is the zeta (or zipf) distribution, see e.g. [145].
Exercises
2.1.1 Check that a marginal of a uniform distribution is again a uniform
distribution; more precisely, D(π1 )(unif X×Y ) = unif X .
2.1.2 1 Prove that flip : [0, 1] → D(2) is an isomorphism.
2 Check that flip(r) is the same as bn[1](r).
3 Describe the distribution bn[3]( 41 ) concretely and interepret this
distribution in terms of coin flips.
2.1.3 We often write n B {0, . . . , n − 1} so that 0 = ∅, 1 = {0} and 2 = {0, 1}.
Verify that:
L(0) 1 P(0) 1 N(0) 1 M(0) 1 D(0) 0
L(1) N P(1) 2 N(1) N M(1) R≥0 D(1) 1
P(n) 2n N(n) Nn M(n) Rn≥0 D(2) [0, 1].
The set D(n + 1) is often called the n-simplex. Describe it as a subset
of Rn+1 , and also as a subset of Rn .
2.1.4 Let X = {a, b, c} with draw ϕ = 2| ai + 3| b i + 2| c i ∈ N[7](X).
78
2.1. Probability distributions 79
mn[7](ω)(ϕ) = 432 .
35
2.1.5 Let a number r ∈ [0, 1] and a finite set X be given. Show that:
X
r| U | · (1 − r)| X\U | = 1.
U∈P(X)
2.1.7 Let X be a non-empty finite set with n elements. Use Exercise 1.5.7
to check that, for the following multiset distribution yields a well-
defined distribution on N[K](X).
X (ϕ )
mltdst[K]X B ϕ .
n K
ϕ∈N[K](X)
It captures the probability of being successful for the first time after
79
80 Chapter 2. Discrete probability distributions
Use Theorem 1.5.2 (or Exercise 1.5.9) to show that this forms a distri-
bution. We shall describe negative multinomial distributions in Sec-
tion ??.
2.1.11 Prove that a distribution ω ∈ D∞ (X) necessarily has countable sup-
port.
Hint: Use that each set {x ∈ X | ω(x) > 1n }, for n > 0, can have only
finitely many elements.
2.1.12 Let ω ∈ D(X) be an arbitrary distribution on a set X. We extend it to a
distribution ω? on the set L(X) of lists of elements from X. We define
the function ω? : L(X) → [0, 1] by:
ω(x1 ) · . . . · ω(xn )
ω? [x1 , . . . , xn ] B .
2n+1
(−)?
D(X) / D∞ L(X)
80
2.2. Probabilistic channels 81
i ri · ωi (x) = x ωi (x) = i ri · 1 = 1.
P P P P P
x i ri ·
The familiar properties of unit and flatten hold for distributions too: the ana-
logue of Lemma 1.4.3 holds, with ‘multiset’ replaced by ‘distribution’. We
conclude that D, with these unit and flat, is a monad. The same holds for the
‘infinite’ variation D∞ from Remark 2.1.4.
A probabilistic channel c : X → Y is a function c : X → D(Y). For a state /
distribution ω ∈ D(X) we can do state transformation and produce a new state
81
82 Chapter 2. Discrete probability distributions
c = ω ∈ D(Y). This happens via the standard definition for = from Section 1.8,
see especially (1.39) and (1.40):
XX
c = ω B flat D(c)(ω) = c(x)(y) · ω(x) y . (2.9)
y∈Y x∈X
c / D(Y) D(d)
/ D(D(Z)) flat / D(Z) .
d ◦· c B X
[0, 1]
flip
◦ / {0, 1} and [0, 1]
bn[K]
◦ / {0, 1, . . . , K}.
◦ / ◦ / ◦ /
mn[K] hg[K] pl[K]
D(X) N[K](X) N[L](X) N[K](X). N∗ (X) N[K](X).
82
2.2. Probabilistic channels 83
f = ω =
flat D( f )(ω)
= flat 61 | f (a) i + 12 | f (b) i + 31 | f (c) i
= flat 16 12 | u i + 21 | vi + 12 1| u i + 13 34 | ui + 14 |v i
= 12 |u i + 12 | v i + 2 | u i + 4 | u i + 12 | vi
1 1 1 1 1
= 6 | ui + 6 | v i.
5 1
f = ω = 6 · 2 | ui + 2 | vi
1 1 1
+ 1
2 · 1| ui + 1
3 · 3
4|ui + 14 | v i
= 6 | ui + 6 | vi.
5 1
This ‘mixture’ terminology is often used in a clustering context where the ele-
ments a, b, c are the components.
Example 2.2.2 (Copied from [79]). Let’s assume we wish to capture the mood
of a teacher, as a probabilistic mixture of three possible options namely: pes-
simistic (p), neutral (n) or optimistic (o). We thus have a three-element proba-
bility space X = {p, n, o}. We assume a mood distribution:
σ = 81 | p i + 38 | n i + 21 | o i with plot
83
84 Chapter 2. Discrete probability distributions
{p, n, o}.
c(p)
= 501
| 1 i + 50
2
| 2 i + 50
10
| 3i + 15 50 |4 i + 50 | 5 i
10
+ 50 6
| 6i + 503
| 7i + 50 1
| 8 i + 501
| 9 i + 50
1
| 10 i
pessimistic mood marks
c(n)
= 501
| 1 i + 50
2
| 2 i + 50
4
| 3i + 10
50 |4 i + 50 | 5 i
15
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 5 1 1 1
| 10 i
neutral mood marks
c(o)
= 501
| 1 i + 50
1
| 2 i + 50
1
| 3i + 50
2
|4 i + 50
4
|5i
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 15 10 4 2
| 10 i.
optimistic mood marks
Now that the state σ ∈ D(X) and the channel c : X → Y have been described,
we can form the transformed state c = σ ∈ D(Y). Following the formula-
tion (2.9) we get for each mark i ∈ Y the probability:
This example will return in later chapters. There, the teacher will be confronted
with the marks that the pupils actually obtain. This will lead the teacher to an
84
2.2. Probabilistic channels 85
In the next example we show how products, multisets and distributions come
together in an elementary combinatorial situation.
This is a sum over (ϕ )-many lists ~x with acc(~x) = ϕ. The size K of the multiset
involved has disappeared from this formulation, so that we can view arr simply
as a channel arr : N(X) → L(X).
instance, for X = {a, b} with multiset ϕ = 3| a i + 1| bi there are (ϕ ) =
4 For
3,1 = 3!·1! = 4 arrangements of ϕ, namely [a, a, a, b], [a, a, b, a], [a, b, a, a],
4!
Our next question is: how are accumulator acc and arrangement arr related?
85
86 Chapter 2. Discrete probability distributions
N(X)
arr
◦ / L(X)
◦ ◦ acc (2.11)
unit
N(X)
It combines a probabilistic channel arr : N(X) → D(L(X)) with an ordinary
function acc : L(X) → N(X), promoted to a deterministic channel acc. For
the channel composition ◦· we make use of Lemma 1.8.3 (4):
acc ◦· arr (ϕ) = D(acc) arr(ϕ)
X 1
= D(acc) ~x
(ϕ )
~x∈acc −1 (ϕ)
X 1
=
acc(~x)
(ϕ )
~x∈acc −1 (ϕ)
X 1 1
= ϕ = (ϕ) · ϕ = 1 ϕ = unit(ϕ).
(ϕ ) (ϕ)
~x∈acc (ϕ)
−1
The above triangle (2.11) captures a very basic relationship between sequences,
multisets and distributions, via the notion of (probabilistic) channel.
In the other direction, arr(acc(~x)) does not return ~x. It yields a uniform dis-
tribution over all permutations of the sequence ~x ∈ X K .
Deleting and adding an element to a natural multiset are basic operations
that are naturally described as channels.
Definition 2.2.4. For a set X and a number K ∈ N we define a draw-delete
channel DD and also a draw-add channel DA in a situation:
DD
r ◦
N[K](X)
◦
2 N[K +1](X)
DA
86
2.2. Probabilistic channels 87
in the urn ψ. This drawn element x is subsequently removed from the urn via
subtraction ψ − 1| x i.
In the draw-add case DA one also draws single elements x from the urn χ,
but instead of deleting x one adds an extra copy of x, via the sum χ + 1| x i. This
is typical for Pólya style urns, see [115] or Section 3.4 for more information.
These channels are not each other’s inverses. For instance:
DD 3|a i + 1| bi = 43 2| a i + 1| b i + 14 3| a i
DA 2|a i + 1| bi = 2 3| a i + 1| b i + 1 2| a i + 2|b i .
3 3
DA ◦· DD 3| ai + 1| b i
= DA = 34 2| a i + 1| b i + 41 3| a i
= 34 23 3| a i + 1| bi + 31 2| ai + 2| bi + 14 1 4| ai
= 12 3| ai + 1| b i + 14 2| a i + 2| b i + 14 4| a i
DD ◦· DA 2| ai + 1| bi
= DD = 23 3| a i + 1|b i + 31 2| ai + 2| bi
= 23 34 2| a i + 1| bi + 41 3| ai + 13 12 1| a i + 2| b i + 12 2|a i + 1| bi
= 21 2| ai + 1| b i + 16 3| a i + 16 1| a i + 2| b i + 16 2|a i + 1| bi .
DD K
r ◦
N[L](X) 2 N[L+K](X)
◦
DA K
87
88 Chapter 2. Discrete probability distributions
ψ χ
ϕ ϕ
X X
DD K (ψ) = ψ − ϕ DA K (χ) = L χ + ϕ .
L+K
ϕ≤K ψ K ϕ≤K χ K
Proof. For both equations we use induction on K ∈ N. In both cases the only
option for ϕ in N[0](X) is the empty multiset 0, so that DD 0 (ψ) = 1| ψi and
DA 0 (χ) = 1|χ i.
For the induction steps we make extensive use of the equations in Exer-
cise 1.7.6. In those cases we shall put ‘E’ above the equation. We start with the
induction step for draw-delete. For ψ ∈ N[L+K + 1](X),
We reason basically in the same way for draw-add, and now also use Exer-
88
2.2. Probabilistic channels 89
Exercises
2.2.1 Consider the D-channel f : {a, b, c} → {u, v} from Example 2.2.1,
together with a new D-channel g : : {u, v} → {1, 2, 3, 4} given by:
g(u) = 14 |1 i + 18 | 2 i + 21 | 3 i + 18 | 4 i g(v) = 14 | 1i + 18 | 3i + 58 |3 i.
2.2.4 Identify the channel f and the state ω in Example 2.2.1 with matrices:
1
1
1 3
6
M f = 21 =
4
1
1 M ω 2
2 0 4
1
3
Such matrices are called stochastic, since each of their columns has
non-negative entries that add up to one.
Check that the matrix associated with the transformed state f = ω
is the matrix-column multiplication M f · Mω .
(A general description appears in Remark 4.3.5.)
89
90 Chapter 2. Discrete probability distributions
N[K +1](X)
DD
◦ / N[K](X)
arr ◦ ◦ arr
K+1 π / XX
X ◦
2.2.7 Consider the arrangement channels arr : N(X) → L(X) from Exam-
ple 2.2.3. Prove that arr is natural: for each function f : X → Y the
following diagram commutes.
N(X)
arr / D L(X))
N( f ) D(L( f ))
N(Y)
arr / D L(Y)
(This is far from easy; you may wish to check first what this means in
a simple case and then content yourself with a ‘proof by example’.)
2.2.8 Show that the draw-and-delete and draw-and-add maps DD and DA
in Definition 2.2.4 are natural, from N[K+1] to D ◦ N[K], and from
N[K] to D ◦ N[K +1].
2.2.9 Recall from (1.34) that the set N[K](X)
of K-sized natural multisets
over a set X with n has contains Kn members. We write unif K ∈
D N[K](X) for the uniform distribution over such multisets.
1 Check that:
X 1
unif K = n ϕ .
ϕ∈N[K](n) K
90
2.2. Probabilistic channels 91
2.2.10 Let X be a finite set, say with n elements, of the form X = {x1 , . . . , xn }.
Define for K ≥ 1,
X
σK B 1
∈ D N[K](X) .
n K| xi i
1≤i≤n
Thus:
σ1 = 1n 1| x1 i + · · · + 1
n 1| xn i
σ = 1 2| x i + · · · + n 2| xn i ,
1
2 n 1 etc.
Show that these σK form a cone, both for draw-delete and for draw-
add:
DD = σK+1 = σK and DA = σK = σK+1 .
Thus, the whole sequence σK K∈N>0 can be generated from σ1 =
unifN[1](X) by repeated application of DA.
2.2.11 Recall the bijective correspondences from Exercise 1.8.4.
1 Let X, Y be finite sets and c : X → D(Y) be a D-channel. We
can then define an M-channel c† : Y → M(X) by swapping ar-
guments: c† (y)(x) = c(x)(y). We call c a ‘bi-channel’ if c† is also a
D-channel, i.e. if x c(x)(y) = 1 for each y ∈ Y.
P
Prove that the identity channel is a bi-channel and that bi-channels
are closed under composition.
2 Take A = {a0 , a1 } and B = {b0 , b1 } and define a channel bell : A ×
2 → D(B × 2) as:
bell(a0 , 0) = 21 | b0 , 0i + 8 |b1 , 0 i + 8 | b1 , 1 i
3 1
bell(a0 , 1) = 2 | b0 , 1i + 8 |b1 , 0 i + 8 | b1 , 1 i
1 1 3
bell(a1 , 0) = 83 | b0 , 0i + 18 | b0 , 1 i + 18 | b1 , 0 i + 38 | b1 , 1 i
bell(a1 , 1) = 81 | b0 , 0i + 38 | b0 , 1 i + 38 | b1 , 0 i + 18 | b1 , 1 i
Check that bell is a bi-channel.
(It captures the famous Bell table from quantum theory; we have
deliberately used open spaces in the above description of the chan-
nel bell so that non-zero entries align, giving a ‘bi-stochastic’ ma-
trix, from which one can read bell † vertically.)
2.2.12 Check that the inclusions D(X) ,→ M(X) form a map of monads, as
described in Definition 1.9.2.
2.2.13 Recall from Exercise 1.9.6 that for a fixed set A the mapping X 7→
X + A is a monad. Prove that X 7→ D(X + A) is also a monad.
(The latter monad will be used in Chapter ?? to describe Markov
models with outputs. A composition of two monads is not necessarily
91
92 Chapter 2. Discrete probability distributions
The normalisation step forces the sum on the right-hand side to be a convex
sum, with factors adding up to one. Clearly, from an empty multiset we cannot
learn a distribution — technically because the above sum s is then zero so that
we cannot divide by s.
Using scalar multiplication from Lemma 1.4.2 (2) we can define the Flrn
92
2.3. Frequentist learning: from multisets to distributions 93
Flrn(ϕ) = 20
20+30 | H i + 30
20+30 | T i = 25 | H i + 53 |T i.
Thus, the bias (twowards head) is 25 . In this simple case we could have ob-
tained this bias immediately from the data, but the Flrn map captures the
general mechanism.
Notice that with frequentist learning, more (or less) consistent data gives
the same outcome. For instance if we knew that 40 out of 100 tosses were
head, or 2 out of 5, we would still get the same bias. Intuitively, these data
give more (or less) confidence about the data. These aspects are not cov-
ered by frequentist learning, but by a more sophisticated form of ‘Bayesian’
learning. Another disadvantage of the rather primitive form of frequentist
learning is that prior knowledge, if any, about the bias is not taken into ac-
count.
2 Recall the medical table (1.18) captured by the multiset τ ∈ N(B × M).
Learning from τ yields the following joint distribution:
Flrn(τ) = 0.1| H, 0 i + 0.35| H, 1 i + 0.25| H, 2 i
+ 0.05| L, 0 i + 0.1| L, 1 i + 0.15| L, 2 i.
Such a distribution, directly derived from a table, is sometimes called an
empirical distribution [33].
In the above coin example we saw a property that is typical of frequentist
learning, namely that learning from more of the same does not have any effect.
We can make this precise via the equation:
93
94 Chapter 2. Discrete probability distributions
Lemma 2.3.2. The frequentist learning maps Flrn : M∗ (X) → D(X) from (2.13)
are natural in X. This means that for each function f : X → Y the following
diagram commutes.
M∗ (X)
Flrn / D(X)
M∗ ( f ) D( f )
M∗ (Y) / D(Y)
Flrn
= i s | f (xi ) i
P ri
= D( f ) i rsi | xi i
P
= D( f ) ◦ Flrn (ϕ).
We can apply this basic result to the medical data in Table (1.18), via the
multiset τ ∈ N(B × M). We have already seen in Section 1.4 that the multiset-
marginals N(πi )(τ) produce the marginal columns and rows, with their totals.
We can learn the distributions from the colums as:
= 0.7| H i + 0.3| L i.
94
2.3. Frequentist learning: from multisets to distributions 95
of, at an intuitive level, but maybe not in the mathematically precise form of
Lemma 2.3.2.
Drawing an object from an urn is an elementary operation in probability
theory, which involves frequentist learning Flrn. For instance, the draw-and-
delete and draw-and-add operations DD and DA from Definition 2.2.4 can be
described, for urns ψ and χ with kψk = K + 1 and kχk = K, as:
X ψ(x) X
DD(ψ) = ψ − 1| x i = Flrn(ψ)(x) ψ − 1| x i
x∈supp(ψ)
K + 1 x∈supp(ψ)
X χ(x) X
DA(χ) = χ + 1| x i = Flrn(χ)(x) χ + 1| x i .
x∈supp(χ)
K x∈supp(χ)
Since this drawing takes the multiplicities into account, the urns after the
DD and DA draws have the same frequentist distribution as before, but only
if we interpret “after” as channel composition ◦·. This is the content of the
following basic result.
◦ ◦ ◦ ◦
Flrn +Xr Flrn Flrn +Xr Flrn
95
96 Chapter 2. Discrete probability distributions
First taking (multiset) multiplication, and then doing frequentist learning gives:
Flrn flat(Φ) = Flrn 4| ai + 2| b i + 6| c i = 31 | a i + 16 | bi + 12 | ci.
However, first (outer en inner) learning and then doing (distribution) multipli-
cation yields:
flat Flrn M(Flrn)(Φ) = 13 13 | a i + 23 | c i + 32 13 | ai + 13 |b i + 13 | c i
= 13 | a i + 29 | b i + 49 | c i.
Exercises
2.3.1 Recall the data / multisets about child ages and blood types in the
beginning of Subsection 1.4.1. Compute the associated (empirical)
distributions.
Plot these distributions as a graph. How do they compare to the
plots (1.16) and (1.17)?
96
2.3. Frequentist learning: from multisets to distributions 97
2.3.2 Check that frequentist learning from a constant multiset yields a uni-
form distribution. And also that frequentist learning is invariant under
(non-zero) scalar multiplication for multisets: Flrn(s · ϕ) = Flrn(ϕ)
for s ∈ R>0 .
2.3.3 1 Prove that for multisets ϕ, ψ ∈ M∗ (X) one has:
kϕk kψk
Flrn ϕ + ψ = · Flrn(ϕ) +
· Flrn(ψ).
kϕk + kψk kϕk + kψk
This means that when one has already learned Flrn(ϕ) and new
data ψ arrives, all probabilities have to be adjusted, as in the above
convex sum of distributions.
2 Check that the following formulation for natural multisets of fixed
sizes K > 0, L > 0 is a special case of the previous item.
+ / N[K +L](X)
N[K](X) × N[L](X)
Flrn×Flrn
K L
K+L (−)+ K+L (−)
Flrn
D(X) × D(X) / D(X)
L∗ (X)
supp
/ Pfin (X)
F
(2.16)
acc * M (X) / D(X) supp
∗
Flrn
2.3.6 Let X be a finite set and K ∈ N be an arbitrary number. Show that for
σ ∈ D N[K +1](X) and ϕ ∈ N[K](X) one has:
DD = σ (ϕ) X σ(ϕ+1| x i)
= .
(ϕ) x∈X
( ϕ+1| x i)
2.3.7 Recall the uniform distributions unif K ∈ D N[K](X) from Exer-
cise 2.2.9, where the set X is finite. Use Lemma 1.7.5 to prove that
Flrn = unif K = unif X ∈ D(X).
97
98 Chapter 2. Discrete probability distributions
Definition 2.4.1. Tensors, also called parallel products, will be defined first for
states, for each collection type separately, and then for channels, in a uniform
manner.
98
2.4. Parallel products 99
The right-hand side uses the tensor product of the appropriate type T , as
defined in the previous item.
We shall use tensors not only in binary form ⊗, but also in n-ary form ⊗ · · · ⊗,
both for states and for channels.
We see that tensors ⊗ involve the products × of the underlying domains. A
simple illustration of a (probabilistic) tensor product is:
6 | ui + 3 | vi + 2 |w i ⊗ 4 | 0 i + 4 | 1 i
1 1 1 3 1
= 18 | u, 0 i + 24
1
| u, 1 i + 14 | v, 0i + 12
1
|v, 1 i + 38 | w, 0i + 18 | w, 1i.
These tensor products tend to grow really quickly in size, since the number of
entries of the two parts have to be multiplied.
Parallel products are well-behaved, as described below.
Lemma 2.4.2. 1 Transformation of parallel states via parallel channels is the
parallel product of the individual transformations:
(c ⊗ d) = (ω ⊗ ρ) = (c = ω) ⊗ (d = ρ).
2 Parallel products of channels interact nicely with unit and composition:
unit ⊗ unit = unit (c1 ⊗ d1 ) ◦· (c2 ⊗ d2 ) = (c1 ◦· c2 ) ⊗ (d1 ◦· d2 ).
3 The tensor of trivial, deterministic channels is obtained from their product:
f ⊗ g = f × g
99
100 Chapter 2. Discrete probability distributions
As promised we look into why parallel products don’t work for lists.
Remark 2.4.3. Suppose we have two list [a, b, c] and [u, v] and we wish to
form their parallel product. Then there are many ways to do so. For instance,
two obvious choices are:
[ha, ui, ha, vi, hb, ui, hb, vi, hc, ui, hc, vi]
[ha, ui, hb, ui, hc, ui, ha, vi, hb, vi, hc, vi]
There are many other possibilities. The problem is that there is no canonical
choice. Since the order of elements in a list matter, there is no commutativity
property which makes all options equivalent, like for multisets. Technically,
the tensor for L does not exist because L is not a commutative (i.e. monoidal)
monad; this is an early result in category theory going back to [102].
X
split
/ X×X f ⊗g
/ Y ×Y join
/Y (2.17)
The split and join operations depend on the situation. In the next result they
are copy and sum.
Lemma 2.4.4. Recall the flip (or Bernoulli) distribution flip(r) = r|1 i + (1 −
r)| 0 i and consider it as a channel flip : [0, 1] → N. The convolution of two
such flip’s is then a binomial distribution, as described in the following dia-
gram of channels.
∆ / [0, 1] × [0, 1] flip⊗flip
/ N×N
[0, 1] ◦ ◦
◦+
◦
bn[2] /N
100
2.4. Parallel products 101
The structure used in this result is worth making explicit. We have intro-
duced probability distributions on sets, as sample spaces. It turns out that if
this underlying set, say M, happens to be a commutative monoid, then D(M)
also forms a commutative monoid. This makes for instance the set D(N) of
distributions on the natural numbers a commutative monoid — even in two
ways, using either the additive or multiplicative monoid structure on N. The
construction occurs in [99, p.82], but does not seem to be widely used and/or
familiar. It is an instance of an abstract form of convolution in [104, §10].
Since it plays an important role here we make it explicit. It can be described
via parallel products ⊗ of distributions.
CMon
D / CMon
(2.19)
Sets
D / Sets
The vertical arrows ‘forget’ the monoid structure, by sending a monoid to its
underlying set.
101
102 Chapter 2. Discrete probability distributions
We can also express this via commutation of the following diagrams of chan-
nels.
+ / R≥0 0 / R≥0
R≥0 × R≥0 ◦ 1 ◦
102
2.4. Parallel products 103
R≥0 and k ∈ N.
(2.18)
pois[λ1 ] + pois[λ2 ] (k) = D(+) pois[λ1 ] ⊗ pois[λ2 ] (k)
X
= pois[λ1 ] ⊗ pois[λ2 ] (k1 , k2 )
k1 ,k2 ,k1 +k2 =k
X
= pois[λ1 ](m) · pois[λ2 ](k − m)
0≤m≤k
λm λk−m
!
(2.7)
X
= e−λ1 · 1 · e−λ2 · 2
0≤m≤k
m! (k − m)!
X e−(λ1 +λ2 ) k!
= · · λm
1 · λ2
k−m
0≤m≤k
k! m!(k − m)!
e−(λ1 +λ2 ) X k
!
= · · λm1 · λ2
k−m
k! 0≤m≤k
m
(1.26) e−(λ1 +λ2 )
= · (λ1 + λ2 )k
k!
(2.7)
= pois[λ1 + λ2 ](k).
k
Finally, in the expression pois[0] = k e0 · 0k! | k i everything vanishes except
P
for k = 0, since only 00 = 1. Hence pois[0] = 1| 0 i.
The next illustration shows that tensors for different collection types are
related.
This says that support and frequentist learning are ‘monoidal’ natural trans-
formations.
Proof. We shall do the Flrn-case and leave supp as exercise. We first note that:
ϕ ⊗ ψ
= z (ϕ ⊗ ψ)(z) = x,y (ϕ ⊗ ψ)(x, y)
P P
Then:
(ϕ ⊗ ψ)(x, y) ϕ(x) ψ(y)
Flrn ϕ ⊗ ψ (x, y) = =
·
kϕ ⊗ ψk kϕk kψk
= Flrn(ϕ)(x) · Flrn(ψ)(y)
= Flrn(ϕ) ⊗ Flrn(ψ) (x, y).
103
104 Chapter 2. Discrete probability distributions
We can form a K-ary product X K for a set X. But also for a distribution
ω ∈ D(X) we can form ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ). This lies at the heart
of the notion of ‘independent and identical distributions’. We define a separate
function for this construction:
iid [K]
/ D(X K ) iid [K](ω) = ω ··· ⊗ ω
D(X) with | ⊗ {z }. (2.20)
K times
We can describe iid [K] diagrammatically as composite, making the copying
of states explicit:
D(X)
iid [K]
/ D(X K )
[K](ω1 , . . . , ωK )
N
B
where (2.21)
, D(X)K B ω1 ⊗ · · · ⊗ ω K .
∆
N
[K]
We often omit the parameter K when it is clear from the context. This map iid
will pop-up occasionally in the sequel. At this stage we only mention a few of
its properties, in combination with zipping.
Lemma 2.4.8. Consider the iid maps from (2.20), and also the zip function
from Exercise 1.1.7.
1 These iid maps are natural, which we shall describe in two ways. For a
function f : X → Y and for a channel c : X → Y the following two diagrams
commutes.
D(X)
iid / D(X K ) D(X)
iid
◦ / XK
D( f ) K c = (−) ◦ cK = c ⊗ ··· ⊗ c
D( f )
D(Y)
iid / D(Y K ) D(Y)
iid
◦ / YK
D(X) × D(Y)
iid ⊗iid
◦ / XK × Y K
◦ zip
⊗
D(X × Y)
iid
◦ / (X × Y)K
104
2.4. Parallel products 105
cK ◦· iid (ω) = (c ⊗ · · · ⊗ c) = (ω ⊗ · · · ⊗ ω)
= (c = ω) ⊗ · · · ⊗ (c = ω)
= iid (c = ω)
= iid ◦ (c = (−)) (ω).
[(x1 , y1 ), . . . , (xK , yK )]
= D(zip) ω1 ⊗ · · · ⊗ ωK ⊗ ρ1 ⊗ · · · ρK [(x1 , y1 ), . . . , (xK , yK )]
= ω1 ⊗ · · · ⊗ ωK [x1 , . . . , xK ] · ρ1 ⊗ · · · ⊗ ρK [y1 , . . . , yK ]
[(x1 , y1 ), . . . , (xK , yK )] .
Exercises
2.4.1 Prove the equation for supp in Proposition 2.4.7.
2.4.2 Show that tensoring of multisets is linear, in the sense that for σ ∈
M(X) the operation ‘tensor with σ’ σ ⊗ (−) : M(Y) → M(X × Y) is
105
106 Chapter 2. Discrete probability distributions
The same hold in the other coordinate, for (−)⊗τ. As a special case we
obtain that when σ is a probability distribution, then σ⊗(−) preserves
convex sums of distributions.
2.4.3 Extend Lemma 2.4.4 in the following two ways.
1 Show that the K-fold convolution (2.17) of flip’s is a binomial of
size K, as in:
∆K flip K
[0, 1] ◦ / [0, 1]K ◦ / NK
◦+
◦
bn[K] .N
ω + ρ = D(+) ω ⊗ ρ ω ? ρ = D(·) ω ⊗ ρ .
and
Consider the following three distributions on N.
ρ1 = 12 | 0i + 13 | 1i + 16 |2 i ρ2 = 12 | 0 i + 12 | 1 i ω = 32 | 0i + 13 | 1i.
Show consecutively:
1 ρ1 + ρ2 = 14 | 0 i + 125
| 1i + 14 | 2i + 12
1
| 3 i;
2 ω ? (ρ1 + ρ2 ) = 4 | 0i + 36 | 1i + 12 | 2 i + 36
3 5 1 1
| 3 i;
3 ω ? ρ1 = 6 | 0i + 9 | 1i + 18 | 2 i;
5 1 1
4 ω ? ρ2 = 65 | 0i + 16 | 1i;
5 (ω ? ρ1 ) + (ω ? ρ2 ) = 36 25
| 0i + 108
25
| 1 i + 108
7
|2i + 1
108 | 3i.
106
2.4. Parallel products 107
Show that the tensor σ⊗τ can be reconstructed from these two strength
maps via the following diagram:
D(D(X × Y))
flat / D(X × Y) o flat
D(D(X × Y))
noise( f, c) B f + c.
107
108 Chapter 2. Discrete probability distributions
2 Show also that it can be described abstractly via strength (from the
previous exercise) as composite:
D D(X)
iid
◦ / D(X)K
= iid (Ω) = iid flat(Ω) .
N
that is
N
◦
flat
D(X)
iid
◦ / XK
N
2.4.10 Show that the big tensor : D(X)K → D(X K ) from (2.21) com-
mutes with unit and flatten, as described below.
N N
D( )
X K
D (X)2 K / D D(X)K / D2 (X K )
unit
K
unit flat K flat
N N
D(X)K / D(X K ) D(X)K / D(X K )
Abstractly this shows that the functor (−)K distributes over the monad
D.
2.4.11 Check that Lemma 2.4.2 (4) can be read as: the tensor ⊗ of distribu-
tions, considered as a function ⊗ : D(X) × D(Y) → D(X × Y), is a
natural transformation from the composite functor:
Sets × Sets
D×D / Sets × Sets × / Sets
to the functor
Sets × Sets
× / Sets D / Sets.
M×M
unit×unit / D(M) × D(M)
+ +
M
unit / D(M)
108
2.5. Projecting and copying 109
The sum + on the on the left-hand side is the one in D(M), from
the beginning of this exercise. The sum + on the right-hand side is
the one in D(D(M)), using that D(M) is a commutative monoid —
and thus D(D(M)) too.
3 Check the functor D : CMon → CMon in (2.19) is also a monad.
πi = ω (y)
(2.23)
X X
= ω x1 , . . . , xi−1 , y, xi+1 , . . . , xn .
x1 ∈X1 ,..., xi−1 ∈Xi−1 xi+1 ∈Xi+1 ,..., xn ∈Xn
Below we shall introduce special notation for such marginalisation, but first
we look at some of its properties.
The next result only works in the probabilistic case, for D. The exercises
below will provide counterexamples for P and M.
109
110 Chapter 2. Discrete probability distributions
πi = ω1 ⊗ · · · ⊗ ωn = D(πi ) ω1 ⊗ · · · ⊗ ωn = ωi .
X1 × · · · × Xn
c1 ⊗ ··· ⊗ cn
◦ / Y1 × · · · × Yn
πi ◦ ◦ πi
Xi ◦ / Yi
ci
Proof. We only do item (1), since item (2) then follows easily, using that
parallel products of channels are defined pointwise, see Definition 2.4.1 (2).
The first equation in item (1) follows from Lemma 1.8.3 (4), which yields
πi = (−) = D(πi )(−). We restrict to the special case where n = 2 and i = 1.
Then:
(2.23) P
D(π1 ) ω1 ⊗ ω2 )(x) = y ω1 ⊗ ω2 )(x, y)
= y ω1 (x) · ω2 (y)
P
The last line of this proof relies on the fact that probabilistic states (distri-
butions) involve a convex sum, with multiplicities adding up to one. This does
not work for subsets or multisets, see Exercise 2.5.1 below.
We introduce special, post-fix notation for marginalisation via ‘masks’. It
corresponds to the idea of listing only the relevant variables in traditional no-
tation, where a distribution on a product set is often written as ω(x, y) and its
first marginal as ω(x).
Definition 2.5.2. A mask M is a finite list of 0’s and 1’s, that is an element
M ∈ L({0, 1}). For a state ω of type T ∈ {P, M, D} on X1 × · · · × Xn and a mask
M of length n we write:
ωM
for the marginal with mask M. Informally, it keeps all the parts from ω at a
position where there is 1 in M and it projects away parts where there is 0. This
is best illustrated via an example:
= hπ1 , π3 , π5 i = ω.
110
2.5. Projecting and copying 111
ω 1, 0 = 83 | ui + 58 |v i ω 0, 1 = 12 | a i + 12 | b i.
and
The original state ω differs from the product of its marginals:
ω 1, 0 ⊗ ω 0, 1 = 16 3
| u, a i + 16
3
| u, b i + 16
5
| v, a i + 16
5
| v, bi.
This entwinedness follows from a general characterisation, see Exercise 2.5.9
below.
111
112 Chapter 2. Discrete probability distributions
2.5.2 Copying
For an arbitrary set X there is a K-ary copy function ∆[K] : X → X K = X ×
· · · × X (K times), given as:
We often omit the subscript K, when it is clear from the context, especially
when K = 2. This copy function can be turned into a copy channel ∆[K] =
unit ◦ ∆K : X → X K , but, recall, we often omit writing − for simplicity.
These ∆[K] are alternatively called copiers or diagonals.
As functions one has πi ◦ ∆[K] = id , and thus as channels πi ◦· ∆[K] = unit.
Fact 2.5.5. State transformation with a copier ∆ = ω differs from the tensor
product ω ⊗ ω. In general, for K ≥ 2,
(2.20)
∆[K] = ω , ω ··· ⊗ ω
| ⊗ {z } = iid [K](ω).
K times
The following simple example illustrates this fact. For clarity we now write
the brackets − explicitly. First,
∆ = 13 | 0i + 23 | 1i = 13 ∆(0) + 23 ∆(1) = 13 0, 0 + 23 1, 1 .
(2.24)
In contrast:
3|0i + 3|1i ⊗ 3|0i + 3|1i
1 2 1 2
(2.25)
= 19 |0, 0i + 29 | 0, 1 i + 92 | 1, 0 i + 49 | 1, 1 i.
One might expect a commuting diagram like in Lemma 2.5.1 (2) for copiers,
but that does not work: diagonals do not commute with arbitrary channels,
see Exercise 2.5.10 below for a counterexample. Copiers do commute with
deterministic channels of the form f = unit ◦ f , as in:
∆ ◦· f = ( f ⊗ f ) ◦· ∆ because ∆ ◦ f = ( f × f ) ◦ ∆.
In fact, this commutation with diagonals may be used as definition for a chan-
nel to be deterministic, see Exercise 2.5.12.
112
2.5. Projecting and copying 113
hc, di B (c ⊗ d) ◦· ∆ : X → Y × Z.
We are overloading the tuple notation hc, di. Above we use it for the tuple of
channels, so that hc, di has type X → D(Y × Z). Interpreting hc, di as a tuple of
functions, like in Subsection 1.1.1, would give a type X → D(Y) × D(Z). For
channels we standardly use the channel-interpretation for the tuple.
In the literature a probabilistic channel is often called a discriminative model.
Let’s consider a channel whose codomain is a finite set, say (isomorphic to) the
set n = {0, 1, . . . , n − 1}, whose elements i ∈ n can be seen as labels or classes.
Hence we can write the channel as c : X → n. It can then be understood as a
classifier: it produces for each element x ∈ X a distribution c(x) on n, giving
the likelihood c(x)(i) that x belongs to class i ∈ n. The class i with the highest
likelihood, obtained via argmax, may simply be used as x’s class.
The term generative model is used for a pair of a state σ ∈ D(Y) and a chan-
nel g : Y → Z, giving rise to a joint state hid , gi = σ ∈ D(X × Y). This shows
how the joint state can be generated. Later on we shall see that in principle
each joint state can be described in such generative form, via marginalisation
and disintegration, see Section 7.6.
Exercises
2.5.1 1 Let U ∈ P(X) and V ∈ P(Y). Show that the equation
(U ⊗ V) 1, 0 = U
(ϕ ⊗ ψ) 0, 1 = ϕ.
It fails when ψ is but ϕ is not the empty multiset 0. But it also fails
in non-zero cases, e.g. for the multisets ϕ = 2| ai + 4| b i + 1| ci and
ψ = 3| ui + 2| v i. Check this by doing the required computations.
113
114 Chapter 2. Discrete probability distributions
2.5.9 Let X = {u, v} and A = {a, b} as in Example 2.5.4. Prove that a state
ω = r1 |u, ai + r2 | u, bi + r3 | v, a i + r4 | v, b i ∈ D(X × A), where r1 +
r2 + r3 + r4 = 1, is non-entwined if and only if r1 · r4 = r2 · r3 .
2.5.10 Consider the probabilistic channel f : X → Y from Example 2.2.1
and show that on the one hand ∆ ◦· f : X → Y × Y is given by:
a −7 → 21 | u, u i + 12 | v, vi
b 7−→ 1|u, ui
c 7−→ 34 | u, u i + 14 | v, vi.
114
2.5. Projecting and copying 115
D(X)
D(∆)
/ D(X × X)
>
∆ ( ⊗
D(X) × D(X)
2.5.14 Show that the tuple operation h−, −i for channels, from Definition 2.5.6,
satisfies both:
πi ◦· hc1 , c2 i = ci and hπ1 , π2 i = unit
but not, essentially as in Exercise 2.5.10:
hc1 , c2 i ◦· d = hc1 ◦· d, c2 ◦· di.
2.5.15 Consider the joint state Flrn(τ) from Example 2.3.1 (2).
1 Take σ = 10 7
|Hi + 3
10 | L i ∈ D({H, L}) and c : {H, L} → {0, 1, 2}
defined by:
c(H) = 71 | 0 i + 21 | 1 i + 5
14 | 2i c(T ) = 16 | 0i + 13 | 1i + 12 |2 i.
Check that gr(σ, c) = Flrn(τ).
2.5.16 Consider a state σ ∈ D(X) and an ‘endo’ channel c : X → X.
1 Check that the following two statements are equivalent.
115
116 Chapter 2. Discrete probability distributions
c = σ = σ.
116
2.6. A Bayesian network example 117
wet grass slippery road
(D) (E)
9
d
9
sprinkler rain
(B) (C)
e :
winter
(A)
Figure 2.3 The wetness Bayesian network (from [33, Chap. 6]), with only the
nodes and edges between them; the conditional probability tables associated with
the nodes are given separately in the text.
A b b⊥ A c c⊥
winter
sprinkler
a a⊥ rain
a 1/5 4/5 a 4/5 1/5
3/5 2/5
slippery road
b c 19/20 1/20 C e e⊥
wet grass
b ⊥
c 4/5 1/5 ⊥
c 0 1
⊥ ⊥
b c 0 1
This Bayesian network is thus given by nodes, each with a conditional proba-
bility table, describing likelihoods in terms of previous ‘ancestor’ nodes in the
network (if any).
How to interpret all this data? How to make it mathematically precise? It is
not hard to see that the first ‘winter’ table describes a probability distribution
on the set A, which, in the notation of this book, is given by:
wi = 35 | ai + 25 |a⊥ i.
Thus we are assuming with probability of 60% that we are in a winter situation.
This is often called the prior distribution, or also the initial state.
Notice that the ‘sprinkler’ table contains two distributions on B, one for the
117
118 Chapter 2. Discrete probability distributions
We read them as: if it’s winter, there is a 20% chance that the sprinkler is on,
but if it’s not winter, there is 75% chance that the sprinkler is on.
Similarly, the ‘rain’ table corresponds to a channel:
ra / ra(a) = 54 |c i + 15 | c⊥ i
A ◦ C with
ra(a⊥ ) = 1 | c i + 9 | c⊥ i.
10 10
Before continuing we can see that the formalisation (partial, so far) of the
wetness Bayesian network in Figure 2.3 in terms of states and channels, al-
ready allows us to do something meaningful, namely state transformation =.
Indeed, we can form distributions:
sp = wi on B and ra = wi on C.
These (transformed) distributions capture the derived, predicted probabilities
that the sprinkler is on, and that it rains. Using the definition of state transfor-
mation, see (2.9), we get:
sp = wi (b ) = x sp(x)(b ) · wi(x)
⊥ P ⊥
sp = wi = 21
50 | b i + 29 ⊥
50 | b i.
In a similar way one can compute the probability distribution for rain as:
ra = wi = 13
25 |c i + 12 ⊥
25 | c i.
Such distributions for non-initial nodes of a Bayesian network are called pre-
dictions. As will be shown here, they can be obtained via forward state trans-
formation, following the structure of the network.
But first we still have to translate the upper two nodes of the network from
Figure 2.3 into channels. In the conditional probability table for the ‘wet grass’
118
2.6. A Bayesian network example 119
node we see 4 distributions on the set D, one for each combination of elements
from the sets B and C. The table thus corresponds to a channel:
wg(b, c) = 1920 | d i + 20 | d i
1 ⊥
wg(b, c⊥ ) = 10 | d i + 10
9 1
| d⊥ i
B×C ◦ / D
wg
with
wg(b , c) = 5 | d i + 5 | d⊥ i
⊥ 4 1
wg(b⊥ , c⊥ ) = 1|d⊥ i.
We illustrate how to obtain predictions for ‘rain’ and for ‘slippery road’. We
start from the latter. Looking at the network in Figure 2.3 we see that there are
two arrows between the initial node ’winter’ and our node of interest ‘slippery
road’. This means that we have to do two state successive transformations,
giving:
sr ◦· ra = wi = sr = ra = wi
The first equation follows from Lemma 1.8.3 (3). The second one involves
elementary calculations, where we can use the distribution ra = wi that we
calculated earlier.
Getting the predicted wet grass probability requires some care. Inspection of
the network in Figure 2.3 is of some help, but leads to some ambiguity — see
below. One might be tempted to form the parallel product ⊗ of the predicted
distributions for sprinkler and rain, and do state transformation on this product
along the wet grass channel wg, as in:
But this is wrong, since the winter probabilities are now not used consistently,
see the different outcomes in the calculations (2.24) and (2.25). The correct
way to obtain the wet grass prediction involves copying the winter state, via
the copy channel ∆, see:
wg ◦· (sp ⊗ ra) ◦· ∆ = wi = wg = (sp ⊗ ra) = ∆ = wi
= wg = hsp, rai = wi
= 1399
2000 | d i + 2000 | d i.
601 ⊥
119
120 Chapter 2. Discrete probability distributions
summations are automatically done at the right place. We proceed in two steps,
where for each step we only elaborate the first case.
We conclude that:
(sp ⊗ ra) = ∆ = wi = hsp, rai = wi
= 500
63
| b, ci + 500
147
| b, c⊥ i + 500 | b , ci
197 ⊥
+ 500 | b , c i.
93 ⊥ ⊥
= 19
20 · 500 + 10 · 500 + 5 · 500 + 0 · 500
63 9 147 4 197 93
= 1399
2000
wg = (sp ⊗ ra) = (∆ = wi) (d⊥ )
= 2000 .
601
120
2.6. A Bayesian network example 121
•O •O
D E
wet grass slippery road
@ d 8
B
•O
C
sprinkler rain
f :
•O
A
winter
Figure 2.4 The wetness Bayesian network from Figure 2.3 redrawn in a manner
that better reflects the underlying channel-based semantics, via explicit copiers
and typed wires. This anticipates the string-diagrammatic approach of Chapter 7.
Exercises
2.6.1 In [33, §6.2] the (predicted) joint distribution on D × E that arises
from the Bayesian network example in this section is reprented as a
table. It translates into:
30,443
100,000 |d, e i + 39,507 ⊥
100,000 | d, e i + 100,000 | d , e i
5,957 ⊥
+ 100,000 | d , e i.
24,093 ⊥ ⊥
121
122 Chapter 2. Discrete probability distributions
122
2.7. Divergence between distributions 123
DKL (ω, ρ) = 0 ⇐⇒ ω = ρ.
Proof. 1 The direction (⇐) is easy. For (⇒), let 0 = DKL (ω, ρ) = x ω(x) ·
P
log ω(x)/ρ(x) . This means that if ω(x) , 0, one has log ω(x)/ρ(x) = 0, and thus
ω(x)/ρ(x) = 1 and ω(x) = ρ(x). In particular:
X X
1= ω(x) = ρ(x). (∗)
x∈supp(ω) x∈supp(ω)
Hence U = ∅.
2 By unwrapping the relevant definitions:
DKL ω ⊗ ω0 , ρ ⊗ ρ0
(ω ⊗ ω0 )(x, y)
X !
= (ω ⊗ ω0 )(x, y) · log
x,y
(ρ ⊗ ρ0 )(x, y)
ω(x) ω0 (y)
X !
= ω(x) · ω0 (y) · log · 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! !!
= ω(x) · ω (y) · log
0
+ log 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! X !
= ω(x) · ω (y) · log
0
+ ω(x) · ω (y) · log 0
0
x,y
ρ(x) x,y
ρ (y)
ω(x) ω 0
! !
X X (y)
= ω(x) · log + ω0 (y) · log 0
x
ρ(x) y
ρ (y)
= DKL ω, ρ + DKL ω , ρ . 0 0
123
124 Chapter 2. Discrete probability distributions
124
2.7. Divergence between distributions 125
= log 1 = 0.
Exercises
2.7.1 Take ω = 14 | a i + 34 | b i and ρ = 12 | a i + 12 | b i. Check that:
125
126 Chapter 2. Discrete probability distributions
≤ r1 · DKL ω, ρ1 + · · · + rn · DKL ω, ρn .
126
3
– In the “-1” scenario the drawn ball is removed from the urn. This is cov-
ered by the hypergeometric distribution.
– In the “0” scenario the drawn ball is returned to the urn, so that the urn
remains unchanged. In that case the multinomial distribution applies.
– In the “+1” scenario not only the drawn ball is returned to the urn, but one
additional ball of the same colour is added to the urn. This is covered by
the Pólya distribution.
In the “-1” and “+1” scenarios the probability of drawing a ball of a certain
colour changes after each draw. In the “-1” case only finitely many draws
are possible, until the urn is empty.
This yields six variations in total, namely with ordered / unordered draws and
with -1, 0, 1 as replacement scenario. These six options are represented in a
3 × 2 table, see (3.3) below.
127
128 Chapter 3. Drawing from an urn
/ single-colour, Urn 0
Urn (3.1)
We use the ad hoc notation Urn 0 to describe the urn after the draw. It may
be the same urn as before, in case of a draw with replacement, or it may be
a different urn, with one ball less/more, namely the original urn minus/plus a
ball as drawn.
The above transition (3.1) will be described as a probabilistic channel. It
gives for each single draw the associated probability. In this description we
combine multisets and distributions. For instance, an urn with three red balls
and two blue ones will be described as a (natural) multiset 3| R i + 2| Bi. The
transition associated with drawing a single ball without replacement (scenario
“-1”) gives a mapping:
/ 3 R, 2| Ri + 2| Bi + 2 B, 3| R i + 1| Bi .
3| R i + 2| Bi 5 5
It gives the 35 probability of drawing a red ball, together with the remaining
urn, and a 25 probability of drawing a blue one, with a different new urn.
The “+1” scenario with double replacement gives a mapping:
/ 3 R, 4| Ri + 2| Bi + 2 B, 3| R i + 3| Bi .
3| R i + 2| Bi 5 5
In this last, third case we see that the urn/multiset does not change. An im-
portant first observation is that in that case we may as well use a distribution
as urn, instead of a multiset. The distribution represents an abstract urn. In the
above example we would use the distribution 53 | Ri+ 25 | Bi as abstract urn, when
we draw with single replacement (case “0”). The distribution contains all the
relevant information. Clearly, it is obtained via frequentist learning from the
original multiset. Using distributions instead of multisets gives more flexibility,
128
129
since not all distributions are obtained via frequentist learing — in particular
when the probabilities are proper real numbers and not fractions.
We formulate this approach explicitly.
• In the “-1” and “+1” situations without or with double replacement, an urn
is a (non-empty, natural) multiset, which changes with every draw, via re-
moval or double replacement of the drawn ball. These scenarios will also be
described in terms of deletion or addition, using the draw-delete and draw-
add channels DD and DA that we already saw in Definition 2.2.4.
• In a “0” situation with replacement, an urn is a probability distribution; it
does not change when balls are drawn.
This covers the above second question, about what happens to the urn. For
the first question concerning ordered and unordered draws one has to go be-
yond single draw transitions. Hence we need to suitably iterate the single-draw
transition (3.1) to:
/ multiple-colours, Urn 0
Urn (3.2)
Now we can make the distinction between ordered and unordered draws ex-
plicit. Let X be the set of colours, for the balls in the urn — so X = {R, B} in
the above illustration.
Thus, in the latter case, both the urn and the handful of balls drawn from it, are
represented as a multiset.
In the end we are interested in assigning probabilities to draws, ordered or
not, in “-1”, “0” or “1” mode. These probabilities on draws are obtained by tak-
ing the first marginal/projection of the iterated transition map (3.2). It yields a
mapping from an urn to multiple draws. The following table gives an overview
of the types of these operations, where X is the set of colours of the balls.
0 D(X)
O0
◦ / XK D(X)
U0
◦ / N[K](X)
(3.3)
-1 N[L](X)
O−
◦ / XK N[L](X)
U−
◦ / N[K](X)
+1 N∗ (X)
O+
◦ / XK N∗ (X)
U+
◦ / N[K](X)
129
130 Chapter 3. Drawing from an urn
We see that in the replacement scenario “0” the inputs of these channels are
distributions in D(X), as abstract urns. In the deletion scenario “-1” the in-
put (urns) are multisets in N[L](X), of size L. In the ordered case the outputs
are tuples in X K of length K and in the unordered case they are multisets in
N[K](X) of size K. Implicitly in this table we assume that L ≥ K, so that the
urn is full enough for K single draws. In the addition scenario “+1” we only
require that the urn is a non-empty multiset, so that at least one ball can be
drawn. Sometimes it is required that there is at least one ball of each colour in
the urn, so that all colours can occur in draws.
The above table uses systematic names for the six different draw maps. In
the unordered case the following (historical) names are common:
In this chapter we shall see that these three draw maps have certain properties
in common, such as:
• Frequentist learning applied to the draws yields the same outcome as fre-
quentist learning from the urn, see Theorem 3.3.5, Corollary 3.4.2 (2) and
Proposition 3.4.5 (1).
• Doing a draw-delete DD after a draw of size K + 1 is the same as doing a
K-sized draw, see Proposition 3.3.8, Corollary 3.4.2 (4) and Theorem 3.4.6.
But we shall also see many differences between the three forms of drawing.
This chapter describes and analyses the six probabilistic channels in Ta-
ble 3.3 for drawing from an urn. It turns out that many of the relevant prop-
erties can be expressed via composition of such channels, either sequentially
(via ◦·) or in parallel (via ⊗). In this analysis the operations of accumulation
acc and arrangement arr, for going back-and-forth between products and mul-
tisets, play an important role. For instance, in each case one has commuting
diagrams of the form.
Such commuting diagrams are unusual in the area of probability theory but we
like to use them because they are very expressive. They involve multiple equa-
tions and clearly capture the types and order of the various operations involved.
130
131
Moreover, to emphasise once more, these are diagrams of channels with chan-
nel composition ◦·. As ordinary functions, with ordinary function composition
◦, there are no such diagrams / equations.
The chapter first takes a fresh look at accumulation and arrangement, in-
troduced earlier in Sections 1.6 and 2.2. These are the operations for turning
a list into a multiset, and a multiset into a uniform distribution of lists (that
accumulate to the multiset). In Section 3.2 we will use accumulation and ar-
rangement for the powerful operation for “zipping” two multisets of the same
size together. It works analogously to zipping of two lists of the same length,
into a single list of pairs, but this ‘multizip’ produces a distribution over (com-
bined) multisets. The multizip operation turns out to interact smoothly with
multinomial and hypergeometric distributions, as we shall see later on in this
chapter.
Section 3.3 investigates multinomial channels and Section 3.4 covers both
hypergeometric and Pólya channels, all in full, multivariate generality. We have
briefly seen the multinomial and hypergeometric channels in Example 2.1.1.
Now we take a much closer look, based on [78], and describe how these chan-
nels commute with basic operations such as accumulation and arrangement,
with frequentist learning, with multizip, and with draw-and-delete. All these
commutation properties involve channels and channel composition ◦·.
Subsequently, Section 3.5 elaborates how the channels in Table 3.3 actu-
ally arise. It makes the earlier informal descriptions in (3.1) and (3.2) mathe-
matically precise. What happens is a bit sophisticated. We recall that for any
monoid M, the mapping X 7→ M × X is a monad, called the writer monad,
see Lemma 1.9.1. This can be combined with the distribution monad D, giv-
ing a combined monad X 7→ D(M × X). It comes with an associated ‘Kleisli’
composition. It is precisely this composition that we use for iterating a single
draw, that is, for going from (3.1) to (3.2). Moreover, for ordered draws we use
the monoid M = L(X) of lists, and for unordered draws we use the monoid
M = N(X) of multisets. It is rewarding, from a formal perspective, to see that
from this abstract principled approach, common distributions for different sorts
of drawing arise, including the well known multinomial, hypergeometric and
Pólya distributions. This is based on [81].
The subsequent two sections 3.6 and 3.7 of this chapter focus on a non-
trivial operation from [78], namely turning a multiset of distributions into a
distribution of multisets. Technically, this operation is a distributive law, called
the parallel multinomial law. We spend ample time introducing it: Section 3.6
contains no less than four different definitions — all equivalent. Subsequently,
various properties are demonstrated of this parallel multinomial law, including
131
132 Chapter 3. Drawing from an urn
132
3.1. Accumlation and arrangement, revisited 133
assigns equal probability to each sequence. Here we shall also use arrangement
for a specific size, as a channel arr[K] : N[K](X) → X K . We recall:
X 1 kϕk! K!
arr[K](ϕ) = ~x where (ϕ ) = = Q .
(ϕ) ϕ x ϕ(x)!
~x∈acc (ϕ)
−1
The vectors ~y take the multiplicities of elements in ~x into account, which leads
1
to the factor ( acc(~
x) )
.
Permutations are an important part of the story of accumulation and arrange-
ment. This is very explicit in the following result. Categorically it can be de-
scribed in terms of a so-called coequaliser, but here we prefer a more concrete
description. The axiomatic approach of [80] is based on this coequaliser.
XK
acc / / N[K](X)
f
f
+Y
The double head / / for acc is used to emphasise that it is a surjective func-
tion.
This result will be used both as a definition principle and as a proof principle.
The existence part can be used to define a function N[K] → Y by specifying a
function X K → Y that is stable under permutation. The uniqueness part yields
a proof principle: if two functions g1 , g2 : N[K](X) → Y satisfy g1 ◦ acc =
g2 ◦ acc, then g1 = g2 . This is quite powerful, as we shall see. Notice that acc
is stable under permutation itself, see also Exercise 1.6.4.
133
134 Chapter 3. Drawing from an urn
Proof. Take ϕ = 1≤i≤` ni | xi i ∈ N[K](X). Then we can define f (ϕ) via any
P
arrangement, such as:
f (ϕ) B f x1 , . . . , x1 , . . . , x` , . . . , x` .
| {z } | {z }
n1 times n` times
Example 3.1.2. We illustrate the use of Proposition 3.1.1 to re-define the ar-
rangement map and to prove one of its basic properties. Assume for a moment
that we do not already now about arrangement, only about accumulation. Now
consider the following situation:
XK
acc / / N[K](X)
X 1
arr where perm(~x) B ~y .
perm
' ~y is a permutation of ~x
K!
D(X K )
XK
acc / / N[K](X)
134
3.1. Accumlation and arrangement, revisited 135
We show that both dashed arrows fit. The first equation below holds by con-
struction of arr, in the above triangle.
x) = acc ◦· perm (~x)
acc ◦· arr ◦ acc (~
X 1
= D(acc) ~y
K!
~y is a permutation of ~x
X 1
=
acc(~y)
K!
~y is a permutation of ~x
X 1
=
acc(~x)
K!
~y is
a permutation of ~x
=
1 acc(~x)
= acc (~ x)
=
unit ◦ acc (~x).
The fact that the composite arr ◦· acc produces N all permutation is used in the
K K
following result. Recall that the ‘big’ tensor : D(X) → D(X ) is defined
(ω1 , . . . , ωK ) = ω1 ⊗ · · · ⊗ ωK , see 2.21.
N
by
Proposition 3.1.3. The composite arr ◦· acc commutes with tensors, as in the
following diagram of channels.
D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ ◦
XK
acc
◦ / N[K](X) arr
◦ / XK
= ω)(~x).
N
arr ◦· acc ◦· (~
135
136 Chapter 3. Drawing from an urn
Exercises
3.1.1 Show that the permutation channel perm from Example 3.1.2 is an
idempotent, i.e. satisfies perm ◦· perm = perm. Prove this both con-
cretely, via the definition of perm, and abstractly, via acc and arr.
3.1.2 Consider Proposition 3.1.3 for X = {a, b} and K = 4. Check that:
◦·arr ◦· acc ω1 , ω1 , ω2 , ω2 a, b, b, b
N
N[K](X) × N[K](Y)
mzip[K]
◦ / N[K](X × Y).
136
3.2. Zipping multisets 137
N[K](X) × N[K](Y)
arr⊗arr / D XK × Y K
D(zip) (3.7)
D (X × Y)K
/ D N[K](X × Y)
D(acc)
Example 3.2.1. Let’s use two set X = {a, b} and Y = {0, 1} with two multisets
of size three:
ϕ = 1| ai + 2| bi and ψ = 2| 0 i + 1| 1 i.
Then:
3 3
(ϕ) = 1,2 =3 ( ψ) = 2,1 =3
(a, 0), (b, 0), (b, 1) (b, 0), (a, 0), (b, 1) (b, 0), (b, 0), (a, 1)
(a, 0), (b, 1), (b, 0) (b, 0), (a, 1), (b, 0) (b, 0), (b, 1), (a, 0)
(a, 1), (b, 0), (b, 0) (b, 1), (a, 0), (b, 0) (b, 1), (b, 0), (a, 0)
By applying the accumulation function acc to each of these we get multisets:
This shows that calculating mzip is laborious. But it is quite mechanical and
easy to implement. The picture below suggests to look at mzip as a funnel with
two input pipes in which multiple elements from both sides can be combined
into a probabilistic mixture.
137
138 Chapter 3. Drawing from an urn
1|a i+2| b i 2| 0 i+1| 1i
@ @
@ @
1
+ 2
3 1| a, 1 i+2| b, 0i 3 1|a, 0i+1| b, 0i+1| b, 1 i
N[K](X) × N[K](Y)
mzip
/ D N[K](X × Y)
N[K]( f )×N[K](g) D(N[K]( f ×g))
N[K](U) × N[K](V)
mzip
/ D N[K](U × V)
N[K](X) × N[K](Y)
arr⊗arr
◦ / XX × Y K
mzip ◦ ◦ zip
N[K](X × Y)
arr
◦ / (X × Y)K
138
3.2. Zipping multisets 139
Proof. 1 Easy, via the diagrammatic formulation (3.7), using naturality of arr
(Exercise 2.2.7), of zip (Exercise 1.9.4), and of acc (Exercise 1.6.6).
2 Since:
X 1
mzip(ϕ, K| y i) = acc zip(~x, hy, . . . , yi)
( ϕ )
~x∈acc −1 (ϕ)
1
acc(−x−i→
X
= , y)
( ϕ )
~x∈acc −1 (ϕ)
X 1
=
acc(~x) ⊗ 1|y i
( ϕ )
~x∈acc −1 (ϕ)
X 1
= ϕ ⊗ 1| y i = 1 ϕ ⊗ 1| y i .
( ϕ )
−1 ~x∈acc (ϕ)
139
140 Chapter 3. Drawing from an urn
N[K](X)×N[K](Y)×N[K](Z)
id ⊗mzip
◦ / N[K](X)×N[K](Y ×Z)
arr⊗arr⊗arr ◦ arr⊗arr
◦*
mzip⊗id ◦ K
X ×Y ×Z K K id ⊗zip
◦ / X ×(Y ×Z)K
K
mzip ◦ ◦ ◦ acc
arr
N[K](X×Y ×Z) ◦ N[K](X×Y ×Z) o
We can see that the outer diagram commutes by going through the internal
subdiagrams. In the middle we use associativity of the (ordinary) zip func-
tion, formulated in terms of (deterministic) channels. Three of the (other)
internal subdiagrams commute by item (4). The acc-arr triangle at the bot-
tom commutes by (2.11).
The following result deserves a separate status. It tells that what we learn
from a multiset zip is the same as what we learn from a parallel product (of
multisets).
Theorem 3.2.3. Multiset zip and frequentist learning interact well, namely as:
N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)
⊗
◦ Flrn
2
N[K ](X × Y)
Flrn
◦ / X×Y
140
3.2. Zipping multisets 141
in ~y is ψ(b) x, ~y)
K = Flrn(ψ)(b). Hence the fraction of occurrences of (a, b) in zip(~
is Flrn(ϕ)(a) · Flrn(ψ)(b) = Flrn(ϕ ⊗ ψ)(a, b).
N[K](X) × N[L](X)
arr⊗arr / D XK × XL
D(++)
D X K+L / D N[K +L](X)
D(acc)
XK × XK
acc[K]×acc[K]
/ N[K](X) × N[K](X) +
(
zip N[2K](X)
6
acc[2]K
(X × X)K / N[2](X)K +
acc[K]L
XK
L / N[K](X)L +
'
zip L N[L·K](X)
7
acc[L]K
XL K
/ N[L](X)K +
141
142 Chapter 3. Drawing from an urn
exemplary proof. The two paths in the diagram yield the same outcomes in:
= 2| a i + 1|b i + 2| ci + 2| a i + 2| bi + 1| ci
= 4| a i + 3| b i + 3| ci.
+ ◦ acc[2]K ◦ zip [a, b, c, a, c], [b, b, a, a, c]
= + ◦ acc[2]K [(a, b), (b, b), (c, a), (a, a), (c, c)]
= 4| a i + 3|b i + 3| ci.
2 Similarly.
+ / N[2K](X) unit
N[K](X) × N[K](X)
(
mzip D N[2K](X)
6
D N[K](X × X)
D(N[K](acc[2]))
/ D N[K] N[2](X) D(flat)
+ / N[L·K](X)
N[K](X)L unit
(
mzip L D N[L·K](X)
6
D(N[K](acc[L]))/
D N[K](X L )
D N[K] N[L](X) D(flat)
Via the associativity of Proposition 3.2.2 (5) the actual arrangement of these
multiple multizips does not matter.
142
3.2. Zipping multisets 143
Exercises
3.2.1 Show that:
mzip[4] 1| a i + 2|b i + 1| ci, 3|0 i + 1| 1i
= 41 1| a, 1 i + 2| b, 0i + 1| c, 0 i
+ 21 1| a, 0 i + 1| b, 0i + 1| b, 1 i + 1|c, 0 i
+ 41 1| a, 0 i + 2| b, 0i + 1| c, 1 i .
∆ / N[K](X) × N[K](X)
N[K](X)
, mzip
N[K](∆) - D N[K](X × X)
143
144 Chapter 3. Drawing from an urn
2 Check that mzip and zip do not commute with accumulation, as in:
XK × Y K
acc × acc / N[K](X) × N[K](Y)
zip , mzip
acc / D N[K](X × Y)
(X × Y)K
Hint: Take sequences [a, b, b], [0, 0, 1] and re-use Example 3.2.1.
The number K ∈ N represents the number of objects that is drawn. The distri-
bution mn[K](ω) assigns a probability to a K-object draw ϕ ∈ N[K](X). There
is no bound on K, since the idea is that drawn objects are replaced.
Clearly, the above function (3.8) forms a channel mn[K] : D(X) → N[K](X).
In this section we collect some basic properties of these channels. They are
interesting in themselves, since they capture basic relationships between mul-
tisets and distributions, but they will also be useful in the sequel.
For convenience we repeat the definition of the multinomial channel (3.8)
from Example 2.1.1 (2) For a set X and a natural number K ∈ N we define the
144
3.3. The multinomial channel 145
D(X)
mn[K]
◦ / N[K](X)
◦ arr[K]
◦
iid [K] / XK
D(X)
mn[K]
◦ / N[K](X)
A
◦
iid [K] . XK ◦
acc[K]
2 A direct consequence of the previous item, using that acc ◦· arr = unit,
see (2.11) in Example 2.2.3.
145
146 Chapter 3. Drawing from an urn
D(X)
mn[K]
/ D N[K](X)
D( f ) D(N[K]( f ))
D(X)
mn[K]
/ D N[K](X)
Proof. By Theorem 3.3.1 (2) and naturality of acc and of iid , as expressed by
the diagram on the left in Lemma 2.4.8 (1).
D(N[K]( f )) ◦ mn[K] = D(N[K]( f )) ◦ D(acc) ◦ iid
= D(acc) ◦ D( f K ) ◦ iid
= D(acc) ◦ iid ◦ D( f )
= mn[K] ◦ D( f ).
For the next result, recall the multizip operation mzip from Section 3.2. It
may have looked a bit unusual at the time, but the next result demonstrates that
it behaves quite well — as in other such results.
Corollary 3.3.3. Multinomial channels commute with tensor and multizip:
mzip = mn[K](ω) ⊗ mn[K](ρ) = mn[K](ω ⊗ ρ).
Diagrammatically this amounts to:
D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
⊗ ◦ mzip
D(X × Y)
mn[K]
◦ / N[K](X × Y)
D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
◦ arr⊗arr
iid ⊗iid / X × YK
K
◦ zip
⊗
◦ mzip
iid / (X × Y)K
◦ acc
D(X × Y)
mn[K]
◦ / N[K](X × Y) o
The three subdiagrams, from top to bottom, commute by Theorem 3.3.1 (1),
by Lemma 2.4.8 (2), and by Theorem 3.3.1 (2).
146
3.3. The multinomial channel 147
Actually computing mn[K](ω ⊗ ρ)(χ) is very fast, but computing the equal
expression:
mzip = mn[K](ω) ⊗ mn[K](ρ) (χ)
is much much slower. The reason is that one has to sum over all pairs (ϕ, χ)
that mzip to χ.
We move on to a next fundamental fact, namely that frequentist learning
after a multinomial is the identity, in the theorem below. We first need an aux-
iliary result.
Lemma 3.3.4. Fix a distribution ω ∈ D(X) and a number K. For each y ∈ X,
X
mn[K](ω)(ϕ) · ϕ(y) = K · ω(y).
ϕ∈N[K](X)
Proof. The equation holds for K = 0, since then ϕ(y) = 0. Hence we may
assume K > 0. Then:
X
mn[K](ω)(ϕ) · ϕ(y)
ϕ∈N[K](X)
X K! Y
= ϕ(y) · Q · ω(x)ϕ(x)
x ϕ(x)!
x
ϕ∈N[K](X), ϕ(y),0
X K · (K −1)! Y
= · ω(y) · ω(y)ϕ(y)−1 · ω(x)ϕ(x)
ϕ(x)!
Q
ϕ∈N[K](X), ϕ(y),0
(ϕ(y)−1)! · x,y x,y
X (K − 1)! Y
= K · ω(y) · · ω(x)ϕ(x)
ϕ(x)!
Q
x
ϕ∈N[K−1](X) x
X
= K · ω(y) · mn[K −1](ω)(ϕ)
ϕ∈N[K−1](X)
= K · ω(y).
Theorem 3.3.5. Frequentist learning from a multinomial gives the original
distribution:
Flrn = mn[K](ω) = ω. (3.10)
This means that the following diagram of channels commutes.
D(X)
mn[K]
◦ / N[K](X)
◦
Flrn
◦
id /X
147
148 Chapter 3. Drawing from an urn
X
Flrn ◦· mn[K] (ω)(y) =
mn[K](ω)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ϕ(y)
= mn[K](ω)(ϕ) ·
ϕ∈N[K](X)
kϕk
1
= kϕk · ω(y) ·
kϕk
= ω(y).
Theorem 3.3.6. Multinomial channels compose, with a bit of help of the (fixed-
size) flatten operation for multisets (1.33), as in:
mn[K]
/ D M[K](X) mn[L]
/ D M[L] M[K](X)
D(X)
D(flat)
mn[L·K] . D M[L·K](X)
148
3.3. The multinomial channel 149
Proof. Since:
X X X
mn[K](ω)(ϕ) · ϕ =
mn[K](ω)(ϕ) · ϕ(x) x
ϕ∈N[K](X) x∈X ϕ∈N[K](X)
X
= K · ω(x) x by Lemma 3.3.4
x∈X X
= K· ω(x) x
x∈X
= K · ω.
149
150 Chapter 3. Drawing from an urn
N[K](X) o DD
◦ N[K; +1](X)
a
◦ ◦
mn[K] mn[K+1]
D(X)
Proof. For ω ∈ D(X) and ϕ ∈ N[K](X) we have:
DD ◦· mn[K +1] (ω)(ϕ)
X
= mn[K +1](ω)(ψ) · DD[K](ψ)(ϕ)
ψ∈N[K+1](X)
X ϕ(x) + 1
= mn[K +1](ω)(ϕ + 1| x i) ·
x∈X
K+1
X (K + 1)! Y ϕ(x) + 1
= · ω(y)(ϕ+1| x i)(y) ·
y (ϕ + 1| x i)(y)! K+1
Q
y
x∈X
X K! Y
ϕ(y)
= · ω(y) · ω(x)
y ϕ(y)!
Q
y
x∈X Y
= (ϕ) · ω(y)ϕ(y) · x ω(x)
P
y
= mn[K](ω)(ϕ).
The above triangles exist for each K. This means that the collection of chan-
nels mn[K], indexed by K ∈ N, forms a cone for the infinite chain of draw-
and-delete channels. This situation is further investigated in [85] in relation to
de Finetti’s theorem [37], which is reformulated there in terms of multinomial
channels forming a limit cone.
The multinomial channels do not commute with draw-add channels DA. For
instance, for ω = 13 | ai + 32 | bi one has:
mn[2](ω) = 19 2| a i + 49 1|a i + 1| bi + 49 2| bi
8
mn[3](ω) = 27 1
3| a i + 92 2| a i + 1|b i + 49 1| ai + 2| bi + 27 3| bi .
But:
DA = mn[2](ω) = 19 3|a i + 29 2| ai + 1| bi + 29 1| ai + 2| b i + 49 3| b i .
There is one more point that we like to address. Since mn[K](ω) is a distri-
P
bution, the sum over all draws ϕ mn[K](ω)(ϕ) equals one. But what if we re-
strict this sum to draws ϕ of certain colours only, that is, with supp(ϕ) ⊆ S , for
a proper subset S ⊆ supp(ω)? And what if we then let the size of these draws
150
3.3. The multinomial channel 151
K go to infinity? The result below describes what happens. It turns out that
the same behaviour exists in the hypergeometric and Pólya cases, see Proposi-
tions 3.4.9.
ω(x) in:
P
Proof. Write r B x∈S
(1.27)
X X Y
MK = mn[K](ω)(ϕ) = (ϕ) · ω(x)ϕ(x) = r K .
ϕ∈M[K](S ) ϕ∈M[K](S ) x∈S
Since 0 < r < 1 we get MK = r K > r K+1 = MK+1 and lim MK = lim r K = 0.
K→∞ K→∞
Exercises
3.3.1 Let’s throw a fair dice 12 times. What is the probability that each
12!
number appears twice? Show that it is 72 6.
ω = 14 | a i + 21 | b i + 14 | c i ψ = 2| 0 i + 1| 1 i.
1 Show that:
mn[3] D( f )(ω) (ψ) = 64 .
27
151
152 Chapter 3. Drawing from an urn
2 Check that:
D(N[3]( f )) mn[3](ω) (ψ) = mn[3](ω)(2|a i + 1| ci)
+ mn[3](ω)(1| ai + 1| b i + 1| c i)
+ mn[3](ω)(2| bi + 1| c i)
27
yields the same outcome 64 .
3.3.3 Check that:
mn[2] 12 | ai + 12 | bi = 14 2|a i + 21 1| ai + 1| bi + 14 2| b i .
Conclude that the multinomial map does not preserve uniform distri-
butions.
3.3.4 Check that: X
mn[1](ω) = ω(x) 1| x i .
x∈supp(ω)
3.3.5 Use Exercise 1.6.3 to show that for ϕ ∈ N[K](X) and ψ ∈ N[L](X)
one has:
K+L
K
mn[K +L](ω)(ϕ + ψ) = ϕ+ψ · mn[K](ω)(ϕ) · mn[L](ω)(ψ),
ϕ
152
3.3. The multinomial channel 153
D(X) ×O D(X)
mn[K]⊗mn[L]
◦ / N[K](X) × N[L](X)
∆ ◦+
D(X)
mn[K+L]
◦ / N[K +L](X)
Notice that this is essentially Theorem 3.3.1 (2), via the isomor-
phism M[1](X) X.
3.3.9 Show that for a natural multiset ϕ of size K one has:
flat mn[K] Flrn(ϕ) = ϕ.
3.3.10 Check that multinomial channels do not commute with tensors, as in:
D(X) × D(Y)
mn[K]⊗mn[L]
◦ / M[K](X) × M[L](Y)
⊗ , ◦⊗
D(X × Y) ◦ / M[K ·L](X × Y)
mn[K·L]
153
154 Chapter 3. Drawing from an urn
3.3.12 Check that the multiset distribution mltdst[K]X from Exercise 2.1.7
equals:
mltdst[K]X = mn[K](unif X ).
3.3.13 1 Check that accumulation does not commute with draw-and-delete,
as in:
acc / N[K +1](X)
X K+1 ◦
π , ◦ DD
XK ◦ / N[K](X)
acc
ppr ◦ ◦ DD
acc / N[K](X)
XK ◦
X K+1
acc / / N[K +1](X)
ppr *
D(X K ) DD
D(acc)
+
D N[K](X)
4 Show that probabilistic projection also commutes with arrange-
ment:
X K+1 o
arr
◦ N[K +1](X)
ppr ◦ ◦ DD
XK o
arr
◦ N[K](X)
154
3.3. The multinomial channel 155
3.3.14 Show that the probabilistic projection channel ppr from the previous
exercise makes the following diagram commute.
N
D(X) K+1 ◦ / X K+1
ppr ◦ ◦ ppr
N
D(X)K ◦ / XK
in a situation:
arr⊗arr /
N[K +1](X) × N[K +1](Y) ◦ X K+1 × Y K+1
zip ◦
ppr
◦ ,
(X × Y)K+1 2 (X × Y)
K
◦
π
3.3.16 Recall the set D∞ (X) of discrete distributions from (2.8), with infinite
(countable) support, see Exercise 2.1.11. For ρ ∈ D∞ (N>0 ) and K ∈ N
define:
X Y
ρ(i)ϕ(i) ϕ .
mn [K](ρ) B (ϕ) ·
∞
ϕ∈N[K](N>0 ) i∈supp(ϕ)
1 Show that this yields a distribution in D∞ N[K](N>0 ) , i.e. that
X
mn [K](ρ)(ϕ) = 1.
∞
ϕ∈N[K](N>0 )
155
156 Chapter 3. Drawing from an urn
We first repeat from Example 2.1.1 (3) the definition of the hypergeometric
channel. For a set X and a natural numbers K ≤ L we have, for multiset/urn
ψ ∈ N[L](X),
X ψϕ X x ψ(x)
Q
ϕ(x)
hg[K] ψ B ϕ = ϕ .
L L
(3.13)
ϕ≤K ψ K ϕ≤K ψ K
Recall that we write ϕ ≤K ψ for: kϕk = K and ϕ ≤ ψ, see Definition 1.5.1 (2).
Lemma 1.6.2 shows that the probabilities in the above definition add up to one.
The Pólyachannel
resembles the above hypergeometric one, instead
that
multichoose is used instead of ordinary binomial coefficients .
ψ
ϕ
X
pl[K] ψ B L ϕ
ϕ∈N[K](supp(ψ)) K
Q ψ(x) (3.14)
x∈supp(ψ) ϕ(x)
X
= L ϕ .
ϕ∈N[K](supp(ψ)) K
156
3.4. The hypergeometric and Pólya channels 157
N[K +L](X)
hg[K]
◦ / N[K](X)
@
DD ◦ & ◦
DD
N[K +L−1](X) N[K: +1](X)
◦
DD ◦· ··· ◦· DD
| {z }
L−2 times
To emphasise, this result says that the draws can be obtained via iterated
draw-and-delete. The full picture emerges in Theorem 3.5.10 where we include
the urn.
Proof. Write ψ ∈ N[K + L](X) for the urn. The proof proceeds by induction
on the number of iterations L, starting with L = 0. Then ϕ ≤K ψ means ϕ = ψ.
Hence:
ψ ψ
X ϕ ψ
hg[K] ψ = K+0 ϕ = K ψ = 1 ψ = DD 0 (ψ) = DD L (ψ).
ϕ≤K ψ K K
For the induction step we use ψ ∈ N[K +(L+1)](X) = N[(K +1)+ L](X) and
157
158 Chapter 3. Drawing from an urn
ϕ ≤K ψ. Then:
DD L+1 (ψ)(ϕ) = DD = DD L (ψ) (ϕ)
X
= DD L (ψ)(χ) · DD(χ)(ϕ)
χ∈N[K+1](X)
X ϕ(y) + 1
= DD L (ψ)(ϕ + 1| y i) ·
y∈X
K+1
ψ
(IH)
X ϕ+1| y i ϕ(y) + 1
= K+L+1 ·
y, ϕ(y)<ψ(y)
K+1
K+1
(ψ(y) − ϕ(y)) · ψϕ
(*)
X
= K+L+1 by Exercise 1.7.6 (1),(2)
y, ϕ(y)<ψ(y) (L + 1) ·
K
((K + L + 1) − K) · ψϕ
=
(L + 1) · K+L+1K
ψ
ϕ
= K+L+1
K
= hg[K](ψ)(ϕ).
From this result we can deduce many additional facts about hypergeometric
distributions.
N[N](X)
hg[K]
◦ / N[K](X)
◦ ◦
Flrn -Xq Flrn
N[L](X) o
DD
◦ N[L+1](X)
◦ ◦
hg[K] * t hg[K]
N[K](X)
4 Also:
N[K](X) o DD
◦ N[K= +1](X)
_
◦ ◦
hg[K] hg[K+1]
N[L](X)
158
3.4. The hypergeometric and Pólya channels 159
N[K +L](X)
hg[K]
◦ / N[K](X)
` @
◦ ◦
mn[K+L] D(X) mn[K]
The distribution of small hypergeometric draws from a large urn looks like
a multinomial distribution. This is intuitively clear, but will be made precise
below.
Remark 3.4.3. When the urn from which we draw in hypergeometric mode is
very large and the draw inolves only a small number of balls, the withdrawals
do not really affect the urn. Hence in this case the hypergeometric distribution
behaves like a multinomial distribution, where the urn (as distribution) is ob-
tained via frequentist learning. This is elaborated below, where the urn ψ is
159
160 Chapter 3. Drawing from an urn
And also:
X
pl[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y). (3.16)
ϕ∈M[K](supp(ψ))
(∗)
Proof. We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. We
start with equation (3.15), for which we assume K ≤ L.
X ψϕ · ϕ(y)
X
hg[K](ψ)(ϕ) · ϕ(y) = L
ϕ≤K ψ ϕ≤K ψ K
ψ−1| y i
(∗)
X ψ(y) · ϕ−1| y i
= L−1
L
ϕ≤K ψ K · K−1
ψ−1| y i
ψ(y) X ϕ
= K· ·
L ϕ≤ ψ−1| y i L−1
K−1 K−1
= K · Flrn(ψ)(y).
160
3.4. The hypergeometric and Pólya channels 161
N∗ (X)
pl[K]
◦ / N[K](X)
◦ ◦
Flrn (Xu Flrn
N[L](X)
DA
◦ / N[L+1](X)
◦ ◦
pl[K] * t pl[K]
N[K](X)
This second point is the Pólya analogue of Theorem 3.4.2 (3), with draw-
and-add instead of draw-and-delete.
(∗)
2 We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. For urn
161
162 Chapter 3. Drawing from an urn
N[K](X) o DD
◦ N[K? +1](X) N[K](X) o hg[K]
◦ N[L](X)
] ] A
◦ ◦ ◦ ◦
pl[K] pl[K+1] pl[K] pl[L]
N∗ (X) N∗ (X)
162
3.4. The hypergeometric and Pólya channels 163
Later on we shall see that the Pólya channel factors through the multinomial
channel, see Exercise 6.4.5 and ??. The Pólya channel does not commute with
multizip.
Earlier in Proposition 3.3.7 we have seen an ‘average’ result for multinomi-
als. There are similar results for hypergeometric and Pólya distributions.
1 For L ≥ K ≥ 1,
X K
flat hg[K](ψ) = hg[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).
ϕ≤ ψ
L
K
2 Similarly, for K ≥ 1,
X K
flat pl[K](ψ) = pl[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).
ϕ∈N[K](supp(ψ))
L
Proof. In both cases we rely on Lemma 3.4.4. For the first item we get:
X X X
hg[K](ψ)(ϕ) · ϕ =
hg[K](ψ)(ϕ) · ϕ(x) x
ϕ≤K ψ x∈X ϕ≤K ψ
(3.15)
X
= K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).
163
164 Chapter 3. Drawing from an urn
Then: lim an = 0.
n→∞
Proof. We switch to the (natural) logarithm ln and prove the equivalent state-
ment lim ln(an ) = −∞. We use that the logarithm turns products into sums,
n→∞
see Exercise 1.2.2, and that the derivative of ln(x) is 1x . Then:
X N +i X
ln an = = ln N + i − ln M + i
ln
i<n
M+i i<n
X Z M+1 1
= − dx
i<n N+1 x
(∗) X (M + i) − (N + i)
≤ −
i<n
M+i
X 1
= (N − M) ·
i<n
M+i
X 1
= (N − M) · .
i<n
M+i
It is well-known that the harmonic series n>0 n1 is infinite. Since M > N the
P
above sequence ln(an ) thus goes to −∞.
The validity of the marked inequality ≤ follows from an inspection of the
graph of the function 1x : the integral from N + i to M + i is the surface under 1x
between the points N + i < M + i. Since 1x is a decreasing function, this surface
is bigger than the rectangle with height M+i1
and length (M + i) − (N + i).
1 Write for K ≤ L,
X
HK B hg[K](ψ)(ϕ).
ϕ∈N[K](S ), ϕ≤ψ
164
3.4. The hypergeometric and Pólya channels 165
We define:
PK+1 (L−1)! (LS +(K +1)−1)! (LS −1)! (L+K −1)!
aK B = · · ·
PK (LS −1)! (L+(K +1)−1)! (L−1)! (LS +K −1)!
LS +K
= < 1, since LS < L.
L+K
Thus PK+1 = aK · PK < PK and also:
PK = aK−1 · PK−1 = aK−1 · aK−2 · PK−2 = · · · = aK−1 · aK−2 · . . . · a0 · P0 .
Our goal is to prove lim PK = 0. This follows from lim i<K ai = 0, which
Q
K→∞ K→∞
we obtain from Lemma 3.4.8.
Exercises
3.4.1 Consider an urn with 5 red, 6 green, and 8 blue balls. Suppose 6 balls
are drawn from the urn, resulting in two balls of each colour.
1 Describe both the urn and the draw as a multiset.
Show that the probability of the six-ball draw is:
5184000
2 47045881≈ 0.11, when balls are drawn one by one and are replaced
before each next single-ball draw;
165
166 Chapter 3. Drawing from an urn
50
3 323 ≈ 0.15, when the drawn balls are deleted;
405
4 4807 ≈ 0.08 when the drawn balls are replaced and each time an
extra ball of the same colour is added.
3.4.2 Draw-delete preserves hypergeometric and Pólya distributions, see
Corollary 3.4.2 (4) and Theorem 3.4.6. Check that draw-add does
not preserve hypergeometric or Pólya distributions, for instance by
checking that for ϕ = 3| ai + 1| b i,
hg[2](ϕ) = 12 2| ai + 12 1| ai + 1| bi
hg[3](ϕ) = 41 3| ai + 34 2| ai + 1| bi
DA = hg[2](ϕ) = 12 3| ai + 14 2| ai + 1| bi + 14 1| a i + 2| b i ,
And:
pl[2](ϕ)
3
= 35 2| a i + 10 1| a i + 1| b i + 1
10 2| b i
pl[3](ϕ)
3
= 12 3| a i + 10 2| a i + 1| b i + 3
+ 2|b i + 1
20 1| a i 20 3| bi
DA = pl[2](ϕ)
3 1
= 53 3| a i + 20 2| a i + 1| b i + + 2|b i + 10
3
20 1| a i
3| bi
Let X be a finite set, say of size n. Write 1 = x∈X 1| x i for the multiset
P
3.4.3
of single occurrences of elements of X. Show that for each K ≥ 0,
pl[K](1) = unifN[K](X) .
3.4.7 Prove in analogy with Exercise 3.3.7, for an urn ψ ∈ N[L](X) and
elements y, z ∈ X, the following points.
166
3.4. The hypergeometric and Pólya channels 167
1 When y , z and K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L−1
2 When K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= K · Flrn(ψ)(y) · .
L−1
3 And in the Pólya case, when y , z,
X
pl[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L+1
4 Finally,
X
pl[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= K · Flrn(ψ)(y) · .
L+1
3.4.8 Use Theorem 3.4.1 to prove the following two recurrence relations
for hypgeometric distributions.
X
hg[K](ψ) = Flrn(ψ)(x) · hg[K −1] ψ−1| x i
x∈supp(ψ)
X ϕ(x) + 1
hg[K](ψ)(ϕ) = · hg[K −1](ψ) ϕ+1| x i
x
K+1
3.4.9 Fix numbers N, M ∈ N and write ψ = N| 0 i + M| 1 i for an urn with N
balls of colour 0 and M of colour 1. Let n ≤ N. Show that:
X N+M+1
hg[n+m](ψ) n| 0 i + m| 1 i = .
0≤m≤M
N+1
Hint: Use Exercise 1.7.3. Notice that the right-hand side does not
depend on n.
3.4.10 This exercise elaborates that draws from an urn excluding one partic-
ular colour can be expressed in binary form. This works for all three
modes of drawing: multinomial, hypergeometric, and Pólya.
Let X be a set with at least two elements, and let x ∈ X be an arbi-
trary but fixed element. We write x⊥ for an element not in X. Assume
k ≤ K.
167
168 Chapter 3. Drawing from an urn
/ multiple-colours, Urn 0
Urn
168
3.5. Iterated drawing from an urn 169
0 D(X)
Ot 0
◦ / L(X) × D(X) D(X)
Ut 0
◦ / N(X) × D(X)
(3.17)
-1 N∗ (X)
Ot −
◦ / L(X) × N∗ (X) N∗ (X)
Ut −
◦ / N(X) × N∗ (X)
+1 N(X)
Ot +
◦ / L(X) × N(X) N∗ (X)
Ut +
◦ / N(X) × N∗ (X)
Notice that the draw maps in Table 3.3 in the introduction are written as chan-
nels O0 : D(X) → N[K](X), where the above table contains the corresponding
transition maps, with an extra ‘t’ in the name, as in Ot 0 : D(X) → N(X) ×
D(X).
In each case, the drawn elements are accumulated in the left product-com-
ponent of the codomain of the transition map. They are organised as lists, in
L(X), in the ordered case, and as multisets, in N(X) in the unordered case.
As we have seen before, in the scenarios with deletion/addition the urn is a
multiset, but with replacement it is a distribution.
The crucial observation is that the list and multiset data types that we use for
accumulating drawn elements are monoids, as we have observed early on, in
Lemmas 1.2.2 and 1.4.2. In general, for a monoid M, the mapping X 7→ M × X
is a monad, called the writer monad, see Lemma 1.9.1. It turns out that the
combination of the writer monad with the distribution monad D is again a
monad. This forms the basis for iterating transitions, via Kleisli composition of
this combined monad. We relegate the details of the monad’s flatten operation
to Exercise 3.5.6 below, and concentrate on the unit and Kleisli composition
involved.
Notice the
occurence of the sum + of the monoid M in the first component
of the ket −, − in (3.18). When M is the list monoid, this sum is the (non-
commutative) concatenation ++ of lists, producing an ordered list of drawn
elements. When M is the multiset monoid, this sum is the (commutative) +
169
170 Chapter 3. Drawing from an urn
D(X)
Ot 0
/ D L(X) × D(X) N(X)
Ut 0
/ D N(X) × N(X)
ω
/
X
ω(x) [x], ω
ω
/
X
ω(x) 1| x i, ω
x∈supp(ω) x∈supp(ω)
N∗ (X)
Ot −
/ D L(X) × N(X) N∗ (X)
Ut −
/ D N(X) × N(X)
X ψ(x) ψ(x)
/ /
X
ψ [x], ψ − 1| x i ψ 1| x i, ψ − 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk
N∗ (X)
Ot +
/ D L(X) × N∗ (X) N∗ (X)
Ut +
/ D N(X) × N∗ (X)
X ψ(x) X ψ(x)
ψ / [x], ψ + 1| x i ψ / 1| x i, ψ + 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk
Figure 3.1 Definitions of the six transition channels in Table (3.17) for draw-
ing a single element from an urn. In the “ordered” column on the left the list
monoid L(X) is used, where in the “unordered” columnn the (commutative) mul-
tiset monoid N(X) occurs. In the first row the urn (as distribution ω) remains
unchanged, whereas in the second and (resp. third) row the drawn elements x is
removed (resp. added) to the urn ψ. Implicitly it is assumed that the multiset ψ is
non-empty.
Now that we know how to iterate, we need the actual transition maps that can
be iterated, that is, we need concrete definitions of the transition channels in
Table (3.17). They are given in Figure 3.1. In the subsections below we analyse
what iteration means for these six channels. Subsequently, we can describe the
associated K-sized draw channels, as in Table 3.3, as first projection π1 ◦· t K ,
going from urns to drawn elements.
170
3.5. Iterated drawing from an urn 171
D(X) with replacement. By definition we have as first iteration.
X
Ot 10 (ω) = Ot 0 (ω) = ω(x1 ) [x1 ], ω .
x1 ∈supp(ω)
Accumulation of drawn elements in the first coordinate of −, − starts in the
second iteration:
Ot 20 (ω) = Ot 0 =X
Ot 0 (ω)
= ω(x1 ) · Ot 0 (ω)(`, ω) [x1 ] ++ `, ω
`∈L(X), x1 ∈supp(ω)
X
= ω(x1 ) · ω(x2 ) [x1 ] ++ [x2 ], ω
x1 ,x2 ∈supp(ω)
X
= (ω ⊗ ω)(x1 , x2 ) [x1 , x2 ], ω .
x1 ,x2 ∈supp(ω)
Ot 2− (ψ) = Ot − = Ot − (ψ)
X ψ(x1 ) (ψ−1| x1 i)(x2 )
= x1 , x2 , ψ−1| x1 i−1| x2 i .
·
kψk kψk − 1
x1 ∈supp(ψ),
x2 ∈supp(ψ−1| x1 i
171
172 Chapter 3. Drawing from an urn
This independence means that any order of the elements of the same multiset
of balls gets the same (draw) probability. This is not entirely trivial.
does not depend on the order of the elements in ~x: each element y j occurs n j
times in this product, with multiplicities ψ(y j ), . . . , ψ(y j ) − n j + 1, indepen-
dently of the exact occurrences of the y j in ~x. Thus:
Y Y
ψ − acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) − n j + 1)
j
0≤i<K Y
= ψ(y j ) · . . . · (ψ(y j ) − ϕ(y j ) + 1)
j
Y ψ(y j )!
=
j (ψ(y j ) − ϕ(y j ))!
Y ψ(y)!
= .
y∈X
(ψ(y) − ϕ(y))!
We can extend the product over j to a product over all y ∈ X since if y <
supp(ϕ), then, even if ψ(y) = 0,
ψ(y)! ψ(y)!
= = 1.
(ψ(y) − ϕ(y))! ψ(y)!
Theorem 3.5.4. Consider the ordered-transition-with-deletion channel Ot −
on ψ ∈ N[L](X).
172
3.5. Iterated drawing from an urn 173
1 For K ≤ L,
X X ( ψ − ϕ )
Ot −K (ψ) = ~x, ψ − ϕ .
ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)
we get:
X X (L−K)! Y ψ(y)!
Ot −K (ψ) = ~x, ψ−ϕ
·
L! y (ψ(y)−ϕ(y))!
ϕ≤K ψ ~x∈acc −1 (ϕ)
y ψ(y)!
Q
X X (L−K)!
= ~x, ψ−ϕ
Q ·
ϕ≤K ψ ~x∈acc −1 (ϕ) y (ψ(y)−ϕ(y))! L!
X X (L−K)! ψ
= · ~x, ψ−ϕ
ϕ≤K ψ ~x∈acc −1 (ϕ)
(ψ−ϕ) L!
X X ( ψ −ϕ )
= ~x, ψ−ϕ .
ϕ≤ ψ
( ψ)
K ~x∈acc (ϕ)
−1
We now move from deletion to addition, that is, from Ot − to Ot + , still in the
ordered case. The analysis is very much as in Lemma 3.5.3.
173
174 Chapter 3. Drawing from an urn
Proof. The first item is easy, so we concentrate on the second, in line with the
proof of Lemma 3.5.3. If ϕ = acc(~x) = j n j | y j i we now have:
P
Y Y
ψ + acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) + ϕ(y j ) − 1)
j
0≤i<K
Y (ψ(y j ) + ϕ(y j ) − 1)!
=
j (ψ(y j ) − 1)!
Y (ψ(y) + ϕ(y) − 1)!
= .
y∈supp(ψ)
(ψ(y) − 1)!
ψ
X X 1 ϕ
Ot +K (ψ) = · kψk ~x, ψ + ϕ .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K
ψ
X X 1 ϕ
O+ [K](ψ) = · kψk ~x .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K
ψ kψk
The multichoose coefficients ϕ sum to K for all ϕ with kϕk = K, see
Proposition 1.7.3.
Y (L + K − 1)!
(L + i) = L · (L + 1) · . . . · (L + K − 1) = .
0≤i<K
(L − 1)!
174
3.5. Iterated drawing from an urn 175
Ot +K (ψ)
X X (kψk−1)! Y (ψ(y)+ϕ(y)−1)!
= · ~x, ψ+ϕ
ϕ∈N[K](supp(ψ))
(kψk+K −1)! (ψ(y) − 1)!
~x∈acc −1 (ϕ) y∈supp(ψ)
Q (ψ(y)+ϕ(y)−1)!
X X y∈supp(ψ) (ψ(y)−1)!
= ~x, ψ+ϕ
(kψk+K−1)!
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ) (kψk−1)!
(ψ(y)+ϕ(y)−1)!
ϕ(y)!
Q Q
X X y y∈supp(ψ) ϕ(y)!·(ψ(y)−1)!
= ~x, ψ+ϕ
· (kψk+K−1)!
ϕ∈N[K](supp(ψ))
K!
~x∈acc −1 (ϕ) K!·(kψk−1)!
ψ
X X 1 ϕ
= · kψk ~x, ψ+ϕ .
ϕ∈N[K](supp(ψ)) ~x∈acc (ϕ)
−1
( ϕ )
K
2 For ψ ∈ N[L+K](X),
Q ψ(x)
X ψϕ
X x ϕ(x)
Ut −K (ψ) = L+K ϕ, ψ − ϕ = kψk ϕ, ψ − ϕ .
ϕ≤K ψ K ϕ≤K ψ K
3 For ψ ∈ N∗ (X),
Q ψ(x)
ψ
ϕ(x)
x∈supp(ψ) ϕ
X X
Ut +K (ψ) = ϕ, ψ + ϕ = kψk ϕ, ψ + ϕ .
kψk
ϕ∈N[K](X) K ϕ∈N[K](X) K
175
176 Chapter 3. Drawing from an urn
2 For K = 0 both sides are equal 1 0, ψ . Next, for a multiset ψ ∈ N[L+ K +
1](X) we have:
(3.18)
X ψ(y)
= Ut −K (ψ − 1| y i)(ϕ, χ) · ϕ + 1| y i, χ
L+K +1
y∈supp(ψ), χ∈N[L](X),
ϕ≤K ψ−1| y i
ψ−1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| yi, ψ − 1| y i − ϕ
L+K ·
L+K +1
y∈supp(ψ), K
ϕ≤K ψ−1| y i
ψ
X ϕ(y) + 1 ϕ+1| y i
= · L+K+1 ϕ + 1| yi, ψ − (ϕ + 1| y i)
K +1
y∈supp(ψ), K+1
ϕ≤K ψ−1| y i
by Exercise 1.7.6
ψ
X χ(y) χ
= · L+K+1 χ, ψ − χ
χ≤K+1 ψ
K +1
K+1
ψ
χ
X
= L+K+1 χ, ψ − χ .
χ≤K+1 ψ K+1
3 The case K = 0 so we immediately look at, the induction step, for urn ψ
176
3.5. Iterated drawing from an urn 177
(3.18)
X ψ(y)
= Ut +K (ψ + 1| y i)(ϕ, χ) · ϕ + 1| y i, χ
y∈supp(ψ), χ
L
ψ+1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| y i, ψ + 1| y i + ϕ
L+1 ·
y∈supp(ψ), ϕ∈N[K](X)
L
K
ψ
X ϕ(y) + 1 ϕ+1| y i
= · L ϕ + 1| y i, ψ + (ϕ + 1| yi)
y∈supp(ψ), ϕ∈N[K](X)
K +1
K+1
by Exercise 1.7.6
ψ
X χ(y) χ
= · L χ, ψ + χ
K +1
y, χ∈N[K+1](X) K+1
ψ
χ
X
= L χ, ψ + χ .
χ∈N[K+1](X) K+1
mn[K] = π1 ◦· Ut 0K C U0 [K].
hg[K] = π1 ◦· Ut −K C U− [K].
Together, Theorems 3.5.2, 3.5.4, 3.5.6 and 3.5.8 give precise descriptions of
the six channels, for ordrered / unordered drawing with replacement / deletion
/ addition, as originally introduced in Table (3.3), in the introduction to this
chapter. What remains to do is show that the diagrams (3.4) mentioned there
commute — which relate the various forms of drawing via accumulation and
177
178 Chapter 3. Drawing from an urn
arrangement. We repeat them below for convenience, but now enriched with
multinomial, hypergeometric, and Pólya channels.
Theorem 3.3.1 precisely says that the diagram on the left commutes. For the
two other diagram we still have to do some work. This provides new character-
isations of unordered draws with deletion / addition, namely as hypergeometric
/ Pólya followed by arrangement.
(acc ⊗ id ) ◦· Ot −K = Ut −K (acc ⊗ id ) ◦· Ot +K = Ut +K .
2 Permuting ordered draws with deletion / addition has no effect: the permu-
tation channel perm = arr ◦· acc : X K → X K satisfies:
(perm ⊗ id ) ◦· Ot −K = Ot −K (perm ⊗ id ) ◦· Ot +K = Ot +K .
= Ut −K (ψ).
178
3.5. Iterated drawing from an urn 179
= Ut +K (ψ).
acc ◦· O− = acc ◦· π1 ◦· Ot −K
= π1 ◦· (acc ⊗ id ) ◦· Ot −K
= π1 ◦· Ut −K as shown in the first item
= hg[K] by Theorem 3.5.8 (2).
But then:
arr ◦· hg[K] = arr ◦· acc ◦· π1 ◦· Ut −K as just shown
= perm ◦· π1 ◦· Ot −K
= π1 ◦· (perm ⊗ id ) ◦· Ot −K
= π1 ◦· Ot −K see item (2)
= O− .
We conclude with one more result that clarifies the different roles that the
draw-and-delete DD and draw-and-add DA channels play for hypergeometric
and Pólya distributions. So far we have looked only at the first marginal after
iteration. including also the second marginal gives a fuller picture.
Theorem 3.5.10. 1 In the hypergeometric case, both the first and the second
marginal after iterating Ut − can be expressed in terms of the draw-and-
delete map, as in:
N[L+K](X)
hg[K] = DD L DD K
◦ ◦ Ut −K ◦
z π1 π2 $
N[K](X) o ◦ N[K](X) × N[L](X) ◦ / N[L](X)
179
180 Chapter 3. Drawing from an urn
2 In the Pólya case, (only) the second marginal is can be expressed via the
draw-and-add map:
N[L](X)
pl[K] DA K
◦ ◦
◦ Ut +K
y π1 π2 %
N[K](X) o ◦ N[K](X) × N[L+K](X) ◦ / N[L+K](X)
Proof. 1 The triangle on the left commutes by Theorem 3.5.8 (2) and Theo-
rem 3.4.1. The one on the right follows from the equation:
X ψϕ
DD K (ψ) = ψ − ϕ .
kψk
(3.20)
ϕ≤K ψ K
Exercises
3.5.1 Consider the multiset/urn ψ = 4| a i + 2|b i of size 6 over {a, b}. Com-
pute O− [3](ψ) ∈ D {a, b}3 , both by hand and via the O− -formula in
Theorem 3.5.4 (2)
3.5.2 Check that:
pl[2] 1| ai + 2| bi + 1| c i
3
= 10
1
2|a i + 15 1| ai + 1| bi + 10
2| bi
1
+ 101
1| ai + 1| c i + 15 1| b i + 1| c i + 10 2| c i .
3.5.3 Show that pl[K] : N∗ (X) → D N[K](X) is natural: for each function
f : X → Y one has:
N∗ (X)
pl[K]
/ D N[K](X)
N∗ ( f ) D(N[K]( f ))
N∗ (Y)
pl[K]
/ D N[K](Y)
180
3.6. The parallel multinomial law: four definitions 181
2 Idem for:
hid ,pl[K]i
N[L](X) ◦ / N[L](X) × N[K](X)
◦+
DA K - N[L+K](X)
3.5.6 This exercise fills in the details of Lemma 3.5.1 and describes the
relevant monad, slightly more generally, for a general monoid M, and
not only for the monoid of multisets.
For an arbitrary set X define the flatten map:
flat / D M × X
D M×D M×X ◦
/
X P
Ω ω Ω(m, ω) · ω(m , x) m + m , x .
0 0
m,m0 ,x
Recall
that we additionally have unit : X → D(M × X) with unit(x) =
1 0, x .
1 Show that unit and flat are both natural transformations.
2 Prove the monad equations from (1.41) for T (X) = D(M × X).
3 Check that the induced Kleisli composition ◦· is the one given in
Lemma 3.5.1: for f : A → D(M × B) and g : B → D(M × C),
g ◦· f (a) = X
flat ◦ T (g) ◦ f (a)
P
= b f (a)(ϕ, b) · g(b)(ψ, c) m + m , c .
0
m,m0 ,c
181
182 Chapter 3. Drawing from an urn
pml[K]
/ D N[K](X) .
N[K] D(X) (3.22)
The dependence of pml on K (and X) is often left implicit. Notice that pml
can also be written as channel N[K] D(X) → N[K](X). We shall frequently
encounter it in this form in commuting diagrams.
This map pml turns a K-element multiset of distributions over X into a dis-
tribution over K-element multisets over X. It is not immediately clear how to
do this. It turns out that there are several ways to describe pml. This section is
solely devoted to defining this law, in four different manners — yielding each
time the same result. The subsequent section collects basic properties of pml.
First definition
Since the law (3.22) is rather complicated, we start with an example.
3|a i 2| a i + 1| b i 1| ai + 2| b i 3|b i.
The goal is to assign a probability to each of them. The map pml does this in
182
3.6. The parallel multinomial law: four definitions 183
pml 2| ωi + 1| ρi
= ω(a) · ω(a) · ρ(a) 3| ai
+ ω(a) · ω(a) · ρ(b) + ω(a) · ω(b) · ρ(a) + ω(b) · ω(a) · ρ(a) 2| a i + 1| b i
+ ω(a) · ω(b) · ρ(b) + ω(b) · ω(a) · ρ(b) + ω(b) · ω(b) · ρ(a) 1| a i + 2| b i
+ ω(b) · ω(b) · ρ(b) 3| bi
= 13 · 13 · 34 3| ai + 13 · 31 · 14 + 13 · 32 · 34 + 32 · 13 · 34 2| ai + 1| b i
+ 13 · 23 · 14 + 32 · 13 · 14 + 23 · 23 · 34 1| a i + 2|b i + 23 · 23 · 14 3| b i
= 12
1
3| a i + 13 36 2| ai + 1| b i + 9 1| a i + 2| b i + 9 3| b i .
4 1
There is a pattern. Let’s try to formulate the law pml from (3.22) in general,
for arbitrary K and X. It is defined on a natural multiset i ni | ωi i with multi-
P
plicities ni ∈ N satisfying i ni = K, and with distributions ωi ∈ D(X). The
P
number pml i ni | ωi i (ϕ) describes the probability of the K-sized multiset ϕ
P
over X, by using for each element occurring in ϕ the probability of that element
in the corresponding distribution in i ni | ωi i.
P
In order to make this description precise we assume that the indices i are
somehow ordered, say as i1 , . . . , im and use this ordering to form a product
state:
Second definition
There is an alternative formulation of the parallel multinomial law, using sums
of parallel multinomial distributions via ⊗. This formulation is the basis for the
phrase ‘parallel multinomial’.
+
P N
pml i ni |ωi i B D i mn[ni ](ωi )
X Q P (3.25)
= i mn[ni ](ωi )(ϕi ) i ϕi .
i, ϕi ∈N[ni ](X)
183
184 Chapter 3. Drawing from an urn
Q P
That, this sum has type i N[ni ](X) → N[ i ni ](X).
This definition (3.25) may be seen as a sum of multinomials, in the style of
Proposition 2.4.5. But this is justified via inclusions N[ni ](X) ,→ N(X). This
perspective is exploited in the fourth definition below.
Proposition 3.6.2. The definitions of the law pml in (3.24) and (3.25) are
equivalent.
Proof. Because:
(3.24) X N ni
ni | ωi i = ωi (~x) acc(~x)
P
pml i
~x∈X K
X Q P
= ωni i (~xi ) i acc(~xi )
i see Exercise 1.6.5
i, ~xi ∈X ni
X Q P P
= i ~xi ∈acc(ϕi ) ωni i (~xi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q P
= i D(acc)(ωni i )(ϕi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q P
= i acc ◦· iid ωi )(ϕi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q P
= i mn[ni ](ωi )(ϕi ) i ϕi .
i, ϕi ∈N[ni ](X)
This last equation is an instance of Theorem 3.3.1 (2). It leads to the second
formulation of the pml in (3.25).
Example 3.6.3. We continue Example 3.6.1 but now we describe the applica-
tion of the parallel multinonial law pml in terms of multinomials, as in (3.25).
We use the same multiset 2| ω i + 1| ρi of distributions ω, ρ from (3.23). The
calculation of pml on this multiset, according to the second definition (3.25),
is a bit more complicated than in Example 3.6.1 according to the first defini-
tion, since we have to evaluate the multinomial expressions. But of course the
184
3.6. The parallel multinomial law: four definitions 185
pml 2| ω i + 1| ρi
X
= mn[2](ω)(ϕ) · mn[1](ρ)(ψ) ϕ + ψ
ϕ∈N[2](X), ψ∈N[1](X)
= mn[2](ω)(2| a i) · mn[1](ρ)(1|a i) 3| a i
+ mn[2](ω)(2| ai) · mn[1](ρ)(1| b i)
+ mn[2](ω)(1| a i + 1| b i) · mn[1](ρ)(1| ai) 2| ai + 1| b i
+ mn[2](ω)(1| a i + 1| b i) · mn[1](ρ)(1|b i)
+ mn[2](ω)(2| b i) · mn[1](ρ)(1|a i) 1| ai + 2| bi
+ mn[2](ω)(2| b i) · mn[1](ρ)(1| bi) 3| b i
2 1
= 2,0 · ω(a)2 · 1,0 · ρ(a) 3| ai
2 1 2 1
+ 2,0 · ω(a)2 · 1,0 · ρ(b) + 1,1 · ω(a) · ω(b) · 1,0 · ρ(a) 2| ai + 1| bi
2 1 2 1
+ 1,1 · ω(a) · ω(b) · 1,0 · ρ(b) + 2,0 · ω(b)2 · 1,0 · ρ(a) 1| ai + 2| b i
2 1
+ 2,0 · ω(b)2 · 1,0 · ρ(b) 3| bi
= 31 2 · 34 3| a i + 13 2 · 14 + 2 · 31 · 23 · 34 2| ai + 1| bi
+ 2 · 1 · 2 · 1 + 2 2 · 3 1| a i + 2| b i + 2 2 · 1 3| bi
3 3 4 3 4 3 4
= 1
+ 13
+ 1|b i + 4
+ 2| bi + 19 3|b i .
12 3|a i 36 2| a i 9 1| a i
Third definition
Our third definition of pml is more abstract than the two earlier ones. It uses
the coequaliser property of Proposition 3.1.1. It determines pml as the unique
(dashed) map in:
D(X)K
acc / N[K] D(X)
N '
D(X K ) pml (3.26)
D(acc) )
D N[K](X)
185
186 Chapter 3. Drawing from an urn
π : {1, . . . , K} →
{1, . . . , K},
X
f ω1 , . . . , ωK (ϕ) = ω1 ⊗ · · · ⊗ ωK (x1 , . . . , xK )
~x∈acc −1 (ϕ)
(∗)
X
= ω1 ⊗ · · · ⊗ ωK (yπ−1 (1) , . . . , yπ−1 (K) )
~y∈acc −1 (ϕ)
X
= ωπ(1) ⊗ · · · ⊗ ωπ(K) (y1 , . . . , yK )
~y∈acc −1 (ϕ)
= f ωπ(1) , . . . , ωπ(K) (ϕ).
(∗)
The marked equation = holds because accumulation is stable under permuta-
tion. This implies that if ~x is in acc −1 (ϕ), then each permutation of ~x is also in
acc −1 (ϕ).
Proposition 3.6.4. The definitions of pml in (3.24), (3.25) and (3.26) are all
equivalent.
Proof. By the uniqueness property in the triangle (3.26) it suffices to prove
that pml as described in the first (3.24) or second (3.25) formulation makes
this triangle commute. For this we use the first version. We rely again on
the fact that accumulation is stable under permutation. Assume we have ω ~ =
(ω1 , . . . , ωK ) ∈ D(X)K with acc(~ω) = i∈S ni |ωi i, for S ⊆ {1, . . . , K}. Then:
P
ω) = pml i∈S ni | ωi i
P
pml ◦ acc (~
(3.24)
X N
= i ωi (~
ni
x) acc(~x)
~x∈X K
X
= ω1 ⊗ · · · ⊗ ωK (~x) acc(~x)
~x∈X K
= D(acc) ω) .
N
(~
Implicitly, for well-definedness of the first definitionN
3.24 of pml we already
used that the precise ordering of states in the tensor ω) is irrelevant in the
(~
formulation of pml.
This third formulation of the parallel multinomial law is not very useful for
actual calculations, like in Examples 3.6.1 and 3.6.3. But it is useful for proving
properties about pml, via the uniqueness part of the third definition. This will
be illustrated in Exercise 3.6.3 below. More generally, the fact that we have
three equivalent formulations of the same law allows us to switch freely and
use whichever formulation is most convenient in a particular situation.
Fourth definition
For our fourth and last definition we have to piece together some earlier obser-
vations.
186
3.6. The parallel multinomial law: four definitions 187
2 Recall from Proposition 1.4.5 that such commutative monoid structure cor-
responds to an N-algebra α : N(D(M)) → D(M), given by:
X
α i ni | ωi i = ni · ωi
P
Xi N P (3.27)
= ωni (~x) ~x where K = n .
P
i i i i
~x∈M K
α / DN(X)
N DN(X) (3.28)
ND(unit N ) α
/ NDN(X) / DN(X) .
pml B ND(X) (3.29)
ni | ωi i.
P
Proof. We elaborate formulation (3.29), on a K-sized multiset i
(3.29) P
pml i ni | ωi i = α i ni D(unit)(ωi )
P
(3.27)
X N P
= ni
i D(unit)(ωi ) (~ ~
ϕ) ϕ
~ ∈N(X)K
ϕ
X N
= ωni i (x1 , . . . , xK ) 1| x1 i + · · · + 1| xK i
i
x1 ,...,xK ∈X
X N
= ωni i (~x) acc(~x)
i
~x∈X K
The last line coincides with the first formulation (3.24) of pml.
187
188 Chapter 3. Drawing from an urn
Exercises
3.6.1 Check that the multinomial channel can be obtained via the parallel
multinomial law, in two different ways.
1 Use the first or second formulation, (3.24) or (3.25), of pml to com-
pute that for a distribution ω ∈ D(X) and number K ∈ N one has:
mn[K](ω) = pml K| ωi ,
that is:
D(X)
mn[K]
/ D N[K](X)
;
K·unit * pml
N[K] D(X)
2 Prove the same thing via the third formulation (3.26), and via The-
orem 3.3.1 (2).
3.6.2 Apply the multiset flatten map flat : M(M(X)) → M(X) in the setting
of Example 3.6.1 to show that:
flat 2| ωi + 1| ρi = 12
17
| ai + 19
12 |b i = flat pml 2| ωi + 1| ρi .
N[K] D(X)
pml
/ D N[K](X)
N[K](D( f )) D(N[K]( f ))
N[K] D(Y)
pml
/ D N[K](Y).
1 Prove this claim via a direct calculation using the first (3.24) or
second (3.25) formulation of pml.
2 Give an alternative proof using the uniqueness part of the third for-
mulation (3.26), as suggested in the diagram:
D(X)K
acc / N[K] D(X)
D( f )K
)
D(Y)K
N )
D(Y K )
D(acc) *
D N[K](Y)
188
3.7. The parallel multinomial law: basic properties 189
D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ pml ◦ ◦
XK
acc
◦ / N[K](X) arr
◦ / XK
Proof. The rectangle on the left is the third formulation of pml in (3.26)
and thus provides uniqueness. Commutation of the rectangle of the right fol-
lows from a uniqueness argument, using that the outer rectangle commutes by
Proposition 3.1.3
◦· arr ◦· acc = arr ◦· acc ◦·
N N
by Proposition 3.1.3
= arr ◦· pml ◦· acc by (3.26).
N
This result shows that pml is squeezed betweenN , both on the left and on
the right. We have seen in Exercise 2.4.10 that is a distributive law. We
shall prove the same about pml below.
But first we show how pml interacts with frequentist learning.
189
190 Chapter 3. Drawing from an urn
Theorem 3.7.2. The distributive law pml commutes with frequentist learning,
in the sense that for Ψ ∈ N[K] D(X) ,
N[K] D(X)
pml
◦ / N[K](X)
Flrn ◦ ◦ Flrn
D(X)
id
◦ /X
The channel D(X) → X at the bottom is the identity function D(X) → D(X).
Proof. We use the second formulation of the parallel multinonial law (3.25).
Let Ψ = i ni | ωi i ∈ N[K] D(X) .
P
= flat Flrn(Ψ) .
190
3.7. The parallel multinomial law: basic properties 191
= n1 · ω1 + · · · + nk · ωk by Proposition 3.3.7
= flat(Ψ).
Theorem 3.7.4. 1 The parallel multinomial law pml commutes with multino-
mials in the following manner.
D D(X)
mn[K]
◦ / M[K] D(X)
flat ◦ pml
D(X)
mn[K]
◦ / M[K](X)
Proof. 1 We use that mn[K] = acc ◦· iid [K], see Theorem 3.3.1 (2), in:
D D(X)
iid
◦ / D(X)K acc
◦ / M[K] D(X)
N
◦ ◦ pml
flat
D(X)
iid
◦ / XK acc
◦ / M[K](X)
The rectangle on the left commutes by Exercise 2.4.9, and the one on the
right by Proposition 3.7.1.
191
192 Chapter 3. Drawing from an urn
2 We show that precomposing both legs in the diagram with the accumula-
tion map acc : D(X)K → M[K]D(X) yields an equality. This suffices by
Proposition 3.1.1.
K+1
Proof. We use the probabilistic projectionN channel ppr : X → X K from
Exercise 3.3.13 and its interaction with in Exercise 3.3.14.
192
3.7. The parallel multinomial law: basic properties 193
Corollary 3.7.6. The parallel multinomial law commutes with the hypergeo-
metric channel: for L ≥ K one has:
N[L] D(X)
hg[K]
◦ / N[K] D(X)
pml ◦ ◦ pml
N[L](X)
hg[K]
◦ / N[K](X)
Proof. Theorem 3.4.1 shows that the hypergeometric distribution can be ex-
pressed via iterated draw-and-deletes. Hence the result follows from (iterated
application of) Proposition 3.7.5.
We continue to show that the parallel multinomial law pml commutes with
the unit and flatten operations of the distribution monad. This shows that pml
is an instance of what is called a distributive law in category theory. Such laws
are important in combining different forms of computation. A notorious result,
noted ome twenty years ago by Gordon Plotkin is that the powerset monad P
does not distribute over the probability distributions monad D. He never pub-
lished this important no-go result himself. Instead, it appeared in [161, 162]
(with full credits). This negative result interpreted as: there is no semanti-
cally solid way to combine non-deterministic and probabilistic computation.
The fact that a distributive law for multisets and distributions does exist shows
that multiset-computations and probability can be combined. Indeed, in Corol-
lary 3.7.8 we shall see that the K-sized multiset functor N[K] can be ‘lifted’
to the category of probabilistic channels.
But first we have to show that pml is a distributive law. We have already
seen in Exercise 3.6.3 that it is natural.
Proposition 3.7.7. The parallel multinonial law pml is a distributive law of
the K-sized multiset functor N[K] over the distribution monad D. This means
that pml commutes with the unit and flatten operations of D, as expressed by
the following two diagram.
N[K](X) unit
N[K](unit)
#
N[K] D(X)
pml
/ D N[K](X)
N[K] D2 (X)
pml
/ D N[K] D(X) D(pml)
/ D2 N[K](X)
N[K](flat)
flat
N[K] D(X)
pml
/ D N[K](X)
N
Proof. In Exercise 2.4.10 we have seen that the big tensor : D(X)K →
193
194 Chapter 3. Drawing from an urn
D(X K ) is a distributive law. These properties will be used to show that pml is
a distributive law too. We exploit the uniqueness property of the third formu-
lation 3.26.
pml ◦ N[K](unit) ◦ acc
= pml ◦ acc ◦ unit K by naturality of acc, see Exercise 1.6.6
= D(acc) ◦ ◦ unit K
N
by (3.26)
= D(acc) ◦ unit via the first diagram in Exercise 2.4.10
= unit ◦ acc by naturality of unit.
Simarly for the flatten-diagram:
N[K](X)
N[K]( f )
/ N[K] D(Y) pml
/ D N[K](Y).
194
3.7. The parallel multinomial law: basic properties 195
N[K] D(X) × N[K] D(Y)
pml⊗pml
◦ / N[K](X) × N[K](Y)
mzip ◦
N[K] D(X) × D(Y) ◦ mzip
N[K](⊗) ◦
N[K] D(X × Y)
pml
◦ / N[K](X × Y)
Proof. The result follows from a big diagram chase in which the mzip opera-
tions on the left and on the right are expanded, according to (3.7).
N[K] D(X) × N[K] D(Y)
pml⊗pml
◦ / N[K](X) × N[K](Y)
arr⊗arr ◦ arr⊗arr
N N
⊗
D(X) × D(Y)K
K ◦ / XK × Y K
zip ◦ ◦ zip
◦ mzip
N
D(X) × D(Y) K
⊗K
◦ / D(X × Y)K ◦ / (X × Y)K mzip ◦
acc ◦
/ N[K] D(X) × D(Y) ◦ acc
acc
N[K](⊗) ◦
v
N[K] D(X × Y)
pml
◦ / N[K](X × Y) o
N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)
N[K]( f )⊗N[K](g) ◦ ◦ N[K]( f ⊗g)
N[K](U) × N[K](V)
mzip
◦ / N[K](U × V)
In combination with the unit and associativity of Proposition 3.2.2 (2) and (5)
this means that the lifted functor N[K] : Chan(D) → Chan(D) is monoidal
via mzip.
195
196 Chapter 3. Drawing from an urn
Proof. This result is rather subtle, since f, g are used as channels. So when we
write N[K]( f ) we mean application of the lifted functor N[K] : Chan(D) →
Chan(D), as described in the last line of the proof of Corollary 3.7.8, produc-
ing another channel.
The left-hand side of the equation (3.30) thus expands as in the first equation
below.
mzip ◦· N[K]( f ) ⊗ N[K](g)
=
mzip ◦· (pml ⊗ pml) ◦ N[K]( f ) × N[K](g)
=
pml ◦· N[K](⊗) ◦· mzip ◦ N[K]( f ) × N[K](g) by Lemma 3.7.9
= pml ◦· N[K](⊗) ◦· N[K]( f × g) ◦· mzip by Proposition 3.2.2 (1)
= pml ◦· N[K]( f ⊗ g) ◦· mzip
=
N[K] f ⊗ g ◦· mzip.
For this result we really need the multizip operation mzip. One may think
that one can use tensors ⊗ instead, but the tensor-version of Lemma 3.7.9 does
not hold, see Exercise 3.7.5 below.
We show that the other operations are natural with respect to channels,
namely arrangement and draw-delete.
N[K]
N[K+1]
$
w
*
w
arr
w
(−)K
/ Chan
w
Chan Chan 4 Chan
w
; DD
w w
w
acc
w
N[K]
N[K]
Proposition 3.2.2 (4) and Exercise 3.3.15 (3) say that the natural transforma-
tions arr and DD are monoidal.
196
3.7. The parallel multinomial law: basic properties 197
N[K](X)
arr / D(X K )
K
N[K]( f )
D( f )
N[K]D(Y)
arr / D(D(Y)K )
N
N[K]( f )
D( ) f K = (−)
= (−) DD(Y K )
N
pml
! flat
/ DN[K](Y) D(arr)
/ DD(Y K ) flat / D(Y K ) o
O
arr = (−)
The upper rectangle is naturality of arr, for ordinary functions, see Exercise 2.2.7.
The lower rectangle commutes by Proposition 3.7.1.
For accumulation the situation is a bit simpler. The required equality acc ◦·
f = N[K]( f ) ◦· acc is obtained in:
K
XK
acc / N[K](X)
fK
N[K]( f )
f K
D(Y)K
acc / N[K]D(Y) N[K]( f )
N
pml
/ D(Y K ) D(acc)
/ DN[K](Y) o
The upper part is ordinary naturality of acc and the lower is the third formula-
tion (3.26) of pml.
For naturality of draw-delete we need to prove DD◦·N[K+1]( f ) = N[K]( f )◦·
DD, that is DD = (−) ◦ N[K + 1]( f ) = N[K]( f ) = (−) ◦ DD.
N[K +1](X)
DD / N[K](X)
N[K+1]( f )
DN[K]( f )
N[K +1]D(Y)
DD / DN[K]D(Y)
N[K+1]( f ) D(pml) f K = (−)
pml pml = (−) DDN[K](Y)
& flat
/ DN[K +1](Y) D(DD)
/ DDN[K](Y) flat / DN[K](Y) o
O
DD = (−)
The upper diagram is naturality of draw-delete, see Exercise 2.2.8. The lower
rectangle commutes by Proposition 3.7.5.
We have started this section with a 3 × 2 table (3.3) with the six options for
197
198 Chapter 3. Drawing from an urn
drawing balls from an urn, namely orderered or unordered, and with replace-
ment, deletion or addition. We have described these six draw maps as channels,
for instance of the form U0 : D(X) → N[K](X) and we saw at various stages
that these maps are natural in X. But this meant naturality with respect to func-
tions. But now that we know that N[K] is a functor Chan → Chan, we can
also ask if these draw maps are natural with respect to channels.
This turns out to be the case. Besides N[K] there are other functors involved
in these draw channels, namely distribution D and power (−)K . They also have
to be lifted to functors Chan → Chan. For D we recall Exercise 1.9.8 which
that D lifts to D : Chan → Chan. The power functor (−)K lifts by using
tells N
that is a distributive law, see Exercise 2.4.10.
Theorem 3.7.12. The following four draw maps, out of the six in Table 3.3,
are natural w.r.t. channels, and thus form natural transformations in diagrams:
Chan Chan
O0 O−
D =⇒ (−)K N[L] =⇒ (−)K
ordered
Chan Chan
“independent identical” (3.31)
Chan Chan
U0 U−
D =⇒ N[K] N[L] =⇒ N[K]
unordered
Chan Chan
“multinomial” “hypergeometric”
These natural transformations are all monoidal, by Lemma 2.4.8 (2), by Corol-
laries 3.3.3 and 3.4.2 (7), and finally by Proposition 3.2.2 (4).
The Pólya channel from Section 3.4 does not fit in this table since it is not
natural w.r.t. channels. Intuitively, this can be explained from the fact that Pólya
involves copying, and copying does not commute with channels, as we have
seen early on in Exercises 2.5.10. Pólya channels are natural w.r.t. functions,
see Exercise 3.5.3, and indeed, functions do commute with copying, see Exer-
cise 2.5.12.
Proof. Naturality of O0 = iid in the above table (3.31) is given by the diagram
on the right in Lemma 2.4.8 (1). Combining this fact with naturality of acc
from Lemma 3.7.11 gives natuality for O− = mn[K]. This uses that mn[K] =
acc ◦· iid , see Theorem 3.3.1.
198
3.7. The parallel multinomial law: basic properties 199
Exercises
3.7.1 Consider the two distributions ω, ρ in (3.23) and check yourself the
following equation, which is an instance of Theorem 3.7.2.
3.7.3 Check that the construction of Corollary 3.7.8 indeed yields a functor
N[K] : Chan(D) → Chan(D).
3.7.4 Show that the lifted functors N[K] : K`(D) → K`(D) commute with
sums of multisets: for a channel f : X → Y,
+ / N[K +L](X)
N[K](X) × N[L](Y) ◦
3.7.5 The parallel multinomial law pml does not commute with tensors (of
multisets and distributions), as in the following diagram.
N[K] D(X) × N[L] D(Y)
pml⊗pml
◦ / N[K](X) × N[K](Y)
⊗◦
N[K ·L] D(X) × D(Y) , ◦⊗
N[K·L](⊗) ◦
N[K ·L] D(X × Y)
pml
◦ / N[K ·L](X × Y)
199
200 Chapter 3. Drawing from an urn
1 Calculate:
2 And also:
N[K+1]
w *
Chan 4 Chan
w
Flrn
w
id
This allows us to see that the unrestricted version pml : ND(X) → DN(X) is
natural in X, using that N preserves size, see Exercise 1.5.2. Thus, let f : X →
Y and ϕ = i ni |ωi i ∈ N(D(X)) with kϕk = i ni = K. Then:
P P
P
P
ND( f )(ϕ)
=
i ni D( f )(ωi )
= i ni = K.
200
3.8. Parallel multinomials as law of monads 201
Hence:
pml ND( f )(ϕ) = pml kN[K]D( f )(ϕ)k N[K]D( f )(ϕ)
= pml[K] N[K]D( f )(ϕ)
= DN[K]( f ) pml[K](ϕ) by naturality of pml[K]
= DN( f ) pml(ϕ) .
We have seen in Proposition 3.7.7 that pml interacts appropriately with the
unit and flatten operations of the distribution monad D. Since we now use pml
in unrestricted form we can also look at interaction with the unit and flatten of
the (natural) multiset monad N.
Lemma 3.8.1. The parallel multinomial law pml : ND(X) → DN(X) com-
mutes in the following way with the unit and flatten operations of the monad
N.
Proof. The equation for the units is easy. For ω ∈ D(X) we have, by Exer-
cise 3.6.1,
X
pml ◦ unit (ω) = pml(1| ωi) = mn[1](ω) = ω(x) 1| x i
x∈X
= D(unit)(ω).
For flatten we have to do a bit more work. We recall the above N-algebra
α : N(D(M)) → D(M) in (3.27), for a commutative monoid M. We also re-
call that by Proposition 1.4.5 the following two diagrams commute, where
f : M1 → M2 is a map of (commutative) monoids.
N 2 D(M)
N(α)
/ ND(M) ND(M1 )
ND( f )
/ ND(M2 )
flat α α α (3.32)
α
ND(M) / D(M) D(M1 )
D( f )
/ D(M2 )
201
202 Chapter 3. Drawing from an urn
In order to show that the parallel multinomial law pml : ND(X) → DN(X) is a
map of monoids it suffices by Proposition 1.4.5 to show that the diagram (1.21)
commutes. This is the outer rectangle in:
N(pml)
N 2 D(unit)
N 2 D(X) / N 2 DN(X) N(α)
/ NDN(X)
flat flat α
α
ND(X)
ND(unit)
/ NDN(X) / DN(X)
O
pml
The rectangle on the left commutes by naturality of flat; the one on the right is
an instance of the rectangle on the left in (3.32).
2 D(X) D(unit)
unit
(
unit B X DN(X)
6
, N(X)
unit N(unit)
1 DN 2 (X)
flat D(flat)
)
D(pml)
/ D2 N 2 (X)
flat B DNDN(X) DN(X)
5
- D2 N(X)
D2 (flat) flat
Proof. This is a standard result in category, originally from Beck, see e.g. [12,
9, 71]. The diamonds in the above descriptions of unit and flatten commute by
naturality.
202
3.8. Parallel multinomials as law of monads 203
203
204 Chapter 3. Drawing from an urn
(3.34)
its ◦ flat DN = flat M ◦ σ ◦ D(τ) ◦ D(flat N ) ◦ flat D ◦ D(pml)
= flat M ◦ σ ◦ D(flat M ) ◦ DM(τ) ◦ D(τ) ◦ flat D ◦ D(pml)
= flat M ◦ M(flat M ) ◦ M2 (τ) ◦ M(τ) ◦ σ ◦ flat D ◦ D(pml)
= flat M ◦ flat M ◦ M2 (τ) ◦ M(τ) ◦ flat M ◦ σ ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ flat M ◦ M3 (τ) ◦ M2 (τ) ◦ σ ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ M(flat M ) ◦ M3 (τ) ◦ σ ◦ DM(τ) ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ M2 (τ) ◦ M(flat M ) ◦ σ ◦ DM(τ) ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ σ ◦ DM(τ) ◦ D(flat M ) ◦ DM(τ) ◦ D(σ) ◦ D(pml)
(3.33)
= flat M ◦ M(flat M ) ◦ σ ◦ DM(τ) ◦ D(flat M ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ σ ◦ D(flat M ) ◦ D(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ σ ◦ D(flat M ) ◦ DM(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ M(flat M ) ◦ σ ◦ DM(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ flat M ◦ σ ◦ D(τ) ◦ DN(flat M ) ◦ DNM(τ) ◦ DN(σ)
(3.34)
= flat M ◦ its ◦ DN(its).
Figure 3.2 Equational proof that the intensity natural transformation its
from (3.34) commutes with flattens, as part of the proof of Theorem 3.8.3.
ND(X), +, 0
pml
/ DN(X), +, 0
its
- M(X), +, 0.
Hence its 1| 0 i = 0.
204
3.8. Parallel multinomials as law of monads 205
for each x ∈ X,
X
its(ω + ρ)(x) = flat(ω + ρ)(x) = (ω + ρ)(ϕ) · ϕ(x)
ϕ∈N(X)
(2.18)
X
= D(+) ω ⊗ ρ (ϕ) · ϕ(x)
ϕ∈N(X)
X
= ω(ψ) · ρ(χ) · (ψ + χ)(x)
ψ,χ∈N(X)
X
= ω(ψ) · ρ(χ) · (ψ(x) + χ(x))
ψ,χ∈N(X)
X X X X
= ω(ψ) · ρ(χ) · ψ(x) +
ω(ψ) · ρ(χ) · χ(x)
ψ∈N(X) χ∈N(X) χ∈N(X) ψ∈N(X)
= flat(ω)(x) + flat(ρ)(x)
= its(ω) + its(ρ) (x).
Exercises
its mn[K](ω) = K · ω
its hg[K](ψ) = K · Flrn(ψ)
its pl[K](ψ) = K · Flrn(ψ).
3.8.3 In Corollary 3.7.8 we have seen the lifted functor N[K] : Chan(D) →
Chan(D). The flatten operation flat : NN ⇒ N for (natural) mul-
tisets, from Subsection 1.4.2, restricts to N[K]N[L] ⇒ N[K · L],
making N[K] : Sets → Sets into what is called a graded monad, see
e.g. [123, 54]. Also the lifted functor N[K] : Chan(D) → Chan(D)
is such a graded monad, essentially by Lemma 3.8.1.
The aim of this exercise is to show that these (lifted) flattens do
not commute with multizip. This means that the following diagram of
205
206 Chapter 3. Drawing from an urn
N[K](mzip) ◦
N[K]N[L](X × Y)
flat
◦ / N[K · L](X × Y)
3 Show next:
mzip(ϕ1 , ψ1 ) = 2
+ 1| a, 1 i + 1|b, 0i + 13 2| a, 0 i + 1|b, 1i
3 1| a, 0i
mzip(ϕ1 , ψ2 ) = 1 2| a, 1 i + 1| b, 1i
2
mzip(ϕ2 , ψ1 ) = 3 1| a, 1i + 2| b, 0 i + 3 1| a, 0i + 1| b, 0 i + 1|b, 1i
1
mzip(ϕ2 , ψ2 ) = 1 1| a, 1 i + 2| b, 1i .
206
3.9. Ewens distributions 207
This differs from what we get in the first item, via the east-south
route.
We shall write E(K) ⊆ N(N) for the set of Ewens multisets with mean K. It is
easy to see that kϕk ≤ K for ϕ ∈ E(K).
207
208 Chapter 3. Drawing from an urn
For instance, for mean K = 4 there are the following Ewens multisets with
mean K.
4| 1 i 2| 1 i + 1| 2 i 2| 2 i 1| 1i + 1| 3 i 1| 4 i. (3.35)
Recall that we have been counting lists of coin values in Subsection 1.2.3.
There we saw in (1.7) all lists of coins with values 1, 2, 3, 4 that add up to 4.
As we see, the accumulations of these lists are the Ewens multisets of mean 4.
In these multiset representations we do not care about the order of the coins.
There are several possible ways of using an Ewens multiset ϕ = k nk |k i ∈
P
E(K), so that k nk · k = K.
P
• We can think of nk as the number of coins with value k that are used in ϕ to
form an amount K.
• We can also think of nk as the number of tables with k customers in an
arrangement of K guests in a restaurant, see [6] (or [2, §11.19]) for an early
description, and also Exercise 3.9.7 below.
• In genetics, each gene has an ‘allelic type’ ϕ, where each nk is the number
of alleles appearing k times.
• An Ewens multiset with mean K is type of a partition of the set {1, 2, . . . , K}
in [100]. It tells how many subsets in a partitition have k elements.
Definition 3.9.2. Let X be an arbitrary set. For each number K there is a mul-
tiplicity count function:
mc /
X
N[K](X) E(K) given by mc(ϕ) B ϕ−1 (k) | k i.
1≤k≤kϕk
Thus, the function mc is defined via the size | − | of inverse image subsets,
on k ≥ 1 as:
mc(ϕ)(k) B ϕ−1 (k) = {x ∈ X | ϕ(x) = k} . (3.36)
208
3.9. Ewens distributions 209
We then use sums of multisets implicitly. For instance, for X = {a, b, c} and
K = 10,
mc 2| a i + 6| b i + 2|c i = 1| 2i + 1| 6 i + 1| 2 i = 2|2 i + 1| 6i
mc 2| a i + 3| b i + 5|c i = 1| 2i + 1| 3 i + 1| 5 i.
2 For arbitrary x ∈ X,
mc ϕ + 1| x i = mc ϕ + 1|ϕ(x)+1i − 1| ϕ(x) i.
Y
4 ϕ = k! mc(ϕ)(k) .
1≤k≤K
5 For each number K and for each set X with at least K elements, the multi-
plicity count function mc : N[K](X) → E(K) is surjective.
Proof. 1 Obvious, since permuting the elements of a multiset does not change
the multiplicities that it has.
2 Let k = ϕ(x). If we add another element x to the multiset ϕ, then the number
mc(ϕ)(k) of elements in ϕ occuring k times is decreased by one, and the
number mc(ϕ)(k + 1) of elements occurring k + 1 times is increased by one.
3 By a similar argument.
4 By induction on K ≥ 1. The statement clearly holds when K = 1. Next, by
209
210 Chapter 3. Drawing from an urn
+1)! if ϕ = K| x i
(K
= (ϕ(x)+1)!
Y
k! mc(ϕ)(k) ·
otherwise
ϕ(x)!
1≤k≤K
(ϕ + 1| x i)
(IH) if ϕ = K| x i
=
ϕ · (ϕ(x)+1) otherwise
= (ϕ + 1| x i) .
There are similar channels for Ewens multisets, which we shall write with an
extra ‘E’ for ‘Ewens’:
EDD
t ◦
E(K)
◦
3 E(K +1) (3.37)
EDA(t)
210
3.9. Ewens distributions 211
to shift in the other direction. This is achieved in the formulations in the next
definition. The deletion construction comes from [100, 101] and the addition
construction is used in the ‘Chinese restaurant’ illustration, see Exercise 3.9.7
below. Both constructions can be found in [27, §3.1 and §4.5].
Later on, in Proposition 6.6.6, we shall show that the channels EDD and
EDA(t) are closely related, in the sense that they are each other’s ‘daggers’.
Definition 3.9.4. For Ewens multisets ϕ ∈ E(K) and ψ ∈ E(K+1) and for t > 0
define:
t X ϕ(k) · k
ϕ + 1| 1 i + ϕ − 1| k i + 1| k+1 i
EDA(t)(ϕ) B
K +t 1≤k≤K
K +t
ψ(1) X ψ(k) · k
ψ − 1| 1 i + ψ + 1| k−1 i − 1| k i
EDD(ψ) B
K +1 2≤k≤K+1
K +1
N[K](X) o
DD
◦ N[K +1](X)
mc ◦ ◦ mc
E(K) o ◦ E(K +1)
EDD
Proof. We recall from Definition 2.2.4 that for ψ ∈ N[K +1](X) one has:
X ψ(x)
ψ − 1| x i
DD(ψ) B
x∈supp(ψ)
K +1
211
212 Chapter 3. Drawing from an urn
D(mc) DD(ψ)
X ψ(x)
= mc(ψ−1| x i)
x∈supp(ψ)
K +1
X ψ(x) if ψ(x) = 1
mc(ψ)−1| 1 i
=
K +1 mc(ψ) + 1| ψ(x)−1i − 1| ψ(x) i otherwise.
x∈supp(ψ)
X ψ(x) X ψ(x)
= mc(ψ)−1| 1i + mc(ψ) + 1| ψ(x)−1 i − 1| ψ(x) i
x, ψ(x)=1
K +1 x, ψ(x)>1
K +1
x, ψ(x)=k ψ(x)
P
mc(ψ)(1) X
= mc(ψ)−1|1 i + mc(ψ) + 1| k−1i − 1| k i
K +1 2≤k≤K+1
K +1
mc(ψ)(1) X mc(ψ)(k) · k
= mc(ψ)−1|1 i + mc(ψ) + 1|k−1 i + 1|k i
K +1 2≤k≤K+1
K +1
= EDD mc(ψ) .
We now come to a crucial channel ew[K] : R>0 → E(K) called after Ewens
who introduced the distributions ew[K](t) ∈ D E(K) in [46]. These distribu-
tions will be introduced by induction on K. Subsequently, an explicit formula
is obtained, for the special case when the parameter t is a natural number.
ew[1](t) B 1 1| 1i
(3.38)
ew[K +1](t) B EDA(t) = ew[K](t).
The first distribution ew[1](t) = 1 1|1 i ∈ D E(1) contains a lot of 1’s. It
is the singleton distribution for the singleton Ewens multiset 1| 1i on {1} with
mean 1. In this first step the parameter t ∈ R>0 does not play a role. The two
next steps yield:
ew[2](t) = EDA(t) = 1 1| 1 i = EDA(t)(1| 1i)
t 1
= 2| 1i + 1| 2 i .
1+t 1+t
212
3.9. Ewens distributions 213
And:
ew[3](t) !
t 1
= EDA(t) = 2| 1i +
1| 2 i
1+t 1+t
t 1
= · EDA(t) 2| 1 i +
· EDA(t) 1|2 i
1+t 1+t
t t t 2
= 3| 1i + 1| 1i + 1| 2 i
· ·
1+t 2+t 1+t 2+t
1 t 1 2
+ 1| 1i + 1| 2i +
· · 1| 3i
1+t 2+t 1+t 2+t
t2 3t 2
= 3| 1i + 1|1 i + 1| 2i + 1|3 i .
(1 + t)(2 + t) (1 + t)(2 + t) (1 + t)(2 + t)
Example 3.9.7. For numbers t ∈ R>0 and K ∈ N>0 define the Hoppe channel:
N[K] {1, . . . , K}
hop[K](t)
◦ / N[K +1] {1, . . . , K +1}
X ϕ(k)
hop[K](t)(ϕ) B ϕ + 1| k i + t ϕ + 1| K +1 i .
K +t K +t (3.39)
1≤k≤K
213
214 Chapter 3. Drawing from an urn
The following result can be seen as an analogue of Lemma 3.9.5, for addition
instead of for deletion. It is however more restricted, since it involves sample
spaces of the form {1, . . . , K}, and not arbitrary sets.
N[K] {1, . . . , K}
hop(t)
◦ / N[K +1] {1, . . . , K +1}
mc ◦ ◦ mc
E(K) ◦ / E(K +1)
EDA(t)
Proof. Let a multiset ϕ ∈ N[K] {1, . . . , K} be given. Via Lemma 3.9.3 (2):
D(mc) hop(t)(ϕ)
X ϕ(`)
(3.39)
= mc ϕ + 1|` i + t mc ϕ + 1| K +1i
1≤`≤K
K +t K +t
X ϕ(`)
= mc ϕ + 1| ϕ(`)+1 i − 1| ϕ(`) i + t mc ϕ + 1|1 i
1≤`≤K
K +t K +t
(∗) t X mc(ϕ)(k) · k
= mc(ϕ) + 1|1 i + mc(ϕ) − 1|k i + 1| k+1 i
K +t 1≤k≤K
K +t
= EDA(t) mc(ϕ) .
(∗)
The marked equation = holds since by definition, mc(ϕ)(k) = {` | ϕ(`) = k} ,
see (3.36).
214
3.9. Ewens distributions 215
Corollary 3.9.9. Ewens distributions can also be obtained via repeated Hoppe
draws followed by multiplicity count: for t ∈ R>0 and K ∈ N>0 , one has:
ew[K](t) = mc ◦· hop(t) ◦· · · · ◦· hop(t) 1| 1i .
| {z }
K−1 times
It turns out that there is a single formula for Ewens distributions when the
mutation rate t is a natural number. This formula is often called Ewens sam-
pling formula.
Theorem 3.9.10. When the parameter t > 0 is a natural, the Ewens distribu-
tion ew[K](t) ∈ D E(K) can be described by:
X Q1≤k≤K t ϕ(k)
ew[K](t) = kt ϕ . (3.40)
ϕ∈E(K) ϕ · K
t 1 t t (3.38)
1
1| 1 i = 1|1 i = 1|1 i = 1 1| 1 i = ew[1](t).
1! · 1t 1
t t
For the induction step we assume that Equation (3.40) holds for all numbers
below K. We first note that we can write for ϕ ∈ E(K) and ψ ∈ E(K +1),
t
if ψ = ϕ + 1| 1 i
K + t
EDA(t)(ϕ)(ψ) =
ϕ(k) · k
if ψ = ϕ − 1| k i + 1| k+1 i
K+t
t
if ϕ = ψ − 1| 1i
K + t
=
(ψ(k) + 1) · k
if ϕ = ψ + 1| k i − 1|k+1 i
K+t
215
216 Chapter 3. Drawing from an urn
where 1 ≤ k ≤ K. Hence:
(3.38)
ew[K +1](t)(ψ) = EDA(t) = ew[K](t) (ψ)
X
= EDA(t)(ϕ)(ψ) · ew[K](t)(ϕ)
ϕ∈E(K)
t
= if ψ(1) > 0
· ew[K](t)(ψ − 1|1 i)
K+t
X (ψ(k) + 1) · k
+ · ew[K](t)(ψ + 1| k i − 1| k+1i)
1≤k≤K
K+t
Q t (ψ−1| 1 i)(`)
(IH) t 1≤`≤K `
= ·
K + t (ψ − 1| 1 i) · t
K
X (ψ(k) + 1) · k Q t (ψ+1| k i−1| k+1 i)(`)
1≤`≤K `
+ ·
K+t
1≤k≤K (ψ + 1| k i − 1| k+1 i) · Kt
t ψ(`)
ψ(1)
Q
1≤`≤K+1 `
= ·
K+1
t
ψ · K+1
X ψ(k + 1) · (k + 1) Q1≤`≤K+1 t ψ(`)
+ · t `
1≤k≤K
K + 1 ψ · K+1
t ψ(`)
ψ(1) + 1≤k≤K ψ(k + 1) · (k + 1)
P Q
1≤`≤K+1 `
= ·
K+1
t
ψ · K+1
Q t ψ(`)
1≤`≤K+1 `
= t .
ψ · K+1
This result has a number of consequences. We start with an obvious one, re-
formulating that the probabilities in (3.40) add up to one. This result resembles
earlier results, in Lemma 1.6.2 and Proposition 1.7.3, underlying the multino-
mial and Pólya distributions.
216
3.9. Ewens distributions 217
o EDD
E(K) @ +1)
E(K
◦
[
◦ ◦
ew[K] ew[K+1]
N>0
Recall from Exercise 2.2.10 that there are also ‘fixed point’ distributions for
the ordinary (non-Ewens) draw-delete/add maps, but the ones given there are
not as rich as the Ewens distributions.
Proof. We first extract the following characterisation from the formulation of
EDD in Definition 3.9.4: for ψ ∈ E(K +1) and ϕ ∈ E(K),
ψ(1)
if ϕ = ψ − 1| 1 i
K + 1
EDD(ψ)(ϕ) =
ψ(k) · k
if ϕ = ψ + 1| k−1 i − 1| k i
K+1
ϕ(1) + 1
if ψ = ϕ + 1|1 i
K+1
=
(ϕ(k) + 1) · k
if ψ = ϕ − 1|k−1 i + 1| k i
K+1
where 2 ≤ k ≤ K + 1. Hence:
EDD = ew[K](t) (ϕ)
X
= EDD(ψ)(ϕ) · ew[K +1](t)(ψ)
ψ∈E(K+1)
ϕ(1) + 1
= · ew[K +1](t) ϕ + 1| 1i
K+1
X (ϕ(k + 1) + 1) · (k + 1)
+ · ew[K +1](t) ϕ − 1| k i + 1| k+1 i
1≤k≤K
K + 1
t (ϕ+1| 1 i)(`)
ϕ(1) + 1
Q
(3.40) 1≤`≤K+1 `
= ·
K+1
t
(ϕ + 1| 1i) · K+1
X (ϕ(k + 1) + 1) · (k + 1) Q t (ϕ−1| k i+1| k+1 i)(`)
1≤`≤K+1 `
+ ·
K+1
t
1≤k≤K (ϕ − 1| k i + 1| k+1i) · K+1
Q t ϕ(`) X ϕ(k) · k Q1≤`≤K t ϕ(`)
t 1≤`≤K `
= · + · t`
K+t ϕ · Kt K + t ϕ ·
P 1≤k≤K K
t 1≤k≤K ϕ(k) · k
= · ew[K](t)(ϕ) + · ew[K](t)(ϕ)
K+t K+t
= ew[K](t)(ϕ).
We consider another consequence of Theorem 3.9.10. It fills a gap that
217
218 Chapter 3. Drawing from an urn
we left open at the very beginning of this book, namely the proof of Theo-
rem 1.2.8. There we looked at sums sum(`) and products prod (`) of sequences
` ∈ L(N>0 ) of positive natural numbers. By accumulating such lists into mul-
tisets we get a connection with Ewens multisets, namely:
in which all probabilities add up to one. This is what was claimed earlier in
Theorem 1.2.8, at the time without proof.
Proof. We elaborate:
X X
arr = ew[K](t) =
arr(ϕ)(`) · ew[K](t)(ϕ) `
`∈L({1,...,K}) ϕ∈E(K)
t ϕ(k)
Q
(3.40)
X X 1 1≤k≤K
= t `
k
·
`∈L({1,...,K}) ϕ∈E(K), acc(`)=ϕ
(ϕ) ϕ · K
(∗)
X X 1 t kϕk
= · `
`∈L({1,...,K}) ϕ∈E(K), acc(`)=ϕ
kϕk! prod (`) · t
K
(3.41)
X t kϕk
= t ` .
`∈sum −1 (K) k`k! · prod (`) · K
218
3.9. Ewens distributions 219
Exercises
3.9.1 Check that the sets of Ewens multisets E(1), E(2), E(3), E(5) have,
respectively, 1, 2, 3, 5 and 7 elements; list them all.
3.9.2 Use the construction from the proof of Lemma 3.9.3 (5) to construct
for ϕ = 2| 1i + 1| 4 i + 2| 5 i ∈ E(16) a multiset ψ with mc(ψ) = ϕ.
3.9.3 Use the isomorphism N N(1) to describe the mean function from
Definition 3.9.1 as an instance of flattening.
3.9.4 Show that for ` ∈ L(N>0 ) one has:
mean acc(`) = sum(`).
3.9.5 Recall from Lemma 1.2.7 that for N ∈ N>0 there are 2N−1 many lists
` ∈ L(N>0 ) with sum(`) = N. Hence we have a uniform distribution:
X 1
unifsum −1 (N) = N−1
` .
2
−1 `∈sum (N)
and:
D(mc) DA(ϕ) = 25 1| 1 i + 1| 2 i + 1| 3 i + 35 2|1 i + 1| 4i .
219
220 Chapter 3. Drawing from an urn
220
4
P
We have seen (discrete probability) distributions as formal convex sums i ri | xi i
in D(X) and (probabilistic) channels X → Y, as functions X → D(Y), describ-
ing probabilistic states and computations. This section develops the tools to
reason about such distributions and channels, via what are called observables.
They are functions from a set / sample space X to (a subset of) the real num-
bers R that associate some ‘observable’ numerical information with an element
x ∈ X. The following table gives an overview of terminology and types, where
X is a set (used as sample space).
name type
We shall use the term of ‘observable’ as generic expression for all the entries in
this table. A function X → R is thus the most general type of observable, and
a sharp predicate X → {0, 1} is the most specific one. Predicates are the most
appropriate observable for probabilistic logical reasoning. Often attention is
restricted to subsets U ⊆ X as predicates (or events [144]), but here, in this
book, the fuzzy versions X → [0, 1] are the default. Such fuzzy predicates may
also be called belief functions — or effects, in a quantum setting. A technical
reason using fuzzy, [0, 1]-valued predicates is that these fuzzy predicates are
closed under predicate transformation =, and the sharp predicates are not, see
below for details.
221
222 Chapter 4. Observables and validity
4.1 Validity
This section introduces the basic facts and terminology for observables, as
described in Table 4.1 and defines their validity in a state. Recall that we write
Y X for the set of functions from X to Y. We will use notations:
222
4.1. Validity 223
The first set SPred (X) = {0, 1}X of sharp predicates can be identified with
the powerset P(X) of subsets of X, see below. We first define some special
observables.
for all x ∈ X.
3 For a singleton subset {x} we simply write 1 x for 1{x} . Such functions 1 x : X →
[0, 1] are also called point predicates, where the element x ∈ X is seen as a
point.
4 There is a (sharp) equality predicate Eq : X × X → [0, 1] defined in the
obvious way as:
1 if x = x
0
Eq(x, x0 ) B
0 otherwise.
223
224 Chapter 4. Observables and validity
2 If X is a finite set, with size | X | ∈ N, one can define the average avg(p) of
the observable p as its validity in the uniform state unif X on X, i.e.,
X p(x)
avg(p) B unif X |= p = .
x∈X
|X|
Definition 4.1.3. In the presence of a map incl X : X ,→ R, one can define the
mean mean(ω), also known as average, of a distribution ω on X as the validity
of incl X , considered as an observable:
X
mean(ω) B ω |= incl X = ω(x) · x.
x∈X
224
4.1. Validity 225
is a game where you can throw the coin and win €100 if head (1) comes up,
but you lose €50 if the outcome is tail (0). Is it a good idea to play the game?
The possible gain can be formalised as an observable v : {0, 1} → R with
v(0) = −50 and v(1) = 100. We get an anwer to the above question by
computing the validity:
X 3
3
flip( 10 ) |= v = flip( )(x) · v(x)
x∈{0,1}
10
= flip( 10
3
)(0) · v(0) + flip( 10
3
)(1) · v(1)
= 7
10 · −50 + 3
10 · 100 = −35 + 30 = −5.
Now consider a more subtle claim that the even pips are more likely than
the odd ones, where the precise likelihoods are described by the (non-sharp,
fuzzy) predicate p : pips → [0, 1] with:
p(1) = 1
10 p(2) = 9
10 p(3) = 3
10
p(4) = 8
10 p(5) = 2
10 p(6) = 10 .
7
This new claim p happens to be equally probable as the even claim e, since:
dice |= p = 6 · 10 + 6 · 10 + 6
1 1 1 9 1
· 8
10 + 1
6 · 8
10 + 1
6 · 2
10 + 1
6 · 7
10
= 1+9+3+8+2+7
60 = 30
60 = 2.
1
4 Recall the binomial distribution bn[K](r) on the set {0, 1, . . . , K} from Ex-
ample 2.1.1 (2), for r ∈ [0, 1]. There is an inclusion function {0, 1, . . . , K} ,→
R that allows us to compute the mean of the binomial distribution. In this
computation we treat the binomial distribution as a special instance of the
225
226 Chapter 4. Observables and validity
mean bn[K](r)
X
= bn[K](r)(k) · k
0≤k≤K
X
= mn[K] flip(r) (ϕ) · ϕ(1) for flip(r) = r|1 i + (1 − r)| 0 i
ϕ∈N[K](2)
For the next result we recall from Proposition 2.4.5 that if M is a commuta-
tive monoid, so is the set D(M) of distributions on M. This is used below where
M is the additive monoid N of natural numbers. The result can be generalised
to any additive submonoid of the reals.
D(N), +, 0
mean / R≥0 , +, 0.
mean 1| 0i = 1 · 0 = 0.
226
4.1. Validity 227
227
228 Chapter 4. Observables and validity
The pet distribution ω is on the left and the costs per pet is in the middle. The
plot on the right describes 207
| 10 i + 13
20 | 50i, which is the distribution of pet
costs in the neighbourhood. It is obtained via the functoriality of D, as image
distribution D(q)(ω) ∈ D(R). In more traditional notation it is described as:
P[q = 10] = 7
20 P[q = 50] = 20 .
13
228
4.1. Validity 229
In the second item of the above definition, Equation (4.4) holds since:
X
mean(R = ω) = (R = ω)(r) · r see Definition 4.1.3
r∈R
X
= D(R)(ω)(r) · r
r∈R
X X
=
ω(x) · r
r∈R x∈R−1 (r)
X
= ω(x) · R(x)
x∈X
= ω |= R.
Example 4.1.8. We consider the expected value for the sum of two dices.
In this situation we have an observable S : pips × pips → R, on the sam-
ple space pips = {1, 2, 3, 4, 5, 6}, given by S (i, j) = i + j. It forms a random
variable together with the product state dice ⊗ dice ∈ D(pips × pips). Recall,
229
230 Chapter 4. Observables and validity
dice = unif = i∈pips 16 |i i is the uniform distribution unif on pips. First, the
P
distribution for the sum of the pips is the image distribution:
+ 36 5
| 8 i + 91 | 9 i + 12
1
| 10 i + 18
1
| 11 i + 36
1
| 12 i.
The expected value of the random variable (dice ⊗ dice, S ) is, according to
Definition 4.1.7 (3),
(ω + ρ) |= p = (ω |= p) + (ρ |= p),
ω + ρ, p ω, p ρ, p .
230
4.1. Validity 231
Exercises
4.1.1 Check that the average of the set n + 1 = {0, 1, . . . , n}, considered as
random variable, is 2n .
4.1.2 In Example 4.1.8 we have seen that dice ⊗ dice |= S = 7, for the
observable S : pips × pips → R by S (x, y) = x + y.
1 Now define T : pips 3 → R by T (x, y, z) = x + y + z. Prove that
pips ⊗ pips ⊗ pips |= T = 21
2 .
2 Can you generalise and show that summing on pips n yields validity
7n
2 ?
4.1.3 The birthday paradox tells that with at least 23 people in a room there
is a more than 50% chance that at least to of them have the same
birthday. This is called a ‘paradox’, because the number 23 looks sur-
prisingly low.
Let us scale this down so that the problem becomes manageable.
Suppose there are three people, each with their birthday in the same
week.
1 Show that the probability that they all have different birthdays is
30
49 .
2 Conclude that the probability that at least two of the three have the
19
same birthday is 49 .
3 Consider the set days B {1, 2, 3, 4, 5, 6, 7}. The set N[3](days) of
multisets of size three contains the possible combinations of the
231
232 Chapter 4. Observables and validity
mn[3](unifdays ) |= p = 49 .
19
f = ω |= q = ω |= q ◦ f
mean(p = ω) = p = ω |= id = ω |= p,
2 For channels c, d : X → R,
mean (c = ω) + (d = ρ) = ω |= mean ◦ c + ρ |= mean ◦ d .
mean pois[λ] = λ.
232
4.1. Validity 233
From the first two items we can conclude that mean is an (Eilenberg-
Moore) algebra of the distribution monad, see Subsection 1.9.4. The
third item says that (−) |= p : D(X) → R is a homomorphism of
algebras.
4.1.9 For a distribution ω ∈ D(X) write I(ω) : X → R≥0 for the information
content or surprise of the distribution ω, defined as factor:
− log ω(x) if ω(x) , 0, i.e. if x ∈ supp(ω)
I(ω)(x) B
0
otherwise.
1 Check that Kullback-Leibler divergence, see Definition 2.7.1, can
be described as validity: for ω, ρ ∈ D(X) with supp(ω) ⊆ sup(ρ),
DKL (ω, ρ) = ω |= I(ρ) − I(ω).
2 The Shannon entropy H(ω) of ω and the cross entropy H(ω, ρ) of
ω, ρ, are defined as: validities:
H(ω) B ω |= I(ω) = − x ω(x) · log ω(x)
P
233
234 Chapter 4. Observables and validity
The least well-known structures in this table are effect modules. They will thus
be described in greatest detail, in Subsection 4.2.3.
234
4.2. The structure of observables 235
4.2.1 Observables
Observables are R-valued functions on a set (or sample space) which are often
written as capitals X, Y, . . .. Here, these letters are typically used for sets and
spaces. We shall use letters p, q, . . . for observables in general, and in particular
for predicates. In some settings one allows observables X → Rn to the n-
dimensional space of real numbers. Whenever needed, we shall use such maps
as n-ary tuples hp1 , . . . , pn i : X → Rn of observables pi : X → R, see also
Section 1.1.
Let us fix a set X, and consider the collection Obs(X) = RX of observables
on X. What structure does it have?
4.2.2 Factors
We recall that a factor is a non-negative observables and that we write Fact(X) =
(R≥0 )X for the set of factors on a set X. Factors occur in probability theory for
instance in the so-called sum-product algorithms for computing marginals and
posterior distributions in graphical models. These matters are postponed to
Chapter ??; here we concentrate on the mathematical properties of factors.
The set Fact(X) looks like a vector space, except that there are no negatives.
235
236 Chapter 4. Observables and validity
Factors can be added pointwise, with identity element 0 ∈ Fact(X), like ran-
dom variables. But a factor p ∈ Fact(X) cannot be re-scaled with an ar-
bitrary real number, but only with a non-negative number s ∈ R≥0 , giving
s · p ∈ Fact(X). These structures are often called cones. The cone Fact(X) is
positive, since p + q = 0 implies p = q = 0. It is also cancellative: p + r = q + r
implies p = q.
The monoid (1, &) on Obs(X) restricts to Fact(X), since 1 ≥ 0 and if p, q ≥ 0
then also p & q ≥ 0.
4.2.3 Predicates
We first note that the set Pred (X) = [0, 1]X of predicates on a set X contains
falsity 0 and truth 1, which are always 0 (resp. 1). There are some noteworthy
differences between predicates on the one hand and observables and factors on
the other hand.
• Predicates are not closed under addition, since the sum of two numbers in
[0, 1] may ly outside [0, 1]. Thus, addition of predicates is a partial opera-
tion, and is then written as p > q. Thus: p > q is defined if p(x) + q(x) ≤ 1
for all x ∈ X, and in that case (p > q)(x) = p(x) + q(x).
This operation > is commutative and associative in a suitably partial
sense. Moreover, it has 0 as identity element: p > 0 = p = 0 > p. This
is structure (Pred (X), 0, >) is called a partial commutative monoid.
• There is a ‘negation’ of predicates, written as p⊥ , and called orthosupple-
ment. It is defined as p⊥ = 1− p, that is, as p⊥ (x) = 1− p(x). Then: p> p⊥ = 1
and p⊥⊥ = p.
• Predicates are closed under scalar multiplication s· p, but only if one restricts
the scalar s to be in the unit interval [0, 1]. Such scalar multiplication inter-
acts nicely with partial addition >, in the sense that s·(p>q) = (s· p)>(s·q).
The combination of these items means that the set Pred (X) carries the structure
of an effect module [68], see also [44]. These structures arose in mathemati-
cal physics [49] in order to axiomatise the structure of quantum predicates on
Hilbert spaces.
The effect module Pred (X) also carries a commutative monoid structure for
conjunction, namely (1, &). Indeed, when p, q ∈ Pred (X), then also p & q ∈
Pred (X). We have p & 0 = 0 and p & (q1 > q2 ) = (p & q1 ) > (p & q2 ).
236
4.2. The structure of observables 237
Since these effect structures are not so familiar, we include more formal
descriptions.
1·x = x (r · s) · x = r · (s · x),
0·x = 0 (r + s) · x = r · x > s · x
s·0 = 0 s · (x > y) = s · x > s · y.
We write EMod for the category of effect modules, where morphisms are
maps of effect algebras that preserve scalar multiplication (i.e. are ‘equivari-
ant’).
The following notion of ‘test’ comes from a quantum context and captures
‘compatible’ observations, see e.g. [36, 130]. It will be used occasionally later
on, for instance in Exercise 6.1.4 or Proposition 6.7.10.
237
238 Chapter 4. Observables and validity
1
3|ai + 61 | bi + 12 | ci 2
3 · 1a > 5
6 · 1b > 1 · 1c .
Writing > looks a bit pedantic, so we often simply write + instead. Recall that
in a predicate the probabilities need not add up to one, see Remark 4.1.6.
A set of predicates p1 , . . . , pn on the same space, can be turned into a test
via (pointwise) normalisation. Below we describe an alternative construction
which is sometimes called stick breaking, see e.g. [94, Defn. 1]. It is commonly
used for probabilities r1 , . . . , rn ∈ [0, 1], but it applies to predicates as well —
but it may be described even more generally, inside an arbitrary effect algebra
with a commutative monoid structure for conjunction. The idea is to break up
a stick in two pieces, multiple times, first according to the first ratio r1 . You
then put one part, say on the left, aside and break up the other part according
to ratio r2 . Etcetera.
Lemma 4.2.4. Let p1 , . . . , pn be arbitrary predicates, all on the same set, for
n ≥ 1. We can turn them via “stick breaking” into an n+1-test q1 , . . . , qn+1 via
definitions:
q1 B p1
qi+1 B p⊥1 & · · · & p⊥i & pi+1 for 0 < i < n
qn+1 B p⊥1 & · · · & p⊥n .
238
4.2. The structure of observables 239
· · · > p⊥1 & p⊥2 & · · · & p⊥n−1 & pn > p⊥1 & · · · & p⊥n+1
h
= p1 > p⊥1 & p2 > p⊥2 & p3 > · · ·
i
· · · > p⊥2 & p⊥3 & · · · & p⊥n−1 & pn > p⊥2 & · · · & p⊥n+1
(IH)
= p1 > p⊥1 & 1
= 1.
[0, 1]n
stbr / D n+1 = D {0, 1, . . . , n},
with:
stbr(r1 , . . . , rn ) B r1 0 + (1−r1 )r2 1 + · · · +
Q
i<n (1−ri ) rn n − 1
(4.7)
+ i<n+1 (1−ri ) n .
Q
239
240 Chapter 4. Observables and validity
f , homomorphism
f
+E
240
4.2. The structure of observables 241
Finally, for uniqueness, let g : Pred (X) → E be a map of effect modules with
g(1U ) = f (U) for each U ∈ P(X). Then g = f since:
Now that we have seen observables, with factors and (sharp) predicates as
special cases, we can say that all these subsets of observables all share the same
multiplicative structure (1, &) for conjunction, but that their additive structures
and scalar multiplications differ. The latter, however, are preserved under tak-
ing validity — and not (1, &).
1 ω |= 0 = 0;
2 ω |= (p + q) = (ω |= p) + (ω |= q);
ω |= p⊥ = 1 − ω |= p ;
3
4 ω |= (s · p) = s · (ω |= p).
241
242 Chapter 4. Observables and validity
1| ⊗ {z · · · ⊗ }1 .
1 ⊗ p ⊗ 1| ⊗ {z
· · · ⊗}
i−1 times n−i times
ω |= 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1 = D(πi )(ω) |= p
(4.8)
= ω 0, . . . , 0, 1, 0, . . . , 0 |= p,
where the 1 in the latter marginalisation mask is at position i. Soon we shall see
an alternative description of weakening in terms of predicate transformation.
The above equation then appears as a special case of a more general result,
namely of Proposition 4.3.3.
We mention some basic results about parallel products of observables. More
such ‘logical’ results can be found in the exercises.
σ ⊗ τ |= p ⊗ q = σ |= p · τ |= q .
242
4.2. The structure of observables 243
Proof. Easy:
X
σ ⊗ τ |= p ⊗ q = (σ ⊗ τ)(z) · (p ⊗ q)(z)
z∈X×Y
X
= (σ ⊗ τ)(x, y) · (p ⊗ q)(x, y)
x∈X,y∈Y
X
= σ(x) · τ(y) · p(x) · q(y)
x∈X,y∈Y
X X
= σ(x) · p(x) · τ(y) · q(y)
x∈X y∈Y
= σ |= p · τ |= q .
= p1 ⊗ · · · ⊗ pn (x1 , . . . , xn ) · q1 ⊗ · · · ⊗ qn (x1 , . . . , xn )
4.2.6 Functoriality
We like to conclude this section with some categorical observations. They are
not immediately relevant for the sequel and may be skipped. We shall be using
four (new) categories:
• Vect, with vector spaces (over the real numbers) as objects and linear maps
as morphisms between them (preserving addition and scalar multiplication);
• Cone, with cones as objects and also with linear maps as morphisms, but
this time preserving scalar multiplication with non-negative reals only;
243
244 Chapter 4. Observables and validity
= x 7→ p( f (x)) + q( f (x))
244
4.2. The structure of observables 245
This yields Obs(g ◦ f ) = Obs(g) ◦op Obs( f ), so that we get a functor of the
form Obs : Sets → Vectop .
Notice that saying that we have a functor like Pred : Sets → EModop con-
tains remarkably much information, about the mathematical structure on ob-
jects Pred (X), about preservation of this structure by maps Pred ( f ), and about
preservation of identity maps and composition by Pred (−) on morphisms (see
also Exercise 4.3.10). This makes the language of category theory both power-
ful and efficient.
Exercises
4.2.1 Consider the following question:
An urn contains 10 balls of which 4 are red and 6 are blue. A second urn
contains 16 red balls and an unknown number of blue balls. A single ball is
11
drawn from each urn. The probability that both balls are the same color is 25 .
Find the number of blue balls in the second urn.
1 Check that the givens can be expressed in terms of validity as:
2 Prove, by solving the above equation, that there are 4 blue balls in
the second urn.
4.2.2 Find examples of predicates p, q on a set X and a distribution ω on X
such that ω |= p & q and (ω |= p) · (ω |= q) are different.
4.2.3 1 Check that (sharp) indicator predicates 1E : X → [0, 1], for subsets
E ⊆ X, satisfy:
• 1E∩D = 1E & 1D , and thus 1 ⊗ 1 = 1, 1 ⊗ 0 = 0 = 0 ⊗ 1;
• 1E∪D = 1E > 1D , if E, D are disjoint;
245
246 Chapter 4. Observables and validity
σ |= p) ≤ τ |= p =⇒ σ |= p & q) ≤ τ |= p & q .
σ = 19
100 | a i + 47
100 | b i + 17
50 | ci τ = 15 | ai + 9
20 | b i + 7
20 | c i
with predicates:
p = 1 · 1a + 7
10 · 1b + 1
2 · 1c q= 1
10 · 1a + 1
2 · 1b + 1
5 · 1c .
Now check consecutively that:
1 σ |= p = 1000 < 100 = τ |= p.
689 69
2 p& q = 10
1
· 1a + 20
7
· 1b + 10
1
· 1c .
3 σ |= p & q = 400 > 400 = τ |= p &
87 85
q.
Find a counterexample yourself in which the predicate q is sharp.
4.2.5 Consider a state σ ∈ D(X), a factor p on X, and a predicate q on X
which is non-zero on the support of σ. Show that:
σ |= qp ≥ 1 =⇒ σ |= p ≥ σ |= q ,
246
4.2. The structure of observables 247
2 ω |= 1E = 1;
3 ω |= p & 1E = ω |= p for each p ∈ Obs(X).
4.2.7 Let p1 , p2 , p3 be predicates on the same set.
1 Show that:
⊥
p⊥1 ⊗ p⊥2 = (p1 ⊗ p2 ) > (p1 ⊗ p⊥2 ) > (p⊥1 ⊗ p2 ).
2 Show also that:
⊥
p⊥1 ⊗ p⊥2 ⊗ p⊥3 = (p1 ⊗ p2 ⊗ p3 ) > (p1 ⊗ p2 ⊗ p⊥3 )
> (p1 ⊗ p⊥2 ⊗ p3 ) > (p1 ⊗ p⊥2 ⊗ p⊥3 )
> (p⊥1 ⊗ p2 ⊗ p3 ) > (p⊥1 ⊗ p2 ⊗ p⊥3 )
> (p⊥1 ⊗ p⊥2 ⊗ p3 ).
3 Generalise to n.
4.2.8 For predicates p, q on the same set, define Reichenbach implication ⊃
as:
p ⊃ q B p⊥ > (p & q).
1 Check that:
p ⊃ q = (p & q⊥ )⊥
from which it easily follows that:
p ⊃ 0 = p⊥ 1⊃q = q p⊃1 = 1 0 ⊃ q = 1.
2 Check also that:
p⊥ ≤ p ⊃ q q ≤ p ⊃ q.
3 Show that:
p1 ⊃ (p2 ⊃ q) = (p1 & p2 ) ⊃ q.
4 For subsets E, D of the same set,
1E ⊃ 1D = 1¬(E∩¬D) = 1¬E∪D .
The subset ¬(E ∩ ¬D) = ¬E ∪ D is the standard interpretation of
‘E implies D’ in Boolean logic (of subsets).
4.2.9 Let p be a predicate on a set X. Prove that the following statements
are equivalent.
1 p is sharp;
2 p & p = p;
3 p & p⊥ = 0;
247
248 Chapter 4. Observables and validity
4.2.12 For a random variable (ω, p), show that the validity of the observable
p − (ω |= p) · 1 is zero, i.e.,
ω |= p − (ω |= p) · 1 = 0.
M(X) × Fact(X)
• / M(X)
e ? d B e⊥ > d⊥ ⊥ if e⊥ ⊥ d⊥
e d B e⊥ > d ⊥ = e ? d⊥ if e ≥ d.
Show that:
1 (E, ?, 1) is a partial commutative monoid;
248
4.3. Transformation of observables 249
249
250 Chapter 4. Observables and validity
Proof. 1 If q ∈ Fact(Y) then q(y) ≥ 0 for all y. But then also (c = q)(x) =
y c(x)(y) · q(y) ≥ 0, so that c = q ∈ Fact(X). If in addition q ∈ Pred (Y), so
P
that q(y) ≤ 1 for all y, then also (c = q)(x) = y c(x)(y) · q(y) ≤ y c(x)(y) =
P P
1, since c(x) ∈ D(Y), so that c = q ∈ Pred (X).
The fact that a transformation c = p of a sharp predicate p need not be
sharp is demonstrated in Exercise 4.3.2.
2 Easy.
250
4.3. Transformation of observables 251
7 For a function f : X → Y,
X
f = q (x) = unit( f (x))(y) · q(y) = q( f (x)) = (q ◦ f )(x).
y∈Y
c = ω |= q = ω |= c = q. (4.9)
• Earlier we mentioned that marginalisation (of states) and weakening (of ob-
servables) are dual to each other, see Equation (4.8). We can now see this as
251
252 Chapter 4. Observables and validity
p(x) = unit(x) |= p
≤ c = unit(x) |= q = unit(x) |= c = q = (c = q)(x).
As a result, p ≤ c = q.
Lemma 4.3.2 (6) expresses a familiar compositionality property in the the-
ory of weakest preconditions:
We close this section with two topics that dig deeper into the nature of trans-
formations. First, we relate transformation of states and observables in terms
of matrix operations. Then we look closer at the categorical aspects of trans-
formation of observables.
Remark 4.3.5. Let c be a channel with finite sets as domain and codomain.
For convenience we write these as n = {0, . . . , n − 1} and m = {0, . . . , m − 1},
so that the channel c has type n → m. For each i ∈ n we have that c(i) ∈ D(m)
252
4.3. Transformation of observables 253
Indeed,
X X
(c = ω)( j) = c(i)( j) · ω(i) = = Mc · Mω j .
Mc ji · Mω i
i i
253
254 Chapter 4. Observables and validity
Exercises
4.3.1 Consider the channel f : {a, b, c} → {u, v} from Example 1.8.2, given
by:
f (a) = 12 | u i + 21 | v i f (b) = 1| u i f (c) = 34 | u i + 14 | v i.
Take as predicates p, q : {u, v} → [0, 1],
p(u) = 1
2 p(v) = 2
3 q(u) = 1
4 q(v) = 16 .
Alternatively, in the normal-form notation of Lemma 4.2.3 (2),
p = 1
2 · 1u + 2
3 · 1v q = 1
4 · 1u + 1
6 · 1v .
Compute:
• f = p
• f = q
• f = (p > q)
• ( f = p) > ( f = q)
• f = (p & q)
• ( f = p) & ( f = q)
This will show that predicate transformation = does not preserve con-
junction &.
4.3.2 1 Still in the context of the previous exercise, consider the sharp
(point) predicate 1u on {u, v}. Show that the transformed predicate
f = 1u on {a, b, c} is not sharp. This proves that sharp predicates
are not closed under predicate transformation. This is a good mo-
tivation for working with fuzzy predicates instead of with sharp
predicates only.
2 Let h : X → Y be a function, considered as deterministic channel,
and let V ⊆ Y be an arbitrary subset (event). Check that:
h = 1V = 1h−1 (V) .
Conclude that predicate transformation along deterministic chan-
nels does preserve sharpness.
254
4.3. Transformation of observables 255
q1 & q2 = ∆ = (q1 ⊗ q2 ).
255
256 Chapter 4. Observables and validity
Chan
Pred / EModop
256
4.4. Validity and drawing 257
natural multisets of size K, and not on (real) numbers. Hence the requirement
for a mean, see Definition 4.1.3, does not apply: the space is not a subset of
the reals. Still, we have described Proposition 3.3.7 as a generalised mean for
multinomials.
Alternatively, we can describe a pointwise mean for multinomials via point-
evaluation observables. For a set X with an element y ∈ X we can define an
observable:
πy
M(X) /R by πy (ϕ) B ϕ(y). (4.10)
Definition 4.4.1. Let X be a set (of colours) and K be a number. The means of
the three draw distributions are defined as multisets over X in:
X X
mean mn[K](ω) B mn[K](ω) |= πy y = K · ω(y) y
y∈X y∈X
= K · ω.
X X
hg[K](ψ) |= πy y = K ·
mean hg[K](ψ) B Flrn(ψ)(y) y
y∈X y∈X
= K · Flrn(ψ)
X X
mean pl[K](ψ) B pl[K](ψ) |= πy y = K · Flrn(ψ)(y) y
y∈X y∈X
= K · Flrn(ψ).
These formulations coincide with the ones that we have given earlier in
Propositions 3.3.7 and 3.4.7.
We recall a historical example, known as the paradox of the Chevalier De
Méré, from the 17th century. He argued informally that the following two out-
comes have equal probability.
257
258 Chapter 4. Observables and validity
= 1 − 65 4
= 1296
671
≈ 0.518.
Similarly one can compute (DM2) as:
24 ⊥
dice ⊗ dice 24 |= 16 ⊗ 16 ⊥ = 1− 35 24
36
≈ 0.491.
Hence indeed, betting on (DM2) is a bad idea.
In this section we look at validities over distributions of draws from an urn,
like in (DM1) and (DM2). In particular, we establish connections between
such validities and validities over the urn, as distribution. This involves free
extensions of observables to multisets.
We notice that there are two ways to extend an observable X → R on a set
X to natural multisets over X, since we can choose to use either the additive
258
4.4. Validity and drawing 259
p + (ϕ) =
X X
ϕ(x) · p(x) = ϕ(x) · p(x).
x∈supp(ϕ) x∈X
+ + + +
p (0) = 0 p ϕ + ψ = p (ϕ) + p (ψ)
and (4.13)
p• (0) = 1. p ϕ + ψ = p (ϕ) · p (ψ)
• • •
1 The additive extension p + of p forms new random variable, with the multi-
nomial distribution mn[K](ω) on N[K](X) as state. The associated validity
is:
mn[K](ω) |= p + = K · ω |= p .
mn[K](ω) |= p • = ω |= p K .
259
260 Chapter 4. Observables and validity
ϕ∈N[K](X)
X X
= ϕ(x) · p(x)
mn[K](ω)(ϕ) ·
ϕ∈N[K](X) x∈X
X X
= p(x) · mn[K](ω)(ϕ) · ϕ(x)
x∈X ϕ∈N[K](X)
X
= p(x) · K · ω(x)
x∈X
= K · (ω |= p).
= ω |= p K .
For the hypergeometric and Pólya distributions we have results for the addi-
tive extension. They resemble the formulation in the above first item.
Proof. Both equations are obtained as for Proposition 4.4.3 (1), this time using
Lemma 3.4.4.
For the parallel multinomial law pml : M[K] D(X) → D M[K](X) from
Section 3.6 there is a similar result, in the multiplicative case.
260
4.4. Validity and drawing 261
The probability ω(x) equals the validity ω |= 1 x of the point predicate 1 x , for
the element x ∈ X with multiplicity ϕ(x) in the mulitset (draw) ϕ. We ask
ourselves the question: can we replace these point predicates with arbitrary
predicates? Does such a generalisation make sense?
As we shall see in Chapter ?? on probabilistic learning, this makes perfectly
good sense. It involves a generalisation of multisets over points, as elements of
a set X, to multisets over predicates on X. Such predicates can express uncer-
tainties over the data (elements) from which we wish to learn.
In the current setting we describe only how to adapt multinomial distribu-
tions to predicates. This works for tests, which are finite sets of predicates
adding up to the truth predicate 1, see Definition 4.2.2. It turns out that there
are two forms of “logical” multinomial for predicates.
Definition 4.4.6. Fix a state ω ∈ D(X) and a number K ∈ N. Let T ⊆ Pred (X)
be a test.
ω |= p ϕ(p) ϕ ,
X Y
mn E [K](ω, T ) = (ϕ) ·
ϕ∈N[K](T ) p∈T
261
262 Chapter 4. Observables and validity
In the first, external case the product is outside the validity |=, whereas in
Q
the second, internal case the conjunction (product) & is inside the validity |=.
This explains the names.
It is easy to see how the internal logical formulation generalises the standard
multinomial distribution. For a distribution ω with support X = supp(ω), say
of the form X = {x1 , . . . , xn }, the point predicates 1 x1 , . . . , 1 xn form a test on
X, since 1 x1 > · · · > 1 xn = 1. A multiset over these predicates 1 x1 , . . . , 1 xn can
easily be identified with a multiset over the points x1 , . . . , xn .
The external logical formulation only makes sense for proper fuzzy predi-
cates. In the sharp case, say with a test 1U , 1¬U things trivialise, as in:
1U n & 1¬U m = 1U & 1¬U = 0.
It is probably for this reason that the internal version is not so common. But it
is quite natural in Bayesian learning, see Chapter ??.
We need to check that the above definitions yield actual distributions, with
probabilities adding up to one. In both cases this follows from the Multinomial
Theorem (1.27). Externally:
K
X K
X Y ϕ(p) (1.27) X
(ϕ) · ω |= p = ω |= p = ω |=
p
ϕ∈N[K](P) p∈P p∈P p∈P
K
= ω |= 1 = 1K = 1.
In the internal case we get:
X ! X !
ϕ(p) ϕ(p)
( ϕ ) · ω |= & p = ω |= (ϕ) · & p
p∈P p∈P
ϕ∈N[K](P) ϕ∈N[K](P)
X X Y
= ω(x) · (ϕ) · p(x)ϕ(p)
x∈X ϕ∈N[K](P) p∈P
K
(1.27)
X X X
= ω(x) · p(x) = ω(x) · 1K = 1.
x∈X p∈P x∈X
Exercises
4.4.1 Show that the additive/multiplicative extension preserves the addi-
tive/multiplicative structure of observables:
+ +
p + q = p+ + q+ 0 =0
262
4.4. Validity and drawing 263
and:
• •
p & q = p• & q• 1 = 1
(ω, p)(n) B ω |= p + · · · + p
P
(n times).
p1 = 1
8 · 1a + 1
2 · 1b p2 = 1
4 · 1a + 1
6 · 1b p2 = 5
8 · 1a + 1
3 · 1b .
1 Show that:
mn E [2](ω, T )
7
= 64
9
2| p1 i + 48 1| p1 i + 1| p2 i + 1296
49
2| p2 i
+ 3196 1| p1 i + 1| p3 i + 1296 1| p2 i + 1| p3 i +
217
5184 2| p3 i .
961
mn I [2](ω, T )
19 17
= 11
64 2| p1 i + 144 1| p1 i + 1| p2 i + 432 2| p2 i
+ 288
79
1| p1 i + 1| p3 i + 432
77
1| p2 i + 1| p3 i + 1728 2| p3 i .
353
263
264 Chapter 4. Observables and validity
Note that the above formulations involve predicates only, not observables in
general.
264
4.5. Validity-based distances 265
This proves Proposition 4.5.1 since the inequality in the middle is trivial.
We start with some preparatory definitions. Let U ⊆ X be an arbitrary subset.
We shall write ωi (U) = x∈U ωi (x) = (ω |= 1U ). We partition U in three
P
disjoint parts, and take the relevant sums:
U> = {x ∈ U | ω1 (x) > ω2 (x)}
U↑ = ω1 (U> ) − ω2 (U> ) ≥ 0
= ω = ω
{x ∈ | (x) (x)}
U = U 1 2 U↓ = ω (U ) − ω (U ) ≥ 0.
< <
2 1
U< = {x ∈ U | ω1 (x) < ω2 (x)}
That is,
As a result:
1 ω1 (x) − ω2 (x)
P
2 x∈X
= 1 P
x∈X> ω1 (x) − ω2 (x) + x∈X< ω2 (x) − ω1 (x)
P
2
= 2 ω1 (X> ) − ω2 (X> ) + ω2 (X< ) − ω1 (X< )
1 (4.16)
= 2 X↑ + X↓
1
= X↑
We have prepared the ground for proving the above inequalities (a) and (b).
(a) We will see that the above maximum is actually reached for the subset U =
X> , first of all because:
(4.16)
x∈X ω1 (x) − ω2 (x) = X↑ = ω1 (X> ) − ω2 (X> )
1 P
2
= ω1 |= 1X> − ω2 |= 1X>
≤ max ω1 |= 1U − ω2 |= 1U .
U⊆X
(b) Let p ∈ Pred (X) be an arbitrary predicate. We have: 1U & p (x) = 1U (x) ·
265
266 Chapter 4. Observables and validity
ω1 |= p − ω2 |= p
= ω1 |= 1X> & p + ω1 |= 1X= & p + ω1 |= 1X< & p
− ω2 |= 1X> & p + ω2 |= 1X= & p + ω2 |= 1X< & p
= ω1 |= 1X & p − ω2 |= 1X & p − ω2 |= 1X & p − ω1 |= 1X & p
> > < <
= X↑
(4.16) 1
= 2 ω1 (x) − ω2 (x) .
P
x∈X
266
4.5. Validity-based distances 267
d(ω1 , ω3 ) = 1
| ω1 (x) − ω3 (x) |
P
2 x∈X
1
| ω1 (x) − ω2 (x) | + | ω2 (x) − ω3 (x)|
P
≤ 2 x∈X
= 1
x∈X | ω1 (x) − ω2 (x) | + 2
1 P
x∈X | ω2 (x) − ω3 (x) |
P
2
= d(ω1 , ω2 ) + d(ω2 , ω3 ).
Proof. Since:
P
d c = ω1 , c = ω2 = 1
y (c = ω1 )(y) − (c = ω2 (y)
2
P P
= 1
x ω1 (x) · c(x)(y) − x ω2 (x) · c(x)(y)
P
2 y
P P
= 1
2 y x c(x)(y) · (ω1 (x) − ω2 (x))
x c(x)(y) · ω1 (x) − ω2 (x)
1 P P
≤ 2 y
= 1
x ( y c(x)(y)) · ω1 (x) − ω2 (x)
P P
2
= d ω1 , ω2 .
Proof. We use the notation and results from the proof of Proposition 4.5.1. We
first prove the inequality (≤). Let X> = {x ∈ X | ω1 (x) > ω2 (x)}. Then:
≤ ω2 (x) + σ |= (1 x ⊗ 1) & Eq ⊥ .
267
268 Chapter 4. Observables and validity
have:
ω2 (x) = σ(y, x) = σ(x, x) + y,x σ(y, x)
P P
x
≤ ω1 (x) + σ |= (1 ⊗ 1 x ) & Eq ⊥ .
Hence ω2 (x) − ω1 (x) for x < X> . Putting this together gives:
P
d(ω1 , ω2 ) = 12 x∈X ω1 (x) − ω3 (x)
= 12 x∈X> ω1 (x) − ω2 (x) + 12 x∈¬X> ω2 (x) − ω1 (x)
P P
= 1
2 σ |= (1X> ⊗ 1) & Eq ⊥ + 1
2 σ |= (1 ⊗ 1¬X> ) & Eq ⊥
≤ 1
2 σ |= Eq ⊥ + 1
2 σ |= Eq ⊥
= σ |= Eq ⊥ .
For the inequality (≥) one uses what is called an optimal coupling ρ ∈ D(X ×
X) of ω1 , ω2 . It can be defined as:
min ω1 (x), ω2 (x) if x = y
ρ(x, y) B
max ω1 (x) − ω2 (x), 0 · max ω2 (y) − ω1 (y), 0
otherwise.
d(ω1 , ω2 )
(4.17)
We first check that this ρ is a coupling. Let x ∈ X> so that ω1 (x) > ω2 (x); then:
y ρ(x, y)
P
X max ω2 (y) − ω1 (y), 0
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
y,x
d(ω1 , ω2 )
y∈X< ω2 (y) − ω1 (y)
P
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
d(ω1 , ω2 )
X↓
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
d(ω1 , ω2 )
= ω2 (x) + (ω1 (x) − ω2 (x)) · 1 see the proof of Proposition 4.5.1
= ω1 (x).
If x < X> , so that ω1 (x) ≤ ω2 (x), then it is obvious that y ρ(x, y) = ω1 (x)+0 =
P
ω1 (x). This shows σ 1, 0 = ω1 . In a similar way one obtains σ 0, 1 = ω2 .
Finally,
ρ |= Eq = x ρ(x, x) = x min ω1 (x), ω2 (x)
P P
= ω2 (X> ) + 1 − ω1 (X> )
= 1 − ω1 (X> ) − ω2 (X> ) = 1 − d(ω1 , ω2 ).
Hence d(ω1 , ω2 ) = 1 − ρ |= Eq = ρ |= Eq ⊥ .
268
4.5. Validity-based distances 269
Figure 4.1 Visual comparance of distance and divergence between flip states, see
Remark 4.5.5 for details.
for r, s ∈ [0, 1], where, recall, flip(r) = r| 1 i + (1 − r)|0 i. The distance and
divergence are zero when r = s and increases on both sides of the diagonal.
The distance ascends via straight planes, but the divergence has a more baroque
shape.
Remark 4.5.6. Let a distribution ω ∈ D(X) be given. One can ask: is there a
squence of natural multisets ϕK ∈ N[K](X) with Flrn(ϕK ) getting closer and
closer, in the total variation distance d, as K goes to infinity?
The answer is yes. Here is one way to do it. Assume the distribution ω has
269
270 Chapter 4. Observables and validity
• ω0 = 0 and σ1 = 1| x2 i
• ω1 = 61 | x1 i + 12 | x2 i + 31 | x3 i and σ2 = 1| x2 i + 1| x3 i
• ω2 = 13 | x1 i + 1| x2 i + 23 | x3 i and σ3 = 1| x1 i + 1| x2 i + 1| x3 i
• ω3 = 21 | x1 i + 32 | x2 i + 1| x3 i and σ4 = 1| x1 i + 2| x2 i + 1| x3 i
• ω4 = 32 | x1 i + 2| x2 i + 43 | x3 i and σ5 = 1| x1 i + 2| x2 i + 2| x3 i
• ω5 = 65 | x1 i + 52 | x2 i + 35 | x3 i and σ6 = 1| x1 i + 3| x2 i + 2| x3 i.
• ϕ10 = 2| x1 i + 5| x2 i + 3| x3 i
• ϕ100 = 15| x1 i + 56| x2 i + 29| x3 i
• ϕ1000 = 143| x1 i + 571| x2 i + 286| x3 i
• ϕ10000 = 1429| x1 i + 5713| x2 i + 2858| x3 i.
Theorem 4.5.7. 1 If X is a finite set, then D(X), with the total variation dis-
tance d, is a complete metric space.
2 For an arbitrary set X, the set D(X) has a dense subset:
[
D[K](X) ⊆ D(X) where D[K](X) B {Flrn(ϕ) | ϕ ∈ N[K](X)}.
K∈N
270
4.5. Validity-based distances 271
The combination of the these two points says that for a finite set X, the set
D(X) of distributions on X is a Polish space: a complete metric space with a
countable dense subset, see ?? for more details.
Hence, the sequence ωi (xn ) ∈ [0, 1] is a Cauchy sequence, say with limit
rn ∈ [0, 1]. Take ω = n rn | xn i ∈ D(X). This is the limit of the distributions
P
ωi .
2 Let ω ∈ D(X) and ε > 0. We need to find a multiset ϕ ∈ N(X) with
d(ω, Flrn(ϕ)) < ε. There is a systematic way to find such multisets via the
decimal representation of the probabilities in ω. This works as follows. As-
sume we have:
ω = 0.383914217 . . . a + 0.406475610 . . . b + 0.209610173 . . . c .
For each n we chop off after n decimals and multiply with 10n , giving:
ϕ1 B 3 a + 4 b + 2 c with d ω, Flrn(ϕ1 ) ≤ 21 · 3 · 10−1
ϕ2 B 38 a + 40 b + 20 c with d ω, Flrn(ϕ2 ) ≤ 21 · 3 · 10−2
ϕ3 B 383 a + 406 b + 209 c with d ω, Flrn(ϕ3 ) ≤ 21 · 3 · 10−3
etc.
sufficiently large.
When X is finite, say with
n elements, we know from Proposition 1.7.4
that N[K](X) contains Kn multisets, so that D[K](X) contains Kn distri-
S
butions. Hence a countable union K D[K](X) of such finite sets is count-
able.
271
272 Chapter 4. Observables and validity
Figure 4.2 Fractional distributions from D[K] {a, b, c} as points in the cube
3K = 20 on the left and K = 50 on the right. The plot
[0, 1]3 , for 3 on
the left
contains 20 = 231 dots (distributions) and the one on the right 50 = 1326,
see Proposition 1.7.4.
This distance function d makes the set Pred (X) into a metric space.
(4.15)
_
d(p1 , p2 ) = ω |= p − ω |= p
1 2
ω∈D(X)
≥ unit(x) |= p1 − unit(x) |= p2
= p (x) − p (x) .
1 2
_
Hence d(p1 , p2 ) ≥ p (x) − p (x) .
1 2
x
272
4.5. Validity-based distances 273
Exercises
4.5.1 1 Prove that for states ω, ω0 ∈ D(X) and ρ, ρ0 ∈ D(Y) there is an
inequality:
d ω ⊗ ρ, ω0 ⊗ ρ0 ≤ d ω, ω0 + d ρ, ρ0 .
273
274 Chapter 4. Observables and validity
Show that:
274
5
275
276 Chapter 5. Variance and covariance
This short chapter first introduces the basic definitions and results for vari-
ance, and for covariance in shared-state form. They are applied to draw distri-
butions, from Chapter 3, in Section 5.2. The joint-state version of covariance is
introduced in Section 5.3 and illustrated in several examples. Section 5.4 then
establishes the equivalence between:
p − (ω |= p) · 1 2 = p − (ω |= p) · 1 & p − (ω |= p) · 1 ∈ Obs(X).
1 Subtraction expressions like these occur more frequently in mathematics. For instance, an
eigenvalue λ of a matrix M may be defined as the scalar that forms a solution to the equation
M − λ · 1 = 0, where 1 is the identity matrix. A similar expression is used to define the
elements in the spectrum of a C ∗ -algebra. See also Excercise 4.2.12.
276
5.1. Variance and shared-state covariance 277
R≥0 . It is thus a factor. Its validity in the original state ω is called variance. It
captures how far the values of p are spread out from their expected value.
Definition 5.1.1. For a random variable (ω, p), the variance Var(ω, p) is the
non-negative number defined by:
Var(ω, p) B ω |= p − (ω |= p) · 1 2 .
Var(ω, p) = ω |= p2 − ω |= p 2 .
ω |= p2 ≥ ω |= p 2 .
277
278 Chapter 5. Variance and covariance
Proof. We have:
Var(ω, p)
= ω |= p − (ω |= p) · 1 2
= x ω(x) p(x) − (ω |= p) 2
P
= x ω(x) p(x)2 − 2(ω |= p)p(x) + (ω |= p)2
P
P P P
= x ω(x)p (x) − 2(ω |= p)
2
x ω(x)p(x) + x ω(x)(ω |= p)
2
= ω |= p2 − 2 ω |= p)(ω |= p) + (ω |= p)2
= ω |= p2 − (ω |= p)2 .
• The two random variables are of the form (ω, p1 ) and (ω, p2 ), where they
share their state ω.
• There is a joint state τ ∈ D(X1 ×X2 ) together with two observables q1 : X1 →
R and q2 : X2 → R on the two components X1 , X2 . This situation can be seen
as a special case of the previous point by first weakening the two observables
to the product space, via: π1 = q1 = q1 ⊗ 1 and π2 = q2 = 1 ⊗ q2 . Thus we
obtain two random variable with a shared state:
(τ, π1 = q1 ) and (τ, π2 = q2 ).
Cov ω, p1 , p2 B ω |= p1 − (ω |= p1 ) · 1 & p2 − (ω |= p2 ) · 1 .
278
5.1. Variance and shared-state covariance 279
Cov(ω, p1 , p2 ) = ω |= p1 & p2 − ω |= p1 · ω |= p2 .
Example 5.1.6. 1 We have seen in Definition 4.1.2 (2) how the average of
an observable can be computed as its validity in a uniform state. The same
approach is used to compute the covariance (and correlation) in a uniform
joint state. Consider the following to lists a and b of numerical data, of the
same length.
We will calculate the covariance between a and b wrt. the uniform state
unif5 , as:
= i 51 · a(i) − 15 · b(i) − 11
P
= 11.
279
280 Chapter 5. Variance and covariance
variances:
Var(unif5 , a) = i 1
a(i) − 15 2 = 50
P
5
Var(unif5 , b) = i 1
b(i) − 11 2 = 5.6
P
5
Then:
Cov(unif5 , a, b) 11
Cor unif5 , a, b = √ = √
√ √ ≈ 0.66.
Var(unif5 , a) · Var(unif5 , b) 50 · 5.6
The next result collects several linearity properties for (co)variance and cor-
relation. This shows that one can do a lot of re-scaling and stretching of ob-
servables without changing the outcome.
1 Covariance satisfies:
Cov(ω, p1 , p2 ) = Cov(ω, p2 , p1 )
Cov(ω, p1 , 1) = 0
Cov(ω, r · p1 , p2 ) = r · Cov(ω, p1 , p2 )
Cov(ω, p, p1 + p2 ) = Cov(ω, p, p1 ) + Cov(ω, p, p2 )
Cov(ω, p1 + r · 1, p2 + s · 1) = Cov(ω, p1 , p2 ).
2 Variance satisfies:
Var(ω, r · p) = r2 · Var(ω, p)
Var(ω, p + r · 1) = Var(ω, p)
Var(ω, p1 + p2 ) = Var(ω, p1 ) + 2 · Cov(ω, p1 , p2 ) + Var(ω, p2 ).
Cov(ω, r · p1 , p2 ) = ω |= (r · p1 ) & p2 − ω |= r · p1 · ω |= p2
= r · ω |= p1 & p2 − r · ω |= p1 · ω |= p2
= r · Cov(ω, p1 , p2 ).
280
5.1. Variance and shared-state covariance 281
Cov(ω, p, p1 + p2 )
= ω |= p & (p1 + p2 ) − ω |= p · ω |= p1 + p2
= ω |= (p & p1 ) + (p & p2 ) − ω |= p · (ω |= p1 ) + (ω |= p2 )
= ω |= p & p1 + ω |= p & p2
− ω |= p · ω |= p1 − ω |= p · ω |= p2
= Cov(ω, p, p1 ) + Cov(ω, p, p2 )
Var(ω, p1 + p2 )
= Cov(ω, p1 + p2 , p1 + p2 )
= Cov(ω, p1 + p2 , p1 ) + Cov(ω, p1 + p2 , p2 )
= Cov(ω, p1 , p1 ) + Cov(ω, p2 , p1 ) + Cov(ω, p1 , p2 ) + Cov(ω, p2 , p2 )
= Var(ω, p1 ) + 2 · Cov(ω, p1 , p2 ) + Var(ω, p2 ).
Cov(ω, r · p1 , s · p2 )
Cor(ω, r · p1 , s · p2 ) = p p
Var(ω, r · p1 ) · Var(ω, s · p2 )
r · s · Cov(ω, p1 , p2 )
= p p
r · Var(ω, p1 ) · s2 · Var(ω, p2 )
2
r · s · Cov(ω, p1 , p2 )
= p p
|r| · Var(ω, p1 ) · |s| · Var(ω, p2 )
Cor(ω, p1 , p2 )
if r, s have the same sign
=
−Cor(ω, p , p ) otherwise.
1 2
(The same sign means: either both r ≥ 0 and s ≥ 0 or both r ≤ 0 and s ≤ 0.)
The final equation Cor(ω, p1 + r · 1, p2 + s · 1) = Cor(ω, p1 , p2 ) holds since
both variance and covariance are closed under addition of constants.
281
282 Chapter 5. Variance and covariance
Exercises
5.1.1 Let ω be a state and p be a factor on the same set. Define for v ∈ R≥0 ,
f (v) B ω |= (p − v · 1)2 .
Show that the function f : R≥0 → R≥0 takes its minimum value at
ω |= p.
5.1.2 Let τ ∈ D(X × Y) be a joint state with an observable p : X → R. We
can turn them into a random variable in two ways, by marginalisation
and weakening:
(τ 1, 0 , p) (τ, π1 = p),
and
where π1 = p = p ⊗ 1 is an observable on X × Y.
These two random variables have the same expected value, by (4.8).
Show that they have the same variance too:
282
5.1. Variance and shared-state covariance 283
Use this inequality to prove that correlation is in the interval [−1, 1].
5.1.8 Let ω ∈ D(X) be a state with an n-test ~p = p1 , . . . , pn , see Defini-
tion 4.2.2.
1 Define the (symmetric) covariance matrix as:
Cov(ω, p1 , p1 ) · · · Cov(ω, p1 , pn )
CovMat(ω, ~p) B
.. ..
. .
Cov(ω, pn , p1 ) · · · Cov(ω, pn , pn )
Prove that all rows and all colums add up to 0 and that the entries
on the diagonal are non-negative and have a sum below 1.
2 Next consider the vector v of validities and the symmetric matrix A
of conjunctions:
ω |= p1 ω |= p1 & p1 · · · ω |= p1 & pn
.. .. ..
v B . A B
. .
ω |= pn ω |= pn & p1 · · · ω |= pn & pn
a B 1n i ai b B 1n i bi .
P P
283
284 Chapter 5. Variance and covariance
2 Show that one can also write the slope v̂ of the best line as:
P
(ai − a)(bi − b)
v̂ = i P 2
.
i (ai − a)
284
5.2. Draw distributions and their (co)variances 285
L−K
Var hg[K](ψ), πy = K · · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .
L−1
L+K
Var pl[K](ψ), πy = K · · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .
L+1
Var mn[K](ω), πy
Var hg[K](ψ), πy
2
X X
= hg[K](ψ)(ϕ) · ϕ(y) −
2
hg[K](ψ)(ϕ) · ϕ(y)
ϕ∈N[K](X) ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= − K · Flrn(ψ)(y) 2
K · Flrn(ψ)(y) ·
L−1
L(K −1) · ψ(y) + L(L−K) (L−1)K · ψ(y)
!
= K · Flrn(ψ)(y) · −
L(L−1) L(L−1)
(L−K) · (L − ψ(y))
= K · Flrn(ψ)(y) ·
L(L−1)
L−K
= · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .
K·
L−1
285
286 Chapter 5. Variance and covariance
3 Similarly:
Var pl[K](ψ), πy
2
X X
= pl[K](ψ)(ϕ) · ϕ(y) −
2
pl[K](ψ)(ϕ) · ϕ(y)
ϕ∈N[K](X) ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= − K · Flrn(ψ)(y) 2
K · Flrn(ψ)(y) ·
L+1
L(K −1) · ψ(y) + L(L+K) (L+1)K · ψ(y)
!
= K · Flrn(ψ)(y) · −
L(L+1) L(L+1)
(L+K) · (L − ψ(y))
= K · Flrn(ψ)(y) ·
L(L+1)
L+K
= · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .
K·
L+1
With these pointwise results we can turn variances into multisets, like in
Definition 4.4.1. In the literature one finds descriptions of such variances as
vectors but they presuppose an ordering on the points of the underlying space.
The multiset description given below does not require such an ordering.
Definition 5.2.2. Let X be a set (of colours). The variances of the three draw
distributions are defined as multisets over X in:
X
Var mn[K](ω) B Var mn[K](ω), πy y
y∈X X
= K· ω(y) · (1 − ω(y)) y
X y∈X
Var hg[K](ψ) B Var hg[K](ψ), πy y
y∈X
L−K X
= K·
· Flrn(ψ)(y) · (1 − Flrn(ψ)(y)) y
L−1 y∈X
X
Var pl[K](ψ) B Var pl[K](ψ), πy y
y∈X
L+K X
= K· · Flrn(ψ)(y) · (1 − Flrn(ψ)(y)) y .
L+1 y∈X
286
5.2. Draw distributions and their (co)variances 287
Proof. 1 Using the covariance formula of Lemma 5.1.5 and Exercise 3.3.7 we
get:
Cov mn[K](ω), πy , πz
Cov hg[K](ψ), πy , πz
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · − K · Flrn(ψ)(y) · K · Flrn(ψ)(z)
L−1
L(K −1) · ψ(z) − K(L−1) · ψ(z)
= K · Flrn(ψ)(y) ·
L(L−1)
L−K
= −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L−1
3 And:
Cov pl[K](ψ), πy , πz
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · − K · Flrn(ψ)(y) · K · Flrn(ψ)(z)
L+1
L(K −1) · ψ(z) − K(L+1) · ψ(z)
= K · Flrn(ψ)(y) ·
L(L+1)
L+K
= −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L+1
Definition 5.2.4. For a set X we define the covariances of the three draw dis-
287
288 Chapter 5. Variance and covariance
On the diagonal in these joint multisets, where y = z, there are the variances.
When the elements of the space X are ordered, these covariance multisets can
be seen as matrices. We elaborate an illustration.
Example 5.2.5. Consider a group of 50 people of which 25 vote for the green
party (G), 10 vote liberal (L) and the ramaining ones vote for the christian-
democractic party (C). We thus have a set of vote options V = {G, L, C} with a
(natural) voter multiset ν = 25|G i + 10| L i + 15|C i.
We select five people from the group and look at their votes. These five
people are obtained in hypergeometric mode, where selected individuals step
out of the group and are no longer available for subsequent selection.
The hypergeometric mean is introduced in Definition 4.4.1. It gives the mul-
tiset:
mean hg[5](ν) = 5 · Flrn(ν) = 25 |G i + 32 | L i + 1|C i.
The deviations of the mean are given by the variance, see Definition 5.2.2.
50 − 5 X
Var hg[5](ν) = 5 · · Flrn(ν)(x) · (1 − Flrn(ν)(x)) x
50 − 1 x∈V
= 225
196 |G i + 27
28 | L i + 36
49 |C i
≈ 1.148|G i + 0.9643| L i + 0.7347|C i.
The covariances give a 2-dimensional multiset over V ×V, see Definition 5.2.4.
Cov hg[5](ν) = 196225
|G, G i − 135 45
196 |G, L i − 98 |G, C i
196 | L, G i +
− 135 27 27
28 | L, L i − 98 | L, C i
− 45 98 |C, L i + 49 |C, C i
27 36
98 |C, G i −
≈ 1.148|G, G i − 0.6888|G, L i − 0.4592|G, C i
− 0.6888| L, G i + 0.9643| L, L i − 0.2755| L, C i
− 0.4592|C, G i − 0.2755|C, L i + 0.7347|C, C i.
When these covariances are seen as a matrix, we recognise that the matrix is
symmetric and has variances on its diagonal.
288
5.2. Draw distributions and their (co)variances 289
Var mn[K](ω), p +
2 Similarly.
289
290 Chapter 5. Variance and covariance
on R via:
dval[K](ω, p) B D Flrn(−) |= p mn[K](ω)
X
= mn[K](ω)(ϕ) Flrn(ϕ) |= p
ϕ∈M[K](X)
(5.1)
dvar[K](ω, p) B D Var(Flrn(−), p) mn[K](ω)
X
= mn[K](ω)(ϕ) Var(Flrn(ϕ), p) .
ϕ∈M[K](X)
Similarly,
dvar[2](ω, p) = 1
ai, p) + 16 Var( 21 |a i + 12 | b i, p)
36 Var(1|
+ 41 Var(1| b i, p) + 91 Var( 12 | ai + 12 | ci, p)
+ 1 Var( 12 | bi + 12 | ci, p) + 19 Var(1| ci, p)
3
= 1 2 2
+ 61 26 − 52 + 41 42 − 42
36 6 − 6
+ 9 90 − 92 + 31 80 − 82 + 19 122 − 122
1
= 18 0 + 6 1 + 9 9 + 3 16 .
7 1 1 1
290
5.2. Draw distributions and their (co)variances 291
The rectangle on the right commutes by Exercise 4.1.8 (3), and the triangle
on the left by Theorem 3.3.5.
2 The second item requires more work. Let’s assume supp(ω) = {x1 , . . . , xn }.
We first prove the auxiliary result (∗) below, via the Multinomial Theo-
(E)
rem (1.27) and Exercise 3.3.7, denoted below by the marked equation = .
(K −1) ω |= p 2 + ω |= p2
X 2
mn[K](ω)(ϕ) Flrn(ϕ) |= p = . (∗)
ϕ∈M[K](X)
K
291
292 Chapter 5. Variance and covariance
We reason as follows.
X 2
mn[K](ω)(ϕ) Flrn(ϕ) |= p
ϕ∈M[K](X)
X P ϕ(xi )
2
= mn[K](ω)(ϕ) i K · p(xi )
ϕ∈M[K](X)
1
ϕ(xi ) · p(xi ) ψ(i)
(1.27)
X X Y
= ( ψ) ·
2
· mn[K](ω)(ϕ) ·
K ϕ∈M[K](X) ψ∈N[2]({1,...,n})
i
1 X X
= · 2 mn[K](ω)(ϕ) · ϕ(xi ) · p(xi ) · ϕ(x j ) · p(x j )
K 2 i, j ϕ∈M[K](X)
X X
+ mn[K](ω)(ϕ) · ϕ(xi ) · p(xi )
2 2
i ϕ∈M[K](X)
(E) 1 X
= K · (K −1) · ω(xi ) · p(xi ) · ω(x j ) · p(x j )
· 2
K 2 i, j
X
+ K · (K −1) · ω(xi ) · p(xi ) + K · ω(xi ) · p(xi )
2 2
2
i
(1.27) K −1 P 2 1 P
= · i ω(xi ) · p(xi ) + · i ω(xi ) · p(xi )
2
K K
(K −1) ω |= p 2 + ω |= p2
= .
K
Now we are ready to prove the formula for the variance of the distribution
of validities in item (2) in the proposition. We use item (1) and the auxiliarly
equation (∗).
Var dval[K](ω, p)
X 2 2
= mn[K](ω)(ϕ) Flrn(ϕ) |= p − mean dval[K](ω, p)
ϕ∈M[K](X)
(K −1) ω |= p 2 + ω |= p2
(∗)
= − ω |= p 2
K
− ω |= p 2 + ω |= p2
=
K
Var(ω, p)
= .
K
292
5.3. Joint-state covariance and correlation 293
Exercises
5.2.1 In the distribution of validities (5.1) we have used multinomial sam-
ples. We can also use hypergeometric ones. Show that in that case the
analogue of Proposition 5.2.7 (1) still holds: for an urn υ ∈ N(X), an
observable p : X → R and a number K ≤ kυk,
mean D Flrn(−) |= p hg[K](υ) = Flrn(υ) |= p.
JCov(τ, q1 , q2 ) B Cov(τ, π1 = q1 , π2 = q2 ).
293
294 Chapter 5. Variance and covariance
JCor(τ, q1 , q2 ) B Cor(τ, π1 = q1 , π2 = q2 ).
In both these cases, if there are inclusions X1 ,→ R and X2 ,→ R, then one can
use these inclusions as random variables and write just JCov(τ) and JCor(τ).
JCov(τ, q1 , q2 ) = τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1 .
= τ |= q1 ⊗ q2 − (τ 1, 0 |= q1 ) · (τ 0, 1 |= q2 ).
We see that if the joint state τ is non-entwined, that is, if it is the product
of its marginals τ 1, 0 and τ 0, 1 , then the joint covariance is 0, whatever
the obervables q1 , q2 are. This situation will be investigated further in the next
section.
JCov(τ, q1 , q2 ) = Cov(τ, π1 = q1 , π2 = q2 ).
= τ |= (π1 = q1 ) − (τ |= π1 = q1 ) · 1
& (π2 = q2 ) − (τ |= π2 = q2 ) · 1
= τ |= (π1 = q1 ) − (τ 1, 0 |= q1 ) · (π1 = 1)
= τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1 .
JCov(τ, q1 , q2 )
= τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1
294
5.3. Joint-state covariance and correlation 295
Now we can compute the joint covariance in the joint state τ as:
JCov(τ) = τ |= incl − mean τ 1, 0 · 1 ⊗ incl − mean τ 1, 0 · 1
= x,y τ(x, y) · (x − 32 ) · (y − 2)
P
= 41 − 12 · −1 + 12 · 1
= 41 .
2 In order to calculate the (joint) correlation of τ we first need the variances
of its marginals:
Var τ 1, 0 = x τ 1, 0 (x) · (x − 32 )2 = 14
P
Then:
JCov(τ) 1/4 √
JCor(τ) = p = 1/2 · 1/ √2 =
1
2 2.
Var τ 1, 0 · Var τ 0, 1
p
295
296 Chapter 5. Variance and covariance
Proof. We prove the second equation since the first one is a special case,
namely when c, d are identity channels. Via Lemma 5.3.2 we get:
JCov hc, di = ω, p, q
= hc, di = ω |= p − ((hc, di = ω) 1, 0 |= p) · 1
⊗ q − ((hc, di = ω) 0, 1 |= q) · 1
= ω |= hc, di = p − (c = ω |= p) · 1 ⊗ q − (d = ω |= q) · 1
= ω |= c = p − (ω |= c = p) · 1 & d = q − (ω |= d = q) · 1
by Exercise 4.3.8
= ω |= (c = p) − (ω |= c = p) · 1 & (d = q) − (ω |= d = q) · 1
JCov(τ, q1 , q2 ) = JCov(τ, q2 , q1 )
JCov(τ, q1 , 1) = 0
JCov(τ, r · q1 , q2 ) = r · JCov(τ, q1 , q2 )
JCov(ω, q1 , q2 + q3 ) = JCov(τ, q1 , q2 ) + Cov(τ, q1 , q3 ).
JCov(τ, q1 + r · 1, q2 + s · 1) = JCov(τ, q1 , q2 ).
Proof. This follows directly from Theorem 5.1.7, using that predicate trans-
296
5.3. Joint-state covariance and correlation 297
formation πi = (−) is linear and thus preserves sums and scalar multiplications
(and also truth), see Lemma 4.3.2 (2).
We conclude this section with a medical example about the correlation be-
tween disease and test.
Example 5.3.6. We start with a space D = {d, d⊥ } for occurence of a disease
or not (for a particular person) and a space T = {p, n} for a positive or negative
test outcome. Prevalence is used to indicate the prior likelihood of occurrence
of the disease, for instance in the whole population, before a test. It can be
described via a flip-like channel:
[0, 1]
prev
◦ /D with prev(r) B r| d i + (1 − r)| d⊥ i.
We assume that there is a test for the disease with the following characteristics.
• (‘sensitivity’) If someone has the disease, then the test is positive with prob-
ability of 90%.
• (‘specificity’) If someone does not have the disease, there is a 95% chance
that the test is negative.
We formalise this via a channel test : D → T with:
test(d) = 9
10 | p i + 1
10 | n i test(d⊥ ) = 1
20 | p i + 19
20 | n i.
Exercise 5.3.3 below tells that it does not really matter which observables we
choose, so we simply take 1d : D → [0, 1] and 1 p : T → [0, 1], mapping d
and p to 1 and d⊥ and n to 0. We are thus interested in (joint-state) correlation
function:
[0, 1] 3 r 7−→ JCor joint(r), 1d , 1 p ∈ [−1, 1].
This is plotted in Figure 5.1, on the left. We see that, with the sensitivity and
specificity values as given above, there is a clear positive correlation beteen
disease and test, but less so in the corner cases with minimal and maximal
prevalence.
We now fix a prevalence of 20% and wish to understand correlation as a
function of sensitivity and specificity. We thus parameterise the above test
channel to test(se, sp) : D → T with parameters se, sp ∈ [0, 1].
test(se, sp)(d) = se| p i + (1 − se)| n i
test(se, sp)(d⊥ ) = (1 − sp)| p i + sp| n i.
297
298 Chapter 5. Variance and covariance
Exercises
5.3.1 Find examples of covariance and correlation computations in the liter-
ature (or online) and determine if they are of shared-state or joint-state
form.
5.3.2 Prove that:
JCov(τ, q1 , q2 )
JCor(τ, q1 , q2 ) = .
StDev(τ 1, 0 , q1 ) · StDev(τ 0, 1 , q2 )
298
5.4. Independence for random variables 299
JCor(τ, p1 , p2 ) = ±JCor(τ, q1 , q2 ),
299
300 Chapter 5. Variance and covariance
But this says that the joint state hp1 , p2 i = ω on R × R, transformed along
hp1 , p2 i : X → R×R, is non-entwined: it is the product of its marginals. Indeed,
its (first) marginal is:
hp1 , p2 i = ω 1, 0 = π1 = hp1 , p2 i = ω
= π1 ◦· hp1 , p2 i = ω
= p1 = ω.
Definition 5.4.1. Let (ω, p1 ) and (ω, p2 ) be two random variables with a com-
mon, shared state ω. They will be called independent if the transformed state
hp1 , p2 i = ω on R × R is non-entwined, as in Equation (5.4): it is the product
of its marginals:
300
5.4. Independence for random variables 301
It is not hard to see that the parallel product ⊗ of these two marginal distribu-
tions differs from hp1 , p2 i = ω.
Proposition 5.4.3. The shared-state covariance of shared-state independent
random variables is zero: if random variables (ω, p1 ) and (ω, p2 ) are indepen-
dent, then Cov(ω, p1 , p2 ) = 0.
The converse does not hold.
Proof. If (ω, p1 ) and (ω, p2 ) are independent, then, by definition, hp1 , p2 i =
ω = (p1 = ω) ⊗ (p2 = ω). The calculation belows shows that covariance is
then zero. It uses multiplication & : R × R → R as observable. It can also be
described as the parallel product id ⊗ id of the observable id : R → R with
itself.
Cov(ω, p1 , p2 )
= ω |= p1 & p2 − ω |= p1 · ω |= p2
by Lemma 5.1.5
= ω |= & ◦ hp1 , p2 i
− p1 = ω |= id · p2 = ω |= id
by Exercise 4.1.4
= hp1 , p2 i = ω |= &
− (p1 = ω) ⊗ (p2 = ω) |= id ⊗ id
by Lemma 4.2.8, 4.1.4
= hp1 , p2 i = ω |= & − hp1 , p2 i = ω |= &
by assumption
= 0.
The claim that the converse does not hold follows from Example 5.4.4, right
below.
301
302 Chapter 5. Variance and covariance
Example 5.4.4. We continue Example 5.4.2. The set-up used there involves
two dependent random variables (ω, p1 ) and (ω, p2 ), with shared state ω =
flip ⊗ flip. We show here that they (nevertheless) have covariance zero. This
proves the second claim of Proposition 5.4.3, namely that zero-covariance need
not imply independence, in the shared-state context.
We first compute the validities (expected values):
ω |= p1 = 1
4 · p1 (1, 1) + 14 · p1 (1, 0) + 14 · p1 (0, 1) + 14 · p1 (0, 0)
= 1
4 · 100 + 14 · 100 + 14 · 50 + 14 · 50 = 75
ω |= p2 = 1
4 · 100 + 14 · −100 + 14 · 50 + 14 · −50 = 0
Then:
Cov(ω, p1 , p2 )
= ω |= p1 − (ω |= p1 ) · 1 & p2 − (ω |= p2 ) · 1
= 14 (100−75) · 100 + (100−75) · −100 + (50−75) · 50 + (50−75) · −50
= 0.
We now turn to independence in joint-state form, in analogy with joint-state
covariance in Definition 5.3.1.
Definition 5.4.5. Let τ ∈ D(X1 × X2 ) be a joint state and with two observables
q1 ∈ Obs(X1 ) and q2 ∈ Obs(X2 ). We say that there is joint-state independence
of q1 , q2 if the two random variables (τ, π1 = q1 ) and (τ, π2 = q2 ) are (shared-
state) independent, as described in Definition 5.4.1.
Concretely, this means that:
(q1 ⊗ q2 ) = τ = q1 = τ 1, 0 ⊗ q2 = τ 0, 1 .
(5.5)
Equation (5.5) is an instance of the formulation (5.4) used in Definition 5.4.1
since πi = qi = qi ◦ πi and:
= D q1 × q2 (τ)
= (q1 ⊗ q2 ) = τ
(q1 ◦ π1 ) = τ ⊗ (q2 ◦ π2 ) = τ = q1 = (π1 = τ) ⊗ q2 = (π2 = τ)
= q1 = τ 1, 0 ⊗ q2 = τ 0, 1 .
302
5.4. Independence for random variables 303
= τ |= q1 ⊗ q2 .
In the same way one proves (q1 = τ1 ) ⊗ (q2 = τ2 ) |= & = (τ1 |= q1 ) ·
(τ2 |= q2 ). But then we are done via the formulation of binary covariance from
Lemma 5.3.2:
JCov(τ, q1 , q2 ) = τ |= q1 ⊗ q2 − (τ1 |= q1 ) · (τ2 |= q2 )
= 0.
(3) ⇒ (1). Let joint-state covariance JCov(τ, q1 , q2 ) be zero for all observ-
ables q1 , q2 . In order to prove that τ is non-entwined, we have to show τ(x, y) =
τ1 (x) · τ2 (y) for all (x, y) ∈ X1 × X2 . We choose as random variables the observ-
ables 1 x and 1y and use again the formulation of covariance from Lemma 5.3.2.
Then, since, by assumption, the binary covariance is zero, so that:
τ(x, y) = τ |= 1 x ⊗ 1y = (τ1 |= 1 x ) · (τ2 |= 1y ) = τ1 (x) · τ2 (y).
303
304 Chapter 5. Variance and covariance
In essence this result says that joint-state independence and joint-state co-
variance being zero are not properties of observables, but of joint states.
Exercises
5.4.1 Prove, in the setting of Definition 5.4.5 that the first marginal (q1 ⊗
q2 ) = τ 1, 0 of the transformed state along q1 ⊗ q2 is equal to the
transformed marginal q1 = τ 1, 0 .
Proof. 1 Because:
ω |= p ≥ ω |= p & (p ≥ a) ≥ ω |= (a · 1) & (p ≥ a)
= ω |= a · 1 & (p ≥ a)
= ω |= a · p ≥ a
= a · ω |= (p ≥ a) .
304
5.5. The law of large numbers, in weak form 305
=⇒ q(x)2 ≥ a2 ⇐⇒ q2 ≥ a2 (x) = 1.
≤ a2 · ω |= q2 ≥ a2
≤ ω |= q2 by item (1)
= Var(ω, p).
We turn to the weak law of large numbers, in binary form. To start, fix the
single and parallel coin states:
sum n
/ [0, 1] x1 + · · · + xn
2n sum n x1 , . . . , xn B .
defined by
n
The predicate sum n captures the average number of heads (as 1) in n coin flips.
Then:
σ |= sum 1 = 1
2 ·1+ 1
2 ·0 = 1
2
1
· (1 + 1) + 1
· (1 + 0) + 1
· (0 + 1) + 1
· (0 + 0)
σ2 |= sum 2 = 4 4 4 4
= 1
2
.. 2
.
n
P k
1≤k≤n k · 2n
σ |= sum n =
n
= 1
· mean bn[n]( 21 ) = 2.
1
n n
For the last equation, see Example 4.1.4 (4). These equations express that in n
coin flips the average number of heads is 12 . This makes perfect sense.
The weak law of large numbers involves a more subtle statement, namely
that for each ε > 0 the validity:
σn |= sum n − 12 · 1 ≥ ε .
(5.7)
goes to zero as n goes to infinity. One may think that this convergence to zero
is obvious, but it is not, see Figure 5.2. The above validities (5.7) do go down,
but not monotonously.
305
306 Chapter 5. Variance and covariance
Theorem 5.5.2 (Weak law of large numbers). Using the fair coin σ and pred-
icate sum n : 2n → [0, 1] defined above one has, for each ε > 0,
lim σn |= sum n − 21 · 1 ≥ ε = 0.
n→∞
1 X n
σn |= sum n & sum n = · σ |= πi & π j
n2 i, j
1 X n X
= 2 · σ |= πi & πi + σ |= πi & πi
n
n i i, j, i, j
n n2 − n
!
1
= 2· +
n 2 4
n+1
= .
4n
Hence, by Lemma 5.1.3,
2 n+1 1 1
Var σn , sum n = σn |= sum n & sum n − σn |= sum n = − = .
4n 4 4n
(∗)
This proves the marked equation =, and thus the theorem.
306
5.5. The law of large numbers, in weak form 307
Thus, the predicate acc i yields the average number of occurrences of the ele-
ment xi in its input sequence. As one may expect, its value converges to ω(xi )
in increasing product states ωn = ω ⊗ · · · ⊗ ω. This is the content of item (3)
below, which the multivariate version of the weak law of large numbers.
ωn |= acc i = ω(xi ).
ω(x1 ) · (1 − ω(xi ))
Var ωn , acc i = .
n
where π xi is the point evaluation map from (4.10). Then, using acc as a
deterministic channel,
ωn |= acc i = 1
n · ωn |= acc = π xi
= 1
n · acc = iid [n](ω) |= π xi
= 1
n · mn[n](ω) |= π xi by Theorem 3.3.1 (2)
= ω(xi ) by (4.11).
307
308 Chapter 5. Variance and covariance
6 (a, c, a, a, c, b) = 2 | a i + 6 | b i + 3 | c i.
so that, for instance acc 1 1 1
For a given state ω ∈ D(X) we can look at the total variation distance d, see
Subsection 4.5.1, as a predicate X n → [0, 1], given by:
x) X
~x 7−→ d acc(~ acc(~x)(y) − ω(y) ,
n , ω = 2 ·
1
n
y∈X
Theorem 5.5.4. For a state ω ∈ D(X) and a number ε > 0 one has:
lim ωn |= d acc(−) , ω ε = 0.
n ≥
n→∞
308
5.5. The law of large numbers, in weak form 309
Figure 5.3 Initial segment of the limit from Theorem 5.5.4, with ω = 1
4 |a i +
12 |b i + 3 | c i and ε = 10 , from n = 1 to n = 12.
5 1 1
Now we can prove the limit result in the theorem via Proposition 5.5.3 (3):
lim ωn |= d acc(−)
n , ω ≥ε
n→∞ X
≤ lim ωn |= acc(−)(y) , ω(y) ≥ 2ε
n→∞ n `
X y∈supp(ω) acc(−)(y)
= lim ωn |= n , ω(y) ≥ 2ε
`
n→∞
y∈supp(ω)
= 0.
Figure 5.3 gives an impression of the terms in a limit like in Theorem 5.5.4,
for ω = 14 | ai + 12
5
| bi + 31 | ci. The plot covers only a small initial fragment
because of the exponential growth of the sample space. For instance, the space
of the parallel product ω12 has 312 elements, which is more than half a million.
This initial segment suggests that the validities go down, but the suggestion is
weak; Theorem 5.5.4 provides certainty.
Exercises
5.5.1 For a number a ∈ R, write 1≥a : R → [0, 1] for the sharp predicate
with 1≥a (r) = 1 iff r ≥ a. Consider an observable p : X → R as a
309
310 Chapter 5. Variance and covariance
310
6
Updating distributions
311
312 Chapter 6. Updating distributions
312
6.1. Update basics 313
ω| p |= q.
Often, the state ω before updating is called the prior or the a priori state,
whereas the updated state ω| p is called the posterior or the a posteriori state.
The posterior incorporates the evidence given by the factor p. One may thus
expect that in the updated state ω| p the factor p is more true than in ω. This is
indeed the case, as will be shown in Theorem 6.1.5 below.
The conditioning c|q of a channel in item (3) is in fact a generalisation of the
conditioning of a state ω| p in item (1), since the state ω ∈ D(X) can be seen as
a channel ω : 1 → X with a one-element set 1 = {0} as domain. We shall see
the usefulness of conditioning of channels in Subsection 6.3.1.
The standard conditional probability notation is: P(E | D) for events E, D ⊆
X, where the distribution involved is left implicitly. If ω is this implicit distri-
bution, then P(E | D) corresponds to the conditional expectation expressed by
the validity ω|1D |= 1E of the sharp predicate 1E in the state ω updated with the
313
314 Chapter 6. Updating distributions
Pred (pips) = [0, 1]pips expressing that we are fairly certain of the outcome
being even:
evenish (1) = 1
5 evenish (3) = 1
10 evenish (5) = 1
10
evenish (2) = 9
10 evenish (4) = 9
10 evenish (6) = 4
5
= 1
6 · 1
5 + 1
6 · 9
10 + 1
6 · 1
10 + 1
6 · 9
10 + 1
6 · 1
10 + 1
6 · 4
5
= 2+9+1+9+1+8
60 = 2.
1
If we take evenish as evidence, we can update our dice state and get:
X dice(x) · evenish (x)
dice evenish = x
x dice |= evenish
1/6 · 1/5
= 1 1 + 1/6 · 9/10 2 + 1/6 · 1/10 3
/2 1/2 1/2
1/6 · 9/10 1/6 · 1/10 1/6 · 4/5
+ 1 4 +
5 +
6
/ 2
1/2
1/2
= 15
1
1 + 10 2 + 30
3
3 + 10
1
4 + 30
3
5 + 15
1 4
6 .
As expected, the probabilities for the even pips is now, in the posterior,
higher than for the odd ones.
2 The following alarm example is due to Pearl [133]. It involves an ‘alarm’ set
A = {a, a⊥ } and a ‘burglary’ set B = {b, b⊥ }, with the following a priori joint
distribution ω ∈ D(A × B).
0.000095| a, b i + 0.009999| a, b⊥ i + 0.000005| a⊥ , b i + 0.989901| a⊥ , b⊥ i.
314
6.1. Update basics 315
Someone reports that the alarm went off, but with only 80% certainty
because of deafness. This can be described as a predicate p : A → [0, 1]
with p(a) = 0.8 and p(a⊥ ) = 0.2. We see that there is a mismatch between
the p on A and ω on A × B. This can be solved via weakening p to p ⊗ 1, so
that it becomes a predicate on A × B. Then we can take the update:
ω| p⊗1 ∈ D(A × B).
In order to do so we first compute the validity:
X
ω |= p ⊗ 1 = ω(x, y) · p(x) = 0.206.
x,y
We see that the burglary probability is four times higher in the posterior
than in the prior. What happens is noteworthy: evidence about one compo-
nent A changes the probabilities in another component B. This ‘crossover
influence’ (in the terminology of [88]) or ‘crossover inference’ happens pre-
cisely because the joint distribution ω is entwined, so that the different parts
can influence each other. We shall return to this theme repeatedly, see for
instance in Corollary 6.3.13 below.
One of the main results about conditioning is Bayes’ theorem. We present it
here for factors, and not just for sharp predicates (events), as is common.
Theorem 6.1.3. Let ω be distribution on a sample space X, and let p, q be
factors on X.
1 The product rule holds:
ω |= p & q
ω| p |= q = .
ω |= p
315
316 Chapter 6. Updating distributions
ω| p |= p ≥ ω |= p.
316
6.1. Update basics 317
ω |= p & p ω |= p 2
ω| p |= p = ≥ = ω |= p.
ω |= p ω |= p
We add a few more basic facts about conditioning.
ω|1 = ω.
ω| p |q = ω| p&q = ω|q | p .
ω| s·p = ω| p .
When we ignore undefinedness issues, we see that items 1 and 3 show that
conditioning is an action on distributions, namely of the monoid of factors with
conjunction (1, &), see Definition 1.2.4.
317
318 Chapter 6. Updating distributions
3 It suffices to prove:
ω| p (x) · q(x) ω(x)·p(x)/ω|= p · q(x)
ω| p |q (x) = =
by Proposition 6.1.3 (1)
ω| p |= q ω|= p&q/ω|= p
318
6.1. Update basics 319
Exercises
6.1.1 Let ω ∈ D(X) with a ∈ supp(ω). Prove that:
6.1.2 Consider the following girls/boys riddle: given that a family with two
children has a boy, what is the probability that the other child is a girl?
Take as space {G, B}. On it we use the uniform distribution unif =
2 |G i + 2 | Bi since there is no prior knowledge.
1 1
1 Take as ‘at least one girl’ and ‘at least one boy’ predicates on
{G, B} × {G, B}:
g B 1B ⊗ 1B ⊥ = (1G ⊗ 1) > (1B ⊗ 1G )
319
320 Chapter 6. Updating distributions
ω = i (ω |= pi ) · ω| pi .
P
(6.1)
4 Prove now:
ω |= q & pi
ω|q |= pi = P .
j ω |= q & p j
6.1.5 Show that conditioning a convex sum of states yields another convex
sum of conditioned states: for σ, τ ∈ D(X), p ∈ Fact(X) and r, s ∈
[0, 1] with r + s = 1,
r · σ + s · τ p
r · (σ |= p) s · (τ |= p)
= · σ| p + · τ| p
r · (σ |= p) + s · (τ |= p) r · (σ |= p) + s · (τ |= p)
r · (σ |= p) s · (τ |= p)
= · σ| p + · τ| p .
(r · σ + s · τ) |= p (r · σ + s · τ) |= p
6.1.6 Consider ω ∈ D(X) and p ∈ Fact(X) where p is non-zero, at least
on the support of X. Check that updating ω with p can be undone via
updating with 1p .
6.1.7 Let c : X → Y be a channel, with state ω ∈ D(X × Z).
1 Prove that for a factor p ∈ Fact(Z),
(c ⊗ id ) = ω 1⊗p = (c ⊗ id ) = ω1⊗p .
6.1.8 This exercise will demonstrate that conditioning may both create and
remove entwinedness.
320
6.1. Update basics 321
is entwined.
2 Consider the state ω ∈ D(2 × 2 × 2) given by:
ω 1, 0, 1 , ω 1, 0, 0 ⊗ ω 0, 0, 1 .
ρ B ω|1⊗yes⊗1 .
f |q = f .
(e ⊗ f )| p⊗q = (e| p ) ⊗ ( f |q ).
321
322 Chapter 6. Updating distributions
6.1.12 Consider a state ω and predicate p on the same set. For r ∈ [0, 1]
we use the (ad hoc) abbreviation ωr for the convex combination of
updated states:
ωr B r · ω| p + (1 − r) · ω| p⊥ .
1 Prove that:
r ≤ (ω |= p) =⇒ r ≤ (ωr |= p) ≤ (ω |= p)
r ≥ (ω |= p) =⇒ r ≥ (ωr |= p) ≥ (ω |= p).
2 Show that, still for all r ∈ [0, 1],
ω |= p ω |= p⊥
r· + (1 − r) · ≤ 1.
ωr |= p ωr |= p⊥
Hint: Write 1 = r + (1 − r) on the right-hand side of the inequality
≤ and move the two summands to the left.
This inequality occurs in n-ary form in Proposition 6.7.10 below. It
is due to [50].
ni | ωi i p• = pml i ni | ωi | p i .
P P
pml i
322
6.2. Updating draw distributions 323
323
324 Chapter 6. Updating distributions
1 If
ψ • 1E
≥ K,
hg[K](ψ) 1 • = hg[K] ψ • 1E .
E
2 Similarly, if ψ • 1E is non-empty,
= pl[K] ψ • 1E .
pl[K](ψ) 1E
•
Then:
X hg[K](ψ)(ϕ) · 1E • (ϕ)
hg[K](ψ) 1 • (ϕ) = •
ϕ
E
ϕ≤K ψ hg[K](ψ) |= 1E
Q ψ(x) L
X x ϕ(x) K
= L · L ϕ
E
ϕ≤K ψ•1E
ψ•1K K
E
ϕ
X
= L ϕ = hg[K] ψ • 1E .
E
ϕ≤K ψ•1E K
2 Similarly.
We illustrate how to solve a famous riddle via updating a draw distribution.
Example 6.2.3. We start by recalling the ordered draw channel Ot − : N∗ (X) →
L∗ (X) × N(X) from Figure 3.1. For convenience, we recall that it is given by:
X ψ(x)
Ot − (ψ) = [x], ψ − 1| x i .
x∈supp(ψ)
kψk
324
6.2. Updating draw distributions 325
We recall that these maps can be iterated, via Kleisli composition ◦·, as de-
scribed in (3.18).
Below we like to incorporate the evidence that the last drawn colour is in a
subset U ⊆ X. We thus extend U to last(U) ⊆ L∗ (X), given by:
last(U) = { [x1 , . . . , xn ] ∈ L∗ (X) | xn ∈ U}.
We now come to the so-called Monty Hall problem. It is a famous riddle in
probability due to [150], see also e.g. [61, 158]:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one
door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who
knows what’s behind the doors, opens another door, say No. 3, which has a goat. He
then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch
your choice?
We describe the three doors via a set D = {1, 2, 3} and the situation in which
to make the first choice as a multiset ψ = 1| 1 i + 1| 2i + 1| 3 i. One may also
view ψ as an urn from which to draw numbered balls. Let’s assume without
loss of generality that the car is behind door 2. We shall represent this logically
by using a (sharp) ‘goat’ predicate 1G : {1, 2, 3} → [0, 1], for G = {1, 3}.
Your first draw yields a distribution:
Ot − (ψ) = 13 [1], 1| 2i + 1| 3i + 13 [2], 1| 1i + 1| 3 i + 13 [3], 1| 1 i + 1| 2 i .
Next, the host’s opening of a door with a goat is modeled by a new draw, but
we know that a goat appears. The resulting distribution is described by another
draw followed by an update:
Ot − ◦· Ot − (ψ)1
last(G) ⊗1
= 14 [1, 3], 1| 2 i + 41 [2, 1], 1| 3 i + 14 [2, 3], 1| 1 i + 14 [3, 1], 1| 2 i .
Inspection of this outcome shows that for two of your three choices — when
you choose 1 or 3, in the first and last term — it makes sense to change your
choice, because the car is behind the remaining door 2. Hence changing is
better than sticking to your original choice.
We have given a formal account of the situation. More informally, the host
knows where the car is, so his choice is not arbitrary. By opening a door with
a goat behind it, the host is giving you information that you can exploit to
improve your choice: two of your possible choices are wrong, but in those two
out three cases the host gives you information how to correct your choice.
We conclude with a well known result, namely that a multinomial distri-
bution can be obtained from conditioning parallel Poisson distributions. This
result can be found for instance in [145, §6.4], in binary form. Recall that a
325
326 Chapter 6. Updating distributions
sum K n1 , . . . , nm = 1 ⇐⇒ n1 + · · · + nm = K.
pois[λ1 ] ⊗ · · · ⊗ pois[λm ] sum
K
mn[K] i λλi | i i
X P P
= i ni | i i n1 , . . . , nm .
n1 ,..., nm , ni =K
P
i
326
6.2. Updating draw distributions 327
mn[K] i λλi | ii
X P P
= i ni | i i n1 , . . . , nm .
n1 ,..., nm , ni =K
P
i
sum ψ (ϕ, ϕ0 ) = 1 ⇐⇒ ϕ + ϕ0 = ψ.
327
328 Chapter 6. Updating distributions
ϕ(x)
X (ϕ) · (
x ω(x) ) · ( ψ − ϕ ) · ( x ω(x)ψ(x)−ϕ(x) )
Q Q
= ϕ, ψ − ϕ
ϕ≤K ψ
mn[L](ω)(ψ)
X ( ϕ ) · (ψ − ϕ ) · (Q ω(x)ψ(x) )
= x
ψ(x) )
ϕ, ψ − ϕ
ω(x)
Q
ϕ≤K ψ
(ψ ) · ( x
ψ
X ϕ
= ϕ, ψ − ϕ ,
L
see Lemma 1.6.4 (1).
ϕ≤K ψ K
Exercises
6.2.1 Prove Theorem 6.2.2 (2) yourself.
Let K1 , . . . , KN ∈ N with ψ ∈ N[K](X) be given, where K = i Ki .
P
6.2.2
Write sum ψ : N[K1 ](X)×· · ·×N[KN ] → {0, 1} for the sharp predicate
defined by:
sum ψ (ϕ1 , . . . , ϕN ) = 1 ⇐⇒ ϕ1 + · · · + ϕN = ψ.
Show that, for any ω ∈ D(X),
mn[K1 ](ω) ⊗ · · · ⊗ mn[KN ](ω) sum
ψ
X 1
= ψ ϕ1 , . . . , ϕ N .
ϕ1 ≤K1 ψ, ..., ϕN ≤KN ψ, ϕi =ψ ϕ1 , ..., ϕN
P
i
Recall from Exercise 1.6.9 that the inverses of the multinomial coef-
ficients of multisets in the above sum add up to one.
328
6.3. Forward and backward inference 329
The next definition captures the two basic patterns — first formulated in this
form in [87]. We shall refer to them jointly as channel-based inference, or as
reasoning along channels.
c = ω| p .
In both cases the distribution ω is often called the prior distribution or simply
the prior. Similarly, c = ω| p and ω|c = q are called posterior distributions or just
posteriors.
Thus, with forward inference one first conditions and then performs (for-
ward, state) transformation, whereas for backward inference one first performs
(backward, factor) transformation, and then one conditions. We shall illustrate
these fundamental inferences mechanisms in several examples. They mostly
involve backward inference, since that is the more useful technique. An impor-
tant first step in these examples is to recognise the channel that is hidden in the
description of the problem. It is instructive to try and do this, before reading
the analysis and the solution.
Example 6.3.2. We start with the following question from [144, Example 1.12].
Consider two urns. The first contains two white and seven black balls, and the second
contains five white and six black balls. We flip a coin and then draw a ball from the
first urn or the second urn depending on whether the outcome was heads or tails. What
is the conditional probability that the outcome of the toss was heads given that a white
ball was selected?
Our analysis involves two sample spaces {H, T } for the sides of the coin and
329
330 Chapter 6. Updating distributions
{W, B} for the colours of the balls in the urns. The coin distribution is uni-
form: unif = 12 | H i+ 12 | T i. The above description implicitly contains a channel
c : {H, T } → {W, B}, namely:
= 29 | W i + 79 | Bi = 11
5
| W i + 11
6
| Bi.
As in the above quote, the first urn is associated with heads and the second one
with tails.
The evidence that we have is described in the quote after the word ‘given’.
It is the point predicate 1W on the set of colours {W, B}, indicating that a white
ball was selected. This evidence can be pulled back (transformed) along the
channel c, to a predicate c = 1W on the sample space {H, T }. It is given by:
X
c = 1W (H) = c(H)(x) · 1W (x) = c(H)(W) = 29 .
x∈{W,B}
Then:
1/2 · 2/9 1/2 · 5/11
unif|c = 1W = 67/198
|Hi + 67/198
|T i = 22
67 | H i + 45
67 | T i.
A cab was involved in a hit and run accident at night. Two cab companies, Green and
Blue, operate in the city. You are given the following data:
• 85% of the cabs in the city are Green and 15% are Blue
• A witness identified the cab as Blue. The court tested the reliability of the witness
under the circumstances that existed on the night of the accident, and concluded that
the witness correctly identified each one of the two colors 80% of the time and failed
20% of the time.
What is the probability that the cab involved in the accident was Blue rather than Green?
330
6.3. Forward and backward inference 331
We use as colour set C = {G, B} for Green and Blue. There is a prior ‘base
rate’ distribution ω = 2017
|G i + 20
3
| Bi ∈ D(C), as in the first bullet above.
The reliability information in the second bullet translates into a ‘correctness’
channel c : {G, B} → {G, B} given by:
c(G) = 45 |G i + 51 | Bi c(B) = 15 |G i + 54 | Bi.
The second bullet also gives evidence of a Blue car. It translates into a point
predicate 1B on {G, B}. It can be used for backward inference, giving the answer
to the query, as posterior:
ω|c = 1B = 17
29 |G i + 12
29 | Bi ≈ 0.5862|G i + 0.4138| Bi.
Thus the probability that the Blue car was actually involved in the incident is
a bit more that 41%. This may seem like a relatively low probability, given
that the evidence says ‘Blue taxicab’ and that observations are 80% accurate.
But this low percentage is explained by the fact that are relatively few Bleu
taxicabes in the first place, namely only 15%. This is in the prior, base rate
distribution ω. It is argued in [159] that humans find it difficult to take such base
rates (or priors) into account. They call this phenomenon “base rate neglect”,
see also [57].
Notice that the channel c is used to accomodate the uncertainty of observa-
tions: the sharp point observation ‘Blue taxicab’ is transformed into a fuzzy
predicate c = 1B . The latter is G 7→ 15 and B 7→ 54 . Updating ω with this
predicate gives the claimed outcome.
Example 6.3.4. We continue in the setting of Example 2.2.2, with a teacher in
a certain mood, and predictions about pupils’ performances depending on the
mood. We now assume that the pupils have done rather poorly, with no-one
scoring above 5, as described by the following evidence/predicate q on the set
of grades Y = {1, 2, . . . , 10}.
q = 1
10 · 11 + 3
10 · 12 + 3
10 · 13 + 2
10 · 14 + 1
10 · 15 .
The validity of this predicate q in the predicted state c = σ is:
c = σ |= q = σ |= c = q = 299
4000 = 0.07475.
The interested reader may wish to check that the updated state σ0 = σ|c = q via
backward inference is:
σ0 = 77
299 | p i + 162
299 | ni + 60
299 | o i ≈ 0.2575| p i + 0.5418| n i + 0.2007| o i.
Interestingly, after updating the teacher has more realistic view in the sense
that the validity of the predicate q has risen to c = σ0 |= q = 149500
15577
≈ 0.1042.
This is one way how the mind can adapt to external evidence.
331
332 Chapter 6. Updating distributions
Example 6.3.5. Recall the Medicine-Blood Table (1.18) with data on different
types of medicine via a set M = {0, 1, 2} and blood pressure via the set B =
{H, L}. From the table we can extract a channel b : M → B describing the
blood pressure distribution for each medicine type. This channel is obtained
by column-wise frequentist learning:
b = ω|1E = b = 17 9
| 1 i + 178
|2i
= ( 17
9
· 97 + 17 8
· 58 )| H i + ( 17
9
· 29 + 17
8
· 38 )| H i
= 17 | H i + 17 | L i.
12 5
This shows the distribution of high and low blood pressure among people using
medicine 1 or 2.
We turn to backward reasoning. Suppose that we have evidence 1H on {H, L}
of high blood pressure. What is then the associated distribution of medicine
usage? It is obtained in several steps:
X
b = 1H (x) = b(x)(y) · 1H (y) = b(x)(H)
y
X X
ω |= b = 1H = ω(x) · (b = 1H )(x) = ω(x) · b(x)(H)
x x
= 3
20· + · + · =
2
3
9
20
7
9
2
5
5
8
7
10
X ω(x) · (b = 1H )(x)
ω|b = 1H = x
x ω |= b = 1H
3/20 · 2/3 9/20 · 7/9 2/5 · 5/8
= 7 | 0i + 7 | 1i + 7 | 2i
/10 /10 /10
= 17 | 0 i + 21 | 1 i + 14
5
| 2i ≈ 0.1429| 0i + 0.5| 1 i + 0.3571| 2 i.
We can also reason with ‘soft’ evidence, using the full power of fuzzy pred-
icates. Suppose we are only 95% sure that the blood pressure is high, due to
some measurement uncertainty. Then we can use as evidence the predicate
332
6.3. Forward and backward inference 333
Example 6.3.6. The following question comes from [158, §6.1.3] (and is also
used in [129]).
One fish is contained within the confines of an opaque fishbowl. The fish is equally
likely to be a piranha or a goldfish. A sushi lover throws a piranha into the fish bowl
alongside the other fish. Then, immediately, before either fish can devour the other,
one of the fish is blindly removed from the fishbowl. The fish that has been removed
from the bowl turns out to be a piranha. What is the probability that the fish that was
originally in the bowl by itself was a piranha?
Let’s use the letters ‘p’ and ‘g’ for piranha and goldfish. We are looking at a
situation with multiple fish in a bowl, where we cannot distinguish the order.
Hence we describe the contents of the bowl as a (natural) muliset over {p, g},
that is, as an element of N({p, g}). The initial situation can then be described
as a distribution ω ∈ D(N({p, g})) with:
ω = 12 1| p i + 21 1| g i .
Adding a piranha to the bowl involves a function A : N({p, g}) → N({p, g}),
such that A(σ) = (σ(p) + 1)| p i + σ(g)| gi. It forms a deterministic channel.
We use a piranha predicate P : N({p, g}) → [0, 1] that gives the likelihood
P(σ) of taking a piranha from a multiset/bowl σ. Thus:
(2.13) σ(p)
P(σ) B Flrn(σ)(p) = .
σ(p) + σ(g)
We have now collected all ingredients to answer the question via backward
inference along the deterministic channel A. It involves the following steps.
σ(p) + 1
(A = P)(σ) = P(A(σ)) =
σ(p) + 1 + σ(g)
ω |= A = P = 1
2 · (A = P)(1| pi) + 12 · (A = P)(1| gi)
= 1
2 · 1+1
1+1+0 + 1
2 · 0+1
0+1+1 = 1
2 + 1
4 = 3
4
1/2 · 1 1/2 · 1/2
ω|A = P = 1| p i +
1| gi
3/4 3/4
= 23 1| pi + 13 1| gi .
2
Hence the answer to the question in the beginning of this example is: 3 proba-
bility that the original fish is a piranha.
333
334 Chapter 6. Updating distributions
Example 6.3.7. In [146, §20.1] a situation is described with five different bags,
numbered 1, . . . , 5, each containing its own mixture of cherry (C) and lime (L)
candies. This situation can be described via a candy channel:
=
c(1) 1|C i
= 4 |C i + 4 | L i
3 1
c(2)
/ {C, L} where
c
B = {1, 2, 3, 4, 5} = 2 |C i + 2 | L i
1 1
B and
◦
c(3)
= 4 |C i + 4 | L i
1 3
c(4)
=
c(5)
1| L i.
c = 1L = >i c(i)(L) · 1i = 1
4 · 12 > 1
2 · 13 > 3
4 · 14 > 1 · 15 .
The question is what we learn about the bag distribution after observing this
predicate 10 consecutive times? This involves computing:
ω|c = 1L = 1
10 | 2i + 25 |3 i + 3
10 | 4 i + 15 | 5 i
ω|c = 1L |c = 1L = ω|(c = 1L )&(c = 1L ) = ω|(c = 1L )2
= 26
1
| 2i + 13 4
| 3 i + 269
| 4 i + 13
4
| 5i
≈ 0.0385| 2 i + 0.308| 3i + 0.346| 4i + 0.308|5 i
ω|c = 1L |c = 1L |c = 1L = ω|(c = 1L )&(c = 1L )&(c = 1L ) = ω|(c = 1L )3
= 76
1
| 2i + 19 4
| 3 i + 2776 | 4 i + 19 | 5i
8
334
6.3. Forward and backward inference 335
low false positives but high false negatives, and moreover these false negatives
depend on the day after infection.
The Covid-19 PCR-test has almost no false positives. This means that if you
do not have the disease, then the likelihood of a (false) postive test is very
low. In our calculations below we put it at 1%, independently of the day that
you get tested. In contrast, the PCR-test has considerable false negative rates,
which depend on the day after infection. The plot at the top in Figure 6.2 gives
an indication; it does not precisely reflect the medical reality, but it provides
a reasonable approximation, good enough for our calculation. This plot shows
that if you are infected at day 0, then a test at this day or the day after (day 1)
will surely be negative. On the second day after your infection a PCR-test
might start to detect, but still there is only a 20% chance of a positive outcome.
This probability increases and after on day 6 the likelihood of a positive test
has risen to 80%.
How to formalise this situation? We use the following three sample spaces,
for Covid (C), days after infection (D), and test outcome (T ).
C = {c, c⊥ } D = {0, 1, 2, 3, 4, 5, 6} T = {p, n}.
The test probabilities are then captured via a test channel t : C × D → T , given
335
336 Chapter 6. Updating distributions
Figure 6.2 Covid-19 false negatives and posteriors after tests. In the lower plot P2
means: positive test after 2 days, that is, in state (r| ci + (1 − r)|c⊥ i) ⊗ ϕ2 , where
r ∈ [0, 1] is the prior Covid probability. Similarly for N2, P5, N5.
in the following way. The first equation captures the false positives, and the
336
6.3. Forward and backward inference 337
second one the false negatives, as in the plot at the top of Figure 6.2.
t(c⊥ , i) = 1
100 | p i + 99
100 | ni
if i = 0 or i = 1
1| ni
10 | pi + if i = 2
2 8
10 | n i
10 | pi + if i = 3
3 7
10 | n i
t(c, i) =
10 | pi + if i = 4
4 6
10 | n i
10 | pi + if i = 5
6 4
10 | n i
10 | pi + if i = 6
8 2
10 | n i
ϕ2 = 14 | 1 i + 12 | 2i + 14 | 3i ϕ5 = 14 | 4 i + 12 | 5i + 14 | 6i.
Let’s consider the case where we have no (prior) information about the like-
lihood that the person that is going to be tested has the disease. Therefore we
use as prior ω = unifC = 12 |c i + 12 | c⊥ i.
Suppose in this situation we have a positive test, say after two days. What
do we then learn about the disease probability? Our evidence is the postive test
predicate 1 p on T , which can be transformed to t = 1 p on C × D. We can use
this predicate to update the joint state ω ⊗ ϕ2 . The distribution that we are after
is the first marginal of this updated state, as in:
(ω ⊗ ϕ2 )t = 1 1, 0 ∈ D(C).
p
Then:
X
ω ⊗ ϕ2 |= t = 1 p = ω(x) · ϕ2 (y) · (t = 1 p )(x, y)
x∈C,y∈D
= 2 · 2 · 10 + 2 · 4 · 10 + 2
1 1 2 1 1 3 1
· 1
4 · 1
100 + 1
2 · 1
2 · 1
100 + 1
2 · 1
4 · 1
100
= 40+30+1+2+1
800 = 400
37
.
337
338 Chapter 6. Updating distributions
But then:
1/2 · 1/2 · 2/10
(ω ⊗ ϕ2 )t = 1 = c, 2 + 1/2 · 1/4 · 3/10 c, 3 + 1/2 · 1/4 · 1/100 c⊥ , 1
p 37/400 37/400 37/400
1/2 · 1/2 · 1/100 1/2 · 1/4 · 1/100 ⊥
+ 37/400
c , 2 +
⊥
37/400
c , 3
1 ⊥
= 20 + 15
+ 1 ⊥
, + 1 ⊥
, + 74 c , 3 .
37
c, 2 37 c, 3 74 c 1 37 c 2
Finally:
(ω ⊗ ϕ2 )t = 1 1, 0 = 35
37 | c i + 2 ⊥
37 | c i ≈ 0.946| c i + 0.054| c⊥ i.
p
Hence a positive test changes the a priori likelihood of 50% to about 95%. In
a similar way one can compute that:
(ω ⊗ ϕ2 )t = 1 1, 0 = 363
165
| c i + 198
363 | c i ≈ 0.455| c i + 0.545| c i.
⊥ ⊥
n
We see that a negative test, 2 days after infection, reduces the prior disease
probability of 50% only slightly, namely to 45%.
Doing the test around 5 days after infection gives a bit more certainty, espe-
cially in the case of a negative test:
(ω ⊗ ϕ5 )t = 1 1, 0 = 60
61 | c i + 61 | c i ≈ 0.984| c i + 0.016| c i
1 ⊥ ⊥
p
(ω ⊗ ϕ5 )t = 1 1, 0 = 139
40
| c i + 139
99 ⊥
| c i ≈ 0.288| c i + 0.712| c⊥ i.
n
The lower plot in Figure 6.2 gives a more elaborate description, for different
disease probabilities r in a prior ω = r| c i + (1 − r)| c⊥ i as used above. We see
that a postive test outcome quickly gives certainty about having the disease.
But a negative test outcome gives only a little bit information with respect to
the prior — which in this plot can be represented as the diagonal. For this
reason, if you get a negative PCR-test, often a second test is done a few days
later.
In the next two examples we look at estimating the number of fish in a pond
by counting marked fishes, first in multinomial mode and then in hypergeo-
metric mode.
338
6.3. Forward and backward inference 339
sample space F together with the uniform ‘prior’ state unif F is:
Once this is set up, we construct a posterior state by updating the prior with
the information that five marked fish have been found. The latter is expressed
as point predicate 15 ∈ Pred (K +1) on the codomain of the channel c. We can
then do backward inference, as in Definition 6.3.1 (2), and obtain the updated
uniform distribution:
20/i 5 · i−20/i 15
X
unif F c = 1 = P 20 5 j−20 15 i .
5
i∈F j /j · /j
The bar chart of this posterior is in Figure 6.3; it indicates the likelihoods of
the various numbers of fish in the pond. One can also compute the expected
value (mean) of this posterior; it’s 116.5 fish. In case we had caught 10 marked
fish out of 20, the expected number would be 47.5.
Note that taking a uniform prior corresponds to the idea that we have no
idea about the number of fish in the pond. But possibly we already had a good
estimate from previous years. Then we could have used such an estimate as
prior distribution, and updated it with this year’s evidence.
Example 6.3.10. We take another look at the previous example. There we used
the multinomial distribution (in binomial form) for the probability of catching
five marked fish, per pond size. This multinomial mode is appropriate for draw-
ing with replacement, which corresponds to returning each fish that we catch
to the pond. This is probably not what happens in practice. So let’s try to re-
describe the capture-recapture model in hypergeometric mode (like in [145,
§4.8.3, Ex. 8h]).
Let’s write M = {m, m⊥ } for the space with elements m for marked and m⊥ for
339
340 Chapter 6. Updating distributions
Figure 6.3 The posterior fish number distribution after catching 5 marked fish, in
multinomial mode, see Example 6.3.9.
given by:
d(i) B hg[K] 20| mi + (i − 20)| m⊥ i .
15 )/(20)
X (i−20 i
unif F d = 1 = i .
j ( 15 )/(20)
κ
P j−20 j
i∈F
Its bar-chart is in Figure 6.4. It differs minimally from the multinomial one in
Figure 6.3. In the hypergeometric case the expected value is 113 fish, against
116.5 in the multinomial case. When the recapture involves 10 marked fish,
the expected values are 45.9, against 47.5. As we have already seen in Re-
mark 3.4.3, the hypergeometric distribution on small draws from a large urn
look very much like a multinomial distribution.
340
6.3. Forward and backward inference 341
Figure 6.4 The posterior fish number distribution after catching 5 marked fish, in
hypergeometric mode, see Example 6.3.10.
After all these examples we develop some general results. We start with a fun-
damental result about conditioning of a transformed state. It can be reformu-
lated as a combination of forward and backward reasoning. This has many
consequences.
c = ω |q = c|q = (ω|c = q ).
(6.2)
341
342 Chapter 6. Updating distributions
Earlier we have seen ‘crossover update’ for a joint state, whereby one up-
dates in one component, and then marginalises in the other, see for instance Ex-
ample 6.1.2 (2). We shall do this below for joint states gr(σ, c) = hid , ci = σ
that arise as graph, see Definition 2.5.6, via a channel. The result below ap-
peared in [21], see also [87, 88]. It says that crossover updates on joint graph
states can also be done via forward and backward inferences. This is also a
consequence of Theorem 6.3.11.
342
6.3. Forward and backward inference 343
= π2 = ω|π1 = p .
2 for a factor q on Y,
σ|c = q = ω|1⊗q 1, 0
= π1 = ω|π2 = q .
= σ|c = q .
This result will help us how to perform inference in a Bayesian network, see
Section 7.10.
Exercises
6.3.1 We consider some disease with an a priori probability (or ‘preva-
lence’) of 1%. There is a test for the disease with the following char-
acteristics.
• (‘sensitivity’) If someone has the disease, then the test is positive
with probability of 90%.
• (‘specificity’) If someone does not have the disease, there is a 95%
chance that the test is negative.
1 Take as disease space D = {d, d⊥ }; describe the prior as a distribu-
tion on D;
2 Take as test space T = {p, n} and describe the combined sensitivity
and specificity as a channel c : D → T ;
343
344 Chapter 6. Updating distributions
Check that:
p · se
PPV B ω|c = 1 p (d) = .
p · se + (1 − p) · (1 − sp)
This is commonly expressed in medical textbooks as:
prevalence · sensitivity
PPV = .
prevalence · sensitivity + (1 − prevalence) · (1 − specificity)
Check similarly that:
(1 − p) · sp
NPV B ω|c = 1n (d⊥ ) = .
p · (1 − se) + (1 − p) · sp
As an aside, the (positive) likelihood ratio LR is the fraction:
c(d)(p) se
LR B = .
c(d )(p)
⊥
1 − sp
6.3.3 Give a channel-based analysis and answer to the following question
from [144, Chap. I, Exc. 39].
Stores A, B, and C have 50, 75, and 100 employees, and respectively, 50, 60,
and 70 percent of these are women. Resignations are equally likely among
all employees, regardless of sex. One employee resigns and this is a woman.
What is the probability that she works in store C?
6.3.4 The multinomial and hypergeometric charts in Figures 6.3 and 6.4
are very similar. Still, if we look at the probability for the situation
when there are 40 fish in the pond, there is clear difference. Give a
conceptual explation for this difference.
344
6.3. Forward and backward inference 345
6.3.5 The following situation about the relationship between eating ham-
burgers and having Kreuzfeld-Jacob disease is insprired by [8, §1.2].
We have two sets: E = {H, H ⊥ } about eating Hamburgers (or not),
and D = {K, K ⊥ } about having Kreuzfeld-Jacob disease (or not). The
following distributions on these sets are given: half of the people eat
hamburgers, and only one in hundred thousand have Kreuzfeld-Jacob
disease, which we write as:
ω = 21 | H i + 12 | H ⊥ i and σ = 1
100,000 | K i + 99,999 ⊥
100,000 | K i.
1 Suppose that we know that 90% of the people who have Kreuzfeld-
Jacob disease eat Hamburgers. Use this additional information to
define a channel c : D → E with c = σ = ω.
2 Compute the probability of getting Kreuzfeld-Jacob for someone
eating hamburgers (via backward inference).
6.3.6 Consider in the context of Example 6.3.8 a negative Covid-19 test
obtained after 2 days, via the distribution ϕ2 = 14 |1 i + 21 | 2i + 14 | 3 i,
assuming a uniform disease prior ω. Show that the posterior ‘days’
distribution is:
(ω ⊗ ϕ ) 0, 1 = 199 | 1 i + 179 | 2i + 169 | 3 i
2 t = 1 726 363 726
n
345
346 Chapter 6. Updating distributions
1 We write [a, b]N ⊆ [a, b] ⊆ R for the interval [a, b] reduced to N elements:
Often we combine the notations from these two items and use discretised states
of the form Disc( f, [a, b]N ).
346
6.4. Discretisation, and coin bias learning 347
To see an example of item (1), consider the interval [1, 2] with N = 3. The
step size s is then s = 2−1
3 = 3 , so that:
1
[1, 2]3 = 1 + 21 · 13 , 1 + 32 · 13 , 1 + 52 · 13 = 1 + 16 , 1 + 12 , 1 + 65
We choose to use internal points only and exclude the end-points in this finite
subset since the end-points sometimes give rise to boundary problems, with
functions being undefined. When N goes to infinity, the smallest and largest
elements in [a, b]N will approximate the end-points a and b — from above and
from below, respectively.
The ‘total’ number t in item (2) normalises the formal sum and ensures that
the multiplicities add up to one. In this way we can define a uniform distri-
bution on [a, b]N as unif[a,b]N , like before, or alternatively as Disc(1, [a, b]N ),
where 1 is the constant-one function.
Example 6.4.2. We look at the following classical question: suppose we are
given a coin with an unknown bias, we flip it eight times, and observe the
following list of heads (H) and tails (T ):
[T, H, H, H, T, T, H, H].
What can we then say about the bias of the coin?
The frequentist approach that we have seen in Section 2.3 would turn the
above list into a multiset and then into a distribution, by frequentist learning,
see also Diagram (2.16). This gives:
[T, H, H, H, T, T, H, H] 7−→ 5| H i + 3| T i 7−→ 85 | H i + 38 | T i.
Here we do not use this frequentist approach to learning the bias parameter,
but take a Bayesian route. We assume that the bias parameter itself is given
by a distribution, describing the likelihoods of various bias values. We assume
no prior knowledge and therefore start from the uniform distribution. It will
be updated based on successive observations, using the technique of backward
inference, see Definition 6.3.1 (2).
The bias b of a coin is a number in the unit interval [0, 1], giving rise to a
coin distribution flip(b) = b| H i + (1 − b)| T i. Thus we can see flip as a chan-
nel flip : [0, 1] → {H, T }. At this stage we avoid continuous distributions and
discretise the unit interval. We choose N = 100 in the chop up, giving as under-
lying space [0, 1]N with N points, on which we take the uniform distribution
unif as prior:
X X
unif B Disc(1, [0, 1]N ) = 1
N | x i = 1 2i+1
N 2N
x∈[0,1]N 0≤i<N
= 1 1
+ N1 2N
3
+ · · · + N1 2N−1 .
N 2N 2N
347
348 Chapter 6. Updating distributions
[0, 1]N
flip
◦ / {H, T } given by flip(b) = b| H i + (1 − b)| T i.
There are the two (sharp, point) predicates 1H and 1T on the codomain
{H, T }. It is not hard to show, see Exercise 6.4.2 below, that:
Predicate transformation along flip yields two predicates on [0, 1]N given by:
unif|flip = 1T
unif|flip = 1T |flip = 1H = unif|(flip = 1T )&(flip = 1H ) = unif|(flip = 1H )&(flip = 1T )
unif|flip = 1T |flip = 1H |flip = 1H = unif|(flip = 1T )&(flip = 1H )&(flip = 1H )
= unif|(flip = 1H )2 &(flip = 1T )
..
.
unif|(flip = 1H )5 &(flip = 1T )3
An overview of the resulting distributions is given in Figure 6.5. These distri-
butions approximate (continuous) beta distributions, see Section ?? later on.
These beta functions form a smoothed out version of the bar charts in Fig-
ure 6.5. We shall look more systematically into ‘learning along a channel’ in
Section ??, where the current coin-bias computation re-appears in Example ??.
After these eight updates, let’s write ρ = unif|(flip = 1H )5 &(flip = 1T )3 for the re-
sulting distribution. We now ask three questions.
1 Where does ρ reach its highest value, and what is it? The answers are given
by:
argmax(ρ) = { 25
40 } with ρ( 25
40 ) ≈ 0.025347.
2 What is the predicted coin distribution? The outcome, with truncated multi-
plicities, is:
flip = ρ = 0.6| 1 i + 0.4|0 i.
348
6.4. Discretisation, and coin bias learning 349
Figure 6.5 Coin bias distributions arising from the prior uniform distribution by
successive updates after coin observations [0, 1, 1, 1, 0, 0, 1, 1].
For mathematical reasons1 the exact outcome is 0.6. However, we have used
approximation via discretisation. The value computed with this discretisation
is 0.599999985316273. We can conclude that chopping the unit interval up
with N = 100 already gives a fairly good approximation.
1 The distribution ρ is an approximation of the probability density function β(5, 3), which has
5+1
mean (5+1)+(3+1) = 0.6.
349
350 Chapter 6. Updating distributions
This last equation says that the βN distributions are closed under updating
via flip. This is the essence of the fact that the βN channel is ‘conjugate prior’
to the flip channel, see Section ?? for more details. For now we note that this
relationship is convenient because it means that we don’t have to perform all
the state updates explicitly; instead we can just adapt the inputs a, b of the
channel βN . These inputs are often called hyperparameters.
= βN (a + n, b + m).
350
6.4. Discretisation, and coin bias learning 351
1| a i, 4
5|ai + 51 | b i, 3
5|ai + 25 | b i, 2
5|ai + 35 | b i, 1
5|ai + 45 | b i, 1| bi.
candidates a b c
poll numbers 52 28 20
What is the probability that candidate a then wins in the election? Notice that
these are poll numbers, before the election, not actual votes. If they were actual
votes, candidate a gets more than 50%, and thus wins. However, these are poll
numbers, from which we like to learn about voter preferences.
A state in D(X) is used as distribution of preferences over the three candi-
dates in X = {a, b, c}. In this example we discretise the set of states and work
with the finite subset D[K](X) ,→ D(X), for a number K. There is no ini-
tial knowledge about voter preferences, so our prior is the uniform distribution
over these preference distributions:
υK B unifD[K](X) ∈ D D[K](X) .
351
352 Chapter 6. Updating distributions
σ ∈ D[K](X) as:
aw(σ) = 1 ⇐⇒ σ(a) > 1
2
bw(σ) = 1 ⇐⇒ σ(b) > 1
2
cw(σ) = 1 ⇐⇒ σ(c) > 1
2
aw(σ) = 1 ⇐⇒ σ(a) ≤ 1
2 and σ(b) ≤ 1
2 and σ(c) ≤ 12 .
The prior validities:
υK |= aw υK |= bw υK |= cw υK |= nw
all approximate 14 , as K goes to infinity.
We use the inclusion D[K](X) ,→ D(X) as a channel i : D[K](X) → X. Like
in the above coin scenario, we perform successive backward inferences, using
the above poll numbers, to obtain a posterior ρK obtained via updates:
ρK B υK | (i = 1a )52 | (i = 1b )28 | (i = 1c )20 = υK | (i = 1a )52 & (i = 1b )28 & (i = 1c )20 .
We illustrate that the posterior probability ρK |= aw takes the following values,
for several numbers K.
N = 100 N = 500 N = 1000
Exercises
6.4.1 Recall the the N-element set [a, b]N from Definition 6.4.1 (1).
1 Show that its largest element is b − 21 s, where s = b−a
N .
2 Prove that x∈[a,b]N x = N(a+b) = N(N−1)
P P
2 . Remember: i∈N i 2 .
6.4.2 Consider coin parameter learning in Example 6.4.2.
1 Show that the predicition in the prior (uniform) state unif on [0, 1]N
gives a fair coin, i.e.
flip = unif = 12 | 1 i + 12 | 0i.
This equality is independent of N.
352
6.4. Discretisation, and coin bias learning 353
i2 = N(N−1)(2N−1)
P
4 Use the ‘square pyramidal’ formula i∈N 6 to prove
that:
flip = (unif|flip = 1H ) (1) = 2 1
.
3 − 6N 2
Show that:
1 The map add is an action of the monoid N({H, T }) on N × N.
2 Show that the above rectangle commutes — and thus that βN is a
homomorphism of monoid actions. (This description of a conjugate
prior relationship as monoid action comes from [81].)
6.4.6 We accept, without proof, the following equation. For a, b ∈ N,
1 X a a! · b!
lim · r · (1 − r)b = . (∗)
N→∞ N
r∈[0,1]
(a + b + 1)!
N
Use (∗) to prove that the binary Pólya distribution can be obtained
from the binomial, via the limit of the discretised β distribution (6.3):
lim bn[K] = βN (a, b) (i)
N→∞
= pl[K] (a+1)| H i + (b+1)|T i i| H i + (K −i)| T i .
353
354 Chapter 6. Updating distributions
354
6.5. Inference in Bayesian networks 355
O O
D X
dysp k xray
O O
•O
E
B
either
L
? O
T
bronc lung tub
d ; O
•O A
S
smoke asia
Figure 6.6 The graph of the Asia Bayesian network, with node abbreviations:
bronc = bronchitis, dysp = dyspnea, lung = lung cancer, tub = tuberculosis. The
wires all have 2-element sets of the form A = {a, a⊥ } as types.
lung : A → L via state transformation. Thus we follow the ‘V’ shape in the
relevant part of the graph.
Combining this down-update-up steps gives the required outcome:
lung = smoke bronc = 1 ⊥ = 0.0427| ` i + 0.9573|`⊥ i.
b
We see that this calculation combines forward and backward inference, see
Definition 6.3.1.
355
356 Chapter 6. Updating distributions
P(smoke) P(asia)
0.5 0.01
smoke P(lung) asia P(tub)
s 0.1 a 0.05
s⊥ 0.01 a⊥ 0.01
s 0.6 e 0.98
s ⊥
0.3 e⊥ 0.05
Figure 6.7 The conditional probability tables of the Asia Bayesian network
Figure 6.8 The conditional probability tables from Figure 6.7 described as states
and channels.
Thus we compute:
(smoke ⊗ asia)(lung⊗tub) = (either = (xray = 1 )) 1, 0
x
= (smoke ⊗ asia) 1, 0
(xray ◦· either ◦· (lung⊗tub)) = 1 x
= 0.6878| si + 0.3122| s⊥ i.
Thus, a positive xray makes it more likely — w.r.t. the uniform prior — that the
patient smokes — as is to be expected. This is obtained by backward inference.
356
6.5. Inference in Bayesian networks 357
Recall that we write ∆ for the copy channel, in this expression of type S →
S × S.
Going in the backward direction we can form a predicate, called p below,
on the set B × L × T , by predicate transformation and conjunction:
The result that we are after is now obtained via updating and marginalisation:
σ| p 0, 1, 0 = 0.0558| ` i + 0.9442|`⊥ i.
(6.4)
There is an alternative way to describe the same outcome, using that certain
channels can be ‘shifted’. In particular, in the definition of the above state
σ, the channel bronc is used for state transformation. It can also be used in
a different role, namely for predicate transformation. We then use a slightly
different state, now on S × L × T ,
The bronc channel is now used for predicate transformation in the predicate:
The reason why the outcomes (6.4) and (6.5) are the same is the topic of Exer-
cise 6.5.2.
We conclude that inference in Bayesian networks can be done composi-
tionally via a combination of forward and backward inference, basically by
following the network structure.
357
358 Chapter 6. Updating distributions
Exercises
6.5.1 Consider the wetness Bayesian network from Section 2.6. Write down
the channel-based inference formulas for the following inference ques-
tions and check the outcomes that are given below.
1 The updated sprinkler distribution, given evidence of a slippery
63
road, is 260 |b i + 260
197
| b i.
2 The updated wet grass distribution, given evidence of a slippery
5200 |d i + 5200 | d i.
road, is 4349 851 ⊥
358
6.6. Bayesian inversion: the dagger of a channel 359
Thus, the dagger is obtained by updating the prior ω via the likelihood func-
tion Y → Pred (X), given by y 7→ c = 1y , see Definition 4.3.1. We have already
implicitly used the dagger of a channel in several situations.
Example 6.6.2. 1 In Example 6.3.2 we used a channel c : {H, T } → {W, B}
from sides of a coin to colors of balls in an urn, together with a fair coin
unif ∈ D({H, T }). We had evidence of a white ball and wanted to know the
updated coin distribution. The outcome that we computed can be described
via a dagger channel, since:
c†unif (W) = unif|c = 1W = 22
67 | H i + 45
67 | T i.
The same redescription in terms of Bayesian inversion can be used for Ex-
amples 6.3.3 and 6.3.9, 6.3.10. In Example 6.3.5 we can use Bayesian in-
version for the point evidence case, but not for the illustration with soft
evidence, i.e. with fuzzy predicates. This matter will be investigated further
in Section 6.7.
2 In Exercises 6.3.1 and 6.3.2 we have seen a channel c : {d, d⊥ } → {p, n} that
combines the sensitivity and specificity of a medical test, in a situation with
a prior / prevalence of 1% for the disease. The associated Positive Prediction
Value (PPV) and Negative Predication Value (NPV) can be expressed via a
dagger as:
PPV = c†ω (p)(d) = ω|c = 1 p (d) = 18
117
NPV = c†ω (n)(d⊥ ) = ω|c = 1n (d⊥ ) = 1883 ,
1881
where ω = 1
100 | d i + 99 ⊥
100 | d i is the prior distribution.
In Definition 6.3.1 we have carefully distinguished forward inference c =
(ω| p ) from backward inference ω|c = q . Since Bayesian inversion turns channels
around, the question arises whether it also turns inference around: can one ex-
press forward (resp. backward) inference along c in terms of backward (resp.
forward) inference along the Bayesian inversion of c? The next result from [77]
shows that this is indeed the case. It demonstrates that the directions of infer-
ence and of channels are intimately connected and it gives the channel-based
framework internal consistency.
359
360 Chapter 6. Updating distributions
2 Similarly, for y ∈ Y,
(c = σ)(y) · (c†σ = p)(y)
τc† = p (y) =
σ
c = σ |= c†σ = p
X (c = σ)(y) · c† (y)(x) · p(x)
σ
=
x∈X cσ = (c = σ) |= p
†
X (c = σ)(y) · σ(x)·c(x)(y)
(c = σ)(y) · p(x)
=
x∈X
σ |= p
X σ(x) · p(x)
= c(x)(y) ·
x∈X
σ |= p
X
= c(x)(y) · σ| p (x)
x∈X
= c = (σ| p ) (y).
We include two further results with basic properties of the dagger of a chan-
nel. More such properties are investigated in Section 7.8, where the dagger is
put in a categorical perspective.
360
6.6. Bayesian inversion: the dagger of a channel 361
Then:
1 The joint state ω is the graph of both these channels:
hid , ci = ω1 = ω = hd, id i = ω2 .
361
362 Chapter 6. Updating distributions
1 For x ∈ X and y ∈ Y,
hid , ci = ω1 (x, y) = ω1 (x) · π2 ◦· π1 †ω (x)(y)
X
†
= ω1 (x) · π1 ω (x)(z, y)
z∈X
(6.6)
X ω(z, y) · π1 ((z, y), x)
= (π1 = ω)(x) ·
z∈X
(π1 = ω)(x)
= ω(x, y).
2 Similarly,
(6.6) ω1 (x) · c(x)(y) (hid , ci = ω1 )(x, y)
c†ω1 (y)(x) = =
(c = ω1 )(y) ω2 (y)
ω(x, y)
=
(π2 = ω)(y)
X
= π2 †ω (y)(x, z)
z∈Y
†
= π1 ◦· π2 ω (y)(x)
= d(y)(x).
As announced there, these EDD and EDA maps are each other’s daggers. We
can now make this precise.
362
6.6. Bayesian inversion: the dagger of a channel 363
Proposition 6.6.6. The Ewens draw-delete/add channels are each other’s dag-
gers, via the Ewens distributions: for each K, t ∈ N>0 one has:
Proof. Fix Ewens multisets ϕ ∈ E(K) and ψ ∈ E(K +1). A useful observation
to start with is that the commuting diagrams in Corollary 3.9.12 give us the
following validities of transformed point predicates.
= ew[K](t)(ϕ)
ew[K](t) |= EDA(t) = 1ψ = EDA(t) = ew[K](t) (ψ)
= ew[K +1](t)(ψ).
For the first equation in the proposition we mimick the proof of Corollary 3.9.12
and follow the reasoning there:
363
364 Chapter 6. Updating distributions
ψ(1) X ψ(k) · k
= ψ − 1| 1 i + ψ + 1| k−1 i − 1| k i
K +1 2≤k≤K+1
K +1
= EDD(ϕ).
Exercises
6.6.1 Let σ, τ ∈ D(X) be distributions with supp(τ) ⊆ supp(σ). Show that
for the identity channel id one has:
id †σ = τ = τ.
6.6.2 Let c : X → Y be a channel with state ω ∈ D(X), where c = ω has
full support. Consider the following distribution of distributions:
X
ΩB (c = ω)(y) c†ω (y) ∈ D D(X) .
y∈Y
364
6.6. Bayesian inversion: the dagger of a channel 365
Recall the probabilistic ‘flatten’ operation from Section 2.2 and show:
flat(Ω) = ω.
This says that Ω is ‘Bayes plausible’ in the terminology of [97] on
Bayesian persuasion. The construction of Ω is one part of the bijective
correspondence in Exercise 7.6.7.
6.6.3 For ω ∈ D(X), c : X → Y, p ∈ Obs(X) and q ∈ Obs(Y) prove:
c = ω |= (c†ω = p) & q = ω |= p & (c = q).
This is a reformulation of [24, Thm. 6]. If we ignore the validity signs
|= and the states before them, then we can recognise in this equation
the familiar ‘adjointness property’ of daggers (adjoint transposes) in
the theory of Hilbert spaces: hA† (x) | yi = hx | A(y)i.
6.6.4 Consider the draw-delete channel DD : N[K + 1](X) → N[K](X)
from Definition 2.2.4. Let X be a finite set, say with n elements. Con-
sider the the uniform state unif on N[K + 1](X), see Exercise 2.3.7.
Show that DD’s dagger, with respect to unif, can be described on
ϕ ∈ N[K](X) as:
X ϕ(x) + 1
DD †unif (ϕ) = ϕ + 1| x i .
x∈X
K + n
This dagger differs from the draw-add map DA : N[K](X) → N[K +
1](X).
6.6.5 Let ω ∈ D(X × Y) be a joint state, whose two marginals ω1 B
ω 1, 0 ∈ D(X) and ω2 B ω 0, 1 ∈ D(Y) have full support. Let
p ∈ Pred (X) and q ∈ Pred (Y) be arbitrary predicates.
Prove that there are predicates q1 on X and p2 on Y such that:
ω1 |= p & q1 = ω |= p ⊗ q = ω2 |= p1 & q.
This is a discrete version of [131, Prop. 6.7].
Hint: Disintegrate!
6.6.6 Let c : X → Y and d : X → Z be channels with a state ω ∈ D(X) on
their domain such that both predicted states c = ω and d = ω have
full support. Show that:
hc, di†ω (y, z) = ω|(c = 1y )&(d = 1z ) = d†† (z)
cω (y)
= ω|(d = 1z )&(c = 1y ) = c†† (y).
cω (z)
365
366 Chapter 6. Updating distributions
σ|c = q ∈ D(X).
2 Jeffrey’s rule involves update of the prior σ with evidence in form of a state
τ ∈ D(Y) to the posterior:
c†σ = τ ∈ D(X).
We shall illustrate the difference between Pearl’s and Jeffreys’s rules in two
examples.
366
6.7. Pearl’s and Jeffrey’s update rules 367
Example 6.7.2. Consider the situation from Exercise 6.3.1 with set D = {d, d⊥ }
for disease (or not) and T = {p, n} for a positive or negative test outcome, with
a test channel c : D → T given by:
c(d) = 9
10 | p i + 1
10 | n i c(d⊥ ) = 1
20 | p i + 19
20 | n i.
The test thus has a sensitivity of 90% and a specificity of 95%. We assume a
prevalence of 10% via a prior state ω = 101
| d i + 10
9
| d⊥ i.
The test is performed, under unfavourable circumstances like bad light, and
we are only 80% sure that the test is positive. With Pearl’s update rule we thus
use as evidence predicate q = 45 · 1 p + 51 · 1n . It gives as posterior disease
probability:
ω|c = q = 74
281 | d i + 207
281 | d i ≈ 0.26| d i + 0.73| d i.
This gives a disease likelihood of 26%.
When we decide to use Jeffrey’s rule we translate the 80% certainty of a
positive test into a state τ = 54 | p i + 15 | ni. Then we compute:
(6.6) 4
= 5 · ω|c = 1 p + 15 · ω|c = 1n
= 45 · 23 | d i + 13 | d i + + 45 · 2
173 | d i + 171
173 | di
= 519278
| d i + 241
519 | d i
≈ 0.54| d i + 0.46| d i.
The disease likelihood is now 54%, more than twice as high as with Pearl’s
update rule. This is a serious difference, which may have serious consequences.
Should we start asking our doctors: does your computer use Pearl’s or Jeffrey’s
rule to calculate likelihoods, and come to an advice about medical treatment?
We also review an earlier illustration, now with Jeffrey’s approach.
Example 6.7.3. Recall the situation of Example 2.2.2, with a teacher predict-
ing the performance of pupils, depending on the teacher’s mood. The evidence
q from Example 6.3.4 is used for Pearl’s update. It can be translated into a state
τ on the set G of grades:
τ= 1
10 | 1 i + 3
10 | 2i + 3
10 |3 i + 2
10 | 4 i + 1
10 | 5 i.
There is an a priori divergence DKL (τ, c = σ) ≈ 1.336. With some effort one
can prove that the Jeffrey-update of σ is:
σ0 = c†σ = τ = 3913520
972795
| p i + 1966737
3913520 | ni + 3913520 | o i
973988
367
368 Chapter 6. Updating distributions
prior mood
Pearl-update Jeffrey-update
The prior mood is reproduced from Example 2.2.2 for easy comparison. The
Pearl and Jeffrey updates differ only slightly in this case.
The following theorem (from [79]) captures the essences of the update rules
of Pearl and Jeffrey: validity increase or divergence decrease.
Theorem 6.7.4. Consider the situation in Definition 6.7.1, with a prior state
σ ∈ D(X) and a channel c : X → Y.
368
6.7. Pearl’s and Jeffrey’s update rules 369
≥ σ |= c = q
= c = σ |= q.
The fact that Jeffrey’s rule involves correction of prediction errors, as stated
in Theorem 6.7.4 (2), has led to the view that Jeffrey’s update rule should
be used in situations where one is confronted with ‘surprises’ [41] or with
‘unanticipated knowledge’ [40]. Here we include an example from [41] (also
used in [77]) that involves such error correction after a ‘surprise’.
Example 6.7.5. Ann must decide about hiring Bob, whose characteristics are
described in terms of a combination of competence (c or c⊥ ) and experience (e
or e⊥ ). The prior, based on experience with many earlier candidates, is a joint
distribution on the product space C × E, for C = {c, c⊥ } and E = {e, e⊥ }, given
as:
ω= 4
10 | c, e i + 1 ⊥
10 | c, e i + 10 |c , ei
1 ⊥
+ 10 | c , e i.
4 ⊥ ⊥
Ann reads Bob’s letter to find out if he actually has relevant experience. We
quote from [41]:
Bob’s answer reveals right from the beginning that his written English is poor. Ann no-
tices this even before figuring out what Bob says about his work experience. In response
to this unforeseen learnt input, Ann lowers her probability that Bob is competent from
1
2
to 18 . It is natural to model this as an instance of Jeffrey revision.
369
370 Chapter 6. Updating distributions
Bob’s poor English is a new state of affairs: a surprise that causes Ann to switch
to error reduction mode, via Jeffrey’s rule. Bob’s poor command of the English
language translates into a competence state ρ = 18 | ci + 78 | c⊥ i. Ann wants to
adapt to this new surprising situation, so she uses Jeffrey’s rule, giving a new
joint state:
ω0 = (π1 )†ω = ρ = 1
10 | c, e i + 1 ⊥
40 | c, e i + 40 | c , e i
7 ⊥
+ 10 | c , e i.
7 ⊥ ⊥
If the letter now tells that Bob has work experience, Ann will factor this in, in
this new situation ω0 , giving, like above, via Pearl’s rule followed by marginal-
isation,
ω0 |π2 = 1e 1, 0 = 11
4
| c i + 11
7 ⊥
| c i.
The likelihood of Bob being competent is now lower than in the prior state,
4
since 11 < 21 . This example reconstructs the illustration from [41] in channel-
based form, with the associated formulations of Pearl’s and Jeffrey’s rules from
Definition 6.7.1, and produces exactly the same outcomes as in loc. cit.
1 Pearl’s rule and Jeffrey’s rule agree on point predicate / states: for y ∈ Y,
with associated point predicate 1y and point state 1| y i, one has:
2 Pearl’s updating with a constant predicate (no information) does not change
the prior state ω:
σ| c = r ·1 = σ, for r > 0.
Jeffrey’s update does not change anything when we update with what we
already predict:
c†σ = c = σ = σ.
This is in essence the law of total probability, see Exercise 6.7.6 below.
3 For a factor q ∈ Fact(Y) we can update the state σ according to Pearl, and
also the channel c so that the updated predicted state is predicted:
c|q = (σ|c = q ) = c = σ |q .
In Jeffrey’s case, with an evidence state τ ∈ D(Y), we can update both the
370
6.7. Pearl’s and Jeffrey’s update rules 371
state and the channel, via a double-dagger, so that the evidence state τ is
predicted:
† †
cσ = c†σ = τ = τ.
τ
4 Multiple updates with Pearl’s rule commute, but multiple updates with Jef-
frey’s rule do not commute.
Proof. These items follow from earlier results.
1 Directly by Definition 6.7.1.
2 The first equation follows from Lemma 6.1.6 (1) and (4); the second one is
Proposition 6.6.4 (1).
3 The first claim is Theorem 6.3.11 and the second one is an instance of Propo-
sition 6.6.4 (2).
4 By Lemma 6.1.6 (3) we have:
σ|c = q1 |c = q2 = σ|(c = q1 )&(c = q2 ) = σ|c = q2 |c = q1 .
However, in general, for evidence states τ1 , τ2 ∈ D(Y),
c†† = τ2 , c†† = τ1 .
cσ = τ1 cσ = τ2
371
372 Chapter 6. Updating distributions
c†σ = τ = σ| c = τ/(c=σ) .
Proof. The first item is exactly Theorem 6.6.3 (1). For the second item we first
note that (c = σ) |= τ/(c = σ) = 1, since (c = σ) is a state:
X τ(y) X
(c = σ) |= τ/(c = σ) = (c = σ)(y) · = τ(y) = 1. (6.7)
y∈Y
(c = σ)(y) y∈Y
This result can be interpreted as follows. With my prior state σ I can predict
c = σ. I can update this prediction with evidence q to (c = σ)|q . The divergence
between this update and my prediction is bigger than the divergence between
the update and the prediction from the Pearl/Bayes posterior σ|c = q . Thus, the
posterior gives a correction.
372
6.7. Pearl’s and Jeffrey’s update rules 373
Remarks 6.7.9. 1 In the above corrolary we use that Pearl’s rule can be ex-
pressed as Jeffrey’s, see point (1) in Proposition 6.7.7. The other way around,
Jeffrey’s rule is expressed via Pearl’s in point (2). This leads to a validity in-
crease property for Jeffrey’s rule: let c : X → Y be a channel with states
σ ∈ D(X) and τ ∈ D(Y). Assuming that c = σ has full support we can form
the factor q = τ/(c = σ). We then get an inequality:
(6.7)
c = c†σ = τ |= q = c = σ|c = q |= q ≥ c = σ |= q = 1.
373
374 Chapter 6. Updating distributions
X ri · (ω |= pi )
≤ 1. (6.8)
j r j · (ω| p j |= pi )
P
i
This weighted update inequality follows from [50, Thm. 4.1]. The proof is
non-trivial, but involves some interesting properties of matrices of validities.
We skip this proof now and first show how it is used in a proof of the diver-
gence reduction of Jeffrey’s rule. A binary version, for n = 2, of this result has
appeared already in Exercise 6.1.12.
We use the name ‘weigthed update’, since the denominator in (6.8) can also
be written as the validity of the predicate pi in the weighted sum of updates:
P
r j · (ω| p j |= pi = j r j · ω| p j |= pi .
P
j
374
6.7. Pearl’s and Jeffrey’s update rules 375
The first inequality is an instance of Jensen’s inequality, see Lemma 2.7.3. The
second one follows from:
X τ(y) · c = σ(y)
y c = (c† = τ) (y)
σ
X τ(y) · (σ |= c = 1y )
=
x (cσ = τ)(x) · c(x)(y)
y
P †
X τ(y) · (σ |= c = 1y )
=
x,z τ(z) · cσ (z)(x) · (c = 1y )(x)
y
P †
(6.6)
X τ(y) · (σ |= c = 1y )
= σ(x)·(c = 1z )(x)
x,z τ(z) · · (c = 1y )(x)
y P
σ|=c = 1z
X τ(y) · (σ |= c = 1y )
=
x σ(x)·((c = 1z )&(c = 1y ))(x) σ|=(c = 1z )&(c = 1y )
P
z τ(z) ·
y P
σ|=(c = 1z )&(c = 1y ) · σ|=c = 1z
X τ(y) · (σ |= c = 1y )
= by Proposition 6.1.3 (1)
z τ(z) · (σ|c = 1z |= c = 1y )
P
y
≤ 1, by Proposition 6.7.10.
In the last line we apply Proposition 6.7.10 with test pi B c = 1yi , where Y =
{y1 , . . . , yn }. The point predicates 1yi form a test on Y. Predicate transformation
preserves tests, see Exercise 4.3.4. As is common with daggers c†σ , we assume
that c = σ has full support, so that σ |= pi = σ |= c = 1yi = (c = σ)(yi ) is
non-zero for each i.
375
376 Chapter 6. Updating distributions
teresting properties. The proof that we present is extracted2 from [50] and is
applied only to this conditional expectation matrix C. The original proof is
formulated more generally.
We recall that for a square matrix A the spectral radius ρ(A) is the maximum
of the absolute values of its eigenvalues:
n o
ρ(A) B max |λ| λ is an eigenvalue of A .
We shall make use the following result. The first item is known as Gelfand’s
formula, originally from 1941. The proof is non-trivial and is skipped here; for
details, see e.g. [112, Appendix 10]. For convenience we include short (stan-
dard) proofs of the other two items.
2 Here we shall use the 1-norm k A k1 B max j i |Ai j |. It yields that ρ(A) = 1
P
for each (left/column-)stochastic matrix A.
3 Let square matrix A now be non-negative, that is, satisfy Ai j ≥ 0, and let x
be a positive vector, so each xi > 0. If Ax ≤ r · x with r > 0, then ρ(A) ≤ r.
Since stochastic matrices are closed under matrix multiplication, one gets k An k1 =
1 for each n. Hence ρ(A) = 1 via Gelfand’s formula.
We next show how the third item can be obtained from the first one, as
in [98, Cor. 8.2.2]. By assumption, each entry xi in the (finite) vector x is
positive. Let’s write x− for the least one and x+ for the greatest one. Then
0 < x− ≤ xi ≤ x+ for each i. For each n we have:
P
An
1 · x− = max j i An i j · x−
≤ max j i An i j · xi
P
= max j An x j
≤ max j rn · x j
≤ r n · x+ .
2 With the help of Harald Woracek and Ana Sokolova.
376
6.7. Pearl’s and Jeffrey’s update rules 377
Hence:
!1/n !1/n
1/n x x+
An
1 ≤ rn · + = r· .
x− x−
For the remainder of this subsection the setting is as follows. Let ω ∈ D(X)
be a fixed state, with an n-test p1 , . . . , pn ∈ Pred (X) so that >i pi = 1, by
definition (of test). We shall assume that the validities vi B ω |= pi are non-
zero. Notice that i vi = 1. We organise these validities in a vector v and in
P
diagonal matrix V:
In the last line we use Bayes’ product rule from Theorem 6.1.3 (1). The next
series of facts is extracted from [50].
Lemma 6.7.12. The above matrices B and C satisfy the following properties.
i di · vi ≤ ρ(DC).
P
377
378 Chapter 6. Updating distributions
= ω |= q & q for q = i vi · pi
P zi
> 0.
We have a strict inequality > here since q ≥ vz11 · p1 and ω |= p1 > 0, by
assumption; in fact this holds for each pi . Thus:
z21 z21
ω |= q & q ≥ v21
· (ω |= p1 & p1 ) ≥ v21
· (ω |= p1 )2 > 0.
2 The square root B1/2 and inverse B−1 are obtained in the standard way via
spectral decomposition B = QΛQT where Λ is the diagonal matrix of eigen-
values λi > 0 and Q is an orthogonal matrix (so QT = Q−1 ). Then: B1/2 =
QΛ1/2 QT where Λ1/2 has entries λi/2 . Similarly, B−1 = QΛ−1 QT , and B−1/2 =
1
QΛ− /2 QT .
1
i Ci j = i ω| p j |= pi = ω| p j |= i pi = ω| p j |= 1 = 1.
P P P
= j ω |= pi & p j
P
by Proposition 6.1.3 (1)
= ω |= pi & ( j p j ) = ω |= pi & 1 = ω |= pi = vi .
P
Further, V B i j = vi · Bi j = Ci j .
4 We show that DC and B1/2 DV B1/2 have the same eigenvalues, which gives
ρ(DC) = ρ(B1/2 DV B1/2 ). First, let DC z = λz. Take z0 = B1/2 z one gets:
B1/2 DV B1/2 z0 = B1/2 DV B1/2 B1/2 z = B1/2 DCz = B1/2 λz = λB1/2 z = λz0 .
In the other direction, let B1/2 DV B1/2 w = λw. Now take w0 = B−1/2 w so that:
378
6.7. Pearl’s and Jeffrey’s update rules 379
5 We use the standard fact that for non-zero vectors z one has:
| (Az, z) |
≤ ρ(A),
(z, z)
where (−, −) is inner product. In particular,
| (B1/2 DV B1/2 z, z) |
≤ ρ B1/2 DV B1/2 = ρ(DC).
(z, z)
We instantiate with z = B1/2 v and use that V Bv = Cv = v and Bv = 1 in:
| (B1/2 DV B1/2 B1/2 v, B1/2 v) | | (DV Bv, B1/2 B1/2 v) |
ρ(DC) ≥ =
(B1/2 v, B1/2 v) (v, B1/2 B1/2 v)
P
| (Dv, Bv) | | (Dv, 1) | | i di · vi |
= = = = i di · vi .
P
P
(v, Bv) (v, 1) i vi
Exercises
6.7.1 Check for yourself the claimed outcomes in Example 6.7.5:
1 ω|π2 = 1e 1, 0 = 45 | ci + 51 | c⊥ i;
2 ω0 B (π1 )†ω = ρ = 10 1
| c, e i + 401
| c, e⊥ i + 40
7 ⊥
| c , e i + 10 | c , e i;
7 ⊥ ⊥
3 ω |π2 = 1e 1, 0 = 11 | ci + 11 | c i.
0 4 7 ⊥
379
380 Chapter 6. Updating distributions
6.7.3 The next illustration is attributed to Jeffrey, and reproduced for in-
stance in [19, 33]. We consider three colors: green (g), blue (b) and
violet (v), which are combined in a space C = {g, b, v}. These colors
apply to cloths, which can additionally be sold or not, as represented
by the space S = {s, s⊥ }. There is a prior joint distribution τ on C × S ,
namely:
ω= 3
25 | g, si + 9
50 | g, s⊥ i + 3
25 | b, si + 9
50 | b, s⊥ i + 8
25 | v, si + 2
25 | v, s⊥ i.
A cloth is inspected by candlelight and the following likelihoods are
reported per color: 70% certainty that it is green, 25% that it is blue,
and 5% that it is violet.
1 Compute the two marginals ω1 B ω 1, 0 ∈ D(C) and ω2 B
ω 0, 1 ∈ D(S ) and show that we can write the joint state ω in
two ways as:
hc, id i = ω2 = ω = hid , di = ω1
for channels c : S → C and d : C → S , given by:
d(g) = 52 | s i + 35 | s⊥ i
c(s) = 14 | bi + 14 |b i + 14 | v i
3 3 8
d(b) = 52 | s i + 35 | s⊥ i
c(s⊥ ) = 9 | bi + 9 |b i + 2 | v i
d(v) = 4 | s i + 1 | s⊥ i.
22 22 11
5 5
These channels c, d are each other’s daggers, see Theorem 6.6.5 (1).
2 Capture the above inspection evidence as a predicate q on C and
show that Pearl’s rule gives:
ω2 |c = q = ω|q⊗1 0, 1 = d = ω1 |q = 26 61 | s i + 61 | s i.
35 ⊥
380
6.7. Pearl’s and Jeffrey’s update rules 381
ω= 1
200 | a, b i + 7 ⊥
500 | a, b i + 1000 | a , bi
1 ⊥
+ 100 |a , b i.
98 ⊥ ⊥
Someone reports that the alarm went off, but with only 80% certainty
because of deafness.
1 Translate the alarm information into a predicate p : A → [0, 1] and
show that crossover updating leads to a burglary distribution:
ω| p⊗1 0, 1 = 151
3
| b i + 148
⊥
151 | b i
≈ 0.02| b i + 0.98| b⊥ i.
d = 4
+ 15 |a⊥ i = 93195
19639
| b i + 73556
⊥
5 | ai 93195 | b i
≈ 0.21| b i + 0.79| b⊥ i.
σ1 B c†σ = ρ = 278
519 | d i + 278 ⊥
519 | d i.
381
382 Chapter 6. Updating distributions
382
6.8. Frequentist and Bayesian discrete probability 383
6.7.11 Consider the set {a, b, c} with distribution and test of predicates:
p1 = 1
2 · 1a + 1
2 · 1b
ω= 1
4 | ai + 1
2 | bi + 1
4 |c i p2 = 1
2 · 1a + 1
4 · 1b + 1
2 · 1c
p3 = 1
4 · 1b + 1
2 · 1c .
1 Check that the two matrices B, C defined in (6.9) are:
4/3 8/9 2/3 1/2 1/3 1/4
383
384 Chapter 6. Updating distributions
is truth and falsity 1, 0, orthocomplement p⊥ , partial sum p>q, and scalar mul-
tiplication r · p, see Subsection 4.2.3.
The next result is a Riesz-style representation result, representing distribu-
X
tions on X via a double dual [0, 1][0,1] .
Theorem 6.8.1. Let X be a finite set. There is then a ‘representation’ isomor-
phism,
In the last line we use that h is a homomorphism of effect modules and that the
predicate p has a normal form > x p(x) · 1 x , see Lemma 4.2.3 (2).
There are two leading interpretations of probability, namely a frequentist
interpretation and a Bayesian interpretation. The frequentist approach treats
probability distributions as records of probabilities of occurrences, obtained
from long-term accumulations, e.g. via frequentist learning. One can asso-
ciate this view with the set D(X) of distributions on the left-hand-side of
the representation isomorphism (6.10). The right-hand-side fits the Bayesian
view, which focuses on assigning probabilities to belief functions (predicates),
see [38], in this situation via special functions Pred (X) → [0, 1] that preserve
384
6.8. Frequentist and Bayesian discrete probability 385
2 Let h : Pred (X) → [0, 1] and k : Pred (Y) → [0, 1] be maps of effect modules.
We define their tensor product h ⊗ k : Pred (X × Y) → [0, 1] as:
(h ⊗ k)(r) B h x 7→ k r(x, −)
= k y 7→ h r(−, y) .
3 Let h : Pred (X) → [0, 1] be a map of effect modules and let p ∈ Pred (X) be
a predicate with h(p) , 0. We define an update h| p : Pred (X) → [0, 1] as:
h(p & q)
h| p (q) B .
h(p)
Then: V−1 h| p = V−1 (h) p .
Proof. 1 We consider the first marginal only. It is easy to see that the map
h 1, 0 : Pred (X) → [0, 1], as defined above, preserves the effect module
385
386 Chapter 6. Updating distributions
The same can be shown for the other formulation of h ⊗ k in item (2).
3 Again, h| p : Pred (X) → [0, 1] is a map of effect modules. Next, for x ∈ X,
h(p ⊗ 1 x )
V−1 h| p (x) = h| p (1 x ) =
h(p)
V−1 (h) |= p ⊗ 1 x
=
V−1 (h) |= p
= V−1 (h) p |= 1 x by Bayes, see Theorem 6.1.3 (2)
= V−1 (h) (x).
p
386
6.8. Frequentist and Bayesian discrete probability 387
For each set X, the powerset P(X) is an effect algebra with arbitrary joins, so
in particular ω-joins. The unit interval [0, 1] is an ω-effect module, and more
generally, each set of predicates Pred (X) = [0, 1]X is an effect module, via
pointwise joins. Later on, in continuous probability, we shall see measurable
spaces whose sets of measurable subsets form examples of ω-effect algebras.
We recall from Exercise 4.2.15 that the sum operation > of an ω-effect al-
W
gebra is continuous in both arguments separately. Explicitly: n x > yn ≤
W W W
x > n yn . Further, if there are joins , there are also meets via orthosup-
plement.
Proof. Much of this works as in the proof of Theorem 6.8.1, except that we
have to prove that V(ω) : Pred (X) → [0, 1] is ω-continuous, now that we have
ω ∈ D∞ (X). So let (pn ) be an ascending chain of predicates, with pointwise
join p = n pn . Then:
W
_ _ X _
V(ω) pn = ω |= pn = ω(x) · pn (x)
n n n
x∈X
X _
= ω(x) · pn (x)
n
x∈X
X _
= ω(x) · pn (x)
n
x∈X
(∗)
_ X _
= ω(x) · pn (x) = V(ω)(pn ).
n n
x∈X
(∗)
The direction (≥) of the marked equation = holds by monotonicity. For (≤) we
reason as follows. Since X is countable, we can write it as X = {x1 , x2 , . . .}. For
387
388 Chapter 6. Updating distributions
Hence:
X_ X_ _ X
ω(x) · pn (x) = lim ω(xk ) · pn (xk ) ≤ ω(x) · pn (x).
n N→∞ n n
x∈X k≤N x∈X
In order to see that the probabilities h(1 x ) add up to one, we write X = {x1 , x2 , . . .}.
We express the sum over these xi via a join of an ascending chain:
X _ X _
h(1 x ) = h(1 xk ) = h > 1 xk
n n
x∈X k≤n k≤n
_
= h 1{x1 ,...,xn } = h(1) = 1.
n
The argument that V and V−1 are each other’s inverses works essentially in
the same way as in the proof of Theorem 6.8.1.
Exercises
6.8.1 Consider the representation isomorphism in Theorem 6.8.1.
1 Show that a uniform distribution corresponds to the mapping that
sends a predicate to its average value.
2 Show that for finite sets X, Y and maps h ∈ EMod Pred (X), [0, 1]
and k ∈ EMod Pred (Y), [0, 1] one has:
h ⊗ k 1, 0 = h.
6.8.2 Recall from Theorem 4.2.5 that Pred (X) is the free effect module on
the effect algebra P(X), for a finite set X.
1 Deduce from this fact that there is an isomorphism:
EA P(X), [0, 1] EMod Pred (X), [0, 1] .
388
6.8. Frequentist and Bayesian discrete probability 389
389
7
390
391
391
392 Chapter 7. Directed Graphical Models
Definition 7.1.1. The collection Type of types is the smallest set with:
In fact, the first item can easily be seen as a special case of the second one.
392
7.1. String diagrams 393
dom /
SD(Σ) / L(Type)
cod
(1) Σ ⊆ SD(Σ), that is, every box in f ∈ Σ in the signature as in (7.1) is a string
diagram in itself, with dom( f ) = [A1 , . . . , An ] and cod ( f ) = [B1 , . . . , Bm ].
(2) For all A, B ∈ Type, the following diagrams are in SD(Σ).
393
394 Chapter 7. Directed Graphical Models
(g) Boxes for the logical operations conjunction, orthosupplement and scalar
multiplication, of the form:
2 2 2
& ⊥ r
2 2 2 2
where r ∈ [0, 1]. These string diagrams have obvious domains and co-
domains. We recall that predicates on A can be identified with channels
A → 2, see Exercise 4.3.5. Hence they can be included in string diagrams.
(3) If S 1 , S 2 ∈ SD(Σ) are string diagrams, then the parallel composition (jux-
taposition) S 1 S 2 is also string diagram, with dom(S 1 S 2 ) = dom(S 1 ) ++
dom(S 2 ) and cod (S 1 S 2 ) = cod (S 1 ) ++ cod (S 2 ).
394
7.1. String diagrams 395
S2
···
dom(T )
= dom(S 1 ) ++ dom(S 2 ) − C~
···
T =
with (7.2)
C~
cod (T )
= cod (S 1 ) − C~ ++ cod (S 2 ).
···
S1
···
D E dysp xray
A A
S asia
winter
smoking
395
396 Chapter 7. Directed Graphical Models
diagrams make it possible to write a joint state, on an n-ary product type. For
instance, for n = 2 we can have a write a joint state as:
(7.4)
We shall use this marginal notation with masks also for boxes in general — and
not just for states, without incoming wires. This is in line with Definition 2.5.2
where marginalisation is defined both for states and for channels.
An additional advantage of string diagrams is that they can be used for equa-
tional reasoning. This will be explained in the next section. First we give mean-
ing to string diagrams.
396
7.1. String diagrams 397
cod (S ). The product over the empty list is the singleton element 1 = {0}. Such
an interpretation [[ S ]] is parametrised by an interpretation of the boxes in the
signature Σ as channels.
Given such an interpretation of the boxes in Σ, the items below describe
this interpretation function [[ − ]] inductively, acting on the whole of SD(Σ), by
precisely following the above four items in the inductive definition of string
diagrams.
(f) The point string diagram point A (a), for a ∈ A is interpreted as the point
state 1| a i ∈ D(A).
(g) The ‘logical’ channels for conjunction, orthosupplement, and scalar mul-
tiplication are interpreted by the corresponding channels conj, orth and
scal(r) from Exercise 4.3.5.
(3) The tensor product ⊗ of channels is used to interpret juxtaposition of string
diagrams [[ S 1 S 2 ]] = [[ S 1 ]] ⊗ [[ S 2 ]].
(4) If we have string diagrams S 1 , S 2 in a composition T as in (7.2), then:
[[ T ]] = id ⊗ [[ S 2 ]] ◦· ([[ S 1 ]] ⊗ id ),
At this stage we can say more precisely what a Bayesian network is, within
the setting of string diagrams. This string diagrammatic perspective started
in [48].
397
398 Chapter 7. Directed Graphical Models
The string diagram G in the first item is commonly referred to as the (underly-
ing) graph of the Bayesian network. The channels that are used as interpreta-
tions are the conditional probability tables of the network.
Given the definition of the boxes in (the signature of) a Bayesian network,
the whole network can be interpreted as a state/distribution [[ G ]]. This is in
general not the same thing as the joint distribution associated with the network,
see Definition 7.3.1 (3) later on. In addition, for each node A in G, we can look
at the subgraph G A of cumulative parents of A. The interpretation [[ G A ]] is
then also a state, namely the one that one obtains via state transformation. This
has been described as ‘prediction’ in Section 2.6.
The above definition of Bayesian network is more general than usual, for
instance because it allows the graph to contain joint states, like in (7.4), or
discarding , see also the discussion in [82].
Exercises
7.1.1 For a state ω ∈ D(X1 × X2 × X3 × X4 × X5 ), describe its marginalisation
ω 1, 0, 1, 0, 0 as string diagram.
7.1.2 Check in detail how the two string diagrams in (7.3) are obtained by
following the construction rules for string diagrams.
7.1.3 Consider the boxes in the wetness string diagram in (7.3) as a signa-
ture. An interpretation of the elements in this signature is described
in Section 2.6 as states and channels wi, sp etc. Check that this in-
terpretation extends to the whole wetness string diagram in (7.3) as a
channel 1 → D × E given by:
(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆ ◦· wi.
7.1.4 Extend the interpretation from Section 6.5 of the signature of the Asia
string diagram in (7.3) to a similar interpretation of the whole Asia
string diagram.
7.1.5 For A = {a, b, c} check that:
hh ii
A = 3 | a, ai + 3 | b, b i + 3 | c, c i.
1 1 1
[[ A B
]] = [[ A×B
]] [[ A B ]] = [[ A×B ]]
398
7.2. Equations for string diagrams 399
7.1.7 Verify that for each box, interpreted as a channel, one has:
hh B ii hh ii
= A .
A
399
400 Chapter 7. Directed Graphical Models
Convention 7.2.1. 1 We will write the arrows ↑ on the wires of string dia-
grams simply as | and assume that the flow is allways upwards in a string
diagram, from bottom to top. Thus, even though we are not writing edges as
arrows, we still consider string diagrams as directed graphs.
2 We drop the types of wires when they are clear from the context. Thus, wires
in string diagrams are always typed, but we do not always write these types
explicitly.
3 We become a bit sloppy about writing the interpretation function [[ − ]] for
string diagrams explicitly. Sometimes we’ll say “the interpretation of S ”
instead of just [[ S ]], but also we sometimes simply write a string diagram,
where we mean its interpretation as a channel. This can be confusing at first,
so we will make it explicit when we start blurring the distinction between
syntax and semantics.
Shift equations
String diagrams allow quite a bit of topological freedom, for instance, in the
sense that parallel boxes may be shifted up and down, as expressed by the
following equations.
f g
= f g =
g f
[[ f ]] ⊗ id ◦· id ⊗ [[ g ]] = [[ f ]] ⊗ [[ g ]] = id ⊗ [[ g ]] ◦· [[ f ]] ⊗ id .
f g
= =
f g
These four equations do not only hold for boxes, but also for and , since
we consider them simply as boxes but with a special notation.
400
7.2. Equations for string diagrams 401
= =
Wires of product type correspond to parallel wires, and wires of the empty
product type 1 do nothing — as represented by the empty dashed box.
A×B = A B 1 =
= = = (7.5)
= =
A×B A B 1
We should be careful that boxes can in general not be “pulled through”
copiers, as expressed on the left below (see also Exercise 2.5.10).
,
f f
but =
f a a a
401
402 Chapter 7. Directed Graphical Models
A = 1 = = 1
For product wires we have the following discard equations, see Exercise 7.1.6.
A×B = A B = A B
Similarly we have the following uniform state equations.
A×B = A B = A B
=
(a, b) a b
& &
&
= = &
& &
1 In a quantum setting one uses ‘causal’ instead of ‘unital’, see e.g. [25].
402
7.2. Equations for string diagrams 403
Moreover, conjunction has truth and falsity as unit and zero elements:
& & 0
= =
1 0
⊥ ⊥
= =
0
⊥ 1
r & r
= r·s 1 = 0 =
0 =
s r &
& & r
& r
= = r = & =
r r &
All of the above equations between string diagrams are sound, in the sense
that they hold under all interpretations of string diagrams — as described at
the end of the previous section. There are also completeness results in this
area, but they require a deeper categorical analysis. For a systematic account
of string diagrams we refer to the overview paper [149]. In this book we use
string diagrams as convenient, intuitive notation with a precise semantics.
Exercises
7.2.1 Give a concrete description of the n-ary copy channel ∆n : A → An
in (7.6).
403
404 Chapter 7. Directed Graphical Models
& r·s
=
r s &
A A
Prove that it satisfies the following ‘barycentric’ equations, due to [155].
r rs
1 = r = 1−r =
r(1−s)
s 1−rs
··· ···
What does this amount to when there are no input (or no output)
wires?
404
7.3. Accessibility and joint states 405
B
C
sprinkler sprinkler
rain rain (7.7)
A
winter winter
The non-accessible parts of a string diagram are often called latent or unob-
served.
We are interested in such accessible versions because they lead in an easy
way to joint states (and vice-versa, see below): the joint state for the wetness
Bayesian network is the interpretation of the above accessible string diagram
on the right, giving a distribution on the product set A × B × D × C × E. Such
joint distributions are conceptually important, but they are often impractical to
work with because their size quickly grows out of hand. For instance, the joint
distribution ω for the above wetness network is (with zero-probability terms
omitted, as usual):
+ 3125
672
| a, b⊥ , d, c, ei + 3125
288
| a, b⊥ , d, c, e⊥ i + 3125168
| a, b⊥ , d⊥ , c, e i
+ 3125 | a, b , d , c, e i + 125 | a, b , d , c , e i + 20000
72 ⊥ ⊥ ⊥ 12 ⊥ ⊥ ⊥ ⊥ 399
| a⊥ , b, d, c, e i
+ 20000
171
| a⊥ , b, d, c, e⊥ i + 1000
243
| a⊥ , b, d, c⊥ , e⊥ i + 20000
21
| a⊥ , b, d⊥ , c, e i
+ 20000 | a , b, d , c, e i + 1000 | a , b, d , c , e i + 1250 |a⊥ , b⊥ , d, c, ei
9 ⊥ ⊥ ⊥ 27 ⊥ ⊥ ⊥ ⊥ 7
+ 12503
| a⊥ , b⊥ , d, c, e⊥ i + 5000
7
| a⊥ , b⊥ , d⊥ , c, e i + 5000
3
| a⊥ , b⊥ , d⊥ , c, e⊥ i
+ 100 | a , b , d , c , e i
9 ⊥ ⊥ ⊥ ⊥ ⊥
ω = id ⊗ id ⊗ wg ⊗ id ⊗ sr ◦· id ⊗ ∆2 ⊗ ∆3 ◦· id ⊗ sp ⊗ sr ◦· ∆3 ◦· wi.
Recall that ∆n is written for the n-ary copy channel, see (7.6).
Later on in this section we use our new insights into accessible string dia-
grams to re-examine the relation between crossover inference on joint states
405
406 Chapter 7. Directed Graphical Models
and channel-based inference, see Corollary 6.3.13. But we start with a more
precise description.
S2
···
···
(7.8)
···
S1
···
In this way each ‘internal’ wire between string diagrams S 1 and S 2 is ‘ex-
ternally’ accessible.
2 Each string diagram S can be made accessible by replacing each occur-
rence (7.2) of step (4) in its construction with the above modified step (7.8).
We shall write S for a choice of accessible version of S .
3 A joint state or joint distribution associated with a Bayesian network G ∈
SD(Σ) is the interpretation [[ G ]] of an accessible version G ∈ SD(Σ) of the
string diagram G. It re-uses G’s interpretation of its boxes.
406
7.3. Accessibility and joint states 407
407
408 Chapter 7. Directed Graphical Models
For the second inference question we use the same evidence, but we now
marginalise on the wet grass set D, in third position:
ω 0, 0, 1, 0, 0 = 4349 | d i + 851 | d⊥ i
1⊗1⊗1⊗1⊗1e 5200 5200
≈ 0.8363| d i + 0.1637|d⊥ i.
We see that crossover inference is easier to express than channel-based in-
ference, since we do not have to form the more complicated expressions with =
and = in the above two points. Instead, we just have to update and marginalise
at the right positions. Thus, it is easier from a mathematical perspective. But
from a computational perspective crossover inference is more complicated,
since these joint states grow exponentially in size (in the number of nodes in
the graph). This topic will be continued in Section 7.10.
Exercises
7.3.1 Draw an accessible verion of the Asia string diagram in (7.3). Write
also the corresponding composition of channels that produces the as-
sociated joint state.
Cloudy
0.5
0.2 Stay-in
0.15 0.2
0.3
0.9
Sunny 0.8 (7.9)
0.2 0.5
0.05 0.2 0.8
Go-out
Rainy 0.1
0.6
This model has three ‘hidden’ elements, namely Cloudy, Sunny, and Rainy,
representing the weather condition on a particular day. There are ‘temporal’
transitions with associated probabilities between these elements, as indicated
408
7.4. Hidden Markov models 409
by the labeled arrows. For instance, if it is cloudy today, then there is a 50%
chance that it will be cloudy again tomorrow. There are also two ‘visible’ el-
ements on the right: Stay-in, and Stay-out, describing two possible actions of
a person, depending on the weather condition. There are transitions with prob-
abilities from the hidden to the visible elements. The idea is that with every
time step a transition is made between hidden elements, resulting in a visible
outcome. Such steps may be repeated for a finite number of times — or even
forever. The interaction between what is hidden and what can be observed is
a key element of hidden Markov models. For instance, one may ask: given a
certain initial state, how likely is it to see a consecutive sequence of the four
visible elements: Stay-in, Stay-in, Go-out, Stay-in?
Hidden Markov models are simple statistical models that have many ap-
plications in temporal pattern recognition, in speech, handwriting or gestures,
but also in robotics and in biological sequences. This section will briefly look
into hidden Markov models, using the notation and terminology of channels
and string diagrams. Indeed, a hidden Markov model can be defined easily in
terms of channels and forward and backward transformation of states and ob-
servables. In addition, conditioning of states by observables can be used to for-
mulate and answer elementary questions about hidden Markov models. Learn-
ing for Markov models will be described separately in Sections ?? and ??.
Markov models are examples of probabilistic automata. Such automata will be
studied separately, in Chapter ??.
409
410 Chapter 7. Directed Graphical Models
σ
t = σ
unit
if n = 0
t = (t = σ) = (t ◦· t) = σ t =
n
where
.. t ◦· tn−1
if n > 0.
.
tn = σ
In these transitions the state at stage n + 1 only depends on the state at stage n:
in order to predict a future step, all we need is the immediate predecessor state.
This makes HMMs relatively easy dynamical models. Multi-stage dependen-
cies can be handled as well, by enlarging the sample space, see Exercise 7.4.6
below.
One interesting problem in the area of Markov chains is to find a ‘stationary’
state σ∞ with t = σ∞ = σ∞ , see Exercise 7.4.2 for an illustration, and also
Exercise 2.5.16 for a sufficient condition.
σ t e
Here we are more interested in hidden Markov models 1 → X → X → Y. The
elements of the set Y are observable — and hence sometimes called signals —
whereas the elements of X are hidden. Thus, many questions related to hidden
Markov models concentrate on what one can learn about X via Y, in a finite
number of steps. Hidden Markov models are examples of models with latent
variables.
We briefly discuss some basic issues related to HMMs in separate subsec-
410
7.4. Hidden Markov models 411
We recall that the tuple he, ti of channels is (e ⊗ t) ◦· ∆, see Definition 2.5.6 (1).
With these tuples we can form a joint state he, tin = σ ∈ D(Y n × X). As an
(interpreted) string diagram it looks as follows.
Y Y Y X
e e ··· e t
..
.
(7.11)
t
411
412 Chapter 7. Directed Graphical Models
σ t e
Definition 7.4.2. Let H = 1 → X → X → Y be a hidden Markov model and
let ~p = p1 , . . . , pn be a list of observables on Y. The validity H |= ~p of this
sequence ~p in the model H is defined via the tuples (7.10) as:
H |= ~p B he, tin = σ 1, . . . , 1, 0 |= p1 ⊗ · · · ⊗ pn
(4.8)
= he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ 1 (7.12)
(4.9)
= σ |= he, tin = p1 ⊗ · · · ⊗ pn ⊗ 1 .
H |= ~p = σ |= (e = p1 ) &
t = (e = p2 ) &
(7.13)
t = (e = p3 ) & · · ·
t = (e = pn ) · · · .
= ∆ = (e = p1 ) ⊗ (t = q)
by Exercise 4.3.8
= (e = p1 ) & (t = q) by Exercise 4.3.7.
412
7.4. Hidden Markov models 413
H |= ~p
(7.12)
= σ |= he, tin = p1 ⊗ · · · ⊗ pn ⊗ 1
(∗)
= (e = p1 ) & t = ((e = p2 ) & t = (· · · t = ((e = pn ) & t = 1) · · · ))
= (e = p1 ) & t = ((e = p2 ) & t = (· · · t = (e = pn ) · · · )).
7.4.2 Filtering
Given a sequence ~p of factors, one can compute their validity H |= ~p in a
HMM H, as described above. But we can also use these factors to ‘guide’
the evolution of the HMM. At each state i the factor pi is used to update the
current state, via backward inference. The new state is then moved forward via
the transition function. This process is called filtering, after the Kalman filter
from the 1960s that is used for instance in trajectory optimisation in navigation
and in rocket control (e.g. for the Apollo program). The system can evolve
autonomously via its transition function, but observations at regular intervals
can update (correct) the current state.
σ t e
Definition 7.4.4. Let H = 1 → X → X → Y be a hidden Markov model and
let ~p = p1 , . . . , pn be a list of factors on Y. It gives rise to the filtered sequence
of states σ1 , σ1 , . . . , σn+1 ∈ D(X) following the observe-update-proceed prin-
ciple:
σ1 B σ σi+1 B t = σi e = p .
and
i
In the terminology of Definition 6.3.1, the definition of the state σi+1 in-
volves both forward and backward inference. Below we show that the final
state σn+1 in the filtered sequence can also be obtained via crossover inference
on a joint state, obtained via the tuple channels (7.10). This fact gives a theoret-
ical justification, but is of little practical relevance — since joint states quickly
become too big to handle.
413
414 Chapter 7. Directed Graphical Models
Proof. By induction on n ≥ 1. The base case with he, ti1 = he, ti is handled as
follows.
he, ti = σ p ⊗1 0, 1
1
= σ2 .
= σn+2 .
he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ q
σn+1 |= q = .
H |= ~p
414
7.4. Hidden Markov models 415
σn+1 |= q = he, tin = σ p ⊗···⊗p ⊗1 0, . . . , 0, 1 |= q
1 n
The last equation uses Lemma 4.2.9 (1) and Definition 7.4.2.
X X X X Y Y Y
··· t e ··· e e
..
.
(7.14)
t
We proceed, much like in the beginning of this section, this time not using
tuples but triples. We define the above state as element of the subset:
hid ,t,ein
hid , t, ein = σ for X / Xn × X × Y n
415
416 Chapter 7. Directed Graphical Models
σ t e
Definition 7.4.7. Let 1 → X → X → Y be a HMM and let ~p = p1 , . . . , pn be a
sequence of factors on Y. Given these factors as successive observations, the
most likely path of elements of the sample space X, as an n-tuple in X n , is
obtained as:
argmax hid , t, ein = σ 1⊗ ··· ⊗1⊗1⊗p ⊗ ··· ⊗p 1, . . . , 1, 0, 0, . . . , 0 .
(7.15)
n 1
This expression involves updating the joint state (7.14) with the factors pi ,
in reversed order, and then taking its marginal so that the state with the first n
outcomes in X remains. The argmax of the latter state gives the sequence in X n
that we are after.
The expression (7.15) is important conceptually, but not computationally.
It is inefficient to compute, first of all because it involves a joint state that
grows exponentionally in n, and secondly because the conditioning involves
normalisation that is irrelevant when we take the argmax.
We given an impression of what (7.15) amounts to for n = 3. We first focus
on the expression within argmax − . The marginalisation and conditioning
amount produce an state on X × X × X of the form:
X (hid , t, ei3 = σ)(x1 , x2 , x3 , x, y3 , y2 , y1 ) · p3 (y3 ) · p2 (y2 ) · p1 (y1 )
Exercises
7.4.1 Consider the HMM example 7.9 with initial state σ = 1| Cloudyi.
1 Compute successive states tn = σ for n = 0, 1, 2, 3.
416
7.4. Hidden Markov models 417
7.4.2 Consider the transition channel t associated with the HMM exam-
ple 7.9. Check that in order to find a stationary state σ∞ = x| Cloudyi+
y| Sunny i + z| Rainy i one has to solve the equations:
1 2 3 4 5
t(1) = 3
4 |1 i + 1
4|2i e(1) = 1| 3 i
t(2) = 1
4 |1 i + 1
2|2i + 1
4|3i e(2) = 1| 2 i
t(3) = 1
4 |2 i + 1
2|3i + 1
4|4i e(3) = 1| 2 i
t(4) = 1
4 |3 i + 1
2|3i + 1
4|5i e(4) = 1| 2 i
t(5) = 1
4 |4 i + 3
4|5i e(5) = 1| 3 i.
417
418 Chapter 7. Directed Graphical Models
σ1 B σ = 1| 3i
σ2 = 4 |2 i + 2 | 3 i + 4 | 4 i
1 1 1
σ3 = 16 | 1 i + 4 | 2 i + 8 | 3 i + 4 | 4i + 16 | 5i
1 1 3 1 1
σ4 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i
3 1 1 3
σ5 = 8 |1 i + 4 | 2 i + 4 | 3 i + 4 | 4 i + 8 | 5i
1 1 1 1 1
σ6 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i
3 1 1 3
σ7 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i.
3 1 1 3
3 Show that a most likely sequence of states giving rise to the se-
quence of observations α is [3, 2, 1, 2, 1, 1]. Is there an alternative?
7.4.4 Apply Bayes’ rule to the validity formulation (7.13) in order to prove
the correctness of the following HMM validity algorithm.
(σ, t, e) |= [] B 1
(σ, t, e) |= [p] ++ ~p B σ |= e = p · (t = (σ|e = p ), t, e) |= ~p .
418
7.5. Disintegration 419
it rained yesterday but not today, then it will rain tomorrow with probability
0.4; if it has not rained in the past to days, then it will rain tomorrow with
probability 0.2.
1 Write R = {r, r⊥ } for the state space of rain and no-rain outcomes,
and capture the above probabilities via a channel c : R × R → R.
2 Turn this channel c into a Markov chain hπ2 , ci : R × R → R × R,
where the second component of R × R describes whether or not it
rains on the current day, and the first component on the previous
day. Describe hπ2 , ci both as a function and as a string diagram.
3 Generalise this approach to a history of length N > 1: turn a chan-
nel X N → X into a Markov model X N → X N , where the relevant
history is incorporated into the sample space.
7.4.7 Use the approach of the prevous exercise to turn a hidden Markov
model into a Markov model.
7.4.8 In Definition 7.4.4 we have seen filtering for hidden Markov models,
starting from a sequence of factors ~p as observations. In practice these
factors are often point predicates 1yi for a sequence of elements ~y of
the visible space Y. Show that in that case one can describe filtering
via Bayesian inversion as follows, for a hidden Markov model with
transition channel t : X → X, emission channel e : X → Y and initial
state σ ∈ D(X).
σ1 = σ and σi+1 = (t ◦· e†σi )(yi ).
7.4.9 Let H1 and H2 be two HMMs. Define their parallel product H1 ⊗ H2
using the tensor operation ⊗ on states and channels.
7.5 Disintegration
In Subsections 1.3.3 and 1.4.3 we have seen how a binary relation R ∈ P(A×B)
on A × B corresponds to a P-channel A → P(B), and similarly how a multiset
ψ ∈ M(A × B) corresponds to an M-channel A → M(B). This phenomenon
was called: extraction of a channel from a joint state. This section describes the
analogue for probabilistic binary/joint states and channels (like in [52]). It turns
out to be more subtle because probabilistic extraction requires normalisation
— in order to ensure the unitality requirement of a D-channel: multiplicities
must add up to one.
In the probabilistic case this extraction is also called disintegration. It corre-
sponds to a well known phenomenon, namely that one can write a joint prob-
ability P(x, y) in terms of a conditional probability as via what is often called
419
420 Chapter 7. Directed Graphical Models
The mapping x 7→ P(y | x) is the channel involved, and P(x) refers to the
first marginal of P(x, y). We have already seen this extraction of a channel via
daggers, in Theorem 6.6.5.
Our current description of disintegration (and daggers) makes systematic
use of the graphical calculus that we introduced earlier in this chapter. The
formalisation of disintegration in category theory is not there (yet), but the
closely related concept of Bayesian inversion — the dagger of a channel —
has a nice categorical description, see Section 7.8 below.
We start by formulating the property of ‘full support’ in string diagrammatic
terms. It is needed as pre-condition in disintegration.
Lemma 7.5.2. Let ω ∈ D(X), where X is non-empty and finite. The following
three points are equivalent:
Proof. Suppose that the finite set X has N ≥ 1 elements. For the implication
(1) ⇒ (2), let ω ∈ D(X) have full support, so that ω(x) > 0 for each x ∈ X.
Since ω(x) ≤ 1, we get ω(x)
1
≥ N and t · ω(x) ≥ 1, for
P 1
≥ 1 so that t B x ω(x)
each x ∈ X. Take r = 1t ≤ N1 ≤ 1, and put p(x) = t·ω(x)
1
. Then:
Next, for (2) ⇒ (3), suppose ω(x) · p(x) = r, for some predicate p and
non-zero probability r. Then ω |= p = x ω(x) · p(x) = N · r. Hence:
P
ω(x) · p(x) r 1
ω| p (x) = = = = unif X (x).
ω |= p N·r N
For the final step (3) ⇒ (1), let ω| p = unif X . This requires that ω |= p , 0
and thus that ω(x) · p(x) = ω|N= p > 0 for each x ∈ X. But then ω(x) > 0.
420
7.5. Disintegration 421
Definition 7.5.3. A state box s has full support if there is a predicate box p
and a scalar box r, for some r ∈ [0, 1], with an equation as on the left below.
2 A 2 A
p 2 A p 2 A
= = q
r
s f B
B
Similarly, we say that a channel box f has full support if there are predicates
p, q with an equation as above on the right.
A
0
Uniqueness of f in this definition means: for each box g from B, A to C and
421
422 Chapter 7. Directed Graphical Models
g
B C
= f =⇒ h = f and g = f0
h
A
h = = = f
h h
X f (a)(b, c)
f 0 (b, a) B c . (7.17)
c
f 1, 0 (a)(b)
422
7.5. Disintegration 423
= f (a)(b, c).
For uniqueness, assume (id ⊗ g) ◦· (∆ ⊗ id ) ◦· (h ⊗ id ) ◦· ∆ = f . As already men-
tioned, we can obtain h = f 1, 0 via diagrammatic reasoning. Here we reason
element-wise. By assumption, g(b, a)(c) · h(a)(b) = f (a)(b, c). This gives:
h(a)(b) = 1 · h(a)(b) = c g(b, a)(c) · h(a)(b) =
P P
c g(b, a)(c) · h(a)(b)
= c f (a)(b, c)
P
= f 1, 0 (a)(b).
f (a)(b, c) f (a)(b, c)
g(b, a)(c) = = = f 0 (b, a)(c).
h(a)(b) f 1, 0 (a)(b)
Without the full support requirement, disintegrations may still exist, but they
are not unique, see Example 7.6.1 (2) for an illustration.
Next we elaborate on the notation that we will use for disintegration.
Remark 7.5.6. In traditional notation in probability theory one simply omits
variables to express marginalisation. For instance, for a distribution ω ∈ D(X1 ×
X2 × X3 × X4 ), considered as function ω(x1 , x2 , x3 , x4 ) in four variables xi , one
writes:
X
ω(x2 , x3 ) for the marginal ω(x1 , x2 , x3 , x4 ).
x1 ,x4
423
424 Chapter 7. Directed Graphical Models
How this really works is best illustrated via a concrete example. Consider the
box f on the left below, with five output wires. We elaborate the disintegration
f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 , as on the right.
B1 B2 B3 B4 B5 B1 B5
f f 1, 0, 0, 0, 1 0, 1, 0, 1, 0
A B2 B4 A
The disintegration box f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 is unique in satisfying:
f 1, 0, 0, 0, 1 0, 1, 0, 1, 0
= f
424
7.5. Disintegration 425
• The above disintegration f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 can be obtained from the
‘simple’, one-wire version of disintegration in Definition 7.5.4 by first suit-
ably rearranging wires and combining them via products. How to do this
precisely is left as an exercise (see below).
A
B A
This ‘shaded box’ notation can also be used for more complicated forms of
disintegration, as described above. This notation is useful to express some basic
properties of disintegration, see Exercise 7.5.4.
Exercises
7.5.1 1 Prove that if a joint state has full support, then each of its marginals
has full support.
2 Consider the joint state ω = 12 | a, b i + 12 | a⊥ , b⊥ i ∈ D(A × B) for
A = {a, a⊥ } and B = {b, b⊥ }. Check that both marginals ω 1, 0 ∈
D(A) and ω 0, 1 ∈ D(B) have full support, but ω itself does not.
Conclude that the converse of the previous point does not hold.
7.5.2 Consider the box f in Remark 7.5.6. Write down the equation for the
disintegration f 0, 1, 0, 1, 0 1, 0, 1, 0, 1 . Formulate also what unique-
ness means.
7.5.3 Show how to obtain the disintegration f 0, 0, 0, 1, 1 0, 1, 1, 0, 0 in
Remark 7.5.6 from the formulation in Definition 7.5.4 via rearranging
and combining wires (via ×).
7.5.4 (From [52]) Prove the following ‘sequential’ and ‘parallel’ proper-
ties of disintegration. The best is to give a diagrammatic proof, us-
ing uniqueness of disintegration. But you may also check that these
equations are sound, i.e. hold for all interpretations of the boxes as
425
426 Chapter 7. Directed Graphical Models
f
prtr( f ) B (7.18)
A
B
1 Check that:
X 1 f (a, b)(a, c)
prtr( f )(b)(c) = ·P ,
a #A y f (a, b)(a, y)
426
7.5. Disintegration 427
2 2
g f
prtr (g ⊗ id ) ◦· f = = prtr f ◦· g
,
f g
A 2
24 | a, 0 i + 8 | b, 0 i + 6 | c, 0i
a 7→ 11 3 1
2
b 7→ 15 | a, 0 i + 10
3
| a, 1 i + 103
| b, 1 i + 15
4
| c, 0i
c 7→ 4 | a, 0 i + 6 | a, 1 i + 4 | b, 0i + 3 |c, 1 i.
1 1 1 1
0 7→ 1
12 | 0, 0 i + 13 | 0, 1i + 7
12 | 1, 0i
1 7→ 13
40 | 0, 0 i + 38 | 1, 0i + 3
10 | 1, 1i.
4 Now show that the disintegration (g⊗id )◦· f 0, 1 1, 0 : A×A → 2
is:
(a, a) 7→ 1| 0 i 4
(b, a) 7→ 13 |0i + 9
13 | 1 i (c, a) 7→ 35 | 0 i + 25 | 1 i
(a, b) 7→ 1| 0 i (b, b) 7→ 1| 1i (c, b) 7→ 1| 0i
(a, c) 7→ 1| 0 i (b, c) 7→ 1| 0i (c, c) 7→ 1| 1i.
And that f ◦· g 0, 1 1, 0 : 2 × 2 → 2 is:
(0, 0) 7→ 15 | 0 i + 45 | 1 i (1, 0) →
7 1| 0i
(0, 1) 7→ 1|0 i (1, 1) →7 59 |0 i + 49 | 1 i.
5 Conclude that:
prtr (g ⊗ id ) ◦· f = 13 | 0i + 23 | 1i,
whereas:
prtr f ◦· g = 17
+ 28
45 | 0i 45 |1 i.
427
428 Chapter 7. Directed Graphical Models
B
A B
From extract c B
ω ω
A
so that:
A B
A B c
= (7.19)
ω
ω
like in Theorem 6.6.5. This is a special case of Equation (7.16), where the
incoming wire is trivial (equal to 1). We should not forget the side-condition
of disintegration, which in this case is: the first marginal ω 1, 0 must have full
support. Then one can define the channel c = dis 1 (ω) = ω 0, 1 1, 0 : A → B
as disintegration, explicitly, as a special case of (7.17):
X ω(a, b) X ω(a, b)
b =
c(a) B b .
P
ω(a, y) ω 1, 0 (a) (7.21)
b∈B y b∈B
From a joint state ω ∈ D(X × Y) we can extract a channel ω 0, 1 1, 0 : X →
Y and also a channel ω 1, 0 0, 1 : Y → X in the opposite direction, provided
that ω’s marginals have full support, see again Theorem 6.6.5. The direction of
the channels is thus in a certain sense arbitrary, and does not reflect any form
of causality, see Chapter ??.
428
7.6. Disintegration for states 429
ω = 14 | a, b i + 12 | a, b⊥ i + 14 | a⊥ , b⊥ i.
ω(a , b)
⊥
ω(a , b ) ⊥
⊥ ⊥
0 1/4
c(a⊥ ) = | bi + |b i = 1 |bi + | b⊥ i = 1| b⊥ i.
σ(a⊥ ) σ(a⊥ ) /4 1/4
ω = 13 | a, b i + 23 | a, b⊥ i.
(ω ./ ρ) 1, 1, 0 = ω (ω ./ ρ) 1, 0, 1 = ρ.
and
We show how such natural joins can be constructed via disintegration of states.
Let’s assume we have joint states ω and ρ as described above, with common
marginal written as σ B ω 1, 0 = ρ 1, 0 . We extract channels:
c B ω 0, 1 1, 0 : X → Y d B ρ 0, 1 1, 0 : X → Z.
429
430 Chapter 7. Directed Graphical Models
Now we define:
X Y Z
c d
ω ./ ρ B (7.22)
c d c c
= = =
ω
σ σ ω
The next result illustrates two bijective correspondences resulting from dis-
integration. These correspondences will be indicated with a double line ,
like in (1.3), expressing that one can translate in two directions, from what
is above the lines to what is below, and vice-versa. The first bijective corre-
spondence is basically a reformulation of disintegration itself (for states). The
second one is new and involves distributions over distributions — sometimes
called hyperdistributions [121, 122].
430
7.6. Disintegration for states 431
Furthermore, ρ’s second marginal has full support, since for each i ∈ N,
X X
ρ 0, 1 (i) = ρ(x, i) = Ω(ωi ) · ωi (x) = Ω(ωi ) > 0.
x∈X x∈X
Since ρ 0, 1 (i) > 0 for each i, this Ω’s support has N elements.
Starting from Ω with support {ω0 , . . . , ωN−1 }, we can get ρ as in (7.23)
with extracted channel d satisfying:
ρ(x, i) Ω(ωi ) · ωi (x)
d(i)(x) = = = ωi (x).
ρ 0, 1 (i) Ω(ωi )
Hence the definition (7.24) yields the original state Ω ∈ D(D(X)):
X X
ρ 0, 1 (i) d(i) = Ω(ω ) ω = Ω. i i
i∈N i∈N
In the other direction, starting from a joint state ρ ∈ D(X × N) whose second
marginal has full support, we can form Ω as in (7.24) and turn it into a joint
state again via (7.23), whose probability at (x, i) is:
Ω(d(i)) · d(i)(x) = ρ 0, 1 (i) · ρ 1, 0 0, 1 (i)(x)
= hid , ρ 1, 0 0, 1 i = ρ 0, 1 (x, i)
= ρ(x, i).
431
432 Chapter 7. Directed Graphical Models
2 And as a result:
hid , ci = σ p⊗q = hid , c|q i = σ| p & (c = q) .
2 An (updated) joint state like (hid , ci = σ)| p⊗q can always be written as graph
hid , di = τ. The channel d is obtained by distintegration of the joint state,
and equals c|q by the previous point. The state τ is the first marginal of the
432
7.6. Disintegration for states 433
joint state. Hence we are done by charactersing this first marginal, in:
hid , ci = σ p⊗q 1, 0
= hid , ci = σ (1⊗q)&(p⊗1) 1, 0
by Exercise 4.3.7
= hid , ci = σ 1⊗q p⊗1 1, 0
by Lemma 6.1.6 (3)
= hid , ci = σ 1⊗q
1, 0 p
by Lemma 6.1.6 (6)
= σ| c = q | p by Corollary 6.3.13 (2)
= σ| p & (c = q) by Lemma 6.1.6 (3).
p p
ω| p = and ω| p⊥ = (7.25)
ω ω
2 2
1 0
Proof. First, let’s write σ B hp, id i = ω ∈ D(2 × A) for the joint state
that is disintegrated in the above string diagrams. The pre-conditioning for
disintegration is that the marginal σ 1, 0 ∈ D(2) has full support. Explicitly,
this means for b ∈ 2 = {0, 1},
X X ω |= p
if b = 1
σ 1, 0 (b) = σ(b, a) = ω(a) · p(a)(b) =
ω |= p⊥ if b = 0.
a∈A a∈A
The full support requirement that σ 1, 0 (b) > 0 for each b ∈ 2 means that both
ω |= p and ω |= p⊥ = 1 − (ω |= p) are non-zero. This holds by assumption.
433
434 Chapter 7. Directed Graphical Models
We elaborate the above string diagram on the left. Let’s write c for the ex-
tracted channel. It is, according to (7.21),
X σ(1, a) X ω(a) · p(a)
c(1) = a = a = ω| .
p
σ(1, ω =
P
a∈A a a) a∈A
| p
1 for a factor p on X,
ω| p⊗1 0, 1 = c = σ| p .
2 for a factor q on Y,
ω|1⊗q 1, 0 = σ|c = q .
c†ω
c c
c†ω B such that = (7.26)
c
ω ω
ω
B
By construction, c†ω is the unique box giving the above equation. By applying
434
7.6. Disintegration for states 435
to the left outgoing line on both sides of the above equation we get the
equation that we saw earlier in Proposition 6.6.4 (1):
ω = c†ω = c = ω .
π2 = π1 †ω π1 †ω : A → A × B,
for
π1 =
We need to prove the equation in (7.19). It can be obtained via purely diagram-
matic reasoning:
(π1 )†ω
= =
ω
ω ω
435
436 Chapter 7. Directed Graphical Models
We see that the state σ and channel c : X → Y together form a graph state
ω B hc, id i = σ ∈ D(Y × X). This joint state ω is what’s really used in
the above two diagrams. Hence we can also reformulate the above to update
diagrams in terms of this joint state ω.
X
X
q
Exercises
7.6.1 Check that Proposition 7.6.5 can be reformulated as:
436
7.6. Disintegration for states 437
7.6.2 Let σ ∈ D(X) have full support and consider ω B σ ⊗ τ for some
τ ∈ D(Y). Check that the channel X → Y extracted from ω by disin-
tegration is the constant function x 7→ τ. Give a string diagrammatic
account of this situation.
Consider an extracted channel ω 1, 0, 0 0, 0, 1 for some state ω.
7.6.3
1 Write down the defining equation for this channel, as string dia-
gram.
Check that ω 1, 0, 0 0, 0, 1 is the same as ω 1, 0, 1 1, 0 0, 1 .
2
7.6.4 Check that a marginalisation ωM can also be described as disintegra-
tion ω M 0, . . . , 0 where the number of 0’s equals the length of the
mask/list M.
7.6.5 Disintegrate the distribution Flrn(τ) ∈ D({H, L} × {1, 2, 3}) in Subsec-
tion 1.4.1 to a channel {H, L} → {1, 2, 3}.
7.6.6 Consider sets X = {x1 , x2 }, Y = {y1 , y2 , y3 }, Z = {z1 , z2 , z3 } with distri-
bution ω ∈ D(X × Y) given by:
4 | x1 , y1 i + 41 | x1 , y2 i + 16 | x2 , y1 i + 61 | x2 , y2 i + 16 | x2 , y3 i
1
12 | x1 , z1 i + 31 | x1 , z2 i + 12 | x1 , z3 i + 81 | x2 , z1 i + 18 | x2 , z2 i + 14 | x2 , z3 i.
1 1
+ 24 1
| x1 , y2 , z1 i + 16 | x1 , y2 , z2 i + 24
1
| x1 , y2 , z3 i
+ 24 | x2 , y1 , z1 i + 24 | x2 , y1 , z2 i + 12
1 1 1
| x2 , y1 , z3 i
+ 24 | x2 , y2 , z1 i + 24 | x2 , y2 , z2 i + 12 | x2 , y2 , z3 i
1 1 1
+ 24 1
| x2 , y3 , z1 i + 24
1
| x2 , y3 , z2 i + 12
1
| x2 , y3 , z3 i.
7.6.7 Combining the two items of Theorem 7.6.3 gives a bijective corre-
spondence between:
Ω ∈ D D(X) with supp(Ω) = N
============================================================
ω ∈ D(X) and c : X → N such that c = ω has full support
Define this correspondence in detail and check that it is bijective.
7.6.8 Prove the following possibilitistic analogues of Theorem 7.6.3.
437
438 Chapter 7. Directed Graphical Models
can be obtained both from Theorem 6.3.11 and from Theorem 7.6.4.
7.6.10 Prove that for a state ω ∈ D(X × Y) with full support one has:
†
ω 1, 0 0, 1 = ω 0, 1 1, 0 : Y → X.
438
7.7. Disintegration and inversion in machine learning 439
the table in Figure 7.1. We choose obvious abbreviations for the entries in the
table:
S B O × T × H × W × P.
It combines the five columns in Figure 7.1. The table itself can now be consid-
ered as a multiset in M(S ) with 14 elements, each with multiplicity one. We
will turn it immediately into an empirical distribution — formally via frequen-
tist learning. It yields a distribution τ ∈ D(S ), with 14 entries, each with the
same probability, written as:
τ= 1
14 | s, h, h, f, n i + 1
14 | s, h, h, t, n i + ··· + 1
14 | r, m, h, t, ni.
Play
This model oversimplifies the situation, but still it often leads to good (enough)
outcomes.
439
440 Chapter 7. Directed Graphical Models
π B τ 0, 0, 0, 0, 1 = 14
9
|y i + 14
5
| n i.
Next, we extract four channels cO , cT , cH , cW via appropriate disintegrations,
from the Play column to the Outlook / Temperature / Humidity / Windy columns.
cO B τ 1, 0, 0, 0, 0 0, 0, 0, 0, 1 cH B τ 0, 0, 1, 0, 0 0, 0, 0, 0, 1
cT B τ 0, 1, 0, 0, 0 0, 0, 0, 0, 1 cW B τ 0, 0, 0, 1, 0 0, 0, 0, 0, 1 .
Recall the question that we started from: what is the probability of playing
if the outlook is sunny, the temperature is Cold, the humidity is High and it
is Windy? These features can be translated into an element (s, c, h, w) of the
codomain O×T ×H×W of this tuple channel — and thus into a point predicate.
Hence our answer can be obtained by Bayesian inversion of the tuple channel,
as:
(6.6)
c†π s, c, h, t = πc = 1 = 611
125
|y i + 486
611 | n i
(s,c,h,t)
≈ 0.2046| y i + 0.7954| n i.
This corresponds to the probability 20.5% calculated in [165] — without any
disintegration or Bayesian inversion.
The classification that we have just performed works via what it called a
naive Bayesian classifier. In our set-up this classifier is the dagger channel:
c†π
O×T ×H×W ◦ /P
This state differs considerably from the original table/state τ. It shows that the
shape (7.29) does not really fit the data that we have in Figure 7.1. But recall
440
7.7. Disintegration and inversion in machine learning 441
that this approach is called naive. We shall soon look closer into such matters
of shape in Section 7.9.
Remark 7.7.1. Now that we have seen the above illustration we can give a
more abstract recipe of how to obtain a Bayesian classifier. The starting point
is a joint state ω ∈ D X1 ×· · ·× Xn ×Y where X1 , . . . , Xn are the sets describing
the ‘input features’ and Y is the set of ‘target features’ that are used in clas-
sification: its elements represent the different classes. The recipe involves the
following steps, in which disintegration and Bayesian inversion play a promi-
nent role.
c1 B ω 1, 0, . . . , 0 0, . . . , 0, 1 : Y → X1
c2 B ω 0, 1, 0 . . . , 0 0, . . . , 0, 1 : Y → X2
..
.
cn B ω 0, . . . , 0, 1, 0 0, . . . , 0, 1 : Y → Xn .
Xi Xi
Y
π = ··· ci =
ω ω
Y Y
441
442 Chapter 7. Directed Graphical Models
Figure 7.2 Data about circumstances for reading or skipping of articles, copied
from [140].
construction:
Y
c1 cn
···
naive Bayesian classifier =
π
X1 ··· Xn
The twisting at the bottom happens implicitly when we consider the product
X1 × · · · × Xn as a single set and take the dagger wrt. this set.
442
7.7. Disintegration and inversion in machine learning 443
new thread or was a follow-up, the length of the article, and whether it is read
at home or at work.”
We formalise the table in Figure 7.2 via the following five sets, with ele-
ments corresponding in an obvious way to the entries in the table: k = known,
u = unknown, etc..
A = {k, u} T = {n, f } L = {l, s} W = {h, w} U = {s, r}.
We interprete the table as a joint distribution ω ∈ D(A × T × L × W × U) with
the same probability for each of the 18 entries in the table:
ω = 1 k, n, l, h, s + 1 u, n, s, w, r + · · · + 1 u, n, s, w, r .
18 18 18
(7.31)
So far there is no real difference with the naive Bayesian classification example
in the previous subsection. But the aim now is not classification but deciding:
given a 4-tuple of input features (x1 , x2 , x3 , x4 ) ∈ A × T × L × W we wish to
decide as quickly as possible whether this tuple leads to a read or a skip action.
The way to take such a decision is to use a decision tree as described in
Figure 7.3. It puts Length as dominant feature on top and tells that a long article
is skipped immediately. Indeed, this is what we see in the table in Figure 7.2:
we can quickly take this decision without inspecting the other features. If the
article is short, the next most relevant feature, after Length, is Thread. Again
we can see in Figure 7.2 that a short and new article is read. Finally, we see in
Figure 7.3 that if the article is short and a follow-up, then it is read if the author
is known, and skipped if the author is unkown. Apparently the location of the
user (the Where feature) is irrelevant for the read/skip decision.
We see that such decision trees provide a (visually) clear and efficient method
for reaching a decision about the target feature, starting from input features.
The question that we wish to address here is: how to learn (derive) such a
decision tree (in Figure 7.3) from the table (in Figure 7.2)?
The recipe (algorithm) for learning the tree involves iteratively going through
the following three steps, acting on a joint state σ.
1 Check if we are done, that is, if the U-marginal (for User action) of the
current joint state σ is a point state; if so, we are done and write this point
state 1| x i in a box as leaf in the tree.
2 If we are not done, determine the ‘dominant’ feature X for σ, and write X as
a circle in the tree.
3 For each element of x ∈ X, update (and marginalise) σ with the point predi-
cate 1 x and re-start with step 1.
We shall describe these steps in more detail, starting from the state ω in (7.31),
corresponding to the table in Figure 7.2.
443
444 Chapter 7. Directed Graphical Models
Figure 7.3 Decision tree for reading (r) or skipping (s) derived from the data in
Figure 7.2.
ω 0, 0, 0, 0, 1 1, 0, 0, 0, 0
cA B : A→U
ωA ω 1, 0, 0, 0, 0 ∈ D(A)
B
ω 0, 0, 0, 0, 1 0, 1, 0, 0, 0
cT B : T →U
ωT ω 0, 1, 0, 0, 0 ∈ D(T )
B
ω 0, 0, 0, 0, 1 0, 0, 1, 0, 0
cL B : L→U
ωL ω 0, 0, 1, 0, 0 ∈ D(L)
B
ω 0, 0, 0, 0, 1 0, 0, 0, 1, 0
cW B : W→U
ωW ω 0, 0, 0, 1, 0 ∈ D(W).
B
Note that these channels go in the opposite direction with respect to the
444
7.7. Disintegration and inversion in machine learning 445
+ 13 · 12 · − log( 21 ) + 12 · − log( 21 )
= 23 + 13
= 1.
In the same way one computes the other expected entropies as:
ωT |= H ◦ cT = 0.85 ωL |= H ◦ cL = 0.42 ωW |= H ◦ cW = 1.
One then picks the lowest entropy value, which is 0.42, for feature / com-
ponent L. Hence L = Length is the dominant feature at this first stage.
Therefore it is put on top in the decision tree in Figure 7.3.
1.3 The set L has two elements, l for long and s for short. We update the
current state ω with each of these, via suitably weakened point predicates
1l and 1 s , and marginalise out the L component. This gives new states for
which we use the following ad hoc notation.
ω/l B ω1⊗1⊗1 ⊗1⊗1 1, 1, 0, 1, 1 ∈ D(A × T × W × U)
l
We now go into a recursive loop and repeat the previous steps for both these
states ω/l and ω/s. Notice that they are ‘shorter’ than ω, since they only
have 4 components instead of 5, since we marginalised out the dominant
component L.
445
446 Chapter 7. Directed Graphical Models
2.1.1 In the l-branch we are now done, since the U-marginal of the l-update
is a point state:
ω/l 0, 0, 0, 1 = 1| si.
It means that a long article is skipped immedately. This is indicated via the
the l-box as (left) child of the Length node in the decision tree in Figure 7.3.
2.1.2 We continue with the s-branch. The U-marginal of ω/s is not a point
state.
2.2.2 We will now have to determine the dominant feature in ω/s. We com-
pute the three expected entropies for A, T, W as:
ω/s 1, 0, 0, 0 |= H ◦ ω/s 0, 0, 0, 1 1, 0, 0, 0 = 0.44
ω/s 0, 1, 0, 0 |= H ◦ ω/s 0, 0, 0, 1 0, 1, 0, 0 = 0.36
ω/s 0, 0, 1, 0 |= H ◦ ω/s 0, 0, 0, 1 0, 0, 1, 0 = 0.68.
The second value is the lowest, so that T = Thread is now the dominant fea-
ture. It is added as codomain node of the s-edge out Length in the decision
tree in Figure 7.3.
2.3.2 The set T has two elements, n for new and f for follow-up. We take
the corresponding updates:
ω/s/n B ω/s 1⊗1 ⊗1⊗1 1, 0, 1, 1 ∈ D(A × W × U)
n
ω/s/n 0, 0, 1 = 1| r i.
This settles the left branch under the Thread node in Figure 7.3.
3.1.2 The f -branch is not done, since the U-marginal of the state ω/s/ f is
not a point state.
3.2.2 We thus start computing expected entropies again, in order to find out
which of the remaining input features A, T is dominant.
ω/s/ f 1, 0, 0 |= H ◦ ω/s/ f 0, 0, 1 1, 0, 0 = 0
ω/s/ f 0, 1, 0 |= H ◦ ω/s/ f 0, 0, 1 0, 1, 0 = 1.
Hence the A feature is dominant, so that the Author node is added to the
f -edge out of Thread in Figure 7.3.
446
7.7. Disintegration and inversion in machine learning 447
3.3.2 The set A has two elements, k for known and and u for unknown. We
form the corresponding two updates of the current state ω/s/ f .
ω/s/ f /k B ω/s/ f 1 ⊗1⊗1 0, 1, 1 ∈ D(W × U)
k
ω/s/ f /u B ω/s/ f 1 ⊗1⊗1 0, 1, 1 ∈ D(W × U).
u
ω/s/ f /k 0, 1 = 1| r i.
ω/s/ f /u 0, 1 = 1| s i.
This gives the last two boxes, so that the decision tree in Figure 7.3 is
finished.
There are many variations on the above learning algorithm for decision trees.
The one that we just described is sometime called the ‘classification’ version,
as in [91], since it works with discrete distributions. There is also a ‘regres-
sion’ version, for continuous distributions. The key part of the the above al-
gorithm is deciding which feature is dominant (in step 2). We have described
the so-called ID3 version from [142], which uses expected entropies (intrinsic
values). Sometimes it is described in terms of ‘gains’, see [125], where in the
above step 1.2 we can define for feature X ∈ {A, T, L, W},
gain(X) B H(ωU ) − ωX |= H ◦ cX .
One then looks for the feature with the highest gain. But one may as well look
for the lowest expected entropy — given by the valdity expression after the
minus sign — as we do above. There are alternatives to using gain, such as
what is called ‘gini’, but that is out of scope.
To conclude, running the decision tree learning algorithm on the distribution
associated with the weather and play table from Figure 7.1 — with Play as
target feature — yields the decision tree in Figure 7.4, see also [165, Fig. 4.4].
Exercises
7.7.1 Write σ ∈ M(O × T × H × W × P) for the weather-play table in
Figure 7.1, as multidimensional multiset.
1 Compute the marginal multiset σ 1, 0, 0, 0, 1 ∈ M(O × P).
447
448 Chapter 7. Directed Graphical Models
Figure 7.4 Decision tree for playing (y) or not (n) derived from the data in Fig-
ure 7.1.
yes 2 4 3
no 3 0 2
3 Deduce a channel P → O from this table, and compare it to the
description (7.30) of the channel cO given in Subsection 7.7.1 —
see also Lemma 2.3.2 (and Proposition ??).
4 Do the same for the marginal tables σ 0, 1, 0, 0, 1 , σ 0, 0, 1, 0, 1 ,
σ 0, 0, 0, 1, 1 , and the corresponding channels cT , cH , cW in Sub-
section 7.7.1.
7.7.2 Continuing with the weather-play table σ, notice that instead of tak-
ing the dagger of the tuple of channels hcO , cT , cH , cW i : P → O ×
T × H × W in Example 7.7.1 we could have used instead the ‘direct’
disintegration:
σ[0,0,0,0,1 | 1,1,1,1,0]
O×T ×H×W ◦ /P
448
7.7. Disintegration and inversion in machine learning 449
Consider the following table of six words, together with the likeli-
hoods of them being spam or ham.
spam ham
review 1/4 1
send 3/4 1/2
us 3/4 1/2
your 3/4 1/2
password 1/2 1/2
account 1/4 0
Lets write S = {s, h} for the probability space for spam and ham, and,
as usual 2 = {0, 1}.
1 Translate the above table into six channels:
7.7.4 Show in detail that we get point states in items 2.1.1 , 3.1.1 , 4.1.1
and 4.1.2 in Subsection 7.7.2.
7.7.5 In Subsection 7.7.1 we classified the play distribution as 611 125
|yi +
486
611 | ni, given the input features (s, c, h, t). Check what the play de-
cision is for these inputs in the decision tree in Figure 7.4.
449
450 Chapter 7. Directed Graphical Models
7.7.6 As mentioned, the table in Figure 7.2 is copied from [140]. There
the following two questions are asked: what can be said about read-
ing/skipping for the two queries:
Q1 = (unknown, new, long, work)
Q2 = (unknown, follow-up, short, home).
1 Answer Q1 and Q2 via the decision tree in Figure 7.3.
2 Compute the distributions on U for both queries Q1 and Q2 via
naive Bayesian classification.
(The outcome for Q2 obtained via Bayesian classification is 35 | r i +
2
5 | s i; it gives reading the highest probability. This does not coincide
with the outcome via the decision tree. Thus, one should be careful to
rely on such classification methods for important decisions.)
It is not hard to see that ◦· is associative and has unit as identity element. In
fact, this has already been proven more generally, in Lemma 1.8.3.
There are two aspects of the category Chan(P) that we wish to illustrate,
namely (1) that it has an ‘inversion’ operation, in the form of a dagger functor,
450
7.8. Categorical aspects of Bayesian inversion 451
and (2) that it is isomorphic to the category Rel of sets with relations between
them (as morphisms). Probabilistic analogues of these two points will be de-
scribed later.
X ◦ /Y / P(Y)
f
X
============ that is, between functions: ===============
Y ◦ /X Y g
/ P(X)
451
452 Chapter 7. Directed Graphical Models
We shall prove the first equation and leave the second one to the interested
reader. The proof is obtained by carefully unpacking the right definition at
each stage. For a function f : X → P(Y) and elements x ∈ X, y ∈ Y,
Proof. We first have to check that the dagger functor is well-defined, i.e. that
452
7.8. Categorical aspects of Bayesian inversion 453
the above mapping yields another morphism in Krn. This follows from Propo-
sition 6.6.4 (1):
fσ† = τ = fσ† = ( f = σ) = σ.
g ◦· f †σ (z) = σ|(g◦· f ) = 1z = σ| f = (g = 1z )
= fσ† = τ|g = 1z
†
Finally we have fσ† τ = f by Proposition 6.6.4 (2).
One can also prove this result via equational reasoning with string diagrams.
For instance preservation of composition ◦· by the dagger follows by unique-
ness from:
σ f τ g g
σ τ f
453
454 Chapter 7. Directed Graphical Models
The essence of the following result is due to [24], but there it occurs in
slightly different form, namely in a setting of continuous probability. Here it is
translated to the discrete situation.
1 This category carries a dagger functor (−)† : Cpl → Cplop which is the
identity on objects; on morphisms it is defined via swapping:
ϕ † hπ2 ,π1 i = ϕ
(X, σ) −→ (Y, τ) B (Y, τ) −−−−−−−→ (X, σ) .
454
7.8. Categorical aspects of Bayesian inversion 455
Let ϕ : (X, σ) → (Y, τ), ψ : (Y, τ) → (Z, ρ), χ : (Z, ρ) → (W, κ) be morphisms in
Cpl. Then:
X (ψ • ϕ)(x, z) · χ(z, w)
χ • (ψ • ϕ) (x, w) =
z ρ(z)
X ϕ(x, y) · ψ(y, z) · χ(z, w)
=
y,z τ(y) · ρ(z)
X ϕ(x, y) · (χ • ψ)(y, w)
= = (χ • ψ) • ϕ (x, w).
y τ(y)
We turn to the dagger. It is obvious that (−)†† is the identity functor, and also
that (−)† preserves identity maps. It also preserves composition in Cpl since:
ψ • ϕ † (z, x) = ψ • ϕ (x, z)
X ϕ(x, y) · ψ(y, z)
=
y τ(y)
X ψ† (z, y) · ϕ† (y, x)
= = ϕ† • ψ† (z, x).
y τ(y)
The graph operation gr on channels from Definition 2.5.6 gives rise to an
identity-on-objects ‘graph’ functor G : Krn → Cpl via:
f gr(σ, f )
G (X, σ) −→ (Y, τ) B (Y, τ) −−−−−→ (X, σ) .
where gr(σ, f ) = hid , f i = σ. This yields a functor since:
G id (X,σ) = hid , id i = σ
= Eq (X,σ)
G g ◦· f (x, z) = gr(σ, g ◦· f )(x, z)
= σ(x) · (g ◦· f )(x)(z)
X
= σ(x) · f (x)(y) · g(y)(z)
y
X σ(x) · f (x)(y) · τ(y) · g(y)(z)
=
y τ(y)
X gr(σ, f )(x, y) · gr(τ, g)(y, z)
=
y τ(y)
= gr(τ, g) • gr(σ, f ) (x, z)
In the other direction we define a functor F : Cpl → Krn which is the iden-
tity on objects and use disintegration on morphisms:
for ϕ : (X, σ) → (Y, τ) in
Cpl we get a channel F(ϕ) B dis 1 (ϕ) = ϕ 0, 1 1, 0 : X → Y which satisfies,
by construction (7.20):
ϕ = gr ϕ 1, 0 , dis 1 (ϕ) = gr σ, F(ϕ) = GF(ϕ).
455
456 Chapter 7. Directed Graphical Models
We still need to prove that F preserves identities and composition. This fol-
lows by uniqueness of disintegration:
456
7.9. Factorisation of joint states 457
(7.21) ϕ† (y, x)
=
ϕ† 1, 0 (y)
ϕ(x, y)
=
ϕ 0, 1 (y)
= ϕ 1, 0 0, 1 (y)(x)
= ϕ 0, 1 1, 0 † (y)(x)
by Exercise 7.6.10
= F(ϕ)† (y)(x).
We have done a lot of work in order to be able to say that Krn and Cpl are
isomorphic dagger categories, or, more informally, that there is a one-one cor-
respondence between probabilistic computations (channels) and probabilistic
relations.
Exercises
7.8.1 Prove in the context of the powerset channels of Example 7.8.1 that:
1 unit † = unit and (g ◦· f )† = f † ◦· g† .
2 G(unit X ) = Eq X and G(g ◦· f ) = G( f ) • G( f ).
3 F(Eq X ) = unit X and F(S • R) = F(S ) ◦· F(R).
4 (S • R)† = R† • S † .
5 F(R† ) = F(R)† .
7.8.2 Give a string diagrammatic proof of the property f †† = f in Theo-
rem 7.8.3. Prove also that ( f ⊗ g)† = f † ⊗ g† .
7.8.3 Prove the equation = in the definition (7.32) of composition • in the
category Cpl. Give also a string diagrammatic description of •.
ω = 41 | a, b i + 12 | a, b⊥ i + 12 | a , b i
1 ⊥
+ 16 | a⊥ , b⊥ i
457
458 Chapter 7. Directed Graphical Models
ω has shape
or as: and write this as: ω |≈ .
ω factorises as
We shall give a formal definition of |≈ below, but at this stage it suffices to read
σ |≈ S , for a state σ and a string diagram S , as: there is an interpretation of the
boxes in S such that σ = [[ S ]].
In the above case of ω |≈ we obtain ω = ω1 ⊗ ω2 , for some state ω1 that
interpretes the box on the left, and some ω2 interpreting the box on the right.
But then:
ω 1, 0 = ω1 ⊗ ω2 1, 0 = ω1 .
This state has this form in multiple ways, for instance as:
c1 = σ1 = 14
25 | H i + 11
25 | T i = c2 = σ2
for:
σ1 = 15 | 1 i + 54 | 0i σ2 = 25 | 1 i + 35 | 0 i
c1 (1) = 54 | H i + 15 | T i and c2 (1) = 12 | H i + 21 | T i
c1 (0) = 12 | H i + 12 | T i c2 (0) = 35 | H i + 52 | T i.
We note that is not an accessible string diagram: the wire inbetween the two
boxes cannot be accessed from the outside. If these wires are accessible, then
we can access the individual boxes of a string diagram and use disintegration
to compute them. We illustrate how this works.
458
7.9. Factorisation of joint states 459
ω= 25 | a, c, d, b i + 50 | a, c, d, b i + 50 | a, c, d , bi + 50 | a, c, d , b i
1 9 ⊥ 3 ⊥ 1 ⊥ ⊥
+ 25 1
| a, c⊥ , d, b i + 50 9
| a, c⊥ , d, b⊥ i + 50 3
| a, c⊥ , d⊥ , b i
+ 50 | a, c , d , b i + 125 | a , c, d, bi + 500
1 ⊥ ⊥ ⊥ 3 ⊥ 9
| a⊥ , c, d, b⊥ i
+ 250 9
| a⊥ , c, d⊥ , b i + 5001
| a⊥ , c, d⊥ , b⊥ i + 12512
| a⊥ , c⊥ , d, bi
+ 125 | a , c , d, b i + 125 |a , c , d , b i + 125 | a⊥ , c⊥ , d⊥ , b⊥ i
9 ⊥ ⊥ ⊥ 18 ⊥ ⊥ ⊥ 1
f g
= =
σ
σ σ
σ = ω 1, 0, 0, 1
= 25
1
+ 50
3
+ 25 1
+ 50
3
|a, bi + 50 9
+ 50
1
+ 50 9
+ 501
|a, b⊥ i
+ 125 + 250 + 125 + 125 | a , bi + 500 + 500 + 125
3 9 12 18 ⊥ 9 1 9
+ 125 |a , b i
1 ⊥ ⊥
= 15 | a, b i + 25 | a, b⊥ i + 10 | a , b i
3 ⊥
+ 10 | a , b i.
1 ⊥ ⊥
459
460 Chapter 7. Directed Graphical Models
f g f f
= =
σ σ σ
The string diagram on the right tells us that we can obtain f via disintegration
from the marginal ω 1, 1, 0, 0 , using that extracted channels are unique, in
diagrams of this form, see (7.19). In the same way one obtains g from the
marginal ω 0, 0, 1, 1 . Thus:
f = ω 0, 1, 0, 0 1, 0, 0, 0
+
a →
7 | c i |c i
u,v,y ω(a, u, v, y) u,v,y ω(a, u, v, y)
P P
=
v,y ω(a , c, v, y) v,y ω(a , c , v, y) ⊥
P ⊥
P ⊥ ⊥
+
⊥
a →
7 | c i |c i
u,v,y ω(a , u, v, y) u,v,y ω(a , u, v, y)
P ⊥
P ⊥
a 7→ 21 | c i + 12 | c⊥ i
=
a⊥ 7→ 1 | c i + 4 | c⊥ i
5 5
g = ω 0, 0, 1, 0 0, 0, 0, 1
b →
7 | d i + |d i
x,u,v ω(x, u, v, b) x,u,v ω(x, u, v, b)
P P
=
x,u ω(x, u, d, b ) x,v ω(x, u, d , b ) ⊥
P ⊥
P ⊥ ⊥
+
⊥
b 7→ P |d i |d i
x,u,v ω(x, u, v, b ) x,u,v ω(x, u, v, b )
⊥
P ⊥
b 7→ 52 | d i + 35 | d⊥ i
=
b⊥ 7→ 9 | d i + 1 | d⊥ i.
10 10
At this stage one can check that the joint state ω can be reconstructed from
these extracted state and channels, namely as:
[[ S ]] = (id ⊗ f ⊗ g ⊗ id ) = (∆ ⊗ ∆) = σ = ω.
This proves ω |≈ S .
We now come to the definition of |≈. We will use it for channels, and not just
for states, as used above.
460
7.9. Factorisation of joint states 461
c |≈ S
ω |≈ ⇐⇒ ω |≈ ⇐⇒ ω |≈
ω |≈ ω|1⊗1b ⊗1 1, 0, 1 |≈
⇐⇒ for all b ∈ B.
The string diagrams on the left in item (1) is often called a fork; the other
two are called a chain. In item (2) this shape is related to non-entwinedness,
pointwise.
Proof. 1 Since ω has full support, so have all its marginals, see Exercise 7.5.1.
This allows us to perform all disintegrations below. We start on the left-hand
side, and assume an interpretation ω = hc, id , di = τ, consisting of a state
τ = ω 0, 1, 0 ∈ D(B) and channels c : B → A and d : B → C. We write
σ = c = τ = ω 1, 0, 0 and take the Bayesian inversion c†τ : A → B. We now
have an interpretation of the string diagram in the middle, which is equal to
461
462 Chapter 7. Directed Graphical Models
ω, since by (7.26):
d
d d
c†τ c d
c†τ = = c = = ω
c τ
σ τ
τ
Similarly one obtains an interpretation of the string diagram on the right via
ρ = d = τ = ω 0, 0, 1 and the inversion dτ† : C → B.
f (b) B σb g(b) B τb ρ B ω 0, 1, 0 .
= ω|1⊗1b ⊗1 1, 0, 1 (x, z)
X ω(x, y, z) · 1b (y)
=
y ω |= 1 ⊗ 1b ⊗ 1
ω(x, b, z)
=
ω 0, 1, 0 |= 1b
ω(x, b, z)
= .
ρ(b)
But now we see that ω has a fork shape:
What the first item of this result shows is that (sub)shapes of the form:
462
7.9. Factorisation of joint states 463
Proposition 7.9.4. Let c be a channel with full support, of the same type as
an accessible string diagram S ∈ SD(Σ) that does not contain . Then there
is a unique interpretation of Σ that can be obtained from c. In this way we can
factorise c according to S .
In the special case that c |≈ S holds, the factorisation interpretation of S
obtained in this way from c is the same one that gives [[ S ]] = c because of
c |≈ S and uniqueness of disintegrations.
g g g
S = so that S0 B =
h h h
This result gives a way of testing whether a channel c has shape S : fac-
torise c according to S and check if the resulting interpretation [[ S ]] equals c.
Unfortunately, this is computationally rather expensive.
463
464 Chapter 7. Directed Graphical Models
ω|1⊗1a ⊗1 |≈ .
ω |≈ then ω|1⊗q⊗1 |≈
ω|1⊗q⊗1
= hc, id , di = σ |1⊗q⊗1
464
7.9. Factorisation of joint states 465
= (c ⊗ id ⊗ d) = 1| ai ⊗ 1| a i ⊗ 1| a i
ω = id ⊗ c ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ) ,
by Theorem 6.3.11
= id ⊗ c|q ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ)(∆⊗∆) = (1⊗(c = q)⊗1)
by Exercise 6.1.10.
The updated state (σ ⊗ τ)(∆⊗∆) = (1⊗(c = q)⊗1) is typically entwined, even if q is
a point predicate, see also Exercise 6.1.8.
The fact that conditioning with a point predicate destroys the shape is an
important phenomenon since it allows us to break entwinedness / correlations.
This is relevant in statistical analysis, esp. w.r.t. causality [136, 137], see the
causal surgery procedure in Section ??. In such a context, conditioning on a
point predicate is often expressed in terms of ‘controlling for’. For instance, if
there is a gender component G = {m, f } with elements m for male and f for
female, then conditioning with a point predicate 1m or 1 f , suitably weakenend
via tensoring with truth 1, amounts to controlling for gender. Via restriction to
one gender value one fragments the shape and thus controls the influence of
gender in the situation at hand.
Exercises
7.9.1 Check the aim of Exercise 2.5.2 really is to prove the shape statement
ω |≈
465
466 Chapter 7. Directed Graphical Models
ω |≈ =⇒ ω 1, 0, 1 |≈
Decomposition: ω |≈ implies ω 0, 1, 1, 1 |≈
2
ω |≈ .
466
7.10. Inference in Bayesian networks, reconsidered 467
extend the factor p to a factor on the whole product space, by weakening it to:
πi = p = 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1.
After the update with this factor we take the marginal at j in the form of:
π j = ω|πi = p = ω|πi = p 0, . . . , 0, 1, 0 . . . , 0 .
π j = ω|πi = p
(7.34)
for a joint state ω with an evidence factor p on its i-th component and with the
j-th marginal as conclusion.
In words: an inference query is a marginalisation of a joint state conditioned
with a weakened factor. What is called inference is the activity of calculating
the outcome of an inference query.
By suitably swapping components and using products one can reduce such a
query to one of the from (7.34). Similarly one may marginalise on several com-
ponents at the same time and again reduce this to the canonical form (7.34).
The notion of inference that we use is formulated in terms of joint states.
This gives mathematical clarity, but not a practical method to actually compute
queries. If the joint state has the shape of a Bayesian network, we can use this
network structure to guide the computations. This is formalised in the next
result: it describes quite generally what we have been illustrating many times
already, namely that inference can be done along channels — both forward and
467
468 Chapter 7. Directed Graphical Models
Theorem 7.10.2. Let joint state ω have the shape S of a Bayesian network:
ω |≈ S . An inference query π j = ω|πi = p can then be computed via forward
and backward inference along channels in the string diagram S .
and where S 0 is the string diagram obtained from S by removing the single
box whose interpretation we have written as channel c, say of type Y → Xk .
The induction hypothesis applies to ρ and S 0 .
Consider the situation below wrt. the underlying product space X1 × · · · × Xn
of ω,
evidence marginal
p
X1 × · · · × Xi × · · · × XO k × · · · × X j × · · · × Xn
c
channel
468
7.10. Inference in Bayesian networks, reconsidered 469
The positive xray evidence translates into a point predicate 1 x on the set X =
{x, x⊥ } of xray outcomes. It has to be weakened to π3 = 1 x = 1 ⊗ 1 ⊗ 1 x
on the product space S × D × X. We are interested in the updated smoking
distribution, which can be obtained as first marginal. Thus we compute the
k
following inference query, where the number k in = refers to the k-th of the
469
470 Chapter 7. Directed Graphical Models
S D X S D X
dysp xray
xray dysp
bronc
either
tub either
asia
lung
lung
tub
bronc
smoking
smoking asia
Figure 7.5 Two ‘stretchings’ of the Asia Bayesian network from Section 6.5, with
the same semantics.
It is a exercise below to compute the same inference query for the stretching
470
7.10. Inference in Bayesian networks, reconsidered 471
on the right in Figure 7.5. The outcome is necessarily the same, by Theo-
rem 7.10.2, since there it is shown that all such inference computations are
equal to the computation on the joint state.
The inference calculation in Example 7.10.3 is quite mechanical in nature
and can thus be implemented easily, giving a channel-based inference algo-
rithm, see [75] for Bayesian networks. The algorithm consists of two parts.
1 It first finds a stretching of the Bayesian network with a minimal width,
that is, a description of the network as a sequence of channels such that
the state space in between the channels has minimal size. As can be seen
in Figure 7.5, there can be quite a bit of freedom in choosing the order of
channels.
2 Perform the calculation as in Example 7.10.3, following the steps in the
proof of Theorem 7.10.2.
The resulting algorithm’s performance compares favourably to the performance
of the pgmpy Python library for Bayesian networks [5]. By design, it uses
(fuzzy) factors and not (sharp) events and thus solves the “soft evidential up-
date” problem [160].
Exercises
7.10.1 In Example 7.10.3 we have identified a state on a space X with a chan-
nel 1 → X, as we have done before, e.g. in Exercise 1.8.2. Consider
states σ ∈ D(X) and τ ∈ D(Y) with a factor p ∈ Fact(X × Y). Prove
that, under this identification,
σ|(id ⊗τ) = p = (σ ⊗ τ)| p 1, 0 .
7.10.2 Compute the inference query from Example 7.10.3 for the stretching
of the Asia Bayesian network on the right in Figure 7.5.
7.10.3 From Theorem 7.10.2 one can conclude that the for the channel-based
calculation of inference queries the particular stretching of a Bayesian
network does not matter — since they are all equal to inference on
the joint state. Implicitly, there is a ‘shift’ result about forward and
backward inference.
Let c : A → X and d : B → Y be channels, together with a joint
state ω ∈ D(A× X) and a factor q : Y → R≥0 on Y. Then the following
marginal distributions are the same.
(c ⊗ id ) = ω (id ⊗d) = (1⊗q) 1, 0
= (c ⊗ id ) = ((id ⊗ d) = ω)1⊗q 1, 0 .
471
472 Chapter 7. Directed Graphical Models
472
References
473
474 Chapter 7. References
474
475
[34] S. Dash and S. Staton. A monad for probabilistic point processes. In D. Spi-
vak and J. Vicary, editors, Applied Category Theory Conference, Elect. Proc. in
Theor. Comp. Sci., 2020.
[35] S. Dash and S. Staton. Monads for measurable queries in probabilistic databases.
In A. Sokolova, editor, Math. Found. of Programming Semantics, 2021.
[36] E. Davies and J. Lewis. An operational approach to quantum probability. Com-
munic. Math. Physics, 17:239–260, 1970.
[37] B. de Finetti. Funzione caratteristica di un fenomeno aleatorio. Memorie della
R. Accademia Nazionale dei Lincei, IV, fasc. 5:86–113, 1930. Available at www.
brunodefinetti.it/Opere/funzioneCaratteristica.pdf.
[38] B. de Finetti. Theory of Probability: A critical introductory treatment. Wiley,
2017.
[39] P. Diaconis and S. Zabell. Updating subjective probability. Journ. American
Statistical Assoc., 77:822–830, 1982.
[40] P. Diaconis and S. Zabell. Some alternatives to Bayes’ rule. Technical Report
339, Stanford Univ., Dept. of Statistics, 1983.
[41] F. Dietrich, C. List, and R. Bradley. Belief revision generalized: A joint charac-
terization of Bayes’ and Jeffrey’s rules. Journ. of Economic Theory, 162:352–
371, 2016.
[42] E. Dijkstra. A Discipline of Programming. Prentice Hall, Englewood Cliffs, NJ,
1976.
[43] E. Dijkstra and C. Scholten. Predicate Calculus and Program Semantics.
Springer, Berlin, 1990.
[44] A. Dvurečenskij and S. Pulmannová. New Trends in Quantum Structures.
Kluwer Acad. Publ., Dordrecht, 2000.
[45] M. Erwig and E. Walkingshaw. A DSL for explaining probabilistic reasoning.
In W. Taha, editor, Domain-Specific Languages, number 5658 in Lect. Notes
Comp. Sci., pages 335–359. Springer, Berlin, 2009.
[46] W. Ewens. The sampling theory of selectively neutral alleles. Theoret. Popula-
tion Biology, 3:87–112, 1972.
[47] W. Feller. An Introduction to Probability Theory and Its applications, volume I.
Wiley, 3rd rev. edition, 1970.
[48] B. Fong. Causal theories: A categorical perspective on Bayesian networks. Mas-
ter’s thesis, Univ. of Oxford, 2012. see arxiv.org/abs/1301.6201.
[49] D. J. Foulis and M.K. Bennett. Effect algebras and unsharp quantum logics.
Found. Physics, 24(10):1331–1352, 1994.
[50] S. Friedland and S. Karlin. Some inequalities for the spectral radius of non-
negative matrices and applications. Duke Math. Journ., 42(3):459–490, 1975.
[51] K. Friston. The free-energy principle: a unified brain theory? Nature Reviews
Neuroscience, 11(2):127–138, 2010.
[52] T. Fritz. A synthetic approach to Markov kernels, conditional independence, and
theorems on sufficient statistics. Advances in Math., 370:107239, 2020.
[53] T. Fritz and E. Rischel. Infinite products and zero-one laws in categorical prob-
ability. Compositionality, 2(3), 2020.
[54] S. Fujii, S. Katsumata, and P. Melliès. Towards a formal theory of graded mon-
ads. In B. Jacobs and C. Löding, editors, Foundations of Software Science and
475
476 Chapter 7. References
Computation Structures, number 9634 in Lect. Notes Comp. Sci., pages 513–
530. Springer, Berlin, 2016.
[55] D. Geiger, T., and J. Pearl. Identifying independence in Bayesian networks.
Networks, 20:507–534, 1990.
[56] A. Gibbs and F. Su. On choosing and bounding probability metrics. Int. Statis-
tical Review, 70(3):419–435, 2002.
[57] G. Gigerenzer and U. Hoffrage. How to improve Bayesian reasoning without
instruction: Frequency formats. Psychological Review, 102(4):684–704, 1995.
[58] M. Giry. A categorical approach to probability theory. In B. Banaschewski,
editor, Categorical Aspects of Topology and Analysis, number 915 in Lect. Notes
Math., pages 68–85. Springer, Berlin, 1982.
[59] A. Gordon, T. Henzinger, A. Nori, and S. Rajamani. Probabilistic programming.
In Int. Conf. on Software Engineering, 2014.
[60] T. Griffiths, C. Kemp, and J. Tenenbaum. Bayesian models of cognition. In
R. Sun, editor, Cambridge Handbook of Computational Cognitive Modeling,
pages 59–100. Cambridge Univ. Press, 2008.
[61] J. Halpern. Reasoning about Uncertainty. MIT Press, Cambridge, MA, 2003.
[62] M. Hayhoe, F. Alajaji, and B. Gharesifard. A pólya urn-based model for epi-
demics on networks. In American Control Conference, pages 358–363, 2017.
[63] W. Hino, H. Kobayashi I. Hasuo, and B. Jacobs. Healthiness from duality. In
Logic in Computer Science. IEEE, Computer Science Press, 2016.
[64] J. Hohwy. The Predictive Mind. Oxford Univ. Press, 2013.
[65] F. Hoppe. Pólya-like urns and the Ewens’ sampling formula. Journ. Math.
Biology, 20:91–94, 1984.
[66] M. Hyland and J. Power. The category theoretic understanding of universal alge-
bra: Lawvere theories and monads. In L. Cardelli, M. Fiore, and G. Winskel, ed-
itors, Computation, Meaning, and Logic: Articles dedicated to Gordon Plotkin,
number 172 in Elect. Notes in Theor. Comp. Sci., pages 437–458. Elsevier, Am-
sterdam, 2007.
[67] B. Jacobs. Convexity, duality, and effects. In C. Calude and V. Sassone, editors,
IFIP Theoretical Computer Science 2010, number 82(1) in IFIP Adv. in Inf. and
Comm. Techn., pages 1–19. Springer, Boston, 2010.
[68] B. Jacobs. New directions in categorical logic, for classical, probabilistic and
quantum logic. Logical Methods in Comp. Sci., 11(3), 2015.
[69] B. Jacobs. Affine monads and side-effect-freeness. In I. Hasuo, editor, Coalge-
braic Methods in Computer Science (CMCS 2016), number 9608 in Lect. Notes
Comp. Sci., pages 53–72. Springer, Berlin, 2016.
[70] B. Jacobs. Introduction to Coalgebra. Towards Mathematics of States and Ob-
servations. Number 59 in Tracts in Theor. Comp. Sci. Cambridge Univ. Press,
2016.
[71] B. Jacobs. Introduction to coalgebra. Towards mathematics of states and obser-
vations. Cambridge Univ. Press, to appear, 2016.
[72] B. Jacobs. Hyper normalisation and conditioning for discrete probability dis-
tributions. Logical Methods in Comp. Sci., 13(3:17), 2017. See https:
//lmcs.episciences.org/3885.
476
477
477
478 Chapter 7. References
[90] R. Jeffrey. The Logic of Decision. The Univ. of Chicago Press, 2nd rev. edition,
1983.
[91] F. Jensen and T. Nielsen. Bayesian Networks and Decision Graphs. Statistics
for Engineering and Information Science. Springer, 2nd rev. edition, 2007.
[92] P. Johnstone. Stone Spaces. Number 3 in Cambridge Studies in Advanced Math-
ematics. Cambridge Univ. Press, 1982.
[93] C. Jones. Probabilistic Non-determinism. PhD thesis, Edinburgh Univ., 1989.
[94] P. Joyce. Partition structures and sufficient statistics. Journ. of Applied Proba-
bility, 35(3):622–632, 1998.
[95] A. Jung and R. Tix. The troublesome probabilistic powerdomain. In A. Edalat,
A. Jung, K. Keimel, and M. Kwiatkowska, editors, Comprox III, Third Workshop
on Computation and Approximation, number 13 in Elect. Notes in Theor. Comp.
Sci., pages 70–91. Elsevier, Amsterdam, 1998.
[96] D. Jurafsky and J. Martin. Speech and language processing. Third Edition draft,
available at https://web.stanford.edu/~jurafsky/slp3/, 2018.
[97] E. Kamenica and M. Gentzkow. Bayesian persuasion. American Economic Re-
view, 101(6):2590–2615, 2011.
[98] S. Karlin. Mathematical Methods and Theory in Games, Programming, and
Economics. Vol I: Martrix Games, Programming, and Mathematical Economics.
The Java Series. Addison-Wesley, 1959.
[99] K. Keimel and G. Plotkin. Mixed powerdomains for probability and nondeter-
minism. Logical Methods in Comp. Sci., 13(1), 2017. See https://lmcs.
episciences.org/2665.
[100] J. Kingman. Random partitions in population genetics. Proc. Royal Soc., Series
A, 361:1–20, 1978.
[101] J. Kingman. The representation of partition structures. Journ. London Math.
Soc., 18(2):374–380, 1978.
[102] A. Kock. Closed categories generated by commutative monads. Journ. Austr.
Math. Soc., XII:405–424, 1971.
[103] A. Kock. Monads for which structures are adjoint to units. Journ. of Pure &
Appl. Algebra, 104:41–59, 1995.
[104] A. Kock. Commutative monads as a theory of distributions. Theory and Appl.
of Categories, 26(4):97–131, 2012.
[105] D. Koller and N. Friedman. Probabilistic Graphical Models. Principles and
Techniques. MIT Press, Cambridge, MA, 2009.
[106] D. Kozen. Semantics of probabilistic programs. Journ. Comp. Syst. Sci,
22(3):328–350, 1981.
[107] D. Kozen. A probabilistic PDL. Journ. Comp. Syst. Sci, 30(2):162–178, 1985.
[108] P. Lau, T. Koo, and C. Wu. Spatial distribution of tourism activities: A Pólya urn
process model of rank-size distribution. Journ. of Travel Research, 59(2):231–
246, 2020.
[109] S. Lauritzen. Graphical models. Oxford Univ. Press, Oxford, 1996.
[110] S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on
graphical structures and their application to expert systems. Journ. Royal Statis-
tical Soc., 50(2):157–224, 1988.
[111] F. Lawvere. The category of probabilistic mappings. Unpublished manuscript,
see ncatlab.org/nlab/files/lawvereprobability1962.pdf, 1962.
478
479
[112] P. Lax. Linear Algebra and Its Applications. John Wiley & Sons, 2nd edition,
2007.
[113] T. Leinster. Basic Category Theory. Cambridge Studies in Advanced Mathe-
matics. Cambridge Univ. Press, 2014. Available online via arxiv.org/abs/
1612.09375.
[114] L. Libkin and L. Wong. Some properties of query languages for bags. In
C. Beeri, A. Ohori, and D. Shasha, editors, Database Programming Languages,
Workshops in Computing, pages 97–114. Springer, Berlin, 1993.
[115] H. Mahmoud. Pólya Urn Models. Chapman and Hall, 2008.
[116] E. Manes. Algebraic Theories. Springer, Berlin, 1974.
[117] R. Mardare, P. Panangaden, and G. Plotkin. Quantitative algebraic reasoning. In
Logic in Computer Science. IEEE, Computer Science Press, 2016.
[118] S. Mac Lane. Categories for the Working Mathematician. Springer, Berlin, 1971.
[119] S. Mac Lane. Mathematics: Form and Function. Springer, Berlin, 1986.
[120] A. McIver and C. Morgan. Abstraction, refinement and proof for probabilistic
systems. Monographs in Comp. Sci. Springer, 2004.
[121] A. McIver, C. Morgan, and T. Rabehaja. Abstract hidden Markov models: A
monadic account of quantitative information flow. In Logic in Computer Science,
pages 597–608. IEEE, Computer Science Press, 2015.
[122] A. McIver, C. Morgan, G. Smith, B. Espinoza, and L. Meinicke. Abstract chan-
nels and their robust information-leakage ordering. In M. Abadi and S. Kremer,
editors, Princ. of Security and Trust, number 8414 in Lect. Notes Comp. Sci.,
pages 83–102. Springer, Berlin, 2014.
[123] S. Milius, D. Pattinson, and L. Schröder. Generic trace semantics and graded
monads. In L. Moss and P. Sobocinski, editors, Conference on Algebra and
Coalgebra in Computer Science (CALCO 2015), volume 35 of LIPIcs, pages
253–269. Schloss Dagstuhl, 2015.
[124] H. Minc. Nonnegative Matrices. John Wiley & Sons, 1998.
[125] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[126] E. Moggi. Notions of computation and monads. Information & Computation,
93(1):55–92, 1991.
[127] R. Nagel. Order unit and base norm spaces. In A. Hartkämper and H. Neu-
mann, editors, Foundations of Quantum Mechanics and Ordered Linear Spaces,
number 29 in Lect. Notes Physics, pages 23–29. Springer, Berlin, 1974.
[128] M. Nielsen and I. Chuang. Quantum Computation and Quantum Information.
Cambridge Univ. Press, 2000.
[129] F. Olmedo, F. Gretz, B. Lucien Kaminski, J-P. Katoen, and A. McIver. Con-
ditioning in probabilistic programming. ACM Trans. on Prog. Lang. & Syst.,
40(1):4:1–4:50, 2018.
[130] M. Ozawa. Quantum measuring processes of continuous observables. Journ.
Math. Physics, 25:79–87, 1984.
[131] P. Panangaden. The category of Markov kernels. In C. Baier, M. Huth,
M. Kwiatkowska, and M. Ryan, editors, Workshop on Probabilistic Methods in
Verification, number 21 in Elect. Notes in Theor. Comp. Sci., pages 171–187.
Elsevier, Amsterdam, 1998.
[132] V. Paulsen and M. Tomforde. Vector spaces with an order unit. Indiana Univ.
Math. Journ., 58-3:1319–1359, 2009.
479
480 Chapter 7. References
480
481
481