0% found this document useful (0 votes)
1K views495 pages

Structured Probabilistic Reasoning: (Incomplete Draft)

This document is an incomplete draft of a book about structured probabilistic reasoning by Bart Jacobs. It covers topics like collections, discrete probability distributions, drawing from urns, observables and validity, variance and covariance, updating distributions, and directed graphical models. The preface discusses how the author's background outside of probability theory provides a fresh perspective and emphasizes more of the mathematical structure in probability than is commonly done.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views495 pages

Structured Probabilistic Reasoning: (Incomplete Draft)

This document is an incomplete draft of a book about structured probabilistic reasoning by Bart Jacobs. It covers topics like collections, discrete probability distributions, drawing from urns, observables and validity, variance and covariance, updating distributions, and directed graphical models. The preface discusses how the author's background outside of probability theory provides a fresh perspective and emphasizes more of the mathematical structure in probability than is commonly done.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 495

Structured Probabilistic Reasoning

(Incomplete draft)

Bart Jacobs
Institute for Computing and Information Sciences, Radboud University Nijmegen,
P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.
[email protected] http://www.cs.ru.nl/∼ bart

Version of: November 30, 2021


Contents

Preface page v
1 Collections 1
1.1 Cartesian products 2
1.2 Lists 6
1.3 Subsets 16
1.4 Multisets 25
1.5 Multisets in summations 35
1.6 Binomial coefficients of multisets 41
1.7 Multichoose coefficents 48
1.8 Channels 56
1.9 The role of category theory 61
2 Discrete probability distributions 69
2.1 Probability distributions 70
2.2 Probabilistic channels 81
2.3 Frequentist learning: from multisets to distributions 92
2.4 Parallel products 98
2.5 Projecting and copying 109
2.6 A Bayesian network example 116
2.7 Divergence between distributions 122
3 Drawing from an urn 127
3.1 Accumlation and arrangement, revisited 132
3.2 Zipping multisets 136
3.3 The multinomial channel 144
3.4 The hypergeometric and Pólya channels 156
3.5 Iterated drawing from an urn 168
3.6 The parallel multinomial law: four definitions 181
3.7 The parallel multinomial law: basic properties 189

iii
iv Chapter 0. Contents

3.8 Parallel multinomials as law of monads 200


3.9 Ewens distributions 207
4 Observables and validity 221
4.1 Validity 222
4.2 The structure of observables 234
4.3 Transformation of observables 249
4.4 Validity and drawing 256
4.5 Validity-based distances 264
5 Variance and covariance 275
5.1 Variance and shared-state covariance 276
5.2 Draw distributions and their (co)variances 284
5.3 Joint-state covariance and correlation 293
5.4 Independence for random variables 299
5.5 The law of large numbers, in weak form 304
6 Updating distributions 311
6.1 Update basics 312
6.2 Updating draw distributions 322
6.3 Forward and backward inference 328
6.4 Discretisation, and coin bias learning 346
6.5 Inference in Bayesian networks 354
6.6 Bayesian inversion: the dagger of a channel 358
6.7 Pearl’s and Jeffrey’s update rules 366
6.8 Frequentist and Bayesian discrete probability 383
7 Directed Graphical Models 390
7.1 String diagrams 392
7.2 Equations for string diagrams 399
7.3 Accessibility and joint states 404
7.4 Hidden Markov models 408
7.5 Disintegration 419
7.6 Disintegration for states 428
7.7 Disintegration and inversion in machine learning 438
7.8 Categorical aspects of Bayesian inversion 450
7.9 Factorisation of joint states 457
7.10 Inference in Bayesian networks, reconsidered 466
References 473

iv
Preface

No victor believes in chance.


Friedrich Nietzsche,
Die fröhliche Wissenschaft, §258, 1882.
Originally in German:
Kein Sieger glaubt an den Zufall.

Probability is for losers — a defiant rephrase of the above aphorism of the


German philosopher Friedrich Nietzsche. According to him, winners do not
reason with probabilities, but with certainties, via Boolean logic, one could say.
However, this goes against the current trend, in which reasoning with proba-
bilities has become the norm, in large scale data analytics and in aritificial
intelligence (AI); it is Boolean reasoning that is now losing influence, see also
the famous End of Theory article [4] from 2008. This book is about the mathe-
matical structures underlying the reasoning, not of Nietzsche’s winners, but of
today’s apparent winners.
The phrase ‘structure in probability’ in the title of this book may sound
like a contradictio in terminis: it seems that probability is about randomness,
like in the tossing of coins, in which one may not expect to find much struc-
ture. Still, as we know since the seventeenth century, via the pioneering work
of Christiaan Huygens, Pierre Fermat, and Blaise Pascal, there is quite some
mathematical structure in the area of probability. The raison d’être of this book
is that there is more structure — especially algebraic and categorical — than
is commonly emphasised.
The scientific roots of this book’s author lie outside probability theory, in
type theory and logic (including some quantum logic), in semantics and speci-
fication of programming languages, in computer security and privacy, in state-
based computation (coalgebra), and in category theory. This scientific distance
to probability theory has advantages and disadvantages. Its obvious disadvan-

v
vi Chapter 0. Preface

tage is that there is no deeply engrained familiarity with the field and with its
development. But at the same time this distance may be an advantage, since
it provides a fresh perspective, without sacred truths and without adherence to
common practices and notations. For instance, the terminology and notation in
this book are influenced by quantum theory, e.g. in using the words ‘state’, ‘ob-
servable’ and ‘test’ — as synonyms for ‘distribution’, for an R-valued function
on a sample space and for compatible (summable) predicates — in using ket
notation | − i in writing discrete probability distributions, or in using daggers
as reversals, in analogy with conjugate transposes (for Hilbert spaces).
It should be said: for someone trained in formal methods, the area of prob-
ability theory can be rather sloppy: everything is called ‘P’, types are hardly
ever used, crucial ingredients (like distributions in expected values) are left
implicit, basic notions (like conjugate prior) are introduced only via examples,
calculation recipes and algorithms are regularly just given, without explana-
tion, goal or justification, etc. This hurts, especially because there is so much
beautiful mathematical structure around. For instance, the notion of a channel
(see below) formalises the idea of a conditional probability and carries a rich
mathematical structure that can be used in compositional reasoning, with both
sequential and parallel composition. The Bayesian inversion (‘dagger’) of a
channel does not only come with appealing mathematical (categorical) proper-
ties — e.g. smooth interaction with sequential and parallel composition — but
is also extremely useful in inference and learning. Via this dagger we can con-
nect forward and backward inference (see Theorem 6.6.3: backward inference
is forward inference with the dagger, and vice-versa) and capture the difference
between Pearl’s and Jeffrey’s update rules (see Theorem 6.7.4: Pearl increases
validity, whereas Jeffrey decreases divergence).
We even dare to think that this ‘sloppiness’ is ultimately a hindrance to fur-
ther development of the field, especially in computer science, where computer-
assisted reasoning requires a clear syntax and semantics. For instance, it is
hard to even express the above-mentioned theorems 6.6.3 and 6.7.4 in stan-
dard probabilistic notation. One can speculate that states/distributions are kept
implicit in traditional probability theory because in many examples they are
used as a fixed implicit assumption in the background. Indeed, in mathemati-
cal notation one tends to omit — for efficiency — the least relevant (implicit)
parameters. But the essence of probabilistic computation is state transforma-
tion, where it has become highly relevant to know explicitly in which state one
is working at which stage. The notation developed in this book helps in such
situations — and in many other situations as well, we hope.
Apart from having beautiful structure, probability theory also has magic. It
can be found in the following two points.

vi
vii

1 Probability distributions can be updated, making it possible to absorb infor-


mation (evidence) into them and learn from it. Multiple updates, based on
data, can be used for training, so that a distribution absorbs more and more
information and can subsequently be used for prediction or classification.
2 The components of a joint distribution, over a product space, can ‘listen’
to each other, so that updating in one (product) component has crossover
ripple effects in other components. This ripple effect looks like what happens
in quantum physics, where measuring one part of an entangled quantum
systems changes other parts.

The combination of these two points is very powerful and forms the basis for
probabilistic reasoning. For instance, if we know that two phenomena are re-
lated, and we have new information about one of them, then we also learn
something new about the other phenomenon, after updating. We shall see that
such crossover ripple effects can be described in two equivalent ways, starting
from a joint distribution with evidence in one component.

• We can use the “weaken-update-marginalise” approach, where we first wea-


ken the evidence from one component to the whole product space, so that
it fits the joint distribution and can be used for updating; subsequently, we
marginalise the updated state to the component that we wish to learn more
about. That’s where the ripple effect through the joint distribution becomes
visible.
• We can also use the “extract-infer” technique, where we first extract a condi-
tional probability (channel) from the joint distribution and then do (forward
or backward) inference with the evidence, along the channel. This is what
happens if we reason in Bayesian networks, when information at one point
in the network is transported up and down the connections to draw conclu-
sions at another point in the network.

The equivalence of these two approaches will be demonstrated and exploited


at various places in this book.
Here is a characteristic illustration of our structure-based approach. A well
known property of the Poisson distribution pois is commonly expressed as:
if X1 ∼ pois[λ1 ] and X2 ∼ pois[λ2 ] then X1 + X2 ∼ pois[λ1 + λ2 ]. This
formulation uses random variables X1 , X2 , which are Poisson-distributed. We
shall formulate this fact as an (algebraic) structure preservation property of
the Poisson distribution, without using any random variables. The property is:
pois[λ1 +λ2 ] = pois[λ1 ] + pois[λ2 ]. It says that the Poisson channel pois is a
is a homomorphism of monoids, from non-negative reals R≥0 to distributions
D(N) on the natural numbers, see Proposition 2.4.6 for details. This result uses

vii
viii Chapter 0. Preface

a commutative monoid structure on distributions whose underlying space is it-


self a commutative monoid. This monoid structure plays a role in many other
situations, for instance in the fundamental distributive law that turns multisets
of distributions into distributions of multisets.
The following aspects characterise the approach of this book.
1 Channels are used as a cornerstone in probabilistic reasoning. The con-
cept of (communication) channel is widely used elsewhere, under various
names, such as conditional probability, stochastic matrix, probabilistic clas-
sifier, Markov kernel, conditional probability table (in Bayesian network),
probabilistic function/computation, signal (in Bayesian persuasion theory),
and finally as Kleisli map (in category theory). Channels can be composed
sequentially and in parallel, and can transform both states and predicates.
Channels exist for all relevant collection types (lists, subsets, multisets, dis-
tributions), for instance for non-deterministic, and for probabilistic compu-
tation. However, after the first chapter about collection types, channels will
be used exclusively for distributions, in probabilistic form.
2 Multisets play a central role to capture various forms of data, like coloured
balls in an urn, draws from such an urn, tables, inputs for successive learn-
ing steps, etc. The interplay between multisets and distributions, notably in
learning, is a recurring theme.
3 States (distributions) are treated as separate from, and dual, to predicates.
These predicates are the ingredients of an (implicit) probabilistic logic, with,
for instance conjunction and negation operations. States are really different
entities, with their own operations, without, for instance conjunction and
negation. In this book, predicates standardly occur in fuzzy (soft, non-sharp)
form, taking values in the unit interval [0, 1]. Along a channel one can trans-
fer states forward, and predicates backward. A central notion is the validity
of a predicate in a state, written as |=. It is standardly called expected value.
Conditioning involves updating a state with a predicate.
It is not just mathematical aesthetics that drives the developments in this
book. Probability theory nowadays forms the basis of large parts of big data
analytics and of artificial intelligence. These areas are of increasing societal
relevance and provide the basis of the modern view of the world — more based
on correlation than on causation — and also provide the basis for much of mod-
ern decision making, that may affect the lives of billions of people in profound
ways. There are increasing demands for justification of such probabilistic rea-
soning methods and decisions, for instance in the legal setting provided by
Europe’s General Data Protection Regulation (GDPR). Its recital 71 is about
automated decision-making and talks about a right to obtain an explanation:

viii
ix

In any case, such processing should be subject to suitable safeguards, which should
include specific information to the data subject and the right to obtain human interven-
tion, to express his or her point of view, to obtain an explanation of the decision reached
after such assessment and to challenge the decision.

It is not acceptable that your mortgage request is denied because you drive a
blue car — in presence of a correlation between driving blue cars and being
late on one’s mortgage payments.
These and other developments have led to a new area called Explainable
Artificial Intelligence (XAI), which strives to provide decisions with explana-
tions that can be understood easily by humans, without bias or discrimination.
Although this book will not contribute to XAI as such, it aims to provide a
mathematically solid basis for such explanations.
In this context it is appropriate to quote Judea Pearl [134] from 1989 about
a divide that is still wide today.
To those trained in traditional logics, symbolic reasoning is the standard, and non-
monotonicity a novelty. To students of probability, on the other hand, it is symbolic
reasoning that is novel, not nonmonotonicity. Dealing with new facts that cause proba-
bilities to change abruptly from very high values to very low values is a commonplace
phenomenon in almost every probabilistic exercise and, naturally, has attracted special
attention among probabilists. The new challenge for probabilists is to find ways of ab-
stracting out the numerical character of high and low probabilities, and cast them in
linguistic terms that reflect the natural process of accepting and retracting beliefs.

This book does not pretend to fill this gap. One of the big embarrassments of
the field is that there is no widely accepted symbolic logic for probability, to-
gether with proof rules and a denotational semantics. Such a logic for symbolic
reasoning about probability will be non-trivial, because it will have to be non-
monotonic1 — a property that many logicians shy away from. This book does
aim to contribute towards bridging the divide mentioned by Pearl, by provid-
ing a mathematical basis for such a symbolic probabilistic logic, consisting of
channels, states, predicates, transformations, conditioning, disintegration, etc.
From the perspective of this book, the structured categorical approach to
probability theory began with the work of Bill Lawvere (already in the 1960s)
and his student Michèle Giry. They recognised that taking probability dis-
tributions has the structure of a monad, which was published in the early
1980s in [58]. Roughly at the same time Dexter Kozen started the systematic
investigation of probabilistic programming languages and logics, published
in [106, 107]. The monad introduced back then is now called the Giry mo-
1 Informally, a logic is non-monotonic if adding assumptions may make a conclusion less true.
For instance, I may think that scientists are civilised people, until, at some conference dinner,
a heated scientific debate ends in a fist fight.

ix
x Chapter 0. Preface

nad G, whose restriction to finite discrete probability distributions is written as


D. Much of this book, certainly in the beginning, concentrates on this discrete
form. The language and notation that is used, however, covers both discrete
and continuous probability — and quantum probability too (inspired by the
general categorical notion of effectus, see [22, 68]).
Since the early 1980s the area of categorical probability theory remained
relatively calm. It is only in the new millenium that there is renewed attention,
sparked in particular by several developments.
• The grown interest in probabilistic programming languages that incorporate
updating (conditioning) and/or higher order features, see e.g. [29, 30, 32,
129, 154, 153].
• The compositional approach to Bayesian networks [26, 48] and to Bayesian
reasoning [28, 87, 89].
• The use of categorical and diagrammatic methods in quantum foundations,
including quantum probability, see [25] for an overview.
• The efforts to develop ‘synthetic’ probability theory via a categorical ax-
iomatisation, see e.g. [52, 53, 147, 20, 80].
This book builds on these developments.
The intended audience consists of students and professionals — in math-
ematics, computer science, artificial intelligence and related fields — with a
basic background in probability and in algebra and logic — and with an inter-
est in formal, logically oriented approaches. This book’s goal is not to provide
intuitive explanations of probability, like [158], but to provide clear and precise
formalisations of the relevant structures. Mathematical abstraction (esp. cate-
gorical air guitar playing) is not a goal in itself (except maybe in Chapter 3):
instead, the book tries to uncover relevant abstractions in concrete problems. It
includes several basic algorithms, with a focus on the algorithms’ correctness,
not their efficiency. Each section ends with a series of exercises, so that the
book can also be used for teaching and/or self-study. It aims at an undergradu-
ate level. No familiarity with category theory is assumed. The basic, necessary
notions are explained along the way. People who wish to learn more about cat-
egory theory can use the references in the text, consult modern introductory
texts like [7, 113], or use online resources such as ncatlab.org or Wikipedia.

Contents overview
The first chapter of the book covers introductory material that is meant to set
the scene. It starts from basic collection types like lists and subsets, and con-
tinues with multisets, which receive most attention. The chapter discusses the

x
xi

(free) monoid structure on all these collection types and introduces ‘unit’ and
‘flatten’ maps as their common, underlying structure. It also introduces the
basic concept of a channel, for these three collection types, and shows how
channels can be used for state transformation and how they can be composed,
both sequentially and in parallel. At the end, the chapter provides definitions
of the relevant notions from category theory.
In the second chapter (discrete) probability distributions first emerge, as a
special collection type, with their own associated form of (probabilistic) chan-
nel. The subtleties of parallel products of distributions (states), with entwined-
ness/correlation between components and the non-naturality of copying, are
discussed at this early stage. This culminates in an illustration of Bayesian net-
works in terms of (probabilistic) channels. It shows how predictions are made
within such Bayesian networks via state transformation and via compositional
reasoning, basically by translating the network structure into (sequential and
parallel) composites of channels.
Blindly drawing coloured balls from an urn is a basic model in discrete prob-
ability. Such draws are analysed systematically in Chapter 3, not only for the
two familar multinomial and hypergeometric forms (with or without replace-
ment), but also in the less familiar Pólya and Ewens forms. By describing these
draws as probabilistic channels we can derive the well known formulations for
these draw-distributions via channel-composition. Once formulated in terms of
channels, these distributions satisfy various compositionality properties. They
are typical for our approach and are (largely) absent in traditional treatments
of this topic. Urns and draws from urns are both described as multisets. The
interplay between multisets and distributions is an underlying theme in this
chapter. There is a fundamental distributive law between multisets and distri-
butions that expresses basic structural properties.
The fourth chapter is more logically oriented, via observables X → R (in-
cluding factors, predicates and events) that can be defined on sample spaces
X, providing numerical information. The chapter concentrates on validity of
obervables in states and on transformation of observables. Where the second
chapter introduces state transformation along a probabilistic channel in a for-
ward direction, this fourth chapter adds observable (predicate) transformation
in a backward direction. These two operations are of fundamental importance
in program semantics, and also in quantum computation — where they are dis-
tinguished as Schrödinger’s (forward) and Heisenberg’s (backward) approach.
In this context, a random variable is a combination of a state and an observable,
on the same underlying sample space. The statistical notions of variance and
covariance are described in terms of of validity for such random variables in

xi
xii Chapter 0. Preface

Chapter 5. This chapter distinguishes two forms of covariance, with a ‘shared’


or a ‘joint’ state, which satisfy different properties.
A very special technique in the area of probability theory is conditioning,
also known as belief updating, or simply as updating. It involves the incor-
poration of evidence into a distribution (state), so that the distribution better
fits the evidence. In traditional probability such conditioning is only indirectly
available, via a rule P(B | A) for computing conditional probabilities. In Chap-
ter 6 we formulate conditioning as an explicit operation, mapping a state ω
and a predicate p to a new updated state ω| p . A key result is that the validity
of p in ω| p is higher than the validity of p in the original state ω. This means
that we have learned from p and adapted our state (of mind) from ω to ω| p .
This updating operation ω| p forms an action (of predicates on states) and sat-
isfies Bayes’ rule, in fuzzy form. The combination with forward and backward
transformation along a channel leads to the techniques of forward inference
(causal reasoning) and backward inference (evidential reasoning). These in-
ference techniques are illustrated in many examples, including Bayesian net-
works. The backward inference rule is also called Pearl’s rule. It turns out that
there exists an alternative update mechanism, called Jeffrey’s rule. It can give
completely different outcomes. The formalisation of Jeffrey’s rule is based on
the reversal of the direction of a channel. This corresponds to turning a con-
ditional probability P(y | x) into P(x | y), essentially via Bayes’ rule. This
Bayesian inversion is also called the ‘dagger’ of a channel, since it satisfies
the properties of a dagger operation (or conjugate transpose), in Hilbert spaces
(and in quantum theory). This chapter includes a mathematical characterisation
of the different goals of these two update rules: Pearl’s rule increases validity
and Jeffrey’s rule decreases divergence. More informally, one learns via Pearl’s
rule by improving what’s going well and via Jeffrey’s rule by reducing what’s
going wrong. Jeffrey’s rule is thus an error correction mechanism. This fits the
basic idea in predictive coding theory [64, 23] that the human mind is seen as
a Bayesian prediction engine that operates by reducing prediction errors.
Since channels can be composed both sequentially and in parallel we can
use graphical techniques to represent composites of channels. So-called string
diagrams have been developed in physics and in category theory to deal with
the relevant compositional structure (symmetric monoidal categories). Chap-
ter 7 introduces these (directed) graphical techniques. It first describes string
diagrams. They are similar to the graphs used for Bayesian networks, but they
have explicit operations for copying and discarding and are thus more expres-
sive. But most importantly, string diagrams have a clear semantics, namely in
terms of channels. The chapter illustrates these string diagrams in a channel-
based description of the basics of Markov chains and of hidden Markov mod-

xii
xiii

els. But the most fundamental technique that is introduced in this chapter, via
string diagrams, is disintegration. In essence, it is the well known procedure
of extracting a conditional probability P(y | x) from a joint probability P(x, y).
One of the themes running through this book is how ‘crossover’ influence can
be captured via channels — extracted from joint states via disintegration —
in particular via forward and backward inference. This phenomenon is what
makes (reasoning in) Bayesian networks work. Disintegration is of interest in
itself, but also provides an intuitive formalisation of the Bayesian inversion of
a channel.
Almost all of the material in these chapters is known from the literature, but
typically not in the channel-based form in which it is presented here. This book
includes many examples, often copied from familiar sources, with the deliber-
ate aim of illustrating how the channel-based approach actually works. Since
many of these examples are taken from the literature, the interested reader may
wish to compare the channel-based description used here with the original de-
scription.

Status of the current incomplete version


An incomplete version of this book is made available online, in order to gen-
erate feedback and to justify a pause in the writing process. Feedback is most
welcome, both positive and negative, especially when it suggests concrete im-
provements of the text. This may lead to occasional updates of this text. The
date on the title page indicates the current version.
Some additional points.

• The (non-trivial) calculations in this book have been carried out with the
EfProb library [20] for channel-based probability. Several calculations in
this book can be done by hand, typically when the outcomes are described
117
as fractions, like 2012 . Such calculations are meant to be reconstructable by
a motivated reader who really wishes to learn the ‘mechanics’ of the field.
Doing such calculations is a great way to really understand the topic — and
the approach of this book2 . Outcomes written in decimal notation 0.1234, as
approximations, or as plots, serve to give an impression of the results of a
computation.
• For the rest of this book, beyond Chapter 7, several additional chapters exist
2 Doing the actual calculations can be a bit boring and time consuming, but there are useful
online tools for calculating fractions, such as
https://www.mathpapa.com/fraction-calculator.html. Recent versions of EfProb also allow
calculations in fractional form.

xiii
xiv Chapter 0. Preface

in unfinished form, for instance on learning, probabilistic automata, causal-


ity and on continuous probability. They will be incorporated in due course.

Bart Jacobs, August 26, 2021.

xiv
1

Collections

There are several ways to put elements from a given set together, for instance
as lists, subsets, multisets, and as probability distributions. This introductory
chapter takes a systematic look at such collections and seeks to bring out nu-
merous similarities. For instance, lists, subsets and multisets all form monoids,
by suitable unions of collections. Unions of distributions are more subtle and
take the form of convex combinations. Also, subsets, multisets and distribu-
tions can be combined naturally via parallel products ⊗, though lists cannot.
In this first chapter, we collect some basic operations and properties of tuples,
lists, subsets and multisets — where multisets are ‘sets’ in which elements
may occur multiple times. Probability distributions will be postponed to the
next chapter. Especially, we collect several basic definitions and results for
multisets, since they play an important role in the sequel, as urns, filled with
coloured balls, as draws from such urns, and as data in learning.
The main differences between lists, subsets and multisets are summarised in
the table below.
lists subsets multisets

order of elements matters + - -


multiplicity of elements matters + - +

For instance, the lists [a, a, b], [a, b, a] and [a, b] are all different. The multisets
2| ai+1| bi and 1| bi+2| a i, with the element a occurring twice and the element
b occurring once, are the same. However, 1|a i + 1| b i is a different multiset.
Similarly, the sets {a, b}, {b, a}, and {a} ∪ {a, b} are the same.
These collections are important in themselves, in many ways, and primar-
ily (in this book) as outputs of channels. Channels are functions of the form
input → T (output), where T is a ‘collection’ operator, for instance, combin-
ing elements as lists, subsets, multisets, or distributions. Such channels capture

1
2 Chapter 1. Collections

a form of computation, directly linked to the form of collection that is produced


on the outputs. For instance, channels where T is powerset are used as interpre-
tations of non-deterministic computations, where each input element produces
a subset of possible output elements. In the probabilistic case these channels
produce distributions, with a suitable instantiation of the operator T . Channels
can be used as elementary units of computation, which can be used to build
more complicated computations via sequential and parallel composition.
First, the reader is exposed to some general considerations that set the scene
for what is coming. This requires some level of patience. But it is useful to
see the similarities between probability distributions (in Chapter 2) and other
collections first, so that constructions, techniques, notation, terminology and
intuition that we use for distributions can be put in a wider perspective and
thus may become natural.
The final section of this chapter explains where the abstractions that we use
come from, namely from category theory, especially referring the notion of
monad. It gives a quick overview of the most relevant parts of this theory and
also how category theory will be used in the remainder of this book to elicit
mathematical structure. We use category theory pragmatically, as a tool, and
not as a goal in itself.

1.1 Cartesian products


This section briefly reviews some (standard) terminology and notation related
to Cartesian products of sets.
Let X1 and X2 be two arbitrary sets. We can form their Cartesian product
X1 × X2 , as the new set containing all pairs of elements from X1 and X2 , as in:

X1 × X2 B {(x1 , x2 ) | x1 ∈ X1 and x2 ∈ X2 }.

(We use the sign B for mathematical definitions.)


We thus write (x1 , x2 ) for the ‘pair’ or ‘tuple’ of elements x1 ∈ X1 and
x2 ∈ X2 . We have just defined a binary product set, constructed from two given
sets X1 , X2 . We can also do this in n-ary form, for n sets X1 , . . . , Xn . We then
get an n-ary Cartesian product:

X1 × · · · × Xn B {(x1 , . . . , xn ) | x1 ∈ X1 , . . . , xn ∈ Xn }.

The tuple (x1 , . . . , xn ) is sometimes called an n-tuple. For convenience, it may


be abbreviated as a vector ~x. The product X1 × · · · × Xn is sometimes written

2
1.1. Cartesian products 3
Q
differently using the symbol , as:
Y Y
Xi or more informally as: Xi .
1≤i≤n

In the latter case it is left implicit what the range is of the index element i.
We allow n = 0. The resulting ‘empty’ product is then written as a singleton
set, written as 1, containing the empty tuple () as sole element, as in:

1 B {()}.
For n = 1 the product X1 × · · · × Xn is (isomorphic to) the set X1 . Note that we
are overloading the symbol 1 and using it both as numeral and as singleton set.
If one of the sets Xi in a product X1 × · · · × Xn is empty, then the whole
product is empty. Also, if all of the sets Xi are finite, then so is the product
X1 × · · · × Xn . In fact, the number of elements of X1 × · · · × Xn is then obtained
by multiplying all the numbers of elements of the sets Xi .

1.1.1 Projections and tuples


If we have sets X1 , . . . , Xn as above, then for each number i with 1 ≤ i ≤ n
there is a projection function πi out of the product to the set Xi , as in:
πi
X1 × · · · × Xn / Xi given by πi x1 , . . . , xn B xi .


This gives us functions out of a product. We also wish to be able to define


functions into a product, via tuples of functions: if we have a set Y and n
functions f1 : Y → X1 , . . . , fn : Y → Xn , then we can form a new function
Y → X1 × · · · × Xn , namely:
h f1 ,..., fn i
Y / X1 × · · · × Xn via h f1 , . . . , fn i(y) B ( f1 (y), . . . , fn (y)).

There is an obvious result about projecting after tupling of functions:

πi ◦ h f1 , . . . , fn i = fi . (1.1)
This is an equality of functions. It can be proven easily by applying both sides
to an arbitrary element y ∈ Y.
There are some more ‘obvious’ equations about tupling of functions:

h f1 , . . . , fn i ◦ g = h f1 ◦ g, . . . , fn ◦ gi hπ1 , . . . , πn i = id , (1.2)
where g : Z → Y is an arbitrary function. In the last equation, id is the identity
function on the product X1 × · · · × Xn .
In a Cartesian product we place sets ‘in parallel’. We can also place functions

3
4 Chapter 1. Collections

between them in parallel. Suppose we have n functions fi : Xi → Yi . Then we


can form the parallel composition:

X1 × · · · × Xn
f1 ×···× fn
/ Y1 × · · · × Yn

via:
f1 × · · · × fn = h f1 ◦ π1 , . . . , fn ◦ πn i
so that:
f1 × · · · × fn (x1 , . . . , xn ) = ( f1 (x1 ), . . . , fn (xn )).


The latter formulation clearly shows how the functions fi are applied in parallel
to the elements xi .
We overload the product symbol ×, since we use it both for sets and for
functions. This may be a bit confusing at first, but it is in fact quite convenient.

1.1.2 Powers and exponents


Let X be an arbitrary set. A power of X is an n-product of X’s, for some n. We
write the n-th power of X as X n , in:
Xn B |
X × {z } = {(x1 , . . . , xn ) | xi ∈ X for each i}.
··· × X
n times

As special cases we have X 1 = X and X 0 = 1, where 1 = {()} is the singleton


set with the empty tuple () as sole inhabitant. Since powers are special cases of
Cartesian products, they come with projection functions πi : X n → X and tuple
functions h f1 , . . . , fn i : Y → X n for n functions fi : Y → X. Finally, for a func-
tion f : X → Y we write f n : X n → Y n for the obvious n-fold parallelisation of
f.
More generally, for two sets X, Y we shall occasionally write:
X Y B functions f : Y → X .


This new set X Y is sometimes called the function space or the exponent of X
and Y. Notice that this exponent notation is consistent with the above one for
powers, since functions n → X can be identified with n-tuples of elements in
X.
These exponents X Y are related to products in an elementary and useful way,
namely via a bijective correspondence:

Z×Y /Xf

=============Y=== (1.3)
Z /X
g

4
1.1. Cartesian products 5

This means that for a function f : Z × Y → X there is a corresponding function


f : Z → X Y , and vice-versa, for g : Z → X Y there is a function g : Z × Y → X,
in such a way that f = f and g = g. It is not hard to see that we can take
f (z) ∈ X Y to be the function f (z)(y) = f (z, y), for z ∈ Z and y ∈ Y. Similarly,
we use g(z, y) = g(z)(y).
The correspondence (1.3) is characteristic for so-called Cartesian closed cat-
egories.

Exercises
1.1.1 Check what a tuple function hπ2 , π3 , π6 i does on a product set X1 ×
· · · × X8 . What is the codomain of this function?
1.1.2 Check that, in general, the tuple function h f1 , . . . , fn i is the unique
function h : Y → X1 × · · · × Xn with πi ◦ h = fi for each i.
1.1.3 Prove, using Equations (1.1) and (1.2) for tuples and projections, that:

g1 × · · · × gn ◦ f1 × · · · × fn = (g1 ◦ f1 ) × · · · × (gn ◦ fn ).
 

1.1.4 Check that for each set X there is a unique function X → 1. Because
of this property the set 1 is sometimes called ‘final’ or ‘terminal’. The
unique function is often denoted by !.
Check also that a function 1 → X corresponds to an element of X.
1.1.5 Define functions in both directions, using tuples and projections, that
yield isomorphisms:

X×Y  Y ×X 1×X  X X × (Y × Z)  (X × Y) × Z.

Try to use Equations (1.1) and (1.2) to prove these isomorphisms,


without reasoning with elements.
1.1.6 Similarly, show that exponents satisfy:

X1  X 1Y  1 (X × Y)Z  X Z × Y Z X Y×Z  X Y Z .


1.1.7 For K ∈ N and sets X, Y define:

XK × Y K
zip[K]
/ (X × Y)K

by:

zip[K] (x1 , . . . , xK ), (y1 , . . . , yK ) B (x1 , y1 ), . . . , (xK , yK ) .


 

Show that zip is an isomorphism, with inverse function unzip[K] B


h(π1 )K , (π2 )K i.

5
6 Chapter 1. Collections

1.2 Lists
The datatype of (finite) lists of elements from a given set is well-known in
computer science, especially in functional programming. This section collects
some basic constructions and properties, especially about the close relationship
between lists and monoids.
For an arbitrary set X we write L(X) for the set of all finite lists [x1 , . . . , xn ]
of elements xi ∈ X, for arbitrary n ∈ N. Notice that we use square brackets
[−] for lists, to distinguish them from tuples, which are typically written with
round brackets (−).
Thus, the set of lists over X can be defined as a union of all powers of X, as
in:
[
L(X) B Xn.
n∈N

When the elements of X are letters of an alphabet, then L(X) is the set of words
— the language — over this alphabet. The set L(X) is alternatively written as
X ? , and called the Kleene star of X.
We zoom in on some trivial cases. One has L(0)  1, since one can only
form the empty word over the empty alphabet 0 = ∅. If the alphabet contains
only one letter, a word consists of a finite number of occurrences of this single
letter. Thus: L(1)  N.
We consider lists as an instance of what we call a collection data type, since
L(X) collects elements of X in a certain manner. What distinguishes lists from
other collection types is that elements may occur multiple times, and that the
order of occurrence matters. The three lists [a, b, a], [a, a, b], and [a, b] differ.
As mentioned in the introduction to this chapter, within a subset orders and
multiplicities do not matter, see Section 1.3; and in a multiset the order of
elements does not matter, but multiplicities do matter, see Section 1.4.
Let f : X → Y be an arbitrary function. It can be used to map lists over X into
lists over Y by applying f element-wise. This is what functional programmers
call map-list. Here we like overloading, so we write L( f ) : L(X) → L(Y) for
this function, defined as:

L( f ) [x1 , . . . , xn ] B [ f (x1 ), . . . , f (xn )].




Thus, L is an operation that not only sends sets to sets, but also functions to
functions. It does so in such a way that identity maps and compositions are
preserved:
L(id ) = id L(g ◦ f ) = L(g) ◦ L( f ).

We shall say: the operation L is functorial, or simply L is a functor.

6
1.2. Lists 7

Functoriality can be used to define the marginal of a list on a product set,


via L(πi ), where πi is a projection map. For instance, let ` ∈ L(X × Y) be of
the form ` = [(x1 , y1 ), . . . , (xn , yn )]. The first marginal L(π1 )(`) ∈ L(X) is then
computed as:

L(π1 )(`) = L(π1 ) [(x1 , y1 ), . . . , (xn , yn )]




= [π1 (x1 , y1 ), . . . , π1 (xn , yn )]


= [x1 , . . . , xn ].

1.2.1 Monoids
A monoid is a very basic mathematical structure. For convenience we define it
explicitly.
Definition 1.2.1. A monoid consists of a set M with a binary operation M ×
M → M, written for instance as infix +, together with an identity element, say
written as 0 ∈ M. The binary operation + is associative and has 0 as identity
on both sides. That is, for all a, b, c ∈ M,

a + (b + c) = (a + b) + c and 0 + a = a = a + 0.
The monoid is called commutative if a + b = b + a, for all a, b ∈ M. It is called
idempotent if a + a = a for all a ∈ M.
Let (M, 0, +) and (N, 1, ·) be two monoids. A function f : M → N is called a
homomorphism of monoids if f preserves the unit and binary operation, in the
sense that:

f (0) = 1 and f (a + b) = f (a) · f (b), for all a, b ∈ M.


For brevity we also say that such an f is a map of monoids, or simply a monoid
map.
The natural numbers N with addition form a commutative monoid (N, 0, +).
But also with multiplication they form a commutative monoid (N, 1, ·). The
function f (n) = 2n is a homomorphism of monoids f : (N, 0, +) → (N, 1, ·).
Various forms of collection types form monoids, with ‘union’ as binary op-
eration. We start with lists, in the next result. The proof is left as (an easy)
exercise to the reader.
Lemma 1.2.2. 1 For each set X, the set L(X) of lists over X is a monoid,
with the empty list [] ∈ L(X) as identity element, and with concatenation
++ : L(X) × L(X) → L(X) as binary operation:

[x1 , . . . , xn ] ++ [y1 , . . . , ym ] B [x1 , . . . , xn , y1 , . . . , ym ].

7
8 Chapter 1. Collections

This monoid (L(X), [], ++) is neither commutative nor idempotent.


2 For each function f : X → Y the associated map L( f ) : L(X) → L(Y) is a
homomorphism of monoids.

Thus, lists are monoids via concatenation. But there is more to say: lists are
free monoids. We shall occasionally make use of this basic property and so we
like to make it explicit. We shall encounter similar freeness properties for other
collection types.
Each element x ∈ X yields a singleton list unit(x) B [x] ∈ L(X). The result-
ing function unit : X → L(X) plays a special role, see also the next subsection.

Proposition 1.2.3. Let X be an arbitrary set and let (M, 0, +) be an arbitrary


monoid, with a function f : X → M. Then there is a unique homomorphism of
monoids f : (L(X), [], ++) → (M, 0, +) with f ◦ unit = f .
The homomorphism f is called the free extension of f . Its freeness can be
expressed via a diagram, as below, where the vertical arrow is dashed, to indi-
cate uniqueness.

X
unit / L(X)

f , homomorphism (1.4)
f
* M

Proof. Since f preserves the identity element and satisfies f ◦ unit = f it is


determined on empty and singleton lists as:

f [] = 0 f ([x]) = f (x).

and
Further, on an list [x1 , . . . , xn ] of length n ≥ 2 we necessarily have:
 
f [x1 , . . . , xn ] = f [x1 ] ++ · · · ++ [xn ]


= f [x1 ] + · · · + f [xn ]
 

= f (x1 ) + · · · + f (xn ).

Thus, there is only one way to define f . By construction, this f : L(X) → M is


a homomorphism of monoids.

The exercise below illustrate this result. For future use we introduce monoid
actions and their homomorphisms.

Definition 1.2.4. Let (M, 0, +) be a monoid.

1 An action of the monoid M on a set X is a function α : M×X → X satisfying:

α(0, x) = x and α(a + b, x) = α(a, α(b, x)),

8
1.2. Lists 9

for all a, b ∈ M and x ∈ X.


α  β
2 A homomorphism or map of monoid actions from M × X →
− X to M ×Y →


Y is a function f : X → Y satisfing:

f α(a, x) = β a, f (x)
 
for all a ∈ M, x ∈ X.

This equation corresponds to commutation of the following diagram.

M×X
id × f
/ M×Y
α
 β
X
f
/Y

Monoid actions are quite common in mathematics. For instance, scalar mul-
tiplication of a vector space forms an action. Also, as we shall see, probabilistic
updating can be described via monoid actions. The action map α : M × X → X
can be understood intuitively as pushing the elements in X forward with a
quantity from M. It then makes sense that the zero-push is the identity, and
that a sum-push is the composition of two individual pushes.

1.2.2 Unit and flatten for lists


We proceed to describe more elementary structure for lists, in terms of special
‘unit’ and ‘flatten’ functions. In subsequent sections we shall see that this same
structure exists for other collection types, like powerset, multiset and distribu-
tion. This unit and flatten structure will turn out to be essential for sequential
composition. At the end of this chapter we will see that it is characteristic for
what is called a ‘monad’.
We have already seen the singleton-list function unit : X → L(X), given
unit(x) B [x]. There is also a ‘flatten’ function which turns a list of lists into
a list by removing inner brackets. This function is written as flat : L(L(X)) →
L(X). It is defined as:

flat [[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]] B [x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk ].




The next result contains some basic properties about unit and flatten. These
properties will first be formulated in terms of equations, and then, alternatively
as commuting diagrams. The latter style is preferred in this book.

Lemma 1.2.5. 1 For each function f : X → Y one has:

unit ◦ f = L( f ) ◦ unit and flat ◦ L(L( f )) = L( f ) ◦ flat.

9
10 Chapter 1. Collections

Equivalently, the following two diagrams commute.

X
unit / L(X) L(L(X))
flat / L(X)
f L( f ) L(L( f )) L( f )
   
Y / L(Y) L(L(Y)) / L(Y)
unit flat

2 One further has:

flat ◦ unit = id = flat ◦ L(unit) flat ◦ flat = flat ◦ L(flat).

These two equations can equivalently be expressed via commutation of:

L(X)
unit / L(L(X)) o L(unit) L(X) L(L(L(X)))
flat / L(L(X))
L(flat)
 flat   flat
L(X) L(L(X)) / L(X)
flat

Proof. We shall do the first cases of each item, leaving the second cases to the
interested reader. First, for f : X → Y and x ∈ X one has:

L( f ) ◦ unit (x) = L( f )(unit(x)) = L( f )([x])




= [ f (x)] = unit( f (x)) = unit ◦ f (x).




Next, for the second item we take an arbitrary list [x1 , . . . , xn ] ∈ L(X). Then:

flat ◦ unit ([x1 , . . . , xn ]) = flat [[x1 , . . . , xn ]]


 

= [x1 , . . . , xn ]
flat ◦ L(unit) ([x1 , . . . , xn ]) = flat [unit(x1 ), . . . , unit(xn )]
 

= flat [[x1 ], . . . , [xn ]]




= [x1 , . . . , xn ].

The equations in item 1 of this lemma are so-called naturality equations.


They express that unit and flat work uniformly, independent of the set X in-
volved. The equations in item 2 show that L is a monad, see Section 1.9 for
more information.
The next result connects monoids with the unit and flatten maps.

Proposition 1.2.6. Let X be an arbitrary set.

1 To give a monoid structure (u, +) on X is the same as giving an L-algebra,


that is, a map α : L(X) → X satisfying α ◦ unit = id and α ◦ flat = α ◦

10
1.2. Lists 11

L(α), as in:

X
unit / L(X) L(L(X))
L(α)
/ L(X)
α flat α (1.5)
id $   α

X L(X) /X

2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two monoids, with corresponding L-


algebras α1 : L(M1 ) → M1 and α2 : L(M2 ) → M2 . A function f : M1 → M2
is then a homomorphism of monoids if and only if the diagram

L(M1 )
L( f )
/ L(M2 )
α1 α2 (1.6)
 
M1
f
/ M2 .

commutes.

This result says that instead of giving a binary operation + with an identity
element u we can give a single operation α that works on all sequences of el-
ements. This is not so surprising, since we can apply the sum multiple times.
The more interesting part is that the monoid equations can be captured uni-
formly by the diagrams/equations (1.5). We shall see that same diagrams also
work for other types of monoids (and collection types).

Proof. 1 If (X, u, +) is a monoid, we can define α : L(X) → X in one go as


α([x1 , . . . , xn ]) B x1 + · · · + xn . The latter sum equals the identity element
u when n = 0. Notice that the bracketing of the elements in the expression
x1 + · · · + xn does not matter, since a monoid is associative. The order does
matter, since we do not assume that the monoid is commutative. It is easy
to check the equations (1.5). From a more abstract perspective we define
α : L(X) → X via freeness (1.4).
In the other direction, assume an L-algebra α : L(X) → X. We then
define an identity element u B α([]) ∈ X and the sum of x, y ∈ X as
x + y B α([x, y]) ∈ X. We have to check that u is identity for + and that
+ is associative. This requires some fiddling with the equations (1.5):
(1.5)
x + u = α [x, α([])] = α [α(unit(x)), α([])]
 
 
= α L(α) [ [x], [] ]
(1.5)    (1.5)
= α flat [ [x], [] ] = α [x] = x.

Similarly one shows u + y = y. Next, associativity of + is obtained in a

11
12 Chapter 1. Collections

similar manner:
(1.5)
x + (y + z) = α [x, α([y, z])] = α [α(unit(x)), α([y, z])]
 
 
= α L(α) [ [x], [y, z] ]
(1.5)  
= α flat [ [x], [y, z] ]
 
= α flat [ [x, y], [z] ]
(1.5)  
= α L(α) [ [x, y], [z] ]
= α [α([x, y]), α(unit(z))]

(1.5)
= α [α([x, y]), z] = (x + y) + z.


2 Now let f : M1 → M2 be a homomorphism of monoids. Diagram (1.6) then


commutes:
α2 ◦ L( f ) ([x1 , . . . , xn ]) = α2 ([ f (x1 ), . . . , f (xn )])


= f (x1 ) + · · · + f (xn )
= f (x1 + · · · + xn ) since f is a homomorphism
= f ◦ α1 ([x1 , . . . , xn ]).


In the other direction, if (1.6) commutes for a function f : M1 → M2 , then


f is a homomorphism of monoids, since:
 (1.6)
f (u1 ) = f α1 ([]) = α2 L( f )([]) = α2 ([]) = u2 .


Similarly one checks that sums are preserved:


 (1.6)
f (x +1 y) = f α1 ([x, y]) = α2 L( f )([x, y])


= α2 ([ f (x), f (y)]) = f (x) +2 f (y).


We see that the algebraic structure (M, u, +) on the set M is expressed as an
algebra α : L(M) → M, namely a certain map to M. This will be a recurring
theme in the coming sections.

1.2.3 List combinatorics


Combinatorics is a subarea of mathematics focused on advanced forms of
counting. It is relevant for probability theory, since frequencies of occurrences
play an important role. We give a first taste of this, using lists.
We shall use the length k`k ∈ N of a list `, see Exercise 1.2.3 for more
details, and also the sum and product of a list of natural numbers, defined as:
 
sum [n1 , . . . , nk ] B n1 + · · · + nk = i ni
P
 
prod [n1 , . . . , nk ] B n1 · . . . · nk = i ni .
Q

See also Exercise 1.2.6.

12
1.2. Lists 13

We restrict ourselves to the subset N>0 = {n ∈ N | n > 0} of positive


natural numbers. Clearly, we obtain restrictions sum, prod : L(N>0 ) → N>0 of
the above sum and product functions.
Now fix N ∈ N>0 . We are interested in lists ` ∈ L(N>0 ) with sum(`) = N.
These lists are in the inverse image:

sum −1 (N) B {` ∈ L(N>0 ) | sum(`) = N}.

For instance, for N = 4 this inverse image contains the eight lists:

[1, 1, 1, 1], [1, 1, 2], [1, 2, 1], [2, 1, 1], [2, 2], [1, 3], [3, 1], [4]. (1.7)

We can interpret the situation as follows. Suppose we have coins with value
n ∈ N>0 , for each n. Then we can ask, for an amount N: how many (ordered)
ways are there to lay out the amount N in coins? For N = 4 the different layouts
are given above. Other interpretations are possible: one can also think of the
sequences (1.7) as partitions of the numbers {1, 2, 3, 4}.
Here is a first, easy counting result.

Lemma 1.2.7. For N ∈ N>0 , the subset sum −1 (N) ⊆ L(N>0 ) has 2N−1 ele-
ments.

Proof. We use induction on N, starting with N = 1. Obviously, only the list


[1] sums to 1, and indeed 2N−1 = 20 = 1.
For the induction step we use the familiar fact that for K ∈ N,

X 1 2K+1 − 1
= . (∗)
0≤k≤K
2 k 2K

This can be shown easily by induction on K.


We anticipate the notation | A | for the number of elements of a finite set A,

13
14 Chapter 1. Collections

see Definition 1.3.6. Then, for N > 1,



sum −1 (N) = [N] ∪ ` ++ [n] 1 ≤ n ≤ N −1 and ` ∈ sum −1 (N −n)
X
= 1+ sum −1 (N −n)
1≤n≤N−1
(IH)
X
= 1+ 2N−n−1
1≤n≤N−1
 
 X1 
= 1 + 2N−1 ·  n
− 1
0≤n≤N−1
2
2N − 1
!
(∗)
= 1 + 2N−1 · − 1
2N−1
= 1 + 2N − 1 − 2N−1
= 2N−1 .

Here is an elementary fact about coin lists. It has a definite probabilistic


flavour, since it involves what we later call a convex sum of probabilities ri ∈
[0, 1] with i ri = 1. The proof of this result is postponed. It involves rather
P
complex probability distributions about coins, see Corollary 3.9.13. We are not
aware of an elementary proof.

Theorem 1.2.8. For each N ∈ N>0 ,


X 1
= 1. (1.8)
k`k! · prod (`)
`∈sum −1 (N)

At this stage we only give an example, for N = 4, using the corresponding


lists ` in (1.7). The associated sum (1.8) is illustrated below.

[1, 1,_1, 1] [1, 1,


_ 2] [1, 2,
_ 1] [2, 1,
_ 1] [2,_2] [1,_3] [3,_1] [4]
_
       
1 1 1 1 1 1 1 1
4!·1 3!·2 3!·2 3!·2 2!·4 2!·3 2!·3 1!·4

1 1 1 1 1 1 1 1
24 12 12 12 8 6 6 4
| {z }
with sum: 1
Obviously, elements in a lists are ordered. Thus, in (1.7) we distinghuish
between coin layouts [1, 1, 2], [1, 2, 1] and [2, 1, 1]. However, when we are
commonly discussing which coins add up to 4 we do not take the order into
account, for instance in saying that we use two coins of value 1 and one coin
of 2, without caring about the order. In doing so, we are not using lists as col-
lection type, but multisets — in which the order of elements does not matter.

14
1.2. Lists 15

These multisets form an important alternative collection type; they are dis-
cussed from Section 1.4 onwards.

Exercises
1.2.1 Let X = {a, b, c} and Y = {u, v} be sets with a function f : X → Y
given by f (a) = u = f (c) and f (b) = v. Write `1 = [c, a, b, a] and
`2 = [b, c, c, c]. Compute consecutively:
• `1 ++ `2
• `2 ++ `1
• `1 ++ `1
• `1 ++ (`2 ++ `1 )
• (`1 ++ `2 ) ++ `1
• L( f )(`1 )
• L( f )(`2 )
• L( f )(`1 ) ++ L( f )(`2 )
• L( f )(`1 ++ `2 ).
1.2.2 We write log for the logarithm function with some base b > 0, so that
log(x) = y iff x = by . Verify that the logarithm function log is a map
of monoids:

(R>0 , 1, ·)
log
/ (R, 0, +).

Often the log function is used to simplify a computation, by turning


multiplications into additions. Then one uses that log is precisely this
homomorphism of monoids. (An additonal useful property is that log
is monotone: it preserves the order.)
1.2.3 Define a length function k − k : L(X) → N on lists via the freeneess
property of Proposition 1.2.3.
1 Describe k`k for ` ∈ L(X) explicitly.
2 Elaborate what it means that k − k is a homomorphism of monoids.
3 Write ! for the unique function X → 1 and check that k − k is L(!).
Notice that the previous item can then be seen as an instance of
Lemma 1.2.2 (2).
1.2.4 1 Check that the list-flatten operation L(L(X)) → L(X) can be de-
scribed in terms of concatenation ++ as:

flat [`1 , . . . , `n ] = `1 ++ · · · ++ `n .


15
16 Chapter 1. Collections

2 Now consider the correspondence of Proposition 1.2.6 (1). Con-


clude that the algebra α : L(L(X)) → L(X) associated with the
monoid (L(X), [], ++) from Lemma 1.2.2 (1) is flat.
1.2.5 Check Equation (1.8) yourself for N = 5.
1.2.6 Consider the set N of natural numbers with its additive monoid struc-
ture (0, +) and also with its multiplicative monoid structure (1, ·). Ap-
ply freeness from Proposition 1.2.3 with these two structures to define
two monoid homomorphisms:
sum
)
L(N) 5N
prod

1 Describe these maps explicitly on a sequence [n1 , . . . , nk ] of natural


numbers ni .
2 Make explicit what it means that they preserve the monoid struc-
ture.
3 Prove that for an arbitrary set X, the list-length function k − k from
Exercise 1.2.3 satisfies:

sum ◦ L(k − k) = k − k ◦ flat.

In other words, the following diagram commutes.

L L(X)
 L(k−k)
/ L(N)
flat sum
 
L(X)
k−k
/N

A fancy way to prove that length is such an algebra homomorphism


is to use the uniqueness in Proposition 1.2.3.

1.3 Subsets
The next collection type that will be studied is powerset. The symbol P is
commonly used for the powerset operator. We will see that there are many
similarities with lists L from the previous section. We again pay much attention
to monoid structures.
For an arbitrary set X we write P(X) for the set of all subsets of X, and
Pfin (X) for the set of finite subsets. Thus:

P(X) B {U | U ⊆ X} and Pfin (X) B {U ∈ P(X) | U is finite}.

16
1.3. Subsets 17

If X is a finite set itself, there is no difference between P(X) and Pfin (X). In the
sequel we shall speak mostly about P, but basically all properties of interest
hold for Pfin as well.
First of all, P is a functor: it works both on sets and on functions. Given
a function f : X → Y we can define a new function P( f ) : P(X) → P(Y) by
taking the image of f on a subset. Explicitly, for U ⊆ X,

P( f )(U) = { f (x) | x ∈ U}.

The right-hand side is clearly a subset of Y, and thus an element of P(Y). We


have two equalities:

P(id ) = id P(g ◦ f ) = P(g) ◦ P( f ).

We can use functoriality for marginalisation: for a subset (relation) R ⊆ X × Y


on a product set we get its first marginal P(π1 )(R) ∈ P(X) as the subset:

P(π1 )(R) = {π1 (z) | z ∈ R} = {π1 (x, y) | (x, y) ∈ R} = {x | ∃y. (x, y) ∈ R}.

The next topic is the monoid structure on powersets. The first result is an
analogue of Lemma 1.2.2 and its proof is left to the reader.

Lemma 1.3.1. 1 For each set X, the powerset P(X) is a commutative and
idempotent monoid, with empty subset ∅ ∈ P(X) as identity element and
union ∪ of subsets of X as binary operation.
2 Each P( f ) : P(X) → P(Y) is a map of monoids, for f : X → Y.

Next we define unit and flatten maps for subsets, much like for lists. The
function unit : X → P(X) sends an element to a singleton subset: unit(x) B
{x}. The flatten function flat : P(P(X)) → P(X) is given by union: for A ⊆
P(X),
[
flat(A) B A = {x ∈ X | ∃U ∈ A. x ∈ U}.

We mention, without proof, the following analogue of Lemma 1.2.5.

Lemma 1.3.2. 1 For each function f : X → Y the ‘naturality’ diagrams

X
unit / P(X) P(P(X))
flat / P(X)
f P( f ) P(P( f )) P( f )
   
Y / P(Y) P(P(Y)) / P(Y)
unit flat

commute.

17
18 Chapter 1. Collections

2 Additionaly, the ‘monad’ diagrams below commute.

P(X)
unit / P(P(X)) o P(unit) P(X) P(P(P(X)))
flat / P(P(X))
P(flat)
 flat   flat
P(X) P(P(X)) / P(X)
flat

1.3.1 From list to powerset


We have seen that lists and subsets behave in a similar manner. We strengthen
this connection by defining a support function supp : L(X) → Pfin (X) between
them, via:
supp [x1 , . . . , xn ] B {x1 , . . . , xn }.


Thus, the support of a list is the subset of elements occurring in the list. The
support function removes order and multiplicities. The latter happens implic-
itly, via the set notation, above on the right-hand side. For instance,
supp([b, a, b, b, b]) = {a, b} = {b, a}.
Notice that there is no way to go in the other direction, namely Pfin (X) → L(X).
Of course, one can for each subset choose an order of the elements in order to
turn the subset into a list. However, this process is completely arbitrary and is
not uniform (natural).
The support function interacts nicely with the structures that we have seen
so far. This is expressed in the result below, where we use the same notation
unit and flat for different functions, namely for L and for P. The context, and
especially the type of an argument, will make clear which one is meant.
Lemma 1.3.3. Consider the support map supp : L(X) → Pfin (X) defined above.
1 It is a map of monoids (L(X), [], ++) → (P(X), ∅, ∪).
2 It is natural, in the sense that for f : X → Y one has:

L(X)
supp
/ Pfin (X)
L( f ) Pfin ( f )
 
L(Y) / Pfin (Y)
supp

3 It commutes with the unit and flatten maps of list and powerset, as in:

X X L(L(X))
L(supp)
/ L(Pfin (X)) supp
/ Pfin (Pfin (X))
unit
  unit flat
  flat
L(X) / Pfin (X) L(X) / Pfin (X)
supp supp

18
1.3. Subsets 19

Proof. The first item is easy and skipped. For item 2,

Pfin ( f ) ◦ supp ([x1 , . . . , xn ]) = Pfin ( f )(supp([x1 , . . . , xn ]))




= Pfin ( f )({x1 , . . . , xn })
= { f (x1 ), . . . , f (xn )}
= supp([ f (x1 ), . . . , f (xn )])
= supp(L( f )([x1 , . . . , xn ]))
= supp ◦ L( f ) ([x1 , . . . , xn ]).


In item 3, commutation of the first diagram is easy:

supp ◦ unit (x) = supp([x]) = {x} = unit(x).




The second diagram requires a bit more work. Starting from a list of lists we
get:

flat ◦ supp ◦ L(supp) ([[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]])




= ◦ supp ([supp([x11 , . . . , x1n1 ]), . . . , supp([xk1 , . . . , xknk ])])


S 

= ◦ supp ([{x11 , . . . , x1n1 }, . . . , {xk1 , . . . , xknk }])


S 

= ({{x11 , . . . , x1n1 }, . . . , {xk1 , . . . , xknk }})


S

= {x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk }


= supp([x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk ])
= supp ◦ flat ([[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]]).


1.3.2 Finite powersets and idempotent commutative monoids


We briefly review the relation between monoids and (finite) powersets. At
an abstract level the situation is much like for lists, as described in Subsec-
tion 1.2.1. For instance, the commutative idempotent monoids Pfin (X) are free,
like lists, in Proposition 1.2.3.

Proposition 1.3.4. Let X be a set and (M, 0, +) a commutative idempotent


monoid, with a function f : X → M between them. Then there is a unique
homomorphism of monoids f : (Pfin (X), ∅, ∪) → (M, 0, +) with f ◦ unit = f .
We represent this situation in the diagram below.

X
unit / Pfin (X)

f , homomorphism (1.9)
* M
f

19
20 Chapter 1. Collections

Proof. Given the requirements, the only way to define f is as:


 
f {x1 , . . . , xn } B f (x1 ) + · · · + f (xn ), with special case f (∅) = 0.

The order in the above sum f (x1 ) + · · · + f (xn ) does not matter since M is
commutative. The function f sends unions to sums since + is idempotent.

Commutative idempotent monoids can be described as algebras, in analogy


with Proposition 1.2.6.

Proposition 1.3.5. Let X be an arbitrary set.

1 To specify a commutative idempotent monoid structure (u, +) on X is the


same as giving a Pfin -algebra α : Pfin (X) → X, namely so that the diagrams

X
unit / Pfin (X) Pfin (Pfin (X))
Pfin (α)
/ Pfin (X)
α flat α (1.10)
id %   α

X Pfin (X) /X

commute.
2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative idempotent monoids,
with corresponding Pfin -algebras α1 : Pfin (M1 ) → M1 and α2 : Pfin (M2 ) →
M2 . A function f : M1 → M2 is a map of monoids if and only if the rectangle

Pfin (M1 )
Pfin ( f )
/ Pfin (M2 )
α1 α2 (1.11)
 
M1
f
/ M2 .

commutes.

Proof. This works very much like in the proof of Proposition 1.2.6. If (X, u, +)
is a monoid, we define α : Pfin (X) → X by freeness as α({x1 , . . . , xn }) B x1 +
· · · + xn . In the other direction, given α : Pfin (X) → X we define a sum as
x + y B α({x, y}) with unit u B α(∅). Clearly, thi sum + on X is commutative
and idempotent.

This result concentrates on the finite powerset functor Pfin . One can also con-
sider algebras P(X) → X for the (general) powerset functor P. Such algebras
turn the set X into a complete lattice, see [116, 9, 92] for details.

1.3.3 Extraction
So far we have concentrated on how similar lists and subsets are: the only struc-
tural difference that we have seen up to now is that subsets form an idempotent

20
1.3. Subsets 21

and commutative monoid. But there are other important differences. Here we
look at subsets of product sets, also known as relations.
The observation is that one can extract functions from a relation R ⊆ X × Y,
namely functions of the form extr 1 (R) : X → P(Y) and extr 2 (R) : Y → P(X),
given by:
extr 1 (R)(x) = {y ∈ Y | (x, y) ∈ R} extr 2 (R)(y) = {x ∈ X | (x, y) ∈ R}.
In fact, one can easily reconstruct the relation R from extr 1 (R), and also from
extr 2 (R), via:
R = {(x, y) | y ∈ extr 1 (R)(x)} = {(x, y) | x ∈ extr 2 (R)(y)}.
This all looks rather trivial, but such function extraction is less trivial for other
data types, as we shall see later on for distributions, where it will be called
disintegration.
Using the exponent notation from Subsection 1.1.2 we can summarise the
situation as follows. There are two isomorphisms:
P(Y)X  P(X × Y)  P(X)Y . (1.12)
Functions of the form X → P(Y) will later be called ‘channels’ from X to
Y, see Section 1.8. What we have just seen will then be described in terms of
‘extraction of channels’.

1.3.4 Powerset combinatorics


We describe the basic of counting subsets, of finite sets. This will lead to the bi-
nomial and multinomial coefficients that show up in many places in probability
theory.
Definition 1.3.6. For a finite set A we write | A| ∈ N for the number of elements
in A. Then, for an arbitrary set X and a number K ∈ N we define:
P[K](X) B {U ∈ Pfin (X) | |U | = K}.
For a number n ∈ N we also write n for the n-element set {0, 1, . . . , n − 1}. Thus
0 is a also the empty set, and 1 = {0} is a singleton.
We recall that from a categorical perspective the sets 0 and 1 are special,
because they are, respectively, initial and final in the category of sets and func-
tions. We have | A × B| = | A| · | B| and | A + B| = | A| + | B|, where A + B is the
disjoint union (coproduct) of sets A and B. It is well known that | P(A) | = 2| A |
for a finite set A. Below we are interested in the number of subsets of a fixed
size.

21
22 Chapter 1. Collections

The next lemma makes a basic result explicit, partly because it provides
valuable insight in itself, but also because we shall see generalisations later
on, for multisets instead of subsets. We use the familiar binomial and multi-
nomial coefficients. We recall their definitions, for natural numbers k ≤ n and
m1 , . . . , m` with ` ≥ 2 and i mi = m.
P
! !
n n! m m!
B and B . (1.13)
k k! · (n − k)! m1 , . . . , m` m1 ! · . . . · m` !
Recall that a partition of a set X is a disjoint cover: a finite collection of subsets
U1 , . . . , Uk ⊆ X satisfying Ui ∩ U j = ∅ for i , j and i Ui = X. We do not
S
assume that the subsets Ui are non-empty.

Lemma 1.3.7. Let X be a (finite) set with n elements.

1 The binomial coefficient counts the number of subsets of size K ≤ n,


! !
P[K](X) = n , and especially: P[K](n) = n .

K K
2 For K1 , . . . , K` ∈ N with ` ≥ 2 and n = i Ki , the multinomial coefficient
P
counts the number of partitions, of appropriate sizes:

~ ∈ P[K1 ](X) × · · · × P[K` ](X) U1 , . . . , U` is a partition of X

U
!
n
= .
K1 . . . , K`
Each subset U ⊆ X forms a partition (U, ¬U) of X, with its complement
¬U = X \ U. Correspondingly the binomial coefficient can be described as a
multinomial coefficient:
! !
n n
= .
K K, n − K
Proof. 1 We use induction on the number n of elements in X. If n = 0, also
K0= 0 and so P[K](X) contains only the empty set. At the same time, also
0 = 1. Next, let X have n + 1 elements, say X = Y ∪ {y} where Y has n
elements and y < Y.If K = 0 or K = n + 1, there is only one subset in X of
size K, and indeed n+1 0 = 1 = n+1
n+1 . We may thus assume 0 < K ≤ n. A
subset of size K in X is thus either a subset of Y, or it contains y, and is then
determined by a subset of size K − 1. Hence:

P[K](X) = P[K](Y) + P[K − 1](Y)
n+1
! ! !
(IH) n n
= + = , by Exercise 1.3.5.
K K−1 K

22
1.3. Subsets 23

2 We now use induction on m ≥ 2. The case m = 2 has just been covered.


Next:

~ ∈ P[K1 ](X) × · · · × P[Km+1 ](X) U1 , . . . , Um+1 is a partition of X

U

~ ∈ P[K1 ](X) × · · · × P[Km+1 ](X)

= U

U2 , . . . , Um+1 is a partition of X \ U1
! !
(IH) n n − K1
= ·
K1 K2 , . . . , Km+1
!
n
= , by Exercise 1.3.6.
K1 . . . , Km+1
The following result will be useful later.

Lemma 1.3.8. Fix a number m ∈ N. Then:


n
m 1
lim = .
n→∞ nm m!
Proof. Since:
n
m n!
lim = lim
n→∞ nm n→∞ m! · (n − m)! · nm
1 n n−1 n−m+1
= · lim · · ... ·
m! n→∞ n n n
n−m+1
! !
1  n n−1
= · lim · lim · . . . · lim
m! n→∞ n n→∞ n n→∞ n
1
= .
m!

Exercises
1.3.1 Continuing Exercise 1.2.1, compute:
• supp(`1 )
• supp(`2 )
• supp(`1 ++ `2 )
• supp(`1 ) ∪ supp(`2 )
• supp(L( f )(`1 ))
• Pfin ( f )(supp(`1 )).
1.3.2 We have used finite unions (∅, ∪) as monoid structure on P(X) in
Lemma 1.3.1 (1). Intersections (X, ∩) give another monoid structure
on P(X).

23
24 Chapter 1. Collections

1 Show that the negation/complement function ¬ : P(X) → P(X),


given by:
¬U = X \ U = {x ∈ X | x < U},
is a homomorphism of monoids between (P(X), ∅, ∪) and (P, X, ∩),
in both directions. In fact, if forms an isomorphism of monoids,
since ¬¬U = U.
2 Prove that the intersections monoid structure is not preserved by
maps P( f ) : P(X) → P(Y).
Hint: Look at preservation of the unit X ∈ P(X).
1.3.3 Check that in general,

P( f )(U) , U .
Conclude that the powerset functor P does not restrict to a functor
P[K], for K ∈ N.
1.3.4 Check that an ordered partition of a set X of size m can be identified
with a function X → m. This assumes that the partition consists of
non-empty subsets.
1.3.5 1 Prove what is called Pascal’s rule: for 0 < m ≤ n,
n+1
! ! !
n n
= + . (1.14)
m m m−1
2 Use this equation to prove:
X m + i! n+m+1
!
= .
0≤i≤n
m m+1

3 Now use both previous equations to prove:


m+i n+m+2
X ! !
(n + 1 − i) · = .
0≤i≤n
m m+2

Let K1 , . . . , K` ∈ N be given, for ` > 2, with K = i Ki . Show that


P
1.3.6
the multinomial coefficient can be written as a product of binomial
coefficients, via:
! ! !
K K K − K1
= ·
K1 , . . . , K` K1 K2 , . . . , K`
Let K1 , . . . , K` ∈ N be given, for ` ≥ 2, with K = i Ki . Show that:
P
1.3.7

P[K1 ] K  × P[K2 ] K −K1  × · · · × P[K`−1 ] K −K1 − · · · −K`−2 
!
K
= .
K1 . . . , K`

24
1.4. Multisets 25

1.4 Multisets
So far we have discussed two collection data types, namely lists and subsets of
elements. In lists, elements occur in a particular order, and may occur multiple
times (at different positions). Both properties are lost when moving from lists
to sets. In this section we look at multisets, which are ‘sets’ in which elements
may occur multiple times. Hence multisets are in between lists and subsets,
since they do allow multiple occurrences, but the order of their elements is
irrelevant.
The list and powerset examples are somewhat remote from probability the-
ory. But multisets are much more directly relevant: first, because we use a
similar notation for multisets and distributions; and second, because observed
data can be organised nicely in terms of multisets. For instance, for statistical
analysis, a document is often seen as a multiset of words, in which one keeps
track of the words that occur in the document together with their frequency
(multiplicity); in that case, the order of the words is ignored. Also, tables with
observed data can be organised naturally as multisets, see Subsection 1.4.1 be-
low. Learning from such tables will be described in Section 2.1 as a (natural)
transformation from multisets to distributions.
Despite their importance, multisets do not have a prominent role in pro-
gramming, like lists have. Eigenvalues of a matrix form a clear example where
the ‘multi’ aspect is ignored in mathematics: eigenvalues may occur multiple
times, so the proper thing to say is that a matrix has a multiset of eigenvalues.
One reason for not using multisets may be that there is no established notation.
We shall use a ‘ket’ notation | − i that is borrowed from quantum theory, but
interchangeably also a functional notation. Since multisets are less familiar,
we take time to introduce the basic definitions and properties, in Sections 1.4
– 1.7.
We start with an introduction about notation, terminology, and conventions
for multisets. Consider a set C = {R, G, B} for the three colours Red, Green,
Blue. An example of a multiset over C is:

2| Ri + 5|G i + 0| Bi.

In this multiset the element R occurs 2 times, G occurs 5 times, and B occurs
0 times. The latter means that B does not occur, that is, B is not an element
of the multiset. From a multiset perspective, we have 2 + 5 + 0 = 7 elements
— and not just 2. A multiset like this may describe an urn containing 2 red
balls, 5 green ones, and no blue balls. Such multisets are quite common. For
instance, the chemical formula C2 H3 O2 for vinegar may be read as a multiset

25
26 Chapter 1. Collections

2|C i + 3| H i + 2| O i, containing 2 carbon (C) atoms, 3 hydrogen (H) atoms and


2 oxygen (O) atoms.
In a situation where we have multiple data items, say arising from succes-
sive experiments, a basic question to ask is: does the order of the experiments
matter? If so, we need to order the data elements as a list. If the order does not
matter we should use a multiset. More concretely, if six successive experiments
yield data items d, e, d, f, e, d and their order is relevant we should model the
data as the list [d, e, d, f, e, d]. When the order is irrelevant, we can capture the
data as the multiset 3| d i + 2|e i + 1| f i.
The funny brackets | − i are called ket notation; this is frequently used in
quantum theory. Here it is meaningless notation, used to separate the natural
numbers, called multiplicities, and the elements in the multiset.
Let X be an arbitrary set. Following the above ket-notation, a (finite) multiset
over X is an expression of the form:
n1 | x1 i + · · · + nk | xk i where ni ∈ N and xi ∈ X.
This expression is a formal sum, not an actual sum (for instance in R). We may
P
write it as i ni | xi i. We use the convention:
• 0| x i may be omitted; but it may also be written explicitly in order to empha-
sise that the element x does not occur in a multiset;
• a sum n| x i + m| x i is the same as (n + m)| x i;
• the order and brackets (if any) in a sum do not matter.
Thus, for instance, there is an equality of multisets:
2| a i + 5|b i + 0| ci + 4| b i = 9| bi + 2| ai.


There is an alternative description of multisets. A multiset can be defined as


a function ϕ : X → N that has finite support. The support supp(ϕ) ⊆ X is the
set supp(ϕ) = {x ∈ X | ϕ(x) , 0}. For each element x ∈ X the number ϕ(x) ∈ N
indicates how often x occurs in the multiset ϕ. Such a function ϕ can also be
written as a formal sum x ϕ(x)| x i, where x ranges over supp(ϕ).
P
For instance, the multiset 9| b i + 2|a i over A = {a, b, c} corresponds to the
function ϕ : A → N given by ϕ(a) = 2, ϕ(b) = 9, ϕ(c) = 0. Its support is thus
{a, b} ⊆ A, with two elements. The number kϕk of elements in ϕ is 11.
We shall freely switch back and forth between the ket-description and the
function-description of multisets, and use whichever form is most convenient
for the goal at hand.
Having said this, we stretch the idea of a multiset and do not only allow
natural numbers n ∈ N as multiplicities, but also allow non-negative numbers
r ∈ R≥0 . Thus we can have a multiset of the form 23 | a i + π| b i where π ∈ R≥0

26
1.4. Multisets 27

is the famous constant of Archimedes: the ratio of a circle’s circumference


to its diameter. This added generality will be useful at times, although many
examples of multisets will simply have natural numbers as multiplicities. We
call such multisets natural.
Definition 1.4.1. For a set X we shall write M(X) for the set of all multisets
over X. Thus, using the function approach:
M(X) B {ϕ : X → R≥0 | supp(ϕ) is finite}.
The elements of M(X) may be called mass functions, as in [147].
We shall write N(X) ⊆ M(X) for the subset of natural multisets, with natu-
ral numbers as multiplicities — also called bags or urns. Thus, N(X) contains
functions ϕ ∈ M(X) with ϕ(x) ∈ N, for all x ∈ X.
We shall write M∗ (X) for the set of non-empty multisets. Thus:
M∗ (X) B {ϕ ∈ M(X) | supp(ϕ) is non-empty}
= {ϕ : X → R≥0 | supp(ϕ) is finite and non-empty}.
Similarly, N∗ (X) ⊆ M∗ (X) contains the non-empty natural multisets.
For a number K we shall write M[K](X) ⊆ M(X) and N[K](X) ⊆ N(X) for
the subsets of mulisets with K elements. Thus:
M[K](X) B {ϕ ∈ M(X) | kϕk = K} where kϕk B x ϕ(x). (1.15)
P

This expression kϕk gives the size of the multiset, that is, its total number of
elements.
All of M, N, M∗ , N∗ , M[K], N[K] are functorial, in the same way. Hence
we concentrate on M. For a function f : X → Y we can define M( f ) : M(X) →
M(Y). When we see a multiset ϕ ∈ M(X) as an urn containing coloured balls,
with colours from X, then M( f )(ϕ) ∈ M(Y) is the urn with ‘repainted’ balls,
where the new colours are taken from the set Y. The function f : X → Y defines
the transformation of colours. It tells that a ball of colour x ∈ X in ϕ should be
repainted with colour f (x) ∈ Y.
The urn M( f )(ϕ) with repainted balls can be defined in two equivalent ways:
X  X X
M( f ) ri | xi i B ri | f (xi ) i or as M( f )(ϕ)(y) B ϕ(x).
i i
x∈ f −1 (y)

It may take a bit of effort to see that these two descriptions are the same, see
P
Exercise 1.4.1 below. Notice that in the sum i ri | f (xi ) i it may happen that
f (xi1 ) = f (xi2 ) for xi1 , xi2 , so that ri1 and ri2 are added together. Thus, the sup-
P  P
port of M( f ) i ri | xi i may have fewer elements than the support of i ri | xi i,
P  P
but the sum of all multiplicities is the same in M( f ) i ri | xi i and i ri | xi i,
see Exercise 1.5.2 below.

27
28 Chapter 1. Collections

Applying M to a projection function πi : X1 ×· · ·× Xn → Xi yields a function


M(πi ), from the set M(X1 ×· · ·×Xn ) of multisets over a product to the set M(Xi )
of multisets over a component. This M(πi ) is called a marginal function or
simply a marginal. It computes what is ‘on the side’, in the marginal of a table,
as will be illustrated in Subsection 1.4.1 below.
Multisets, like lists and subset form a monoid. In terms of urns with coloured
balls, taking the sum of two multisets corresponds to pouring the balls from
two urns into a new urn.

Lemma 1.4.2. 1 The set M(X) of multisets over X is a commutative monoid.


In functional form, addition and zero (identity) element 0 ∈ M(X) are de-
fined as:

(ϕ + ψ)(x) B ϕ(x) + ψ(x) 0(x) B 0.

These sums restrict to N(X).


2 The set M(X) is also a cone: it is closed under ‘scalar’ multiplication with
non-negative numbers r ∈ R≥0 , via:

(r · ϕ)(x) B r · ϕ(x).

This scalar multiplication r · (−) : M(X) → M(X) preserves the sums (0, +)
from the previous item, and is thus a map of monoids.
3 For each f : X → Y, the function M( f ) : M(X) → M(Y) is a map of
monoids and also of cones. The latter means: M( f )(r · ϕ) = r · M( f )(ϕ).

The fact that M( f ) preserves sums can be understood informally as follows.


If we have two urns, we can first combine their contents and then repaint ev-
erything. Alternatively, we can first repaint the balls in the two urns separately,
and then throw them together. The result is the same, in both cases.
The element 0 ∈ M(X) used in item (1) is the empty multiset, that is, the
urn containing no balls. Similarly, the sum of multisets + is implicit in the ket-
notation. The set N(X) of natural multisets is not closed in general under scalar
multiplication with r ∈ R≥0 . It is closed under scalar multiplication with n ∈ N,
but such multiplications add nothing new since they can also be described via
repeated addition.

1.4.1 Tables of data as multisets


Let’s assume that a group of 36 children in the age range 0 − 10 is participat-
ing in some study, where the number of children of each age is given by the
following table.

28
1.4. Multisets 29

0 1 2 3 4 5 6 7 8 9 10

2 0 4 3 5 3 2 5 5 2 4
We can represent this table as a natural multiset over the set of ages {0, 1, . . . , 10}.
2| 0i + 4| 2 i + 3| 3 i + 5|4 i + 3| 5i + 2| 6i + 5| 7 i + 5| 8 i + 2|9 i + 4| 10 i.

Notice that there is no summand for age 1 because of our convention to ommit
expressions like 0| 1 i with multiplicity 0. We can visually represent the above
age data/multiset in the form of a histogram:

(1.16)

Here is another example, not with numerical data, in the form of ages, but
with nominal data, in the form of blood types. Testing the blood type of 50
individuals gives the following table.

A B O AB

10 15 18 7
This corresponds to a (natural) multiset over the set {A, B, O, AB} of blood
types, namely to:
10| A i + 15| Bi + 18| Oi + 7| ABi.

It gives rise to the following bar graph, in which there is no particular ordering
of elements. For convenience, we follow the order of the above table.

(1.17)

29
30 Chapter 1. Collections

Next, consider the two-dimensional table (1.18) below where we have com-
bined numeric information about blood pressure (either high H, or low L) and
certain medicines (either type 1, type 2, or no medicine, indicated as 0). There
is data about 100 study participants:

no medicine medicine 1 medicine 2 totals

high 10 35 25 70
(1.18)
low 5 10 15 30

totals 15 45 40 100

We claim that we can capture this table as a (natural) multiset. To do so, we


first form sets B = {H, T } for blood pressure values, and M = {0, 1, 2} for types
of medicine. The above table can then be described as a natural multiset τ over
the product set/space B × M, that is, as an element τ ∈ N(B × M), namely:

τ = 10| H, 0 i + 35| H, 1 i + 25| H, 2 i + 5| L, 0i + 10| L, 1 i + 15| L, 2 i.

Such a multiset can be plotted in three dimensions as:

We see that Table (1.18) contains ‘totals’ in its vertical and horizontal mar-
gins. They can be obtained from τ as marginals, using the functoriality of N.
This works as follows. Applying the natural multiset functor N to the two pro-
jections π1 : B × M → B and π2 : B × M → M yields marginal distributions on

30
1.4. Multisets 31

B and M, namely:
N(π1 )(τ) = 10|π1 (H, 0)i + 35| π1 (H, 1)i + 25| π1 (H, 2)i
+ 5| π1 (L, 0)i + 10| π1 (L, 1)i + 15| π1 (L, 2)i
= 10| H i + 35| H i + 25| H i + 5| L i + 10| L i + 15| L i
= 70| H i + 30| L i.
N(π2 )(τ) = (10 + 5)| 0i + (35 + 10)| 1 i + (25 + 15)| 2 i
= 15|0 i + 45| 1 i + 40|2 i.
The expression ‘marginal’ is used to describe such totals in the margin of a
multidimensional table. In Section 2.1 we describe how to obtain probabilities
from tables in a systematic manner.

1.4.2 Unit and flatten for multisets


As may be expected by now, there are also unit and flatten maps for multisets.
The unit function unit : X → M(X) is simply unit(x) B 1| x i. Flattening in-
volves turning a multiset of multisets into a multiset. Concretely, this is done
as:
 
flat 31 2|a i + 2| ci + 5 1| b i + 61 | ci = 23 | ai + 5| b i + 32 | c i.

More generally, flattening is the function flat : M(M(X)) → M(X) with:


P  X P 
flat i ri | ϕi i B i ri · ϕi (x) x .
x
P P
Notice that the big outer sum x is a formal one, whereas the inner sum i is
an actual one, in R≥0 , see the earlier example.
The following result, about unit and flatten, does not come as a surprise
anymore. We formulate it for general multisets M, but it restricts to natural
multisets N.
Lemma 1.4.3. 1 For each function f : X → Y the two rectangles

X
unit / M(X) M(M(X))
flat / M(X)
f M( f ) M(M( f )) M( f )
   
Y / M(Y) M(M(Y)) / M(Y)
unit flat

commute.
2 The next two diagrams also commute.

M(X)
unit / M(M(X)) M(unit)
o M(X) M(M(M(X)))
flat / M(M(X))
M(flat)
 flat   flat
M(X) M(M(X)) / M(X)
flat

31
32 Chapter 1. Collections

The next result shows that natural multisets are free commutative monoids.
Arbitrary multisets are also free, but for other algebraic structures, see Exer-
cise 1.4.9.
Proposition 1.4.4. Let X be a set and (M, 0, +) a commutative monoid. Each
function f : X → M has a unique extension to a homomorphism of monoids
f : (N(X), 0, +) → (M, 0, +) with f ◦ unit = f . The diagram below captures
the situation, where the dashed arrow is used for uniqueness.

X
unit / N(X)

f , homomorphism (1.19)
f
* M

Proof. One defines:


 
f n1 | x1 i + · · · + nk | xk i B n1 · f (x1 ) + · · · + nk · f (xk ).
where we write n · a for the n-fold sum a + · · · + a in a monoid.

The unit and flatten operations for (natural) multisets can be used to cap-
ture commutative monoids more precisely, in analogy with Propositions 1.2.6
and 1.3.5
Proposition 1.4.5. Let X be an arbitrary set.

1 A commutative monoid structure (u, +) on X corresponds to an N-algebra


α : N(X) → X making the two diagrams below commute.

X
unit / N(X) N(N(X))
N(α)
/ N(X)
α flat α (1.20)
id %   α

X N(X) /X

2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative monoids, with corre-


sponding N-algebras α1 : N(M1 ) → M1 and α2 : N(M2 ) → M2 . A function
f : M1 → M2 is a map of monoids if and only if the rectangle

N(M1 )
N( f )
/ N(M2 )
α1 α2 (1.21)
 
M1
f
/ M2 .

commutes.
Proof. Analogously to the proof Proposition 1.2.6: if (X, u, +) is a commu-
tative monoid, we define α : N(X) → X by turning formal sums into actual

32
1.4. Multisets 33

sums: α( i ni | xi i) B i ni · xi , see (the proof of) Proposition 1.4.4. In the


P P
other direction, given α : N(X) → X we define a sum as x + y B α(1| x i + 1| y i)
with unit u B α(0). Obviously, + is commutative.

1.4.3 Extraction
At the end of the previous section we have seen how to extract a function
(channel) from a binary subset, that is, from a relation. It turns out that one
can do the same for a binary multiset, that is, for a table. More specifically, in
terms of exponents, there are isomorphisms:

M(Y)X  M(X × Y)  M(X)Y . (1.22)

This is analogous to (1.12) for powerset.


How does this work in detail? Suppose we have an arbitary multiset/table
σ ∈ M(X × Y). From σ one can extract a function extr 1 (σ) : X → M(Y), and
also extr 2 (σ) : Y → M(X), via a cap:
X X
extr 1 (σ)(x) = σ(x, y)| y i extr 2 (σ)(y) = σ(x, y)| x i.
y∈Y x∈X

Notice that we are — conveniently — mixing ket and function notation for
multisets. Conversely, σ can be reconstructed from extr 1 (σ), and also from
extr 2 (σ), via σ(x, y) = extr 1 (σ)(x)(y) = extr 2 (σ)(y)(x).
Functions of the form X → M(Y) will also be used as channels from X to
Y, see Section 1.8. That’s why we often speak about ‘channel extraction’.
As illustration, we apply extraction to the medicine - blood pressure Ta-
ble 1.18, described as the multiset τ ∈ M(B × M). It gives rise to two channels
extr 1 (τ) : B → M(M) and extr 2 (τ) : M → M(B). Explicitly:
X
extr 1 (τ)(H) = τ(H, x)| x i = 10| 0 i + 35| 1i + 25| 2 i
x∈M
X
extr 1 (τ)(L) = τ(L, x)| x i = 5| 0i + 10| 1 i + 15| 2i.
x∈M

We see that this extracted function captures the two rows of Table 1.18. Simi-
larly we get the columns via the second extracted function:

extr 2 (τ)(0) = 10| L i + 5| H i


extr 2 (τ)(1) = 35| L i + 10| H i
extr 2 (τ)(2) = 25| L i + 15| H i.

33
34 Chapter 1. Collections

Exercises
1.4.1 In the setting of Exercise 1.2.1, consider the multisets ϕ = 3| ai +
2| bi + 8| c i and ψ = 3|b i + 1| ci. Compute:
• ϕ+ψ
• ψ+ϕ
• M( f )(ϕ), both in ket-formulation and in function-formulation
• idem for M( f )(ψ)
• M( f )(ϕ + ψ)
• M( f )(ϕ) + M( f )(ψ).
1.4.2 Consider, still in the context of Exercise 1.2.1, the ‘joint’ multiset
ϕ ∈ M(X × Y) given by ϕ = 2|a, ui + 3| a, v i + 5| c, vi. Determine the
marginals M(π1 )(ϕ) ∈ M(X) and M(π2 )(ϕ) ∈ M(Y).
1.4.3 Consider the chemical equation for burning methane:

CH4 + 2 O2 −→ CO2 + 2 H2 O.

Check that there is an underlying equation of multisets:


 
flat 1 1|C i + 4| H i + 2 2| O i

 
= flat 1 1|C i + 2| Oi + 2 2| H i + 1|O i .

It expresses the law of conservation of mass.


1.4.4 Check that M(id ) = id and M(g ◦ f ) = M(g) ◦ M( f ).
1.4.5 Show that for each natural number K the mappings X 7→ M[K](X)
and X 7→ N[K](X) are functorial. Notice the difference with P[K],
see Exercise 1.3.3.
1.4.6 Prove Lemma 1.4.3.
1.4.7 Consider both Lemma 1.4.3 and Proposition 1.4.5.
1 Notice that an abstract way of seeing that N(X) is a commutative
monoid is via the properties of the flatten map flat : N N(X)) →
N(X).
2 Notice also that this flatten map is a homomorphism of monoids.
1.4.8 Verify that the support map supp : M(X) → Pfin (X) commutes with
extraction functions, in the sense that the following diagram com-
mutes.
M(X × Y)
supp
/ Pfin (X × Y)
extr 1
  extr 1
M(Y)X / Pfin (Y)X
supp X

34
1.5. Multisets in summations 35

Equationally, this amounts to showing that for τ ∈ M(X × Y) and


x ∈ X one has:
extr 1 supp(τ) (x) = supp extr 1 (τ)(x) .
 

Here we use that supp X ( f ) B supp ◦ f , so that supp X ( f )(x) =


supp( f (x)).
1.4.9 In Proposition 1.4.4 we have seen that natural multisets N(X) form
free commutative monoids. What about general multisets M(X)? They
form free cones. Briefly, a cone is a commutative monoid M with
scalar multiplication r · (−) : M → M, for each r ∈ R≥0 , that is a
homomorphism of monoids. It is like a vector space, not over all re-
als, but only over the non-negative reals. Homomorphisms of cones
preserve such scalar multiplications.
Let X be a set and (M, 0, +) a cone, with a function f : X → M.
Prove that there is a unique homomorphism of cones f : M(X) → M
with f ◦ unit = f .

1.5 Multisets in summations


The current and next section will dive deeper into the use of natural multisets,
in combinatorics as preparation for later use. This section will focus on the
use of multisets in summations, such as the multinomial theorem — which is
probably best known in binary form, as the binomial theorem for expanding
sums (a + b)n . We shall describe various extensions, both for finite and infinite
sums. All material in this section is standard, but its presentation in terms of
multisets is not.
Recall that a multiset ϕ = i ri | xi i is called natural when all the multiplici-
P
ties ri are natural numbers. Such a multiset is also called a bag or an urn. Urns
are often used as illustrations in simple probabilistic arguments, starting with:
what is the probability of drawing two red balls — with or without replacement
— from an urn with initially three red balls and two blue ones. In this book we
shall describe such an urn as the multiset 3| R i + 2| Bi. In such arguments one
often encounters elementary combinatorial questions about (combinations of)
numbers of balls in urns. This section provides some standard results, building
on Subsection 1.3.4, where subsets are counted — instead of multisets.
In this section we focus on counting with multisets, in particular in (infinite)
sums of powers. The next section focuses on counting multisets themselves,
where we ask, for instance, how many multisets of size K are there on a set
with n elements?

35
36 Chapter 1. Collections

There are several ways to associate a natural number with a multiset ϕ. For
instance, we can look at the size of its support | supp(ϕ) | ∈ N, or at its size, as
total number of elements kϕk = x ϕ(x) ∈ R>0 . This size is a natural number
P
when ϕ is a natural multiset. Below we will introduce several more such num-
  multisets ϕ, namely ϕ and ( ϕ ), and later on also a binomial
bers for natural
coefficient ψϕ .

Definition 1.5.1. 1 For two multisets ϕ, ψ ∈ N(X) we write:

ϕ ≤ ψ ⇐⇒ ∀x ∈ X. ϕ(x) ≤ ψ(x).

When ϕ ≤ ψ we define subtraction ψ−ϕ of multisets as the obvious multiset,


defined pointwise as: ψ − ϕ (x) = ψ(x) − ϕ(x).


2 For an additional number K ∈ N we use:

ϕ ≤K ψ ⇐⇒ ϕ ∈ N[K](X) and ϕ ≤ ψ.

3 For a collection of numbers r = (r x ) x∈X we write:

rϕ(x)
Y
rϕ B x = ~r ϕ .
x∈X

The latter vector notation is appropriate in a situation with a particular order.


4 The factorial ϕ of a natural multiset ϕ ∈ N(X) is the product of the factorial
of its multiplicities:
Y
ϕ B ϕ(x)! (1.23)
x∈supp(ϕ)

5 The muliset coefficient ( ϕ ) is defined as:


!
kϕk! kϕk! kϕk
(ϕ) B = Q = .
ϕ x ϕ(x)! ϕ(x1 ), . . . , ϕ(xn )
The latter formulation, using the multinomial coefficient (1.13), assumes
that the support of ϕ is ordered as [x1 , . . . , xn ].

For instance,
5!
(3| R i + 2| Bi) = 3! · 2! = 12 and ( 3|R i + 2| Bi ) = = 10.
12
The multinomial coefficient (ϕ ) in item (5) counts the number of ways of
putting kϕk items in supp(ϕ) = {x1 , . . . , xn } urns, with the restriction that ϕ(xi )
items go into urn xi . Alternatively, ( ϕ ) is the numbers of partitions (Ui ) of a
 N| Ui | = ϕ(xi ).
set X with kϕk elements, where
The traditional notation m1 ,...,mk
for multinomial coefficients in (1.13) is

36
1.5. Multisets in summations 37

suboptimal for two reasons: first, the number N is superflous, since it is deter-
mined by the mi as N = i mi ; second, the order of the mi is irrelevant. These
P
disadvantages are resolved by the multiset variant ( ϕ ). It has our preference.
We recall the recurrence relations:
! ! !
K−1 K−1 K
+ ··· + = (1.24)
k1 − 1, . . . , kn k1 , . . . , kn − 1 k1 , . . . , kn
for multinomial coefficients. A snappy re-formulation, for a natural multiset ϕ,
is: X
(ϕ − 1| x i ) = (ϕ ). (1.25)
x∈supp(ϕ)

Multinomial coefficients (1.13) are useful, for instance in the Multinomial


Theorem (see e.g. [145]):
!
K X K
r1 + · · · + rn = · r1k1 · . . . · rnkn . (1.26)
k , k =K
P k1 , . . . , kn
i i i

An equivalent formulation using multisets is:


X
r1 + · · · + rn K = ( ϕ ) · ~r ϕ .

(1.27)
ϕ∈M[K]({1,...,n})

There is an ‘infinite’ version of this result, known as the (Binomial) Series


Theorem. It holds more generally than formulated in the first item below, for
complex numbers, with adapted meaning of the binomial coefficient, but that’s
beyond the current scope.
Theorem 1.5.2. Fix a natural number K.
1 For a real number r ∈ [0, 1),
X n + K! 1
· rn = .
n≥0
K (1 − r)K+1
2 As a special case,
X 1
rn = .
n≥0
1−r
3 Another consequence is, still for r ∈ [0, 1),
X r
n · rn = .
n≥1
(1 − r)2
4 For r1 , . . . , rm ∈ [0, 1] with i ri < 1,
P

K + i ni
P ! Y
X 1
· rini = P K+1 .
n , ..., n ≥0
K, n1 , . . . , nm i (1 − i ri )
1 m

37
38 Chapter 1. Collections

Equivalently,
K + kϕk
!
X 1
· ( ϕ ) · ~r ϕ = P K+1 .
ϕ∈N({1,...,m})
K (1 − i ri )

P (n)
Proof. 1 The equation arises as the Taylor series f (x) = n f n!(0) · xn of the
function f (x) = (1−x)1 K+1 . One can show, by induction on n, that the n-th
derivative of f is:
(n + K)! 1
f (n) (x) = · .
K! (1 − x)n+K+1
2 The second equation is a special case of the first one, for K = 0. There is also
a simple direct proof. Define sn = r0 + r1 + · · · + rn . Then sn − r · sn = 1 − rn+1 ,
n+1
so that sn = 1−r 1
1−r . Hence sn → 1−r as n → ∞.
3 We choose to use the first item, but there are other ways to prove this result,
see Exercise 1.5.10.
r X n + 1!
= r · · rn by item (1), with K = 1
(1 − r)2 n≥0
1
X X
= (n + 1) · rn+1 = n · rn .
n≥0 n≥1

4 The trick is to turn the multiple sums into a single ‘leading’ one, in:
K + i ni
X P ! Y
· r ni
n1 , ..., nm ≥0
K, n 1 , . . . , n m i i

K+n
X X ! Y
= · r ni
n≥0 n1 , ..., nm , i ni =n
P K, n 1 , . . . , nm i i

K+n
! ! Y
X X n
= · · r ni
n≥0 n1 , ..., nm , i ni =n
P K n 1 , . . . , nm i i

X K + n! X n
! Y
= · · r ni
n≥0
K n1 , ..., nm , i ni =n
P n 1 , . . . , nm i i

(1.26)
X K + n! P 
n
= · i ri
n≥0
K
1
= P K+1 , by item (1).
(1 − i ri )

Exercises
1.5.1 Consider the function f : {a, b, c} → {0, 1} given by f (a) = f (b) = 1
and f (c) = 0.

38
1.5. Multisets in summations 39

1 Take the natural multiset ϕ = 1| a i + 3|b i + 1| ci ∈ N({a, b, c}) and


compute consecutively:
• (ϕ)
• N( f )(ϕ)
• ( N( f )(ϕ) ).
Conclude that ( ϕ ) , ( N( f )(ϕ) ), in general.
2 Now take ψ = 2| 0 i + 3|1 i ∈ N({0, 1}).
• Compute ( ψ).
• Show that there are four multisets ϕ1 , ϕ2 , ϕ3 , ϕ4 ∈ N({a, b, c})
with M( f )(ϕi ) = ψ, for each i.
• Check that ( ψ) , ( ϕ1 ) + ( ϕ2 ) + ( ϕ3 ) + (ϕ4 ).
What is the general formulation now?
1.5.2 Check that:
1 the size map k − k : M(X) → R≥0 is a homomorphism of monoids,
preserving rescaling — and thus a homomorphism of cones, see
Exercise 1.4.9;
2 kM( f )(ϕ)k = kϕk.
1.5.3 Show that for natural multisets ϕ, ψ ∈ N(X),

ϕ ≤ ψ ⇐⇒ ∃ϕ0 ∈ N(X). ϕ + ϕ0 = ψ.

1.5.4 Let Ψ ∈ N(N(X)) be given with ψ = flat(Ψ). Show that for ϕ ∈


supp(Ψ) one has ϕ ≤ ψ and flat Ψ − 1|ϕ i = ψ − ϕ.


1.5.5 Let ϕ be a natural multiset.


1 Show that:
X ϕ
= kϕk.
x∈supp(ϕ)
(ϕ − 1| x i)

ϕ ϕ(y)!
X X Q
y
=
(ϕ(x) − 1)! · y,x ϕ(y)!
Q
x∈supp(ϕ)
(ϕ − 1| x i) x∈supp(ϕ)
X ϕ(x)!
=
x∈supp(ϕ)
(ϕ(x) − 1)!
X
= ϕ(x)
x∈supp(ϕ)
= kϕk.

2 Derive the recurrence relation (1.25) from this equation.

39
40 Chapter 1. Collections

1.5.6 Prove the multiset formulation (1.27) of the Multinomial Theorem,


by induction on K.
1.5.7 Let K ≥ 0 and let set X have n ≥ 1 elements. Prove that:
X
( ϕ ) = nK .
ϕ∈N[K](X)

This generalises
 the well known sum-formula for binomial coeffi-
cients: 0≤k≤K Kk = 2K , for n = 2.
P

Hint: Use n = 1 + · · · + 1 in (1.27).


1.5.8 Let ϕ be a natural multiset. Show that:
1 ( ϕ ) = 1 ⇐⇒ supp(ϕ) is a singleton;
2 ( ϕ ) = kϕk! ⇐⇒ ∀x ∈ supp(ϕ). ϕ(x) = 1 ⇐⇒ ϕ consists of single-
tons.
1.5.9 Let n ≥ 1 and r ∈ (0, 1). Show that:
X k−1!
· rn · (1 − r)k−n = 1.
k≥n
n−1

1.5.10 Elaborate the details of the following two (alternative) proofs of the
equation in Theorem 1.5.2 (3).
1 Use the derivate ddr on both sides of Theorem 1.5.2 (2).
2 Write s B n≥1 n · rn and exploit the following recursive equation.
P

s = r + 2r2 + 3r3 + 4r4 + · · ·


= r + (1 + 1)r2 + (1 + 2)r3 + (1 + 3)r4 + · · ·
= r + r2 + r3 + r4 + · · · + r · r + 2r2 + 3r3 + · · ·
 
r
= + r · s, by Theorem 1.5.2 (2).
1−r
1.5.11 In the proof of Theorem 1.5.2 we have used Taylor’s formula for
a single-variable function. For multi-variable functions we can use
multisets for a compact, ‘multi-index’ formulation. For an an n-ary
function f (x1 , . . . , xn ) and a natural multiset ϕ ∈ N({1, . . . , n}) write:
ϕ(1) ϕ(n)
∂ ϕ f B ∂x1 · · · ∂xn f.

Check that Taylor’s expansion formula (around 0 ∈ Rn ) then be-


comes:
X (∂ ϕ f )(0)
f (~x) = · ~x ϕ .
ϕ∈N({1,...,n})
ϕ

40
1.6. Binomial coefficients of multisets 41

1.6 Binomial coefficients of multisets



Binomial coefficients nk for numbers n ≥ k are a standard tool in many ar-
eas of mathematics, see Subsection 1.3.4 for the definition and the most basic
properties. Here we
ψ extend binomial coefficients from numbers to natural mul-
tisets: we define ϕ for natural multisets ψ, ϕ with ψ ≥ ϕ. In the next section
 
we shall also look into the less familiar ‘multichoose’ coefficients mn and
their extension to multisets ψϕ . The coefficients ψϕ and ψϕ will play an
     

important role in (multivariate) hypergeometric and Pólya distributions.

Definition 1.6.1. For natural multisets ϕ, ψ ∈ N(X) with ϕ ≤ ψ, the multiset


binomial is defined as:
ψ ψ
!
B
ϕ ϕ · (ψ − ϕ)
(1.28)
x ψ(x)!
Q Y ψ(x)!
= Q  = .
x ϕ(x)! · x (ψ(x) − ϕ(x) !) ϕ(x)
 Q
x∈supp(ψ)

For instance:
3| Ri + 2| Bi
! ! !
3! · 2! 3 2
=  = 6 = 3·2 = · .
2| Ri + 1| Bi

2! · 1! · 1! · 1! 2 1
This describes the number of possible ways of drawing 2|R i + 1| Bi from an
urn 3|R i + 2| Bi.
The following result is a generalisation of Vandermonde’s formula.

Lemma 1.6.2. Let ψ ∈ N(X) be a multiset of size L = kψk, with a number


K ≤ L. Then:
X ψϕ
 
X ψ! L
!
= so that  L  = 1.
ϕ≤ ψ
ϕ K ϕ≤ ψ
K K K

These fractions adding up to one will form the probabilities of the so-called
hypergeometric distributions, see Example 2.1.1 (3) and Section 3.4 later on.

Proof. We use induction on the number of elements in supp(ψ). We go through


some initial values explicitly. If the number of elements is 0, then ψ = 0 and
so L = 0 = K and ϕ ≤K ψ means ϕ = 0, so that the result holds. Similarly, if
supp(ψ) is a singleton, say {x}, then L = ψ(x). For K ≤ L and ϕ ≤K ψ we get
supp(ϕ) = {x} and K = ϕ(x). The result then obviously holds.
The case where supp(ψ) = {x, y} captures the ordinary form of Vander-
monde’s formula. We reformulate it for numbers B, G ∈ N and K ≤ B + G.

41
42 Chapter 1. Collections

Then:
B+G
! ! !
X B G
= · . (1.29)
K b≤B, g≤G, b+g=K
b g

Intuitively: if you select K children out of B boys and G girls, the number of
options is given by the sum over the options for b ≤ B boys times the options
for g ≤ G girls, with b + g = K.
  can be proven by induction on G. When G = 0 both
The equation (1.29)
sides amount to KB so we quickly proceed to the induction step. The case
K = 0 is trivial, so we may assume K > 0.
X B G+1
b · g
b≤B, g≤G+1, b+g=K
     B  G+1  B  G+1  B  G+1
= KB · G+1 0 + K−1 · 1 + · · · + K−G · G + K−G−1 · G+1
 B  G  B  G  B  G
= K · 0 + K−1 · 1 + K−1 · 0
 B  G  B   G   B  G
+ · · · + K−G · G + K−G · G−1 + K−G−1 · G
X B G X B G
= b · g + b · g
b≤B, g≤G, b+g=K b≤B, g≤G, b+g=K−1
(IH)   B+G
= B+G K + K−1
(1.14)  B+G+1
= K .
For the induction step, let supp(ψ) = {x1 , . . . , xn , y}, for n ≥ 2. Writing
` = ψ(y), L0 = L − ` and ψ0 = ψ − `|y i ∈ N[L0 ](X) gives:
ψ ψ(x) ` ψ (xi )
X   X Y   X X  Y  0 
ϕ = ϕ(x) x
= n · ϕ(xi ) i
ϕ≤K ψ ϕ≤K ψ n≤` ϕ≤K−n ψ0
(IH)
X `   L−`  (1.29)  L 
= n · K−n = K.
n≤`, K−n≤L−`

For a multiset ϕ we have already used the ‘support’ definition supp(ϕ) =


{x | ϕ(x) , 0}. This yields a map supp : M(X) → Pfin (X), which is well-
behaved, in the sense that it is natural and preserves the monoid structures on
M(X) and Pfin (X).
We have also seen a support map from lists L to finite powerset Pfin . This
support map factorises through multisets, as described with a new function acc
in the following triangle.

L(X)
supp
/ Pfin (X)
B
(1.30)
acc , N(X) supp

The ‘accumulator’ map acc : L(X) → N(X) counts (accumulates) how many

42
1.6. Binomial coefficients of multisets 43

times an element occurs in a list, while ignoring the order of occurrences. Thus,
for a list ` ∈ L(X),

acc(`)(x) B n if x occurs n times in the list α. (1.31)

Alternatively,

acc x1 , . . . , xn = 1| x1 i + · · · + 1| xn i.


Or, more concretely, in an example:

acc a, b, a, b, c, b, b = 2| ai + 4| bi + 1| c i.


The above diagram (1.30) expresses an earlier informal statement, namely that
multisets are somehow in between lists and subsets.
The starting point in this section is the question: how many (ordered) se-
quences of coloured balls give rise to a specific urn? More technically, given a
natural multiset ϕ, how many lists ` statisfy acc(`) = ϕ? In yet another form,
what is the size | acc −1 (ϕ) | of the inverse image?
As described in the beginning of this section, one aim is to relate the multiset
coefficient (ϕ ) of a multiset ϕ to the number of lists that accumlate to ϕ, as
defined in (1.31). Here we use a K-ary version of accumulation, for K ∈ N,
restricted to K-many elements. It then becomes a mapping:

XK
acc[K]
/ N[K](X). (1.32)

The parameter K will often be omitted when it is clear from the context. We are
now ready for a basic combinatorial / counting result. It involves the multiset
coefficient from Definition 1.5.1 (5).

Proposition 1.6.3. For ϕ ∈ N[K](X) one has:



( ϕ ) = acc −1 (ϕ) = the number of lists ` ∈ X K with acc(`) = ϕ.

Proof. We use induction on the number of elements of the support supp(ϕ) of


the multiset ϕ. If this number is 0, then ϕ = 0, with ( 0) = 1. And indeed, there
is precisely one list ` with acc(`) = 0, namely the empty list [].
Next suppose that supp(ϕ) = {x1 , . . . , xn , xn+1 }. Take m B ϕ(xn+1 ) and ϕ0 B
ϕ − m| xn+1 i so that kϕ0 k = K − m and supp(ϕ0 ) = {x1 , . . . , xn }. By the induction
hypothesis there are ( ϕ0 )-many lists `0 ∈ X K−m with acc(`0 ) = ϕ0 . Each such
list `0 can be extended to a list ` with acc(`) = ϕ by m times adding xn+1 to `0 .
How many such additions are there? It is not hard to see that this number of

43
44 Chapter 1. Collections
K 
additions is m . Thus:
!
K
acc −1 (ϕ) = · ( ϕ0 )
m
K! (K − m)!
= ·Q
i≤n ϕ (xi )!
m!(K − m)! 0

K!
= Q since m = ϕ(xn+1 ) and ϕ0 (xi ) = ϕ(xi )
i≤n+1 ϕ(xi )!
= (ϕ)

We conclude this part with a remarkable fact, combining multinomial co-


efficients with the flatten operation for multisets, and also with accumulation.
We start with an illustration for which we recall from Subsection 1.4.2 that

the flatten operation has type flat : M M(X) → M(X). When we restrict it to
natural multisets and include sizes K, L ∈ N it takes the form:

/ M[L·K](X)
  flat
M[L] M[K](X) (1.33)

Now suppose a natural multiset is given:



 K = 3, L = 2

ψ = 3|a i + 1| bi + 2| c i ∈ M[2 · 3](X)

for
 X = {a, b, c}.

It’s multinomial coefficient is ( ψ) = 3!·1!·2!


6!
= 60.
We now ask ourselves the question: which Ψ ∈ N[L] N[K](X) satisfy

flat(Ψ) = ψ? A little reflection shows that there are three such Ψ, namely:

Ψ1 = 1 2|a i + 1| ci + 1 1| a i + 1| b i + 1|c i


Ψ2 = 1 2| a i + 1| bi + 1 1| a i + 2| c i


Ψ3 = 1 3|a i + 1 1| bi + 2| c i .

Next we are going to compute the expression ( Ψ ) · ϕ ( ϕ )Ψ(ϕ) in each case.


Q
Y
( Ψ1 ) · (ϕ )Ψ1 (ϕ) = 2 · 31 · 61 = 36
ϕ
Y
( Ψ2 ) · (ϕ )Ψ2 (ϕ) = 2 · 31 · 31 = 18
ϕ
Y
( Ψ3 ) · (ϕ )Ψ3 (ϕ) = 2 · 11 · 31 = 6
ϕ

We then find that the sum of these outcomes equals the coefficient of ψ:

( ψ) = 60
     
Y Y Y
= ( Ψ1 )· (ϕ )Ψ1 (ϕ) + (Ψ2 )· ( ϕ )Ψ2 (ϕ) + ( Ψ3 )· ( ϕ )Ψ3 (ϕ) .
ϕ ϕ ϕ

44
1.6. Binomial coefficients of multisets 45

The general result is formulated in Theorem 1.6.5 below. For its proof we need
an intermediate step.

Lemma 1.6.4. Let ψ ∈ N[L](X) be a natural multiset of size L ∈ N, and let


K ≤ L.

1 For ϕ ≤K ψ one has:


 ψ
(ϕ) · (ψ − ϕ) ϕ
= L .
( ψ)
K

2 Now:
X
( ψ) = ( ϕ ) · ( ψ − ϕ ).
ϕ≤K ψ

The earlier equation (1.25) is a special case, for K = 1.

Proof. 1 Because:
(ϕ) · (ψ − ϕ) K! (L − K)! ψ
= · ·
( ψ) ϕ (ψ − ϕ) L!
 ψ
K! · (L − K)! ψ ϕ
= · = L .
L! ϕ · (ψ − ϕ)
K

2 By the previous item and Vandermonde’s formula from Lemma 1.6.2:


P  ψ L
X ϕ≤K ψ ϕ K
( ϕ ) · ( ψ − ϕ ) = ( ψ) · L = ( ψ) ·  L  = ( ψ).
ϕ≤K ψ K K

Theorem 1.6.5. Each ψ ∈ N[L · K](X), for numbers L, K ≥ 1, satisfies:


X Y
( ψ) = (Ψ) · (ϕ )Ψ(ϕ) .
ϕ
Ψ∈N[L](N[K](X))
flat(Ψ) = ψ

This equation turns out to be essential for proving that multinomial distribu-
tions are suitably closed under composition, see Theorem 3.3.6 in Chapter 3.

Proof. Since ψ ∈ N[L · K](X) we can apply Lemma 1.6.4 (2) L-many times,
giving:
X X X
( ψ) = ··· ( ϕ1 ) · ( ϕ2 ) · . . . · ( ϕ L )
ϕ1 ≤K ψ ϕ2 ≤K ψ−ϕ1 ϕL ≤K ψ−ϕ1 −···−ϕL−1
X Y
= ( ϕi ).
i
ϕ1 , ..., ϕL ≤K ψ
ϕ1 + ··· +ϕL = ψ

45
46 Chapter 1. Collections

We can accumulate the sequence of multisets ϕ1 , . . . , ϕL ∈ N[K] into a multi-


set of multisets:

Ψ B acc ϕ1 , . . . , ϕL = 1| ϕ1 i + · · · + 1| ϕL i ∈ N[L] N[K](X) .


 

The flatten map preserves sums of multisets, see Exercise 1.4.7, and thus maps
Ψ to ψ, via the flatten-unit law of Lemma 1.4.3.
 
flat Ψ = flat 1|ϕ1 i + · · · + 1| ϕL i


= flat 1| ϕ1 i + · · · + flat 1| ϕL i
 

= flat unit(ϕ1 ) + · · · + flat unit(ϕL ) = ϕ1 + · · · + ϕL = ψ.


 

By summing over the accumulation Ψ ∈ N[L] N[K](X) with flat(Ψ) = ψ,



instead of over the sequences ϕ1 , . . . , ϕL ≤K ψ with i ϕi = ψ we have that
P
into account that (Ψ)-many sequences accumulate to the same Ψ, see Proposi-
tion 1.6.3. This explains the factor ( Ψ) in the above theorem.

Exercises
K 
= 2K to:
P
1.6.1 Generalise the familiar equation 0≤k≤K k

X ψ!
= 2kψk .
ϕ≤ψ
ϕ

1.6.2 Show that kacc(`)k = k`k, using the length k`k of a list ` from Exer-
cise 1.2.3.
1.6.3 Let ϕ ∈ N[K](X) and ψ ∈ N[L](X) be given.
1 Prove that:
K+L
K
( ϕ + ψ) = ϕ+ψ · ( ϕ ) · ( ψ).
ϕ

2 Now assume that ϕ, ψ have disjoint supports, that is, supp(ϕ) ∩


supp(ψ) = ∅. Show that now:
K +L
!
( ϕ + ψ) = · (ϕ ) · ( ψ),
K
1.6.4 Consider the K-ary accumulator function (1.32), for K > 0. Check
that acc is permutation-invariant, in the sense that for each permuta-
tion π of {1, . . . , K} one has:

acc x1 , . . . , xK = acc xπ(1) , . . . , xπ(K) .


 

46
1.6. Binomial coefficients of multisets 47

1.6.5 Check that the accumulator map acc : L(X) → M(X) is a homomor-
phism of monoids. Do the same for the support map supp : M(X) →
Pfin (X).
1.6.6 Prove that the accumulator and support maps acc : L(X) → M(X)
and supp : M(X) → Pfin (X) are natural: for an arbitrary function
f : X → Y both rectangles below commute.

L(X)
acc / M(X) supp
/ Pfin (X)
L( f ) M( f ) Pfin ( f )
  
L(Y) / M(Y) / Pfin (Y)
acc supp

1.6.7 Convince yourself that the following composite


acc / / N[L·K](X)
  flat
N[K](X)L N[L] N[K](X)

is the L-fold sum of multisets — see (1.33) for the fixed size flatten
operation.

1.6.8 In analogy with the powerset operator, with type P : P(X) → P P(X) ,

a powerbag operator PB : N(X) → N N(X) is introduced in [114]
(see also [35]). It can be defined as:
X ψ!
PB(ψ) B
ϕ .
ϕ≤ψ
ϕ

1 Take X = {a, b} and show that:

PB 1| ai + 3| bi


= 1 0 + 1 1| ai + 3 1| b i + 3 1| ai + 1| b i + 3 2|b i


+ 3 1|a i + 2| bi + 1 3| b i + 1 1| ai + 3| bi .

2 Check that one can compute the powerbag of ψ as follows. Take


a list of elements that accumulate to ψ, such as [a, b, b, b] in the
previous item. Take the accumulation of all subsequences.
1.6.9 For N ≥ 2 many natural multisets ϕ1 , . . . , ϕN ∈ N(X), with ψ B
i ϕi , define a multinomial coefficient of multisets as:
P

ψ ψ
!
B .
ϕ1 , . . . , ϕ N ϕ1 · . . . · ϕ N
1 Check that for N ≥ 3, in analogy with Exercise 1.3.6,
ψ ψ ψ − ϕ1
! ! !
= ·
ϕ1 , . . . , ϕN ϕ1 ϕ2 , . . . , ϕ N

47
48 Chapter 1. Collections

For K1 , . . . , KN ∈ N write K = Ki and assume that ψ ∈ N[K](X)


P
2 i
is given. Show that:
ψ
! !
X K
= .
ϕ1 , . . . , ϕ N K1 , . . . , KN
ϕ1 ≤K1 ψ,
P ..., ϕN ≤KN ψ,
i ϕi =ψ

1.7 Multichoose coefficents


We proceed with another counting challenge. Let X be a finite set, say with n
elements. How many elements are there then in N[K](X)? That is, how many
multisets of size K are there over n elements? This is sometimes formulated
informally as: how many ways are there to divide K balls over n urns? It is
the multiset-analogue of Lemma 1.3.7,  where the number of subsets of size
K of an n-element set is identified as Kn . Below we show that the answer for
 
multisets is given by the multiset number or multichoose number Kn , see
e.g. [145]. We introduce this new number below, and immediately extend its
standard definition from numbers to multisets, in analogy with the extension
of (ordinary) binomials to multisets in Definition 1.5.1. We first need some
preparations before coming to the multiset counting result in Proposition 1.7.4.

Definition 1.7.1. 1 For numbers n ≥ 1 and K ≥ 0, put:


!! ! ! !
n n+K −1 n+K −1 (n+K −1)! n n+K
B = = = · . (1.34)
K K n−1 K! · (n−1)! n+K n
 
This Kn is sometimes called multichoose.
2 Let ψ, ϕ be natural multisets on the same space, where ψ is non-empty and
supp(ϕ) ⊆ supp(ψ).
ψ ψ(x) Y ψ(x) + ϕ(x) − 1!
!! Y !!
B = . (1.35)
ϕ x∈supp(ψ)
ϕ(x) x∈supp(ψ)
ϕ(x)

 5 the set N[3]({a, b, c}) of multisets of size 3 over {a, b, c}. It has
3Consider
3 = 3 = 2 = 10 elements, namely:
4·5

3| ai, 3| bi, 3| ci, 2| ai + 1| bi, 2| ai + 1| c i, 1| a i + 2| b i,


2|b i + 1| ci, 1| ai + 2| c i, 1| b i + 2| c i, 1| a i + 1| b i + 1|c i.
 
Our aim is to prove that the multiset number Kn is the total number of
multisets of size K over a non-empty n-element set. We use the following mul-
tichoose analogue of Vandermonde’s (binary) formula (1.29).

48
1.7. Multichoose coefficents 49

Lemma 1.7.2. Fix B ≥ 1 and G ≥ 1. For all K on has:

B+G
!! !! !!
X B G
= · . (1.36)
K 0≤k≤K
k K−k

In particular:

B+K B+1
! !! !! !!
X B X B
= = = . (1.37)
K K 0≤k≤K
k 0≤k≤K
K−k

Proof. The second equation


  (1.37)
 easily follows from the first one by taking
G = 1 and using that 1n = nn = 1.
We shall make frequent use of the following equation, whose proof is left to
the reader (in Exercise 1.7.5) below.

  n+1  n+1 
n
K+1 + K = K+1 . (∗)

We shall prove the first equation (1.36) in the lemma by induction on B ≥1. In
both the base case B = 1 and the induction step we shall use induction on K.
We shall try to keep the structure clear by using nested bullets.

• We first prove Equation (1.36) for B = 1, by induction on K.

– When K = 0 both sides in (1.36) are equal to 1.


– Assume Equation (1.36) holds for K (and B = 1).
X     X     X  
1 G
k · (K+1)−k = G
K−(k−1) = G
K+1 + G
K−`
0≤k≤K+1 0≤k≤K+1 0≤`≤K
(IH)   G+1
= G
K+1 + K
(∗) G+1
= K+1 .

• Now assume Equation (1.36) holds for B (for all G, K). In order to show that
it then also holds for B + 1 we use induction on K.

– When K = 0 both sides in (1.36) are equal to 1.

49
50 Chapter 1. Collections

– Now assume that Equation (1.36) holds for K, and for B. Then:

X    
B+1 G
k · (K+1)−k
0≤k≤K+1
  X    
= G
K+1 + B+1 G
k+1 · K−k
0≤k≤K
(∗)   X h   B+1 i  G 
= G
K+1 + B
k+1 + k · K−k
0≤k≤K
  X    G  X    
= G
K+1 + B
k+1 · K−k + B+1
k
G
· K−k

(IH, K)
X  0≤k≤K
  G 
0≤k≤K
(B+1)+G
= B
k · (K+1)−k + K
0≤k≤K+1
(IH, B) B+G (B+1)+G
= K+1 + K
(∗) (B+1)+G
= K+1 .

We now get the double-bracket analogue of Lemma 1.6.2.

Proposition 1.7.3. Let ψ be a non-empty natural multiset. Write X = supp(ψ)


and L = kψk. Then, for each K ∈ N,

ψ
ψ
!! !!
X L X ϕ
= so  L  = 1.
ϕ∈N[K](X)
ϕ K ϕ∈N[K](X) K

The fractions in this equation will show up later in so-called Pólya distri-
butions, see Example 2.1.1 (4) and Section 3.4. These fractions capture the
probability of drawing a multiset ϕ from an urn ψ when for each drawn ball an
extra ball is added to the urn (of the same colour).

Proof. We use induction on the number of elements in the support X of ψ,


like in the proof of Lemma 1.6.2. By assumption X cannot be empty, so the
induction starts when X is a singleton, say X = {x}. But then ψ(x) = kψk = L
and ϕ(x) = kϕk = K, so the result obviously holds.
Now let supp(ψ) = X ∪ {y} where y < X and X is not empty. Write:

L = kψk ` = ψ(y) > 0 ψ0 = ψ − `| y i L0 = L − ` > 0.

50
1.7. Multichoose coefficents 51

By construction X = supp(ψ0 ) and L0 = kψ0 k. Now:


ψ (1.35) ψ(x)
X !! X Y !!
=
ϕ∈N[K](X∪{y})
ϕ ϕ∈N[K](X∪{y}) x∈X∪{y}
ϕ(x)
ψ(y) Y ψ(x)
X X !! !!
= ·
0≤k≤K ϕ∈N[K−k](X)
k x∈X
ϕ(x)
X `!! X ψ
!!
= ·
0≤k≤K
k ϕ∈N[K−k](X) ϕ
` L0
!! !!
(IH)
X
= ·
0≤k≤K
k K−k
`+L 0
!! !!
(1.36) L
= = .
K K
We finally come to our multiset counting result. It is the multiset-analogue
of Proposition 1.3.7 (1) for subsets.
Proposition 1.7.4. Let X be a non-empty  with n ≥ 1 elements. The number
 set
of natural multisets of size K over X is Kn , that is:

n+K−1
!! !
n
N[K](X) = = .
K K
  n
Proof. The statement holds for K = 0 since there is precisely 1 = n−1 0 = 0
multiset set of 0, namely the empty multiset 0. Hence we may assume K ≥ 1,
so that Lemma 1.7.2 can be used.
We proceed
K  by 
induction
 on n ≥ 1. For n = 1 the statement holds since there
is only 1 = K = K multiset of size K over 1 = {0}, namely K| 0i.
1

The induction step works as follows. Let X have n elements, and y < X. For
a multiset ϕ ∈ N[K] X ∪ {y} there are K + 1 possible multiplicities ϕ(n). If

ϕ(n) = k, then the number possibilities for ϕ(0), . . . , ϕ(n − 1) is the number of
multisets in N[K −k](X). Thus:
 X
N[K] X ∪ {y} = N[K −k](X)
0≤k≤K !!
(IH)
X n
=
0≤k≤K
K−k
n+1
!!
= , by Lemma 1.7.2.
K
There is also a visual proof of this result, described in terms of stars and bars,
see e.g. [47, II, proof of (5.2)], where multiplicities of multisets are described
in terms of ‘occupancy numbers’.

51
52 Chapter 1. Collections

An associated question is: given a fixed element a in an n-element set X,


what is the sum of all multiplicities ϕ(a), for multisets ϕ over X with size K?
Lemma 1.7.5. For an element a ∈ X, where X has n ≥ 1 elements,
!!
X K n
ϕ(a) = · .
ϕ∈N[K](X)
n K

Proof. When we sum over a we get by Proposition 1.7.4:


X X X X X  
ϕ(a) = ϕ(a) = K = K · Kn .
a∈X ϕ∈N[K](X) ϕ∈N[K](X) a∈X ϕ∈N[K](X)

Since a ∈ X is arbitrary, the outcome should be the same for a different b ∈ X.


Hence we have to divide by n, giving the equation in the lemma.
We look at one more counting problem, namely the number of multisets
below a given multiset. Recall from Definition 1.5.1 (1) the pointwise order ≤
on multisets.
Definition 1.7.6. 1 For multisets ϕ, ψ on the same set X we define the strict
order < as:
ϕ < ψ ⇐⇒ ϕ ≤ ψ and ϕ , ψ.
We say that ϕ is fully below ψ, written as ϕ ≺ ψ when for each element
the number of occurrences in ϕ is strictly below (less) than the number of
occurrences in ψ. Thus:

ϕ ≺ ψ ⇐⇒ ∀x ∈ X. ϕ(x) < ψ(x).


2 For ϕ ∈ N(X) we use downset notation in the following way:

↓ ϕ = {ψ ∈ N(X) | ψ ≤ ϕ} and ↓= ϕ = {ψ ∈ N(X) | ψ < ϕ}.


Notice that < is not the same as ≺, as shown by this example:
2| a i + 1| b i + 2| c i < 2|a i + 2| bi + 2| c i
2| a i + 1| b i + 2| c i ⊀ 2|a i + 2| bi + 2| c i.
Proposition 1.7.7. The number of multisets strictly below a natural multiset ϕ
can be described as:
X Y
↓= ϕ = ϕ(x). (1.38)
∅,U⊆supp(ϕ) x∈U

The proof below lends itself to an easy implementation, see Exercise 1.7.12
below. It uses a (chosen) order on the elements of the support of the multiset.
The above formulation shows that the outcome is independent of such an order.

52
1.7. Multichoose coefficents 53

Proof. We use induction on the size | supp(ϕ) | of the support of ϕ. If this size is
1, then ϕ is of the form m| x1 i, for some number m. Hence there are m multisets
strictly below ϕ, namely 0, 1| x1 i, . . . , (m − 1)| x1 i. Since supp(ϕ) = {x1 }, the
only non-empty subset U ⊆ supp(ϕ) is the singleton U = {x1 } and x∈U ϕ(x) =
Q
ϕ(x1 ) = m.
Next assume that supp(ϕ) = X ∪ {y} with y 6 X is of size n + 1. We write
m = ϕ(y) and ϕ0 = ϕ − m| y i so that supp(ϕ0 ) = X. The number M = | ↓= ϕ0 | is
then given by the formula in (1.38), with ϕ0 instead of ϕ. We count the multisets
strictly below ϕ as follows.

• For each 0 ≤ i ≤ m each ψ < ϕ0 gives ψ + i| y i < ϕ; this gives N · (m + 1)


multisets below ϕ;
• But for each 0 ≤ i < m also ϕ0 + i| y i, which gives another m multisets below
ϕ.
We thus get (m + 1) · N + m multisets ϕ, so that:

↓= ϕ = ϕ(y) + 1 · ↓= ϕ0 + ϕ(y). (∗)
Now let’s look at the subset formulation in (1.38). Each non-empty subset
U ⊆ X ∪ {y} is either a subset of X, or of the form U = V ∪ {y} with V ⊆ X
non-empty, or contains only y. Hence:
   
X Y  X Y   X Y 
ϕ(x) =  ϕ(x) +  ϕ(y) · ϕ(x) + ϕ(y)
∅,U⊆supp(ϕ) x∈U ∅,U⊆X x∈U ∅,U⊆X x∈U
   
 X Y   X Y 
=  ϕ0 (x) + ϕ(y) ·  ϕ0 (x) + ϕ(y)
∅,U⊆X x∈U ∅,U⊆X x∈U
(IH)
= ↓= ϕ + ϕ(y) · ↓= ϕ + ϕ(y)
0 0

(*)
= ↓= ϕ .

Exercises
1.7.1 Check that N[K](1) has one element by Proposition 1.7.4. Describe
this element. How many elements are there in N[K](2)? Describe
them all.
1.7.2 Let ϕ, ψ be natural multiset on the same finite set X.
1 Show that:

ϕ ≺ ψ ⇐⇒ ϕ + 1 ≤ ψ ⇐⇒ ϕ ≤ ψ − 1,
where 1 =
P
x∈X 1| x i is the multiset of singletons on X.

53
54 Chapter 1. Collections

2 Now let ψ ≥ 1. Show that one has:


ψ ψ+ϕ−1
!! !
= .
ϕ ϕ
3 Conclude, analogously to Lemma 1.6.4 (1), that:
ψ
( ϕ ) · (ψ − 1 ) ϕ
=  L  .
(ψ + ϕ − 1)
K

1.7.3 Let N ≥ 0 and M ≥ m ≥ 1 and be given. Use the multichoose Van-


dermonde Equation (1.37) to prove:
X m!! M − m + N − i! N+M
!
· = .
0≤i≤N
i N−i N

1.7.4 Let n ≥ 1 and m ≥ 0.


1 Show that:
n+1 n+1
!! !! !!
n
= + .
m+1 m m+1
2 Generalise this to:
n+k n+i
!! X k! !!
= · .
m+k 0≤i≤k
i m+k−i

1.7.5 Prove the following properties.


 n−k   m−k 
1 m−(k+1) = n−(k+1)
 n  m n+m
2 m + n = n
 n   n+1 
3 m · m = n · m−1 .
n+1  n   n 
4 n · m = (n+m) · m = (m+1) · m+1 .
1.7.6 1 Show that for numbers m ≤ n − 1,
! ! !
n−1 n n
n· = (m+1) · = (n−m) · .
m m+1 m
2 Show similarly that for natural multisets ϕ, ψ with x ∈ supp(ψ) and
ϕ ≤ ψ − 1| x i,
ψ−1| x i ψ ψ
! ! !
ψ(x) · = (ϕ(x)+1) · = (ψ(x)−ϕ(x)) · .
ϕ ϕ+1| x i ϕ
3 For x ∈ supp(ϕ) ⊆ supp(ψ),
ψ ψ+1| x i
!! !!
ϕ(x) · = ψ(x) · .
ϕ ϕ−1| x i

54
1.7. Multichoose coefficents 55

4 For supp(ϕ) ⊆ supp(ψ) and x ∈ supp(ψ),


ψ+1| x i ψ ψ
!! !! !!
ψ(x) · = (ϕ(x)+ψ(x)) · = (ϕ(x)+1) · .
ϕ ϕ ϕ+1| x i
1.7.7 Let n ≥ 1 and m ≥ 1.
1 Show that:
!! !!
X n m
= .
j<m
j n

2 Prove next:
n+m
!! X !! !
X m n
+ = .
i<n
i j<m
j n

1.7.8 1 Extend Exercise 1.3.5 (2) to: for k ≥ 1,


X k + n + j! k+m+1
!!
k
!!
= − .
0≤ j≤m
n n+1 n+1

Hint: One can use induction on m.


2 Show that one also has:
X k!! m + 1 + n − i! k+m+1
!!
k
!!
· = − .
0≤i≤n
i m n+1 n+1

Hint: Use the multichoose version (1.36) of Vandermonde’s for-


mula.
1.7.9 Show that for 0 < k ≤ m + 1,
X k − i!! n + m − k + 1! n+m
!
· = .
i<k
i m−i n

1.7.10 Check that Theorem 1.5.2 (1) can be reformulated as: for a real num-
ber x ∈ (0, 1) and K ≥ 1,
X K !! 1
· xn =
n≥0
n (1 − x)K

1.7.11 Let s ∈ [0, 1] and n, m ≥ 1.


1 Prove first the following auxiliary result.
 
X n+1!! 1 X n!! n+1
!! 
·s =
j
·  j
·s − · s  .
m

j<m
j (1− s) j<m j m−1

55
56 Chapter 1. Collections

2 Take r = 1 − s so that r + s = 1 and prove:


X n!! X m!!
n
r · ·s + s ·
j m
· ri = 1.
j<m
j i<n
i

3 Show also that:


!! !!
X n m 1
· si + · ri = n m .
i≥0
m+i n+i r ·s

1.7.12 Let ϕ be a natural multiset with support {x1 , . . . , xn }. We assume that


this support is ordered as a list [x1 , . . . , xn ]. Use the proof of Proposi-
tion 1.7.7 to show that the number | ↓= ϕ | of multisets strictly below ϕ
can be computed via the following algorithm.

result = 0
for x ∈ [x1 , . . . , xn ] :
result B (ϕ(x) + 1) ∗ result + ϕ(x)
return result

1.8 Channels
The previous sections covered the collection types of lists, subsets, and multi-
sets, with much emphasis on the similarities between them. In this section we
will exploit these similarities in order to introduce the concept of channel, in a
uniform approach, for all of these collection types at the same time. This will
show that these data types are not only used for certain types of collections,
but also for certain types of computation. Much of the rest of this book builds
on the concept of a channel, especially for probabilistic distributions, which
are introduced in the next chapter. The same general approach to channels that
will be described in this section will work for distributions.
Let T be one of the collection functors L, P, or M. A state of type Y is an
element ω ∈ T (Y); it collects a number of elements of Y in a certain way. In this
section we abstract away from the particular type of collection. A channel, or
sometimes more explicitly T -channel, is a collection of states, parameterised
by a set. Thus, a channel is a function of the form c : X → T (Y). Such a
channel turns an element x ∈ X into a certain collection c(x) of elements of
Y. Just like an ordinary function f : X → Y can be seen as a computation, we
see a T -channel as a computation of type T . For instance, a channel X → P(Y)
is a non-deterministic computation and a channel X → M(Y) is a resource-
sensitive computation.

56
1.8. Channels 57

When it is clear from the context what T is, we often write a channel using
functional notation, as c : X → Y, with a circle on the shaft of the arrrow.
Notice that a channel 1 → X from the singleton set 1 = {0} can be identified
with a state on X.
Definition 1.8.1. Let T ∈ {L, P, M}, each of which is functorial, with its own
flatten operation, as described in previous sections.
1 For a state ω ∈ T (X) and a channel c : X → T (Y) we can form a new state
c = ω of type Y. It is defined as:

c = ω B flat ◦ T (c) (ω)



where T (X)
T (c)
/ T (T (Y)) flat / T (Y).

This operation c = ω is called state tranformation, sometimes with addi-


tional along the channel c. In functional programming it is called bind1 .
2 Let c : X → Y and d : Y → Z be two channels. Then we can compose them
and get a new channel d ◦· c : X → Z via:

d ◦· c (x) B d = c(x) d ◦· c = flat ◦ T (d) ◦ c.



so that
We first look at some examples of state transformation.
Example 1.8.2. Take X = {a, b, c} and Y = {u, v}.
1 For T = L an example of a state ω ∈ L(X) is ω = [c, b, b, a]. An L-channel
f : X → L(Y) can for instance be given by:

f (a) = [u, v] f (b) = [u, u] f (c) = [v, u, v].


State transformation f = ω amounts to ‘map list’ with f and then flattening.
It turns a list of lists into a list, as in:
f = ω = flat L( f )(ω) = flat [ f (c), f (b), f (b), f (a)]
 

= flat [[v, u, v], [u, u], [u, u], [u, v]]




= [v, u, v, u, u, u, u, u, v].
2 We consider the analogous example for T = P. We thus take as state ω =
{a, b, c} and as channel f : X → P(Y) with:

f (a) = {u, v} f (b) = {u} f (c) = {u, v}.


1 Our bind notation c = ω differs from the one used in the functional programming language
Haskell; there one writes the state first, as in ω = c. For us, the channel c acts on the state ω
and is thus written before the argument. This is in line with standard notation f (x) in
mathematics, for a function f acting on an argument x. Later on, we shall use predicate
transformation c = p along a channel, where we also write the channel first, since it acts on
the predicate p. Similarly, in categorical logic the corresponding pullback (or substitution) is
written as c∗ (p) with the channel before the predicate.

57
58 Chapter 1. Collections

Then:
 [
f = ω = flat P( f )(ω) =

f (a), f (b), f (c)
[
= {u, v}, {u}, {u, v} = {u, v}.

3 For multisets, a state in M(X) could be of the form ω = 3| ai + 2|b i + 5| c i


and a channel f : X → M(Y) could have:

f (a) = 10| u i + 5| v i f (b) = 1|u i f (c) = 4| u i + 1| v i.

We then get as state transformation:

f = ω = flat M( f )(ω)


= flat 3| f (a)i + 2| f (b) i + 5| f (c) i




= flat 3 10|u i + 5| vi + 2 1| u i + 5 4| ui + 1| v i


= 30| ui + 15| v i + 2| ui + 20| u i + 5| v i


= 52| ui + 20| v i.

We shall mostly be using multiset — and probabilistic channels, as special


case — and so we explicitly describe state transformation = in these cases. So
let c : X → M(Y) be an M-channel. Transformation of a state ω of type X can
be described as:
X
(c = ω)(y) = c(x)(y) · ω(x). (1.39)
x∈X

Equivalently, we can describe the transformed state c = ω as a formal sum:


 
X X 
c = ω =  c(x)(y) · ω(x) y .

(1.40)
y∈Y x∈X

We now prove some general properties about state transformation and about
composition of channels, based on the abstract description in Definition 1.8.1.

Lemma 1.8.3. 1 Channel composition ◦· has a unit, namely unit : Y → Y, so


that:
unit ◦· c = c and d ◦· unit = d,

for all channels c : X → Y and d : Y → Z. Another way to write the second


equation is: d = unit(y) = d(y).
2 Channel composition ◦· is associative:

e ◦· (d ◦· c) = (e ◦· d) ◦· c,

for all channels c : X → Y, d : Y → Z and e : Z → W.

58
1.8. Channels 59

3 State tranformation via a composite channel is the same as two consecutive


transformations:

d ◦· c = ω = d = c = ω .
 

4 Each ordinary function f : Y → Z gives rise to a ‘trivial’ or ‘deterministic’


channel  f  B unit ◦ f : Y → Z. This construction − satisfies:

 f  = ω = T ( f )(ω),

where T is the type of channel involved. Moreover:

g ◦·  f  = g ◦ f   f  ◦· c = T ( f ) ◦ c d ◦·  f  = d ◦ f,

for all functions g : Z → W and channels c : X → Y and d : Y → W.

Proof. We can give generic proofs, without knowing the type T ∈ {L, P, M}
of the channel, by using earlier results like Lemma 1.2.5, 1.3.2, and 1.4.3. No-
tice that we carefully distinguish channel composition ◦· and ordinary function
composition ◦.

1 Both equations follow from the flat − unit law. By Definition 1.8.1 (2):

unit ◦· c = flat ◦ T (unit) ◦ c = id ◦ c = c.

For the second equation we use naturality of unit in:

d ◦· unit = flat ◦ T (d) ◦ unit = flat ◦ unit ◦ d = id ◦ d = d.

2 The proof of associativity uses naturality and also the commutation of flatten
with itself (the ‘flat − flat law’), expressed as flat ◦ flat = flat ◦ T (flat).

e ◦· (d ◦· c) = flat ◦ T (e) ◦ (d ◦· c)
= flat ◦ T (e) ◦ flat ◦ T (d) ◦ c
= flat ◦ flat ◦ T (T (e)) ◦ T (d) ◦ c by naturality of flat
= flat ◦ T (flat) ◦ T (T (e)) ◦ T (d) ◦ c by the flat − flat law
=

flat ◦ T flat ◦ T (e) ◦ d ◦ c by functoriality of T
=

flat ◦ T e ◦· d ◦ c
= (e ◦· d) ◦· c

59
60 Chapter 1. Collections

3 Along the same lines:


d ◦· c = ω = flat ◦ T (d ◦· c) (ω)
 

= flat ◦ T (flat ◦ T (d) ◦ c) (ω)




= flat ◦ T (flat) ◦ T (T (d)) ◦ T (c) (ω)



by functoriality of T
= flat ◦ flat ◦ T (T (d)) ◦ T (c) (ω)

by the flat − flat law
= flat ◦ T (d) ◦ flat ◦ T (c) (ω)

by naturality of flat
  
= flat ◦ T (d) flat ◦ T (c) (ω)


= flat ◦ T (d) c = ω
 

= d = c = ω .


4 All these properties follow from elementary facts that we have seen before:
 f  = ω = flat ◦ T (unit ◦ f ) (ω)


= flat ◦ T (unit) ◦ T ( f ) (ω)



by functoriality of T
= T ( f )(ω) by a flat − unit law
g ◦·  f  = flat ◦ T (unit ◦ g) ◦ unit ◦ f
= flat ◦ unit ◦ (unit ◦ g) ◦ f by naturality of unit
= unit ◦ g ◦ f by a flat − unit law
= g ◦ f 
 f  ◦· c = flat ◦ T (unit ◦ f ) ◦ c
= flat ◦ T (unit) ◦ T ( f ) ◦ c by functoriality of T
= T(f) ◦ c
d ◦·  f  = flat ◦ T (d) ◦ unit ◦ f
= flat ◦ unit ◦ d ◦ f by naturality of unit
= d◦ f by a flat − unit law.
In the sequel we often omit writing the brackets − that turn an ordinary
function f : X → Y into a channel  f . For instance, in a state transformation
f = ω, it is clear that we use f as a channel, so that the expression should be
read as  f  = ω.

Exercises
1.8.1 For a function f : X → Y define an inverse image P-channel f −1 : Y →
X by:
f −1 (y) B {x ∈ X | f (x) = y}.
Prove that:
(g ◦ f )−1 = f −1 ◦· g−1 and id −1 = unit.

60
1.9. The role of category theory 61

1.8.2 Notice that a state of type X can be identified with a channel 1 → T (Y)
with singleton set 1 as domain. Check that under this identification,
state transformation c = ω corresponds to channel composition c ◦· ω.
1.8.3 Let f : X → Y be a channel.
1 Prove that if f is a Pfin -channel, then the state transformation func-
tion f = (−) : Pfin (X) → Pfin (Y) can also be defined via freeness,
namely as the unique function f in Proposition 1.3.4.
2 Similarly, show that f = (−) = f when f is an M-channel, as in
Exercise 1.4.9.
1.8.4 1 Describe how (non-deterministic) powerset channels can be reversed,
via a bijective correspondence between functions:
X −→ P(Y)
===========
Y −→ P(X)
(A description of this situation in terms of ‘daggers’ will appear in
Example 7.8.1.)
2 Show that for finite sets X, Y there is a similar correspondence for
multiset channels.

1.9 The role of category theory


The previous sections have highlighted several structural properties of, and
similarities between, the collection types list, powerset, multiset — and later
on also distribution. By now readers may ask: what is the underlying structure?
Surely someone must have axiomatised what makes all of this work!
Indeed, this is called category theory. It provides a foundational language
for mathematics, which was first formulated in the 1950s by Saunders Mac
Lane and Samuel Eilenberg (see the first overview book [118]). Category the-
ory focuses on the structural aspects of mathematics and shows that many
mathematical constructions have the same underlying structure. It brings for-
ward many similarities between different areas (see e.g. [119]). Category the-
ory has become very useful in (theoretical) computer science too, since it in-
volves a clear distinction between specification and implementation, see books
like [7, 10, 113, 138]. We refer to those sources for more information.
The role of category theory in capturing the mathematical essentials and
estabilishing connections also applies to probability theory. William Lawvere,
another founding father of the area, first worked in this direction. Lawvere him-
self published little on this approach to probability theory, but his ideas can be

61
62 Chapter 1. Collections

found in e.g. the early notes [111]. This line of work was picked up, extended,
and published by his PhD student Michèle Giry. Her name continues in the
‘Giry monad’ G of continuous probability distributions, see Section ??. The
precise source of the distribution monad D for discrete probability theory, that
will be introduced in Section 2.1 in the next chapter, is less clear, but it can be
regarded as the discrete version of G. Probabilistic automata have been studied
in categorical terms as coalgebras, see Chapter ??, and e.g. [152] and [70] for
general background information on coalgebra. There is a recent surge in inter-
est in more foundational, semantically oriented studies in probability theory,
through the rise of probabilistic programming languages [59, 153], probabilis-
tic Bayesian reasoning [28, 89], and category theory ??. Probabilistic methods
have received wider attention, for instance, via the current interest in data an-
alytics (see the essay [4]), in quantum probability [128, 25], and in cognition
theory [64, 151].
Readers who know category theory will have recognised its implicit use in
earlier sections. For readers who are not familiar (yet) with category theory,
some basic concepts will be explained informally in this section. This is in no
way a serious introduction to the area. The remainder of this book will continue
to make implicit use of category theory, but will make this usage increasingly
explicit. Hence it is useful to know the basic concepts of category, functor,
natural transformation, and monad. Category theory is sometimes seen as a
difficult area to get into. But our experience is that it is easiest to learn category
theory by recognising its concepts in constructions that you already know. That
is why this chapter started with concrete descriptions of various collections and
their use in channels. For more solid expositions of category theory we refer to
the sources listed above.

1.9.1 Categories
A category is a mathematical structure given by a collection of ‘objects’ with
‘morphisms’ between them. The requirements are that these morphisms are
closed under (associative) composition and that there is an identity morphism
on each object. Morphisms are also called ‘maps’, and are written as f : X →
Y, where X, Y are objects and f is a homomorphism from X to Y. It is tempting
to think of morphisms in a category as actual functions, but there are plenty of
examples where this is not the case.
A category is like an abstract context of discourse, giving a setting in which
one is working, with properties of that setting depending on the category at
hand. We shall give a number of examples.

62
1.9. The role of category theory 63

1 There is the category Sets, whose objects are sets and whose morphisms are
ordinary functions between them. This is a standard example.
2 One can also restrict to finite sets as objects, in the category FinSets, with
functions between them. This category is more restrictive, since for instance
it contains objects n = {0, 1, . . . , n − 1} for each n ∈ N, but not N itself. Also,
Q
in Sets one can take arbitrary products i∈I Xi of objects Xi , over arbitrary
index sets I, whereas in FinSets only finite products exist. Hence FinSets is
a more restrictive world.
3 Monoids and monoid maps have been mentioned in Definition 1.2.1. They
can be organised in a category Mon, whose objects are monoids, and whose
homomorphisms are monoid maps. We now have to check that monoid maps
are closed under composition and that identity functions are monoid maps;
this is easy. Many mathematical structures can be organised into categories
in this way, where the morphisms preserve the relevant structure. For in-
stance, one can form a category PoSets, with partially ordered sets (posets)
as objects, and monotone functions between them as morphisms (also closed
under composition, with identity).
4 For T ∈ {L, P, M} we can form the category Chan(T ). Its objects are arbi-
trary sets X, but its morphisms X to Y are T -channels, X → T (Y), written as
X → Y. We have already seen that channels are closed under composition ◦·
and have unit as identity, see Lemma 1.8.3. We can now say that Chan(T )
is a category.
These categories of channels form good examples of the idea that a cate-
gory forms a universe of discourse. For instance, in Chan(P) we are in the
world of non-deterministic computation, whereas Chan(M) is the world of
computation in which resources are counted.

We will encounter several more examples of categories later on in the book.


Occasionally, the following construction will be used. Given a category C, a
new ‘opposite’ category Cop is formed. It has the same objects as C, but its
morphisms are reversed. Thus f : Y → X in Cop means f : X → Y in C.
Also, given two categories C, D one can form a product category C × D. Its
objects are pairs (X, A) with X an object in C and A an object in D. Similarly,
an arrow (X, A) → (Y, B) in C × D is given by a pair ( f, g) of arrows f : X → Y
in C and g : A → B in D.

1.9.2 Functors
Category theorists like abstraction, hence the question: if categories are so im-
portant, then why not organise them as objects themselves in a superlarge cat-

63
64 Chapter 1. Collections

egory Cat, with morphisms between them preserving the relevant structure?
The latter morphisms between categories are called ‘functors’. More precisely,
given categories C and D, a functor F : C → D between them consists of two
mappings, both written F, sending an object X in C to an object F(X) in D,
and a morphism f : X → Y in C to a morphism F( f ) : F(X) → F(Y) in D.
This mapping F should preserve composition and identities, as in: F(g ◦ f ) =
F(g) ◦ F( f ) and F(id X ) = id F(X) .
Earlier we have already called some operations ‘functorial’ for the fact that
they preserve composition and identities. We can now be a bit more precise.

1 Each T ∈ {L, P, Pfin , M, M∗ , N, N∗ , N[K]} is a functor T : Sets → Sets.


This has been described in the beginning of each of the sections 1.2 – 2.1.
2 Taking lists is also a functor L : Sets → Mon. This is in essence the content
of Lemma 1.2.2. One can also view P, Pfin and M as functors Sets → Mon,
see Lemmas 1.3.1 and 1.4.2. Moreover, one can describe P, Pfin as a functor
Sets → PoSets, by considering each set of subsets P(X) and Pfin (X) with
its subset relation ⊆ as partial order. In order to verify this claim one has
to check that P( f ) : P(X) → P(Y) is a morphism of posets, that is, forms a
monotone function. But that is easy.
3 There is also a functor J : Sets → Chan(T ), for each T . It is the identity
on sets/objects: J(X) B X. But it sends a functon f : X → Y to the channel
J( f ) B  f  = unit ◦ f : X → Y. We have seen, in Lemma 1.8.3 (4), that
J(g ◦ f ) = J(g) ◦ J( f ) and that J(id ) = id , where the latter identity id is
unit in the category Chan(T ). This functor J shows how to embed the world
of ordinary computations (functions) into the world of compuations of type
T (channels).
4 Taking the product of two sets can be described as a functor × : Sets×Sets →
Sets. Its action on morphisms was already described at the end of Subsec-
tion 1.1.1, see also Exercise 1.1.3.
5 If we have functors Fi : Ci → Di , for i = 1, 2, then we also have a product
functor F1 × F2 : C1 × C2 → D1 × D2 between product categories, simply
by (F1 × F2 )(X1 , X2 ) = (F1 (X1 ), F2 (X2 )), and similarly for morphisms.

1.9.3 Natural transformations


Let us move one further step up the abstraction ladder and look at morphisms
between functors. These are called natural transformations. We have already
seen examples of those as well. Given two functors F, G : C → D, a natural
transformation α from F to G is a collection of maps αX : F(X) → G(X) in D,
indexed by objects X in C. Naturality means that α works in the same way on

64
1.9. The role of category theory 65

all objects and is expressed as follows: for each morphism f : X → Y in C, the


rectangle
αX
F(X) / G(X)
F( f ) G( f )
 
F(Y) / G(Y)
αY

in D commutes.
Such a natural transformation is often denoted by a double arrow α : F ⇒ G.
We briefly review some of the examples of natural transformations that we
have seen.

1 The various support maps can now be described as natural transformations


supp : L ⇒ Pfin , supp : M ⇒ Pfin and supp : D ⇒ Pfin , see the overview
diagram 1.30.
2 For each T ∈ {L, P, M} we have described maps unit : X → T (X) and
flat : T (T (X)) → T (X) and have seen naturality results about them. We can
now state more precisely that they are natural transformations unit : id ⇒
T and flat : (T ◦ T ) ⇒ T . Here we have used id as the identity functor
Sets → Sets, and T ◦ T as the composite of T with itself, also as a functor
Sets → Sets.

1.9.4 Monads
A monad on a category C is a functor T : C → C that comes with two natural
transformations unit : id ⇒ T and flat : (T ◦ T ) ⇒ T satisfying:

flat ◦ unit = id = flat ◦ T (unit)


(1.41)
flat ◦ flat = flat ◦ T (flat).
All the collection functors L, P, Pfin , M, M∗ , N, N∗ , D, D∞ that we have seen
so far are monads, see e.g., Lemma 1.2.5, 1.3.2, or 1.4.3. For each monad T we
can form a category Chan(T ) of T -channels, that capture computations of type
T , see Subsection 1.9.1. In category theory this is called the Kleisli category of
T . Composition in this category Chan(T ) is called Kleisli composition. In this
book it is written as ◦·, where the context should make clear what the monad T
at hand is.
Monads have become popular in functional programming [126] as mecha-
nisms for including special effects (e.g., for input-output, writing, side-effects,
continuations) into a functional programming language2 . The structure of prob-
2 See the online overview https://wiki.haskell.org/Monad_tutorials_timeline

65
66 Chapter 1. Collections

abilistic computation is also given by monads, namely by the discrete distribu-


tion monads D, D∞ and by the continuous distribution monad G.
We thus associate the (Kleisli) category Chan(T ) of channels with a mo-
nad T . A second category is associated with a monad T , namely the category
EM(T ) of “Eilenberg-Moore” algebras. The objects of EM(T ) are algebras
α : T (X) → X, satisfying α ◦ unit = id and α ◦ flat = α ◦ T (α). We have seen
algebras for the monads L, Pfin , and M in Propositions 1.2.6, 1.3.5, and 1.4.5.
They capture monoids, commutative idempotent monoids, and commutative
monoids respectively. A morphism in EM(T ) is a morphism of algebras, given
by a commuting rectangle, as described in these propositions. In general, alge-
bras of a monad capture algebraic structure in a uniform manner.
Here is an easy result that describes so-called writer monads.

Lemma 1.9.1. Let M = (M, +, 0) be an arbitrary monoid. The mapping X 7→


M × X forms a monad on the category Sets.

Proof. Let us write T (X) = M × X. For a function f : X → Y we define


T ( f ) : M × X → M × Y by T ( f )(m, x) = (m, f (x)). There is a unit map
unit : X → M × X, namely unit(x) = (0, x) and a flattening map flat : M ×
(M × X) → M × X by µ(m, m0 , x) = (m + m0 , x). We skip naturality and con-
centrate on the monad equations (1.41). First, for (m, x) ∈ T (X) = M × X,

flat ◦ unit (m, x) = flat(0, m, x) = (0 + m, x) = (m, x)




flat ◦ T (unit) (m, x) = flat m, unit(x) = flat(m, 0, x) = (m, x).


 

Next, the flatten-equation holds by associativity of the monoid addition +. This


is left to the reader.

We have seen natural transformations as maps between functors. In the spe-


cial case where the functors involved are monads, these natural transformations
can be called maps of monads if they additionally commute with the unit and
flatten maps.

Definition 1.9.2. Let T 1 = (T 1 , unit 1 , flat 1 ) and T 2 = (T 2 , unit 2 , flat 2 ) be two


monads (on Sets). A map/homomorphism of monads from T 1 to T 2 is a natural
transformation α : T 1 ⇒ T 2 that commutes with unit and flatten in the sense
that the two diagrams
α / T 2 (T 1 (X)) T2 (α)/ T 2 (T 2 (X))
unit 1
X unit 2
T 1 (T 1 (X))
flat 1 flat
 α
  α
 2
T 1 (X) / T 2 (X) T 1 (X) / T 2 (X)

commute, for each set X.

66
1.9. The role of category theory 67

The writer monads from Lemma 1.9.1 give simple examples of maps of
monads: if f : M1 → M2 is a map of monoids, then the maps α B f ×id : M1 ×
X → M2 × X form a map of monoids.
For a historical account of monads and their applications we refer to [66].

Exercises
1.9.1 We have seen the functor J : Sets → Chan(T ). Check that there is
also a functor Chan(T ) → Sets in the opposite direction, which is
X 7→ T (X) on objects, and c 7→ c = (−) on morphisms. Check ex-
plicitly that composition is preserved, and find the earlier result that
stated that fact implicitly.
1.9.2 Recall from (1.15) the subset N[K](X) ⊆ M(X) of natural multisets
with K elements. Prove that N[K] is a functor Sets → Sets.
1.9.3 Show that Exercise 1.8.1 implicitly describes a functor Setsop →
Chan(P), which is the identity on objects.
1.9.4 Show that the zip function from Exercise 1.1.7 is natural: for each
pair of functions f : X → U and g : Y → V the following diagram
commutes.

XK × Y K
zip
/ (X × Y)K
f K ×gK K
  ( f ×g)
UK × VK
zip
/ (U × V)K

1.9.5 Fill in the remaining details in the proof of Lemma 1.9.1: that T is
a functor, that unit and flat are natural transformation, and that the
flatten equation holds.
1.9.6 For arbitrary sets X, A, write X + A for the disjoint union (coproduct)
of X and A, which may be described explicitly by tagging elements
with 1, 2 in order to distinguish them:

X + A = {(x, 1) | x ∈ X} ∪ {(a, 2) | a ∈ A}.

Write κ1 : X → X + A and κ2 : A → X + A for the two obvious func-


tions.
1 Keep the set A fixed and show that the mapping X 7→ X + A can be
extended to a functor Sets → Sets.
2 Show that it is actually a monad; it is sometimes called the excep-
tion monad, where the elements of A are seen as exceptions in a
computation.

67
68 Chapter 1. Collections

1.9.7 Check that the support and accumulation functions form maps of
monads in the situations:
1 supp : M(X) ⇒ P(X);
2 acc : L(X) ⇒ M(X).
1.9.8 Let T = (T, unit, flat) be a monad. By definition, it involves T as
a functor T : Sets → Sets. Show that T can be ‘lifted’ to a functor
T : Chan(T ) → Chan(T ). It is defined on objects as T (X) B T (X)
and on a morphism f : X → Y as:
T(f)
/ T (T (Y)) flat / T (Y) unit / T (T (Y)) .
 
T ( f ) B T (X)

Prove that T is a functor, i.e. that it preserves (channel) identities and


composition.

68
2

Discrete probability distributions

The previous chapter has introduced products, lists, subsets and multisets as
basic collection types and has made some of their basic properties explicit.
This serves as preparation for the current first chapter on probability distribu-
tions. We shall see that distributions also form a collection type, with much
analogous structure. In fact distributions are special multisets, where multi-
plicities add up to one.
This chapter introduces the basics of probability distributions and of prob-
abilistic channels (as indexed / conditional distributions). These notions will
play a central role in the rest of this book. Distributions will be defined as spe-
cial multisets, so that there is a simple inclusion of the set of distributions in
the set of multisets multisets, on a particular space. In the other direction, this
chapter describes the ‘frequentist learning’ construction, which turns a multi-
set into a distribution, essentially by normalisation. The chapter also describes
several constructions on distributions, like the convex sum, parallel product,
and addition of distributions (in the special case when the underlying space
happens to be a commutative monoid). Parallel products ⊗ of distributions and
joint distributions (on product spaces) are rather special, for instance, because
a joint distribution is typically not equal to the product of its marginals. In-
deed, joint distributions may involve correlations between the different (prod-
uct) components, so that updates in one component have ‘crossover’ effect in
other components. This magical phenonenom will be elaborated in later chap-
ters.
The chapter closes with an example of a Bayesian network. It shows that
the conditional probability tables that are associated with nodes in a Bayesian
network are instances of probabilistic channels. As a result, one can system-
atically organise computations in Bayesian networks as suitable (sequential
and/or parallel) compositions of channels. This is illustrated via calculations
of various predicted probabilities.

69
70 Chapter 2. Discrete probability distributions

2.1 Probability distributions


This section is the first one about probability, in elementary form. It introduces
finite discrete probability distributions, which we often simply call distribu-
tions or states. In the literature they are also called ‘multinomial’ or ‘categor-
ical’ distributions. The notation and definitions that we use for distributions
are very much like for multisets, since distributions form a subset of multisets,
namely where multiplicities add up to one.
What we call distribution, is in fact a finite discrete probability distribution.
We will use state as synonym for distribution. A distribution over a set X is a
finite formal convex sum of the form:
X
r1 | x1 i + · · · + rn | xn i where xi ∈ X and ri ∈ [0, 1] with ri = 1.
i
P
We can write such an expression as a sum i ri | xi i. It is called a convex sum
since the ri add up to one. Thus, a distribution is a special ‘probabilistic’ mul-
tiset.
We write D(X) for the set of distributions on a space / set X, so that there
is an inclusion D(X) ⊆ M(X) of distributions on X in the set of multisets
on X. Via this inclusion, we use the same conventions for distributions, as
for multisets; they were described in the three bullet points in the beginning
of Section 1.4. This set X is often called the sample space, see e.g. [144],
the outcome space, the underlying space, or simply the underlying set. Each
element x ∈ X gives rise to a distribution 1| x i ∈ D(X), which is 1 on x and 0
everywhere else. It is called a Dirac distribution, a point mass, or also a point
distribution. The mapping x 7→ 1| x i is the unit function unit : X → D(X).
For a coin we can use the set {H, T } with elements for head and tail as sample
space. A fair coin is described on the left below, as a distribution over this set;
the distribution on the right gives a coin with a slight bias.
1
2|Hi + 21 | T i 0.51| H i + 0.49| T i.

In general, for a finite set X = {x1 , . . . , xn } there a uniform distribution


unif X ∈ D(X) that assigns the same probability to each element. Thus, it
is given by unif X = 1≤i≤n n1 | xi i. The above fair coin is a uniform distribu-
P
tion on the two-element set {H, T }. Similarly, a fair dice can be described as
unifpips = 16 |1 i+ 16 |2 i+ 16 | 3 i+ 16 | 4 i+ 16 | 5 i+ 61 | 6 i, where pips = {1, 2, 3, 4, 5, 6}.
Figure 2.1 shows bar charts of several distributions. The last one describes the
letter frequencies in English for the latin alphabet. One commonly does not dis-
tinguish upper en lower cases in such frequencies, so we take the 26-element
set A = {a, b, c, . . . , z} of lower cases as sample space. The distribution itself

70
2.1. Probability distributions 71

Figure 2.1 Plots of a slightly biased coin distribution 0.51| H i + 0.49| T i and a
fair (uniform) dice distribution on {1, 2, 3, 4, 5, 6} in the top row, together with the
distribution of letter frequencies in English at the bottom.

can be described as formal sum:


0.082|a i + 0.015| b i + 0.028| c i + 0.043| d i + 0.13|e i + 0.022| f i
+ 0.02|g i + 0.061| h i + 0.07| i i + 0.0015| j i + 0.0077| k i
+ 0.04|l i + 0.024|m i + 0.067| n i + 0.075| oi + 0.019| pi
+ 0.00095|q i + 0.06| r i + 0.063| s i + 0.091| t i + 0.028|u i
+ 0.0098| v i + 0.024| wi + 0.0015| x i + 0.02|y i + 0.0074| z i.
These frequencies have been copied from Wikipedia. Interestingly, they do not
precisely add up to 1, but to 1.01085, probably due to rounding. Thus, strictly
speaking, this is not a probability distribution but a multiset.
Below we describe several standard examples of distributions — which will
play an important role in the rest of the book. Especially the three ‘draw’ dis-
tributions — multinomial, hypergeometric, Pólya — capture probabilities as-
sociated with drawing coloured balls from an urn. The differences between
these distributions involve the changes in the urn after the draws and can be
characterised as −1, 0, and +1. In the hypergeometric case a drawn ball is re-
moved (−1), in the multinomial case a drawn ball is returned so that the urn

71
72 Chapter 2. Discrete probability distributions

remains unaltered (0), and in the Pólya case the drawn ball is returned to the
urn together with an extra ball of the same colour (+1).
Example 2.1.1. We shall explicitly describe several familiar distributions us-
ing the above formal convex sum notation.
1 The coin that we have seen above can be parametrised via a ‘bias’ probabil-
ity r ∈ [0, 1]. The resulting coin is often called flip and is defined as:
flip(r) B r| 1i + (1 − r)|0 i
where 1 may understood as ‘head’ and 0 as ‘tail’. We may thus see flip as a
function flip : [0, 1] → D({0, 1}) from probabilities to distributions over the
sample space 2 = {0, 1} of Booleans. This flip(r) is often called the Bernoulli
distribution, with parameter r ∈ [0, 1].
2 For each number K ∈ N and probability r ∈ [0, 1] there is the familiar
binomial distribution bn[K](r) ∈ D({0, 1, . . . , K}). It captures probabilities
for iterated coin flips, and is given by the convex sum:
X  
bn[K](r) B K
k · r k
· (1 − r) k .
K−k

0≤k≤K

The multiplicity probability before | k i in this expression is the chance of


getting k heads of out K coin flips, where each flip has bias r ∈ [0, 1].
In this way we obtain a function bn[K] : [0, 1] → D({0, 1, . . . , K}). These
multiplicities are plotted as bar charts in the top row of Figure 2.2, for two
binomial distributions, both on the sample space {0, 1, . . . , 10}.
There are isomorphisms [0, 1]  D(2), via the above flip function, and
{0, 1, . . . , K}  N[K](2), via k 7→ k| 0 i + (K − k)| 1 i. With these isomor-
phisms we can describe binomials as probabilistic function bn[K] : D(2) →

D N[K](2) . This re-formulation opens the door to a multivariate version
of this binomial distribution, called the multinomial distribution. It can be
described as function:
mn[K]
/ D N[K](X) .
 
D(X) (2.1)

The number K ∈ N represents the number of objects that is drawn, from


a distribution ω ∈ D(X), seen as abstract urn. The distribution mn[K](ω)
assigns a probability to a K-sized draw ϕ ∈ N[K](X). There is no bound on
K, since the idea behind multinomial distributions is that drawn objects are
replaced. More details will appear in Chapter 3; at this stage it suffices to
define for ω ∈ D(X),
X
( ϕ ) · ωϕ ϕ ,

mn[K](ω) B (2.2)
ϕ∈N[K](X)

72
2.1. Probability distributions 73

where ( ϕ ) is the multinomial coefficient QxK! ϕ(x)! , see Definition 1.5.1 (5)
and where ωϕ B ω(x) ϕ(x)
Q
x . In the sequel we shall standardly use the
multinomial distribution and view the binomial distribution as a special case.
To see an example, for space X = {a, b, c} and urn ω = 13 | ai + 12 | b i + 16 | ci
the draws of size 3 form
a distribution over multisets, described below within
the outer ‘big’ kets − .

mn[3](ω) = 27 3| a i + 6 2| a i + 1| b i + 4 1|a i + 2| bi + 8 3| bi
1 1 1 1

+ 18 1
2| a i + 1| c i + 16 1| a i + 1| bi + 1| ci + 18 2| b i + 1| c i


+ 36 1
1| a i + 2| c i + 241
1| b i + 2|c i + 216
1

3|c i

Via the Multinomial Theorem (1.27) we see that the probabilities in the
above expression (2.2) for mn[K](ω) add up to one:
X X Y (1.26)  P K
(ϕ ) · ωϕ = (ϕ) · ω(x)ϕ(x) = x ω(x) = 1K = 1.
x
ϕ∈N[K](X) ϕ∈N[K](X)

In addition, we note that the multisets ϕ in (2.2) can be restricted to those


with supp(ϕ) ⊆ supp(ω). Indeed, if ω(x) = 0, but ϕ(x) , 0, for some x ∈ X,
then ω(x)ϕ(x) = 0, so that the whole product becomes zero, and so that ϕ
Q
does not contribute to the above multinomial distribution.
3 The multinomial distribution captures draws with replacement, where drawn
balls are returned to the urn. Hence the urn itself can be described as a
probability distribution. A useful variation is the hypergeometric distribu-
tion which captures draws from an urn without replacement. We briefly in-
troduce this hypergeometric distribution, in multivariate form, and postpone
further analysis to Chapter 3.
When drawn balls are not replaced, the urn in question changes with each
draw. In the hypergeometric case the urn is a multiset, say of size L ∈ N.
Draws are then multisets of size K, with K ≤ L. The hypergeometric func-
tion thus takes the form:

N[L](X)
hg[K]
/ D N[K](X). (2.3)

This function/channel is defined on an L-sized multiset ψ ∈ N[L](X) as:

X ψϕ X x ψ(x)
  Q  
ϕ(x)
hg[K] ψ B L ϕ =  L  ϕ .

ϕ≤K ψ K ϕ≤K ψ K

This yields a probability distribution by Lemma 1.6.2. To see an example,

73
74 Chapter 2. Discrete probability distributions

with ψ = 4| a i + 6| b i + 2| c i as urn, the draws of size 3 give a distribution:


9 3 1
hg[3](ψ) = 55 1
3| ai + 55 2| ai + 1| bi + 11 1| ai + 2| b i + 11

3| b i

+ 55 3
2| ai + 1| ci + 12 ai + 1| b i + 1| c i + 22 3
2| b i + 1|c i

55 1|
+ 55 1
1| ai + 2| ci + 1103
1| bi + 2| c i .

4 We have seen that in multinomial mode the drawn ball is returned to the
urn, whereas in hypergeometric mode the drawn ball is removed. There is
a logical third option where the drawn ball is returned, together with one
additional ball of the same colour. This leads to what is called the Pólya
distribution. We shall describe it as a function of the form:

N∗ (X)
pl[K]
/ D N[K](X). (2.4)

Its description is very similar to the hypergeometric distribution, with as


main difference that it uses multichoose binomomial coefficients instead of
ordinary binomial coefficients:
ψ
ϕ
X
pl[K](ψ) =   ϕ . kψk
(2.5)
ϕ∈N[K](supp(ψ)) K

Notices that the draws ϕ satisfy supp(ϕ) ⊆ supp(ψ). In this situation it is


most natural to use urns ψ with full support, that is with ψ(x) > 0 for each
x ∈ X. In that case the urn contains at least one ball of each colour.
The above Pólya formula in (2.5) yields a proper distribution by Proposi-
tion 1.7.3. Here is an example, with the same urn ψ = 4| a i + 6| bi + 2| c i as
before, for the hypergeometric distribution.
3 2
pl[3](ψ) = 915
3| a i + 15 91 2| a i + 1| b i + 13 1|a i + 2| bi + 13

3| bi

+ 91 5
2| a i + 1| c i + 91 1|a i + 1| bi + 1| c i + 26 2| b i + 1| c i
12 3

+ 91 3
1| a i + 2| c i + 182
9
1| b i + 2|c i + 91
1

3| ci

The middle and last rows in Figure 2.2 give bar plots for hypergeometric and
Pólya distributions over 2 = {0, 1}. They show the probabilities for numbers
0 ≤ k ≤ 10 in a drawn multiset k| 0 i + (10 − k)| 1i.

So far we have described distributions as formal convex sums. But they can
be described, equivalently, in functional form. This is done in the definition
below, which is much like Definition 1.4.1 for multisets. Also for distributions
we shall freely switch between the above ket-formulation and the function-
formulation.

74
2.1. Probability distributions 75

bn[10]( 13 ) bn[10]( 43 )

   
hg[10] 10| 0 i + 20| 1 i hg[10] 16| 0 i + 14| 1 i

   
pl[10] 1| 0 i + 2| 1i pl[10] 8| 0 i + 7| 1 i

Figure 2.2 Plots of binomial and hypergeometric distributions

Definition 2.1.2. The set D(X) of all distributions over a set X can be defined
as:

ω(x) = 1}.
P
D(X) B {ω : X → [0, 1] | supp(ω) is finite, and x

Such a function ω : X → [0, 1], with finite support and values adding up to
one, is often called a probability mass function, abbreviated as pmf.
This D is functorial: for a function f : X → Y we have D( f ) : D(X) → D(Y)
defined either as:
X
ω(x).
P  P
D( f ) i ri | xi i B i ri | f (xi ) i or as: D( f )(ω)(y) B
x∈ f −1 (y)

A distribution of the form D( f )(ω) ∈ D(Y), for ω ∈ D(X), is sometimes called


an image distribution. One also says that ω is pushed forward along f .

One has to check that D( f )(ω) is a distribution again, that is, that its multi-

75
76 Chapter 2. Discrete probability distributions

plicities add up to one. This works as follows.


X X X X
D( f )(ω)(y) = ω(x) = ω(x) = 1.
y∈Y y∈Y x∈ f −1 (y) x∈X

We present two examples where functoriality of D is frequently used, but


not always in explicit form.

Example 2.1.3. 1 Computing marginals of ‘joint’ distributions involves func-


toriality of D. In general, one speaks of a joint distribution if its sample
space is a product set, of the form X1 × X2 , or more generally, X1 × · · · × Xn ,
for n ≥ 2. The i-th marginal of ω ∈ D(X1 × · · · × Xn ) is defined as D(πi )(ω),
via the i-th projection function πi : X1 × · · · × Xn → Xi .
For instance, the first marginal of the joint distribution,

ω= 1
12 | H, 0 i + 61 | H, 1i + 13 | H, 2 i + 16 | T, 0i + 1
12 | T, 1 i + 16 | T, 2 i

on the product space {H, T } × {0, 1, 2} is obtained as:

D(π1 )(ω) = 12 | π1 (H, 0)i + 6 | π1 (H, 1)i + 3 | π1 (H, 2)i


1 1 1

+ 6 | π1 (T, 0)i + 12 | π1 (T, 1)i + 16 | π1 (T, 2)i


1 1

= 12 | H i + 6 | H i + 3 | H i + 6 | T i + 12 | T i + 6 | T i
1 1 1 1 1 1

= 12 | H i + 12 | T i.
7 5

2 Let ω ∈ D(X) be distribution. In Chapter 4 we shall discuss random vari-


ables in detail, but here it suffices to know that it involves a function R : X →
R from the sample space of the distribution ω to the real numbers. Often,
then, the notation
P[R = r] ∈ [0, 1]

is used to indicate the probability that the random variable R takes value
r ∈ R.
Since we now know that D is functorial, we may apply it to the function
R : X → R. It gives another function D(R) : D(X) → D(R), so that D(R)(ω)
is a distribution on R. We observe:
X
P[R = r] = D(R)(ω)(r) = ω(x).
x∈R−1 (r)

Thus, P[R = (−)] is an image distribution, on R. In the notation P[R = r]


the distribution ω on the sample space X is left implicit.
Here is a concrete example. Recall that we write pips = {1, 2, 3, 4, 5, 6} for
the sample space of a dice. Let M : pips × pips → pips take the maximum,
so M(i, j) = max(i, j). We consider M as a function pips × pips → R via the

76
2.1. Probability distributions 77

inclusion pips ,→ R. Then, using the uniform distribution unif ∈ D(pips ×


pips),

P[M = k] = the probability that the maximum of two dice throws is k


= D(M)(unif)(k)
X
= unif(i, j)
i, j with max(i, j)=k
X X
= unif(i, k) + unif(k, j)
i≤k j<k
2k − 1
= .
36
In the notation of this book the image distribution D(M)(unif) on pips is
written as in the first line below.

D(M)(unif) = 36
1
| 1i + 36
3
|2 i + 5 | 3 i + 36
7
| 4 i + 9 | 5i + 11
36 |6 i
36 36
= P[M = 1] 1 + P[M = 2] 2 + P[M = 3] 3


+ P[M = 4] 4 + P[M = 5] 5 + P[M = 6] 6 .

Since in our notation the underlying (uniform) distribution unif is explicit,


we can also change it to another distribution ω ∈ D(pips × pips) and still
do the same computation. In fact, once we have seen product distributions in
Section 2.4 we can compute D(M)(ω1 ⊗ω2 ) where ω1 is a distribution for the
first dice (say a uniform, fair one) and ω2 a distribution for the second dice
(which may be unfair). This notation in which states are written explicitly
gives much flexibility in what we wish to express and compute, see also
Subsection 4.1.1 below.

In the previous chapter we have seen that the sets L(X), P(X) and M(X) of
lists, powersets and multisets all carry a monoid structure. Hence one may ex-
pect a similar result saying that D(X) forms a monoid, via an elementwise sum,
like for multisets. But that does not work. Instead, one can take convex sums
of distributions. This works as follows. Suppose we have two distributions
ω, ρ ∈ D(X) and a number s ∈ [0, 1]. Then we can form a new distribution
σ ∈ D(X), as convex combination of ω and ρ, namely:

σ B s · ω + (1 − s) · ρ that is σ(x) = r · ω(x) + (1 − s) · ρ(x). (2.6)

This obviously generalises to an n-ary convex sum.


At this stage we shall not axiomatise structures with such convex sums; they
are sometimes called simply ‘convex sets’ or ‘barycentric algebras’, see [155]
or [67] for details. A brief historical account occurs in [99, Remark 2.9].

77
78 Chapter 2. Discrete probability distributions

Remark 2.1.4. We have restricted ourselves to finite probability distributions,


by requiring that the support of a distribution is a finite set. This is fine in many
situations of practical interest, such as in Bayesian networks. But there are
relevant distributions that have infinite support, like the Poisson distribution
pois[λ] on N, with ‘mean’, ‘rate’ or ‘intensity’ parameter λ ≥ 0. It can be
described as infinite formal convex sum:
X λk
pois[λ] B e−λ · k . (2.7)
k∈N
k!

This does not fit in D(N). Therefore we sometimes use D∞ instead of D, where
the finiteness of support requirement has been dropped:
D∞ (X) B {ω : X → [0, 1] | x ω(x) = 1}.
P
(2.8)
Notice by the way that the multiplicities add up to one in the Poisson distri-
bution because of the basic formula: eλ = k∈N λk! . The Poisson distribution is
P k

typically used for counts of rare events. The rate or intensity parameter λ is the
average number of events per time period. The Poisson distribution then gives
for each k ∈ N the probability of having k events per time period. This works
when events occur independently.
Exercises 2.1.9 and 2.1.10 contain other examples of discrete distributions
with infinite support, namely the geometric and negative-binomial distribution.
Another illustration is the zeta (or zipf) distribution, see e.g. [145].

Exercises
2.1.1 Check that a marginal of a uniform distribution is again a uniform
distribution; more precisely, D(π1 )(unif X×Y ) = unif X .
2.1.2 1 Prove that flip : [0, 1] → D(2) is an isomorphism.
2 Check that flip(r) is the same as bn[1](r).
3 Describe the distribution bn[3]( 41 ) concretely and interepret this
distribution in terms of coin flips.
2.1.3 We often write n B {0, . . . , n − 1} so that 0 = ∅, 1 = {0} and 2 = {0, 1}.
Verify that:
L(0)  1 P(0)  1 N(0)  1 M(0)  1 D(0)  0
L(1)  N P(1)  2 N(1)  N M(1)  R≥0 D(1)  1
P(n)  2n N(n)  Nn M(n)  Rn≥0 D(2)  [0, 1].
The set D(n + 1) is often called the n-simplex. Describe it as a subset
of Rn+1 , and also as a subset of Rn .
2.1.4 Let X = {a, b, c} with draw ϕ = 2| ai + 3| b i + 2| c i ∈ N[7](X).

78
2.1. Probability distributions 79

1 Consider distribution ω = 13 | a i + 12 | b i + 16 | c i and show that:

mn[7](ω)(ϕ) = 432 .
35

Explain this outcome in terms of iterated single draws.


2 Consider urn ψ = 4| a i + 6| b i + 2| ci ∈ N[12](X) and compute:
hg[7](ψ)(ϕ) = 33 .
5

2.1.5 Let a number r ∈ [0, 1] and a finite set X be given. Show that:
X
r| U | · (1 − r)| X\U | = 1.
U∈P(X)

Can you generalise to numbers r1 , . . . , rn and partitions of X of size


n?
Hint: Recall the binomial and multinomial distribution from Exam-
ple 2.1.1 (2) and recall Lemma 1.3.7.
2.1.6 Check that the powerbag operation from Exercise 1.6.8 can be turned
into a probabilistic powerbag PPB via:
X ψϕ
 
PPB(ψ) B
ϕ .
2 kψk
ϕ≤ψ

2.1.7 Let X be a non-empty finite set with n elements. Use Exercise 1.5.7
to check that, for the following multiset distribution yields a well-
defined distribution on N[K](X).
X (ϕ )
mltdst[K]X B ϕ .
n K
ϕ∈N[K](X)

Describe mltdst[4]X for X = {a, b, c}.  


2.1.8 1 Recall Theorem 1.5.2 (1) and conclude that lim n+K · rn = 0,
n→∞ K
for r ∈ [0, 1). (This is general result: if partial sums of a series
converge, the limit of the series itself is zero.)
2 Conclude that for r ∈ (0, 1] one has:
lim bn[n+m](r)(m) = 0.
n→∞

Explain yourself what this means.


2.1.9 For a (non-zero) probability r ∈ (0, 1) one defines the geometric dis-
tribution geo[r] ∈ D∞ (N) as:
X
geo[r] B r · (1 − r)k−1 k .
k∈N>0

It captures the probability of being successful for the first time after

79
80 Chapter 2. Discrete probability distributions

k − 1 unsuccesful tries. Prove that this is a distribution indeed: its


multiplicities add up to one.
Hint: Recall the summation formula for geometric series, but don’t
confuse this geometric distribution with the hypergeometric distribu-
tion from Example 2.1.1 (3).
2.1.10 The negative binomial distribution is of the form nbn[K](s) ∈ D∞ (N),
for K ≥ 1 and s ∈ (0, 1). It captures the probability of reaching K
successes, with probability s, in n + K trials.
X K !!
· sK · (1 − s)n K + n

nbn[K](s) B
n≥0
n
X n + K − 1!
= · sK · (1 − s)n K + n

n≥0
K − 1
X m − 1!
= · sK · (1 − s)m−K m .
m≥K
K−1

Use Theorem 1.5.2 (or Exercise 1.5.9) to show that this forms a distri-
bution. We shall describe negative multinomial distributions in Sec-
tion ??.
2.1.11 Prove that a distribution ω ∈ D∞ (X) necessarily has countable sup-
port.
Hint: Use that each set {x ∈ X | ω(x) > 1n }, for n > 0, can have only
finitely many elements.
2.1.12 Let ω ∈ D(X) be an arbitrary distribution on a set X. We extend it to a
distribution ω? on the set L(X) of lists of elements from X. We define
the function ω? : L(X) → [0, 1] by:
ω(x1 ) · . . . · ω(xn )
ω? [x1 , . . . , xn ] B .

2n+1

Prove that ω? ∈ D∞ L(X) .



1
2 Consider the function f : {a, b, c} → {1, 2} with f (a) = 1, f (b) = 1,
f (c) = 2. Take ω = 13 | a i + 14 | b i + 12
5
| c i ∈ D({a, b, c}) and ` =
[1, 2, 1] ∈ L({1, 2}). Show that:

D( f )(ω)? (`) = 245


27648 = D∞ (L( f ))(ω? )(`).

Using the ?-operation as a function:

(−)?
D(X) / D∞ L(X)

80
2.2. Probabilistic channels 81

we can describe the above equation as:


 / ω? ∈ D∞ L({a, b, c})
D({a, b, c}) 3 ω
_ _
 
D∞ L( f ) (ω? )
 
D({1, 2}) 3 D( f )(ω) / D( f )(ω)? ∈ D∞ L({1, 2})

3 Prove in general that (−)? is a natural transformation from D to


D∞ ◦ L.

2.2 Probabilistic channels


In the previous chapter we have seen that taking lists / powersets / multisets is
functorial, giving functors written respectively as L, P, M. Further, we have
seen that these functors all have ‘unit’ and ‘flatten’ operations that satisfy stan-
dard equations that make L, P and M into a monad. Moreover, in terms of
these unit and flatten maps we have defined categories of channels Chan(L),
Chan(P) and Chan(M). In this section we will show that the same monad
structure exists for the distribution functor D and that we thus also have prob-
abilistic channels, in a category Chan(D). In the remainder of this book the
emphasis will be almost exclusively on this probabilistic case, so that ‘chan-
nel’ will standardly mean ‘probabilistic channel’.
The unit and flatten maps for multisets from Subsection 1.4.2 restrict to
distributions. The unit function unit : X → D(X) is simply unit(x) B 1| x i.
Flattening is the function flat : D(D(X)) → D(X) with:
P  X P  P
flat i ri | ωi i B i ri · ωi (x) x = i ri · ωi .
x∈X

The latter formulation uses a convex sum of distributions (2.6).


The only thing that needs to be checked is that flattening yields a convex
sum, i.e. that its probabilities add up to one. This is easy:

i ri · ωi (x) = x ωi (x) = i ri · 1 = 1.
P P  P P P
x i ri ·

The familiar properties of unit and flatten hold for distributions too: the ana-
logue of Lemma 1.4.3 holds, with ‘multiset’ replaced by ‘distribution’. We
conclude that D, with these unit and flat, is a monad. The same holds for the
‘infinite’ variation D∞ from Remark 2.1.4.
A probabilistic channel c : X → Y is a function c : X → D(Y). For a state /
distribution ω ∈ D(X) we can do state transformation and produce a new state

81
82 Chapter 2. Discrete probability distributions

c = ω ∈ D(Y). This happens via the standard definition for = from Section 1.8,
see especially (1.39) and (1.40):
  XX 
c = ω B flat D(c)(ω) = c(x)(y) · ω(x) y . (2.9)
y∈Y x∈X

In the probabilistic case we get a probability distribution again, with probabil-


ities adding up to one, since:
   
X X X X  X
 c(x)(y) · ω(x) =  c(x)(y) · ω(x) = 1 · ω(x) = 1.
   
y∈Y x∈X x∈X y∈Y x∈X

Moreover, if we have another channel d : Y → D(Z) we can form the compos-


ite channel:

c / D(Y) D(d)
/ D(D(Z)) flat / D(Z) .
 
d ◦· c B X

More explicitly, this gives (d ◦· c)(x) = d = c(x).


As a result we can form a category Chan(D) of probabilistic channels. Its
objects are sets X and its morphisms X → Y are probabilistic channels. They
can be composed via ◦·, see Lemma 1.8.3, with unit : X → X as identity chan-
nel. Recall that an ordinary function f : X → Y can be turned into a (probabilis-
tic) channel  f  : X → Y. Explicitly,  f (x) = 1| f (x)i. This can be formalised
in terms of a functor Sets → Chan(D). Often, we don’t write the − angles
when the context makes clear what is meant.
We have already seen several examples of such probabilistic channels in
Example 2.1.1, namely:

[0, 1]
flip
◦ / {0, 1} and [0, 1]
bn[K]
◦ / {0, 1, . . . , K}.

The multinomial, hypergeometric, and Pólya distributions from (2.1), (2.3)


and (2.4) can be organised as channels of the form:

◦ / ◦ / ◦ /
mn[K] hg[K] pl[K]
D(X) N[K](X) N[L](X) N[K](X). N∗ (X) N[K](X).

Example 2.2.1. We now present an example of a probabilistic channel, in the


style of the earlier list/powerset/multiset channels in Example 1.8.2. We use
again the sets X = {a, b, c} and Y = {u, v}, with state ω = 61 | ai + 12 | b i + 13 |c i ∈
D(X) and channel f : X → D(Y) given by:

f (a) = 21 | u i + 21 | v i f (b) = 1|u i f (c) = 34 | ui + 14 | v i.

82
2.2. Probabilistic channels 83

We then get as state transformation:

f = ω =

flat D( f )(ω)
= flat 61 | f (a) i + 12 | f (b) i + 31 | f (c) i


= flat 16 12 | u i + 21 | vi + 12 1| u i + 13 34 | ui + 14 |v i


= 12 |u i + 12 | v i + 2 | u i + 4 | u i + 12 | vi
1 1 1 1 1

= 6 | ui + 6 | v i.
5 1

This state transformation f = ω describes a mixture (convex sum) of the dis-


tributions f (a), f (b), f (c), where the weights of the components of the mixture
are given by the distribution ω, see also Exercise 2.2.3 (2). In practice we often
compute transformation directly as convex sum of states, as in:

     
f = ω = 6 · 2 | ui + 2 | vi
1 1 1
+ 1
2 · 1| ui + 1
3 · 3
4|ui + 14 | v i
= 6 | ui + 6 | vi.
5 1

This ‘mixture’ terminology is often used in a clustering context where the ele-
ments a, b, c are the components.

Example 2.2.2 (Copied from [79]). Let’s assume we wish to capture the mood
of a teacher, as a probabilistic mixture of three possible options namely: pes-
simistic (p), neutral (n) or optimistic (o). We thus have a three-element proba-
bility space X = {p, n, o}. We assume a mood distribution:

σ = 81 | p i + 38 | n i + 21 | o i with plot

This mood thus tends towards optimism.


Associated with these different moods the teacher has different views on
how pupils perform in a particular test. This performance is expressed in terms
of marks, which can range from 1 to 10, where 10 is best. The probability space
for these marks is written as Y = {1, 2, . . . , 10}.
The view of the teacher is expressed via a channel c : X → Y. It is defined
as via the following three mark distributions, one for each element in X =

83
84 Chapter 2. Discrete probability distributions

{p, n, o}.

c(p)
= 501
| 1 i + 50
2
| 2 i + 50
10
| 3i + 15 50 |4 i + 50 | 5 i
10

+ 50 6
| 6i + 503
| 7i + 50 1
| 8 i + 501
| 9 i + 50
1
| 10 i
pessimistic mood marks

c(n)
= 501
| 1 i + 50
2
| 2 i + 50
4
| 3i + 10
50 |4 i + 50 | 5 i
15

+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 5 1 1 1
| 10 i
neutral mood marks

c(o)
= 501
| 1 i + 50
1
| 2 i + 50
1
| 3i + 50
2
|4 i + 50
4
|5i
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 15 10 4 2
| 10 i.
optimistic mood marks
Now that the state σ ∈ D(X) and the channel c : X → Y have been described,
we can form the transformed state c = σ ∈ D(Y). Following the formula-
tion (2.9) we get for each mark i ∈ Y the probability:

c = σ (i) = x σ(x) · c(x)(i)


 P

= σ(p) · c(p)(i) + σ(n) · c(n)(i) + σ(o) · c(o)(i).


The outcome is in the plot below. It contains a convex combination of the
above three distributions (and plots), for c(p), c(n) and c(o), where the weights
are determined by the mood distribution σ. This plot contains the ‘predicted’
marks, corresponding to the state of mind of the teacher.

This example will return in later chapters. There, the teacher will be confronted
with the marks that the pupils actually obtain. This will lead the teacher to an

84
2.2. Probabilistic channels 85

update of his/her mood. This situation forms an illustration of a predictive cod-


ing model, in which the human mind actively predicts the situation in the out-
side world, depending on its internal state — and updates it when confronted
with the actual situation.

In the next example we show how products, multisets and distributions come
together in an elementary combinatorial situation.

Example 2.2.3. Let X be an arbitrary set and K be a fixed positive natural


number. The K-fold product X K = X × · · · × X contains the lists of length
K. As we have seen in (1.32), the accumulator function acc : L(X) → N(X)
from (1.30) restricts to acc : X K → N[K](X), where, recall, N[K](X) is the set
of multisets over X with K elements.
We ask ourselves: do these maps acc : X K → N[K](X) have inverses? Let
ϕ = i ki | xi i ∈ N[K](X) be a K-element natural multiset, so kϕk = i ki = K,
P P
where we assume that i ranges over 1, 2, . . . , n. We can surely choose a list
` ∈ X K with acc(`) = ϕ. All we have to do is put the elements in ϕ in a
certain order. We can do so in many ways. How many? In Proposition 1.6.3
we have seen that the answer is given by the coefficient ( ϕ ) of the multiset
ϕ ∈ N[K](X), where:
!
K! K! K
(ϕ) = Q = = .
x ϕ(x)! k1 ! · · · kn ! k1 , . . . , kn

Returning to our question about an inverse to acc : X K → N[K](X) we see


that we can construct a channel in the reversed direction. We call it arr for
‘arrange’ and define it as arr : N[K](X) → X K via uniform distributions:
X 1
arr(ϕ) B ~x . (2.10)
(ϕ)
~x∈acc −1 (ϕ)

This is a sum over (ϕ )-many lists ~x with acc(~x) = ϕ. The size K of the multiset
involved has disappeared from this formulation, so that we can view arr simply
as a channel arr : N(X) → L(X).
 instance, for X = {a, b} with multiset ϕ = 3| a i + 1| bi there are (ϕ ) =
 4 For
3,1 = 3!·1! = 4 arrangements of ϕ, namely [a, a, a, b], [a, a, b, a], [a, b, a, a],
4!

and [b, a, a, a], so that:



arr 3| a i + 1|b i = 41 a, a, a, b + 14 a, a, b, a + 41 a, b, a, a + 41 b, a, a, a .


Our next question is: how are accumulator acc and arrangement arr related?

85
86 Chapter 2. Discrete probability distributions

The following commuting triangle gives an answer.

N(X)
arr
◦ / L(X)
◦ ◦ acc  (2.11)
unit 
N(X)
It combines a probabilistic channel arr : N(X) → D(L(X)) with an ordinary
function acc : L(X) → N(X), promoted to a deterministic channel acc. For
the channel composition ◦· we make use of Lemma 1.8.3 (4):
acc  ◦· arr (ϕ) = D(acc) arr(ϕ)
 
 
 X 1 
= D(acc)   ~x 
(ϕ ) 
~x∈acc −1 (ϕ)
X 1
=

acc(~x)
(ϕ )
~x∈acc −1 (ϕ)
X 1 1
= ϕ = (ϕ) · ϕ = 1 ϕ = unit(ϕ).
(ϕ ) (ϕ)
~x∈acc (ϕ)
−1

The above triangle (2.11) captures a very basic relationship between sequences,
multisets and distributions, via the notion of (probabilistic) channel.
In the other direction, arr(acc(~x)) does not return ~x. It yields a uniform dis-
tribution over all permutations of the sequence ~x ∈ X K .
Deleting and adding an element to a natural multiset are basic operations
that are naturally described as channels.
Definition 2.2.4. For a set X and a number K ∈ N we define a draw-delete
channel DD and also a draw-add channel DA in a situation:
DD
r ◦
N[K](X)

2 N[K +1](X)
DA

On ψ ∈ N[K +1](X) and ϕ ∈ N[K](X) they are defined as:


X ψ(x)
DD(ψ) B ψ − 1| x i
x∈supp(ψ)
K + 1
X χ(x) (2.12)
DA(χ) B χ + 1| x i
x∈supp(χ)
K

For DA we need to assume that χ is non-empty, or equivalently, that K > 0.


In the draw-delete case DD one draws single elements x ∈ X from the urn
ψ, with probability determined by the number of occurrences ψ(x) ∈ N of x

86
2.2. Probabilistic channels 87

in the urn ψ. This drawn element x is subsequently removed from the urn via
subtraction ψ − 1| x i.
In the draw-add case DA one also draws single elements x from the urn χ,
but instead of deleting x one adds an extra copy of x, via the sum χ + 1| x i. This
is typical for Pólya style urns, see [115] or Section 3.4 for more information.
These channels are not each other’s inverses. For instance:

DD 3|a i + 1| bi = 43 2| a i + 1| b i + 14 3| a i


DA 2|a i + 1| bi = 2 3| a i + 1| b i + 1 2| a i + 2|b i .

3 3

In a next step we see that neither DA ◦· DD nor DD ◦· DA is the identity.

DA ◦· DD 3| ai + 1| b i
 
 
= DA = 34 2| a i + 1| b i + 41 3| a i

   
= 34 23 3| a i + 1| bi + 31 2| ai + 2| bi + 14 1 4| ai


= 12 3| ai + 1| b i + 14 2| a i + 2| b i + 14 4| a i

DD ◦· DA 2| ai + 1| bi
 
 
= DD = 23 3| a i + 1|b i + 31 2| ai + 2| bi

   
= 23 34 2| a i + 1| bi + 41 3| ai + 13 12 1| a i + 2| b i + 12 2|a i + 1| bi


= 21 2| ai + 1| b i + 16 3| a i + 16 1| a i + 2| b i + 16 2|a i + 1| bi .

When we iterate the draw-and-delete and the draw-and-add channels we


obtain distributions that strongly remind us of the hypergeometric and Pólya
distribution, see Example 2.1.1 (3) and (4). But they are not the same. The
iterations below describe not what is drawn from the urn — as in the hyper-
geometric and Pólya cases — but what is left in the urn after such draws. The
full picture appears in Chapter 3.
The iteration of draws is intuitively very natural, leading to subtraction and
addition of the draws ϕ, in the formulas below. The important thing to note is
that the mathematical formalisation of this intuition works because these draws
are channels and are composed via channel composition ◦·.

Theorem 2.2.5. Iterating K ∈ N times the draw-and-delete and draw-and-add


channels from Definition 2.2.4 yields channels:

DD K
r ◦
N[L](X) 2 N[L+K](X)

DA K

87
88 Chapter 2. Discrete probability distributions

On ψ ∈ N[L+K](X) and χ ∈ N[L](X) they satisfy:

 ψ χ
ϕ ϕ
X X
DD K (ψ) =  ψ − ϕ DA K (χ) =  L  χ + ϕ .

L+K
ϕ≤K ψ K ϕ≤K χ K

The probabilities in these expressions add up to one by Lemma 1.6.2 and


Proposition 1.7.3.

Proof. For both equations we use induction on K ∈ N. In both cases the only
option for ϕ in N[0](X) is the empty multiset 0, so that DD 0 (ψ) = 1| ψi and
DA 0 (χ) = 1|χ i.
For the induction steps we make extensive use of the equations in Exer-
cise 1.7.6. In those cases we shall put ‘E’ above the equation. We start with the
induction step for draw-delete. For ψ ∈ N[L+K + 1](X),

DD K+1 (ψ) = DD K = DD(ψ)


 
(2.12)
 X ψ(x) 
= DD K =  ψ − 1| x i 
x∈supp(ψ)
L+K+1
X X ψ(x)
= · DD K (ψ − 1| x i)(ϕ) ϕ
ϕ∈N[K](X) x∈supp(ψ)
L+K+1
ψ−1| x i
(IH)
X X ψ(x) χ
= · L+K  ψ − 1| x i − χ

x∈supp(ψ) χ≤K ψ−1| x i
L + K + 1
K
 ψ 
(E)
X X χ(x) + 1 χ+1| x i
= · L+K+1 ψ − (χ + 1| x i)

x∈supp(ψ) χ≤K ψ−1| x i
K + 1
K+1
 ψ
X X ϕ(x) ϕ
= · L+K+1 ψ − ϕ

ϕ≤K+1 ψ x∈supp(ψ)
K + 1
K+1
 ψ
ϕ
X
=  ψ − ϕ , since kϕk = K + 1.

 L+K+1
ϕ≤K+1 ψ K+1

We reason basically in the same way for draw-add, and now also use Exer-

88
2.2. Probabilistic channels 89

cise 1.7.5. For χ ∈ N[L](X),


 
(2.12)
 X χ(x) 
DA K+1 (χ) = DA K =  χ + 1| x i 
x∈supp(χ)
L
χ+1| x i
(IH)
X X χ(x) ψ
= · L+1 χ + 1| x i + ψ

x∈supp(χ) ψ≤K χ+1| x i
L
K
 χ 
(E)
X X ψ(x) + 1 ψ+1| x i
= ·  L  χ + ψ + 1| x i

x∈supp(χ) ψ≤K χ+1| x i
K + 1
K+1
χ
X X ϕ(x) ϕ
= ·  L  χ + ϕ

ϕ≤K+1 χ x∈supp(χ)
K + 1
K+1
χ
ϕ
X X
=   χ + ϕ .

L
ϕ≤K+1 χ x∈supp(χ) K+1

Exercises
2.2.1 Consider the D-channel f : {a, b, c} → {u, v} from Example 2.2.1,
together with a new D-channel g : : {u, v} → {1, 2, 3, 4} given by:

g(u) = 14 |1 i + 18 | 2 i + 21 | 3 i + 18 | 4 i g(v) = 14 | 1i + 18 | 3i + 58 |3 i.

Describe the general g ◦· f : {a, b, c} → {1, 2, 3, 4} concretely.


2.2.2 Formulate and prove the analogue of Lemma 1.4.3 for D, instead of
for M.
2.2.3 Notice that for a probabilistic channel c one can describe state trans-
formation as a (convex) sum/mixture of states c(x): that is, as:
X
c = ω = ω(x) · c(x).
x

2.2.4 Identify the channel f and the state ω in Example 2.2.1 with matrices:
1
1 
 1 3 
 6 
M f =  21 =
4
 1 
1  M ω  2 
2 0 4
 1 
3

Such matrices are called stochastic, since each of their columns has
non-negative entries that add up to one.
Check that the matrix associated with the transformed state f = ω
is the matrix-column multiplication M f · Mω .
(A general description appears in Remark 4.3.5.)

89
90 Chapter 2. Discrete probability distributions

2.2.5 Prove that state transformation along D-channels preserves convex


combinations (2.6), that is, for r ∈ [0, 1],
f = (r · ω + (1 − r) · ρ) = r · ( f = ω) + (1 − r) · ( f = ρ).
2.2.6 Write π : X K+1 → X K for the obvious projection function. Show that
the following diagram of channels commutes.

N[K +1](X)
DD
◦ / N[K](X)
arr ◦ ◦ arr
 
K+1 π / XX
X ◦

2.2.7 Consider the arrangement channels arr : N(X) → L(X) from Exam-
ple 2.2.3. Prove that arr is natural: for each function f : X → Y the
following diagram commutes.

N(X)
arr / D L(X))
N( f ) D(L( f ))
 
N(Y)
arr / D L(Y)

(This is far from easy; you may wish to check first what this means in
a simple case and then content yourself with a ‘proof by example’.)
2.2.8 Show that the draw-and-delete and draw-and-add maps DD and DA
in Definition 2.2.4 are natural, from N[K+1] to D ◦ N[K], and from
N[K] to D ◦ N[K +1].
2.2.9 Recall from (1.34) that the set N[K](X)
  of K-sized natural multisets
over a set X with n has contains Kn members. We write unif K ∈

D N[K](X) for the uniform distribution over such multisets.
1 Check that:
X 1
unif K =  n  ϕ .
ϕ∈N[K](n) K

2 Prove that these unif K distributions match the chain of draw-and-


delete maps DD : N[K +1](X) → N[K](X) in the sense that:
DD = unif K+1 = unif K .
In categorical language this means that these uniform distributions
form a cone.
3 Check that the draw-add maps do not preserve uniform distribu-
tions. Check for instance that for X = {a, b},

DA = unif3 = 14 4|a i + 16 3| ai + 1| bi + 16 2| ai + 2| b i


+ 16 1| a i + 3|b i + 14 4| bi .

90
2.2. Probabilistic channels 91

2.2.10 Let X be a finite set, say with n elements, of the form X = {x1 , . . . , xn }.
Define for K ≥ 1,
X
σK B 1
∈ D N[K](X) .

n K| xi i
1≤i≤n

Thus:

σ1 = 1n 1| x1 i + · · · + 1

n 1| xn i

σ = 1 2| x i + · · · + n 2| xn i ,
1

2 n 1 etc.
Show that these σK form a cone, both for draw-delete and for draw-
add:
DD = σK+1 = σK and DA = σK = σK+1 .
Thus, the whole sequence σK K∈N>0 can be generated from σ1 =

unifN[1](X) by repeated application of DA.
2.2.11 Recall the bijective correspondences from Exercise 1.8.4.
1 Let X, Y be finite sets and c : X → D(Y) be a D-channel. We
can then define an M-channel c† : Y → M(X) by swapping ar-
guments: c† (y)(x) = c(x)(y). We call c a ‘bi-channel’ if c† is also a
D-channel, i.e. if x c(x)(y) = 1 for each y ∈ Y.
P
Prove that the identity channel is a bi-channel and that bi-channels
are closed under composition.
2 Take A = {a0 , a1 } and B = {b0 , b1 } and define a channel bell : A ×
2 → D(B × 2) as:
bell(a0 , 0) = 21 | b0 , 0i + 8 |b1 , 0 i + 8 | b1 , 1 i
3 1

bell(a0 , 1) = 2 | b0 , 1i + 8 |b1 , 0 i + 8 | b1 , 1 i
1 1 3

bell(a1 , 0) = 83 | b0 , 0i + 18 | b0 , 1 i + 18 | b1 , 0 i + 38 | b1 , 1 i
bell(a1 , 1) = 81 | b0 , 0i + 38 | b0 , 1 i + 38 | b1 , 0 i + 18 | b1 , 1 i
Check that bell is a bi-channel.
(It captures the famous Bell table from quantum theory; we have
deliberately used open spaces in the above description of the chan-
nel bell so that non-zero entries align, giving a ‘bi-stochastic’ ma-
trix, from which one can read bell † vertically.)
2.2.12 Check that the inclusions D(X) ,→ M(X) form a map of monads, as
described in Definition 1.9.2.
2.2.13 Recall from Exercise 1.9.6 that for a fixed set A the mapping X 7→
X + A is a monad. Prove that X 7→ D(X + A) is also a monad.
(The latter monad will be used in Chapter ?? to describe Markov
models with outputs. A composition of two monads is not necessarily

91
92 Chapter 2. Discrete probability distributions

again a monad, but it is when there is a so-called distributive law


between the monads, see e.g. [70, Chap. 5] for details.)

2.3 Frequentist learning: from multisets to distributions


We have introduced distributions as special multisets, namely as multisets in
which the multiplicities add up to one, so that D(X) ⊆ M(X). There are many
other relations and interactions between distributions and multisets. As we
have mentioned, an urn containing coloured balls can be described aptly as
a multiset. The distribution for drawing a ball can then be derived from the
multiset. In this section we shall describing this situation in terms of a map-
ping from multisets to distributions, called ‘frequentist learning’ since it basi-
cally involves counting. Later on we shall see that other forms of learning from
data can be described in terms of passages from multisets to distributions. This
forms a central topic.
In earlier sections we have seen several (natural) mappings between collec-
tion types, in the form of support maps, see the overview in Diagram (1.30).
We are now adding the mapping Flrn : M∗ (X) → D(X), see [76]. The name
Flrn stands for ‘frequentist learning’, and may be pronounced as ‘eff-learn’.
The frequentist interpretation of probability theory views probabilities as long
term frequencies of occurrences. Here, these occurrences are given via multi-
sets, which form the inputs of the Flrn function. Later on, in Theorem 4.5.7 we
show that these outcomes of frequentist learning ly dense in the set of distri-
butions, which means that we can approximate each distribution with arbitrary
precision via frequentist learning of (natural) multisets.
P
Recall that M∗ (X) is the collection of non-empty multisets i ri | xi i, with
ri , 0 for at least one index i. Equivalently one can require that the sum s B
i ri = k i ri | xi ik is non-zero.
P P

The Flrn maps turns a (non-empty) multiset into a distribution, essentially


by normalisation:
  r1 rk
Flrn r1 x1 + · · · + rk xk B x1 + · · · + xk
s Ps (2.13)
where s B i ri .

The normalisation step forces the sum on the right-hand side to be a convex
sum, with factors adding up to one. Clearly, from an empty multiset we cannot
learn a distribution — technically because the above sum s is then zero so that
we cannot divide by s.
Using scalar multiplication from Lemma 1.4.2 (2) we can define the Flrn

92
2.3. Frequentist learning: from multisets to distributions 93

function more succintly:


1
·ϕ ϕ(x).
P
Flrn(ϕ) B where kϕk B x (2.14)
kϕk
Example 2.3.1. We present two illustrations of frequentist learning.
1 Suppose we have some coin of which the bias is unkown. There are experi-
mental data showing that after tossing the coin 50 times, 20 times come up
head (H) and 30 times yield tail (T ). We can present these data as a multiset
ϕ = 20| H i+30| T i ∈ M∗ ({H, T }). When we wish to learn the resulting prob-
abilities, we apply the frequentist learning map Flrn and get a distribution
in D({H, T }, namely:

Flrn(ϕ) = 20
20+30 | H i + 30
20+30 | T i = 25 | H i + 53 |T i.

Thus, the bias (twowards head) is 25 . In this simple case we could have ob-
tained this bias immediately from the data, but the Flrn map captures the
general mechanism.
Notice that with frequentist learning, more (or less) consistent data gives
the same outcome. For instance if we knew that 40 out of 100 tosses were
head, or 2 out of 5, we would still get the same bias. Intuitively, these data
give more (or less) confidence about the data. These aspects are not cov-
ered by frequentist learning, but by a more sophisticated form of ‘Bayesian’
learning. Another disadvantage of the rather primitive form of frequentist
learning is that prior knowledge, if any, about the bias is not taken into ac-
count.
2 Recall the medical table (1.18) captured by the multiset τ ∈ N(B × M).
Learning from τ yields the following joint distribution:
Flrn(τ) = 0.1| H, 0 i + 0.35| H, 1 i + 0.25| H, 2 i
+ 0.05| L, 0 i + 0.1| L, 1 i + 0.15| L, 2 i.
Such a distribution, directly derived from a table, is sometimes called an
empirical distribution [33].
In the above coin example we saw a property that is typical of frequentist
learning, namely that learning from more of the same does not have any effect.
We can make this precise via the equation:

Flrn K · ϕ = Flrn(ϕ) for K > 0.



(2.15)
In [57] it is argued that in general, people are not very good at probabilistic
(esp. Bayesian) reasoning, but that they are much better at reasoning with “fre-
quency formats”. To put it simply: the information (from D) that there is a 0.04

93
94 Chapter 2. Discrete probability distributions

probability of getting a disease is more difficult to process than the information


(from M) that 4 out of 100 people get the disease. In the current setting these
frequency formats would correspond to natural multisets; they can be turned
into distributions via the frequentist learning map Flrn. In the sequel we regu-
larly return to the interaction between multisets and distributions in relation to
drawing from an urn (in Chapter 3) and to learning (in Chapter ??).
It turns out that the learning map Flrn is ‘natural’, in the sense that it works
uniformly for each set.

Lemma 2.3.2. The frequentist learning maps Flrn : M∗ (X) → D(X) from (2.13)
are natural in X. This means that for each function f : X → Y the following
diagram commutes.

M∗ (X)
Flrn / D(X)
M∗ ( f ) D( f )
 
M∗ (Y) / D(Y)
Flrn

As a result, frequentist learning commutes with marginalisation (via projection


functions), see also Subsection 2.5.1.

Proof. Pick an arbitrary non-empty multiset ϕ = i ri | xi i in M∗ (X) and write


P
s B kϕk = i ri . By non-emptyness of ϕ we have s , 0. Then:
P

Flrn ◦ M∗ ( f ) (ϕ) = Flrn i ri | f (xi ) i


 P 

= i s | f (xi ) i
P ri 

= D( f ) i rsi | xi i
P 

= D( f ) ◦ Flrn (ϕ).


We can apply this basic result to the medical data in Table (1.18), via the
multiset τ ∈ N(B × M). We have already seen in Section 1.4 that the multiset-
marginals N(πi )(τ) produce the marginal columns and rows, with their totals.
We can learn the distributions from the colums as:

Flrn M(π1 )(τ) = Flrn 70| H i + 30| L i = 0.7| H i + 0.3| L i.


 

We can also take the distribution-marginal of the ‘learned’ distribution from


the table, as described in Example 2.3.1 (2):

M(π1 ) Flrn(τ) = (0.1 + 0.35 + 0.25)| H i + (0.05 + 0.1 + 0.15)| L i




= 0.7| H i + 0.3| L i.

Hence the basic operations of learning and marginalisations commute. This


is a simple result, which many practitioners in probability are surely aware

94
2.3. Frequentist learning: from multisets to distributions 95

of, at an intuitive level, but maybe not in the mathematically precise form of
Lemma 2.3.2.
Drawing an object from an urn is an elementary operation in probability
theory, which involves frequentist learning Flrn. For instance, the draw-and-
delete and draw-and-add operations DD and DA from Definition 2.2.4 can be
described, for urns ψ and χ with kψk = K + 1 and kχk = K, as:
X ψ(x) X
DD(ψ) = ψ − 1| x i = Flrn(ψ)(x) ψ − 1| x i

x∈supp(ψ)
K + 1 x∈supp(ψ)
X χ(x) X
DA(χ) = χ + 1| x i = Flrn(χ)(x) χ + 1| x i .

x∈supp(χ)
K x∈supp(χ)

Since this drawing takes the multiplicities into account, the urns after the
DD and DA draws have the same frequentist distribution as before, but only
if we interpret “after” as channel composition ◦·. This is the content of the
following basic result.

Theorem 2.3.3. One has both:

Flrn ◦· DD = Flrn and Flrn ◦· DA = Flrn.


Equivalently, the following two triangles of channels commute.

N[K](X) o / N[K +1](X)


DD DA
◦ N[K +1](X) N[K](X) ◦

◦ ◦ ◦ ◦
Flrn +Xr Flrn Flrn +Xr Flrn

Proof. For the proof of commutation of draw-delete and frequentist learning,


we take ψ ∈ M[K +1](X) and y ∈ X. Then:
X
Flrn ◦· DD (ψ)(y) =

DD(ψ)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ψ(x)
(2.12)
= · Flrn(ψ − 1| x i)(y)
x∈X
K+1
ψ(y) ψ(y) − 1 X ψ(x) ψ(y)
= · + ·
K+1 K x,y
K+1 K
 
ψ(y)  X 
= · ψ(y) − 1 +
 ψ(x)
K(K + 1)
  x,y 
ψ(y) X
= ·  ψ(x) − 1
 
K(K + 1) x
ψ(y) ψ(y)
= · ((K + 1) − 1) = = Flrn(ψ)(y).
K(K + 1) K+1

95
96 Chapter 2. Discrete probability distributions

Similarly, for χ ∈ N[K](X), where K > 0, and y ∈ X,


(2.12)
X χ(x)
Flrn ◦· DA (χ)(y) = Flrn(χ + 1| x i)(y) ·

x∈X
K
χ(y) + 1 χ(y) X χ(y) χ(x)
= · + ·
K+1 K x,y
K+1 K
 
χ(y) χ(y)  χ(y) X χ(x) 
= + ·  +
K(K + 1) K + 1  K

x,y
K 
χ(y) χ(y) X χ(x)
= + ·
K(K + 1) K + 1 x K
χ(y) χ(y)
= +
K(K + 1) K + 1
χ(y) + K · χ(y) χ(y)
= = = Flrn(χ)(y).
K(K + 1) K
Remark 2.3.4. In Lemma 2.3.2 we have seen that Flrn : M∗ → D is a natural
transformation. Since both M∗ and D are monads, one can ask if Flrn is also
a map of monads. This would mean that Flrn also commutes with the unit and
flatten maps, see Definition 1.9.2. This is not the case.
It is easy to see that Flrn commutes with the unit maps, simply because
Flrn(1| x i) = 1| x i. But commutation with flatten’s fails. Here is a simple coun-
terexample. Consider the multiset of multsets Φ ∈ M(M({a, b, c})) given by:

Φ B 1 2| a i + 4| ci + 2 1| a i + 1| b i + 1| c i .

First taking (multiset) multiplication, and then doing frequentist learning gives:
   
Flrn flat(Φ) = Flrn 4| ai + 2| b i + 6| c i = 31 | a i + 16 | bi + 12 | ci.

However, first (outer en inner) learning and then doing (distribution) multipli-
cation yields:
 
flat Flrn M(Flrn)(Φ) = 13 13 | a i + 23 | c i + 32 13 | ai + 13 |b i + 13 | c i

= 13 | a i + 29 | b i + 49 | c i.

Exercises
2.3.1 Recall the data / multisets about child ages and blood types in the
beginning of Subsection 1.4.1. Compute the associated (empirical)
distributions.
Plot these distributions as a graph. How do they compare to the
plots (1.16) and (1.17)?

96
2.3. Frequentist learning: from multisets to distributions 97

2.3.2 Check that frequentist learning from a constant multiset yields a uni-
form distribution. And also that frequentist learning is invariant under
(non-zero) scalar multiplication for multisets: Flrn(s · ϕ) = Flrn(ϕ)
for s ∈ R>0 .
2.3.3 1 Prove that for multisets ϕ, ψ ∈ M∗ (X) one has:
kϕk kψk
Flrn ϕ + ψ = · Flrn(ϕ) +

· Flrn(ψ).
kϕk + kψk kϕk + kψk

This means that when one has already learned Flrn(ϕ) and new
data ψ arrives, all probabilities have to be adjusted, as in the above
convex sum of distributions.
2 Check that the following formulation for natural multisets of fixed
sizes K > 0, L > 0 is a special case of the previous item.
+ / N[K +L](X)
N[K](X) × N[L](X)
Flrn×Flrn
 K L
K+L (−)+ K+L (−)
 Flrn
D(X) × D(X) / D(X)

2.3.4 Show that Diagram (1.30) can be refined to:

L∗ (X)
supp
/ Pfin (X)
F
(2.16)
acc * M (X) / D(X) supp

Flrn

where L∗ (X) ⊆ L(X) is the subset of non-empty lists.


2.3.5 Let c : X → D(Y) be a D-channel and ϕ ∈ M(X) be a multiset.
Because D(Y) ⊆ M(Y) we can also consider c as an M-channel, and
use c = ϕ. Prove that:

Flrn(c = ϕ) = c = Flrn(ϕ) = Flrn(c = Flrn(ϕ)).

2.3.6 Let X be a finite set and K ∈ N be an arbitrary number. Show that for
σ ∈ D N[K +1](X) and ϕ ∈ N[K](X) one has:


DD = σ (ϕ) X σ(ϕ+1| x i)

= .
(ϕ) x∈X
( ϕ+1| x i)

2.3.7 Recall the uniform distributions unif K ∈ D N[K](X) from Exer-
cise 2.2.9, where the set X is finite. Use Lemma 1.7.5 to prove that
Flrn = unif K = unif X ∈ D(X).

97
98 Chapter 2. Discrete probability distributions

2.4 Parallel products


In the very first section of the first chapter we have seen Cartesian, parallel
products X × Y of sets X, Y. Here we shall look at parallel products of states,
and also at parallel products of channels. These new products will be written
as tensors ⊗. They express parallel combination. These tensors exist for P, M
and D, but not for lists L. The reason for this absence will be explained below,
in Remark 2.4.3.
In this section we start with a brief uniform description of parallel products,
for multiple collection types — in the style of the first chapter — but we shall
quickly zoom in on the probabilistic case. Products ⊗ of distributions have their
own dynamics, due to the requirement that probabilities, also over a product,
must add up to one. This means that the two components of a ‘joint’ distri-
bution, over a product space, can be correlated. Indeed, a joint distribution is
typically not equal to the product of its marginals: the whole is more than the
product of its parts. It also means, as we shall see in Chapter 6, that updating
in one part has effect in other parts: the two parts of a joint distribution ‘listen’
to each other.

Definition 2.4.1. Tensors, also called parallel products, will be defined first for
states, for each collection type separately, and then for channels, in a uniform
manner.

1 Let X, Y be arbitrary sets.


a For U ∈ P(X) and V ∈ P(Y) we define U ⊗ V ∈ P(X × Y) as:

U ⊗ V B {(x, y) ∈ X × Y | x ∈ U and y ∈ V}.

This product U ⊗ V is often written simply as a product of sets U × V, but


we prefer to have a separate notation for this product of subsets.
b For ϕ ∈ M(X) and ψ ∈ M(Y) we get ϕ ⊗ ψ ∈ M(X × Y) as:
X
ϕ⊗ψ B (ϕ(x) · ψ(y)) x, y that is ϕ ⊗ ψ)(x, y) = ϕ(x) · ψ(y).

x∈X,y∈Y

c For ω ∈ D(X) and ρ ∈ D(Y) we use ω ⊗ ρ as in the previous item,


for multisets. This is well-defined, with outcome in D(X × Y), since the
relevant multiplicities add up to one. This tensor also works for infinite
support, i.e. for D∞ .
2 Let c : X → Y and d : A → B be two channels, both of the same type
T ∈ {P, M, D}. A channel c ⊗ d : X × A → Y × B is defined via:

c ⊗ d)(x, a) B c(x) ⊗ d(a).

98
2.4. Parallel products 99

The right-hand side uses the tensor product of the appropriate type T , as
defined in the previous item.
We shall use tensors not only in binary form ⊗, but also in n-ary form ⊗ · · · ⊗,
both for states and for channels.
We see that tensors ⊗ involve the products × of the underlying domains. A
simple illustration of a (probabilistic) tensor product is:

6 | ui + 3 | vi + 2 |w i ⊗ 4 | 0 i + 4 | 1 i
1 1 1  3 1 

= 18 | u, 0 i + 24
1
| u, 1 i + 14 | v, 0i + 12
1
|v, 1 i + 38 | w, 0i + 18 | w, 1i.
These tensor products tend to grow really quickly in size, since the number of
entries of the two parts have to be multiplied.
Parallel products are well-behaved, as described below.
Lemma 2.4.2. 1 Transformation of parallel states via parallel channels is the
parallel product of the individual transformations:
(c ⊗ d) = (ω ⊗ ρ) = (c = ω) ⊗ (d = ρ).
2 Parallel products of channels interact nicely with unit and composition:
unit ⊗ unit = unit (c1 ⊗ d1 ) ◦· (c2 ⊗ d2 ) = (c1 ◦· c2 ) ⊗ (d1 ◦· d2 ).
3 The tensor of trivial, deterministic channels is obtained from their product:
 f  ⊗ g =  f × g

where f × g = h f ◦ π1 , g ◦ π2 i, see Subsection 1.1.1.


4 The image distribution along a product function is a tensor of images:
D( f × g)(ω ⊗ ρ) = D( f )(ω) ⊗ D(g)(ρ).
Proof. We shall do the proofs for M and for D of item (1) at once, using
Exercise 2.2.3 (1), and leave the remaining cases to the reader.
X
(c ⊗ d) = (ω ⊗ ρ) (y, b) = (c ⊗ d)(x, a)(y, b) · (ω ⊗ ρ)(x, a)

x∈X, a∈A
X
= (c(x) ⊗ d(a))(y, b) · ω(x) · ρ(a)
x∈X, a∈A
X
= c(x)(y) · d(a)(b) · ω(x) · ρ(a)
x∈X, a∈A
   
X  X
=  c(x)(y) · ω(x) ·  d(a)(b) · ρ(a)

  
x∈X a∈A
= (c = ω)(y) · (d = ρ)(b)
= (c = ω) ⊗ (d = ρ) (y, b).


99
100 Chapter 2. Discrete probability distributions

As promised we look into why parallel products don’t work for lists.

Remark 2.4.3. Suppose we have two list [a, b, c] and [u, v] and we wish to
form their parallel product. Then there are many ways to do so. For instance,
two obvious choices are:
[ha, ui, ha, vi, hb, ui, hb, vi, hc, ui, hc, vi]
[ha, ui, hb, ui, hc, ui, ha, vi, hb, vi, hc, vi]
There are many other possibilities. The problem is that there is no canonical
choice. Since the order of elements in a list matter, there is no commutativity
property which makes all options equivalent, like for multisets. Technically,
the tensor for L does not exist because L is not a commutative (i.e. monoidal)
monad; this is an early result in category theory going back to [102].

From now on we concentrate on parallel products ⊗ for distributions and


for probabilistic channels. We illustrate the use of ⊗ of probabilistic channels
for two standard ‘summation’ results. We start with a simple property, which
is studied in more complicated form in Exercise 2.4.3. Abstractly, it can be
understood in terms of convolutions. In general, a convolution of parallel maps
f, g : X → Y is a composite of the form:

X
split
/ X×X f ⊗g
/ Y ×Y join
/Y (2.17)

The split and join operations depend on the situation. In the next result they
are copy and sum.

Lemma 2.4.4. Recall the flip (or Bernoulli) distribution flip(r) = r|1 i + (1 −
r)| 0 i and consider it as a channel flip : [0, 1] → N. The convolution of two
such flip’s is then a binomial distribution, as described in the following dia-
gram of channels.
∆ / [0, 1] × [0, 1] flip⊗flip
/ N×N
[0, 1] ◦ ◦
◦+
◦ 
bn[2] /N

Proof. For r ∈ [0, 1] we compute:


 
D(+) flip(r) ⊗ flip(r)
 
= D(+) r2 1, 1 + r · (1 − r) 1, 0 + (1 − r) · r) 0, 1 + (1 − r)2 0, 0


= r2 2 +!2 · r · (1 − r) 1 + (1 − r)2 0
X 2
= · ri · (1 − r)2−i i = bn[2](r).
0≤i≤2
i

100
2.4. Parallel products 101

The structure used in this result is worth making explicit. We have intro-
duced probability distributions on sets, as sample spaces. It turns out that if
this underlying set, say M, happens to be a commutative monoid, then D(M)
also forms a commutative monoid. This makes for instance the set D(N) of
distributions on the natural numbers a commutative monoid — even in two
ways, using either the additive or multiplicative monoid structure on N. The
construction occurs in [99, p.82], but does not seem to be widely used and/or
familiar. It is an instance of an abstract form of convolution in [104, §10].
Since it plays an important role here we make it explicit. It can be described
via parallel products ⊗ of distributions.

Proposition 2.4.5. Let M = (M, +, 0) be a commutative monoid. Define a sum


and zero element on D(M) via:

ω + ρ B D(+) ω ⊗ ρ  +: M × M → M
 

using  (2.18)
0 B 1| 0 i 0 ∈ M

Alternative descriptions of this sum of distributions are:


X
ω+ρ = ω(a) · ρ(b) a + b

a,b∈M
X  X  X
ri ai + sj bj = ri · s j ai + b j .

i j i, j

1 Via these sums +, 0 of distributions, the set D(M) forms a commutative


monoid.
2 If f : M → N is a homomorphism of commutative monoids, then so is
D( f ) : D(M) → D(N).

Let CMon be the category of commutative monoid, with monoid homomor-


phisms as arrows between them. The above two items tell that the distribution
functor D : Sets → Sets can be restricted to a functor D : CMon → CMon in
a commuting diagram:

CMon
D / CMon
(2.19)
 
Sets
D / Sets

The vertical arrows ‘forget’ the monoid structure, by sending a monoid to its
underlying set.

Exercise 2.4.4 below contains some illustrations of this definition. In Ex-


ercise 2.4.12 we shall see that the restricted (or lifted) functor D : CMon →
CMon is also a monad. The same construction works for D∞ .

101
102 Chapter 2. Discrete probability distributions

Proof. 1 Commutativity and associativity of + on D(M) follow from com-


mutativity and associativity of + on M, and of multiplication · on [0, 1].
Next,
X X
ω + 0 = ω + 1| 0i = ω(a) · 1 a + 0 = ω(a) a = ω.
a∈M a∈M

2 Clearly, D( f )(1| 0i) = 1| f (0) i = 1| 0 i, and:


  
D( f ) ω + ρ = D( f ) ◦ D(+) (ω ⊗ ρ)
 
= D(+) ◦ D( f × f ) (ω ⊗ ρ)
 
= D(+) D( f )(ω) ⊗ D( f )(ρ) by Lemma 2.4.2 (4)
= D( f )(ω) + D( f )(ρ).

As mentioned, we use this sum of distributions in Lemma 2.4.4. We may


now reformulate it as:

flip(r) + flip(r) = bn[2](r).

We consider a similar property of Poisson distributions pois[λ], see (2.7).


This property is commonly expressed in terms of random variables Xi as: if
X1 ∼ pois[λ1 ] and X2 ∼ pois[λ2 ] then X1 + X2 ∼ pois[λ1 + λ2 ]. We have not
discussed random variables yet, nor the meaning of ∼, but we do not need them
for the channel-based reformulation below.
Recall that the Poisson distribution has infinite support, so that we need to
use D∞ instead of D, see Remark 2.1.4, but that difference is immaterial here.
We now use the mapping λ 7→ pois[λ] as a function R≥0 → D∞ (N) and as a
D∞ -channel pois : R≥0 → N.

Proposition 2.4.6. The Poisson channel pois : R≥0 → N is a homomorphism


of monoids.
Thus, for λ1 , λ2 ∈ R≥0 ,

pois[λ1 + λ2 ] = pois[λ1 ] + pois[λ2 ] and pois[0] = 1| 0i.

We can also express this via commutation of the following diagrams of chan-
nels.
+ / R≥0 0 / R≥0
R≥0 × R≥0 ◦ 1 ◦

pois ⊗ pois ◦ ◦ pois unit ◦ ◦ pois


 +
 
N×N ◦ /N 1
0
◦ /N

Proof. We first do preservation of sums +, for which we pick arbitrary λ1 , λ2 ∈

102
2.4. Parallel products 103

R≥0 and k ∈ N.
(2.18)
pois[λ1 ] + pois[λ2 ] (k) = D(+) pois[λ1 ] ⊗ pois[λ2 ] (k)
 
X
= pois[λ1 ] ⊗ pois[λ2 ] (k1 , k2 )

k1 ,k2 ,k1 +k2 =k
X
= pois[λ1 ](m) · pois[λ2 ](k − m)
0≤m≤k
λm  λk−m
!  
(2.7)
X
= e−λ1 · 1 · e−λ2 · 2 

0≤m≤k
m! (k − m)!
X e−(λ1 +λ2 ) k!
= · · λm
1 · λ2
k−m

0≤m≤k
k! m!(k − m)!
e−(λ1 +λ2 ) X k
!
= · · λm1 · λ2
k−m
k! 0≤m≤k
m
(1.26) e−(λ1 +λ2 )
= · (λ1 + λ2 )k
k!
(2.7)
= pois[λ1 + λ2 ](k).
k
Finally, in the expression pois[0] = k e0 · 0k! | k i everything vanishes except
P
for k = 0, since only 00 = 1. Hence pois[0] = 1| 0 i.

The next illustration shows that tensors for different collection types are
related.

Proposition 2.4.7. The maps supp : D → P and Flrn : M∗ → D commute


with tensors, in the sense that:

supp σ ⊗ τ = supp(σ) ⊗ supp(τ) Flrn ϕ ⊗ ψ = Flrn(ϕ) ⊗ Flrn(ψ).


 

This says that support and frequentist learning are ‘monoidal’ natural trans-
formations.

Proof. We shall do the Flrn-case and leave supp as exercise. We first note that:

ϕ ⊗ ψ = z (ϕ ⊗ ψ)(z) = x,y (ϕ ⊗ ψ)(x, y)
P P

= x,y ϕ(x) · ψ(y)


P

= x ϕ(x) · y ψ(y) = ϕ · ψ .
P  P 

Then:
(ϕ ⊗ ψ)(x, y) ϕ(x) ψ(y)
Flrn ϕ ⊗ ψ (x, y) = =

·
kϕ ⊗ ψk kϕk kψk
= Flrn(ϕ)(x) · Flrn(ψ)(y)
= Flrn(ϕ) ⊗ Flrn(ψ) (x, y).


103
104 Chapter 2. Discrete probability distributions

We can form a K-ary product X K for a set X. But also for a distribution
ω ∈ D(X) we can form ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ). This lies at the heart
of the notion of ‘independent and identical distributions’. We define a separate
function for this construction:
iid [K]
/ D(X K ) iid [K](ω) = ω ··· ⊗ ω
D(X) with | ⊗ {z }. (2.20)
K times
We can describe iid [K] diagrammatically as composite, making the copying
of states explicit:

D(X)
iid [K]
/ D(X K )
[K](ω1 , . . . , ωK )
N
B
where (2.21)
, D(X)K B ω1 ⊗ · · · ⊗ ω K .

N
[K]

We often omit the parameter K when it is clear from the context. This map iid
will pop-up occasionally in the sequel. At this stage we only mention a few of
its properties, in combination with zipping.

Lemma 2.4.8. Consider the iid maps from (2.20), and also the zip function
from Exercise 1.1.7.

1 These iid maps are natural, which we shall describe in two ways. For a
function f : X → Y and for a channel c : X → Y the following two diagrams
commutes.

D(X)
iid / D(X K ) D(X)
iid
◦ / XK
D( f ) K c = (−) ◦ cK = c ⊗ ··· ⊗ c
  D( f )
 
D(Y)
iid / D(Y K ) D(Y)
iid
◦ / YK

2 The iid maps interact suitably with tensors of distributions, as expressed by


the following diagram.

D(X) × D(Y)
iid ⊗iid
◦ / XK × Y K
◦ zip 

 
D(X × Y)
iid
◦ / (X × Y)K

3 The zip function commutes with tensors in the following way.


N N
×
K
D(X) × D(Y) K / D(X K ) × D(Y K ) ⊗ / D(X K × Y K )
zip D(zip)
 N 
D(X) × D(Y) K
 ⊗K / D(X × Y)K / D (X × Y)K 

104
2.4. Parallel products 105

Proof. 1 Commutation of the diagram on the left, for a function f , is a direct


consequence of Lemma 2.4.2 (4), so we concentrate on the one on the right,
for a channel c : X → Y. We use Lemma 2.4.2 (1) in:

cK ◦· iid (ω) = (c ⊗ · · · ⊗ c) = (ω ⊗ · · · ⊗ ω)


= (c = ω) ⊗ · · · ⊗ (c = ω)
= iid (c = ω)
= iid ◦ (c = (−)) (ω).


2 For ω ∈ D(X) and ρ ∈ D(Y),

zip  ◦· (iid ⊗ iid ) (ω, ρ) = D(zip) ωK ⊗ ρK


 
X X
= ωK (~x) · ρK (~y) zip(~x,~z)

~x∈X K ~y∈Y K
X
= (ω ⊗ ρ)K (~z) ~z
~z∈(X×Y)K
= (ω ⊗ ρ)K
= iid (ω ⊗ ρ).

3 For distributions ωi ∈ D(X), ρi ∈ D(Y) and elements xi ∈ X, yi ∈ Y we


elaborate:

D(zip) ◦ ⊗ ◦ ( × ) [ω1 , . . . , ωK ], [ρ1 , . . . , ρK ]


N N 

[(x1 , y1 ), . . . , (xK , yK )]

 
= D(zip) ω1 ⊗ · · · ⊗ ωK ⊗ ρ1 ⊗ · · · ρK [(x1 , y1 ), . . . , (xK , yK )]
 

= ω1 ⊗ · · · ⊗ ωK [x1 , . . . , xK ] · ρ1 ⊗ · · · ⊗ ρK [y1 , . . . , yK ]
   

= ω1 (x1 ) · . . . · ωK (xK ) · ρ1 (y1 ) · . . . · ρK (yK )


= ω1 (x1 ) · ρ1 (y1 ) · . . . · ωK (xK ) · ρK (yK )
= ω1 ⊗ ρ1 (x1 , y1 ) · . . . · ωK ⊗ ρK (xK , yK )
 
 
= ω1 ⊗ ρ1 ⊗ · · · ⊗ ωK ⊗ ρK [(x1 , y1 ), . . . , (xK , yK )]
  

= ◦ ⊗K [(ω1 , ρ1 ), . . . , (ωK , ρK )] [(x1 , y1 ), . . . , (xK , yK )]


N   

= ◦ ⊗K ◦ zip [ω1 , . . . , ωK ], [ρ1 , . . . , ρK ]


N  

[(x1 , y1 ), . . . , (xK , yK )] .


Exercises
2.4.1 Prove the equation for supp in Proposition 2.4.7.
2.4.2 Show that tensoring of multisets is linear, in the sense that for σ ∈
M(X) the operation ‘tensor with σ’ σ ⊗ (−) : M(Y) → M(X × Y) is

105
106 Chapter 2. Discrete probability distributions

linear w.r.t. the cone structure of Lemma 1.4.2 (2): for τ1 , . . . , τn ∈


D(Y) and r1 , . . . , rn ∈ R≥0 one has:
 
σ ⊗ r1 · τ1 + · · · + rn · τn = r1 · σ ⊗ τ1 + · · · + rn · σ ⊗ τn .
 

The same hold in the other coordinate, for (−)⊗τ. As a special case we
obtain that when σ is a probability distribution, then σ⊗(−) preserves
convex sums of distributions.
2.4.3 Extend Lemma 2.4.4 in the following two ways.
1 Show that the K-fold convolution (2.17) of flip’s is a binomial of
size K, as in:
∆K flip K
[0, 1] ◦ / [0, 1]K ◦ / NK
◦+
◦ 
bn[K] .N

where ∆K is a K-fold copy and + returns the sum of a K-tuple of


numbers.
2 Show that binomials are closed under convolution, in the following
sense: for numbers K, L ∈ N,
∆ / [0, 1] × [0, 1] bn[K]⊗bn[L]
/ N×N
[0, 1] ◦ ◦
◦+
◦ 
bn[K+L] /N

Hint: Remember Vandermonde’s binary formula (1.29).


2.4.4 The set of natural numbers N has two commutative monoid structures,
one additive with +, 0, and one multiplicative with ·, 1. Accordingly
Proposition 2.4.5 gives two commutative monoid structures on D(N),
namely:

ω + ρ = D(+) ω ⊗ ρ ω ? ρ = D(·) ω ⊗ ρ .
 
and
Consider the following three distributions on N.

ρ1 = 12 | 0i + 13 | 1i + 16 |2 i ρ2 = 12 | 0 i + 12 | 1 i ω = 32 | 0i + 13 | 1i.
Show consecutively:
1 ρ1 + ρ2 = 14 | 0 i + 125
| 1i + 14 | 2i + 12
1
| 3 i;
2 ω ? (ρ1 + ρ2 ) = 4 | 0i + 36 | 1i + 12 | 2 i + 36
3 5 1 1
| 3 i;
3 ω ? ρ1 = 6 | 0i + 9 | 1i + 18 | 2 i;
5 1 1

4 ω ? ρ2 = 65 | 0i + 16 | 1i;
5 (ω ? ρ1 ) + (ω ? ρ2 ) = 36 25
| 0i + 108
25
| 1 i + 108
7
|2i + 1
108 | 3i.

106
2.4. Parallel products 107

Observe that ? does not distribute over + on D(N). More generally,


conclude that the construction of Proposition 2.4.5 does not extend to
commutative semirings.
2.4.5 Recall that for N ∈ N>0 we write N = {0, 1, . . . , N − 1} for the set
of natural numbers (strictly) below N. It is an additive monoid, via
addition modulo N. As such it is sometimes written as ZN or as Z/NZ.
Prove that:

unif N + unif N = unif N , with + from Proposition 2.4.5.

You may wish to check this equation first for N = 4 or N = 5. It works


for the modular sum, not for the ordinary sum (on N), as one can see
from the sum dice + dice, when dice is considered as distribution on
N, see also Example 4.5. See [148] for more info.
2.4.6 Let M = (M, +, 0) and N = (N, +, 0) be two commutative monoids,
so that D(M) and D(N) are also commutative monoids. Let chan-
nel c : M → D(N) be a homomorphism of monoids. Show that state
transformation c = (−) : D(M) → D(N) is then also a homomor-
phism of monoids.
2.4.7 For sets X, Y with arbitrary elements a ∈ X, b ∈ Y and with distribu-
tions σ ∈ D(X) and τ ∈ D(Y) we define strength maps st 1 : D(X) ×
Y → D(X × Y) and st 2 : X × D(Y) → D(X × Y) via:
X
st 1 (σ, b) B σ ⊗ unit(b) = σ(x) x, b

x∈X
X (2.22)
st 2 (a, τ) B unit(a) ⊗ τ = τ(y) a, y .

y∈Y

Show that the tensor σ⊗τ can be reconstructed from these two strength
maps via the following diagram:

D(X × D(Y)) o / D(D(X) × Y)


st 1 st 2
D(X) × D(Y)
D(st 2 ) ⊗ D(st )
   1

D(D(X × Y))
flat / D(X × Y) o flat
D(D(X × Y))

2.4.8 Consider a function f : X → M where X is an ordinary set and M is a


monoid. We can add noise to f via a channel c : X → M. The result
is a channel noise( f, c) : X → M given by:

noise( f, c) B  f  + c.

This formula promotes f to the deterministic channel  f  = unit ◦ f


and uses the sum + of distributions from (2.18) in a pointwise manner.

107
108 Chapter 2. Discrete probability distributions

1 Check that we can concretely describe this noise channel as:


X
noise( f, c)(x) = c(x)(y) f (x) + y .

y∈Y

2 Show also that it can be described abstractly via strength (from the
previous exercise) as composite:

/ M × D(M) st 2 / D(M × M) D(+) / D(M) .


 h f,ci 
noise( f, c) = X

2.4.9 Show that the following diagram commutes.

D D(X)
 iid
◦ / D(X)K
= iid (Ω) = iid flat(Ω) .
N 
that is
N

flat
 
D(X)
iid
◦ / XK
N
2.4.10 Show that the big tensor : D(X)K → D(X K ) from (2.21) com-
mutes with unit and flatten, as described below.
N N
D( )
X K
D (X)2 K / D D(X)K  / D2 (X K )
unit
K
unit flat K flat
 N   N 
D(X)K / D(X K ) D(X)K / D(X K )

Abstractly this shows that the functor (−)K distributes over the monad
D.
2.4.11 Check that Lemma 2.4.2 (4) can be read as: the tensor ⊗ of distribu-
tions, considered as a function ⊗ : D(X) × D(Y) → D(X × Y), is a
natural transformation from the composite functor:

Sets × Sets
D×D / Sets × Sets × / Sets

to the functor

Sets × Sets
× / Sets D / Sets.

2.4.12 Consider the situation described in Proposition 2.4.5, with a commu-


tative monoid M, and induced monoid structure on D(M).
1 Check that 1|a i + 1| bi = 1| a + bi, for all a, b ∈ M. This says that
unit : M → D(M) is a homomorphism of monoids, which can also
be expressed via commutation of the diagram:

M×M
unit×unit / D(M) × D(M)
+ +
 
M
unit / D(M)

108
2.5. Projecting and copying 109

2 Check also that flat : D(D(M)) → D(M) is a monoid homomor-


phism. This means that for Ω, Θ ∈ D(D(M)) one has:

flat(Ω) + flat(Θ) = flat(Ω + Θ) and flat(1 1| 0i ) = 1| 0 i.

The sum + on the on the left-hand side is the one in D(M), from
the beginning of this exercise. The sum + on the right-hand side is
the one in D(D(M)), using that D(M) is a commutative monoid —
and thus D(D(M)) too.
3 Check the functor D : CMon → CMon in (2.19) is also a monad.

2.5 Projecting and copying


For cartesian products × there are two projection functions π1 : X1 × X2 → X1 ,
π2 : X1 × X2 → X2 and a diagonal function ∆ : X → X × X for copying, see
Section 1.1. There are analogues for tensors ⊗ of distributions. But they behave
differently. Since these differences are fundamental in probability theory we
devote a separate section to them.

2.5.1 Marginalisation and entwinedness


In Section 1.1 we have seen projection maps πi : X1 × · · · × Xn → Xi for Carte-
sian products of sets. Since these πi are ordinary functions, they can be turned
into channels πi  = unit ◦ πi : X1 × · · · × Xn → Xi . We will be sloppy and
typically omit these brackets −. The kind of arrow, → or →, or the type of op-
eration at hand, will then indicate whether πi is used as function or as channel.
State transformation πi = (−) = D(πi ) : D X1 × · · · × Xn → D(Xi ) with such

projections is called marginalisation. It is given by summing over all variables,
except the one that we wish to keep: for y ∈ Xi we get, by Definition 2.1.2,

πi = ω (y)

(2.23)
X X
= ω x1 , . . . , xi−1 , y, xi+1 , . . . , xn .

x1 ∈X1 ,..., xi−1 ∈Xi−1 xi+1 ∈Xi+1 ,..., xn ∈Xn

Below we shall introduce special notation for such marginalisation, but first
we look at some of its properties.
The next result only works in the probabilistic case, for D. The exercises
below will provide counterexamples for P and M.

109
110 Chapter 2. Discrete probability distributions

Lemma 2.5.1. 1 Projection channels take parallel products of probabilistic


states apart, that is, for ωi ∈ D(Xi ) we have:

πi = ω1 ⊗ · · · ⊗ ωn = D(πi ) ω1 ⊗ · · · ⊗ ωn = ωi .
 

Thus, marginalisation of parallel product yields its components.


2 Similarly, projection channels commute with parallel products of probabilis-
tic channels, in the following manner:

X1 × · · · × Xn
c1 ⊗ ··· ⊗ cn
◦ / Y1 × · · · × Yn
πi ◦ ◦ πi
 
Xi ◦ / Yi
ci

Proof. We only do item (1), since item (2) then follows easily, using that
parallel products of channels are defined pointwise, see Definition 2.4.1 (2).
The first equation in item (1) follows from Lemma 1.8.3 (4), which yields
πi  = (−) = D(πi )(−). We restrict to the special case where n = 2 and i = 1.
Then:
(2.23) P
D(π1 ) ω1 ⊗ ω2 )(x) = y ω1 ⊗ ω2 )(x, y)
= y ω1 (x) · ω2 (y)
P

= ω1 (x) · y ω2 (y) = ω1 (x) · 1 = ω1 (x).


P 

The last line of this proof relies on the fact that probabilistic states (distri-
butions) involve a convex sum, with multiplicities adding up to one. This does
not work for subsets or multisets, see Exercise 2.5.1 below.
We introduce special, post-fix notation for marginalisation via ‘masks’. It
corresponds to the idea of listing only the relevant variables in traditional no-
tation, where a distribution on a product set is often written as ω(x, y) and its
first marginal as ω(x).

Definition 2.5.2. A mask M is a finite list of 0’s and 1’s, that is an element
M ∈ L({0, 1}). For a state ω of type T ∈ {P, M, D} on X1 × · · · × Xn and a mask
M of length n we write:
ωM

for the marginal with mask M. Informally, it keeps all the parts from ω at a
position where there is 1 in M and it projects away parts where there is 0. This
is best illustrated via an example:

ω 1, 0, 1, 0, 1 = T (hπ1 , π3 , π5 i)(ω) ∈ T (X1 × X3 × X5 )


 

= hπ1 , π3 , π5 i = ω.

110
2.5. Projecting and copying 111

More generally, for a channel c : Y → X1 × · · · × Xn and a mask M of length n


we use pointwise marginalisation via the same postcomposition:
cM is the channel y 7−→ c(y)M.
With Cartesian products one can take an arbitrary tuple t ∈ X × Y apart
into π1 (t) ∈ X, π2 (t) ∈ Y. By re-assembling these parts the original tuple is
recovered: hπ1 (t), π2 (t)i = t. This does not work for collections: a joint state
is typically not the parallel product of its marginals, see Example 2.5.4 below.
We introduce a special name for this.
Definition 2.5.3. A joint state ω on X × Y is called non-entwined if it is the
parallel product of its marginals:
ω = ω 1, 0 ⊗ ω 0, 1 .
   

Otherwise it is called entwined. This notion of entwinedness may be formu-


lated with respect to n-ary states, via a mask, see for example Exercise 2.5.2,
but it may then need some re-arrangement of components.
Lemma 2.5.1 (1) shows that a probabilistic product state of the form ω1 ⊗ ω2
is non-entwined. But in general, joint states are entwined, so that the different
parts are correlated and can influence each other. This is a mechanism that
will play an important role in the sequel. A joint distribution is more than the
product of its parts, and its different parts can influence each other.
Non-entwined states are called separable in [25]. Sometimes they are also
called independent, although independence is also used for random variables.
We like to have separate terminology for states only, so we use the phrase
(non-)entwinedness, which is a new expression. Independence for random vari-
ables is described in Section 5.4.
Example 2.5.4. Take X = {u, v} and A = {a, b} and consider the state ω ∈
D(X × A) given by:
ω = 18 | u, a i + 14 | u, b i + 38 | v, ai + 14 | v, b i.
We claim that ω is entwined. Indeed, ω has first and second marginals ω 1, 0 =
 
D(π1 )(ω) ∈ D(X) and ω 0, 1 = D(π2 )(ω) ∈ D(A), namely:
 

ω 1, 0 = 83 | ui + 58 |v i ω 0, 1 = 12 | a i + 12 | b i.
   
and
The original state ω differs from the product of its marginals:
ω 1, 0 ⊗ ω 0, 1 = 16 3
| u, a i + 16
3
| u, b i + 16
5
| v, a i + 16
5
   
| v, bi.
This entwinedness follows from a general characterisation, see Exercise 2.5.9
below.

111
112 Chapter 2. Discrete probability distributions

2.5.2 Copying
For an arbitrary set X there is a K-ary copy function ∆[K] : X → X K = X ×
· · · × X (K times), given as:

∆[K](x) B hx, . . . , xi (K times x).

We often omit the subscript K, when it is clear from the context, especially
when K = 2. This copy function can be turned into a copy channel ∆[K] =
unit ◦ ∆K : X → X K , but, recall, we often omit writing − for simplicity.
These ∆[K] are alternatively called copiers or diagonals.
As functions one has πi ◦ ∆[K] = id , and thus as channels πi ◦· ∆[K] = unit.

Fact 2.5.5. State transformation with a copier ∆ = ω differs from the tensor
product ω ⊗ ω. In general, for K ≥ 2,
(2.20)
∆[K] = ω , ω ··· ⊗ ω
| ⊗ {z } = iid [K](ω).
K times

The following simple example illustrates this fact. For clarity we now write
the brackets − explicitly. First,
 
∆ = 13 | 0i + 23 | 1i = 13 ∆(0) + 23 ∆(1) = 13 0, 0 + 23 1, 1 .

(2.24)

In contrast:
3|0i + 3|1i ⊗ 3|0i + 3|1i
1 2  1 2 
(2.25)
= 19 |0, 0i + 29 | 0, 1 i + 92 | 1, 0 i + 49 | 1, 1 i.
One might expect a commuting diagram like in Lemma 2.5.1 (2) for copiers,
but that does not work: diagonals do not commute with arbitrary channels,
see Exercise 2.5.10 below for a counterexample. Copiers do commute with
deterministic channels of the form  f  = unit ◦ f , as in:

∆ ◦·  f  = ( f  ⊗  f ) ◦· ∆ because ∆ ◦ f = ( f × f ) ◦ ∆.

In fact, this commutation with diagonals may be used as definition for a chan-
nel to be deterministic, see Exercise 2.5.12.

2.5.3 Joint states and channels


For an ordinary function f : X → Y one can form the graph of f as the relation
gr( f ) ⊆ X × Y given by gr( f ) = {(x, y) | f (x) = y}. There is a similar thing that
one can do for probabilistic channels, given a state. But please keep in mind
that this kind of ‘graph’ has nothing to do with the graphical models that we
will be using later, like Bayesian networks or string diagrams.

112
2.5. Projecting and copying 113

Definition 2.5.6. 1 For two channels c : X → Y and d : X → Z with the same


domain we can define a tuple channel:

hc, di B (c ⊗ d) ◦· ∆ : X → Y × Z.

2 In particular, for a state σ on Y and a channel d : Y → Z we define the graph


as the joint state on Y × Z defined by:
 
hid , di = σ so that hid , di = σ (y, z) = σ(y) · d(y)(z).

We are overloading the tuple notation hc, di. Above we use it for the tuple of
channels, so that hc, di has type X → D(Y × Z). Interpreting hc, di as a tuple of
functions, like in Subsection 1.1.1, would give a type X → D(Y) × D(Z). For
channels we standardly use the channel-interpretation for the tuple.
In the literature a probabilistic channel is often called a discriminative model.
Let’s consider a channel whose codomain is a finite set, say (isomorphic to) the
set n = {0, 1, . . . , n − 1}, whose elements i ∈ n can be seen as labels or classes.
Hence we can write the channel as c : X → n. It can then be understood as a
classifier: it produces for each element x ∈ X a distribution c(x) on n, giving
the likelihood c(x)(i) that x belongs to class i ∈ n. The class i with the highest
likelihood, obtained via argmax, may simply be used as x’s class.
The term generative model is used for a pair of a state σ ∈ D(Y) and a chan-
nel g : Y → Z, giving rise to a joint state hid , gi = σ ∈ D(X × Y). This shows
how the joint state can be generated. Later on we shall see that in principle
each joint state can be described in such generative form, via marginalisation
and disintegration, see Section 7.6.

Exercises
2.5.1 1 Let U ∈ P(X) and V ∈ P(Y). Show that the equation

(U ⊗ V) 1, 0 = U
 

does not hold in general, in particular not when V = ∅ (and U , ∅).


2 Similarly, check that for arbitrary multisets ϕ ∈ M(X) and ψ ∈
M(Y) one does not have:

(ϕ ⊗ ψ) 0, 1 = ϕ.
 

It fails when ψ is but ϕ is not the empty multiset 0. But it also fails
in non-zero cases, e.g. for the multisets ϕ = 2| ai + 4| b i + 1| ci and
ψ = 3| ui + 2| v i. Check this by doing the required computations.

113
114 Chapter 2. Discrete probability distributions

Check that the distribution ω ∈ D {a, b} × {a, b} × {a, b} given by:



2.5.2

ω = 24 | aaai + 12 | aabi + 12 |aba i + 6 | abbi


1 1 1 1

+ 6 | baai + 3 | babi + 24 |bba i + 12


1 1 1 1
| bbbi
satisfies:
ω = ω 1, 1, 0 ⊗ ω 0, 0, 1 .
   

2.5.3 Check that for finite sets X, Y, the uniform distribution on X × Y is


non-entwined — see also Exercise 2.1.1.
2.5.4 Find different joint states σ, τ with equal marginals:
σ 1, 0 = τ 1, 0 and σ 0, 1 = τ 0, 1 but σ , τ.
       

Hint: Use Example 2.5.4.


2.5.5 Check that:
ω 0, 1, 1, 0, 1, 1 0, 1, 1, 0 = ω 0, 0, 1, 0, 1, 0
    

What is the general result behind this?


2.5.6 Show that a (binary) joint distribution is non-entwined when its first
marginal is a unit (Dirac) distribution.
Formulated more explicitly, if ω ∈ D(X × Y) satisfies ω 1, 0 =
 
1| x i, then ω = 1| x i ⊗ ω 0, 1 . (This property shows that D is a
 
‘strongly affine monad’, see [69] for details.)
2.5.7 Prove that the following diagram about iid and concatenation ++ com-
mutes, for K, L ∈ N.
∆ / D(X) × D(X) iid [K]⊗iid [L]
/
D(X) ◦ XK × XL
 ◦ ++
◦ 
iid [K+L] . X K+L

2.5.8 Prove that for a channel c with suitably typed codomain,


c 1, 0, 1, 0, 1 = hπ1 , π3 , π5 i ◦· c.
 

2.5.9 Let X = {u, v} and A = {a, b} as in Example 2.5.4. Prove that a state
ω = r1 |u, ai + r2 | u, bi + r3 | v, a i + r4 | v, b i ∈ D(X × A), where r1 +
r2 + r3 + r4 = 1, is non-entwined if and only if r1 · r4 = r2 · r3 .
2.5.10 Consider the probabilistic channel f : X → Y from Example 2.2.1
and show that on the one hand ∆ ◦· f : X → Y × Y is given by:
a −7 → 21 | u, u i + 12 | v, vi
b 7−→ 1|u, ui
c 7−→ 34 | u, u i + 14 | v, vi.

114
2.5. Projecting and copying 115

On the other hand, ( f ⊗ f ) ◦· ∆ : X → Y × Y is described by:


a− 7 → 14 | u, ui + 14 | u, v i + 14 | v, u i + 14 | v, vi
b 7−→ 1| u, u i
c 7−→ 16 9
| u, ui + 16
3
|u, v i + 16 3
| v, u i + 161
| v, v i.
2.5.11 Check that for ordinary functions f : X → Y and g : X → Z one can
relate channel tupling and ordinary tupling via the operation − as:
h f , gi = h f, gi i.e. hunit ◦ f, unit ◦ gi = unit ◦ h f, gi.
Conclude that the copy channel ∆ is hunit, uniti and show that it does
commute with deterministic channels:
∆ ◦·  f  = h f ,  f i ◦· ∆.
2.5.12 Deterministic channels are in fact the only ones that commute with
copy (see for an abstract setting [18]). Let c : X → Y commute with
diagonals, in the sense that ∆ ◦· c = (c ⊗ c) ◦· ∆. Prove that c is determi-
nistic, i.e. of the form c =  f  = unit ◦ f for a function f : X → Y.
2.5.13 Check that the fact that in general ∆ = ω , ω ⊗ ω, as described
in Subsection 2.5.2, can also be expressed as: the following diagram
does not commute:

D(X)
D(∆)
/ D(X × X)
>
∆ ( ⊗
D(X) × D(X)
2.5.14 Show that the tuple operation h−, −i for channels, from Definition 2.5.6,
satisfies both:
πi ◦· hc1 , c2 i = ci and hπ1 , π2 i = unit
but not, essentially as in Exercise 2.5.10:
hc1 , c2 i ◦· d = hc1 ◦· d, c2 ◦· di.
2.5.15 Consider the joint state Flrn(τ) from Example 2.3.1 (2).
1 Take σ = 10 7
|Hi + 3
10 | L i ∈ D({H, L}) and c : {H, L} → {0, 1, 2}
defined by:
c(H) = 71 | 0 i + 21 | 1 i + 5
14 | 2i c(T ) = 16 | 0i + 13 | 1i + 12 |2 i.
Check that gr(σ, c) = Flrn(τ).
2.5.16 Consider a state σ ∈ D(X) and an ‘endo’ channel c : X → X.
1 Check that the following two statements are equivalent.

115
116 Chapter 2. Discrete probability distributions

• hid , ci = σ = hc, id i = σ in D(X × X);


• σ(x) · c(x)(y) = σ(y) · c(y)(x) for all x, y ∈ X.
2 Show that the conditions in the previous item imply that σ is a fixed
point for state transformation, that is:

c = σ = σ.

Such a σ is also called a stationary state.


2.5.17 Consider the bi-channel bell : A×2 → B×2 from Exercise 2.2.11 (2).
Show that both bell 1, 0 and bell † 1, 0 are channels with uniform
   
states only.
 
2.5.18 Prove that marginalisation (−) 1, 0 : M(X × Y) → M(X) is linear —
and preserves convex sums when restricted to distributions.

2.6 A Bayesian network example


This section uses the previous first descriptions of probabilistic channels to
give semantics for a Bayesian network example. At this stage we just describe
an example, without giving the general approach, but we hope that it clarifies
what is going on and illustrates to the reader the relevance of (probabilistic)
channels and their operations. A general description of what a Bayesian net-
work is appears much later, in Definition 7.1.2.
The example that we use is a standard one, copied from the literature, namely
from [33]. Bayesian networks were introduced in [133], see also [109, 8, 13,
14, 91, 105]. They form a popular technique for displaying probabilistic con-
nections and for efficiently presenting joint states, without state explosion.
Consider the diagram/network in Figure 2.3. It is meant to capture proba-
bilistic dependencies between several wetness phenomena in the oval boxes.
For instance, in winter it is more likely to rain (than when it’s not winter), and
also in winter it is less likely that a sprinkler is on. Still the grass may be wet
by a combination of these occurrences. Whether a road is slippery depends on
rain, not on sprinklers.
The letters A, B, C, D, E in this diagram are written exactly as in [33]. Here
they are not used as sets of Booleans, with inhabitants true and false, but in-
stead we use these sets with elements:

A = {a, a⊥ } B = {b, b⊥ } C = {c, c⊥ } D = {d, d⊥ } E = {e, e⊥ }.


The notation a⊥ is read as ‘not a’. In this way the name of an elements suggests
to which set the element belongs.

116
2.6. A Bayesian network example 117

 
wet grass slippery road
(D) (E)
9
d
9
 
sprinkler rain
(B) (C)


e :


winter
(A)

Figure 2.3 The wetness Bayesian network (from [33, Chap. 6]), with only the
nodes and edges between them; the conditional probability tables associated with
the nodes are given separately in the text.

The diagram in Figure 2.3 becomes a Bayesian network when we provide


it with conditional probability tables. For the lower three nodes they look as
follows.



A b b⊥ A c c⊥
winter 

sprinkler

a a⊥ rain 
a 1/5 4/5 a 4/5 1/5
3/5 2/5


a ⊥ 3/4 1/4 a ⊥ 1/10 9/10






And for the upper two nodes we have:


B C d d⊥





slippery road

b c 19/20 1/20 C e e⊥
wet grass

b c⊥ 9/10 1/10 c 7/10 3/10

b ⊥
c 4/5 1/5 ⊥
c 0 1






⊥ ⊥
b c 0 1
This Bayesian network is thus given by nodes, each with a conditional proba-
bility table, describing likelihoods in terms of previous ‘ancestor’ nodes in the
network (if any).
How to interpret all this data? How to make it mathematically precise? It is
not hard to see that the first ‘winter’ table describes a probability distribution
on the set A, which, in the notation of this book, is given by:

wi = 35 | ai + 25 |a⊥ i.
Thus we are assuming with probability of 60% that we are in a winter situation.
This is often called the prior distribution, or also the initial state.
Notice that the ‘sprinkler’ table contains two distributions on B, one for the

117
118 Chapter 2. Discrete probability distributions

element a ∈ A and one for a⊥ ∈ A. Here we recognise a channel, namely a


channel A → B. This is a crucial insight! We abbreviate this channel as sp, and
define it explicitly as:

 sp(a) = 15 | b i + 45 | b⊥ i

sp
/

A ◦ B with
 sp(a⊥ ) = 3 | b i + 1 | b⊥ i.


4 4

We read them as: if it’s winter, there is a 20% chance that the sprinkler is on,
but if it’s not winter, there is 75% chance that the sprinkler is on.
Similarly, the ‘rain’ table corresponds to a channel:

ra /  ra(a) = 54 |c i + 15 | c⊥ i


A ◦ C with
 ra(a⊥ ) = 1 | c i + 9 | c⊥ i.


10 10

Before continuing we can see that the formalisation (partial, so far) of the
wetness Bayesian network in Figure 2.3 in terms of states and channels, al-
ready allows us to do something meaningful, namely state transformation =.
Indeed, we can form distributions:
sp = wi on B and ra = wi on C.
These (transformed) distributions capture the derived, predicted probabilities
that the sprinkler is on, and that it rains. Using the definition of state transfor-
mation, see (2.9), we get:

sp = wi (b) = x sp(x)(b) · wi(x)


 P

= sp(a)(b) · wi(a) + sp(a⊥ )(b) · wi(a⊥ )


= 51 · 35 + 34 · 25 = 50
21

sp = wi (b ) = x sp(x)(b ) · wi(x)
 ⊥ P ⊥

= sp(a)(b⊥ ) · wi(a) + sp(a⊥ )(b⊥ ) · wi(a⊥ )


= 54 · 35 + 14 · 25 = 50
29
.
Thus the overall distribution for the sprinkler (being on or not) is:

sp = wi = 21
50 | b i + 29 ⊥
50 | b i.

In a similar way one can compute the probability distribution for rain as:

ra = wi = 13
25 |c i + 12 ⊥
25 | c i.

Such distributions for non-initial nodes of a Bayesian network are called pre-
dictions. As will be shown here, they can be obtained via forward state trans-
formation, following the structure of the network.
But first we still have to translate the upper two nodes of the network from
Figure 2.3 into channels. In the conditional probability table for the ‘wet grass’

118
2.6. A Bayesian network example 119

node we see 4 distributions on the set D, one for each combination of elements
from the sets B and C. The table thus corresponds to a channel:




 wg(b, c) = 1920 | d i + 20 | d i
1 ⊥

 wg(b, c⊥ ) = 10 | d i + 10
 9 1
| d⊥ i


B×C ◦ / D
wg

with
wg(b , c) = 5 | d i + 5 | d⊥ i
⊥ 4 1






 wg(b⊥ , c⊥ ) = 1|d⊥ i.

Finally, the table for the ‘slippery road’ table gives:



sr /  sr(c) = 10

 7
|ei + 3 ⊥
10 | e i
C ◦ E with
 sr(c ) = 1| e⊥ i.

 ⊥

We illustrate how to obtain predictions for ‘rain’ and for ‘slippery road’. We
start from the latter. Looking at the network in Figure 2.3 we see that there are
two arrows between the initial node ’winter’ and our node of interest ‘slippery
road’. This means that we have to do two state successive transformations,
giving:

sr ◦· ra = wi = sr = ra = wi
 

= sr = 1325 | c i + 25 | c i = 250 |e i + 250 | e i.


12 ⊥  91 159 ⊥

The first equation follows from Lemma 1.8.3 (3). The second one involves
elementary calculations, where we can use the distribution ra = wi that we
calculated earlier.
Getting the predicted wet grass probability requires some care. Inspection of
the network in Figure 2.3 is of some help, but leads to some ambiguity — see
below. One might be tempted to form the parallel product ⊗ of the predicted
distributions for sprinkler and rain, and do state transformation on this product
along the wet grass channel wg, as in:

wg = (sp = wi) ⊗ (ra = wi) .




But this is wrong, since the winter probabilities are now not used consistently,
see the different outcomes in the calculations (2.24) and (2.25). The correct
way to obtain the wet grass prediction involves copying the winter state, via
the copy channel ∆, see:
 
wg ◦· (sp ⊗ ra) ◦· ∆ = wi = wg = (sp ⊗ ra) = ∆ = wi


= wg = hsp, rai = wi


= 1399
2000 | d i + 2000 | d i.
601 ⊥

Such calcutions are laborious, but essentially straightforward. We shall do this


in one in detail, just to see how it works. Especially, it becomes clear that all

119
120 Chapter 2. Discrete probability distributions

summations are automatically done at the right place. We proceed in two steps,
where for each step we only elaborate the first case.

hsp, rai = wi (b, c) = x sp(x)(b) · ra(x)(c) · wi(x)


 P

= sp(a)(b) · ra(a)(c) · 35 + sp(a⊥ )(b) · ra(a⊥ )(c) · 52


= 15 · 45 · 35 + 34 · 10
1
· 25 = 500
63

hsp, rai = wi (b, c⊥ ) = · · · = 147



500
hsp, rai = wi (b⊥ , c) = 197

500
hsp, rai = wi (b⊥ , c⊥ ) = 500
93
.


We conclude that:
(sp ⊗ ra) = ∆ = wi = hsp, rai = wi


= 500
63
| b, ci + 500
147
| b, c⊥ i + 500 | b , ci
197 ⊥
+ 500 | b , c i.
93 ⊥ ⊥

This distribution is used in the next step:


 
wg = (sp ⊗ ra) = (∆ = wi) (d)
= x,y wg(x, y)(d) · (sp ⊗ ra) = (∆ = wi) (x, y)
P 

= wg(b, c)(d) · 500


63
+ wg(b, c⊥ )(d) · 147
500
+ wg(b , c)(d) · 197

500 + wg(b ⊥
, c⊥ )(d) · 500
93

= 19
20 · 500 + 10 · 500 + 5 · 500 + 0 · 500
63 9 147 4 197 93

= 1399
 2000 
wg = (sp ⊗ ra) = (∆ = wi) (d⊥ )
= 2000 .
601

We have thus shown that:

wg = (sp ⊗ ra) = (∆ = wi) = 1399


+ 601
 ⊥
2000 | d i 2000 | d i.

2.6.1 Redrawing Bayesian networks


We have illustrated how prediction computations for Bayesian networks can
be done, basically by following the graph structure and translating it into suit-
able sequential and parallel compositions (◦· and ⊗) of channels. The match
between the graph and the computation is not perfect, and requires some care,
especially wrt. copying. Since we have a solid semantics, we like to use it in
order to improve the network drawing and achieve a better match between the
underlying mathematical operations and the graphical representation. There-
fore we prefer to draw Bayesian networks slightly differently, making some
minor changes:

120
2.6. A Bayesian network example 121

•O •O
D E
   
wet grass slippery road
@ d 8 
B
•O
C
   
sprinkler rain
 
f : 

•O
A
 
winter
 

Figure 2.4 The wetness Bayesian network from Figure 2.3 redrawn in a manner
that better reflects the underlying channel-based semantics, via explicit copiers
and typed wires. This anticipates the string-diagrammatic approach of Chapter 7.

• copying is written explicity, for instance as in the binary case; in general


one can have n-ary copying, for n ≥ 2;
• the relevant sets/types — like A, B, C, D, E — are not included in the nodes,
but are associated with the arrows (wires) between the nodes;
• final nodes have outgoing arrows, labeled with their type.
The original Bayesian network in Figure 2.3 is then changed according to these
points in Figure 2.4. In this way the nodes (with their conditional probability
tables) are clearly recognisable as channels, of type A1 × · · · × An → B, where
A1 , . . . , An are the types on the incoming wires, and B is the type of the out-
going wire. Initial nodes have no incoming wires, which formally leads to a
channel 1 → B, where 1 is the empty product. As we have seen, such channels
1 → B correspond to distributions/states on B. In the adapted diagram one
easily forms sequential and parallel compositions of channels, see the exercise
below.

Exercises
2.6.1 In [33, §6.2] the (predicted) joint distribution on D × E that arises
from the Bayesian network example in this section is reprented as a
table. It translates into:
30,443
100,000 |d, e i + 39,507 ⊥
100,000 | d, e i + 100,000 | d , e i
5,957 ⊥
+ 100,000 | d , e i.
24,093 ⊥ ⊥

121
122 Chapter 2. Discrete probability distributions

Following the structure of the diagram in Figure 2.4, it is obtained in


the present setting as:
 
(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆ = wi
= (wg ⊗ sr) = ((id ⊗ ∆) = (hsp, rai = wi)).
Perform the calculations and check that this expression equals the
above distribution.
(Readers may wish to compare the different calculation methods,
using sequential and parallel composition of channels — as here —
or using multiplications of tables — as in [33].)

2.7 Divergence between distributions


In many situations it is useful to know how unequal / different / apart prob-
ability distributions are. This can be used for instance in learning, where one
can try to bring a distribution closer to target via iterative adaptations. Such
comparison of distributions can be defined via a metric / distance function.
Here we shall describe a different comparison, called divergence, or more fully
Kullback-Leibler divergence, written as DKL . It is a not a distance function
since it is not symmetric: DKL (ω, ρ) , DKL (ρ, ω), in general. But it does sat-
isfy DKL (ω, ρ) = 0 iff ω = ρ. In Section 4.5 we describe ‘total variation’ as a
proper distance function between states.
In the sequel we shall make frequent use of this divergence. This section
collects the definition and some basic facts. It assumes rudimentary familiarity
with logarithms.
Definition 2.7.1. Let ω, ρ be two distributions/states on the same set X with
supp(ω) ⊆ supp(ρ). The Kullback-Leibler divergence, or KL-divergence, or
simply divergence, of ω from ρ is:
ω(x)
X !
DKL (ω, ρ) B ω(x) · log .
x∈X
ρ(x)
The convention is that r · log(r) = 0 when r = 0.
Sometimes the natural logarithm ln is used instead of log = log2 .
The inclusion supp(ω) ⊆ supp(ρ) is equivalent to: ρ(x) = 0 implies ω(x) =
0. This requirement immediately implies that divergence is not symmetric. But
even when ω and ρ do have the same support, the divergences DKL (ω, ρ) and
DKL (ρ, ω) are different, in general, see Exercise 2.7.1 below for an easy illus-
tration.

122
2.7. Divergence between distributions 123

Whenever we write an expression DKL (ω, ρ) we will implicitly assume an


inclusion supp(ω) ⊆ supp(ρ).
We start with some easy properties of divergence.

Lemma 2.7.2. Let ω, ρ ∈ D(X) and ω0 , ρ0 ∈ D(Y) be distributions.

1 Zero-divergence is the same as equality:

DKL (ω, ρ) = 0 ⇐⇒ ω = ρ.

2 Divergence of products is a sum of divergences:

DKL ω ⊗ ω0 , ρ ⊗ ρ0 = DKL ω, ρ + DKL ω0 , ρ0 .


  

Proof. 1 The direction (⇐) is easy. For (⇒), let 0 = DKL (ω, ρ) = x ω(x) ·
P
log ω(x)/ρ(x) . This means that if ω(x) , 0, one has log ω(x)/ρ(x) = 0, and thus
 
ω(x)/ρ(x) = 1 and ω(x) = ρ(x). In particular:

X X
1= ω(x) = ρ(x). (∗)
x∈supp(ω) x∈supp(ω)

By assumption we have supp(ω) ⊆ supp(ρ). Write supp(ρ) as disjoint union


supp(ω) ∪ U for some U ⊆ supp(ρ). It suffices to show U = ∅. We have:
(∗)
X X X X
1= ρ(x) = ρ(x) + ρ(x) = 1 + ρ(x).
x∈supp(ρ) x∈supp(ω) x∈U x∈U

Hence U = ∅.
2 By unwrapping the relevant definitions:

DKL ω ⊗ ω0 , ρ ⊗ ρ0

(ω ⊗ ω0 )(x, y)
X !
= (ω ⊗ ω0 )(x, y) · log
x,y
(ρ ⊗ ρ0 )(x, y)
ω(x) ω0 (y)
X !
= ω(x) · ω0 (y) · log · 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! !!
= ω(x) · ω (y) · log
0
+ log 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! X !
= ω(x) · ω (y) · log
0
+ ω(x) · ω (y) · log 0
0

x,y
ρ(x) x,y
ρ (y)
ω(x) ω 0
! !
X X (y)
= ω(x) · log + ω0 (y) · log 0
x
ρ(x) y
ρ (y)
= DKL ω, ρ + DKL ω , ρ . 0 0


123
124 Chapter 2. Discrete probability distributions

In order to prove further properties about divergence we need a powerful


classical result called Jensen’s inequality about functions acting on convex
combinations of non-negative reals. We shall use Jensen’s inequality below,
but also later on in learning.
Lemma 2.7.3 (Jensen’s inequality). Let f : R>0 → R be a function that satis-
fies f 00 < 0. Then for all a1 , . . . , an ∈ R>0 and r1 , . . . , rn ∈ [0, 1] with i ri = 1
P
there is an inequality:
X  X
f ri · ai ≥ ri · f (ai ). (2.26)
i i

The inequality is strict, except in trivial cases.


The inequality holds in particular for logarithms, when f = log, or f = ln.
The proof is standard but is included, for convenience.
Proof. We shall provide a proof for n = 2. The inequality is easily extended
to n > 2, by induction. So let a, b ∈ R>0 be given, with r ∈ [0, 1]. We need to
prove f (ra + (1 − r)b) ≥ r f (a) + (1 − r) f (b). The result is trivial if a = b or
r = 0 or r = 1. So let, without loss of generality, a < b and r ∈ (0, 1). Write
c B ra + (1 − r)b = b − r(b − a), so that a < c < b. By the mean value theorem
we can find a < u < c and c < v < b with:
f (c) − f (a) f (b) − f (c)
= f 0 (u) and = f 0 (v)
c−a b−c
Since f 00 < 0 we have that f 0 is strictly decreasing, so f 0 (u) > f 0 (v) because
u > v. We can write:
c − a = (r − 1)a + (1 − r)b = (1 − r)(b − a) and b − c = r(b − a).
From f (u) > f (v) we deduce inequalities:
0 0

f (c) − f (a) f (b) − f (c)


> i.e. r( f (c) − f (a)) > (1 − r)( f (b) − f (c).
(1 − r)(b − a) r(b − a)
By reorganising the latter inequality we get f (c) > r f (a) + (1 − r) f (b), as
required.
We can now say a bit more about divergence. For instance, that it is non-
negative, as one expects.
Proposition 2.7.4. Let ω, ρ ∈ D(X) be states on the same space X.
1 DKL (ω, ρ) ≥ 0.
2 State transformation is DKL -non-expansive: for a channel c : X → Y and
states ω, ρ ∈ D(X) one has:
DKL c = ω, c = ρ ≤ DKL ω, ρ .
 

124
2.7. Divergence between distributions 125

Proof. 1 Via Jensen’s inequality, see we get:


ρ(x)
X !
−DKL (ω, ρ) = ω(x) · log
x ω(x)
ρ(x)
X ! X 
≤ log ω(x) · = log ρ(x) = log(1) = 0.
x ω(x) x

2 Again Via Jensen’s inequality:


(c = ω)(y)
!
 X
DKL c = ω, c = ρ = (c = ω)(y) · log
y (c = ρ)(y)
(c = ω)(y)
X !
= ω(x) · c(x)(y) · log
x,y (c = ρ)(y)
(c = ω)(y)
X X !
≤ ω(x) · log c(x)(y) · .
x y (c = ρ)(y)
Hence it suffices to prove:
(c = ω)(y)
X X !
ω(x) · log ≤ DKL ω, ρ

c(x)(y) ·
x y (c = ρ)(y)
ω(x)
X !
= ω(x) · log .
x ρ(x)
This inequality ≤ follows from another application of Jensen’s inequality:
(c = ω)(y) ω(x)
X X ! X !
ω(x) · log c(x)(y) · − ω(x) · log
x y (c = ρ)(y) x ρ(x)
(c = ω)(y) ω(x)
X " X ! !#
= ω(x) · log c(x)(y) · − log
x y (c = ρ)(y) ρ(x)
(c = ω)(y) ρ(x)
X X !
= ω(x) · log c(x)(y) · ·
x y (c = ρ)(y) ω(x)
(c = ω)(y) ρ(x)
X !
≤ log ω(x) · c(x)(y) · ·
x,y (c = ρ)(y) ω(x)
X X  (c = ω)(y) !
= log c(x)(y) · ρ(x) ·
y x (c = ρ)(y)
X 
= log (c = ω)(y)
y

= log 1 = 0.


Exercises
2.7.1 Take ω = 14 | a i + 34 | b i and ρ = 12 | a i + 12 | b i. Check that:

125
126 Chapter 2. Discrete probability distributions

1 DKL (ω, ρ) = 43 · log(3) − 1 ≈ 0.19.


2 DKL (ρ, ω) = 1 − 21 · log(3) ≈ 0.21.
2.7.2 Check that for y ∈ supp(ρ) one has:
DKL 1| y i, ρ = − log ρ(y) .
 

Show that the function DKL ω, − behaves as follows on convex com-



2.7.3
binations of states:
DKL ω, r1 · ρ1 + · · · + rn · ρn


≤ r1 · DKL ω, ρ1 + · · · + rn · DKL ω, ρn .
 

2.7.4 Use Jensen’s inequality to prove what is known as the inequality of


arithmetic and geometric means: for ri , ai ∈ R≥0 with i ri = 1,
P
X Y
ri · ai ≥ ari i .
i i

126
3

Drawing from an urn

One of the basic topics covered in textbooks on probability is drawing from


an urn, see e.g. [139, 145, 115] and many other references. An urn is container
which is typically filled with balls of different colours. The idea is that the balls
in the urn are thoroughly shuffled so that the urn forms a physical representa-
tion of a probability distribution. The challenge is to describe the probabilities
associated with the experiment of (blindly) drawing one or more balls from
the urn and registering the colour(s). Commonly two questions can be asked in
this setting.

• Do we consider the drawn balls in order, or not? More specifically, do we


see a draw of multiple balls as a list, or as a multiset?
• What happens to the urn after a ball is drawn? We consider three scenarios,
with short names “-1”, “0”, “+1”, see Examples 2.1.1.

– In the “-1” scenario the drawn ball is removed from the urn. This is cov-
ered by the hypergeometric distribution.
– In the “0” scenario the drawn ball is returned to the urn, so that the urn
remains unchanged. In that case the multinomial distribution applies.
– In the “+1” scenario not only the drawn ball is returned to the urn, but one
additional ball of the same colour is added to the urn. This is covered by
the Pólya distribution.
In the “-1” and “+1” scenarios the probability of drawing a ball of a certain
colour changes after each draw. In the “-1” case only finitely many draws
are possible, until the urn is empty.

This yields six variations in total, namely with ordered / unordered draws and
with -1, 0, 1 as replacement scenario. These six options are represented in a
3 × 2 table, see (3.3) below.

127
128 Chapter 3. Drawing from an urn

An important goal of this chapter is to describe these six cases in a princi-


pled manner via probabilistic channels. When drawing from an urn one can
concentrate on the draw — on what is in your hand after the draw — or on the
urn — on what is left in the urn after the draw. It turns out to be most fruitful to
combine these two perspectives in a single channel, acting as a transition map.
This transition aspect will be studied more systematically in terms of proba-
bilistic automata in Chapter ??. Here we look at single (ball) draws, which may
be described informally as:

 / single-colour, Urn 0
 
Urn (3.1)

We use the ad hoc notation Urn 0 to describe the urn after the draw. It may
be the same urn as before, in case of a draw with replacement, or it may be
a different urn, with one ball less/more, namely the original urn minus/plus a
ball as drawn.
The above transition (3.1) will be described as a probabilistic channel. It
gives for each single draw the associated probability. In this description we
combine multisets and distributions. For instance, an urn with three red balls
and two blue ones will be described as a (natural) multiset 3| R i + 2| Bi. The
transition associated with drawing a single ball without replacement (scenario
“-1”) gives a mapping:
 / 3 R, 2| Ri + 2| Bi + 2 B, 3| R i + 1| Bi .

3| R i + 2| Bi 5 5

It gives the 35 probability of drawing a red ball, together with the remaining
urn, and a 25 probability of drawing a blue one, with a different new urn.
The “+1” scenario with double replacement gives a mapping:
 / 3 R, 4| Ri + 2| Bi + 2 B, 3| R i + 3| Bi .

3| R i + 2| Bi 5 5

Finally, the situation with single replacement (“0”) is given by:


 / 3 R, 3| Ri + 2| Bi + 2 B, 3| R i + 2| Bi .

3| R i + 2| Bi 5 5

In this last, third case we see that the urn/multiset does not change. An im-
portant first observation is that in that case we may as well use a distribution
as urn, instead of a multiset. The distribution represents an abstract urn. In the
above example we would use the distribution 53 | Ri+ 25 | Bi as abstract urn, when
we draw with single replacement (case “0”). The distribution contains all the
relevant information. Clearly, it is obtained via frequentist learning from the
original multiset. Using distributions instead of multisets gives more flexibility,

128
129

since not all distributions are obtained via frequentist learing — in particular
when the probabilities are proper real numbers and not fractions.
We formulate this approach explicitly.

• In the “-1” and “+1” situations without or with double replacement, an urn
is a (non-empty, natural) multiset, which changes with every draw, via re-
moval or double replacement of the drawn ball. These scenarios will also be
described in terms of deletion or addition, using the draw-delete and draw-
add channels DD and DA that we already saw in Definition 2.2.4.
• In a “0” situation with replacement, an urn is a probability distribution; it
does not change when balls are drawn.

This covers the above second question, about what happens to the urn. For
the first question concerning ordered and unordered draws one has to go be-
yond single draw transitions. Hence we need to suitably iterate the single-draw
transition (3.1) to:
 / multiple-colours, Urn 0
 
Urn (3.2)

Now we can make the distinction between ordered and unordered draws ex-
plicit. Let X be the set of colours, for the balls in the urn — so X = {R, B} in
the above illustration.

• An ordered draw of multiple balls, say K many, is represented via a list


X K = X × · · · × X of length K.
• An unordered draw of K-many balls is represented as a K-sized multiset, in
N[K](X).

Thus, in the latter case, both the urn and the handful of balls drawn from it, are
represented as a multiset.
In the end we are interested in assigning probabilities to draws, ordered or
not, in “-1”, “0” or “1” mode. These probabilities on draws are obtained by tak-
ing the first marginal/projection of the iterated transition map (3.2). It yields a
mapping from an urn to multiple draws. The following table gives an overview
of the types of these operations, where X is the set of colours of the balls.

mode ordered unordered

0 D(X)
O0
◦ / XK D(X)
U0
◦ / N[K](X)
(3.3)
-1 N[L](X)
O−
◦ / XK N[L](X)
U−
◦ / N[K](X)

+1 N∗ (X)
O+
◦ / XK N∗ (X)
U+
◦ / N[K](X)

129
130 Chapter 3. Drawing from an urn

We see that in the replacement scenario “0” the inputs of these channels are
distributions in D(X), as abstract urns. In the deletion scenario “-1” the in-
put (urns) are multisets in N[L](X), of size L. In the ordered case the outputs
are tuples in X K of length K and in the unordered case they are multisets in
N[K](X) of size K. Implicitly in this table we assume that L ≥ K, so that the
urn is full enough for K single draws. In the addition scenario “+1” we only
require that the urn is a non-empty multiset, so that at least one ball can be
drawn. Sometimes it is required that there is at least one ball of each colour in
the urn, so that all colours can occur in draws.
The above table uses systematic names for the six different draw maps. In
the unordered case the following (historical) names are common:

U0 = multinomial U− = hypergeometric U+ = Pólya.

In this chapter we shall see that these three draw maps have certain properties
in common, such as:

• Frequentist learning applied to the draws yields the same outcome as fre-
quentist learning from the urn, see Theorem 3.3.5, Corollary 3.4.2 (2) and
Proposition 3.4.5 (1).
• Doing a draw-delete DD after a draw of size K + 1 is the same as doing a
K-sized draw, see Proposition 3.3.8, Corollary 3.4.2 (4) and Theorem 3.4.6.

But we shall also see many differences between the three forms of drawing.
This chapter describes and analyses the six probabilistic channels in Ta-
ble 3.3 for drawing from an urn. It turns out that many of the relevant prop-
erties can be expressed via composition of such channels, either sequentially
(via ◦·) or in parallel (via ⊗). In this analysis the operations of accumulation
acc and arrangement arr, for going back-and-forth between products and mul-
tisets, play an important role. For instance, in each case one has commuting
diagrams of the form.

U0 6 N[K](X) U− 4 N[K](X) U+ 6 N[K](X)


◦ ◦ ◦
arr ◦ arr ◦ arr ◦
  
D(X)
O0
◦ / XK ◦ N[L](X)
O−
◦ / XK ◦ N∗ (X)
O+
◦ / XK ◦ (3.4)
acc ◦ acc ◦ acc ◦
U0

(  U−

*  U+

( 
N[K](X) N[K](X) N[K](X)

Such commuting diagrams are unusual in the area of probability theory but we
like to use them because they are very expressive. They involve multiple equa-
tions and clearly capture the types and order of the various operations involved.

130
131

Moreover, to emphasise once more, these are diagrams of channels with chan-
nel composition ◦·. As ordinary functions, with ordinary function composition
◦, there are no such diagrams / equations.
The chapter first takes a fresh look at accumulation and arrangement, in-
troduced earlier in Sections 1.6 and 2.2. These are the operations for turning
a list into a multiset, and a multiset into a uniform distribution of lists (that
accumulate to the multiset). In Section 3.2 we will use accumulation and ar-
rangement for the powerful operation for “zipping” two multisets of the same
size together. It works analogously to zipping of two lists of the same length,
into a single list of pairs, but this ‘multizip’ produces a distribution over (com-
bined) multisets. The multizip operation turns out to interact smoothly with
multinomial and hypergeometric distributions, as we shall see later on in this
chapter.
Section 3.3 investigates multinomial channels and Section 3.4 covers both
hypergeometric and Pólya channels, all in full, multivariate generality. We have
briefly seen the multinomial and hypergeometric channels in Example 2.1.1.
Now we take a much closer look, based on [78], and describe how these chan-
nels commute with basic operations such as accumulation and arrangement,
with frequentist learning, with multizip, and with draw-and-delete. All these
commutation properties involve channels and channel composition ◦·.
Subsequently, Section 3.5 elaborates how the channels in Table 3.3 actu-
ally arise. It makes the earlier informal descriptions in (3.1) and (3.2) mathe-
matically precise. What happens is a bit sophisticated. We recall that for any
monoid M, the mapping X 7→ M × X is a monad, called the writer monad,
see Lemma 1.9.1. This can be combined with the distribution monad D, giv-
ing a combined monad X 7→ D(M × X). It comes with an associated ‘Kleisli’
composition. It is precisely this composition that we use for iterating a single
draw, that is, for going from (3.1) to (3.2). Moreover, for ordered draws we use
the monoid M = L(X) of lists, and for unordered draws we use the monoid
M = N(X) of multisets. It is rewarding, from a formal perspective, to see that
from this abstract principled approach, common distributions for different sorts
of drawing arise, including the well known multinomial, hypergeometric and
Pólya distributions. This is based on [81].
The subsequent two sections 3.6 and 3.7 of this chapter focus on a non-
trivial operation from [78], namely turning a multiset of distributions into a
distribution of multisets. Technically, this operation is a distributive law, called
the parallel multinomial law. We spend ample time introducing it: Section 3.6
contains no less than four different definitions — all equivalent. Subsequently,
various properties are demonstrated of this parallel multinomial law, including

131
132 Chapter 3. Drawing from an urn

commutation with hypergeometric channels, with frequentist learning and with


muliset-zip.
The final section 3.9 briefly looks at yet another form of drawing coloured
balls from an urn. The draws considered there are like Pólya urns — where an
additional ball of the same colour as the drawn ball is returned to the urn —
but with a new twist: after each draw another ball is added to the urn with a
fresh colour, not already occurring in the urn. This gives an entirely new dy-
namics. The new colour works like a new mutation in genetics, and indeed,
these kind of “Hoppe” urns have arisen in mathematical biology, first in the
work of Ewens [46]. We describe the resulting distributions via probabilistic
channels and show that there are some similarities and differences with the
classical draws — multinomial, hypergeometric, and Pólya — where the num-
ber of colours of balls in the urn does not increase.
This chapter make more use of the language of category theory than other
chapters in this book. It demonstrates that that there is ample (categorical)
structure in basic probabilistic operations. The various properties of multino-
mial, hypergeometric and Pólya channels that occur in this chapter wil be used
frequently in the sequel. Other, more technical topics could be skipped at first
reading. An even more categorical, axiomatic approach can be found in [80].

3.1 Accumlation and arrangement, revisited


Accumulation and arrangement are the operations that turn a sequence (list)
of elements into a multiset and vice-versa. These operations turn out to play
an important role in connecting multisets and distributions. That’s why we use
this first section to look closer than we have done before. We shall see the
role that permutations play. The next section introduces a ‘zip’ operation for
multisets, in terms of accumulation and arrangement.
In the previous two chapters we have introduced the accumulation function
acc : L(X) → N(X) that turns a list into a multiset by counting the occurrences
of each element in the list, see (1.31). For instance, acc(a, b, a, a) = 3| a i+1| bi.
Clearly, acc is a surjective function. Accumulation forms a map of monads,
see Exercise 1.9.7, which means that it is natural and commutes with the unit
and flatten maps (of L and M). In this chapter we shall use the accumlation
map restricted to a particular ‘size’, via a number K ∈ N and via a restriction
acc[K] : X K → N[K](X) to lists and multisets of size K. The parameter K in
acc[K] is often omitted if it is clear from the context.
In the other direction, one can go from multisets to lists/sequences via what
we have called arrangement (2.10). This is not a function but channel, which

132
3.1. Accumlation and arrangement, revisited 133

assigns equal probability to each sequence. Here we shall also use arrangement
for a specific size, as a channel arr[K] : N[K](X) → X K . We recall:
X 1 kϕk! K!
arr[K](ϕ) = ~x where (ϕ ) = = Q .
(ϕ) ϕ x ϕ(x)!
~x∈acc (ϕ)
−1

We recall also that (ϕ ) is called the multiset coefficient of ϕ. It is the number


of sequences that accumulate to ϕ, see Proposition 1.6.3.
One has acc ◦· arr = id , see (2.11). Composition in the other direction does
not give an identity but a (uniform) distribution of permutations:
X 1 X 1
arr ◦· acc (~x) = ~y = ~y . (3.5)

(acc(~x) ) K!
~y∈acc −1 (acc(~x)) ~y is a permutation of ~x

The vectors ~y take the multiplicities of elements in ~x into account, which leads
1
to the factor ( acc(~
x) )
.
Permutations are an important part of the story of accumulation and arrange-
ment. This is very explicit in the following result. Categorically it can be de-
scribed in terms of a so-called coequaliser, but here we prefer a more concrete
description. The axiomatic approach of [80] is based on this coequaliser.

Proposition 3.1.1. For a number K ∈ N, let f : X K → Y be a function which


is stable under permutation: for each permutation π : {1, . . . , K} → 
{1, . . . , K}
one has

f x1 , . . . , xK = f xπ(1) , . . . , xπ(K) , for all sequences (x1 , . . . , xK ) ∈ X K .


 

Then there is a unique function f : N[K](X) → Y with f ◦ acc = f . In a


diagram, we write this unique existence as a dashed arrow:

XK
acc / / N[K](X)

f
f 
+Y

The double head / / for acc is used to emphasise that it is a surjective func-
tion.

This result will be used both as a definition principle and as a proof principle.
The existence part can be used to define a function N[K] → Y by specifying a
function X K → Y that is stable under permutation. The uniqueness part yields
a proof principle: if two functions g1 , g2 : N[K](X) → Y satisfy g1 ◦ acc =
g2 ◦ acc, then g1 = g2 . This is quite powerful, as we shall see. Notice that acc
is stable under permutation itself, see also Exercise 1.6.4.

133
134 Chapter 3. Drawing from an urn

Proof. Take ϕ = 1≤i≤` ni | xi i ∈ N[K](X). Then we can define f (ϕ) via any
P
arrangement, such as:

f (ϕ) B f x1 , . . . , x1 , . . . , x` , . . . , x` .

| {z } | {z }
n1 times n` times

Since f is stable under permutation, any other arrangement of the elements xi


gives the same outcome, that is, f (ϕ) = f (~x) for each ~x ∈ X K with acc(~x) = ϕ.
With this definition, f ◦ acc = f holds, since for a sequence ~x = (x1 , . . . , xK ) ∈
X , say with acc(~x) = ϕ,
K

f (x1 , . . . , xK ) = f (ϕ) = f acc(~x) = f ◦ acc (x1 , . . . , xK ).


 

For uniqueness, suppose g : N[K](X) → Y also satisfies g ◦ acc = f . Then


g = f , by the following argument. Take ϕ ∈ N[K](X); since acc is surjective,
we can find ~x ∈ X K with acc(~x) = ϕ. Then:

g(ϕ) = g acc(~x) = f (~x) = f (ϕ).




Example 3.1.2. We illustrate the use of Proposition 3.1.1 to re-define the ar-
rangement map and to prove one of its basic properties. Assume for a moment
that we do not already now about arrangement, only about accumulation. Now
consider the following situation:

XK
acc / / N[K](X)
X 1
arr where perm(~x) B ~y .
perm
'  ~y is a permutation of ~x
K!
D(X K )

By construction, the function perm is stable under permutation. Hence the


function acc exists, by using Proposition 3.1.1 as a definition principle. As
a result, perm = arr ◦ acc.
Next we use Proposition 3.1.1 as proof principle. Our aim is to re-prove the
equality acc  ◦· arr = unit : N[K](X) → D N[K](X)), which we already know
from (2.11). Via our new proof principle such an equality can be obtained when
both sides of the equation fit in a situation where there is a unique solution. This
situation is given below.

XK
acc / / N[K](X)

unit acc ◦·arr


acc 
*  
D N[K](X)

134
3.1. Accumlation and arrangement, revisited 135

We show that both dashed arrows fit. The first equation below holds by con-
struction of arr, in the above triangle.
 
x) = acc  ◦· perm (~x)
acc ◦· arr ◦ acc (~
 
 
 X 1 
= D(acc)  ~y 
K! 
~y is a permutation of ~x
X 1
=

acc(~y)
K!
~y is a permutation of ~x
X 1
=

acc(~x)
K!
~y is
a permutation of ~x
=

1 acc(~x)
= acc (~ x)
=

unit ◦ acc (~x).

The fact that the composite arr ◦· acc produces N all permutation is used in the
K K
following result. Recall that the ‘big’ tensor : D(X) → D(X ) is defined
(ω1 , . . . , ωK ) = ω1 ⊗ · · · ⊗ ωK , see 2.21.
N
by

Proposition 3.1.3. The composite arr ◦· acc commutes with tensors, as in the
following diagram of channels.

D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ ◦
 
XK
acc
◦ / N[K](X) arr
◦ / XK

Proof. For distributions ω1 , . . . , ωK ∈ D(X) and elements x1 , . . . , xK ∈ X we


have:
X 1
ω)(~x) = · ρ1 ⊗ · · · ⊗ ρK (~x)
N  
◦·arr ◦· acc (~
( acc(~ ω) )
~ρ∈acc −1 (acc(~
ω))
X 1
= · ρ1 ⊗ · · · ⊗ ρK (~x)

K!
~ρ is a permutation of ω
~
X 1
= · ω1 ⊗ · · · ⊗ ωK (~y)

K!
~y is a permutation of ~x
X 1
= · ω1 ⊗ · · · ⊗ ωK (~y)

(acc(~x) )
~y∈acc −1 (acc(~x))

= ω)(~x).
N
arr ◦· acc ◦· (~

135
136 Chapter 3. Drawing from an urn

Exercises
3.1.1 Show that the permutation channel perm from Example 3.1.2 is an
idempotent, i.e. satisfies perm ◦· perm = perm. Prove this both con-
cretely, via the definition of perm, and abstractly, via acc and arr.
3.1.2 Consider Proposition 3.1.3 for X = {a, b} and K = 4. Check that:

◦·arr ◦· acc ω1 , ω1 , ω2 , ω2 a, b, b, b
N   

= 12 · ω1 (a) · ω1 (b) · ω2 (b) · ω2 (b) + 21 · ω1 (b) · ω1 (b) · ω2 (a) · ω2 (b)


= arr ◦· acc ◦· ω1 , ω1 , ω2 , ω2 a, b, b, b .
N  

3.2 Zipping multisets


For two multisets ϕ ∈ N[K](X) and ψ ∈ N[L](Y) we can form their tensor
product ϕ ⊗ ψ ∈ N[K · L](X × Y). The fact that it has size K · L is shown
in the first lines of the proof of Proposition 2.4.7. For sequences there is the
familiar zip combination map X K ×Y K → (X ×Y)K that does maintain size, see
Exercise 1.1.7. Interestingly, there is also a zip-like operation for combining
two multisets of the same size. It makes systematic use of accumulation and
arrangement. This will be described next, based on [78].
Zipping two lists of the same length is a standard operation in (functional)
programming. It produces a new list, consisting of pairs of elements from the
two lists. We have seen it for products in Exercise 1.1.7, as an isomorphism
zip[K] : X K ×Y K −→ 
(X ×Y)K . Our aim below is to describe a similar function,
called multizip and written as mzip, for (natural) multisets of size K. It is
not immediately obvious how mzip should work, since there are many ways
to combine elements from two multisets. Analogously to the arrange channel
arr[K] : N[K](X) → X K we shall describe mzip on multisets as a (uniform)
channel:

N[K](X) × N[K](Y)
mzip[K]
◦ / N[K](X × Y).

So let multisets ϕ ∈ N[K](X) and ψ ∈ N[K](Y) be given. For all sequences


~x ∈ X K and ~y ∈ Y K with acc(~x) = ϕ and acc(~y) = ψ we can form their ordinary
zip, as zip(~x, ~y) ∈ (X × Y)K . Accumulating the pairs in this zip gives a multiset
over X × Y. Thus we define:
X X 1
mzip(ϕ, ψ) B acc zip(~x, ~y) .

(3.6)
( ϕ ) · (ψ )
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)

136
3.2. Zipping multisets 137

Diagrammatically we may describe mzip as the following composite.

N[K](X) × N[K](Y)
arr⊗arr / D XK × Y K 
 D(zip) (3.7)

D (X × Y)K
 / D N[K](X × Y)
D(acc)

An illustration may help to see what happens here.

Example 3.2.1. Let’s use two set X = {a, b} and Y = {0, 1} with two multisets
of size three:

ϕ = 1| ai + 2| bi and ψ = 2| 0 i + 1| 1 i.

Then:
3 3
(ϕ) = 1,2 =3 ( ψ) = 2,1 =3

The sequences in X 3 and Y 3 that accumulate to ϕ and ψ are:


 


 a, b, b 

 0, 0, 1
and 0, 1, 0
 

 b, a, b 

 b, b, a
  1, 0, 0

Zipping them together gives the following nine sequences in (X × Y)3 .

(a, 0), (b, 0), (b, 1) (b, 0), (a, 0), (b, 1) (b, 0), (b, 0), (a, 1)
(a, 0), (b, 1), (b, 0) (b, 0), (a, 1), (b, 0) (b, 0), (b, 1), (a, 0)
(a, 1), (b, 0), (b, 0) (b, 1), (a, 0), (b, 0) (b, 1), (b, 0), (a, 0)
By applying the accumulation function acc to each of these we get multisets:

1|a, 0i+1| b, 0i+1| b, 1 i 1| b, 0 i+1| a, 0 i+1| b, 1i 2| b, 0 i+1| a, 1 i


1| a, 0 i+1| b, 1i+1| b, 0 i 2| b, 0 i+1|a, 1i 1| b, 0 i+1| b, 1 i+1| a, 0 i
1|a, 1 i+2| b, 0i 1| b, 1 i+1| a, 0 i+1|b, 0i 1| b, 1i+1| b, 0 i+1| a, 0 i
We see that are only two different multisets involved. Counting them and mul-
tiplying with ( ϕ )·(1 ψ ) = 19 gives:
 
mzip[3] 1|a i+2| bi, 2| 0i+1| 1 i

= 1 1| a, 1 i+2| b, 0 i + 2 1| a, 0 i+1|b, 0i+1| b, 1i .

3 3

This shows that calculating mzip is laborious. But it is quite mechanical and
easy to implement. The picture below suggests to look at mzip as a funnel with
two input pipes in which multiple elements from both sides can be combined
into a probabilistic mixture.

137
138 Chapter 3. Drawing from an urn
1|a i+2| b i 2| 0 i+1| 1i
@ @
@ @


1
+ 2

3 1| a, 1 i+2| b, 0i 3 1|a, 0i+1| b, 0i+1| b, 1 i

The mzip operation satisfies several useful properties.


Proposition 3.2.2. Consider mzip : N[K](X) × N[K](Y) → N[K](X × Y),
either in equational formulation (3.6) or in diagrammatic form (3.7).
1 The mzip map is natural in X and Y: for functions f : X → U and g : Y → V
the following diagram commutes.

N[K](X) × N[K](Y)
mzip
/ D N[K](X × Y)
N[K]( f )×N[K](g) D(N[K]( f ×g))
 
N[K](U) × N[K](V)
mzip
/ D N[K](U × V)

2 For ϕ ∈ N[K](X) and y ∈ Y,



mzip(ϕ, K| y i) = 1 ϕ ⊗ 1| y i .

And similarly in symmetric form.


3 Multizip commutes with projections in the following sense.
π1 π2
N[K](X) o N[K](X) × N[K](Y) / N[K](Y)
mzip

unit
  unit
D N[K](X) o / D N[K](Y)
 D(N[K](π1 ))  D(N[K](π2 ))
D N[K](X × Y)

4 Arrangement arr relates zip and mzip as in:

N[K](X) × N[K](Y)
arr⊗arr
◦ / XX × Y K
mzip ◦ ◦ zip
 
N[K](X × Y)
arr
◦ / (X × Y)K

5 mzip is associative, as given by:

N[K](X) × N[K](Y) × N[K](Z)


id ⊗mzip
◦ / N[K](X) × N[K](Y × Y)
mzip⊗id ◦ ◦ mzip
 
N[K](X × Y) × N[K](Z)
mzip
◦ / N[K](X × Y × Z)

Here we take associativity of × for granted.

138
3.2. Zipping multisets 139

Proof. 1 Easy, via the diagrammatic formulation (3.7), using naturality of arr
(Exercise 2.2.7), of zip (Exercise 1.9.4), and of acc (Exercise 1.6.6).
2 Since:
X 1
mzip(ϕ, K| y i) = acc zip(~x, hy, . . . , yi)

( ϕ )
~x∈acc −1 (ϕ)
1
acc(−x−i→
X
= , y)

( ϕ )
~x∈acc −1 (ϕ)
X 1
=

acc(~x) ⊗ 1|y i
( ϕ )
~x∈acc −1 (ϕ)
X 1
= ϕ ⊗ 1| y i = 1 ϕ ⊗ 1| y i .

( ϕ )
−1 ~x∈acc (ϕ)

3 By naturality of acc and zip:


 
D(N[K](π1 )) mzip(ϕ, ψ)
X X 1
= N[K](π1 ) acc(zip(~x, ~y))

( ϕ ) · ( ψ)
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)
X X 1
= acc (π1 )K (zip(~x, ~y))

( ϕ ) · ( ψ)
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)
X X 1 X 1
= acc(~x) = ϕ = 1 ϕ .

( ϕ ) · ( ψ) (ϕ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1 −1 ~x∈acc (ϕ)

4 We have for ϕ ∈ N[K](X), ψ ∈ N[K](Y),


 
X  X 
arr ◦· mzip (ϕ, ψ) =
  arr(χ)(~ z ) · mzip(ϕ, ψ)(χ)  ~z
 
~z∈(X×Y)K χ∈N[K](X×Y)
X 1
= · mzip(ϕ, ψ)(acc(~z)) ~z

( acc(~ z ) )
~z∈(X×Y)K  
X 1  X 1 1 
 ~z
= ·  ·
( acc(~z) ) ( ϕ ) ( ψ) 
~z∈(X×Y) K ~x∈acc (ϕ),~y∈acc (ψ),acc(zip(~x,~y))=acc(~z)
−1 −1
X X 1 1 X 1
= · · ~z
( ϕ ) ( ψ) ( acc(zip(~x, ~y)) )
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ) ~z∈acc −1 (acc(zip(~x,~y)))
X X 1 1
= zip(~x, ~y)

·
( ϕ ) ( ψ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1

since permuting a zip of permuted inputs does not add anything


= zip  ◦· (arr ⊗ arr) (ϕ, ψ).


5 Consider the following diagram chase, obtained by unpacking the (diagram-

139
140 Chapter 3. Drawing from an urn

matic) definition of mzip on the right-hand side.

N[K](X)×N[K](Y)×N[K](Z)
id ⊗mzip
◦ / N[K](X)×N[K](Y ×Z)
arr⊗arr⊗arr ◦ arr⊗arr
◦* 
mzip⊗id ◦ K
X ×Y ×Z K K id ⊗zip
◦ / X ×(Y ×Z)K
K

zip⊗id ◦ ◦ zip mzip ◦


  
arr⊗arr /
2/ (X×Y ×Z)
zip
N[K](X)×N[K](Y ×Z) ◦ (X×Y)K ×Z K ◦ K

mzip ◦ ◦ ◦ acc
 arr 
N[K](X×Y ×Z) ◦ N[K](X×Y ×Z) o
We can see that the outer diagram commutes by going through the internal
subdiagrams. In the middle we use associativity of the (ordinary) zip func-
tion, formulated in terms of (deterministic) channels. Three of the (other)
internal subdiagrams commute by item (4). The acc-arr triangle at the bot-
tom commutes by (2.11).

The following result deserves a separate status. It tells that what we learn
from a multiset zip is the same as what we learn from a parallel product (of
multisets).

Theorem 3.2.3. Multiset zip and frequentist learning interact well, namely as:

Flrn = mzip(ϕ, ψ) = Flrn(ϕ ⊗ ψ).


Equivalently, in diagrammatic form:

N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)

 
◦ Flrn
2
N[K ](X × Y)
Flrn
◦ / X×Y

Proof. Let multisets ϕ ∈ N[K](X) and ψ ∈ N[K](Y) be given and let a ∈ X


and b ∈ Y be arbitrary elements. We need to show that the probability:
  X X acc zip(~x, ~y)(a, b)
Flrn = mzip(ϕ, ψ) (a, b) =
K · (ϕ ) · ( ψ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1

is the same as the probability:


ϕ(a) · ψ(a)
Flrn(ϕ ⊗ ψ)(a, b) = .
K·K
We reason informally, as follows. For arbitrary ~x ∈ acc −1 (ϕ) and ~y ∈ acc −1 (ψ)
we need to find the fraction of occurrences (a, b) in zip(~x, ~y). The fraction of
occurrences of a in ~x is ϕ(a)
K = Flrn(ϕ)(a), and the fraction of occurrences of b

140
3.2. Zipping multisets 141

in ~y is ψ(b) x, ~y)
K = Flrn(ψ)(b). Hence the fraction of occurrences of (a, b) in zip(~
is Flrn(ϕ)(a) · Flrn(ψ)(b) = Flrn(ϕ ⊗ ψ)(a, b).

Once we have seen the definition of mzip, via ‘deconstruction’ of multisets


into lists, a zip operation on lists, and ‘reconstruction’ to a multiset result, we
can try to apply this approach more widely. For instance, instead of using a
zip on lists we can simply concatenate (++) the lists — assuming they contain
elements from the same set. This yields, like in (3.7), a composite channel:

N[K](X) × N[L](X)
arr⊗arr / D XK × XL
 D(++)
 
D X K+L / D N[K +L](X)
D(acc)

It is easy to see that this yields addition of multisets, as a deterministic channel.


We don’t get the tensor ⊗ of multisets in this way, because there is no tensor
of lists, see Remark 2.4.3.
We conclude with several useful observations about accumulation in two
dimensions.

Lemma 3.2.4. Fix a number K ∈ N and a set X.

1 We can mix K-ary and 2-ary accumulation in the following way.

XK × XK
acc[K]×acc[K]
/ N[K](X) × N[K](X) +
(
zip  N[2K](X)
6
 acc[2]K
(X × X)K / N[2](X)K +

2 When we generalise from 2 to L ≥ 2 we get:

acc[K]L
XK
L / N[K](X)L +

'
zip L  N[L·K](X)
7

acc[L]K
XL K
 / N[L](X)K +

Proof. 1 Commutation of the diagram is ‘obvious’, so we provide only an

141
142 Chapter 3. Drawing from an urn

exemplary proof. The two paths in the diagram yield the same outcomes in:

+ ◦ (acc[K] × acc[K]) [a, b, c, a, c], [b, b, a, a, c]


 

= 2| a i + 1|b i + 2| ci + 2| a i + 2| bi + 1| ci
 

= 4| a i + 3| b i + 3| ci.
+ ◦ acc[2]K ◦ zip [a, b, c, a, c], [b, b, a, a, c]
 

= + ◦ acc[2]K [(a, b), (b, b), (c, a), (a, a), (c, c)]
 

= + [1| a i + 1|b i, 2| bi, 1| a i + 1| c i, 2| ai, 2| c i]




= 4| a i + 3|b i + 3| ci.

2 Similarly.

We now transfer these results to multizip.

Proposition 3.2.5. In binary form one has:

+ / N[2K](X) unit
N[K](X) × N[K](X)
( 
mzip D N[2K](X)
6

D N[K](X × X)
 D(N[K](acc[2]))
/ D N[K] N[2](X) D(flat)

More generally, we have for L ≥ 2,

+ / N[L·K](X)
N[K](X)L unit

( 
mzip L D N[L·K](X)
6
 D(N[K](acc[L]))/
D N[K](X L )

D N[K] N[L](X) D(flat)

The map mzip L is the L-ary multizip, obtained via:

mzip 2 B mzip and mzip L+1 B mzip ◦· (mzip L ⊗ id )

Via the associativity of Proposition 3.2.2 (5) the actual arrangement of these
multiple multizips does not matter.

142
3.2. Zipping multisets 143

Proof. We only do the binary case, via an equational proof:

D(flat) ◦ D(N[K](acc[2])) ◦ mzip


(3.7)
= D(flat) ◦ D(N[K](acc[2])) ◦ D(acc[K]) ◦ D(zip) ◦ (arr ⊗ arr)
= D(flat) ◦ D(acc[K]) ◦ D(acc[2]K ) ◦ D(zip) ◦ (arr ⊗ arr)
by naturality of acc[K] : X K → N[K](X)
= D(+) ◦ D(acc[2]K ) ◦ D(zip) ◦ (arr ⊗ arr) by Exercise 1.6.7
= D(+) ◦ D(acc[K] × acc[K]) ◦ (arr ⊗ arr) by Lemma 3.2.4 (1)
= D(+) ◦ (D(acc[K]) ◦ arr) ⊗ (D(acc[K]) ◦ arr)

(2.11)
= D(+) ◦ (unit ⊗ unit)
= D(+) ◦ unit
= unit ◦ +.

Exercises
3.2.1 Show that:
 
mzip[4] 1| a i + 2|b i + 1| ci, 3|0 i + 1| 1i

= 41 1| a, 1 i + 2| b, 0i + 1| c, 0 i


+ 21 1| a, 0 i + 1| b, 0i + 1| b, 1 i + 1|c, 0 i


+ 41 1| a, 0 i + 2| b, 0i + 1| c, 1 i .

3.2.2 Show in the context of the previous exercise that:


 
Flrn = mzip[4] 1| a i + 2| b i + 1|c i, 3| 0 i + 1|1 i
= 3
16 | a, 0i + 1
16 |a, 1 i + 38 | b, 0 i + 18 | b, 1 i + 3
16 | c, 0 i + 1
16 | c, 1i.

Compute also: Flrn 1| a i + 2| b i + 1| c i ⊗ Flrn 3|0 i + 1| 1 i , and re-


 
member Theorem 3.2.3.
3.2.3 1 Check that mzip does not commute with diagonals, in the sense
that the following triangle does not commute.

∆ / N[K](X) × N[K](X)
N[K](X)
, mzip

N[K](∆) - D N[K](X × X)

Hint: Take for instance 1| a i + 1|b i.

143
144 Chapter 3. Drawing from an urn

2 Check that mzip and zip do not commute with accumulation, as in:

XK × Y K
acc × acc / N[K](X) × N[K](Y)
zip  , mzip
 
acc  / D N[K](X × Y)
(X × Y)K

Hint: Take sequences [a, b, b], [0, 0, 1] and re-use Example 3.2.1.

3.3 The multinomial channel


Multinomial distributions appear when one draws multiple coloured balls from
an urn, with replacement, as described briefly in Example 2.1.1 (2). These dis-
tributions assign a probability to such a draw, where the draws themselves will
be represented as multisets, over colours. We shall standardly describe multi-
nomial distributions in multivariate form, for multiple colours. The binomial
distribution is then a special case, for two colours only, see Example 2.1.1 (2).
A multinomial draw is a draw ‘with replacement’, so that the draw of mul-
tiple, say K, balls can be understood as K consecutive draws from the same
urn. Such draws may also be understood as samples from the distribution, of
size K. This addititive nature of multinomials is formalised as Exercise 3.3.8
below. In different form it occurs in Section 3.5.
When one thinks in terms of drawing from an urn with replacement, the
urn itself never changes. For that reason, the urn is most conveniently — and
abstractly — represented as a distribution, rather than as a multiset. When the
urn is written as ω ∈ D(X), the underlying set/space X is the set of types
(think: colours) of objects in the urn. To sum up, multinomial distributions are
described as a function of the following type.
mn[K]
/ D N[K](X) .
 
D(X) (3.8)

The number K ∈ N represents the number of objects that is drawn. The distri-
bution mn[K](ω) assigns a probability to a K-object draw ϕ ∈ N[K](X). There
is no bound on K, since the idea is that drawn objects are replaced.
Clearly, the above function (3.8) forms a channel mn[K] : D(X) → N[K](X).
In this section we collect some basic properties of these channels. They are
interesting in themselves, since they capture basic relationships between mul-
tisets and distributions, but they will also be useful in the sequel.
For convenience we repeat the definition of the multinomial channel (3.8)
from Example 2.1.1 (2) For a set X and a natural number K ∈ N we define the

144
3.3. The multinomial channel 145

multinomial channel (3.8) on ω ∈ D(X) as:


X Y
ω(x)ϕ(x) ϕ ,

mn[K](ω) B (ϕ) · (3.9)
x
ϕ∈N[K](X)

where, recall, ( ϕ ) is the multinomial coefficient Qi K!


ϕ(xi )! , see Definition 1.5.1 (5).
The Multinomial Theorem (1.26) ensures that the probabilities in mn[K](ω)
add up to one.
The next result describes the fundamental relationships between multino-
mial distributions, accumulation/arrangement (acc / arr) and indedependent
and identical distributions (iid ).

Theorem 3.3.1. 1 Arrangement after a multinomial yields the indedependent


and identical version of the original distribution:

D(X)
mn[K]
◦ / N[K](X)
◦ arr[K]


iid [K] / XK

2 Multinomial channels can be described as composite:

D(X)
mn[K]
◦ / N[K](X)
A

iid [K] . XK ◦
acc[K]

where iid (ω) = ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ).

Proof. 1 For ω ∈ D(X) and ~x = (x1 , . . . , xK ) ∈ X K ,


X
arr ◦· mn[K] (ω)(~x) =

arr(ϕ)(~x) · mn[K](ω)(ϕ)
ϕ∈N[K](X)
1
= · mn[K](ω)(acc(~x))
( acc(~x) )
1 Y
= · ( acc(~x) ) · ω(y)acc(~x)(y)
( acc(~x) ) y
Y
= ω(xi ) = ωK (~x) = iid (ω)(~x).
i

2 A direct consequence of the previous item, using that acc ◦· arr = unit,
see (2.11) in Example 2.2.3.

This theorem has non-trivial consequences, such as naturality of multino-


mial channels and commutation with tensors.

145
146 Chapter 3. Drawing from an urn

Corollary 3.3.2. Multinomial channels are natural: for each function f : X →


Y the following diagram commutes.

D(X)
mn[K]
/ D N[K](X)
D( f ) D(N[K]( f ))
 
D(X)
mn[K]
/ D N[K](X)

Proof. By Theorem 3.3.1 (2) and naturality of acc and of iid , as expressed by
the diagram on the left in Lemma 2.4.8 (1).
D(N[K]( f )) ◦ mn[K] = D(N[K]( f )) ◦ D(acc) ◦ iid
= D(acc) ◦ D( f K ) ◦ iid
= D(acc) ◦ iid ◦ D( f )
= mn[K] ◦ D( f ).
For the next result, recall the multizip operation mzip from Section 3.2. It
may have looked a bit unusual at the time, but the next result demonstrates that
it behaves quite well — as in other such results.
Corollary 3.3.3. Multinomial channels commute with tensor and multizip:
 
mzip = mn[K](ω) ⊗ mn[K](ρ) = mn[K](ω ⊗ ρ).
Diagrammatically this amounts to:

D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
⊗ ◦ mzip
 
D(X × Y)
mn[K]
◦ / N[K](X × Y)

Proof. We give a proof via diagram-chasing. Commutation of the outer dia-


gram below follows from the commuting subparts of the diagram, which arise
by unfolding the definition of mzip in (3.7), below on the right.

D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
◦ arr⊗arr

iid ⊗iid / X × YK
K

◦ zip 

 ◦ mzip

iid / (X × Y)K
◦ acc 
 
D(X × Y)
mn[K]
◦ / N[K](X × Y) o

The three subdiagrams, from top to bottom, commute by Theorem 3.3.1 (1),
by Lemma 2.4.8 (2), and by Theorem 3.3.1 (2).

146
3.3. The multinomial channel 147

Actually computing mn[K](ω ⊗ ρ)(χ) is very fast, but computing the equal
expression:
 
mzip = mn[K](ω) ⊗ mn[K](ρ) (χ)
is much much slower. The reason is that one has to sum over all pairs (ϕ, χ)
that mzip to χ.
We move on to a next fundamental fact, namely that frequentist learning
after a multinomial is the identity, in the theorem below. We first need an aux-
iliary result.
Lemma 3.3.4. Fix a distribution ω ∈ D(X) and a number K. For each y ∈ X,
X
mn[K](ω)(ϕ) · ϕ(y) = K · ω(y).
ϕ∈N[K](X)

Proof. The equation holds for K = 0, since then ϕ(y) = 0. Hence we may
assume K > 0. Then:
X
mn[K](ω)(ϕ) · ϕ(y)
ϕ∈N[K](X)
X K! Y
= ϕ(y) · Q · ω(x)ϕ(x)
x ϕ(x)!
x
ϕ∈N[K](X), ϕ(y),0
X K · (K −1)! Y
= · ω(y) · ω(y)ϕ(y)−1 · ω(x)ϕ(x)
ϕ(x)!
Q
ϕ∈N[K](X), ϕ(y),0
(ϕ(y)−1)! · x,y x,y
X (K − 1)! Y
= K · ω(y) · · ω(x)ϕ(x)
ϕ(x)!
Q
x
ϕ∈N[K−1](X) x
X
= K · ω(y) · mn[K −1](ω)(ϕ)
ϕ∈N[K−1](X)
= K · ω(y).
Theorem 3.3.5. Frequentist learning from a multinomial gives the original
distribution:
Flrn = mn[K](ω) = ω. (3.10)
This means that the following diagram of channels commutes.

D(X)
mn[K]
◦ / N[K](X)

 Flrn

id /X

The identity function D(X) → D(X) is used as channel D(X) → X.


This last result is important when one thinks of draws as samples from the
distribution. It says that what we learn from samples is the same as what we

147
148 Chapter 3. Drawing from an urn

learn from the distribution itself. This is of course an elementary correctness


property for sampling.

Proof. By Lemma 3.3.4:

X
Flrn ◦· mn[K] (ω)(y) =

mn[K](ω)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ϕ(y)
= mn[K](ω)(ϕ) ·
ϕ∈N[K](X)
kϕk
1
= kϕk · ω(y) ·
kϕk
= ω(y).

Since a multinomial mn[K](ω) is a distribution we can use it as an ab-


stract urn, not containing single balls, but containing multisets of balls (draws).
Hence we can draw from mn[K](ω) as well, giving a distribution on draws of
draws. We show that this can also be done with a single multinomial.

Theorem 3.3.6. Multinomial channels compose, with a bit of help of the (fixed-
size) flatten operation for multisets (1.33), as in:

mn[K]
/ D M[K](X) mn[L]
/ D M[L] M[K](X)
 
D(X)
D(flat)

mn[L·K] . D M[L·K](X)
 

148
3.3. The multinomial channel 149

Proof. For ω ∈ D(X) and ψ ∈ M[L·K](X),


 
D(flat) ◦ mn[L] ◦ mn[K] (ω)(ψ)
X
=

mn[L] mn[K](ω) (Ψ)
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ
X Y Y Ψ(ϕ)
ω(x)ϕ(x)

= (Ψ) · (ϕ ) ·
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
X Y Y Y
= (Ψ) · ( ϕ )Ψ(ϕ) · ω(x)ϕ(x)·Ψ(ϕ)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ) ϕ∈supp(Ψ)
X Y Y
Ψ(ϕ) ϕ(x)·Ψ(ϕ)
= (Ψ) · (ϕ) ω(x)
P
· ϕ∈supp(Ψ)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
X Y Y
= (Ψ) · ( ϕ )Ψ(ϕ) · ω(x)flat(Ψ)(x)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
 
 X Y  Y
=  (Ψ ) · ( ϕ )Ψ(ϕ)  · ω(x)ψ(x)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
Y
ψ(x)
= ( ψ) · ω(x) by Theorem 1.6.5
x
= mn[L·K](ω)(ψ).

We mention another consequence of Lemma 3.3.4; it may be understood as


an ‘average’ or ‘mean’ result for multinomials, see also Definition 4.1.3. We

can describe it via a flatten map flat : M M(X) → M(X), using inclusions
D(X) ,→ M(X) and N[K](X) ,→ M(X).

Proposition 3.3.7. Fix a distribution ω ∈ D(X) and a number K. Each natural


multiset ϕ ∈ N[K](X) can be regarded as an element of the set of all multisets
M(X). Similarly, ω can be understood as an element of M(X). With these
inclusions in mind one has:
X
flat mn[K](ω) = mn[K](ω)(ϕ) · ϕ = K · ω.

ϕ∈N[K](X)

Proof. Since:
 
X X  X 
mn[K](ω)(ϕ) · ϕ = 
 mn[K](ω)(ϕ) · ϕ(x)  x

ϕ∈N[K](X) x∈X ϕ∈N[K](X)
X
= K · ω(x) x by Lemma 3.3.4
x∈X X

= K· ω(x) x
x∈X
= K · ω.

149
150 Chapter 3. Drawing from an urn

Recall the draw-and-delete channel DD : N[K + 1](X) → N[K](X) from


Definition 2.2.4 for drawing a single ball from a non-empty multiset. It com-
mutes with multinomial channels.
Proposition 3.3.8. The following triangles commute, for K ∈ N.

N[K](X) o DD
◦ N[K; +1](X)
a
◦ ◦
mn[K] mn[K+1]
D(X)
Proof. For ω ∈ D(X) and ϕ ∈ N[K](X) we have:
DD ◦· mn[K +1] (ω)(ϕ)

X
= mn[K +1](ω)(ψ) · DD[K](ψ)(ϕ)
ψ∈N[K+1](X)
X ϕ(x) + 1
= mn[K +1](ω)(ϕ + 1| x i) ·
x∈X
K+1
X (K + 1)! Y ϕ(x) + 1
= · ω(y)(ϕ+1| x i)(y) ·
y (ϕ + 1| x i)(y)! K+1
Q
y
x∈X
X K! Y
ϕ(y)
= · ω(y) · ω(x)
y ϕ(y)!
Q
y
x∈X Y
= (ϕ) · ω(y)ϕ(y) · x ω(x)
P 
y
= mn[K](ω)(ϕ).
The above triangles exist for each K. This means that the collection of chan-
nels mn[K], indexed by K ∈ N, forms a cone for the infinite chain of draw-
and-delete channels. This situation is further investigated in [85] in relation to
de Finetti’s theorem [37], which is reformulated there in terms of multinomial
channels forming a limit cone.
The multinomial channels do not commute with draw-add channels DA. For
instance, for ω = 13 | ai + 32 | bi one has:

mn[2](ω) = 19 2| a i + 49 1|a i + 1| bi + 49 2| bi

8
mn[3](ω) = 27 1
3| a i + 92 2| a i + 1|b i + 49 1| ai + 2| bi + 27 3| bi .

But:

DA = mn[2](ω) = 19 3|a i + 29 2| ai + 1| bi + 29 1| ai + 2| b i + 49 3| b i .

There is one more point that we like to address. Since mn[K](ω) is a distri-
P
bution, the sum over all draws ϕ mn[K](ω)(ϕ) equals one. But what if we re-
strict this sum to draws ϕ of certain colours only, that is, with supp(ϕ) ⊆ S , for
a proper subset S ⊆ supp(ω)? And what if we then let the size of these draws

150
3.3. The multinomial channel 151

K go to infinity? The result below describes what happens. It turns out that
the same behaviour exists in the hypergeometric and Pólya cases, see Proposi-
tions 3.4.9.

Proposition 3.3.9. Let ω ∈ D(X) be given with a proper, non-empty subset


S ⊆ supp(ω), so S , ∅ and S , supp(ω). For K ∈ N, write:
X
MK B mn[K](ω)(ϕ).
ϕ∈M[K](S )

Then MK > MK+1 and lim MK = 0.


K→∞

ω(x) in:
P
Proof. Write r B x∈S
(1.27)
X X Y
MK = mn[K](ω)(ϕ) = (ϕ) · ω(x)ϕ(x) = r K .
ϕ∈M[K](S ) ϕ∈M[K](S ) x∈S

Since 0 < r < 1 we get MK = r K > r K+1 = MK+1 and lim MK = lim r K = 0.
K→∞ K→∞

In this section we have assumed that an urn is given, namely as a distribution


ω on colours, and we have looked at probabilities of multiple draws from the
urn ϕ, as multisets. This will be turned around in the context of learning, see
Chapter ??. There we ask, among other things, the question: suppose that ϕ of
size K is given, as ‘data’; which distribution ω best fits ϕ, in the sense that it
maximises the multinomial probability mn[K](ω)(ϕ)? Perhaps unsurprisingly,
the answer is: the distribution Flrn(ϕ), obtained from ϕ via frequentist learning,
see Proposition ??.

Exercises
3.3.1 Let’s throw a fair dice 12 times. What is the probability that each
12!
number appears twice? Show that it is 72 6.

3.3.2 This exercise illustrates naturality of the multinomial channel, see


Corollary 3.3.2. Take sets X = {a, b, c} and Y = 2 = {0, 1} with func-
tion f : X → Y given by f (a) = f (b) = 0 and f (c) = 1. Consider the
following distribution and multiset:

ω = 14 | a i + 21 | b i + 14 | c i ψ = 2| 0 i + 1| 1 i.

1 Show that:
 
mn[3] D( f )(ω) (ψ) = 64 .
27

151
152 Chapter 3. Drawing from an urn

2 Check that:
 
D(N[3]( f )) mn[3](ω) (ψ) = mn[3](ω)(2|a i + 1| ci)
+ mn[3](ω)(1| ai + 1| b i + 1| c i)
+ mn[3](ω)(2| bi + 1| c i)
27
yields the same outcome 64 .
3.3.3 Check that:
 
mn[2] 12 | ai + 12 | bi = 14 2|a i + 21 1| ai + 1| bi + 14 2| b i .

Conclude that the multinomial map does not preserve uniform distri-
butions.
3.3.4 Check that: X
mn[1](ω) = ω(x) 1| x i .

x∈supp(ω)

3.3.5 Use Exercise 1.6.3 to show that for ϕ ∈ N[K](X) and ψ ∈ N[L](X)
one has:
K+L
K
mn[K +L](ω)(ϕ + ψ) = ϕ+ψ · mn[K](ω)(ϕ) · mn[L](ω)(ψ),
ϕ

and if ϕ, ψ have disjoint support, then:


K +L
!
mn[K +L](ω)(ϕ + ψ) = · mn[K](ω)(ϕ) · mn[L](ω)(ψ).
K
3.3.6 The aim of this exercise is to prove recurrence relations for multino-
mials: for each ω ∈ D(X) and ϕ ∈ N[K](X) with K > 0 one has:
X
mn[K](ω)(ϕ) = ω(x) · mn[K −1](ω) ϕ − 1| x i .

x∈supp(ϕ)

Here are two possible avenues:


1 use the recurrence relations (1.25) for multiset coefficients ( ϕ );
2 show first:
ω(x) · mn[K −1](ω) ϕ − 1| x i = Flrn(ϕ)(x) · mn[K](ω) ϕ .
 

Follow-up both avenues.


3.3.7 Prove the following items, in the style of Lemma 3.3.4, for a distribu-
tion ω ∈ D(X) and a number K > 1.
1 For two elements y , z in X,
X
mn[K](ω)(ϕ) · ϕ(y) · ϕ(z) = K · (K − 1) · ω(y) · ω(z).
ϕ∈N[K](X)

152
3.3. The multinomial channel 153

2 For a single element y ∈ X,


X
mn[K](ω)(ϕ) · ϕ(y) · (ϕ(y) − 1) = K · (K − 1) · ω(y)2 .
ϕ∈N[K](X)

3 Again, for a single element y ∈ X,


X
mn[K](ω)(ϕ) · ϕ(y)2 = K · (K − 1) · ω(y)2 + K · ω(y).
ϕ∈N[K](X)

Hint: Write a2 = a · (a − 1) + a and then use the previous item and


Lemma 3.3.4.
3.3.8 1 Prove the following multivariate generalisation of Exercise 2.4.3,
demonstrating that not only binomials, but more generally, multi-
nomials are closed under convolution (2.17): for K, L ∈ N,

D(X) ×O D(X)
mn[K]⊗mn[L]
◦ / N[K](X) × N[L](X)
∆ ◦+

D(X)
mn[K+L]
◦ / N[K +L](X)

On the right, the sum + is the usual sum of multisets, promoted to


a channel +.
Hint: Use Exercises 1.6.5 and 2.5.7.
2 Conclude that K-sized draws can be reduced to parallel single draws,
as in:
mn[1]K
D(X) K ◦ / N[1](X)K
O
∆ ◦+

D(X)
mn[K]
◦ / N[K](X)

Notice that this is essentially Theorem 3.3.1 (2), via the isomor-
phism M[1](X)  X.
3.3.9 Show that for a natural multiset ϕ of size K one has:
 
flat mn[K] Flrn(ϕ) = ϕ.

3.3.10 Check that multinomial channels do not commute with tensors, as in:

D(X) × D(Y)
mn[K]⊗mn[L]
◦ / M[K](X) × M[L](Y)
⊗ , ◦⊗
 
D(X × Y) ◦ / M[K ·L](X × Y)
mn[K·L]

153
154 Chapter 3. Drawing from an urn

3.3.11 Check that the following diagram does not commute.

N[K +1](X) × N[L+1](X)


DD⊗DD
◦ / N[K](X) × N[L](X)
+◦ , ◦+
 
N[K +L+2](X) ◦ / N[K +L](X)
DD◦·DD

3.3.12 Check that the multiset distribution mltdst[K]X from Exercise 2.1.7
equals:
mltdst[K]X = mn[K](unif X ).
3.3.13 1 Check that accumulation does not commute with draw-and-delete,
as in:
acc  / N[K +1](X)
X K+1 ◦

π , ◦ DD
 
XK ◦ / N[K](X)
acc 

2 Now define the probabilistic projection channel ppr : X K+1 → X K


by:
ppr(x1 , . . . , xK+1 )
X 1 (3.11)
x1 , . . . , xk−1 , xk+1 , . . . , xK+1 .

B
1≤k≤K+1
K +1
Show that with ppr instead of π we do get a commuting diagram:
acc  / N[K +1](X)
X K+1 ◦

ppr ◦ ◦ DD
 
acc  / N[K](X)
XK ◦

3 Conclude that we could have introduced DD via the definition prin-


ciple of Proposition 3.1.1 as:

X K+1
acc / / N[K +1](X)
ppr *
D(X K ) DD
D(acc)
+  
D N[K](X)
4 Show that probabilistic projection also commutes with arrange-
ment:
X K+1 o
arr
◦ N[K +1](X)
ppr ◦ ◦ DD
 
XK o
arr
◦ N[K](X)

154
3.3. The multinomial channel 155

3.3.14 Show that the probabilistic projection channel ppr from the previous
exercise makes the following diagram commute.
N
D(X) K+1 ◦ / X K+1
ppr ◦ ◦ ppr
 N 
D(X)K ◦ / XK

3.3.15 1 Prove that ppr ◦· arr = π ◦· arr in:


ppr
◦ &
N[K +1](X)
arr / X K+1 K


6X
π

2 Use this result to show that:

ppr ◦· zip ◦· (arr ⊗ arr) = π ◦· zip ◦· (arr ⊗ arr),

in a situation:
arr⊗arr /
N[K +1](X) × N[K +1](Y) ◦ X K+1 × Y K+1
zip ◦
 ppr
◦ ,
(X × Y)K+1 2 (X × Y)
K

π

Hint: Use Proposition 3.2.2 (4).


3 Use the last item, (3.7), Exercises 2.2.6 and 3.3.13 (3), to show:

N[K +1](X) × N[K +1](Y)


DD⊗DD
◦ / N[K](X) × N[K](Y)
mzip ◦ ◦ mzip
 
N[K +1](X × Y)
DD
◦ / N[K](X × Y)

3.3.16 Recall the set D∞ (X) of discrete distributions from (2.8), with infinite
(countable) support, see Exercise 2.1.11. For ρ ∈ D∞ (N>0 ) and K ∈ N
define:
X Y
ρ(i)ϕ(i) ϕ .

mn [K](ρ) B (ϕ) ·

ϕ∈N[K](N>0 ) i∈supp(ϕ)

1 Show that this yields a distribution in D∞ N[K](N>0 ) , i.e. that
X
mn [K](ρ)(ϕ) = 1.

ϕ∈N[K](N>0 )

2 Check that DD ◦· mn [K +1] = mn [K], like in Proposition 3.3.8.


∞ ∞

155
156 Chapter 3. Drawing from an urn

3.4 The hypergeometric and Pólya channels


We have seen that a multinomial channel assigns probabilities to draws, with
replacement. In addition, we have distinghuised two draws without replace-
ment, namely the “-1” hypergeometric version where a drawn ball is actually
removed from the urn, and the “+1” Pólya version where the drawn ball is
returned to the urn together with an additional ball of the same colour. The
additional ball has a strengthening effect that can capture situations with a
cluster dynamics, like in the spread of contagious diseases [62] or the flow
of tourists [108]. This section describes the main properties of these “-1” and
“+1” draws. We have already briefly seen hypergeometric and Pólya distribu-
tions, namely in Example 2.1.1, in items (3) and (4).
Since drawn balls are removed or added in the hypergeometric and Pólya
modes, the urn in question changes with each draw. It thus not a distribution,
like in the multinomial case, but a multiset, say of size L ∈ N. Draws are
described as multisets of size K — with restriction K ≤ L in hypergeometric
mode. The hypergeometric and Pólya channels thus takes the form:
hg[K]
/
/ D N[K](X) .

N[L](X) (3.12)
pl[K]

We first repeat from Example 2.1.1 (3) the definition of the hypergeometric
channel. For a set X and a natural numbers K ≤ L we have, for multiset/urn
ψ ∈ N[L](X),

X ψϕ X x ψ(x)
  Q  
ϕ(x)
hg[K] ψ B   ϕ =   ϕ .

L L
(3.13)
ϕ≤K ψ K ϕ≤K ψ K

Recall that we write ϕ ≤K ψ for: kϕk = K and ϕ ≤ ψ, see Definition 1.5.1 (2).
Lemma 1.6.2 shows that the probabilities in the above definition add up to one.
The Pólyachannel
 resembles the above hypergeometric one, instead
 that
multichoose is used instead of ordinary binomial coefficients .
ψ
ϕ
X
pl[K] ψ B  L  ϕ

ϕ∈N[K](supp(ψ)) K
Q ψ(x) (3.14)
x∈supp(ψ) ϕ(x)
X
=  L  ϕ .
ϕ∈N[K](supp(ψ)) K

This yields a proper distribution by Proposition 1.7.3. In Section 3.5 we shall


see how this distribution arises from iterated drawing and addition.

156
3.4. The hypergeometric and Pólya channels 157

Pólya distributions have their own dynamics, different from hypergeomet-


ric distributions. One commonly refers to this well-studied approach as Pólya
urns, see e.g. [115]. This section introduces the basics of such urns and the next
section shows how they come about via iteration of a transition map, followed
by marginalisation. Later on in Section ??, within the continuous context, the
Pólya channel shows up again, in relation to Dirichlet-multinomials and de
Finetti’s Theorem.
A subtle point in the above definitions (3.13) and (3.14) is that the draws ϕ
should be multisets over the support supp(ψ) of the urn ψ. Indeed, only balls
that occur in the urn can be drawn. In the hypergeometric case this is handled
via the requirement ϕ ≤K ψ, which ensures an inclusion supp(ϕ) ⊆ supp(ψ).
In the Pólya case we ensure this inclusion by requiring that draws ϕ are in
Q
N[K](supp(ψ)). Notice also that the product in (3.14)  is restricted
 to range
over x ∈ supp(ψ) so that ψ(x) > 0. This is needed since mn is defined only
for n > 0, see Definition 1.7.1 (2).
We start our analysis with hypergeometric channels. A basic fact is that
they can be described in as iterations of draw-and-delete channels from Defi-
nition 2.2.4. As we shall see soon afterwards, this result has many useful con-
sequences.

Theorem 3.4.1. For L, K ∈ N, the hypergeometric channel N[K + L](X) →


N[K](X) equals L consecutive draw-and-delete’s:

N[K +L](X)
hg[K]
◦ / N[K](X)
@
DD ◦ & ◦
DD
N[K +L−1](X) N[K: +1](X)

DD ◦· ··· ◦· DD
| {z }
L−2 times

To emphasise, this result says that the draws can be obtained via iterated
draw-and-delete. The full picture emerges in Theorem 3.5.10 where we include
the urn.

Proof. Write ψ ∈ N[K + L](X) for the urn. The proof proceeds by induction
on the number of iterations L, starting with L = 0. Then ϕ ≤K ψ means ϕ = ψ.
Hence:
 ψ ψ
 X ϕ ψ
hg[K] ψ = K+0 ϕ = K  ψ = 1 ψ = DD 0 (ψ) = DD L (ψ).
ϕ≤K ψ K K

For the induction step we use ψ ∈ N[K +(L+1)](X) = N[(K +1)+ L](X) and

157
158 Chapter 3. Drawing from an urn

ϕ ≤K ψ. Then:
 
DD L+1 (ψ)(ϕ) = DD = DD L (ψ) (ϕ)
X
= DD L (ψ)(χ) · DD(χ)(ϕ)
χ∈N[K+1](X)
X ϕ(y) + 1
= DD L (ψ)(ϕ + 1| y i) ·
y∈X
K+1
 ψ 
(IH)
X ϕ+1| y i ϕ(y) + 1
= K+L+1 ·
y, ϕ(y)<ψ(y)
K+1
K+1
(ψ(y) − ϕ(y)) · ψϕ
 
(*)
X
= K+L+1 by Exercise 1.7.6 (1),(2)
y, ϕ(y)<ψ(y) (L + 1) ·
 K
((K + L + 1) − K) · ψϕ
=  
(L + 1) · K+L+1K
ψ
ϕ
= K+L+1
K
= hg[K](ψ)(ϕ).

From this result we can deduce many additional facts about hypergeometric
distributions.

Corollary 3.4.2. 1 Hypergeometric channels (3.12) are natural in X.


2 Frequentist learning form hypergeometric draws is like learning from the
urn:

N[N](X)
hg[K]
◦ / N[K](X)

◦ ◦
Flrn -Xq Flrn

3 For L ≥ K one has:

N[L](X) o
DD
◦ N[L+1](X)
◦ ◦
hg[K] * t hg[K]
N[K](X)

4 Also:
N[K](X) o DD
◦ N[K= +1](X)
_
◦ ◦
hg[K] hg[K+1]
N[L](X)

158
3.4. The hypergeometric and Pólya channels 159

5 Hypergeometric channels compose, as in:

N[K +L+ M](X)


hg[K]
◦ / N[K](X)
<
◦ ◦
hg[K+L] , N[K +L](X) hg[K]

6 Hypergeometric and multinomial channels commute, as in:

N[K +L](X)
hg[K]
◦ / N[K](X)
` @
◦ ◦
mn[K+L] D(X) mn[K]

7 Hypergeometric channels commute with multizip:

N[K +L](X) × N[K +L](Y)


hg[K]⊗hg[K]
◦ / N[K](X) × N[K](Y)
mzip ◦ ◦ mzip
 
N[K +L](X × Y)
hg[K]
◦ / N[K](X × Y)

Proof. 1 By naturality of draw-and-delete, see Exercise 2.2.8.


2 By Theorem 2.3.3.
3 By Theorem 3.4.1.
4 Idem.
5 Similarly, since DD L+M = DD L ◦· DD M .
6 By Proposition 3.3.8.
7 By Exercise 3.3.15 (3).

The distribution of small hypergeometric draws from a large urn looks like
a multinomial distribution. This is intuitively clear, but will be made precise
below.

Remark 3.4.3. When the urn from which we draw in hypergeometric mode is
very large and the draw inolves only a small number of balls, the withdrawals
do not really affect the urn. Hence in this case the hypergeometric distribution
behaves like a multinomial distribution, where the urn (as distribution) is ob-
tained via frequentist learning. This is elaborated below, where the urn ψ is

159
160 Chapter 3. Drawing from an urn

very large, in comparison to the draw ϕ ≤K ψ.


ψ
ϕ K! Y (L − K)! ψ(x)!
hg[K](ψ)(ϕ) =  L  = Q · ·
x ϕ(x)! x L! (ψ(x)−ϕ(x))!
K
Y ψ(x) · (ψ(x)−1) · . . . · (ψ(x)−ϕ(x)+1)
= (ϕ) ·
x L · (L−1) · . . . · (L−K +1)
ϕ(x)
big ψ Y ψ(x)
≈ (ϕ) ·
x LK
Y ψ(x) !ϕ(x)
= (ϕ) ·
x L
Y
= (ϕ ) · Flrn(ψ)(x)ϕ(x)
x
= mn[K](Flrn(ψ))(ϕ).
The essence of this observation is in Lemma 1.3.8.
We continue with Pólya distributions. Since hypergeometric draw channels
arise from repeated draw-and-delete’s one may expect that Pólya draw chan-
nels arise from repeted draw-and-add’s. This is not the case. The connection
will be clarified further in the next section. At this stage our first point below
is that frequentist learning from its draws is the same as learning from the urn.
We shall first prove the following analogues of Lemma 3.3.4.
Lemma 3.4.4. For an non-empty urn ψ ∈ N(X) and a fixed element y ∈ X,
X
hg[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y), (3.15)
ϕ≤K ψ

And also:
X
pl[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y). (3.16)
ϕ∈M[K](supp(ψ))

(∗)
Proof. We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. We
start with equation (3.15), for which we assume K ≤ L.
X ψϕ · ϕ(y)
 
X
hg[K](ψ)(ϕ) · ϕ(y) = L
ϕ≤K ψ ϕ≤K ψ K
ψ−1| y i
(∗)
X ψ(y) · ϕ−1| y i
=  L−1 
L
ϕ≤K ψ K · K−1
ψ−1| y i
ψ(y) X ϕ
= K· ·  
L ϕ≤ ψ−1| y i L−1
K−1 K−1
= K · Flrn(ψ)(y).

160
3.4. The hypergeometric and Pólya channels 161

Similarly we obtain (3.16) via the additional use of Exercise 1.7.5.


ψ
X X ϕ · ϕ(y)
pl[K](ψ)(ϕ) · ϕ(y) =  L 
ϕ∈M[K](supp(ψ)) ϕ∈M[K](supp(ψ)) K
ψ+1| y i
(∗)
X ψ(y) · ϕ−1| y i
=  L+1 
L
ϕ∈M[K](supp(ψ)) K · K−1
ψ−1| y i
ψ(y) Xϕ
= K· ·  L+1 
L ϕ∈M[K−1](supp(ψ+1| y i))
K−1
= K · Flrn(ψ)(y).

Proposition 3.4.5. Let K > 0.

1 Frequentist learning and Pólya satisfy the following equation:

N∗ (X)
pl[K]
◦ / N[K](X)
◦ ◦
Flrn (Xu Flrn

2 Doing a draw-and-add before Pólya has no effect: for L > 0,

N[L](X)
DA
◦ / N[L+1](X)

◦ ◦
pl[K] * t pl[K]
N[K](X)

This second point is the Pólya analogue of Theorem 3.4.2 (3), with draw-
and-add instead of draw-and-delete.

Proof. 1 For a multiset/urn ψ ∈ N(X) with kψk = L > 0, and for x ∈ X,


  X
Flrn = pl[K](ψ) (x) = Flrn(ϕ)(x) · pl[K](ψ)(ϕ)
ϕ∈N[K](supp(ψ))
1 X
= · ϕ(x) · pl[K](ψ)(ϕ)
K ϕ∈N[K](supp(ψ))
(3.16)
= Flrn(ψ)(x).

(∗)
2 We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. For urn

161
162 Chapter 3. Drawing from an urn

ψ ∈ N[L](X) and draw ϕ ∈ N[K](X),


  X ψ(x)
pl[K] ◦· DA (ψ)(ϕ) = · pl[K] ψ + 1| x i (ϕ)

x∈supp(ψ)
L
ψ+1| x i
X ψ(x) ϕ
= · L+1
x∈supp(ψ)
L
K
ψ
(∗)
X ψ(x) + ϕ(x) ϕ
= ·  L 
x∈supp(ψ)
L + K
K
ψ
ϕ
=  L 
K
= pl[K](ψ)(ϕ).

Next we look at interaction of the Pólya channel with draw-and-delete —


like in Theorem 3.4.2 (4) and then also with hypergeometric channels.
The next point illustrates a relationship between Pólya and the draw-and-add
channel DA from Definition 2.2.4.

Theorem 3.4.6. For each K and L ≥ K the triangles below commute.

N[K](X) o DD
◦ N[K? +1](X) N[K](X) o hg[K]
◦ N[L](X)
] ] A
◦ ◦ ◦ ◦
pl[K] pl[K+1] pl[K] pl[L]
N∗ (X) N∗ (X)

Proof. We concentrate on commutation of the triangle on the left since it


implies commutation of the on on the right by Theorem 3.4.1. The marked
(∗)
equation = below indicates the use of Exercises 1.7.5 and 1.7.6. For urn ψ ∈
N[L](X) of size L > 0, and for ϕ ∈ N[K](X),
  X
DD ◦· pl[K +1] (ψ)(ϕ) = DD(χ)(ϕ) · pl[K +1](ψ)(χ)
χ∈N[K+1](supp(ψ))
X ϕ(x)+1
= · pl[K +1](ψ)(ϕ + 1| x i)
x∈supp(ψ)
K +1
 ψ 
X ϕ(x)+1 ϕ+1| x i
= ·  L 
x∈supp(ψ)
K +1
K+1
ψ
(∗)
X ψ(x) + ϕ(x) ϕ
= ·  L 
x∈supp(ψ)
L+K
K
= pl[K](ψ).

162
3.4. The hypergeometric and Pólya channels 163

Later on we shall see that the Pólya channel factors through the multinomial
channel, see Exercise 6.4.5 and ??. The Pólya channel does not commute with
multizip.
Earlier in Proposition 3.3.7 we have seen an ‘average’ result for multinomi-
als. There are similar results for hypergeometric and Pólya distributions.

Proposition 3.4.7. Let ψ ∈ N(X) be an urn of size kψk = L ≥ 1.

1 For L ≥ K ≥ 1,
X K
flat hg[K](ψ) = hg[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).

ϕ≤ ψ
L
K

2 Similarly, for K ≥ 1,
X K
flat pl[K](ψ) = pl[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).

ϕ∈N[K](supp(ψ))
L

Proof. In both cases we rely on Lemma 3.4.4. For the first item we get:
 
X X  X 
hg[K](ψ)(ϕ) · ϕ = 
 hg[K](ψ)(ϕ) · ϕ(x) x
ϕ≤K ψ x∈X ϕ≤K ψ
(3.15)
X
= K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).

In the second, Pólya case we proceed analogously:


 
X X  X 
pl[K](ψ)(ϕ) · ϕ = 
 pl[K](ψ)(ϕ) · ϕ(x)  x

ϕ∈N[K](supp(ψ)) x∈X ϕ∈N[K](supp(ψ))
(3.16)
X
=

K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).

There is another point of analogy with multinomial distributions that we


wish to elaborate, namely the (limit) behaviour when balls of specific colours
only are drawn, see Proposition 3.3.9. In the hypergeometric case it does not
make sense to look at limit behaviour since after a certain number of steps
the urn is empty. But in the Pólya case one can continue drawing indefinitely,
making not trivial what happens in the limit. We use the following auxiliary
result1 .
1 With thanks to Bas and Bram Westerbaan for help.

163
164 Chapter 3. Drawing from an urn

Lemma 3.4.8. Let 0 < N < M be given. Define for n ∈ N,


Y N+i
an B .
i<n
M+i

Then: lim an = 0.
n→∞

Proof. We switch to the (natural) logarithm ln and prove the equivalent state-
ment lim ln(an ) = −∞. We use that the logarithm turns products into sums,
n→∞
see Exercise 1.2.2, and that the derivative of ln(x) is 1x . Then:
X N +i X
ln an = = ln N + i − ln M + i
  
ln
i<n
M+i i<n
X Z M+1 1
= − dx
i<n N+1 x
(∗) X (M + i) − (N + i)
≤ −
i<n
M+i
X 1
= (N − M) ·
i<n
M+i
X 1
= (N − M) · .
i<n
M+i

It is well-known that the harmonic series n>0 n1 is infinite. Since M > N the
P
above sequence ln(an ) thus goes to −∞.
The validity of the marked inequality ≤ follows from an inspection of the
graph of the function 1x : the integral from N + i to M + i is the surface under 1x
between the points N + i < M + i. Since 1x is a decreasing function, this surface
is bigger than the rectangle with height M+i1
and length (M + i) − (N + i).

Proposition 3.4.9. Consider an urn ψ ∈ N[L](X) with a proper non-empty


subset S ⊆ supp(ψ).

1 Write for K ≤ L,
X
HK B hg[K](ψ)(ϕ).
ϕ∈N[K](S ), ϕ≤ψ

Then HK > HK+1 ; this stops at K = L, when the urn is empty.


2 Now write, for arbitrary K ∈ N,
X
PK B pl[K](ψ)(ϕ).
ϕ∈N[K](S )

Then PK > PK+1 and lim PK = 0.


K→∞

164
3.4. The hypergeometric and Pólya channels 165

ψ(x) is for the number of balls in the urn whose


P
Proof. We write LS B x∈S
colour is in S .
1 By separating S and its complement ¬S we can write, via Vandermonde’s
formula from Lemma 1.6.2,
Q ψ(x) Q ψ(x) L 
x∈S ϕ(x) ·
S
X x<S 0 K LS ! (L−K)!
HK = L = L = · .
ϕ∈N[K](S ), ϕ≤ψ
L! (LS −K)!
K K

Using a similar description for HK+1 we get:


LS ! (L−K)! LS ! (L−(K +1))!
HK > HK+1 ⇐⇒ · > ·
L! (LS −K)! L! (LS −(K +1))!
L−K
⇐⇒ > 1.
LS −K
The latter holds because L > LS , since S is a proper subset of supp(ψ).
2 In the Pólya case we get, as before, but now via the Vandermonde formula
for multichoose, see Proposition 1.7.3,
L 
(L−1)! (LS +K −1)!
S
K
PK =  L  = · .
(LS −1)! (L+K −1)!
K

We define:
PK+1 (L−1)! (LS +(K +1)−1)! (LS −1)! (L+K −1)!
aK B = · · ·
PK (LS −1)! (L+(K +1)−1)! (L−1)! (LS +K −1)!
LS +K
= < 1, since LS < L.
L+K
Thus PK+1 = aK · PK < PK and also:
PK = aK−1 · PK−1 = aK−1 · aK−2 · PK−2 = · · · = aK−1 · aK−2 · . . . · a0 · P0 .
Our goal is to prove lim PK = 0. This follows from lim i<K ai = 0, which
Q
K→∞ K→∞
we obtain from Lemma 3.4.8.

Exercises
3.4.1 Consider an urn with 5 red, 6 green, and 8 blue balls. Suppose 6 balls
are drawn from the urn, resulting in two balls of each colour.
1 Describe both the urn and the draw as a multiset.
Show that the probability of the six-ball draw is:
5184000
2 47045881≈ 0.11, when balls are drawn one by one and are replaced
before each next single-ball draw;

165
166 Chapter 3. Drawing from an urn

50
3 323 ≈ 0.15, when the drawn balls are deleted;
405
4 4807 ≈ 0.08 when the drawn balls are replaced and each time an
extra ball of the same colour is added.
3.4.2 Draw-delete preserves hypergeometric and Pólya distributions, see
Corollary 3.4.2 (4) and Theorem 3.4.6. Check that draw-add does
not preserve hypergeometric or Pólya distributions, for instance by
checking that for ϕ = 3| ai + 1| b i,

hg[2](ϕ) = 12 2| ai + 12 1| ai + 1| bi


hg[3](ϕ) = 41 3| ai + 34 2| ai + 1| bi


DA = hg[2](ϕ) = 12 3| ai + 14 2| ai + 1| bi + 14 1| a i + 2| b i ,

And:
pl[2](ϕ)
3
= 35 2| a i + 10 1| a i + 1| b i + 1

10 2| b i
pl[3](ϕ)
3
= 12 3| a i + 10 2| a i + 1| b i + 3
+ 2|b i + 1

20 1| a i 20 3| bi
DA = pl[2](ϕ)
3 1
= 53 3| a i + 20 2| a i + 1| b i + + 2|b i + 10
3

20 1| a i
3| bi

Let X be a finite set, say of size n. Write 1 = x∈X 1| x i for the multiset
P
3.4.3
of single occurrences of elements of X. Show that for each K ≥ 0,

pl[K](1) = unifN[K](X) .

3.4.4 1 Give a direct proof of Theorem 3.4.1 for L = 1.


2 Elaborate also the case K = 1 in Theorem 3.4.1 and show that in
that case DD L (ψ) = Flrn(ψ), at least when we identify the (isomor-
phic) sets N[1](X) and X.

3.4.5 Let unif L ∈ D N[L](X) be the uniform distribution on multisets of
size L, from Exercise 2.2.9, where X is a finite set. Show that hg[K] =
unif L = unif K , for K ≤ L.
3.4.6 Recall from Remark 3.4.3 that small hypergeometric draws from a
large urn can be described multinomially. Extend this in the following
way. Let σ ∈ D N[L](X) be a distribution of large urns. Argue that

for small numbers K,

hg[K] = σ ≈ mn[K] = D(Flrn)(σ).

3.4.7 Prove in analogy with Exercise 3.3.7, for an urn ψ ∈ N[L](X) and
elements y, z ∈ X, the following points.

166
3.4. The hypergeometric and Pólya channels 167

1 When y , z and K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L−1
2 When K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= K · Flrn(ψ)(y) · .
L−1
3 And in the Pólya case, when y , z,
X
pl[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L+1
4 Finally,
X
pl[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= K · Flrn(ψ)(y) · .
L+1
3.4.8 Use Theorem 3.4.1 to prove the following two recurrence relations
for hypgeometric distributions.
X
hg[K](ψ) = Flrn(ψ)(x) · hg[K −1] ψ−1| x i

x∈supp(ψ)
X ϕ(x) + 1
hg[K](ψ)(ϕ) = · hg[K −1](ψ) ϕ+1| x i

x
K+1
3.4.9 Fix numbers N, M ∈ N and write ψ = N| 0 i + M| 1 i for an urn with N
balls of colour 0 and M of colour 1. Let n ≤ N. Show that:
X N+M+1
hg[n+m](ψ) n| 0 i + m| 1 i = .

0≤m≤M
N+1
Hint: Use Exercise 1.7.3. Notice that the right-hand side does not
depend on n.
3.4.10 This exercise elaborates that draws from an urn excluding one partic-
ular colour can be expressed in binary form. This works for all three
modes of drawing: multinomial, hypergeometric, and Pólya.
Let X be a set with at least two elements, and let x ∈ X be an arbi-
trary but fixed element. We write x⊥ for an element not in X. Assume
k ≤ K.

167
168 Chapter 3. Drawing from an urn

1 For ω ∈ D(X) with x ∈ supp(ω), use the Multinomial Theo-


rem (1.27) to show that:
X
mn[K](ω) k| x i + ϕ

ϕ∈N[K−k](X−x)
  
= mn[K] ω(x)| x i + (1 − ω(x))| x⊥ i k| x i + (K − k)| x⊥ i
= bn[K] ω(x) (k).


2 Prove, via Vandermonde’s formula, that for be an urn ψ of size


L ≥ K one has:
X
hg[K](ψ) k| x i + ϕ

ϕ≤K−k ψ−ψ(x)| x i
  
= hg[K] ψ(x)| x i + (L − ψ(x))| x⊥ i k| x i + (K − k)| x⊥ i .

3 Show again for an urn ψ,


X
pl[K](ψ) k| x i + ϕ

ϕ∈N[K−k](supp(ψ)−x)
  
= pl[K] ψ(x)| x i + (L − ψ(x))| x⊥ i k| x i + (K − k)| x⊥ i .

3.5 Iterated drawing from an urn


This section deals with the iterative aspects of drawing from an urn. It makes
the informal descriptions in the introduction to this chapter precise, for six
different forms of drawing: in “-1” or “0” or “+1” mode, and with ordered
or unordered draws. The iteration involved will be formalised via a special
monad, combining the multiset and the writer monad. This infrastructure for
iteration will be set up first. Subsequently, ordered and unordered draws will
be described in separate subsections.
We recall the transition operation (3.2) from the beginning of this chapter:

 / multiple-colours, Urn 0
 
Urn

mapping an urn to a distribution over pairs of multiple draws and remaining


urn. The actual mapping from urns to draws would then arise as first marginal
of this transition channel. Concretely, the above transition map, used for de-

168
3.5. Iterated drawing from an urn 169

scribing our intuition, takes one of the following six forms.

mode ordered unordered

0 D(X)
Ot 0
◦ / L(X) × D(X) D(X)
Ut 0
◦ / N(X) × D(X)
(3.17)
-1 N∗ (X)
Ot −
◦ / L(X) × N∗ (X) N∗ (X)
Ut −
◦ / N(X) × N∗ (X)

+1 N(X)
Ot +
◦ / L(X) × N(X) N∗ (X)
Ut +
◦ / N(X) × N∗ (X)

Notice that the draw maps in Table 3.3 in the introduction are written as chan-
nels O0 : D(X) → N[K](X), where the above table contains the corresponding
transition maps, with an extra ‘t’ in the name, as in Ot 0 : D(X) → N(X) ×
D(X).
In each case, the drawn elements are accumulated in the left product-com-
ponent of the codomain of the transition map. They are organised as lists, in
L(X), in the ordered case, and as multisets, in N(X) in the unordered case.
As we have seen before, in the scenarios with deletion/addition the urn is a
multiset, but with replacement it is a distribution.
The crucial observation is that the list and multiset data types that we use for
accumulating drawn elements are monoids, as we have observed early on, in
Lemmas 1.2.2 and 1.4.2. In general, for a monoid M, the mapping X 7→ M × X
is a monad, called the writer monad, see Lemma 1.9.1. It turns out that the
combination of the writer monad with the distribution monad D is again a
monad. This forms the basis for iterating transitions, via Kleisli composition of
this combined monad. We relegate the details of the monad’s flatten operation
to Exercise 3.5.6 below, and concentrate on the unit and Kleisli composition
involved.

Lemma 3.5.1. Let M = (M, 0, +) be a monoid. The mapping X 7→ D M × X




is a monad on Sets, with unit : X → D M × X given by:

unit(x) = 1| 0, x i where 0 ∈ M is the zero element.

For Kleisli maps f : A → D(M × B) and g : B → D(M × C) there is the Kleisli


composition g ◦· f : A → D(M × C) given by:
X P 
g ◦· f (a) = b f (a)(m, b) · g(b)(m , c) m + m , c .
0 0

(3.18)
m,m0 ,c

Notice the
occurence of the sum + of the monoid M in the first component

of the ket −, − in (3.18). When M is the list monoid, this sum is the (non-
commutative) concatenation ++ of lists, producing an ordered list of drawn
elements. When M is the multiset monoid, this sum is the (commutative) +

169
170 Chapter 3. Drawing from an urn

D(X)
Ot 0
/ D L(X) × D(X) N(X)
Ut 0
/ D N(X) × N(X)
ω
 /
X
ω(x) [x], ω

ω
 /
X
ω(x) 1| x i, ω

x∈supp(ω) x∈supp(ω)

N∗ (X)
Ot −
/ D L(X) × N(X) N∗ (X)
Ut −
/ D N(X) × N(X)
 X ψ(x)  ψ(x)
/ /
X
ψ [x], ψ − 1| x i ψ 1| x i, ψ − 1| x i

x∈supp(ψ)
kψk x∈supp(ψ)
kψk

N∗ (X)
Ot +
/ D L(X) × N∗ (X) N∗ (X)
Ut +
/ D N(X) × N∗ (X)
 X ψ(x)  X ψ(x)
ψ / [x], ψ + 1| x i ψ / 1| x i, ψ + 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk

Figure 3.1 Definitions of the six transition channels in Table (3.17) for draw-
ing a single element from an urn. In the “ordered” column on the left the list
monoid L(X) is used, where in the “unordered” columnn the (commutative) mul-
tiset monoid N(X) occurs. In the first row the urn (as distribution ω) remains
unchanged, whereas in the second and (resp. third) row the drawn elements x is
removed (resp. added) to the urn ψ. Implicitly it is assumed that the multiset ψ is
non-empty.

of multisets, so that the accumulation of drawn elements yields a multiset, in


which the order of elements is irrelevant.
If we have an ‘endo’ Kleisli map for the combined monad of Lemma 3.5.1,
of the form t : A → D(M × A), we can iterate it K times, giving t K : A →
D(M × A). This iteration is defined via the above unit and Kleisli composition:

t0 = unit and t K+1 = t K ◦· t = t ◦· t K .

Now that we know how to iterate, we need the actual transition maps that can
be iterated, that is, we need concrete definitions of the transition channels in
Table (3.17). They are given in Figure 3.1. In the subsections below we analyse
what iteration means for these six channels. Subsequently, we can describe the
associated K-sized draw channels, as in Table 3.3, as first projection π1 ◦· t K ,
going from urns to drawn elements.

3.5.1 Ordered draws from an urn


We start to look at the upper two ‘ordered’ transition channels in the column on
the left in Figure 3.1. Towards a general formula for their iteration, let’s look
first at the easiest case, namely ordered transition Ot 0 : D(X) → D L(X) ×

170
3.5. Iterated drawing from an urn 171

D(X) with replacement. By definition we have as first iteration.
X
Ot 10 (ω) = Ot 0 (ω) = ω(x1 ) [x1 ], ω .

x1 ∈supp(ω)

Accumulation of drawn elements in the first coordinate of −, − starts in the
second iteration:
Ot 20 (ω) = Ot 0 =X
Ot 0 (ω)

= ω(x1 ) · Ot 0 (ω)(`, ω) [x1 ] ++ `, ω

`∈L(X), x1 ∈supp(ω)
X
= ω(x1 ) · ω(x2 ) [x1 ] ++ [x2 ], ω

x1 ,x2 ∈supp(ω)
X
= (ω ⊗ ω)(x1 , x2 ) [x1 , x2 ], ω .

x1 ,x2 ∈supp(ω)

The formula for subsequent iterations is beginning to appear.

Theorem 3.5.2. Consider in Figure 3.1 the ordered-transition-with-replace-


ment channel Ot 0 : D(X) → L(X) × D(X), with distribution ω ∈ D(X).

1 Iterating K ∈ N times yields:


X
Ot 0K (ω) = ωK (~x) ~x, ω .

~x∈X K

2 The associated K-draw channel O0 [K] B π1 ◦· Ot 0K : D(X) → X K satisfies

O0 [K](ω) = ωK = iid [K](ω),

where iid is the identical and independent channel from (2.20).

The situation for ordered transition with deletion is less straightforward. We


look at two iterations explicitly, starting from a multiset ψ ∈ N(X).
X ψ(x1 )
Ot 1− (ψ) = x , ψ−1| x i
1 1
x ∈supp(ψ)
kψk
1

Ot 2− (ψ) = Ot − = Ot − (ψ)
X ψ(x1 ) (ψ−1| x1 i)(x2 )
= x1 , x2 , ψ−1| x1 i−1| x2 i .

·
kψk kψk − 1
x1 ∈supp(ψ),
x2 ∈supp(ψ−1| x1 i

Etcetera. We first collect some basic observations in an auxiliary result.

Lemma 3.5.3. Let ψ ∈ N[L](X) be a multiset / urn of size L = kψk.

171
172 Chapter 3. Drawing from an urn

1 Iterating K ≤ L times satisfies:


ψ − acc(x1 , . . . , xi ) (xi+1 )
X Y 
Ot −K (ψ) = ~x, ψ−acc(~x) .

L−i
~x∈X K , acc(~x)≤ψ 0≤i<K

2 For ~x ∈ X K write ϕ = acc(~x). Then:


Y   Y ψ(y)! ψ
ψ − acc(x1 , . . . , xi ) (xi+1 ) = = .
0≤i<K
y (ψ(y) − ϕ(y))! (ψ − ϕ)

The right-hand side is thus independent of the sequence ~x.

This independence means that any order of the elements of the same multiset
of balls gets the same (draw) probability. This is not entirely trivial.

Proof. 1 Directly from the definition of the transition channel Ot − , using


Kleisli composition (3.18).
2 Write ϕ = acc(~x) as ϕ = j n j |y j i. Then each element y j ∈ X occurs n j
P
times in the sequence ~x. The product
Y  
ψ − acc(x1 , . . . , xi ) (xi+1 )
0≤i<K

does not depend on the order of the elements in ~x: each element y j occurs n j
times in this product, with multiplicities ψ(y j ), . . . , ψ(y j ) − n j + 1, indepen-
dently of the exact occurrences of the y j in ~x. Thus:
Y   Y
ψ − acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) − n j + 1)
j
0≤i<K Y
= ψ(y j ) · . . . · (ψ(y j ) − ϕ(y j ) + 1)
j
Y ψ(y j )!
=
j (ψ(y j ) − ϕ(y j ))!
Y ψ(y)!
= .
y∈X
(ψ(y) − ϕ(y))!

We can extend the product over j to a product over all y ∈ X since if y <
supp(ϕ), then, even if ψ(y) = 0,
ψ(y)! ψ(y)!
= = 1.
(ψ(y) − ϕ(y))! ψ(y)!
Theorem 3.5.4. Consider the ordered-transition-with-deletion channel Ot −
on ψ ∈ N[L](X).

172
3.5. Iterated drawing from an urn 173

1 For K ≤ L,
X X ( ψ − ϕ )
Ot −K (ψ) = ~x, ψ − ϕ .

ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)

2 The associated K-draw channel O− [K] B π1 ◦· Ot −K : N[L](X) → X K satis-


fies:
X X (ψ − ϕ ) X ( ψ − acc(~x) )
O− [K](ψ) = ~x = ~x .
ϕ≤ ψ
( ψ) ( ψ)
−1
K ~x∈acc (ϕ) K ~x∈X , acc(~x)≤ψ

Proof. 1 By combining the two items of Lemma 3.5.3 and using:


Y L!
(L − i) = L · (L − 1) · . . . · (L − K + 1) = ,
0≤i<K
(L − K)!

we get:
X X (L−K)! Y ψ(y)!
Ot −K (ψ) = ~x, ψ−ϕ

·
L! y (ψ(y)−ϕ(y))!
ϕ≤K ψ ~x∈acc −1 (ϕ)

y ψ(y)!
Q
X X (L−K)!
= ~x, ψ−ϕ

Q ·
ϕ≤K ψ ~x∈acc −1 (ϕ) y (ψ(y)−ϕ(y))! L!
X X (L−K)! ψ
= · ~x, ψ−ϕ
ϕ≤K ψ ~x∈acc −1 (ϕ)
(ψ−ϕ) L!
X X ( ψ −ϕ )
= ~x, ψ−ϕ .
ϕ≤ ψ
( ψ)
K ~x∈acc (ϕ)
−1

2 Directly by the previous item.

We now move from deletion to addition, that is, from Ot − to Ot + , still in the
ordered case. The analysis is very much as in Lemma 3.5.3.

Lemma 3.5.5. Let ψ ∈ N∗ (X) be a non-empty multiset.

1 Iterating K times gives:


ψ + acc(x1 , . . . , xi ) (xi+1 )
X Y 
Ot +K (ψ) = ~x, ψ+acc(~x) .

kψk + i
~x∈X K , acc(~x)≤ψ 0≤i<K

2 For ~x ∈ X K with ϕ = acc(~x),


Y   Y (ψ(y) + ϕ(y) − 1)!
ψ + acc(x1 , . . . , xi ) (xi+1 ) = .
0≤i<K y∈supp(ψ)
(ψ(y) − 1)!

This expression on the right is independent of the sequence ~x.

173
174 Chapter 3. Drawing from an urn

Proof. The first item is easy, so we concentrate on the second, in line with the
proof of Lemma 3.5.3. If ϕ = acc(~x) = j n j | y j i we now have:
P

Y   Y
ψ + acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) + ϕ(y j ) − 1)
j
0≤i<K
Y (ψ(y j ) + ϕ(y j ) − 1)!
=
j (ψ(y j ) − 1)!
Y (ψ(y) + ϕ(y) − 1)!
= .
y∈supp(ψ)
(ψ(y) − 1)!

Theorem 3.5.6. Let ψ ∈ N∗ (X) be a non-empty multiset and let K ∈ N. Then:

ψ
X X 1 ϕ
Ot +K (ψ) = ·  kψk  ~x, ψ + ϕ .

ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K

The associated draw channel O+ [K] B π1 ◦· Ot +K : N∗ (X) → N[K](X) then is:

ψ
X X 1 ϕ
O+ [K](ψ) = ·  kψk  ~x .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K

We recall from Definition 1.7.1 (2) that:

ψ ψ(x) ψ(x) + ϕ(x) − 1


!! Y !! Y !
= = .
ϕ x∈supp(ψ)
ϕ(x) x∈supp(ψ)
ϕ(x)

ψ kψk
The multichoose coefficients ϕ sum to K for all ϕ with kϕk = K, see
Proposition 1.7.3.

Proof. Write L = kψk > 0. We first note that:

Y (L + K − 1)!
(L + i) = L · (L + 1) · . . . · (L + K − 1) = .
0≤i<K
(L − 1)!

174
3.5. Iterated drawing from an urn 175

In combination with Lemma 3.5.5 we get:

Ot +K (ψ)
X X (kψk−1)! Y (ψ(y)+ϕ(y)−1)!
= · ~x, ψ+ϕ
ϕ∈N[K](supp(ψ))
(kψk+K −1)! (ψ(y) − 1)!
~x∈acc −1 (ϕ) y∈supp(ψ)
Q (ψ(y)+ϕ(y)−1)!
X X y∈supp(ψ) (ψ(y)−1)!
= ~x, ψ+ϕ

(kψk+K−1)!
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ) (kψk−1)!
(ψ(y)+ϕ(y)−1)!
ϕ(y)!
Q Q
X X y y∈supp(ψ) ϕ(y)!·(ψ(y)−1)!
= ~x, ψ+ϕ

· (kψk+K−1)!
ϕ∈N[K](supp(ψ))
K!
~x∈acc −1 (ϕ) K!·(kψk−1)!
ψ
X X 1 ϕ
= ·  kψk  ~x, ψ+ϕ .

ϕ∈N[K](supp(ψ)) ~x∈acc (ϕ)
−1
( ϕ )
K

The formula for Ot + [K](ψ) is then an easy consequence.

3.5.2 Unordered draws from an urn


We now concentrate on the transition channels on the right in Figure 3.1, for
unordered draws. Notice that we are now using M = N(X) as monoid in the
setting of Lemma 3.5.1. We now combine all the mathematical ‘heavy lifting’
in one preparatory lemma.

Lemma 3.5.7. 1 For ω ∈ D(X) and K ∈ N,


X Y
ω(x)ϕ(x) ϕ, ω .

Ut 0K (ω) = (ϕ) ·

x
ϕ∈N[K](X)

2 For ψ ∈ N[L+K](X),
Q ψ(x)
X ψϕ
 
X x ϕ(x)
Ut −K (ψ) = L+K  ϕ, ψ − ϕ = kψk ϕ, ψ − ϕ .

ϕ≤K ψ K ϕ≤K ψ K

3 For ψ ∈ N∗ (X),
Q ψ(x)
ψ
ϕ(x)
x∈supp(ψ) ϕ
X X
Ut +K (ψ) = ϕ, ψ + ϕ = kψk ϕ, ψ + ϕ .

kψk
ϕ∈N[K](X) K ϕ∈N[K](X) K

Proof. 1 We use induction on K ∈ N. For K = 0 we have N[K](X) = {0} and


so:
X Y
ω(x)ϕ(x) ϕ, ω .

Ut 00 (ω) = unit(ω) = 1| 0, ωi = (ϕ) ·

x
ϕ∈N[0](X)

175
176 Chapter 3. Drawing from an urn

For the induction step:

Ut 0K+1 (ω) = Ut 0K ◦· Ut 0 (ω)



(3.18)
X
= Ut 0K (ω)(ϕ, ω) · Ut 0 (ω)(ψ, ω) ψ+ϕ, ω

ψ∈N[1](X), ϕ∈N[K](X)
(IH)
X Y
ω(x)ϕ(x) · ω(y) 1| y i+ϕ, ω

= (ϕ) ·

x
y∈X, ϕ∈N[K](X)
X  Y
ω(x)ψ(x) ψ, ω
P
= ( ψ − 1| y i ) ·

y x
ψ∈N[K+1](X)
(1.25)
X Y
ω(x)ψ(x) ψ, ω .

= ( ψ) ·

x
ψ∈N[K+1](X)


2 For K = 0 both sides are equal 1 0, ψ . Next, for a multiset ψ ∈ N[L+ K +

1](X) we have:

Ut −K+1 (ψ) = Ut −K ◦· Ut − (ψ)




(3.18)
X ψ(y)
= Ut −K (ψ − 1| y i)(ϕ, χ) · ϕ + 1| y i, χ

L+K +1
y∈supp(ψ), χ∈N[L](X),
ϕ≤K ψ−1| y i
ψ−1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| yi, ψ − 1| y i − ϕ

L+K  ·
L+K +1
y∈supp(ψ), K
ϕ≤K ψ−1| y i
 ψ 
X ϕ(y) + 1 ϕ+1| y i
= · L+K+1 ϕ + 1| yi, ψ − (ϕ + 1| y i)

K +1
y∈supp(ψ), K+1
ϕ≤K ψ−1| y i
by Exercise 1.7.6
ψ
X χ(y) χ
= · L+K+1 χ, ψ − χ

χ≤K+1 ψ
K +1
K+1
 ψ
χ
X
= L+K+1 χ, ψ − χ .

χ≤K+1 ψ K+1

3 The case K = 0 so we immediately look at, the induction step, for urn ψ

176
3.5. Iterated drawing from an urn 177

with kψk = L > 0.

Ut +K+1 (ψ) = Ut +K ◦· Ut + (ψ)




(3.18)
X ψ(y)
= Ut +K (ψ + 1| y i)(ϕ, χ) · ϕ + 1| y i, χ

y∈supp(ψ), χ
L
ψ+1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| y i, ψ + 1| y i + ϕ

L+1 ·
y∈supp(ψ), ϕ∈N[K](X)
L
K
 ψ 
X ϕ(y) + 1 ϕ+1| y i
= ·  L  ϕ + 1| y i, ψ + (ϕ + 1| yi)

y∈supp(ψ), ϕ∈N[K](X)
K +1
K+1
by Exercise 1.7.6
ψ
X χ(y) χ
= ·  L  χ, ψ + χ

K +1
y, χ∈N[K+1](X)   K+1
ψ
χ
X
=  L  χ, ψ + χ .

χ∈N[K+1](X) K+1

We are now in a position to describe the multinomial, hypergeometric and


Pólya distributions, using iterations of the Ut 0 , Ut − and Ut + transition maps,
followed by marginalisation.

Theorem 3.5.8. 1 The K-draw multinomial is the first marginal of K itera-


tions of the unordered-with-replacement transition:

mn[K] = π1 ◦· Ut 0K C U0 [K].

2 Similarly the hypergeometric distribution arises from iterated unordered-


with-deletion transitions:

hg[K] = π1 ◦· Ut −K C U− [K].

3 Finally, the Pólya distribution comes from iterated unordered-with-addition


transitions, as:
pl[K](ψ) = π1 ◦· Ut +K C U+ [K].

Proof. Directly by Lemma 3.5.7.

Together, Theorems 3.5.2, 3.5.4, 3.5.6 and 3.5.8 give precise descriptions of
the six channels, for ordrered / unordered drawing with replacement / deletion
/ addition, as originally introduced in Table (3.3), in the introduction to this
chapter. What remains to do is show that the diagrams (3.4) mentioned there
commute — which relate the various forms of drawing via accumulation and

177
178 Chapter 3. Drawing from an urn

arrangement. We repeat them below for convenience, but now enriched with
multinomial, hypergeometric, and Pólya channels.

mn=U0 6 N[K](X) hg=U− 4 N[K](X) pl=U+ 6 N[K](X)


◦ ◦ ◦
arr ◦ arr ◦ arr ◦
  
D(X)
O0
◦ / XK ◦ N[L](X)
O−
◦ / XK ◦ N∗ (X)
O+
◦ / XK ◦ (3.19)
acc ◦ acc ◦ acc ◦
mn=U0

(  ◦
hg=U− *  pl=U+

( 
N[K](X) N[K](X) N[K](X)

Theorem 3.3.1 precisely says that the diagram on the left commutes. For the
two other diagram we still have to do some work. This provides new character-
isations of unordered draws with deletion / addition, namely as hypergeometric
/ Pólya followed by arrangement.

Proposition 3.5.9. 1 Accumulating ordered draws with deletion / addition yields


unordered draws with deletion:

(acc ⊗ id ) ◦· Ot −K = Ut −K (acc ⊗ id ) ◦· Ot +K = Ut +K .

2 Permuting ordered draws with deletion / addition has no effect: the permu-
tation channel perm = arr ◦· acc : X K → X K satisfies:

(perm ⊗ id ) ◦· Ot −K = Ot −K (perm ⊗ id ) ◦· Ot +K = Ot +K .

3 The diagrams in the middle and on the right in (3.19) commute.

Proof. 1 By combinining Theorem 3.5.4 (1) and Lemma 3.5.7 (2):


 
 X X (ψ − ϕ ) 
Ot −K (ψ) = (acc ⊗ id ) =  ~x, ψ − ϕ 

(acc ⊗ id ) ◦·
ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ) 
X X ( ψ − ϕ )
= acc(~x), ψ − ϕ .

ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)
X (ϕ) · (ψ − ϕ)
= ϕ, ψ − ϕ .
ϕ≤K ψ
( ψ)
X ψϕ
 
=  L  ϕ, ψ−ϕ

by Lemma 1.6.4 (1)
ϕ≤K ψ K

= Ut −K (ψ).

178
3.5. Iterated drawing from an urn 179

Similarly, by Theorem 3.5.6,


ψ
X X 1 ϕ
Ot +K (ψ) = ·  kψk  acc(~x), ψ + ϕ .

(acc ⊗ id ) ◦·
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
( ϕ )
K
ψ
X 1 ϕ
= (ϕ) · ·  kψk  ϕ, ψ + ϕ .

ϕ∈N[K](supp(ψ))
( ϕ )
K

= Ut +K (ψ).

2 The probability for a K-tuple ~x in Ot −K Theorem 3.5.4 depends only on


acc(~x). Hence it does not change if ~x is permuted. The same holds for Ot +K
Theorem 3.5.6.
3 By the above first item we get:

acc ◦· O− = acc ◦· π1 ◦· Ot −K
= π1 ◦· (acc ⊗ id ) ◦· Ot −K
= π1 ◦· Ut −K as shown in the first item
= hg[K] by Theorem 3.5.8 (2).

But then:
arr ◦· hg[K] = arr ◦· acc ◦· π1 ◦· Ut −K as just shown
= perm ◦· π1 ◦· Ot −K
= π1 ◦· (perm ⊗ id ) ◦· Ot −K
= π1 ◦· Ot −K see item (2)
= O− .

The same argument works with O+ instead of O− .

We conclude with one more result that clarifies the different roles that the
draw-and-delete DD and draw-and-add DA channels play for hypergeometric
and Pólya distributions. So far we have looked only at the first marginal after
iteration. including also the second marginal gives a fuller picture.

Theorem 3.5.10. 1 In the hypergeometric case, both the first and the second
marginal after iterating Ut − can be expressed in terms of the draw-and-
delete map, as in:

N[L+K](X)
hg[K] = DD L DD K
◦ ◦ Ut −K ◦
z π1  π2 $
N[K](X) o ◦ N[K](X) × N[L](X) ◦ / N[L](X)

179
180 Chapter 3. Drawing from an urn

2 In the Pólya case, (only) the second marginal is can be expressed via the
draw-and-add map:
N[L](X)
pl[K] DA K
◦ ◦
◦ Ut +K
y π1  π2 %
N[K](X) o ◦ N[K](X) × N[L+K](X) ◦ / N[L+K](X)

Proof. 1 The triangle on the left commutes by Theorem 3.5.8 (2) and Theo-
rem 3.4.1. The one on the right follows from the equation:
X ψϕ
 
DD K (ψ) =   ψ − ϕ .

kψk
(3.20)
ϕ≤K ψ K

The proof is left as exercise to the reader.


2 Commutation of the triangle on the left holds by Theorem 3.5.8 (3). The
triangle on the right follows from:
ψ
ϕ
X
DA K (ψ) = kψk ψ + ϕ .

(3.21)
ϕ∈N[K](supp(ψ)) K

This one is also left as exercise.

Exercises
3.5.1 Consider the multiset/urn ψ = 4| a i + 2|b i of size 6 over {a, b}. Com-
pute O− [3](ψ) ∈ D {a, b}3 , both by hand and via the O− -formula in

Theorem 3.5.4 (2)
3.5.2 Check that:
pl[2] 1| ai + 2| bi + 1| c i

3
= 10
1
2|a i + 15 1| ai + 1| bi + 10

2| bi
1
+ 101
1| ai + 1| c i + 15 1| b i + 1| c i + 10 2| c i .


3.5.3 Show that pl[K] : N∗ (X) → D N[K](X) is natural: for each function
f : X → Y one has:

N∗ (X)
pl[K]
/ D N[K](X)
N∗ ( f ) D(N[K]( f ))
 
N∗ (Y)
pl[K]
/ D N[K](Y)

Hint: It suffices to show that the function Ut + in Figure (3.1) is natu-


ral.

180
3.6. The parallel multinomial law: four definitions 181

3.5.4 Prove Equations (3.20) and (3.21).


3.5.5 Consider Theorem 3.5.10.
1 Show that the following diagram commutes, involving subtraction
of multisets.
hid ,hg[K]i
N[L+K](X) ◦ / N[L+K](X) × N[K](X)
◦−

DD K . N[L](X)

2 Idem for:
hid ,pl[K]i
N[L](X) ◦ / N[L](X) × N[K](X)
◦+

DA K - N[L+K](X)

3.5.6 This exercise fills in the details of Lemma 3.5.1 and describes the
relevant monad, slightly more generally, for a general monoid M, and
not only for the monoid of multisets.
For an arbitrary set X define the flatten map:
flat / D M × X
 
D M×D M×X ◦

 /
X P 
Ω ω Ω(m, ω) · ω(m , x) m + m , x .
0 0
m,m0 ,x

Recall
that we additionally have unit : X → D(M × X) with unit(x) =

1 0, x .
1 Show that unit and flat are both natural transformations.
2 Prove the monad equations from (1.41) for T (X) = D(M × X).
3 Check that the induced Kleisli composition ◦· is the one given in
Lemma 3.5.1: for f : A → D(M × B) and g : B → D(M × C),

g ◦· f (a) = X
 
flat ◦ T (g) ◦ f (a)
P 
= b f (a)(ϕ, b) · g(b)(ψ, c) m + m , c .
0

m,m0 ,c

3.6 The parallel multinomial law: four definitions


We have already seen the close connection between multisets and distributions.
This section focuses on a very special ‘distributivity’ relation between them.
It shows how a (natural) multiset of distributions can be transformed into a

181
182 Chapter 3. Drawing from an urn

distribution over multisets. This is a rather complicated operation, but it is a


fundamental one. It can be described via a tensor product ⊗ of multinomials,
and will therefore be called the parallel multinonial law, abbreviated as pml.
This law pml is an instance of what is called a distributive law in cate-
gory theory. It has popped up, without explicit description, in [99, 31] and
in [34], for continuous probability, to describe ‘point processes’ as distribu-
tions over multisets. The explicit descriptions of the law that we use below
come from [78]. This law satisfies several elementary properties that combine
basic elements of probability theory.
The parallel multinonial law pml that we are after has the following type.
For a number K ∈ N and a set X it is a function:

pml[K]
/ D N[K](X) .
   
N[K] D(X) (3.22)

The dependence of pml on K (and X) is often left implicit. Notice that pml

can also be written as channel N[K] D(X) → N[K](X). We shall frequently
encounter it in this form in commuting diagrams.
This map pml turns a K-element multiset of distributions over X into a dis-
tribution over K-element multisets over X. It is not immediately clear how to
do this. It turns out that there are several ways to describe pml. This section is
solely devoted to defining this law, in four different manners — yielding each
time the same result. The subsequent section collects basic properties of pml.

First definition
Since the law (3.22) is rather complicated, we start with an example.

Example 3.6.1. Let X = {a, b} be a sample space with two distributions ω, ρ ∈


D(X), given by:

ω = 31 | a i + 32 | b i and ρ = 34 | a i + 41 | bi. (3.23)

We will define pml on the multiset of distributions 2| ωi + 1|ρ i of size K = 3.


The result should be a distribution on multisets of size K = 3 over X. There
are four such multisets, namely:

3|a i 2| a i + 1| b i 1| ai + 2| b i 3|b i.

The goal is to assign a probability to each of them. The map pml does this in

182
3.6. The parallel multinomial law: four definitions 183

the following way:

pml 2| ωi + 1| ρi


= ω(a) · ω(a) · ρ(a) 3| ai

 
+ ω(a) · ω(a) · ρ(b) + ω(a) · ω(b) · ρ(a) + ω(b) · ω(a) · ρ(a) 2| a i + 1| b i

 
+ ω(a) · ω(b) · ρ(b) + ω(b) · ω(a) · ρ(b) + ω(b) · ω(b) · ρ(a) 1| a i + 2| b i


+ ω(b) · ω(b) · ρ(b) 3| bi

 
= 13 · 13 · 34 3| ai + 13 · 31 · 14 + 13 · 32 · 34 + 32 · 13 · 34 2| ai + 1| b i

 
+ 13 · 23 · 14 + 32 · 13 · 14 + 23 · 23 · 34 1| a i + 2|b i + 23 · 23 · 14 3| b i


= 12
1
3| a i + 13 36 2| ai + 1| b i + 9 1| a i + 2| b i + 9 3| b i .
4 1

There is a pattern. Let’s try to formulate the law pml from (3.22) in general,
for arbitrary K and X. It is defined on a natural multiset i ni | ωi i with multi-
P
plicities ni ∈ N satisfying i ni = K, and with distributions ωi ∈ D(X). The
P
number pml i ni | ωi i (ϕ) describes the probability of the K-sized multiset ϕ
P 
over X, by using for each element occurring in ϕ the probability of that element
in the corresponding distribution in i ni | ωi i.
P

In order to make this description precise we assume that the indices i are
somehow ordered, say as i1 , . . . , im and use this ordering to form a product
state:

i ωi = |ωi1 ⊗ · · · ⊗ ωi1 ⊗ · · · ⊗ ωim ⊗ · · · ⊗ ωim ∈ D X K .


N ni 
{z } | {z }
ni1 times nim times

Now we formulate the first definition:


X N 
pml i ni | ωi i B ωni i (~x) acc(~x)
P 
~x∈X K  
X  X N   (3.24)
= 

 ωi i (~x) ϕ .
n

ϕ∈N[K](X) ~x∈acc −1 (ϕ)

Second definition
There is an alternative formulation of the parallel multinomial law, using sums
of parallel multinomial distributions via ⊗. This formulation is the basis for the
phrase ‘parallel multinomial’.

+
P  N 
pml i ni |ωi i B D i mn[ni ](ωi )
X Q  P (3.25)
= i mn[ni ](ωi )(ϕi ) i ϕi .
i, ϕi ∈N[ni ](X)

183
184 Chapter 3. Drawing from an urn

The sum + that we use here as type:

N[ni1 ](X) × · · · × N[nim ](X)


+ / N[ni + · · · + ni ](X).
| 1 {z }m
K

Q P
That, this sum has type i N[ni ](X) → N[ i ni ](X).
This definition (3.25) may be seen as a sum of multinomials, in the style of
Proposition 2.4.5. But this is justified via inclusions N[ni ](X) ,→ N(X). This
perspective is exploited in the fourth definition below.

Proposition 3.6.2. The definitions of the law pml in (3.24) and (3.25) are
equivalent.

Proof. Because:

 (3.24) X N ni 
ni | ωi i = ωi (~x) acc(~x)
P
pml i
~x∈X K
X Q  P
= ωni i (~xi ) i acc(~xi )

i see Exercise 1.6.5
i, ~xi ∈X ni
X Q P  P
= i ~xi ∈acc(ϕi ) ωni i (~xi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q  P
= i D(acc)(ωni i )(ϕi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q  P
= i acc ◦· iid ωi )(ϕi ) i ϕi
i, ϕi ∈N[ni ](X)
X Q  P
= i mn[ni ](ωi )(ϕi ) i ϕi .
i, ϕi ∈N[ni ](X)

This last equation is an instance of Theorem 3.3.1 (2). It leads to the second
formulation of the pml in (3.25).

Example 3.6.3. We continue Example 3.6.1 but now we describe the applica-
tion of the parallel multinonial law pml in terms of multinomials, as in (3.25).
We use the same multiset 2| ω i + 1| ρi of distributions ω, ρ from (3.23). The
calculation of pml on this multiset, according to the second definition (3.25),
is a bit more complicated than in Example 3.6.1 according to the first defini-
tion, since we have to evaluate the multinomial expressions. But of course the

184
3.6. The parallel multinomial law: four definitions 185

outcome is the same.

pml 2| ω i + 1| ρi

X
= mn[2](ω)(ϕ) · mn[1](ρ)(ψ) ϕ + ψ

ϕ∈N[2](X), ψ∈N[1](X)

= mn[2](ω)(2| a i) · mn[1](ρ)(1|a i) 3| a i


+ mn[2](ω)(2| ai) · mn[1](ρ)(1| b i)

+ mn[2](ω)(1| a i + 1| b i) · mn[1](ρ)(1| ai) 2| ai + 1| b i


+ mn[2](ω)(1| a i + 1| b i) · mn[1](ρ)(1|b i)

+ mn[2](ω)(2| b i) · mn[1](ρ)(1|a i) 1| ai + 2| bi


+ mn[2](ω)(2| b i) · mn[1](ρ)(1| bi) 3| b i

2 1
= 2,0 · ω(a)2 · 1,0 · ρ(a) 3| ai

 2  1 2 1 
+ 2,0 · ω(a)2 · 1,0 · ρ(b) + 1,1 · ω(a) · ω(b) · 1,0 · ρ(a) 2| ai + 1| bi

 2  1 2 1 
+ 1,1 · ω(a) · ω(b) · 1,0 · ρ(b) + 2,0 · ω(b)2 · 1,0 · ρ(a) 1| ai + 2| b i

2 1
+ 2,0 · ω(b)2 · 1,0 · ρ(b) 3| bi

  
= 31 2 · 34 3| a i + 13 2 · 14 + 2 · 31 · 23 · 34 2| ai + 1| bi

 
+ 2 · 1 · 2 · 1 + 2 2 · 3 1| a i + 2| b i + 2 2 · 1 3| bi
 
3 3 4 3 4 3 4

= 1
+ 13
+ 1|b i + 4
+ 2| bi + 19 3|b i .

12 3|a i 36 2| a i 9 1| a i

Indeed, this is what has been calculated in Example 3.6.1.

Third definition
Our third definition of pml is more abstract than the two earlier ones. It uses
the coequaliser property of Proposition 3.1.1. It determines pml as the unique
(dashed) map in:

D(X)K
acc / N[K] D(X)
N '
D(X K ) pml (3.26)
D(acc) )  
D N[K](X)

There is an important side-condition


N in Proposition 3.1.1, namely that the com-
K 
posite f B D(acc) ◦ : D(X) → D N[K](X) is stable under permuta-
tions. This is easy: for ωi ∈ D(X) and ϕ ∈ N[K](X), and for a permutation

185
186 Chapter 3. Drawing from an urn

π : {1, . . . , K} →
{1, . . . , K},
X
f ω1 , . . . , ωK (ϕ) = ω1 ⊗ · · · ⊗ ωK (x1 , . . . , xK )
 
~x∈acc −1 (ϕ)
(∗)
X
= ω1 ⊗ · · · ⊗ ωK (yπ−1 (1) , . . . , yπ−1 (K) )

~y∈acc −1 (ϕ)
X
= ωπ(1) ⊗ · · · ⊗ ωπ(K) (y1 , . . . , yK )

~y∈acc −1 (ϕ)
= f ωπ(1) , . . . , ωπ(K) (ϕ).

(∗)
The marked equation = holds because accumulation is stable under permuta-
tion. This implies that if ~x is in acc −1 (ϕ), then each permutation of ~x is also in
acc −1 (ϕ).
Proposition 3.6.4. The definitions of pml in (3.24), (3.25) and (3.26) are all
equivalent.
Proof. By the uniqueness property in the triangle (3.26) it suffices to prove
that pml as described in the first (3.24) or second (3.25) formulation makes
this triangle commute. For this we use the first version. We rely again on
the fact that accumulation is stable under permutation. Assume we have ω ~ =
(ω1 , . . . , ωK ) ∈ D(X)K with acc(~ω) = i∈S ni |ωi i, for S ⊆ {1, . . . , K}. Then:
P

ω) = pml i∈S ni | ωi i
 P 
pml ◦ acc (~
(3.24)
X N
= i ωi (~
ni 
x) acc(~x)
~x∈X K
X 
= ω1 ⊗ · · · ⊗ ωK (~x) acc(~x)

~x∈X K
= D(acc) ω) .
N 
(~
Implicitly, for well-definedness of the first definitionN
3.24 of pml we already
used that the precise ordering of states in the tensor ω) is irrelevant in the
(~
formulation of pml.
This third formulation of the parallel multinomial law is not very useful for
actual calculations, like in Examples 3.6.1 and 3.6.3. But it is useful for proving
properties about pml, via the uniqueness part of the third definition. This will
be illustrated in Exercise 3.6.3 below. More generally, the fact that we have
three equivalent formulations of the same law allows us to switch freely and
use whichever formulation is most convenient in a particular situation.

Fourth definition
For our fourth and last definition we have to piece together some earlier obser-
vations.

186
3.6. The parallel multinomial law: four definitions 187

1 Recall from Proposition 2.4.5 that if M is a commutative monoid, then so is


the set D(M) of distributions on M, with sum:
X
ω + ρ = D(+)(ω ⊗ ρ) = (ω ⊗ ρ)(x1 , x2 ) x1 + x2 .

x1 ,x2 ∈M

2 Recall from Proposition 1.4.5 that such commutative monoid structure cor-
responds to an N-algebra α : N(D(M)) → D(M), given by:
 X
α i ni | ωi i = ni · ωi
P
Xi N  P (3.27)
= ωni (~x) ~x where K = n .
P
i i i i
~x∈M K

3 For an arbitrary set X, the set N(X) of natural multisets on X is a com-


mutative monoid, see Lemma 1.4.2. Applying the previous two items with
M = N(X) yields an N-algebra:

α / DN(X)
 
N DN(X) (3.28)

It interacts with N’s unit and flatten operations as described in Proposi-


tion 1.4.5.

We can now formulate the fourth definition:

ND(unit N ) α
/ NDN(X) / DN(X) .
 
pml B ND(X) (3.29)

Proposition 3.6.5. The definition of pml in (3.29) restricts to N[K](D(X)) →



D N[K](X) , for each K ∈ N. This restriction is the same pml as defined
in (3.24), (3.25) and (3.26).

ni | ωi i.
P
Proof. We elaborate formulation (3.29), on a K-sized multiset i

 (3.29)  P 
pml i ni | ωi i = α i ni D(unit)(ωi )
P

(3.27)
X N  P
= ni
i D(unit)(ωi ) (~ ~
ϕ) ϕ
~ ∈N(X)K
ϕ
X N 
= ωni i (x1 , . . . , xK ) 1| x1 i + · · · + 1| xK i

i
x1 ,...,xK ∈X
X N 
= ωni i (~x) acc(~x)

i
~x∈X K

The last line coincides with the first formulation (3.24) of pml.

187
188 Chapter 3. Drawing from an urn

Exercises
3.6.1 Check that the multinomial channel can be obtained via the parallel
multinomial law, in two different ways.
1 Use the first or second formulation, (3.24) or (3.25), of pml to com-
pute that for a distribution ω ∈ D(X) and number K ∈ N one has:

mn[K](ω) = pml K| ωi ,


that is:

D(X)
mn[K]
/ D N[K](X)
;
K·unit *  pml
N[K] D(X)

2 Prove the same thing via the third formulation (3.26), and via The-
orem 3.3.1 (2).
3.6.2 Apply the multiset flatten map flat : M(M(X)) → M(X) in the setting
of Example 3.6.1 to show that:
 
flat 2| ωi + 1| ρi = 12
17
| ai + 19
12 |b i = flat pml 2| ωi + 1| ρi .


(The general formulation appears later on in Proposition 3.7.3.)


3.6.3 We claim pml is natural: for each function f : X → Y the following
diagram commutes.

N[K] D(X)
 pml
/ D N[K](X)
N[K](D( f )) D(N[K]( f ))
 
N[K] D(Y)
 pml
/ D N[K](Y).

1 Prove this claim via a direct calculation using the first (3.24) or
second (3.25) formulation of pml.
2 Give an alternative proof using the uniqueness part of the third for-
mulation (3.26), as suggested in the diagram:

D(X)K
acc / N[K] D(X)

D( f )K
)
D(Y)K
N )
D(Y K )
D(acc) * 

D N[K](Y)

188
3.7. The parallel multinomial law: basic properties 189

3.6.4 Generalise Exercise 3.3.8 from multinomials to parallel multinomials,


as in the following diagram.

N[K] D(X) × N[L] D(X)
 pml⊗pml
◦ / N[K](X) × N[L](X)
+◦ ◦+
 
N[K +L] D(X)
 pml
◦ / N[K +L](X)

3.7 The parallel multinomial law: basic properties


This section continues the investigation of the parallel multinomial law pml,
introduced in the previous section, in three different forms. This section fo-
cuses on some key properties of this law, including its interaction with fre-
quentist learning, multizip and hypergeometric distributions. As we have seen,
actual calculations with the parallel multinomial law pml quickly grow out of
hand. So we might worry that proving properties also becomes tedious. But
abstraction will help us. Since there is a characterisation of pml with a unique-
ness property, in its third formulation (3.26), we can reason with the associated
uniqueness proof principle. In most general form it says that f = g follows
from f ◦ acc = g ◦ acc, where acc is the acculation map, see Section 3.1,
especially Proposition 3.1.1.
The next result enriches what we already now.
Proposition 3.7.1. The parallel multinomial law pml is the unique channel
making both rectangles below commute.

D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ pml ◦ ◦
  
XK
acc
◦ / N[K](X) arr
◦ / XK

Proof. The rectangle on the left is the third formulation of pml in (3.26)
and thus provides uniqueness. Commutation of the rectangle of the right fol-
lows from a uniqueness argument, using that the outer rectangle commutes by
Proposition 3.1.3
◦· arr ◦· acc = arr ◦· acc ◦·
N N
by Proposition 3.1.3
= arr ◦· pml ◦· acc by (3.26).
N
This result shows that pml is squeezed betweenN , both on the left and on
the right. We have seen in Exercise 2.4.10 that is a distributive law. We
shall prove the same about pml below.
But first we show how pml interacts with frequentist learning.

189
190 Chapter 3. Drawing from an urn

Theorem 3.7.2. The distributive law pml commutes with frequentist learning,
in the sense that for Ψ ∈ N[K] D(X) ,


Flrn = pml(Ψ) = flat Flrn(Ψ) .




Equivalently, in diagrammatic form:

N[K] D(X)
 pml
◦ / N[K](X)
Flrn ◦ ◦ Flrn
 
D(X)
id
◦ /X

The channel D(X) → X at the bottom is the identity function D(X) → D(X).

Proof. We use the second formulation of the parallel multinonial law (3.25).
Let Ψ = i ni | ωi i ∈ N[K] D(X) .
P 

Flrn = pml(Ψ) = Flrn = D +


N 
i mn[ni ](ωi )
X
= ni
· Flrn = mn[ni ](ωi ) by Exercise 3.7.2, for K = i ni
 P
i K
X
= ni
· ωi by Theorem 3.3.5
i K
= flat i K | ωi i
P ni 

= flat Flrn(Ψ) .


We include a result that generalises Exercise 3.6.2. It is a discrete version


of [34, Lem. 13]. The proof below uses the ‘mean’ of multinomials, from
Proposition 3.3.7.

Proposition 3.7.3. Consider inclusions D(X) ,→ M(X) and N[K](X) ,→


M(X), together with the multiset flatten map flat : M(M(X)) → M(X). Via
these inclusions, one has, for Ψ ∈ N[K](D(X)),

flat pml(Ψ) = flat(Ψ).




190
3.7. The parallel multinomial law: basic properties 191

Proof. For Ψ = n1 | ω1 i + · · · + nk | ωk i ∈ N[K](D(X)),


 
 X Q  P 
flat pml(Ψ) = flat  i ϕi 

i mn[ni ](ω)(ϕi )


ϕ1 ∈N[n1 ](X), ..., ϕk ∈N[nk ](X)
X Q  P 
= i mn[ni ](ω)(ϕi ) · i ϕi
ϕ1 ∈N[n1 ](X), ..., ϕk ∈N[nk ](X)
X Q 
= i mn[ni ](ω)(ϕi ) · ϕ1
ϕ1 ∈N[n1 ](X), ..., ϕk ∈N[nk ](X)
X Q 
+··· + i mn[ni ](ω)(ϕi ) · ϕk
ϕ1 ∈N[n1 ](X), ..., ϕk ∈N[nk ](X)
X X
= mn[n1 ](ω1 )(ϕ1 ) · ϕ1 + · · · + mn[nk ](ωk )(ϕk ) · ϕk
ϕ1 ∈N[n1 ](X) ϕk ∈N[nk ](X)

= n1 · ω1 + · · · + nk · ωk by Proposition 3.3.7
= flat(Ψ).

The parallel multinomial law pml contains multinomial distributions. But it


also interacts with the multinomial channel, as described next.

Theorem 3.7.4. 1 The parallel multinomial law pml commutes with multino-
mials in the following manner.

D D(X)
 mn[K]
◦ / M[K] D(X)
flat ◦ pml
 
D(X)
mn[K]
◦ / M[K](X)

2 There is a second form of exchange between pml and multinomials:

pml 1 DN[K](X) mn[L]


/ DN[L]N[K](X) D(flat)
%
N[K]D(X) DN[L·K](X)
9
N[K](mn[L]) + N[K]DN[L](X) / DN[K]N[L](X) D(flat)
pml

Proof. 1 We use that mn[K] = acc ◦· iid [K], see Theorem 3.3.1 (2), in:

D D(X)
 iid
◦ / D(X)K acc
◦ / M[K] D(X)
N
◦ ◦ pml
flat
  
D(X)
iid
◦ / XK acc
◦ / M[K](X)

The rectangle on the left commutes by Exercise 2.4.9, and the one on the
right by Proposition 3.7.1.

191
192 Chapter 3. Drawing from an urn

2 We show that precomposing both legs in the diagram with the accumula-
tion map acc : D(X)K → M[K]D(X) yields an equality. This suffices by
Proposition 3.1.1.

D(flat) ◦ mn[L] ◦ pml ◦ acc


(3.29)
= D(flat) ◦ mn[L] ◦ D(acc) ◦
N

= D(flat) ◦ DM[L](acc) ◦ mn[L] ◦


N
by naturality of mn[L]
= D(flat) ◦ DM[L](acc) ◦ (mzip K ◦· (mn[L] ⊗ · · · ⊗ mn[L]))
by a generalisation of Corollary 3.3.3
= D(flat) ◦ DM[L](acc) ◦ flat ◦ D(mzip K ) ◦
N
◦ mn[L]K
= flat ◦ D2 (flat) ◦ D2 M[L](acc) ◦ D(mzip K ) ◦
N
◦ mn[L]K
= flat ◦ D(unit) ◦ D(+) ◦
N K
◦ mn[L] by Proposition 3.2.5
= D(+) ◦
N K
◦ mn[L]
= D(flat) ◦ D(acc) ◦
N
◦ mn[L]K by Exercise 1.6.7
(3.29)
= D(flat) ◦ pml ◦ acc ◦ mn[L] K

= D(flat) ◦ pml ◦ N[K](mn[L]) ◦ acc by naturality of acc.

We turn to pml and hypergeometric channels, which, as we have seen in


Theorem 3.4.1, are composites of draw-and-delete maps. We know from Propo-
sition 3.3.8 that multinomial channels commute with draw-and-delete. The
same holds for the parallel multinonial law.

Proposition 3.7.5. The following diagram commutes.

N[K +1] D(X)


 DD
◦ / N[K] D(X)
pml ◦ ◦ pml
 
N[K +1](X)
DD
◦ / N[K](X)

K+1
Proof. We use the probabilistic projectionN channel ppr : X → X K from
Exercise 3.3.13 and its interaction with in Exercise 3.3.14.

DD ◦· pml ◦· acc = DD ◦· acc ◦·


N
by (3.26)
= acc ◦· ppr ◦·
N
by Exercise 3.3.13
= acc ◦·
N
◦· ppr by Exercise 3.3.14
= pml ◦· acc ◦· ppr by (3.26)
= pml ◦· DD ◦· acc by Exercise 3.3.13.

The next consquence expresses a fundamental relationship between multi-


nomial and hypergeometric distributions.

192
3.7. The parallel multinomial law: basic properties 193

Corollary 3.7.6. The parallel multinomial law commutes with the hypergeo-
metric channel: for L ≥ K one has:

N[L] D(X)
 hg[K]
◦ / N[K] D(X)
pml ◦ ◦ pml
 
N[L](X)
hg[K]
◦ / N[K](X)

Proof. Theorem 3.4.1 shows that the hypergeometric distribution can be ex-
pressed via iterated draw-and-deletes. Hence the result follows from (iterated
application of) Proposition 3.7.5.
We continue to show that the parallel multinomial law pml commutes with
the unit and flatten operations of the distribution monad. This shows that pml
is an instance of what is called a distributive law in category theory. Such laws
are important in combining different forms of computation. A notorious result,
noted ome twenty years ago by Gordon Plotkin is that the powerset monad P
does not distribute over the probability distributions monad D. He never pub-
lished this important no-go result himself. Instead, it appeared in [161, 162]
(with full credits). This negative result interpreted as: there is no semanti-
cally solid way to combine non-deterministic and probabilistic computation.
The fact that a distributive law for multisets and distributions does exist shows
that multiset-computations and probability can be combined. Indeed, in Corol-
lary 3.7.8 we shall see that the K-sized multiset functor N[K] can be ‘lifted’
to the category of probabilistic channels.
But first we have to show that pml is a distributive law. We have already
seen in Exercise 3.6.3 that it is natural.
Proposition 3.7.7. The parallel multinonial law pml is a distributive law of
the K-sized multiset functor N[K] over the distribution monad D. This means
that pml commutes with the unit and flatten operations of D, as expressed by
the following two diagram.
N[K](X) unit
N[K](unit)
 #
N[K] D(X)
 pml
/ D N[K](X)

N[K] D2 (X)
 pml
/ D N[K] D(X) D(pml)
/ D2 N[K](X)
N[K](flat)
  flat
N[K] D(X)
 pml
/ D N[K](X)
N
Proof. In Exercise 2.4.10 we have seen that the big tensor : D(X)K →

193
194 Chapter 3. Drawing from an urn

D(X K ) is a distributive law. These properties will be used to show that pml is
a distributive law too. We exploit the uniqueness property of the third formu-
lation 3.26.
pml ◦ N[K](unit) ◦ acc
= pml ◦ acc ◦ unit K by naturality of acc, see Exercise 1.6.6
= D(acc) ◦ ◦ unit K
N
by (3.26)
= D(acc) ◦ unit via the first diagram in Exercise 2.4.10
= unit ◦ acc by naturality of unit.
Simarly for the flatten-diagram:

flat ◦ D(pml) ◦ pml ◦ acc


= flat ◦ D(pml) ◦ D(acc) ◦
N
by (3.26)
= flat ◦ D(D(acc)) ◦ D( ) ◦
N N
again by (3.26)
= D(acc) ◦ flat ◦ D( ) ◦
N N
by naturality of flat
= D(acc) ◦ ◦ flat K
N
via Exercise 2.4.10
= pml ◦ acc ◦ flat K once again by (3.26)
= pml ◦ N[K](flat) ◦ acc by naturality of acc.
This result says that pml is a so-called K`-law, of the functor N[K] over the
monad D. In general such a K`-law corresponds to a lifting of the functor to
the Kleisli category of channels for the monad, see [71, 84] for details.
Corollary 3.7.8. The K-sized multiset functor N[K] : Sets → Sets lifts to a
functor N[K] : Chan(D) → Chan(D).
The fact that we use the same notation for two different functors may be
confusing, but usually the context will tell which one is meant. These functors
act in the same way on objects (sets), but differ on morphism, see at the end of
the proof below.

Proof. The functor N[K] : Chan(D) → Chan(D) is defined on objects/sets


as X 7→ N[K](X). Its action on morphisms f : X → Y in Chan(D) is more
interesting. Such an f is a function X → D(X). We have to produce a function

N[K](X) → D N[K](X) . This is done via the parallel multinonial law, as
composite:

N[K](X)
N[K]( f )
/ N[K] D(Y) pml
/ D N[K](Y).

Next we show that parallel multinomials and multizipping commute. This


is a non-trivial technical result, which plays a crucial role in the subsequent
result.

194
3.7. The parallel multinomial law: basic properties 195

Lemma 3.7.9. The following diagram commutes.


N[K] D(X) × N[K] D(Y)
 pml⊗pml
◦ / N[K](X) × N[K](Y)
mzip ◦
 
N[K] D(X) × D(Y) ◦ mzip

N[K](⊗) ◦
 
N[K] D(X × Y)
 pml
◦ / N[K](X × Y)

Proof. The result follows from a big diagram chase in which the mzip opera-
tions on the left and on the right are expanded, according to (3.7).


N[K] D(X) × N[K] D(Y)
 pml⊗pml
◦ / N[K](X) × N[K](Y)
arr⊗arr ◦ arr⊗arr
 N N
⊗ 
D(X) × D(Y)K
K ◦ / XK × Y K
zip ◦ ◦ zip
◦ mzip
 N 
D(X) × D(Y) K
 ⊗K
◦ / D(X × Y)K ◦ / (X × Y)K mzip ◦

acc ◦

/ N[K] D(X) × D(Y) ◦ acc
acc
N[K](⊗) ◦
 v  
N[K] D(X × Y)
pml
◦ / N[K](X × Y) o

The upper rectangle commutes by Proposition 3.7.1 and the middle on by


Lemma 2.4.8 (3). The lower-left subdiagram commutes by naturality of acc
and the lower-right one via the third definition of pml in (3.26).

Theorem 3.7.10. The lifted functor N[K] : Chan(D) → Chan(D) commutes


with multizipping: for channels f : X → U and g : Y → V one has:
 
mzip ◦· N[K]( f ) ⊗ N[K](g) = N[K] f ⊗ g ◦· mzip.

(3.30)

Diagrammatically this amounts to:

N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)
N[K]( f )⊗N[K](g) ◦ ◦ N[K]( f ⊗g)
 
N[K](U) × N[K](V)
mzip
◦ / N[K](U × V)

In combination with the unit and associativity of Proposition 3.2.2 (2) and (5)
this means that the lifted functor N[K] : Chan(D) → Chan(D) is monoidal
via mzip.

195
196 Chapter 3. Drawing from an urn

Proof. This result is rather subtle, since f, g are used as channels. So when we
write N[K]( f ) we mean application of the lifted functor N[K] : Chan(D) →
Chan(D), as described in the last line of the proof of Corollary 3.7.8, produc-
ing another channel.
The left-hand side of the equation (3.30) thus expands as in the first equation
below.

 
mzip ◦· N[K]( f ) ⊗ N[K](g)
=

mzip ◦· (pml ⊗ pml) ◦ N[K]( f ) × N[K](g)
=

pml ◦· N[K](⊗) ◦· mzip ◦ N[K]( f ) × N[K](g) by Lemma 3.7.9
= pml ◦· N[K](⊗) ◦· N[K]( f × g) ◦· mzip by Proposition 3.2.2 (1)
= pml ◦· N[K]( f ⊗ g) ◦· mzip
=

N[K] f ⊗ g ◦· mzip.

For this result we really need the multizip operation mzip. One may think
that one can use tensors ⊗ instead, but the tensor-version of Lemma 3.7.9 does
not hold, see Exercise 3.7.5 below.
We show that the other operations are natural with respect to channels,
namely arrangement and draw-delete.

Lemma 3.7.11. Arrangement and accumulation, and draw-delete are natural


transformation in the situations:

N[K]
N[K+1]
$
w
*
w
arr
w
(−)K
/ Chan
w
Chan Chan 4 Chan
w
; DD
w w
w
acc
w

N[K]
N[K]

Proposition 3.2.2 (4) and Exercise 3.3.15 (3) say that the natural transforma-
tions arr and DD are monoidal.

Proof. Let f : X → Y be a channel. We need to prove f K ◦· arr = arr ◦· N[K]( f ).


We shall express this via functions as f K = (−) ◦ arr = (arr = (−)) ◦


196
3.7. The parallel multinomial law: basic properties 197

N[K]( f ). This is done via the following diagram.

N[K](X)
arr / D(X K )
K
N[K]( f )
  D( f )
N[K]D(Y)
arr / D(D(Y)K )
N
N[K]( f )
 D( ) f K = (−)
= (−) DD(Y K )
N
pml

 !  flat
/ DN[K](Y) D(arr)
/ DD(Y K ) flat / D(Y K ) o
O
arr = (−)

The upper rectangle is naturality of arr, for ordinary functions, see Exercise 2.2.7.
The lower rectangle commutes by Proposition 3.7.1.
For accumulation the situation is a bit simpler. The required equality acc ◦·
f = N[K]( f ) ◦· acc is obtained in:
K

XK
acc / N[K](X)
fK
 
N[K]( f )

f K
D(Y)K
acc / N[K]D(Y) N[K]( f )
N
 
pml
/ D(Y K ) D(acc)
/ DN[K](Y) o

The upper part is ordinary naturality of acc and the lower is the third formula-
tion (3.26) of pml.
For naturality of draw-delete we need to prove DD◦·N[K+1]( f ) = N[K]( f )◦·
DD, that is DD = (−) ◦ N[K + 1]( f ) = N[K]( f ) = (−) ◦ DD.
 

N[K +1](X)
DD / N[K](X)

N[K+1]( f )
 DN[K]( f )
N[K +1]D(Y)
DD / DN[K]D(Y)
N[K+1]( f )  D(pml) f K = (−)
pml pml = (−) DDN[K](Y)
 &  flat
/ DN[K +1](Y) D(DD)
/ DDN[K](Y) flat / DN[K](Y) o
O
DD = (−)

The upper diagram is naturality of draw-delete, see Exercise 2.2.8. The lower
rectangle commutes by Proposition 3.7.5.

We have started this section with a 3 × 2 table (3.3) with the six options for

197
198 Chapter 3. Drawing from an urn

drawing balls from an urn, namely orderered or unordered, and with replace-
ment, deletion or addition. We have described these six draw maps as channels,
for instance of the form U0 : D(X) → N[K](X) and we saw at various stages
that these maps are natural in X. But this meant naturality with respect to func-
tions. But now that we know that N[K] is a functor Chan → Chan, we can
also ask if these draw maps are natural with respect to channels.
This turns out to be the case. Besides N[K] there are other functors involved
in these draw channels, namely distribution D and power (−)K . They also have
to be lifted to functors Chan → Chan. For D we recall Exercise 1.9.8 which
that D lifts to D : Chan → Chan. The power functor (−)K lifts by using
tells N
that is a distributive law, see Exercise 2.4.10.

Theorem 3.7.12. The following four draw maps, out of the six in Table 3.3,
are natural w.r.t. channels, and thus form natural transformations in diagrams:

K-sized draws with replacement with deletion (L≥K)

Chan Chan
O0 O−
D =⇒ (−)K N[L] =⇒ (−)K
ordered    
Chan Chan
“independent identical” (3.31)

Chan Chan
U0 U−
D =⇒ N[K] N[L] =⇒ N[K]
unordered    
Chan Chan
“multinomial” “hypergeometric”
These natural transformations are all monoidal, by Lemma 2.4.8 (2), by Corol-
laries 3.3.3 and 3.4.2 (7), and finally by Proposition 3.2.2 (4).

The Pólya channel from Section 3.4 does not fit in this table since it is not
natural w.r.t. channels. Intuitively, this can be explained from the fact that Pólya
involves copying, and copying does not commute with channels, as we have
seen early on in Exercises 2.5.10. Pólya channels are natural w.r.t. functions,
see Exercise 3.5.3, and indeed, functions do commute with copying, see Exer-
cise 2.5.12.

Proof. Naturality of O0 = iid in the above table (3.31) is given by the diagram
on the right in Lemma 2.4.8 (1). Combining this fact with naturality of acc
from Lemma 3.7.11 gives natuality for O− = mn[K]. This uses that mn[K] =
acc ◦· iid , see Theorem 3.3.1.

198
3.7. The parallel multinomial law: basic properties 199

Naturality of U− = hg[K] follows from (channel) naturality of draw-delete


in Lemma 3.7.11, since hg[K] is an iteration of draw-delete’s, see Theorem 3.4.1.
For naturality of O− we use the equation O− = arr ◦· hg[K] from the middle
diagram in (3.19), see Proposition 3.5.9 (3). Hence we are done by (channel)
naturality of arr from Lemma 3.7.11.

Exercises
3.7.1 Consider the two distributions ω, ρ in (3.23) and check yourself the
following equation, which is an instance of Theorem 3.7.2.

Flrn = pml 2|ω i + 1|ρ i = 36


17
| ai + 19

36 | b i.
= flat ◦ Flrn (2| ωi + 1| ρi).


3.7.2 Let Ω ∈ D(N[K](X)), Θ ∈ D(N[L](X) be given. Use Exercise 2.3.3


to prove that:

Flrn = D(+)(Ω ⊗ Θ) = K+L


K
· Flrn = Ω + K+L
L
· Flrn = Θ .
 

3.7.3 Check that the construction of Corollary 3.7.8 indeed yields a functor
N[K] : Chan(D) → Chan(D).
3.7.4 Show that the lifted functors N[K] : K`(D) → K`(D) commute with
sums of multisets: for a channel f : X → Y,
+ / N[K +L](X)
N[K](X) × N[L](Y) ◦

N[K]( f )⊗N[L]( f ) ◦ ◦ N[K+L]( f )


 +

N[K](Y) × N[L](Y) ◦ / N[K +L](Y)

3.7.5 The parallel multinomial law pml does not commute with tensors (of
multisets and distributions), as in the following diagram.

N[K] D(X) × N[L] D(Y)
 pml⊗pml
◦ / N[K](X) × N[K](Y)
⊗◦
 
N[K ·L] D(X) × D(Y) , ◦⊗

N[K·L](⊗) ◦
 
N[K ·L] D(X × Y)
 pml
◦ / N[K ·L](X × Y)

Take for instance X = {a, b}, Y = {0, 1} with K = 2, L = 1 with


 
 ω = 34 | ai + 14 |b i

  ϕ = 2| ω i


distributions:  and multisets:
 ρ = 2 | 0i + 1 |1 i
  ψ = 1| ρ i


3 3

199
200 Chapter 3. Drawing from an urn

1 Calculate:

⊗ ◦· (pml ⊗ pml) (ϕ, ψ)



1
= 38 2| a, 0i + 14 1| a, 0 i + 1|b, 0i + 24

2| b, 0 i

+ 16
3
2|a, 1i + 18 1| a, 1 i + 1|b, 1i + 48 2| b, 1 i .
1

2 And also:

pml ◦· M[K ·L](⊗) ◦· ⊗ (ϕ, ψ)



1
= 14 2|a, 0i + 16 2| a, 1 i + 36 2| b, 0i + 144
1 1
2|b, 1 i

+ 41 1| a, 0 i + 1| a, 1i + 16 1| a, 0 i + 1| b, 0i

1
+ 121
1|a, 0 i + 1| b, 1 i + 12 1|a, 1i + 1| b, 0 i


+ 241
1|a, 1 i + 1| b, 1 i + 36 1
1|b, 0i + 1| b, 1 i .

3.7.6 Show that frequentist learning is a natural transformation in:

N[K+1]
w *
Chan 4 Chan
w
Flrn
w
id

3.8 Parallel multinomials as law of monads


There is one thing we still wish to do in relation to the parallel multinomial
law. We have described it as a map pml[K] : N[K]D(X) → DN[K](X) in a
restricted manner, namely restricted to multisets of size K. What if we drop
this restriction?
This question makes sense since the fourth formulation (3.29) describes pml
simply as a map ND(X) → DN(X), without size restrictions. It is in fact
pml[K], for the appropriate size K determined by the input. Indeed, we de-
scribe pml in (3.29) as:

pml(ϕ) = pml kϕk (ϕ), where, recall kϕk = ϕ(x).


  P
x

This allows us to see that the unrestricted version pml : ND(X) → DN(X) is
natural in X, using that N preserves size, see Exercise 1.5.2. Thus, let f : X →
Y and ϕ = i ni |ωi i ∈ N(D(X)) with kϕk = i ni = K. Then:
P P

P P
ND( f )(ϕ) = i ni D( f )(ωi ) = i ni = K.

200
3.8. Parallel multinomials as law of monads 201

Hence:
   
pml ND( f )(ϕ) = pml kN[K]D( f )(ϕ)k N[K]D( f )(ϕ)

 
= pml[K] N[K]D( f )(ϕ)
 
= DN[K]( f ) pml[K](ϕ) by naturality of pml[K]
 
= DN( f ) pml(ϕ) .

We have seen in Proposition 3.7.7 that pml interacts appropriately with the
unit and flatten operations of the distribution monad D. Since we now use pml
in unrestricted form we can also look at interaction with the unit and flatten of
the (natural) multiset monad N.

Lemma 3.8.1. The parallel multinomial law pml : ND(X) → DN(X) com-
mutes in the following way with the unit and flatten operations of the monad
N.

D(X) N 2 D(X) / NDN(X)


N(pml) pml
/ DN 2 (X)
D(unit)
unit flat D(flat)
   
ND(X)
pml
/ DN(X) ND(X)
pml
/ DN(X)

Moreover, the law pml : ND(X) → DN(X) is a homomorphism of monoids.

Proof. The equation for the units is easy. For ω ∈ D(X) we have, by Exer-
cise 3.6.1,
X
pml ◦ unit (ω) = pml(1| ωi) = mn[1](ω) = ω(x) 1| x i

x∈X
= D(unit)(ω).

For flatten we have to do a bit more work. We recall the above N-algebra
α : N(D(M)) → D(M) in (3.27), for a commutative monoid M. We also re-
call that by Proposition 1.4.5 the following two diagrams commute, where
f : M1 → M2 is a map of (commutative) monoids.

N 2 D(M)
N(α)
/ ND(M) ND(M1 )
ND( f )
/ ND(M2 )
flat α α α (3.32)
 α
  
ND(M) / D(M) D(M1 )
D( f )
/ D(M2 )

201
202 Chapter 3. Drawing from an urn

We use the fourth formulation (3.29) namely pml = α ◦ N(D(unit)). Then:

D(flat) ◦ pml ◦ N(pml)


= D(flat) ◦ α ◦ ND(unit) ◦ N(pml) by (3.29)
= α ◦ ND(flat) ◦ ND(unit) ◦ N(pml) by (3.32), on the right
= α ◦ N(pml) by a flatten-unit law
= α ◦ N(α ◦ ND(unit)) by (3.29) again
= α ◦ flat ◦ NND(unit) by (3.32), on the left
= α ◦ ND(unit) ◦ flat by naturality
= pml ◦ flat once again by (3.29).

In order to show that the parallel multinomial law pml : ND(X) → DN(X) is a
map of monoids it suffices by Proposition 1.4.5 to show that the diagram (1.21)
commutes. This is the outer rectangle in:
N(pml)

N 2 D(unit) 
N 2 D(X) / N 2 DN(X) N(α)
/ NDN(X)
flat flat α
  α

ND(X)
ND(unit)
/ NDN(X) / DN(X)
O
pml

The rectangle on the left commutes by naturality of flat; the one on the right is
an instance of the rectangle on the left in (3.32).

Theorem 3.8.2. The composite DN : Sets → Sets is a monad, with unit : X →


DN(X) and flat : DNDN(X) → DN(X) operations:

2 D(X) D(unit)
 
unit
(
 
 
unit B  X  DN(X)
6

, N(X)
 
unit N(unit)

1 DN 2 (X)
 
flat D(flat)
)
 
D(pml)
/ D2 N 2 (X)
 
flat B  DNDN(X) DN(X)
5


- D2 N(X)
 
D2 (flat) flat

Proof. This is a standard result in category, originally from Beck, see e.g. [12,
9, 71]. The diamonds in the above descriptions of unit and flatten commute by
naturality.

202
3.8. Parallel multinomials as law of monads 203

We now know that the composite functor DN : Sets → Sets is a monad.


Our next step is to show that there is a map of monads DM ⇒ M, where
M : Sets → Sets is the multiset monad, with non-negative numbers as multi-
plicities — and not just natural numbers, as for natural multisets, in N. This
map of monads is introduced in [34], under the name ‘intensity’. Therefor we
shall write it as its : DM ⇒ M.
There are obvious inclusions D(X) ,→ M(X) and N(X) ,→ M(X) that we
used before. We now need to formalise the situation and so we shall use explicit
names for these inclusions, as natural transformations, namely:
σ +3 M τ +3 M.
D N
These σ and τ are maps of monads. They are used implicitly for instance in
the equation flat pml(Ψ) = flat(Ψ) in Proposition 3.7.3. As a commuting

diagram, it looks as follows.
N(σ)
/ NM τ / MM
ND flat
*4 (3.33)

pml M
σ / MN M(τ)
/ MM flat
DN
Theorem 3.8.3 (From [34]). The intensity natural transformation its : DN ⇒
M has components:
1 MN(X) M(τ)
 
σ
)
 
/ M(X) 
 flat

its B  DN(X) MM(X)
5 (3.34)
- DM(X)
 
σ

D(τ)

This intensity its is a map of monads.


This definition 3.34 looks impressive, but for practical purposes we can just
write its(Ψ) = flat(Ψ), leaving inclusions σ, τ implicit. However, in the proof
below we will be very precise and make them explicit.
Proof. Commutation of intensity with units is easy, with unit DN for the monad
DN from Theorem 3.8.2:
(3.34)
its ◦ unit DN = flat M ◦ M(τ) ◦ σ ◦ D(unit N ) ◦ unit D
= flat M ◦ M(τ) ◦ M(unit N ) ◦ σ ◦ unit D
= flat M ◦ M(unit M ) ◦ unit M since τ, σ are maps of monads
= unit M .
Commutation with flatten maps is much more laborious and involves a long
calculation. Figure 3.2 provides all required equational steps, using the stan-
dard equations for (maps of) monads.

203
204 Chapter 3. Drawing from an urn

(3.34)
its ◦ flat DN = flat M ◦ σ ◦ D(τ) ◦ D(flat N ) ◦ flat D ◦ D(pml)
= flat M ◦ σ ◦ D(flat M ) ◦ DM(τ) ◦ D(τ) ◦ flat D ◦ D(pml)
= flat M ◦ M(flat M ) ◦ M2 (τ) ◦ M(τ) ◦ σ ◦ flat D ◦ D(pml)
= flat M ◦ flat M ◦ M2 (τ) ◦ M(τ) ◦ flat M ◦ σ ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ flat M ◦ M3 (τ) ◦ M2 (τ) ◦ σ ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ M(flat M ) ◦ M3 (τ) ◦ σ ◦ DM(τ) ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ M2 (τ) ◦ M(flat M ) ◦ σ ◦ DM(τ) ◦ D(σ) ◦ D(pml)
= flat M ◦ flat M ◦ σ ◦ DM(τ) ◦ D(flat M ) ◦ DM(τ) ◦ D(σ) ◦ D(pml)
(3.33)
= flat M ◦ M(flat M ) ◦ σ ◦ DM(τ) ◦ D(flat M ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ σ ◦ D(flat M ) ◦ D(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ σ ◦ D(flat M ) ◦ DM(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ M(flat M ) ◦ σ ◦ DM(flat M ) ◦ DM2 (τ) ◦ D(τ) ◦ DN(σ)
= flat M ◦ flat M ◦ σ ◦ D(τ) ◦ DN(flat M ) ◦ DNM(τ) ◦ DN(σ)
(3.34)
= flat M ◦ its ◦ DN(its).

Figure 3.2 Equational proof that the intensity natural transformation its
from (3.34) commutes with flattens, as part of the proof of Theorem 3.8.3.

There is more to say about this situation.

Proposition 3.8.4. For each set X, the intensity map is a homomorphism of


monoids, so that we can rephrase (3.33) as a triangle of monoid homomor-
phisms:

ND(X), +, 0
 pml
/ DN(X), +, 0

 its
- M(X), +, 0.

Proof. We recall that the monoid structure on DN(X) is defined in Proposi-


tion 2.4.5, and induces the map α : NDN(X) → DN(X) used above. The zero
element in DN(X) is 1| 0 i, where 0 ∈ N(X) is the empty multiset. It satisfies,
for x ∈ X,

its 1| 0 i (x) = flat 1| 0 i (x) = 1 · 0(x) = 0.


 

Hence its 1| 0 i = 0.


For distributions ω, ρ ∈ DN(X) we have its(ω + ρ) = its(ω) + its(ρ) since

204
3.8. Parallel multinomials as law of monads 205

for each x ∈ X,
X
its(ω + ρ)(x) = flat(ω + ρ)(x) = (ω + ρ)(ϕ) · ϕ(x)
ϕ∈N(X)
(2.18)
X
= D(+) ω ⊗ ρ (ϕ) · ϕ(x)

ϕ∈N(X)
X
= ω(ψ) · ρ(χ) · (ψ + χ)(x)
ψ,χ∈N(X)
X
= ω(ψ) · ρ(χ) · (ψ(x) + χ(x))
ψ,χ∈N(X)    
X  X  X  X 
= ω(ψ) ·   ρ(χ) · ψ(x) +
 
 ω(ψ) · ρ(χ) · χ(x)
ψ∈N(X) χ∈N(X) χ∈N(X) ψ∈N(X)
= flat(ω)(x) + flat(ρ)(x)
 
= its(ω) + its(ρ) (x).

Exercises

3.8.1 Check that the intensity natural transformation its : DN ⇒ M de-


fined in (3.34) restricts to DN[K] ⇒ M[K], for each K ∈ N.
3.8.2 Check that the ‘mean’ results for multinomial, hypergeometric and
Pólya distributions from Propositions 3.3.7 and 3.4.7 can be reformu-
lated in terms of intensity as:

 
its mn[K](ω) = K · ω
 
its hg[K](ψ) = K · Flrn(ψ)
 
its pl[K](ψ) = K · Flrn(ψ).

3.8.3 In Corollary 3.7.8 we have seen the lifted functor N[K] : Chan(D) →
Chan(D). The flatten operation flat : NN ⇒ N for (natural) mul-
tisets, from Subsection 1.4.2, restricts to N[K]N[L] ⇒ N[K · L],
making N[K] : Sets → Sets into what is called a graded monad, see
e.g. [123, 54]. Also the lifted functor N[K] : Chan(D) → Chan(D)
is such a graded monad, essentially by Lemma 3.8.1.
The aim of this exercise is to show that these (lifted) flattens do
not commute with multizip. This means that the following diagram of

205
206 Chapter 3. Drawing from an urn

channels does not commute.


flat⊗flat /
N[K]N[L](X) × N[K]N[L](Y) ◦ N[K · L](X) × N[K · L](Y)
mzip ◦
 
N[K] N[L](X) × N[L](Y) , ◦ mzip

N[K](mzip) ◦
 
N[K]N[L](X × Y)
flat
◦ / N[K · L](X × Y)

Elaborating a counterexample is quite initimidating, so we proceed


step by step. We take as spaces:

X = {a, b} and Y = {0, 1}.

We use K = 2 and L = 3 for the multisets of multisets:



Φ = 1 2| ai + 1| b i + 1 1|a i + 2| bi ∈ N[2]N[3](X)


Ψ = 1 2| 0 i + 1| 1 i + 1 3| 1i ∈ N[2]N[3](Y).

1 Check that going east-south in the above diagram yields:

mzip flat(Φ), flat(Ψ) = mzip 3| a i + 3| bi, 2| 0i + 4| 1 i


 

= 1
1i + 2| b, 0 i + 1|b, 1i

5 3| a,
+ 5 1|a, 0i + 2| a, 1 i + 1| b, 0 i + 2| b, 1i
3

+ 15 2|a, 0i + 1| a, 1 i + 3| b, 1 i .

2 The other path, south-east, will be done in several steps. Write Φ =


1| ϕ1 i + 1| ϕ2 i and Ψ = 1|ψ1 i + 1| ψ2 i where:
 
 ϕ1 = 2|a i + 1| bi

  ψ1 = 2| 0 i + 1| 1 i


 ϕ2 = 1|a i + 2| bi

  ψ2 = 3| 1 i.

Show then that:



mzip(Φ, Ψ) = 21 1| ϕ1 , ψ1 i + 1| ϕ2 , ψ2 i + 12 1| ϕ1 , ψ2 i + 1| ϕ2 , ψ1 i .

3 Show next:

mzip(ϕ1 , ψ1 ) = 2
+ 1| a, 1 i + 1|b, 0i + 13 2| a, 0 i + 1|b, 1i

3 1| a, 0i
mzip(ϕ1 , ψ2 ) = 1 2| a, 1 i + 1| b, 1i

2
mzip(ϕ2 , ψ1 ) = 3 1| a, 1i + 2| b, 0 i + 3 1| a, 0i + 1| b, 0 i + 1|b, 1i
1

mzip(ϕ2 , ψ2 ) = 1 1| a, 1 i + 2| b, 1i .

206
3.9. Ewens distributions 207

4 Show now that:


 
pml 1| mzip(ϕ1 , ψ1 ) i + 1| mzip(ϕ2 , ψ2 ) i

= 32 1| 1| a, 0 i + 1|a, 1i + 1| b, 0 i i + 1| 1| a, 1i + 2| b, 1 i i


+ 1 1| 2| a, 0 i + 1| b, 1 i i + 1| 1| a, 1 i + 2|b, 1i i

 3 
pml 1| mzip(ϕ1 , ψ2 ) i + 1| mzip(ϕ2 , ψ1 ) i
= 31 | 1| 1| a, 1 i + 2| b, 0 i i + 1| 2| a, 1 i + 1|b, 1i i i+
+ 32 | 1| 1| a, 0i + 1| b, 0 i + 1|b, 1i i + 1| 2| a, 1 i + 1| b, 1i i i.

5 Finally, check that the south-east past yields:


 
flat ◦· N[2](mzip) ◦· mzip (Φ, Ψ)

= 32 1| a, 0 i + 2| a, 1 i + 1| b, 0i + 2| b, 1 i


+ 16 2| a, 0i + 3| b, 1 i + 1|a, 1i


+ 16 3| a, 1i + 2| b, 0 i + 1|b, 1i .

This differs from what we get in the first item, via the east-south
route.

3.9 Ewens distributions


We conclude this chapter by looking into certain natural multisets and distri-
butions on them that arise in population biology. They were first introduced by
Ewens [46]. The aim is to capture situations where new mutations may intro-
duce new elements. One way to model this is to use so-called Hoppe urns, see
Example 3.9.7 below. A Hoppe draw is like a Pólya draw — where an addi-
tional ball of the same colour as the drawn ball is returned to the urn — but
where another ball is added with a new colour, not already occurring in the
urn. The latter ball of a new colour corresponds to a new mutation.
We start with the special multisets that are used in this setting.

Definition 3.9.1. An Ewens multiset with mean K is a natural multiset ϕ ∈


N(N) with supp(ϕ) ⊆ {1, 2, . . . , K} and mean(ϕ) = K. For such a multiset ϕ
the mean (average) is defined as:
X
mean(ϕ) B ϕ(k) · k.
1≤k≤K

We shall write E(K) ⊆ N(N) for the set of Ewens multisets with mean K. It is
easy to see that kϕk ≤ K for ϕ ∈ E(K).

207
208 Chapter 3. Drawing from an urn

For instance, for mean K = 4 there are the following Ewens multisets with
mean K.
4| 1 i 2| 1 i + 1| 2 i 2| 2 i 1| 1i + 1| 3 i 1| 4 i. (3.35)

Recall that we have been counting lists of coin values in Subsection 1.2.3.
There we saw in (1.7) all lists of coins with values 1, 2, 3, 4 that add up to 4.
As we see, the accumulations of these lists are the Ewens multisets of mean 4.
In these multiset representations we do not care about the order of the coins.
There are several possible ways of using an Ewens multiset ϕ = k nk |k i ∈
P
E(K), so that k nk · k = K.
P

• We can think of nk as the number of coins with value k that are used in ϕ to
form an amount K.
• We can also think of nk as the number of tables with k customers in an
arrangement of K guests in a restaurant, see [6] (or [2, §11.19]) for an early
description, and also Exercise 3.9.7 below.
• In genetics, each gene has an ‘allelic type’ ϕ, where each nk is the number
of alleles appearing k times.
• An Ewens multiset with mean K is type of a partition of the set {1, 2, . . . , K}
in [100]. It tells how many subsets in a partitition have k elements.

In this section we are interested in distributions on Ewens multisets, just like


the earlier multinomial / hypergeometric / Pólya draw distributions are distri-
butions on multisets.
We start with a general ‘multiplicity count’ construction for turning a mul-
tiset of size K, on an arbitrary space X, into an Ewens multiset of mean K,
with support in {1, . . . , K}. This multiplicity count produces a multiset k nk | k i
P
where nk is the number of elements in the original multiset that occur k times.
As mentioned above, an Ewens multiset can be seen as a type, capturing how
many elements occur how many times. It is a type, as an abstraction, just like
the size of a multiset can also be seen as a type. The multiplicity count function
calculates this type of an arbitrary multiset, as an Ewens multiset.

Definition 3.9.2. Let X be an arbitrary set. For each number K there is a mul-
tiplicity count function:
mc /
X
N[K](X) E(K) given by mc(ϕ) B ϕ−1 (k) | k i.
1≤k≤kϕk
Thus, the function mc is defined via the size | − | of inverse image subsets,
on k ≥ 1 as:

mc(ϕ)(k) B ϕ−1 (k) = {x ∈ X | ϕ(x) = k} . (3.36)

208
3.9. Ewens distributions 209

Alternatively, we can use:


X
mc(ϕ) = 1 ϕ(x) .

x∈supp(ϕ)

We then use sums of multisets implicitly. For instance, for X = {a, b, c} and
K = 10,
 
mc 2| a i + 6| b i + 2|c i = 1| 2i + 1| 6 i + 1| 2 i = 2|2 i + 1| 6i
 
mc 2| a i + 3| b i + 5|c i = 1| 2i + 1| 3 i + 1| 5 i.

We collect some basic results about multiplicity count.

Lemma 3.9.3. Let a non-empty multiset ϕ ∈ N[K](X) be given.

1 Taking multiplicity counts is invariant under permutation: when the set X is


finite and π : X →

X is a permutation (isomorphism), then
 
mc D(π)(ϕ) = mc(ϕ).

2 For arbitrary x ∈ X,

mc ϕ + 1| x i = mc ϕ + 1|ϕ(x)+1i − 1| ϕ(x) i.
 

3 Similarly, for x ∈ supp(ϕ),



ϕ − 1|1 i if ϕ(x) = 1

   mc
mc ϕ − 1| x i = 

ϕ + 1| ϕ(x)−1i − 1| ϕ(x) i otherwise.

 mc

Y
4 ϕ = k! mc(ϕ)(k) .

1≤k≤K
5 For each number K and for each set X with at least K elements, the multi-
plicity count function mc : N[K](X) → E(K) is surjective.

Proof. 1 Obvious, since permuting the elements of a multiset does not change
the multiplicities that it has.
2 Let k = ϕ(x). If we add another element x to the multiset ϕ, then the number
mc(ϕ)(k) of elements in ϕ occuring k times is decreased by one, and the
number mc(ϕ)(k + 1) of elements occurring k + 1 times is increased by one.
3 By a similar argument.
4 By induction on K ≥ 1. The statement clearly holds when K = 1. Next, by

209
210 Chapter 3. Drawing from an urn

the using the first item,


Y Y
k! mc(ϕ+1| x i)(k) =
mc(ϕ+1| x i)(K+1) mc(ϕ+1| x i)(k)
(K +1)!

· k!
1≤k≤K+1 1≤k≤K

+1)! if ϕ = K| x i

(K



=  (ϕ(x)+1)!
 Y
k! mc(ϕ)(k) ·

 otherwise
ϕ(x)!



1≤k≤K

(ϕ + 1| x i)

(IH)  if ϕ = K| x i
= 
ϕ · (ϕ(x)+1) otherwise

= (ϕ + 1| x i) .

5 Let ψ = 1≤k≤K nk | k i ∈ E(K) be given. We construct a multiset ϕK ∈


P
N[K](X) with mc(ϕK ) = ψ as follows. We start with the empty multiset
ϕ0 .
For each number k with 1 ≤ k ≤ K we set ϕk = ϕk−1 + k| x1 i + · · · k| xnk i,
where x1 , . . . , xnk are freshly chosen elements from X, not occurring in ϕk−1 .
This guarantees that ϕk has nk many elements occurring k times. By con-
struction mc(ϕK ) = ψ. This construction uses kψk many elements from X.
Since K = k nk · k ≥ k nk = kψk, the construction always works for a set
P P
X with at least K elements.

In Definition 2.2.4 we have seen draw-and-delete and draw-and-add chan-


nels:
DD
r ◦
N[K](X)

2 N[K +1](X)
DA

There are similar channels for Ewens multisets, which we shall write with an
extra ‘E’ for ‘Ewens’:
EDD
t ◦
E(K)

3 E(K +1) (3.37)
EDA(t)

Notice that the draw-add map EDA(t) involves a parameter t. It is a positive


real number, which is used as mutation rate in population genetics. Its role will
become clear below.
These maps in (3.37) are channels. Thus, EDA(t) and EDD produce a dis-
tribution over Ewens multisets, with increased or decreased mean. These in-
creases and descreses are a bit subtle, because we have to make sure that the
right mean emerges. The trick is to shift one occurrence from |i i to | i + 1 i, or

210
3.9. Ewens distributions 211

to shift in the other direction. This is achieved in the formulations in the next
definition. The deletion construction comes from [100, 101] and the addition
construction is used in the ‘Chinese restaurant’ illustration, see Exercise 3.9.7
below. Both constructions can be found in [27, §3.1 and §4.5].
Later on, in Proposition 6.6.6, we shall show that the channels EDD and
EDA(t) are closely related, in the sense that they are each other’s ‘daggers’.

Definition 3.9.4. For Ewens multisets ϕ ∈ E(K) and ψ ∈ E(K+1) and for t > 0
define:

t X ϕ(k) · k
ϕ + 1| 1 i + ϕ − 1| k i + 1| k+1 i

EDA(t)(ϕ) B
K +t 1≤k≤K
K +t
ψ(1) X ψ(k) · k
ψ − 1| 1 i + ψ + 1| k−1 i − 1| k i

EDD(ψ) B
K +1 2≤k≤K+1
K +1

For instance, for ϕ = 1| 1 i + 2| 3 i + 1|6 i ∈ E(13) we get:



EDA(1)(ϕ) = 14 2| 1 i +
1
2| 3 i + 1|6 i + 14 1| 2i + 2| 3i + 1| 6 i
1

+ 37 1| 1 i + 1| 3 i + 1|4 i + 1| 6i + 37 1| 1i + 2| 3 i + 1| 7 i

6
EDD(ϕ) = 6
+ 1| 6 i + 13 1|1 i + 1| 2i + 1| 3i + 1| 6 i

13 2| 3 i
+ 13
1
1| 1 i + 2| 3 i + 1|5 i .

The multiplicity count function mc from Definition 3.9.2 interacts smoothly


with draw-delete. This is what we show first.

Lemma 3.9.5. Multiplicity count commutes with draw-delete, as in:

N[K](X) o
DD
◦ N[K +1](X)
mc ◦ ◦ mc
 
E(K) o ◦ E(K +1)
EDD

In this diagram the function mc is used as deterministic channel.

Proof. We recall from Definition 2.2.4 that for ψ ∈ N[K +1](X) one has:

X ψ(x)
ψ − 1| x i

DD(ψ) B
x∈supp(ψ)
K +1

211
212 Chapter 3. Drawing from an urn

Then, via Lemma 3.9.3 (3),

 
D(mc) DD(ψ)
X ψ(x)
= mc(ψ−1| x i)
x∈supp(ψ)
K +1

X ψ(x)  if ψ(x) = 1

 mc(ψ)−1| 1 i

=
K +1  mc(ψ) + 1| ψ(x)−1i − 1| ψ(x) i otherwise.


x∈supp(ψ)
X ψ(x) X ψ(x)
= mc(ψ)−1| 1i + mc(ψ) + 1| ψ(x)−1 i − 1| ψ(x) i
x, ψ(x)=1
K +1 x, ψ(x)>1
K +1
x, ψ(x)=k ψ(x)
P
mc(ψ)(1) X
= mc(ψ)−1|1 i + mc(ψ) + 1| k−1i − 1| k i

K +1 2≤k≤K+1
K +1
mc(ψ)(1) X mc(ψ)(k) · k
= mc(ψ)−1|1 i + mc(ψ) + 1|k−1 i + 1|k i

K +1 2≤k≤K+1
K +1
 
= EDD mc(ψ) .

This completes the proof.

We now come to a crucial channel ew[K] : R>0 → E(K) called after Ewens

who introduced the distributions ew[K](t) ∈ D E(K) in [46]. These distribu-
tions will be introduced by induction on K. Subsequently, an explicit formula
is obtained, for the special case when the parameter t is a natural number.

Definition 3.9.6. Fix a parameter t ∈ R>0 and define by induction:


ew[1](t) B 1 1| 1i
(3.38)
ew[K +1](t) B EDA(t) = ew[K](t).


The first distribution ew[1](t) = 1 1|1 i ∈ D E(1) contains a lot of 1’s. It

is the singleton distribution for the singleton Ewens multiset 1| 1i on {1} with
mean 1. In this first step the parameter t ∈ R>0 does not play a role. The two
next steps yield:


ew[2](t) = EDA(t) = 1 1| 1 i = EDA(t)(1| 1i)

t 1
= 2| 1i + 1| 2 i .

1+t 1+t

212
3.9. Ewens distributions 213

And:

ew[3](t) !
t 1
= EDA(t) = 2| 1i +

1| 2 i
1+t 1+t
t 1
= · EDA(t) 2| 1 i +
 
· EDA(t) 1|2 i
1+t 1+t
t t t 2
= 3| 1i + 1| 1i + 1| 2 i

· ·
1+t 2+t 1+t 2+t
1 t 1 2
+ 1| 1i + 1| 2i +

· · 1| 3i
1+t 2+t 1+t 2+t
t2 3t 2
= 3| 1i + 1|1 i + 1| 2i + 1|3 i .

(1 + t)(2 + t) (1 + t)(2 + t) (1 + t)(2 + t)

There are several alternative ways to describe Ewens distributions, certainly


when the mutation rate t is a natural number. The first alternative involves
drawing from an urn; it justifies putting the topic of Ewens multisets / distri-
butions in this chapter on drawing. The following drawing method was intro-
duced by Hoppe [65]. It extends Pólya urns with the option of introducing a
ball with a new colour. It corresponds to the idea of a new mutation in genetics.

Example 3.9.7. For numbers t ∈ R>0 and K ∈ N>0 define the Hoppe channel:

N[K] {1, . . . , K}
 hop[K](t)
◦ / N[K +1] {1, . . . , K +1}

on ϕ ∈ N[K] {1, . . . , K} as:




X ϕ(k)
hop[K](t)(ϕ) B ϕ + 1| k i + t ϕ + 1| K +1 i .
K +t K +t (3.39)
1≤k≤K

On the right-hand side we implicitly promote ϕ to a multiset on {1, . . . , K +1}.


In these Hoppe draws we use ϕ as an urn filled with coloured balls, where
the colours are numbers 1, . . . , K. Drawn elements of colour k are returned,
together with an extra copy of the same colour, like in Pólya urns. But what’s
different is that an additional ball of a new colour K +1 is added, in the term
ϕ + 1| K +1 i in (3.39).
These Hoppe draws can be repeated and give, starting from 1| 1i ∈ M[1]({1}

213
214 Chapter 3. Drawing from an urn

the following distributions. For convenience, we omit the parameter K.


1 t
h(t) 1| 1i = 2|1 i + 1| 1 i + 1| 2 i

1+t 1+t
2 t
h(t) ◦· h(t) 1| 1i = 3| 1 i + 2|1 i + 1| 3i
 
(1+t)(2+t) (1+t)(2+t)
t t
+ 2|1 i + 1| 2i + 1| 1i + 2| 2 i

(1+t)(2+t) (1+t)(2+t)
t2
+ 1|1 i + 1| 2i + 1| 3i .

(1+t)(2+t)
Notice that the multisets in these distributions are not Ewens multisets. But
we can turn them into such Ewens multiset via multiplicity count mc from
Definition 3.9.2. Interestingly, we then get the Ewens distributions from Defi-
nition 3.9.6.

The following result can be seen as an analogue of Lemma 3.9.5, for addition
instead of for deletion. It is however more restricted, since it involves sample
spaces of the form {1, . . . , K}, and not arbitrary sets.

Lemma 3.9.8. Multiplicity count mc connects Hoppe and add-delete on Ewens


multisets, as in:

N[K] {1, . . . , K}
 hop(t)
◦ / N[K +1] {1, . . . , K +1}
mc ◦ ◦ mc
 
E(K) ◦ / E(K +1)
EDA(t)

This rectangle commutes for arbitrary t ∈ R>0 .

Proof. Let a multiset ϕ ∈ N[K] {1, . . . , K} be given. Via Lemma 3.9.3 (2):

 
D(mc) hop(t)(ϕ)
X ϕ(`)
(3.39)
= mc ϕ + 1|` i + t mc ϕ + 1| K +1i
1≤`≤K
K +t K +t
X ϕ(`)
= mc ϕ + 1| ϕ(`)+1 i − 1| ϕ(`) i + t mc ϕ + 1|1 i
1≤`≤K
K +t K +t
(∗) t X mc(ϕ)(k) · k
= mc(ϕ) + 1|1 i + mc(ϕ) − 1|k i + 1| k+1 i

K +t 1≤k≤K
K +t
 
= EDA(t) mc(ϕ) .
(∗)
The marked equation = holds since by definition, mc(ϕ)(k) = {` | ϕ(`) = k} ,
see (3.36).

214
3.9. Ewens distributions 215

Corollary 3.9.9. Ewens distributions can also be obtained via repeated Hoppe
draws followed by multiplicity count: for t ∈ R>0 and K ∈ N>0 , one has:

 
ew[K](t) = mc ◦· hop(t) ◦· · · · ◦· hop(t) 1| 1i .

| {z }
K−1 times

Proof. By induction on K ≥ 1, using that mc 1| 1 i = 1| 1i and Lemma 3.9.8.




It turns out that there is a single formula for Ewens distributions when the
mutation rate t is a natural number. This formula is often called Ewens sam-
pling formula.

Theorem 3.9.10. When the parameter t > 0 is a natural, the Ewens distribu-

tion ew[K](t) ∈ D E(K) can be described by:

X Q1≤k≤K t ϕ(k)
ew[K](t) =  kt  ϕ . (3.40)
ϕ∈E(K) ϕ · K

Proof. We fix t ∈ N>0 and use induction on K ≥ 1. When K = 1 the only


possible Ewens multiset is ϕ = 1|1 i ∈ E(1), so that:

t 1 t t (3.38)
1
1| 1 i =   1|1 i = 1|1 i = 1 1| 1 i = ew[1](t).

 
1! · 1t 1
t t

For the induction step we assume that Equation (3.40) holds for all numbers
below K. We first note that we can write for ϕ ∈ E(K) and ψ ∈ E(K +1),

 t
if ψ = ϕ + 1| 1 i


K + t


EDA(t)(ϕ)(ψ) = 

ϕ(k) · k
if ψ = ϕ − 1| k i + 1| k+1 i



K+t


 t
if ϕ = ψ − 1| 1i


K + t


= 

(ψ(k) + 1) · k
if ϕ = ψ + 1| k i − 1|k+1 i



K+t

215
216 Chapter 3. Drawing from an urn

where 1 ≤ k ≤ K. Hence:
(3.38)  
ew[K +1](t)(ψ) = EDA(t) = ew[K](t) (ψ)
X
= EDA(t)(ϕ)(ψ) · ew[K](t)(ϕ)
ϕ∈E(K)
t
= if ψ(1) > 0

· ew[K](t)(ψ − 1|1 i)
K+t
X (ψ(k) + 1) · k
+ · ew[K](t)(ψ + 1| k i − 1| k+1i)
1≤k≤K
K+t
Q t (ψ−1| 1 i)(`)
(IH) t 1≤`≤K `
= ·
K + t (ψ − 1| 1 i) · t
 
K
X (ψ(k) + 1) · k Q t (ψ+1| k i−1| k+1 i)(`)
1≤`≤K `
+ ·
K+t
 
1≤k≤K (ψ + 1| k i − 1| k+1 i) · Kt
t ψ(`)
ψ(1)
Q
1≤`≤K+1 `
= ·
K+1
 t 
ψ · K+1
X ψ(k + 1) · (k + 1) Q1≤`≤K+1 t ψ(`)
+ ·  t ` 
1≤k≤K
K + 1 ψ · K+1
t ψ(`)
ψ(1) + 1≤k≤K ψ(k + 1) · (k + 1)
P Q
1≤`≤K+1 `
= ·
K+1
 t 
ψ · K+1
Q t ψ(`)
1≤`≤K+1 `
=  t  .
ψ · K+1

This result has a number of consequences. We start with an obvious one, re-
formulating that the probabilities in (3.40) add up to one. This result resembles
earlier results, in Lemma 1.6.2 and Proposition 1.7.3, underlying the multino-
mial and Pólya distributions.

Corollary 3.9.11. For K, t ∈ N>0 ,


t kϕk
!!
X t
= .
ϕ · k ϕ(k)
Q
ϕ∈E(K) 1≤k≤K K

By construction, in Definition 3.9.6, Ewens channels commute with Ewens


draw-addition maps EDA. With their explicit formulation (3.40) we can show
that Ewens channels, when restricted to positive natural numbers, commute
with Ewens draw-delete maps EDD. Thus these channels form a cone for the
infinite chain of draw-delete maps. This resembles the situation for multino-
mial channels in Proposition 3.3.8.

216
3.9. Ewens distributions 217

Corollary 3.9.12. For K ≥ 1 the following triangles commute:

o EDD
E(K) @ +1)
E(K

[
◦ ◦
ew[K] ew[K+1]
N>0
Recall from Exercise 2.2.10 that there are also ‘fixed point’ distributions for
the ordinary (non-Ewens) draw-delete/add maps, but the ones given there are
not as rich as the Ewens distributions.
Proof. We first extract the following characterisation from the formulation of
EDD in Definition 3.9.4: for ψ ∈ E(K +1) and ϕ ∈ E(K),
ψ(1)

if ϕ = ψ − 1| 1 i


K + 1


EDD(ψ)(ϕ) = 

 ψ(k) · k
if ϕ = ψ + 1| k−1 i − 1| k i



K+1

ϕ(1) + 1

if ψ = ϕ + 1|1 i


 K+1


= 

 (ϕ(k) + 1) · k
if ψ = ϕ − 1|k−1 i + 1| k i



K+1

where 2 ≤ k ≤ K + 1. Hence:
 
EDD = ew[K](t) (ϕ)
X
= EDD(ψ)(ϕ) · ew[K +1](t)(ψ)
ψ∈E(K+1)
ϕ(1) + 1
= · ew[K +1](t) ϕ + 1| 1i

K+1
X (ϕ(k + 1) + 1) · (k + 1)
+ · ew[K +1](t) ϕ − 1| k i + 1| k+1 i

1≤k≤K
K + 1
t (ϕ+1| 1 i)(`)
ϕ(1) + 1
Q
(3.40) 1≤`≤K+1 `
= ·
K+1
 t 
(ϕ + 1| 1i) · K+1
X (ϕ(k + 1) + 1) · (k + 1) Q t (ϕ−1| k i+1| k+1 i)(`)
1≤`≤K+1 `
+ ·
K+1
 t 
1≤k≤K (ϕ − 1| k i + 1| k+1i) · K+1
Q t ϕ(`) X ϕ(k) · k Q1≤`≤K t ϕ(`)
t 1≤`≤K `
= ·   + ·  t` 
K+t ϕ · Kt K + t ϕ ·
P 1≤k≤K K
t 1≤k≤K ϕ(k) · k
= · ew[K](t)(ϕ) + · ew[K](t)(ϕ)
K+t K+t
= ew[K](t)(ϕ).
We consider another consequence of Theorem 3.9.10. It fills a gap that

217
218 Chapter 3. Drawing from an urn

we left open at the very beginning of this book, namely the proof of Theo-
rem 1.2.8. There we looked at sums sum(`) and products prod (`) of sequences
` ∈ L(N>0 ) of positive natural numbers. By accumulating such lists into mul-
tisets we get a connection with Ewens multisets, namely:

sum(`) = K ⇐⇒ acc(`) ∈ E(K). (3.41)

Thus if we arrange the multisets (with mean K) in the Ewens distribution we


get a distribution on lists that sum to a certain number K.

Corollary 3.9.13. For t, K ∈ N>0 ,


X t kϕk
arr = ew[K](t) =  t  ` .
`∈sum −1 (K) k`k! · prod (`) · K

In particular, for t = 1 we get a distribution:


X 1
arr = ew[K](1) = ` ,
k`k! · prod (`)
`∈sum −1 (K)

in which all probabilities add up to one. This is what was claimed earlier in
Theorem 1.2.8, at the time without proof.

Proof. We elaborate:
 
X  X 
arr = ew[K](t) = 
 arr(ϕ)(`) · ew[K](t)(ϕ)  `

`∈L({1,...,K}) ϕ∈E(K)
t ϕ(k) 
 Q 
(3.40)
X  X 1 1≤k≤K
=  t   `
k 
 ·
`∈L({1,...,K}) ϕ∈E(K), acc(`)=ϕ
(ϕ) ϕ · K
 
(∗)
X  X 1 t kϕk 
=  ·    `

`∈L({1,...,K}) ϕ∈E(K), acc(`)=ϕ
kϕk! prod (`) · t  
K
(3.41)
X t kϕk
=  t  ` .
`∈sum −1 (K) k`k! · prod (`) · K

The marked equation holds because if acc(`) = ϕ ∈ E(K), then:


1 Y  1 ϕ(k)
= .
prod (`) 1≤k≤K
k

For instance, for ` = [1, 2, 2, 3] with ϕ = acc(`) = 1| 1 i + 2| 2i + 1| 3 i ∈ E(8)


we have:
1 1 1 1  1 2 1 Y  1 ϕ(k)
= = = · · = .
prod (`) 1·2·2·3 12 1 2 3 1≤k≤K
k

218
3.9. Ewens distributions 219

Exercises
3.9.1 Check that the sets of Ewens multisets E(1), E(2), E(3), E(5) have,
respectively, 1, 2, 3, 5 and 7 elements; list them all.
3.9.2 Use the construction from the proof of Lemma 3.9.3 (5) to construct
for ϕ = 2| 1i + 1| 4 i + 2| 5 i ∈ E(16) a multiset ψ with mc(ψ) = ϕ.
3.9.3 Use the isomorphism N  N(1) to describe the mean function from
Definition 3.9.1 as an instance of flattening.
3.9.4 Show that for ` ∈ L(N>0 ) one has:
mean acc(`) = sum(`).


3.9.5 Recall from Lemma 1.2.7 that for N ∈ N>0 there are 2N−1 many lists
` ∈ L(N>0 ) with sum(`) = N. Hence we have a uniform distribution:
X 1
unifsum −1 (N) = N−1
` .
2
−1 `∈sum (N)

Via accumulation we get a distribution in E(N), namely:


  X 1
D(acc) unifsum −1 (N) = acc(`) .

2 N−1
−1 `∈sum (N)

Use (1.7) to show that for N = 4 this construction yields


 
D(acc) unifsum −1 (4)

= 18 4| 1i + 38 2| 1i + 1| 2i + 18 2| 2 i + 14 1| 1 i + 1| 3 i + 18 1| 4 i .

(This is not ew[4](1).)


3.9.6 Take ϕ = 3| a i + 1| b i + 1|c i. Show that:
 
EDA(1) mc(ϕ)

= 61 3| 1 i + 1|3 i + 13 1| 1i + 1| 2i + 1| 3 i + 12 2| 1 i + 1| 4 i

and:
 
D(mc) DA(ϕ) = 25 1| 1 i + 1| 2 i + 1| 3 i + 35 2|1 i + 1| 4i .

This demonstrates that a “draw” analogue of Lemma 3.9.5 does not


work, at least not for mutatation rate t = 1.
3.9.7 The following Chinese Restaurant process is a standard construction
in the theory of Ewens distributions, see e.g. [6] or [27, §4.5]. We
quote from the latter.
Imagine a restaurant in which customers are labeled according to the order
in which they arrive: the first customer is labeled 1, the second is labeled 2,
and so on. If the first K customers are seated at m ≥ 1 different tables, the
(K + 1)st customer sits

219
220 Chapter 3. Drawing from an urn

• at a table occupied by k ≥ 1 customers with probability k/(t + K) or


• at an unoccupied table with probability t/(t + K),
where t > 0.
(We have adapted the symbols to make them consistent with the no-
tation used here, e.g. with t instead of θ.)
Check that the draw-add channel EDA(t) from Definition 3.9.4
captures the above process. Keep in mind that an Ewens multiset
ϕ = k nk | k i ∈ E(K) describes a seating arrangement with nk tables
P
have k customers.
3.9.8 Give a proof of the equivalence in (3.41).
3.9.9 Show that for ϕ ∈ E(K),
X
ew[K](1) ϕ − 1| i i = K · ew[K](1)(ϕ).

i∈supp(ϕ)

This property is called non-interference, see [100, 27]. It resembles


the recurrence relation for multinomials in Exercise 3.3.6.

3.9.10 Recall the infinite multinomial mn [K] : D∞ (N>0 ) → D∞ N[K](N>0 )

from Exercise 3.3.16. We define:
mn [K]
/ D∞ N[K](N>0 ) D(mc)
/ D∞ E(K)
 ∞

ε[K] B D∞ (N>0 )

Check that for ρ ∈ D∞ (N>0 ) one actually gets ε[K](ρ) in D E(K)



1

instead of in D∞ E(K) .
2 Show that EDD ◦· ε[K +1] = ε[K], so that the ε’s form a cone for
the infinite chain of EDD maps.
Such a cone is called a partition structure in [100, 101]. The above
ε[K] maps correspond to the maps φK in [100, (2.10)]; to see this
Lemma 3.9.3 (4) is useful. They carry a De Finetti structure via the
Poisson-Dirichlet distribution.

220
4

Observables and validity

P
We have seen (discrete probability) distributions as formal convex sums i ri | xi i
in D(X) and (probabilistic) channels X → Y, as functions X → D(Y), describ-
ing probabilistic states and computations. This section develops the tools to
reason about such distributions and channels, via what are called observables.
They are functions from a set / sample space X to (a subset of) the real num-
bers R that associate some ‘observable’ numerical information with an element
x ∈ X. The following table gives an overview of terminology and types, where
X is a set (used as sample space).

name type

observable / utility function X→R

factor / potential function X → R≥0 (4.1)


(fuzzy) predicate / (soft / uncertain) evidence X → [0, 1]

sharp predicate / event X → {0, 1}

We shall use the term of ‘observable’ as generic expression for all the entries in
this table. A function X → R is thus the most general type of observable, and
a sharp predicate X → {0, 1} is the most specific one. Predicates are the most
appropriate observable for probabilistic logical reasoning. Often attention is
restricted to subsets U ⊆ X as predicates (or events [144]), but here, in this
book, the fuzzy versions X → [0, 1] are the default. Such fuzzy predicates may
also be called belief functions — or effects, in a quantum setting. A technical
reason using fuzzy, [0, 1]-valued predicates is that these fuzzy predicates are
closed under predicate transformation =, and the sharp predicates are not, see
below for details.

221
222 Chapter 4. Observables and validity

The commonly used notion of random variable can now be described as a


pair, consisting of an observable together with a state (distribution), both on
the same sample space. Often, this state is left implicit; it may be obvious
in a particular context what it is. But it may sometimes also be confusing,
for instance when we deal with two random variables and we need to make
explicit wether they involve different states, or share a state. Like elsewhere in
this book, we like to be explicit about the state(s) that we are using.
This chapter starts with the definition of what can be seen as probabilistic
truth ω |= p, namely the validity of an observable p in a state ω. It is the
expected value of p in ω. We shall see that many basic concepts can be defined
in terms of validity, including mean, average, entropy, distance, (co)variance.
The algebraic, logical and categorical structure of the various observables in
Table 4.1 will be investigated in Section 4.2.
In the previous chapter we have seen that a state ω on the domain X of a
channel c : X → Y can be transformed into a state c = ω on the channel’s
codomain Y. Analogous to such state transformation = there is also observ-
able transformation =, acting in the opposite direction: for an observable q on
the codomain Y of a channel c : X → Y, there is a transformed observable
c = q on the domain X. When = is applied to predicates, it is called pred-
icate transformation. It is a basic operation in programming logics that will
be investigated in this section. These transformations in different directions
are an aspect of the duality between states and predicates. These two transfor-
mations correspond to Schrödinger’s (forward) Heisenberg’s (backward) ap-
proach in quantum foundations, see [74]. In the end, in Section 4.5, validity is
used to give (dual) formulations of distances between states and between pred-
icates. Via this distance function we can make important properties precise,
like: each distribution can be approximated, with arbitrary precision, via fre-
quentist learning of natural multisets. Technically, the ‘rational’ distributions
of the form Flrn(ϕ), for natural multisets ϕ, form a dense subset of the (metric)
space of all distributions.

4.1 Validity
This section introduces the basic facts and terminology for observables, as
described in Table 4.1 and defines their validity in a state. Recall that we write
Y X for the set of functions from X to Y. We will use notations:

• Obs(X) B RX for the set of observables on a set X;


• Fact(X) B (R≥0 )X for the factors on X;

222
4.1. Validity 223

• Pred (X) B [0, 1]X for the predicates on X;


• SPred (X) B {0, 1}X for the sharp predicates (events) on X.

There are inclusions:

SPred (X) ⊆ Pred (X) ⊆ Fact(X) ⊆ Obs(X).

The first set SPred (X) = {0, 1}X of sharp predicates can be identified with
the powerset P(X) of subsets of X, see below. We first define some special
observables.

Definition 4.1.1. Let X be an arbitrary set.

1 For a subset U ⊆ X we write 1U : X → {0, 1} for the characteristic function


of U, defined as:

1 for x ∈ U


1U (x) B 
0 otherwise.

This function 1U : X → [0, 1] is the (sharp) predicate associated with the


subset U ⊆ X.
2 We use special notation for two extreme cases U = X and U = ∅, giving the
truth predicate 1 : X → [0, 1] and the falsity predicate 0 : X → [0, 1] on X.
Explicitly:

1 B 1X and 0 B 1∅ so that 1(x) = 1 and 0(x) = 0,

for all x ∈ X.
3 For a singleton subset {x} we simply write 1 x for 1{x} . Such functions 1 x : X →
[0, 1] are also called point predicates, where the element x ∈ X is seen as a
point.
4 There is a (sharp) equality predicate Eq : X × X → [0, 1] defined in the
obvious way as:

1 if x = x

 0
Eq(x, x0 ) B 
0 otherwise.

5 If the set X can be mapped into R in an obvious manner, we write this


inclusion function as incl X : X ,→ R and may consider it as an observable
on X. This applies for instance if X = n = {0, 1, . . . , n − 1}.

This mapping U 7→ 1U forms the isomorphism P(X) → 


{0, 1}X that we
mentioned just before Definition 4.1.1. In analogy with item (3) we sometimes
call a state of the form unit(x) = 1| x i a point state.
We now come to the basic definition of validity.

223
224 Chapter 4. Observables and validity

Definition 4.1.2. Let X be a set with an observable p : X → R on X.

1 For a distribution/state ω = i ri | xi i on X we define its validity ω |= p as:


P
X X
ω |= p B ri · p(xi ) = ω(x) · p(x). (4.2)
i
x∈X

2 If X is a finite set, with size | X | ∈ N, one can define the average avg(p) of
the observable p as its validity in the uniform state unif X on X, i.e.,
X p(x)
avg(p) B unif X |= p = .
x∈X
|X|

Notice that validity ω |= p involves a finite sum, since a distribution ω ∈


D(X) has by definition finite support. The sample space X may well be infinite.
Validity is often called expected value and written as E[p], where the state ω is
left implicit. The notation ω |= p makes the state explicit, which is important
in situations where we change the state, for instance via state transformation.
The validity ω |= p is non-negative (is in R≥0 ) if p is a factor and lies in
the unit interval [0, 1] if p is a predicate (whether sharp or not). The fact that
the multiplicities ri in a distribution ω = i ri | xi i add up to one means that
P
the validity ω |= 1 of truth is one. Notice that for a point predicate one has
ω |= 1 x = ω(x) and similarly, for a point state, 1| x i |= p = p(x).
For a fixed state ω ∈ D(X) we can view ω |= (−) as a function Pred (X) →
[0, 1] that assigns a likelihood to a belief function (predicate). This is at the
heart of the Bayesian interpretation of probability, see Section 6.8 for more
details.
As an aside, we typically do not write brackets in equations like (ω |= p) =
a, but use the convention that |= has higher precedence than =, so that the
equation can simply be written as ω |= p = a. Similarly, one can have validity
c = ω |= p in a transformed state, which should be read as (c = ω) |= p.

Definition 4.1.3. In the presence of a map incl X : X ,→ R, one can define the
mean mean(ω), also known as average, of a distribution ω on X as the validity
of incl X , considered as an observable:
X
mean(ω) B ω |= incl X = ω(x) · x.
x∈X

We will use the same definition of mean when ω is a multiset instead of a


distribution. In fact, we have already been using that extension to multisets in
Definition 3.9.1, where we defined Ewens multisets.
3
Example 4.1.4. 1 Let flip( 10 )= 3 7
10 | 1 i+ 10 | 0 i be a biased coin. Suppose there

224
4.1. Validity 225

is a game where you can throw the coin and win €100 if head (1) comes up,
but you lose €50 if the outcome is tail (0). Is it a good idea to play the game?
The possible gain can be formalised as an observable v : {0, 1} → R with
v(0) = −50 and v(1) = 100. We get an anwer to the above question by
computing the validity:
X 3
3
flip( 10 ) |= v = flip( )(x) · v(x)
x∈{0,1}
10
= flip( 10
3
)(0) · v(0) + flip( 10
3
)(1) · v(1)
= 7
10 · −50 + 3
10 · 100 = −35 + 30 = −5.

Hence it is wiser not to play.


2 Write pips for the set {1, 2, 3, 4, 5, 6}, considered as a subset of R, via the
map incl : pips ,→ R, which is used as an observable. The average of the
latter is its validity dice |= incl for the (uniform) fair dice state dice =
unifpips = 61 | 1i + 16 | 2 i + 61 | 3i + 16 | 4 i + 61 | 5i + 16 | 6 i. This average is 21
6 = 2.
7

It is the expected outcome for throwing a dice.


3 Suppose that we claim that in a throw of a (fair) dice the outcome is even.
How likely is this claim? We formalise it as a (sharp) predicate e : pips →
[0, 1] with e(1) = e(3) = e(5) = 0 and e(2) = e(4) = e(6) = 1. Then, as
expected:
X
dice |= e = dice(x) · e(x)
x∈pips
= 6 ·0+ 6
1 1
·1+ 1
6 ·0+ 1
6 ·1+ 1
6 ·0+ 1
6 ·1 = 2.
1

Now consider a more subtle claim that the even pips are more likely than
the odd ones, where the precise likelihoods are described by the (non-sharp,
fuzzy) predicate p : pips → [0, 1] with:

p(1) = 1
10 p(2) = 9
10 p(3) = 3
10
p(4) = 8
10 p(5) = 2
10 p(6) = 10 .
7

This new claim p happens to be equally probable as the even claim e, since:

dice |= p = 6 · 10 + 6 · 10 + 6
1 1 1 9 1
· 8
10 + 1
6 · 8
10 + 1
6 · 2
10 + 1
6 · 7
10
= 1+9+3+8+2+7
60 = 30
60 = 2.
1

4 Recall the binomial distribution bn[K](r) on the set {0, 1, . . . , K} from Ex-
ample 2.1.1 (2), for r ∈ [0, 1]. There is an inclusion function {0, 1, . . . , K} ,→
R that allows us to compute the mean of the binomial distribution. In this
computation we treat the binomial distribution as a special instance of the

225
226 Chapter 4. Observables and validity

multinomial distribution, via the isomorphism {0, 1, . . . , K}  N[K](2). This


allows us to re-use an earlier result.


mean bn[K](r)
X
= bn[K](r)(k) · k
0≤k≤K
X
= mn[K] flip(r) (ϕ) · ϕ(1) for flip(r) = r|1 i + (1 − r)| 0 i

ϕ∈N[K](2)

= K · flip(r)(1) by Lemma 3.3.4


= K · r.

In the two plots of binomial distributions bn[10]( 13 ) and bn[10]( 34 ) in Fig-


ure 2.2 (on page 75) one can see that the associated means 10 3 = 3.333 . . .
and 30
4 = 7.5 make sense.

Means for multivariate drawing will be considered in Section 4.4.

For the next result we recall from Proposition 2.4.5 that if M is a commuta-
tive monoid, so is the set D(M) of distributions on M. This is used below where
M is the additive monoid N of natural numbers. The result can be generalised
to any additive submonoid of the reals.

Lemma 4.1.5. The mean function, for distributions on natural numbers, is a


homomorphism of monoids of the form:

D(N), +, 0
 mean / R≥0 , +, 0.

Proof. Preservation of the zero element is easy, since:

mean 1| 0i = 1 · 0 = 0.


226
4.1. Validity 227

Next, for ω, ρ ∈ D(N),


 (2.18) X
mean ω + ρ = D(+) ω ⊗ ρ (n) · n

n∈N
X
= ω ⊗ ρ (m, k) · (m + k)

m,k∈N
X
= ω(m) · ρ(k) · (m + k)
m,k∈N
   
 X   X 
=  ω(m) · ρ(k) · m +  ω(m) · ρ(k) · k
m,k∈N m,k∈N
       
 X X   X  X
=  ω(m) ·  ρ(k) · m +   ω(m) · ρ(k) · k
 

m∈N k∈N k∈N m∈N


   
 X  X
=  ω(m) · m +  ρ(k) · k

  
m∈N k∈N
= mean(ω) + mean(ρ).
Remark 4.1.6. What is the difference between a multiset and a factor, and
between a distribution and a predicate? A multiset on X is a function X → R≥0
with finite support, and thus a factor. In the same way a distribution on X is
a function X → [0, 1] with finite support, and thus a predicate on X. Hence,
there are inclusions:

M(X) ⊆ Fact(X) = (R≥0 )X and D(X) ⊆ Pred (X) = [0, 1]X .


The first inclusion ⊆ may actually be seen as an equality = when the set X is a
finite set. A predicate p on a finite set however, need not be a state, since its val-
ues p(x) need not add up to one. However, such a predicate can be normalised;
then it is turned into a state.
We are however reluctant to identify states/distributions with certain predi-
cates, since they belong to different universes and have quite different algebraic
properties. For instance, distributions are convex sets, whereas predicates are
effect modules carrying a commutative monoid, see the next section. Keeping
states and predicates apart is a matter of mathematicaly hygiene1 . We have
already seen state transformation = along a channel; it preserves the convex
structure. Later on in this chapter we shall also see predicate transformation
= along a channel, in opposite direction; this operation preserves the effect
module structure on predicates.
Apart from these mathematical differences, states and predicates play en-
tirely different roles and are understood in different ways: states play an ono-
1 Moreover, in continuous probability there are no inclusions of states in predicates.

227
228 Chapter 4. Observables and validity

tologoical role and describe ‘states of affairs’; predicates are epistemological


in nature, and describe evidence (what we know).
Whenever we do make use of the above inclusions, we shall make this usage
explicit.

4.1.1 Random variables


So far we have avoided discussing the concept of ‘random variable’. It can be
a confusing notion for people who start studying probability theory. One often
encounters phrases like: let Y be a random variable with expectation E[Y].
How should this be understood? For clarity, we define a random variable to be
a pair, consisting of a distribution and an observable, with a shared underlying
space. For such a pair it makes sense to talk about expectation, namely as their
validity.
Consider a space Pet = {d, c, r, h} for d = dog, c = cat, r = rabbit and h =
hamster. We look at the cost (e.g. of food) of a pet per month, via an observable
q : Pet → R given by q(d) = q(c) = 50 and q(r) = q(h) = 10. Let us assume
that the distribution of pets in a certain neighbourhood is given by ω = 25 | Di +
4 |C i + 20 |R i + 5 | H i. We can describe the situation via three plots:
1 3 1

The pet distribution ω is on the left and the costs per pet is in the middle. The
plot on the right describes 207
| 10 i + 13
20 | 50i, which is the distribution of pet
costs in the neighbourhood. It is obtained via the functoriality of D, as image
distribution D(q)(ω) ∈ D(R). In more traditional notation it is described as:

P[q = 10] = 7
20 P[q = 50] = 20 .
13

Sometimes, a random variable is described as such an (image) distribution on


the reals, with the underlying distribution ω and observable q left implicit.
Alternatively, a rondom variable is sometimes described via a tilde ∼, as: q ∼
ω, like in phrases such as: “let q ∼ pois[λ] with . . . ”. This means that q : N →
R is an observable, on the underlying space N of the Poisson distribution.
It comes close to what our convention will be, namely to describe a random
variable as a pair of a state and an observable, with the same underlying space.
It should solve what is called in [103, §16] the “notational confusion between
a random variable and its distribution”.

228
4.1. Validity 229

Definition 4.1.7. 1 A random variable on a sample space (set) X consists of


two parts:
• a state ω ∈ D(X);
• an observable R : X → R.
2 The probability mass function P[R = (−)] : R → [0, 1] associated with the
random variable (ω, R) is the image distribution on R given by:

P[R = (−)] B D(R)(ω) = R = ω, (4.3)

where R is understood as a deterministic channel in the last expression R =


ω, see Lemma 1.8.3 (4).
3 The expected value E[R] of a random variable (ω, R) is the validity of R in
ω, which can be expressed as mean of the image distribution:

ω |= R = mean R = ω = mean D(R)(ω) .


 
(4.4)

The image distribution in (4.3) can be described in several (equivalent) ways:

P[R = r] = D(R)(ω)(r) = R = ω (r) = ω = R |= 1r



X
= ω(x)
x, R(x)=r
X
= ω(x)
x∈R−1 (r)
= ω |= 1R−1 (r) .

In the second item of the above definition, Equation (4.4) holds since:
X
mean(R = ω) = (R = ω)(r) · r see Definition 4.1.3
r∈R
X
= D(R)(ω)(r) · r
r∈R
 
X  X 
= 
 ω(x)  · r

r∈R x∈R−1 (r)
X
= ω(x) · R(x)
x∈X
= ω |= R.

Example 4.1.8. We consider the expected value for the sum of two dices.
In this situation we have an observable S : pips × pips → R, on the sam-
ple space pips = {1, 2, 3, 4, 5, 6}, given by S (i, j) = i + j. It forms a random
variable together with the product state dice ⊗ dice ∈ D(pips × pips). Recall,

229
230 Chapter 4. Observables and validity

dice = unif = i∈pips 16 |i i is the uniform distribution unif on pips. First, the
P
distribution for the sum of the pips is the image distribution:

S = dice ⊗ dice = D(+)(dice ⊗ dice)


X  P 
= (dice ⊗ dice)(i, j) n
2≤n≤12 i+ j=n (4.5)
= 36 | 2 i + 18 | 3 i + 12 | 4i + 9 |5 i + 36 | 6 i + 6 | 7 i
1 1 1 1 5 1

+ 36 5
| 8 i + 91 | 9 i + 12
1
| 10 i + 18
1
| 11 i + 36
1
| 12 i.

The expected value of the random variable (dice ⊗ dice, S ) is, according to
Definition 4.1.7 (3),

mean(S = dice ⊗ dice) = dice ⊗ dice |= S


X
= (dice ⊗ dice)(i, j) · S (i, j)
i, j∈pips
X 1 1
= · · (i + j)
i, j∈pips
6 6
i, j∈pips i + j
P
252
= = = 7.
36 36
There is a more abstract way to look at this example, using Lemma 4.1.5.
We have used dice as a distribution on pips = {1, . . . , 6}, but we can also see it
as a distribution dice ∈ D(N) on the natural numbers. The sum of pips that we
are interested in can then be described via a sum of distributions dice + dice,
using the sum + from Proposition 2.4.5. Then, by Lemma 4.1.5, we also get:

mean dice + dice = mean(dice) + mean(dice) = 7


+ 7
= 7.

2 2

We conclude with result for which it is relevant to know in which state we


are evaluating an observable.

Lemma 4.1.9. Let M = (M, +, 0) be commutative monoid, with two distri-


butions ω, ρ ∈ D(M), and with an observable p : M → R that is a map of
(additive) monoids. Then:

(ω + ρ) |= p = (ω |= p) + (ρ |= p),

where the sum of distributions ω + ρ is defined in (2.18).

This result involves three different random variables, namely:

ω + ρ, p ω, p ρ, p .
  

230
4.1. Validity 231

Proof. By unravelling the definitions:


X
(ω + ρ) |= p = D(+)(ω ⊗ ρ)(x) · p(x)
x∈M
X X
= ω(y) · ρ(z) · p(x)
x∈M y,z∈M, y+z=x
X
= ω(y) · ρ(z) · p(y + z)
y,z∈M
X
= ω(y) · ρ(z) · p(y) + p(z)

y,z∈M
   
X X  X X 
= ω(y) ·  ρ(z) · p(y) +
   ω(y) · ρ(z) · p(z)

y∈M z∈M z∈M y∈M
X X
= ω(y) · p(y) + ρ(z) · p(z)
y∈M z∈M
= (ω |= p) + (ρ |= p).

Exercises
4.1.1 Check that the average of the set n + 1 = {0, 1, . . . , n}, considered as
random variable, is 2n .
4.1.2 In Example 4.1.8 we have seen that dice ⊗ dice |= S = 7, for the
observable S : pips × pips → R by S (x, y) = x + y.
1 Now define T : pips 3 → R by T (x, y, z) = x + y + z. Prove that
pips ⊗ pips ⊗ pips |= T = 21
2 .
2 Can you generalise and show that summing on pips n yields validity
7n
2 ?
4.1.3 The birthday paradox tells that with at least 23 people in a room there
is a more than 50% chance that at least to of them have the same
birthday. This is called a ‘paradox’, because the number 23 looks sur-
prisingly low.
Let us scale this down so that the problem becomes manageable.
Suppose there are three people, each with their birthday in the same
week.
1 Show that the probability that they all have different birthdays is
30
49 .
2 Conclude that the probability that at least two of the three have the
19
same birthday is 49 .
3 Consider the set days B {1, 2, 3, 4, 5, 6, 7}. The set N[3](days) of
multisets of size three contains the possible combinations of the

231
232 Chapter 4. Observables and validity

three birthdays. Define the sharp predicate p : N[3](days) → {0, 1}


by p(ϕ) = 1 iff ( ϕ ) ≤ 3. Check that p holds in those cases where at
least two birthdays coincide — see also Exercise 1.5.8.
19
4 Show that the probability 49 of at least two coinciding birthdays
can also be obtained via validity, namely as:

mn[3](unifdays ) |= p = 49 .
19

4.1.4 Consider an arbitrary distribution ω ∈ D(X).


1 Check that for a function f : X → Y and an observable q ∈ Obs(Y),

f = ω |= q = ω |= q ◦ f

2 Observe that we get as special case, for an observable p : X → R,

mean(p = ω) = p = ω |= id = ω |= p,

where the identity function id : R → R is used as observable.


4.1.5 Let ω, ρ ∈ D(X) be given distributions. Prove the following two equa-
tions, in the style of Lemma 4.1.5.
1 For observables p, q : X → R, and thus for random variables, (ω, p)
and (ρ, q), one has:
 
mean D(p)(ω) + D(q)(ρ) = ω |= p + ρ |= q
 

In the notation of Definition 2 we can also write the left-hand side


as: mean P[p = (−)] + P[q = (−)] .


2 For channels c, d : X → R,
 
mean (c = ω) + (d = ρ) = ω |= mean ◦ c + ρ |= mean ◦ d .
 

Let Ω ∈ D D(X) be a distribution of distributions, with an ob-



4.1.6
servable p : X → R. We turn p into a ‘validity of p’ observable
(−) |= p : D(X) → R on D(X). Show that:
 
flat(Ω) |= p = Ω |= (−) |= p .

4.1.7 We have introduced the mean in Definition 4.1.3 for distributions


whose space is a subset of the reals. By definition, these distributions
have finite support. But the definition carries over to distributions with
infinite support, as described in Remark 2.1.4. Apply this to the Pois-
son distribution pois[λ] ∈ D∞ (N) and show that

mean pois[λ] = λ.


232
4.1. Validity 233

4.1.8 Consider the ‘mean’ operation from Definition 4.1.3 as a function


mean : D(R) → R. Show that it interacts in the following way with
the unit and flatten maps for the distribution monad D, see Section 2.1.
1 mean ◦ unit = id ;
2 mean ◦ flat = mean ◦ D(mean).
3 Let p : X → R be an observable, giving a function (−) |= p : D(X) →
R, sending ω ∈ D(X) to ω |= p in R. Show that the following dia-
gram commutes.
D((−)|= p)
D(D(X)) / D(R)
flat mean
 (−)|= p 
D(X) /R

From the first two items we can conclude that mean is an (Eilenberg-
Moore) algebra of the distribution monad, see Subsection 1.9.4. The
third item says that (−) |= p : D(X) → R is a homomorphism of
algebras.
4.1.9 For a distribution ω ∈ D(X) write I(ω) : X → R≥0 for the information
content or surprise of the distribution ω, defined as factor:

− log ω(x) if ω(x) , 0, i.e. if x ∈ supp(ω)
 

I(ω)(x) B 
0
 otherwise.
1 Check that Kullback-Leibler divergence, see Definition 2.7.1, can
be described as validity: for ω, ρ ∈ D(X) with supp(ω) ⊆ sup(ρ),
DKL (ω, ρ) = ω |= I(ρ) − I(ω).
2 The Shannon entropy H(ω) of ω and the cross entropy H(ω, ρ) of
ω, ρ, are defined as: validities:
H(ω) B ω |= I(ω) = − x ω(x) · log ω(x)
P 

H(ω, ρ) B ω |= I(ρ) = − x ω(x) · log ρ(x)


P 

Thus, Shannon entropy is ‘expected surprise’. Show that:


H(ω, ρ) = H(ω) + DKL (ω, ρ).
4.1.10 Consider the entropy function H from the previous exercise.
1 Show that H(ω) = 0 implies that ω ∈ D(X) is a point state ω =
1| x i, for a unique element x ∈ X.
2 Show that for σ ∈ D(X) and τ ∈ D(Y),
H σ ⊗ τ = H(σ) + H(τ).


233
234 Chapter 4. Observables and validity

3 Strengthen the previous item to non-entwinedness: for ω ∈ D(X ×


Y),

H(ω) = H(ω 1, 0 ) + H(ω 0, 1 ) ⇐⇒ ω = ω 1, 0 ⊗ ω 0, 1 .


       

Hint: Use for finitely many numbers ri , si ∈ [0, 1], where i ri =


P
1 = i si , that i ri · log(ri ) = i ri · log(si ) implies ri = si for each
P P P
i.

4.1.11 Let ω ∈ D(X) be an arbitrary state on a finite set X. Use Propo-


sition 2.7.4 (1) to prove that uniform states have maximal Shannon
entropy, that is, H(ω) ≤ H(unif), where unif ∈ D(X) is the uniform
distribution.

4.2 The structure of observables


This section describes the algebraic structures that the various types of observ-
ables have, without going too much into mathematical details. We don’t wish
to turn this section into a sequence of formal definitions. Hence we present the
essentials and refer to the literature for details. We include the basic fact that
mapping a set X to observables on X is functorial, in a suitable sense. This
will give rise to the notion of weakening. It plays the same (or dual) role for
observables that marginalisation plays for states.
It turns out that our four types of observables are all (commutative) monoids,
via multiplication / conjunction, but in different universes. The findings are
summarised in the following table.

name type monoid in

observable X→R ordered vector spaces

factor X → R≥0 ordered cones (4.6)


predicate X → [0, 1] effect modules

sharp predicate X → {0, 1} join semilattices

The least well-known structures in this table are effect modules. They will thus
be described in greatest detail, in Subsection 4.2.3.

234
4.2. The structure of observables 235

4.2.1 Observables
Observables are R-valued functions on a set (or sample space) which are often
written as capitals X, Y, . . .. Here, these letters are typically used for sets and
spaces. We shall use letters p, q, . . . for observables in general, and in particular
for predicates. In some settings one allows observables X → Rn to the n-
dimensional space of real numbers. Whenever needed, we shall use such maps
as n-ary tuples hp1 , . . . , pn i : X → Rn of observables pi : X → R, see also
Section 1.1.
Let us fix a set X, and consider the collection Obs(X) = RX of observables
on X. What structure does it have?

• Given two observables p, q ∈ Obs(X), we can add them pointwise, giving


p + q ∈ Obs(X) via (p + q)(x) = p(x) + q(x).
• Given an observable p ∈ Obs(X) and a scalar s ∈ R we can form a ‘re-
scaled’ observable s · p ∈ Obs(X) via (s · p)(x) = s · p(x). In this way we get
−p = (−1) · p and 0 = 0 · p for any observable p, where 0 = 1∅ ∈ Obs(X) is
taken from Definition 4.1.1 (2).
• For observables p, q ∈ RX we have a partial order p ≤ q defined by: p(x) ≤
q(x) for all x ∈ X.

Together these operations of sum + and scalar multiplication · make RX into a


vector space over the real numbers, since + and · satisfy the appropriate axioms
of vector spaces. Moreover, this is an ordered vector space by the third bullet.
One can restricts to bounded observables p : X → R for which there is a
bound B ∈ R>0 such that −B ≤ p(x) ≤ B for all x ∈ X. The collection of such
bounded observables forms an order unit space [3, 83, 127, 132].
On Obs(X) = RX there is also a commutative monoid structure (1, &), where
& is pointwise multiplication: (p & q)(x) = p(x) · q(x). We prefer to write this
operation as logical conjunction because that is what it is when restricted to
predicates. Besides, having yet another operation that is written as dot · might
be confusing. We will occasionally write pn for p & · · · & p (n times).

4.2.2 Factors
We recall that a factor is a non-negative observables and that we write Fact(X) =
(R≥0 )X for the set of factors on a set X. Factors occur in probability theory for
instance in the so-called sum-product algorithms for computing marginals and
posterior distributions in graphical models. These matters are postponed to
Chapter ??; here we concentrate on the mathematical properties of factors.
The set Fact(X) looks like a vector space, except that there are no negatives.

235
236 Chapter 4. Observables and validity

Using the notation introduced in the previous subsection, we have:

Fact(X) = {p ∈ Obs(X) | p ≥ 0}.

Factors can be added pointwise, with identity element 0 ∈ Fact(X), like ran-
dom variables. But a factor p ∈ Fact(X) cannot be re-scaled with an ar-
bitrary real number, but only with a non-negative number s ∈ R≥0 , giving
s · p ∈ Fact(X). These structures are often called cones. The cone Fact(X) is
positive, since p + q = 0 implies p = q = 0. It is also cancellative: p + r = q + r
implies p = q.
The monoid (1, &) on Obs(X) restricts to Fact(X), since 1 ≥ 0 and if p, q ≥ 0
then also p & q ≥ 0.

4.2.3 Predicates
We first note that the set Pred (X) = [0, 1]X of predicates on a set X contains
falsity 0 and truth 1, which are always 0 (resp. 1). There are some noteworthy
differences between predicates on the one hand and observables and factors on
the other hand.

• Predicates are not closed under addition, since the sum of two numbers in
[0, 1] may ly outside [0, 1]. Thus, addition of predicates is a partial opera-
tion, and is then written as p > q. Thus: p > q is defined if p(x) + q(x) ≤ 1
for all x ∈ X, and in that case (p > q)(x) = p(x) + q(x).
This operation > is commutative and associative in a suitably partial
sense. Moreover, it has 0 as identity element: p > 0 = p = 0 > p. This
is structure (Pred (X), 0, >) is called a partial commutative monoid.
• There is a ‘negation’ of predicates, written as p⊥ , and called orthosupple-
ment. It is defined as p⊥ = 1− p, that is, as p⊥ (x) = 1− p(x). Then: p> p⊥ = 1
and p⊥⊥ = p.
• Predicates are closed under scalar multiplication s· p, but only if one restricts
the scalar s to be in the unit interval [0, 1]. Such scalar multiplication inter-
acts nicely with partial addition >, in the sense that s·(p>q) = (s· p)>(s·q).

The combination of these items means that the set Pred (X) carries the structure
of an effect module [68], see also [44]. These structures arose in mathemati-
cal physics [49] in order to axiomatise the structure of quantum predicates on
Hilbert spaces.
The effect module Pred (X) also carries a commutative monoid structure for
conjunction, namely (1, &). Indeed, when p, q ∈ Pred (X), then also p & q ∈
Pred (X). We have p & 0 = 0 and p & (q1 > q2 ) = (p & q1 ) > (p & q2 ).

236
4.2. The structure of observables 237

Since these effect structures are not so familiar, we include more formal
descriptions.

Definition 4.2.1. 1 A partial commutative monoid (PCM) consists of a set M


with a zero element 0 ∈ M and a partial binary operation > : M × M → M
satisfying the three requirements below. They involve the notation x ⊥ y for:
x > y is defined; in that case x, y are called orthogonal.
• Commutativity: x ⊥ y implies y ⊥ x and x > y = y > x;
• Associativity: y ⊥ z and x ⊥ (y > z) implies x ⊥ y and (x > y) ⊥ z and
also x > (y > z) = (x > y) > z;
• Zero: 0 ⊥ x and 0 > x = x;
2 An effect algebra is a PCM (E, 0, >) with an orthosupplement. The latter is
a total unary ‘negation’ operation (−)⊥ : E → E satisfying:
• x⊥ ∈ E is the unique element in E with x > x⊥ = 1, where 1 = 0⊥ ;
• x ⊥ 1 ⇒ x = 0.
A homomorphism E → D of effect algebras is given by a function f : E →
D between the underlying sets satisfying f (1) = 1, and if x ⊥ x0 in E then
both f (x) ⊥ f (x0 ) in D and f (x > x0 ) = f (x) > f (x0 ). Effect algebras and
their homomorphisms form a category, denoted by EA.
3 An effect module is an effect algebra E with a scalar multiplication s · x, for
s ∈ [0, 1] and x ∈ E forming an action:

1·x = x (r · s) · x = r · (s · x),

and preserving sums (that exist) in both arguments:

0·x = 0 (r + s) · x = r · x > s · x
s·0 = 0 s · (x > y) = s · x > s · y.

We write EMod for the category of effect modules, where morphisms are
maps of effect algebras that preserve scalar multiplication (i.e. are ‘equivari-
ant’).

The following notion of ‘test’ comes from a quantum context and captures
‘compatible’ observations, see e.g. [36, 130]. It will be used occasionally later
on, for instance in Exercise 6.1.4 or Proposition 6.7.10.

Definition 4.2.2. A test or, more explicitly, an n-test on a set X is an n-tuple


of predicates p1 , . . . , pn : X → [0, 1] satisfying p1 > · · · > pn = 1.
This notion of test can be formulated in an arbitrary effect algebra. Here we
do it in Pred (X) only.

237
238 Chapter 4. Observables and validity

Each predicate p forms a 2-test p, p⊥ . Exercise 4.2.10 asks to show that an


n-test of predicates on X corresponds to a channel X → n. In particular, each
predicate on X corresponds to a channel X → 2, see Exercise 4.3.5.
The probabilities ω(xi ) ∈ [0, 1] of a distribution form a test on a singleton
set 1.
The following easy observations give a normal form for predicates on a finite
set.

Lemma 4.2.3. Consider a finite set X = {x1 , . . . , xn }.

1 The point predicates 1 x1 , . . . , 1 xn form a test on X.


2 Each predicate p : X → [0, 1] has a normal form p = >i ri · 1 xi , with scalars
ri = p(xi ) ∈ [0, 1].

On a set X = {a, b, c} we can write a state and a predicate as:

1
3|ai + 61 | bi + 12 | ci 2
3 · 1a > 5
6 · 1b > 1 · 1c .

Writing > looks a bit pedantic, so we often simply write + instead. Recall that
in a predicate the probabilities need not add up to one, see Remark 4.1.6.
A set of predicates p1 , . . . , pn on the same space, can be turned into a test
via (pointwise) normalisation. Below we describe an alternative construction
which is sometimes called stick breaking, see e.g. [94, Defn. 1]. It is commonly
used for probabilities r1 , . . . , rn ∈ [0, 1], but it applies to predicates as well —
but it may be described even more generally, inside an arbitrary effect algebra
with a commutative monoid structure for conjunction. The idea is to break up
a stick in two pieces, multiple times, first according to the first ratio r1 . You
then put one part, say on the left, aside and break up the other part according
to ratio r2 . Etcetera.

Lemma 4.2.4. Let p1 , . . . , pn be arbitrary predicates, all on the same set, for
n ≥ 1. We can turn them via “stick breaking” into an n+1-test q1 , . . . , qn+1 via
definitions:

q1 B p1
qi+1 B p⊥1 & · · · & p⊥i & pi+1 for 0 < i < n
qn+1 B p⊥1 & · · · & p⊥n .

Proof. We prove >i≤n+1 qi = 1 by induction on n ≥ 1. For n = 1 we get


qn+1 = q2 = p⊥1 , and then clearly: q1 > q2 = p1 > p⊥1 = 1. Next, using that

238
4.2. The structure of observables 239

multiplication & distributes over sum >, we get:

q1 > q2 > · · · > qn+1 > qn+2


= p1 > p⊥1 & p2 > p⊥1 & p⊥2 & p3 > · · ·
 

· · · > p⊥1 & p⊥2 & · · · & p⊥n−1 & pn > p⊥1 & · · · & p⊥n+1
 
 h
= p1 > p⊥1 & p2 > p⊥2 & p3 > · · ·

i
· · · > p⊥2 & p⊥3 & · · · & p⊥n−1 & pn > p⊥2 & · · · & p⊥n+1

(IH)
= p1 > p⊥1 & 1


= 1.

As mentioned above, the stick breaking construction is commonly used


for probabilities ri ∈ [0, 1]. That is a special case of the above construction,
with singleton sample space. We can reformulation this construction as a stick
breaking channel stbr of the form:

[0, 1]n
stbr / D n+1 = D {0, 1, . . . , n},

with:

stbr(r1 , . . . , rn ) B r1 0 + (1−r1 )r2 1 + · · · +
Q
i<n (1−ri ) rn n − 1
(4.7)
+ i<n+1 (1−ri ) n .
Q 

4.2.4 Sharp predicates


The structure of the set SPred (X) = {0, 1}X is most familiar to logicians: it
forms a Boolean algebra. The join p ∨ q of p, q ∈ SPred (X) is the pointwise
join (p ∨ q)(x) = p(x) ∨ q(x), where the latter disjunction ∨ is the obvious one
on {0, 1}. One can see these disjunctions (0, ∨) as forming an additive structure
with negation (−)⊥ . Conjunction (1, &) forms a commutative structure on top.
Formally one can say that SPred (X) also has scalar multiplication, with
scalars from {0, 1}, in such a way that 0 · p = 0 and 1 · p = p.
Exercise 4.2.9 below contains several alternative characterisations of sharp-
ness for predicates.
The idea of a free structure as a ‘minimal extension’ has occurred earlier
e.g. from Proposition 1.2.3 or 1.4.4. It also applies in the context of predicates,
where fuzzy predicates are a free extension of sharp predicates. This is a funda-
mental insight that will be formulated in terms of effect algebras and modules.
Later on in Subsection ?? we will see that such a free extension also exists in
continuous probability and forms the essence of integration (as validity).

239
240 Chapter 4. Observables and validity

Theorem 4.2.5. 1 For an arbitrary set X, the indicator function


1(−)
P(X) = SPred (X) / Pred (X)

is a homomorphism of effect algebras.


2 Let X now be finite. This indicator map makes Pred (X) into the free effect
module on P(X), as described below: for each effect module E with a map
of effect algebras f : P(X) → E, there is a unique map of effect modules
f : Pred (X) → E in:
1(−)
P(X) = SPred (X) / Pred (X)

f , homomorphism

f
+E

Proof. 1 See Exercise 4.2.3.


2 We use that each predicate p ∈ Pred (X) can be written in normal form as
p = > x∈X p(x) · 1 x , see Lemma 4.2.3 (2). So we define f as:

f (p) B > p(x) · f ({x}).


x∈X

Obviously, f (0) = 0. Also:


 
 [ 
f (1) = > 1 · f ({x}) = f  {x} = f (X) = 1.
x∈X x∈X

If p ⊥ q, then p(x) + q(x) ≤ 1 for each x ∈ X. Hence:

f (p > q) = > (p > q)(x) · f ({x}) = > (p(x) + q(x)) · f ({x})


x∈X x∈X
= > p(x) · f ({x}) > q(x) · f ({x})
x∈X
= > p(x) · f ({x}) > > q(x) · f ({x})
x∈X x∈X
= f (p) > f (q).

Scalar multiplication is also preserved:

f (r · p) = > (r · p)(x) · f ({x}) = > r · (p(x) · f ({x}))


x∈X x∈X
= r · > p(x) · f ({x})
x∈X
= r · f (p).

240
4.2. The structure of observables 241

Finally, for uniqueness, let g : Pred (X) → E be a map of effect modules with
g(1U ) = f (U) for each U ∈ P(X). Then g = f since:

f (p) = > p(x) · f ({x}) = > p(x) · g(1x )


x∈X x∈X
 
= g  p(x) · 1 x  = g(p).
 
>
x∈X

Now that we have seen observables, with factors and (sharp) predicates as
special cases, we can say that all these subsets of observables all share the same
multiplicative structure (1, &) for conjunction, but that their additive structures
and scalar multiplications differ. The latter, however, are preserved under tak-
ing validity — and not (1, &).

Lemma 4.2.6. Let ω ∈ D(X) be a distribution on a set X. Operations on


observables p, q on X satisfy, whenever defined,

1 ω |= 0 = 0;
2 ω |= (p + q) = (ω |= p) + (ω |= q);
ω |= p⊥ = 1 − ω |= p ;

3
4 ω |= (s · p) = s · (ω |= p).

4.2.5 Parallel products and weakening


Earlier we have seen parallel products ⊗ of states and of channels. This ⊗ can
be defined for observables too, and is then often called parallel conjunction.
The difference between parallel conjunction ⊗ and sequential conjunction &
is that ⊗ acts on observables on different sets X, Y and yields an outcome on
the product set X × Y, whereas & works for observables on the same set Z,
and produces a conjunction observable again on Z. These ⊗ and & are inter-
definable, via transformation = of observables, see Section 4.3 — in particular
Exercise 4.3.7.

Definition 4.2.7. 1 Let p be an observable on a set X, and q on Y. Then we


define a new observable p ⊗ q on X × Y by:

p ⊗ q (x, y) B p(x) · q(y).

2 Suppose we have an observable p on a set X and we like to use p on the


product X × Y. This can be done by taking p ⊗ 1 instead. This p ⊗ 1 is called
a weakening of p. It satisfies (p ⊗ 1)(x, y) = p(x).
More generally, consider a product X1 × · · · × Xn . For an observable p on

241
242 Chapter 4. Observables and validity

the i-th set Xi , we weaken p to a predicate on the whole product X1 ×· · ·× Xn ,


namely:

1| ⊗ {z · · · ⊗ }1 .
1 ⊗ p ⊗ 1| ⊗ {z
· · · ⊗}
i−1 times n−i times

Weakening is a structural operation in logic which makes it possible to use


a predicate p(x) depending on a single variable x in a larger context where
one has for instance two variables x, y by ignoring the additional variable y.
Weakening is usually not an explicit operation, except in settings like linear
logic where one has to be careful about the use of resources. Here, we need
weakening as an explicit operation in order to avoid type mismatches between
observables and underlying sets.
Recall that marginalisation of states is an operation that moves a state to
a smaller underlying set by projecting away. Weakening can be seen as a
dual operation, moving an observable to a larger context. There is a close
relationship between marginalisation and weakening via validity: for a state
ω ∈ D(X1 × · · · × Xn ) and an observable p on Xi we have:

ω |= 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1 = D(πi )(ω) |= p
(4.8)
= ω 0, . . . , 0, 1, 0, . . . , 0 |= p,
 

where the 1 in the latter marginalisation mask is at position i. Soon we shall see
an alternative description of weakening in terms of predicate transformation.
The above equation then appears as a special case of a more general result,
namely of Proposition 4.3.3.
We mention some basic results about parallel products of observables. More
such ‘logical’ results can be found in the exercises.

Lemma 4.2.8. For states σ ∈ D(X), τ ∈ D(Y) and observables p ∈ Obs(X),


q ∈ Obs(Y) one has:

σ ⊗ τ |= p ⊗ q = σ |= p · τ |= q .
  

242
4.2. The structure of observables 243

Proof. Easy:
X
σ ⊗ τ |= p ⊗ q = (σ ⊗ τ)(z) · (p ⊗ q)(z)

z∈X×Y
X
= (σ ⊗ τ)(x, y) · (p ⊗ q)(x, y)
x∈X,y∈Y
X
= σ(x) · τ(y) · p(x) · q(y)
x∈X,y∈Y
   
X  X 
=   σ(x) · p(x) ·  τ(y) · q(y)
 
x∈X y∈Y
= σ |= p · τ |= q .
 

Lemma 4.2.9. 1 For observables pi , qi ∈ Obs(Xi ) one has:


p1 ⊗ · · · ⊗ pn & q1 ⊗ · · · ⊗ qn = (p1 & q1 ) ⊗ · · · ⊗ (pn & qn ).
 

2 Parallel composition p ⊗ q of observables p ∈ Obs(X) and q ∈ Obs(Y) can


be defined in terms of weakening and conjunction, namely as:
p ⊗ q = (p ⊗ 1) & (1 ⊗ q).
Proof. 1 For elements xi ∈ Xi ,
 
p1 ⊗ · · · ⊗ pn & q1 ⊗ · · · ⊗ qn (x1 , . . . , xn )


= p1 ⊗ · · · ⊗ pn (x1 , . . . , xn ) · q1 ⊗ · · · ⊗ qn (x1 , . . . , xn )
 

= p1 (x1 ) · . . . · pn (xn ) · q1 (x1 ) · . . . · qn (xn )


 

= p1 (x1 ) · q1 (x1 ) · . . . · pn (xn ) · qn (xn )


 

= (p1 & q1 )(x1 ) · . . . · (pn & qn )(xn )


 
= (p1 & q1 ) ⊗ · · · ⊗ (pn & qn ) (x1 , . . . , xn ).
2 Directly from the previous item:
(p ⊗ 1) & (1 ⊗ q) = (p & 1) ⊗ (1 & q) = p ⊗ q.

4.2.6 Functoriality
We like to conclude this section with some categorical observations. They are
not immediately relevant for the sequel and may be skipped. We shall be using
four (new) categories:
• Vect, with vector spaces (over the real numbers) as objects and linear maps
as morphisms between them (preserving addition and scalar multiplication);
• Cone, with cones as objects and also with linear maps as morphisms, but
this time preserving scalar multiplication with non-negative reals only;

243
244 Chapter 4. Observables and validity

• EMod, with effect modules as objects and homomorphisms of effect mod-


ules as maps between them (see Definition 4.2.1);
• BA, with Boolean algebras as objects and homomorphisms of Boolean al-
gebras (preserving finite joins and negations, and then also finite meets).
Recall from Subsection 1.9.1 that we write Cop for the opposite of category C,
with arrows reversed. This opposite is needed in the following result.
Proposition 4.2.10. Taking particular observables on a set is functorial: there
are functors:
1 Obs : Sets → Vectop ;
2 Fact : Sets → Coneop ;
3 Pred : Sets → EModop ;
4 SPred : Sets → BAop .
On maps f : X → Y in Sets these functors are all defined by the ‘pre-compose
with f ’ operation q 7→ q ◦ f . They thus reverse the direction of morphisms,
which necessitates the use of opposite categories (−)op .
The above functors all preserve the partial order on observables and also
the commutative monoid structure (1, &), since they are defined pointwise.
Proof. We consider the first instance of observables in some detail. The other
cases are similar. For a set X we have have seen that Obs(X) = RX is a vector
space, and thus an object of the category Vect. Each function f : X → Y in Sets
gives rise to a function Obs( f ) : Obs(Y) → Obs(X) in the opposite direction. It
maps an observable q : Y → R on Y to the observable q ◦ f : X → R on X. It is
not hard to see that this function Obs( f ) = (−) ◦ f preserves the vector space
structure. For instance, it preserves sums, since they are defined pointwise. We
shall prove this in a precise, formal manner. First Obs( f )(0) is the function that
map x ∈ X to:
Obs( f )(0)(x) = (0 ◦ f )(x) = 0( f (x)) = 0.
Hence Obs( f )(0) maps everything to 0 and is thus equal to the zero function
itself: Obs( f )(0) = 0. Next, addition + is preserved since:
Obs( f )(p + q) = (p + q) ◦ f
= x 7→ (p + q)( f (x))
 

= x 7→ p( f (x)) + q( f (x))
 

= x 7→ Obs( f )(p)(x) + Obs( f )(q)(x)


 

= Obs( f )(p) + Obs( f )(q).


We leave preservation of scalar multiplication to the reader and conclude that

244
4.2. The structure of observables 245

Obs( f ) is a linear function, and thus a morphism Obs( f ) : Obs(Y) → Obs(X)


in Vect. Hence Obs( f ) is a morphism Obs(X) → Obs(Y) in the opposite cat-
egory Vectop . We still need to check that identity maps and composition is
preserved. We do the latter. For f : X → Y and g : Y → Z in Sets we have, for
r ∈ Obs(Z),
Obs(g ◦ f )(r) = r ◦ (g ◦ f ) = (r ◦ g) ◦ f
= Obs(g)(r) ◦ f
=

Obs( f ) Obs(g)(r)
=

Obs( f ) ◦ Obs(g) (r)
= Obs(g) ◦op Obs( f ) (r).


This yields Obs(g ◦ f ) = Obs(g) ◦op Obs( f ), so that we get a functor of the
form Obs : Sets → Vectop .

Notice that saying that we have a functor like Pred : Sets → EModop con-
tains remarkably much information, about the mathematical structure on ob-
jects Pred (X), about preservation of this structure by maps Pred ( f ), and about
preservation of identity maps and composition by Pred (−) on morphisms (see
also Exercise 4.3.10). This makes the language of category theory both power-
ful and efficient.

Exercises
4.2.1 Consider the following question:
An urn contains 10 balls of which 4 are red and 6 are blue. A second urn
contains 16 red balls and an unknown number of blue balls. A single ball is
11
drawn from each urn. The probability that both balls are the same color is 25 .
Find the number of blue balls in the second urn.
1 Check that the givens can be expressed in terms of validity as:

25 = Flrn 4| R i+6| Bi ⊗ Flrn 16| Ri+ x| Bi |= (1R ⊗1R ) > (1 B ⊗1 B )


11  

2 Prove, by solving the above equation, that there are 4 blue balls in
the second urn.
4.2.2 Find examples of predicates p, q on a set X and a distribution ω on X
such that ω |= p & q and (ω |= p) · (ω |= q) are different.
4.2.3 1 Check that (sharp) indicator predicates 1E : X → [0, 1], for subsets
E ⊆ X, satisfy:
• 1E∩D = 1E & 1D , and thus 1 ⊗ 1 = 1, 1 ⊗ 0 = 0 = 0 ⊗ 1;
• 1E∪D = 1E > 1D , if E, D are disjoint;

245
246 Chapter 4. Observables and validity

• (1E )⊥ = 1¬E , where ¬E = X \ E = {x ∈ X | x < E} is the


complement of E.
Formally, the function 1(−) : P(X) → Pred (X) = [0, 1]X is a homo-
morphism of effect algebras, see Theorem 4.2.5 (1).
2 Show that this indicator function is natural, in the sense that for
f : X → Y the following diagram commutes.
1(−)
P(X) / Pred (X)
O O
−1 (−)◦ f
f
1(−)
P(Y) / Pred (Y)

3 Now consider subsets E ⊆ X and D ⊆ Y of different sets X, Y


together with their product subset E × D ⊆ X ×Y. Show that 1E×D =
1E ⊗ 1D , with as special case 1(x,y) = 1 x ⊗ 1y .
4.2.4 One may expect the following implication between inequalities of
validities:

σ |= p) ≤ τ |= p =⇒ σ |= p & q) ≤ τ |= p & q .
 

However, it fails. This exercise elaborates a counterexample. Take


X = {a, b, c} with states:

σ = 19
100 | a i + 47
100 | b i + 17
50 | ci τ = 15 | ai + 9
20 | b i + 7
20 | c i

with predicates:

p = 1 · 1a + 7
10 · 1b + 1
2 · 1c q= 1
10 · 1a + 1
2 · 1b + 1
5 · 1c .
Now check consecutively that:
1 σ |= p = 1000 < 100 = τ |= p.
689 69

2 p& q = 10
1
· 1a + 20
7
· 1b + 10
1
· 1c .
3 σ |= p & q = 400 > 400 = τ |= p &
87 85
q.
Find a counterexample yourself in which the predicate q is sharp.
4.2.5 Consider a state σ ∈ D(X), a factor p on X, and a predicate q on X
which is non-zero on the support of σ. Show that:
σ |= qp ≥ 1 =⇒ σ |= p ≥ σ |= q ,
 

where qp (x) = q(x)


p(x)
.
4.2.6 Prove that the following items are equivalent, for a state ω ∈ D(X)
and event E ⊆ X.
1 supp(ω) ⊆ E;

246
4.2. The structure of observables 247

2 ω |= 1E = 1;
3 ω |= p & 1E = ω |= p for each p ∈ Obs(X).
4.2.7 Let p1 , p2 , p3 be predicates on the same set.
1 Show that:
⊥
p⊥1 ⊗ p⊥2 = (p1 ⊗ p2 ) > (p1 ⊗ p⊥2 ) > (p⊥1 ⊗ p2 ).
2 Show also that:
⊥
p⊥1 ⊗ p⊥2 ⊗ p⊥3 = (p1 ⊗ p2 ⊗ p3 ) > (p1 ⊗ p2 ⊗ p⊥3 )
> (p1 ⊗ p⊥2 ⊗ p3 ) > (p1 ⊗ p⊥2 ⊗ p⊥3 )
> (p⊥1 ⊗ p2 ⊗ p3 ) > (p⊥1 ⊗ p2 ⊗ p⊥3 )
> (p⊥1 ⊗ p⊥2 ⊗ p3 ).
3 Generalise to n.
4.2.8 For predicates p, q on the same set, define Reichenbach implication ⊃
as:
p ⊃ q B p⊥ > (p & q).
1 Check that:
p ⊃ q = (p & q⊥ )⊥
from which it easily follows that:
p ⊃ 0 = p⊥ 1⊃q = q p⊃1 = 1 0 ⊃ q = 1.
2 Check also that:
p⊥ ≤ p ⊃ q q ≤ p ⊃ q.
3 Show that:
p1 ⊃ (p2 ⊃ q) = (p1 & p2 ) ⊃ q.
4 For subsets E, D of the same set,
1E ⊃ 1D = 1¬(E∩¬D) = 1¬E∪D .
The subset ¬(E ∩ ¬D) = ¬E ∪ D is the standard interpretation of
‘E implies D’ in Boolean logic (of subsets).
4.2.9 Let p be a predicate on a set X. Prove that the following statements
are equivalent.
1 p is sharp;
2 p & p = p;
3 p & p⊥ = 0;

247
248 Chapter 4. Observables and validity

4 q ≤ p and q ≤ p⊥ implies q = 0, for each q ∈ Pred (X).


4.2.10 Show that an n-test p0 , . . . , pn−1 on a set X, see Definition 4.2.2, can
be identified with a channel c : X → n, with pi = c = 1i , for i ∈ n.
4.2.11 Let p and q be two arbitrary predicates. Prove that the following use
of the partial sum operation > for predicates is justified, that is, yields
a well-defined new predicate.
(p ⊗ q) > (p⊥ ⊗ q⊥ ).

4.2.12 For a random variable (ω, p), show that the validity of the observable
p − (ω |= p) · 1 is zero, i.e.,

ω |= p − (ω |= p) · 1 = 0.


Observables of this form are used later on to define (co)variance in


Section 5.1.
4.2.13 For a multiset ϕ ∈ M(X) on a set X and a factor p ∈ Fact(X) on the
same set, define the multiset ϕ • p ∈ M(X) as (ϕ • p)(x) = ϕ(x)· p(x).
Show that this gives a monoid action:

M(X) × Fact(X)
• / M(X)

wrt. the multiplicative monoid structure (1, &) on factors.


4.2.14 Let E be an arbitrary effect algebra. Prove, from the axioms of an
effect algebra, that for elements e, e0 , d, d0 , f ∈ E the following prop-
erties hold.
1 Orthosupplement is an involution: e⊥⊥ = e;
2 Cancellation holds of the form: e > d = e > d0 implies d = d0 ;
3 Zero-sum freeness holds: e > d = 0 implies e = d = 0;
4 Define an order ≤ on E via: e ≤ d iff e > f = d for some f ∈ E.
This is a partial order with 1 as top and 0 as bottom element;
5 e ≤ d implies d⊥ ≤ e⊥ ;
6 e > d is defined iff e ⊥ d iff e ≤ d⊥ iff d ≤ e⊥ ;
7 e ≤ d and d ⊥ f implies e ⊥ f and e > f ≤ d > f ;
8 if e ≤ e0 and d ≤ d0 and e0 > d0 is defined, then also e > d is defined.
4.2.15 In an effect algebra E, as above, define for elements e, d ∈ E,

e ? d B e⊥ > d⊥ ⊥ if e⊥ ⊥ d⊥


e d B e⊥ > d ⊥ = e ? d⊥ if e ≥ d.


Show that:
1 (E, ?, 1) is a partial commutative monoid;

248
4.3. Transformation of observables 249

2 e ≤ d iff e = d ? f for some f ;


3 e 0 = e and 1 e = e⊥ and e e = 0;
4 e > d = f iff e = f d; in particular, ( f d) > d = f ;
5 e > d ≤ f iff e ≤ f d;
6 e > d iff e d > 0;
7 f e = f d implies e = d;
8 e ≤ d implies d e ≤ d and d (d e) = e;
W
9 the function e>(−) preserves all joins i di that exist in E: if e ⊥ di
for each i, then e ⊥ i di and e > ( i di ) = i (e > di );
W W W

10 Similarly, e ? (−) preserves meets.


11 A homomorphism f : E → D of effect algebras E, D preserves
orthosupplement: f (e⊥ ) = f (e)⊥ , and thus also f (0) = 0, f (e?d) =
f (e) ? f (d) and f (e d) = f (e) f (d).
4.2.16 Let E be an effect module. The aim of this exercise is to show that
subconvex sums exist in E, see [141, Lemma 2.1]. This means that
for arbitrary elements e1 , . . . , en ∈ E and scalars r1 , . . . , rn ∈ [0, 1]
P
with i ri ≤ 1 the sum r1 · e1 > · · · > rn · en exists in E
1 Use induction on n ≥ 1, and check the base step.
2 For the induction step let e1 , . . . , en+1 and r1 , . . . , rn+1 ∈ [0, 1] with
P
i≤n+1 ri ≤ 1 be given. By induction hypothesis the sum >i≤n ri · ei
exists. Use Exercise 4.2.14 to check that the following chain of
inequalities holds and suffices to prove that the sum >i≤n+1 ri · ei
exists too.
P ⊥
rn+1 · en+1 ≤ rn+1 · 1 ≤ ( i≤n ri ) · 1
 ⊥  ⊥
= >i≤n ri · 1 ≤ >i≤n ri · ei .

4.3 Transformation of observables


One of the basic operations that we have seen so far is state transformation =.
It is used to transform a state ω on the domain X of a channel c : X → Y into
a state c = ω on the codomain Y of the channel. This section introduces trans-
formation of observables =. It works in the opposite direction of a channel: an
observable q on the codomain Y of the channel is transformed into an observ-
able c = q on the domain X of the channel. Thus, state transformation works
forwardly, in the direction of the channel, whereas observable transformation
= works backwardly, against the direction of the channel. This operation = is

249
250 Chapter 4. Observables and validity

often applied only to predicates — and then called predicate transformation —


but here we apply it more generally to observables.
This section introduces observable transformation = and lists its key math-
ematical properties. In the next chapter it will be used for probabilistic reason-
ing, especially in combination with updating.

Definition 4.3.1. Let c : X → Y be a channel. An observable q ∈ Obs(Y) is


transformed into c = q ∈ Obs(X) via the definition:
X
c = q (x) B c(x) |= q =

c(x)(y) · q(y).
y∈Y

When q is a point predicate 1y , for an element y ∈ Y, we get:

c = 1y (x) = c(x)(y) c = 1y = c(−)(y) : X → [0, 1].



so that

The resulting function Y → Pred (X), given by y 7→ c = 1y , is sometimes


called the likelihood function.

There is a whole series of basic facts about =.

Lemma 4.3.2. 1 The operation c = (−) : Obs(Y) → Obs(X) of transform-


ing observables along a channel c : X → Y restricts, first to factors c =
(−) : Fact(Y) → Fact(X), and then to fuzzy predicates c = (−) : Pred (Y) →
Pred (X), but not to sharp predicates.
2 Observable transformation c = (−) is linear: it preserves sums (0, +) of
observables and scalar multiplication of observables.
3 Observable transformation preserves truth 1, but not conjunction &.
4 Observable transformation preserves the (pointwise) order on observables:
q1 ≤ q2 implies (c = q1 ) ≤ (c = q2 ).
5 Transformation along the unit channel is the identity: unit = q = q.
6 Transformation along a composite channel is successive transformation:
(d ◦· c) = q = c = (d = q).
7 Transformation along a trivial, deterministic channel f is pre-composition:
f = q = q ◦ f .

Proof. 1 If q ∈ Fact(Y) then q(y) ≥ 0 for all y. But then also (c = q)(x) =
y c(x)(y) · q(y) ≥ 0, so that c = q ∈ Fact(X). If in addition q ∈ Pred (Y), so
P
that q(y) ≤ 1 for all y, then also (c = q)(x) = y c(x)(y) · q(y) ≤ y c(x)(y) =
P P
1, since c(x) ∈ D(Y), so that c = q ∈ Pred (X).
The fact that a transformation c = p of a sharp predicate p need not be
sharp is demonstrated in Exercise 4.3.2.
2 Easy.

250
4.3. Transformation of observables 251

3 (c = 1)(x) = y c(x)(y) · 1(y) = y c(x)(y) · 1 = 1. The fact that & is not


P P
preserved follows from Exercise 4.3.1.
4 Easy.
5 Recall that unit(x) = 1| x i, so that (unit = q)(x) = y unit(x)(y)·q(y) = q(x).
P

6 For c : X → Y and d : Y → Z and q ∈ Obs(Z) we have:


X
(d ◦· c) = q (x) =

(d ◦· c)(x)(z) · q(z)
z∈Z  
X  X 
= 
 c(x)(y) · d(y)(z)  · q(z)
z∈Z y∈Y  
X  X 
= c(x)(y) ·  d(y)(z) · q(z) 
y∈Y z∈Z
X
= c(x)(y) · d = q (y)

y∈Y
= c = (d = q) (x).


7 For a function f : X → Y,
X
 f  = q (x) = unit( f (x))(y) · q(y) = q( f (x)) = (q ◦ f )(x).

y∈Y

There is the following fundamental relationship between transformations =,


= and validity |=.

Proposition 4.3.3. Let c : X → Y be a channel with a state ω ∈ D(X) on its


domain and an observable q ∈ Obs(Y) on its codomain. Then:

c = ω |= q = ω |= c = q. (4.9)

Proof. The result follows simply by unpacking the relevant definitions:


 
X X  X
c = ω |= q = (c = ω)(y) · q(y) = c(x)(y) · ω(x)  · q(y)
 

y∈Y y∈Y x∈X  
X  X 
= ω(x) ·   c(x)(y) · q(y) 
x∈X y∈Y
X
= ω(x) · (c = q)(x)
x∈X
= ω |= c = q.

We have already seen several instances of this basic result.

• Earlier we mentioned that marginalisation (of states) and weakening (of ob-
servables) are dual to each other, see Equation (4.8). We can now see this as

251
252 Chapter 4. Observables and validity

an instance of (4.9), using a projection πi : X1 × · · · × Xn → Xi as (trivial)


channel, in:
πi = ω |= p = ω |= πi = p

On the left-hand side the i-th marginal πi = ω ∈ D(Xi ) of state ω ∈ D(X1 ×


· · · × Xn ) is used, whereas on the right-hand side the observable p ∈ Obs(Xi )
is weakenend to πi = p on the product set X1 × · · · × Xn .
• The first equation in Exercise 4.1.4 is also an instance of (4.9), namely for a
trivial channel f : X → Y given by a function f : X → Y, as in:
(4.9)
f = ω |= p = ω |= f = p
= ω |= q ◦ f by Lemma 4.3.2 (7).

Remark 4.3.4. In a programming context, where a channel c : X → Y is seen


as a program taking inputs from X to outputs in Y, one may call c = q the
weakest precondition of q, commonly written as wp(c, q), see e.g. [43, 106,
120]. We briefly explain this view.
A precondition of q, w.r.t. channel c : X → Y, may be defined as an observ-
able p on the channel’s domain X for which:

ω |= p ≤ c = ω |= q, for all states ω.

Proposition 4.3.3 tells that c = q is then a precondition of q. It is also the


weakest, since if p is a precondition of q, as described above, then in particular:

p(x) = unit(x) |= p
≤ c = unit(x) |= q = unit(x) |= c = q = (c = q)(x).

As a result, p ≤ c = q.
Lemma 4.3.2 (6) expresses a familiar compositionality property in the the-
ory of weakest preconditions:

wp(d ◦· c, q) = (d ◦· c) = q = c = (d = q) = wp(c, wp(d, q)).

We close this section with two topics that dig deeper into the nature of trans-
formations. First, we relate transformation of states and observables in terms
of matrix operations. Then we look closer at the categorical aspects of trans-
formation of observables.

Remark 4.3.5. Let c be a channel with finite sets as domain and codomain.
For convenience we write these as n = {0, . . . , n − 1} and m = {0, . . . , m − 1},
so that the channel c has type n → m. For each i ∈ n we have that c(i) ∈ D(m)

252
4.3. Transformation of observables 253

is given by an m-tuple of numbers in [0, 1] that add up to one. Thus we can


associate an m × n matrix Mc with the channel c, namely:
 
 c(1)(1) · · · c(n−1)(1) 
.. ..
Mc =   .
 
. .
 
c(1)(m−1) · · · c(n−1)(m−1)
By construction, the columns of this matrix add up to one. Such matrices are
often called stochastic.
A state ω ∈ D(n) may be identified with a column vector Mω of length n, as
on the left below. It is then easy to see that the matrix Mc = ω of the transformed
state, is obtained by matrix-column multiplication, as on the right:
 ω(0) 
 
.. 
Mω =  Mc = ω = Mc · Mω .

.  so that
ω(n−1)
 

Indeed,
X X
(c = ω)( j) = c(i)( j) · ω(i) = = Mc · Mω j .
  
Mc ji · Mω i
i i

 An observable q : m → R on m can be identified with a row vector Mq =


q(0) · · · q(m−1) . Transformation c = q then corresponds to row-matrix mul-
tiplication:
Mc = q = Mq · Mc .

Again, this is checked easily:


X X
(c = q)(i) = q( j) · c(i)( j) = Mq j · Mc j,i = Mq · Mc i .
  
j j

We close this section by making the functoriality of observable transfor-


mation = explicit, in the style of Proposition 4.2.10. The latter deals with
functions, but we now consider functoriality wrt. channels, using the cate-
gory Chan = Chan(D) of probabilistic channels. Notice that wrt. Proposi-
tion 4.2.10 the case of sharp predicates is omitted, simply because sharp predi-
cates are not closed under predicate transformation (see Exercise 4.3.2). Also,
conjunction & is not preserved under transformation, see Exercise 4.3.1 below.

Proposition 4.3.6. Taking particular observables on a set is functorial, namely,


via functors:

1 Obs : Chan → Vectop ;


2 Fact : Chan → Coneop ;
3 Pred : Chan → EModop ;

253
254 Chapter 4. Observables and validity

On a channel c : X → Y, all these functors are given by transformation c =


(−), acting in the opposite direction.
Proof. Essentially, all the ingredients are already in Lemma 4.3.2: transforma-
tion restricts appropriately (item (1)), transformation preserves identies (item (5))
and composition (item (6)), and the relevant structure (items (2) and (3)).

Exercises
4.3.1 Consider the channel f : {a, b, c} → {u, v} from Example 1.8.2, given
by:
f (a) = 12 | u i + 21 | v i f (b) = 1| u i f (c) = 34 | u i + 14 | v i.
Take as predicates p, q : {u, v} → [0, 1],
p(u) = 1
2 p(v) = 2
3 q(u) = 1
4 q(v) = 16 .
Alternatively, in the normal-form notation of Lemma 4.2.3 (2),
p = 1
2 · 1u + 2
3 · 1v q = 1
4 · 1u + 1
6 · 1v .
Compute:
• f = p
• f = q
• f = (p > q)
• ( f = p) > ( f = q)
• f = (p & q)
• ( f = p) & ( f = q)
This will show that predicate transformation = does not preserve con-
junction &.
4.3.2 1 Still in the context of the previous exercise, consider the sharp
(point) predicate 1u on {u, v}. Show that the transformed predicate
f = 1u on {a, b, c} is not sharp. This proves that sharp predicates
are not closed under predicate transformation. This is a good mo-
tivation for working with fuzzy predicates instead of with sharp
predicates only.
2 Let h : X → Y be a function, considered as deterministic channel,
and let V ⊆ Y be an arbitrary subset (event). Check that:
h = 1V = 1h−1 (V) .
Conclude that predicate transformation along deterministic chan-
nels does preserve sharpness.

254
4.3. Transformation of observables 255

4.3.3 Let h : X → Y be an ordinary function. Recall from Lemma 4.3.2 (7)


that h = q = q ◦ h, when h is considered as a deterministic chan-
nel. Show that transformation along such deterministic channels does
preserve conjunctions:

h = (q1 & q2 ) = (h = q1 ) & (h = q2 ),

in contrast to the findings in Exercise 4.3.1 for arbitrary channels.


Conclude that weakening preserves conjunction: 1 ⊗ (p & q) =
(1 ⊗ p) & (1 ⊗ q).
4.3.4 Let predicate qi ∈ Pred (Y) form a test, see Definition 4.2.2, and let
c : X → Y be channel. Check that the transformed predicates c = qi
form a test on X.
4.3.5 1 Check that a predicate p : X → [0, 1] can be identified with a chan-
nel bp : X → 2, see also Exercise 4.2.10. Describe this b p explicitly
in terms of p.
2 Define a channel orth : 2 → 2 such that orth ◦· b p=c p⊥ .
3 Define also a channel conj : 2 × 2 → 2 such that conj ◦· hb p,b
qi =
p & q.
\
4 Finally, define also a channel scal(r) : 2 → 2, for r ∈ [0, 1], so that
p = rd
scal(r) ◦· b · p.
4.3.6 Recall that a state ω ∈ D(X) can be identified with a channel 1 → X
with a trivial domain, and also that a predicate p : X → [0, 1] can be
identified with a channel X → 2, see Exercise 4.3.5. Check that under
these identifications validity ω |= p can be identified with:
• state transformation p = ω;
• predicate transformation ω = p;
• channel composition p ◦· ω.
4.3.7 This exercises shows how parallel conjunction ⊗ and sequential con-
junction & are inter-definable via transformation =, using projection
channels πi and copy channels ∆, that is, using weakening and con-
traction.
1 Let observables p1 on X1 and p2 on X2 be given. Show that on
X1 × X2 ,

p1 ⊗ p2 = (π1 = p1 ) & (π2 = p2 ) = (p1 ⊗ 1) & (1 ⊗ p2 ).

The last equation occurred already in Lemma 4.2.9 (2).


2 Let q1 , q2 be observables on the same set Y. Prove that on Y,

q1 & q2 = ∆ = (q1 ⊗ q2 ).

255
256 Chapter 4. Observables and validity

4.3.8 Prove that:


hc, di = (p ⊗ q) = (c = p) & (d = q)
(e ⊗ f ) = (p ⊗ q) = (e = p) ⊗ ( f = q).
4.3.9 Let c : X → Y be a channel, with observables p on Z and q on Y × Z.
Check that:
(c ⊗ id ) = (1 ⊗ p) & q = (1 ⊗ p) & (c ⊗ id ) = q .
 

(Recall that & is not preserved by =, see Exercise 4.3.1.)


4.3.10 We take a closer look at the functor

Chan
Pred / EModop

from Proposition 4.2.10. It sends a map c : X → Y in Chan as to


Pred (c) : Pred (Y) → Pred (X) via:
Pred (c)(q) B c = q.
1 Check in detail that Pred (c) is a morphism of effect modules, see
Definition 4.2.1 (3).
2 Show that the functor Pred is faithful, in the sense that Pred (c) =
Pred (c0 ) implies c = c0 , for channels c, c0 : X → Y.
3 Let Y be a finite set. Show that for each map h : Pred (Y) → Pred (X)
in the category EMod there is a unique channel c : X → Y with
Pred (c) = h.
Hint: Write a predicate p as finite sum >y p(y) · 1y , i.e. as the nor-
mal form of Lemma 4.2.3 (2), and use the relevant preservation
property.
One says that the functor Pred is full and faithful when restricted
to the (sub)category with finite sets as objects. In the context of
programming (logics) this property is called healthiness, see [42,
43, 120], or [63] for an abstract account.

4.4 Validity and drawing


In Chapter 3 we have studied various distributions associated with drawing
coloured balls from an urn, such as the multinomial, hypergeometric and Pólya
distributions. In this section we look at validity with respect to these draw
distributions, in particular at their means.
In Example 4.1.4 (4) we have seen the mean of a binomial distribution. More
generally, a multinomial mn[K](ω) is a distribution on the set N[K](X) of

256
4.4. Validity and drawing 257

natural multisets of size K, and not on (real) numbers. Hence the requirement
for a mean, see Definition 4.1.3, does not apply: the space is not a subset of
the reals. Still, we have described Proposition 3.3.7 as a generalised mean for
multinomials.
Alternatively, we can describe a pointwise mean for multinomials via point-
evaluation observables. For a set X with an element y ∈ X we can define an
observable:
πy
M(X) /R by πy (ϕ) B ϕ(y). (4.10)

Then we can compute the validity of this observable as:


(4.2)
X
mn[K](ω) |= πy = mn[K](ω)(ϕ) · πy (ϕ)
ϕ∈N[K](X)
X
= mn[K](ω)(ϕ) · ϕ(y) (4.11)
ϕ∈N[K](X)
= K · ω(y), by Lemma 3.3.4.
In a similary way one gets from Lemma 3.4.4,

hg[K](ψ) |= πy = K · Flrn(ψ)(y) = pl[K](ψ) |= πy . (4.12)

Based on the above equations we define the mean of draw distributions as a


multiset of point-validities.

Definition 4.4.1. Let X be a set (of colours) and K be a number. The means of
the three draw distributions are defined as multisets over X in:
  X  X
mean mn[K](ω) B mn[K](ω) |= πy y = K · ω(y) y
y∈X y∈X
= K · ω.
  X X
hg[K](ψ) |= πy y = K ·

mean hg[K](ψ) B Flrn(ψ)(y) y
y∈X y∈X
= K · Flrn(ψ)
  X  X
mean pl[K](ψ) B pl[K](ψ) |= πy y = K · Flrn(ψ)(y) y
y∈X y∈X
= K · Flrn(ψ).
These formulations coincide with the ones that we have given earlier in
Propositions 3.3.7 and 3.4.7.
We recall a historical example, known as the paradox of the Chevalier De
Méré, from the 17th century. He argued informally that the following two out-
comes have equal probability.

257
258 Chapter 4. Observables and validity

DM1 Throw a dice 4 times; you get at least one 6.


DM2 Throw a pair of dice 24 times; you get at least one double six, i.e. (6, 6).
However, when De Méré betted on (DM2) he mostly lost. Puzzled, he wrote
to Pascal, who showed that the probabilities differ.
Let us write (−)(6) ≥ 1 : N(pips) → {0, 1} for the sharp predicate that sends
a multiset ϕ, over the dice space pips = {1, 2, 3, 4, 5, 6}, to 1 if ϕ(6) ≥ 1 and to
0 otherwise. It thus tells that the number 6 occurs at least once in the draw ϕ.
We can thus model the first option (DM1) as validity:
mn[4](dice) |= (−)(6) ≥ 1. (DM1)
The second option (DM2) then becomes:
mn[24](dice ⊗ dice) |= (−)(6, 6) ≥ 1, (DM2)
where (−)(6, 6) ≥ 1 : N(dice × dice) → {0, 1} is the obvious predicate.
Using a computer to calculate the validity (DM1)
 is relatively easy, giving
671
an outcome 1296 . It involves summing over 64 = 126 multisets, see Proposi-
 
tion 1.7.4. However, the sum in (DM1) is huge, involving 36 24 multisets — a
16
number in the order of 2 · 10 .
The common way to compute (DM1) and (DM2) is to switch to validity
over the distribution dice, and using orthosupplement (negations). Getting at
least one 6 in 4 throws is the orthosupplement of getting 4 times no 6. This can
be represented and computed as:
 4
dice ⊗ dice ⊗ dice ⊗ dice |= 1⊥6 ⊗ 1⊥6 ⊗ 1⊥6 ⊗ 1⊥6 ⊥ = 1 − dice |= 1⊥6


= 1 − 65 4


= 1296
671

≈ 0.518.
Similarly one can compute (DM2) as:
  24 ⊥
dice ⊗ dice 24 |= 16 ⊗ 16 ⊥ = 1− 35 24

36
≈ 0.491.
Hence indeed, betting on (DM2) is a bad idea.
In this section we look at validities over distributions of draws from an urn,
like in (DM1) and (DM2). In particular, we establish connections between
such validities and validities over the urn, as distribution. This involves free
extensions of observables to multisets.
We notice that there are two ways to extend an observable X → R on a set
X to natural multisets over X, since we can choose to use either the additive

258
4.4. Validity and drawing 259

structure or the multiplicative structure on R. Both these extensions are based


on Proposition 1.4.4.

Definition 4.4.2. Let p : X → R be an observable on a set X.

1 The additive extension p + : N(X) → R of p on multisets over X is defined


as:

p + (ϕ) =
X X
ϕ(x) · p(x) = ϕ(x) · p(x).
x∈supp(ϕ) x∈X

2 The multiplicative extension p• : N(X) → R of p is:


Y Y
p • (ϕ) = p(x)ϕ(x) = p(x)ϕ(x) .
x∈supp(ϕ) x∈X

By construction, these extensions are homomorphisms of monoids, so that:

+ + + +
 
 p (0) = 0  p ϕ + ψ = p (ϕ) + p (ψ)
  
 
and (4.13)
 p• (0) = 1.  p ϕ + ψ = p (ϕ) · p (ψ)
• • •

 
 

Once extended to multisets, we can investigate the validity of these exten-


sions in distributions obtained from drawing. We concentrate on the multino-
mial case first where the validity of both the additive and the multiplicative
extensions can be formulated in terms of the validity in the underlying state
(urn).

Proposition 4.4.3. Consider a random variable given by an observable p : X →


R and a state ω ∈ D(X). Fix K ∈ N.

1 The additive extension p + of p forms new random variable, with the multi-
nomial distribution mn[K](ω) on N[K](X) as state. The associated validity
is:

mn[K](ω) |= p + = K · ω |= p .


2 The multiplicative extension gives:

mn[K](ω) |= p • = ω |= p K .


259
260 Chapter 4. Observables and validity

Proof. 1 We use Lemma 3.3.4 in:

mn[K](ω) |= p + = mn[K](ω)(ϕ) · p + (ϕ)


X

ϕ∈N[K](X)  
X X
= ϕ(x) · p(x)

mn[K](ω)(ϕ) · 
ϕ∈N[K](X) x∈X
X X
= p(x) · mn[K](ω)(ϕ) · ϕ(x)
x∈X ϕ∈N[K](X)
X
= p(x) · K · ω(x)
x∈X
= K · (ω |= p).

2 By unfolding the multinomial distribution, see (3.9):


X
mn[K](ω) |= p• = mn[K](ω)(ϕ) · p• (ϕ)
ϕ∈N[K](X)
X Y  Y
ω(x)ϕ(x) · p(x)ϕ(x)

= (ϕ) ·
x x
ϕ∈N[K](X)

ω(x) · p(x) ϕ(x)


X Y
= (ϕ) ·

x
ϕ∈N[K](X)
(1.26) P K
= ω(x) · p(x)
x

= ω |= p K .


For the hypergeometric and Pólya distributions we have results for the addi-
tive extension. They resemble the formulation in the above first item.

Proposition 4.4.4. Consider an observable p and an urn ψ.

hg[K](ψ) |= p + = K · Flrn(ψ) |= p = pl[K](ψ) |= p + .




Proof. Both equations are obtained as for Proposition 4.4.3 (1), this time using
Lemma 3.4.4.
 
For the parallel multinomial law pml : M[K] D(X) → D M[K](X) from
Section 3.6 there is a similar result, in the multiplicative case.

Proposition 4.4.5. For an observable p : X → R and for a multiset of distri-


P
butions i ni |ωi i ∈ M[K](D(X)),
Y
pml i ni | ωi i |= p• = ωi |= p ni .
P  
i

260
4.4. Validity and drawing 261

Proof. We use the second formulation (3.25) of pml in:


X Q  • P 
pml i ni | ωi i |= p• = i ϕi
P 
i mn[ni ](ωi )(ϕi ) · p
i,ϕi ∈N[ni ](X)
X Q  Q • 
= i mn[ni ](ωi )(ϕi ) · i p (ϕi )
i,ϕi ∈N[ni ](X)
(4.13)
X
= mn[ni ](ωi )(ϕi ) · p• (ϕi )
Q
i
i,ϕi ∈N[ni ](X)
Y X
= mn[ni ](ωi )(ϕi ) · p• (ϕi )
i
ϕi ∈N[ni ](X)
Y
= mn[ni ](ωi ) |= p•
i
Y
= ωi |= p ni ,

by Proposition 4.4.3 (2).
i

We conclude this section by taking a fresh look at multinomial distributions.


For convenience, we recall the formulation from (3.9), for an distribution (as
urn) ω ∈ D(X) and number K ∈ N:
X Y
ω(x)ϕ(x) ϕ ,

mn[K](ω) = (ϕ ) ·
x
ϕ∈N[K](X)

The probability ω(x) equals the validity ω |= 1 x of the point predicate 1 x , for
the element x ∈ X with multiplicity ϕ(x) in the mulitset (draw) ϕ. We ask
ourselves the question: can we replace these point predicates with arbitrary
predicates? Does such a generalisation make sense?
As we shall see in Chapter ?? on probabilistic learning, this makes perfectly
good sense. It involves a generalisation of multisets over points, as elements of
a set X, to multisets over predicates on X. Such predicates can express uncer-
tainties over the data (elements) from which we wish to learn.
In the current setting we describe only how to adapt multinomial distribu-
tions to predicates. This works for tests, which are finite sets of predicates
adding up to the truth predicate 1, see Definition 4.2.2. It turns out that there
are two forms of “logical” multinomial for predicates.

Definition 4.4.6. Fix a state ω ∈ D(X) and a number K ∈ N. Let T ⊆ Pred (X)
be a test.

1 Define the external logical multinomial as:

ω |= p ϕ(p) ϕ ,
X Y 
mn E [K](ω, T ) = (ϕ) ·
ϕ∈N[K](T ) p∈T

261
262 Chapter 4. Observables and validity

2 There is also the internal logical multinomial:


X !
ϕ(p)

mn I [K](ω, T ) = ( ϕ ) · ω |= & p ϕ ,
p∈T
ϕ∈N[K](P)

In the first, external case the product is outside the validity |=, whereas in
Q
the second, internal case the conjunction (product) & is inside the validity |=.
This explains the names.
It is easy to see how the internal logical formulation generalises the standard
multinomial distribution. For a distribution ω with support X = supp(ω), say
of the form X = {x1 , . . . , xn }, the point predicates 1 x1 , . . . , 1 xn form a test on
X, since 1 x1 > · · · > 1 xn = 1. A multiset over these predicates 1 x1 , . . . , 1 xn can
easily be identified with a multiset over the points x1 , . . . , xn .
The external logical formulation only makes sense for proper fuzzy predi-
cates. In the sharp case, say with a test 1U , 1¬U things trivialise, as in:
1U n & 1¬U m = 1U & 1¬U = 0.
 

It is probably for this reason that the internal version is not so common. But it
is quite natural in Bayesian learning, see Chapter ??.
We need to check that the above definitions yield actual distributions, with
probabilities adding up to one. In both cases this follows from the Multinomial
Theorem (1.27). Externally:
K
X K
  
X Y ϕ(p) (1.27) X  
(ϕ) · ω |= p =  ω |= p = ω |=
  p
ϕ∈N[K](P) p∈P p∈P p∈P
K
= ω |= 1 = 1K = 1.
In the internal case we get:
X ! X !
ϕ(p) ϕ(p)
( ϕ ) · ω |= & p = ω |= (ϕ) · & p
p∈P p∈P
ϕ∈N[K](P) ϕ∈N[K](P)
X X Y
= ω(x) · (ϕ) · p(x)ϕ(p)
x∈X ϕ∈N[K](P) p∈P
 K
(1.27)
X X  X
= ω(x) ·  p(x) = ω(x) · 1K = 1.
x∈X p∈P x∈X

Exercises
4.4.1 Show that the additive/multiplicative extension preserves the addi-
tive/multiplicative structure of observables:
+ +
p + q = p+ + q+ 0 =0

262
4.4. Validity and drawing 263

and:
• •
p & q = p• & q• 1 = 1

4.4.2 For a random variable p : X → R with state ω ∈ D(X) consider the


validity mn[−](ω) |= p + as an observable N → R. Show that for a
distribution σ ∈ D(N) one has:
 
σ |= mn[−](ω) |= p = mean(σ) · (ω |= p).
P
4.4.3 For a random variable (ω, p) write (ω, p) : N → R for the summa-
tion observable defined by:

(ω, p)(n) B ω |= p + · · · + p
P
(n times).

Show that for a distribution σ ∈ D(N) one has:

σ |= (ω, p) = mean(σ) · (ω |= p).


P

4.4.4 The following is often used as illustration of Wald’s identity. Roll a


dice, and let n ∈ pips = {1, . . . , 6} be the number that comes up; then
roll the dice n more times and record the sum of the resulting pips.
What is the expected sum?
49
1 Use Exercise 4.4.3 to show that the expected number if 4 .
2 Obtain this same outcome via Proposition 4.4.3 (1).
4.4.5 Consider the set X = {a, b} with state ω = 13 | ai + 32 | b i and 3-test of
predicates T = {p1 , p2 , p3 } where:

p1 = 1
8 · 1a + 1
2 · 1b p2 = 1
4 · 1a + 1
6 · 1b p2 = 5
8 · 1a + 1
3 · 1b .

1 Show that:

mn E [2](ω, T )
7
= 64
9
2| p1 i + 48 1| p1 i + 1| p2 i + 1296
49

2| p2 i

+ 3196 1| p1 i + 1| p3 i + 1296 1| p2 i + 1| p3 i +
217
5184 2| p3 i .
961

2 And similarly that:

mn I [2](ω, T )
19 17
= 11
64 2| p1 i + 144 1| p1 i + 1| p2 i + 432 2| p2 i



+ 288
79
1| p1 i + 1| p3 i + 432
77
1| p2 i + 1| p3 i + 1728 2| p3 i .
353

263
264 Chapter 4. Observables and validity

4.5 Validity-based distances


This section describes two standard notions of distance between states, and
also between predicates. It shows that these distances can both be formulated
in terms of validity |=, in a dual form. Basic properties of these distances are in-
cluded. Earlier, in Section 2.7, we have have seen Kullback-Leibler divergence
as a measure of difference between states. But divergence does not form a dis-
tance function since it is not symmetric, see Remark 4.5.5 for an illustration of
the difference.
The distance between two states ω1 , ω2 ∈ D(X), on the same set X, can be
defined as the join of the distances in [0, 1], between validities:
_
d(ω1 , ω2 ) B ω |= p − ω |= p .
1 2 (4.14)
p∈Pred (X)

Similarly, for two predicates p1 , p2 ∈ Pred (X), one can define:


_
d(p1 , p2 ) B ω |= p − ω |= p .
1 2 (4.15)
ω∈D(X)

Note that the above formulations involve predicates only, not observables in
general.

4.5.1 Distance between states


The distance defined in (4.14) is commonly called the total variation distance,
which is a special case of the Kantorovich distance, see e.g. [56, 15, 121, 117].
Its two alternative characterisations below are standard. We refer to [86] for
more information about the validity-based approach.
Proposition 4.5.1. Let X be an arbitrary set, with states ω1 , ω2 ∈ D(X). Then:
X
d ω1 , ω2 = max ω1 |= 1U − ω2 |= 1U = 21
 ω (x) − ω (x) .
1 2
U⊆X
x∈X
W
We write maximum ‘max’ instead of join to express that the supremum
is actually reached by a subset (sharp predicate).
Proof. Let ω1 , ω2 ∈ D(X) be two discrete probability distributions on the
same set X. We will prove the two inequalities labeled (a) and (b) in:
(a)
x∈X ω1 (x) − ω2 (x) ≤ max ω1 |= 1U − ω2 |= 1U
1 P
2 U⊆X
_
≤ ω |= p − ω |= p
1 2
p∈Pred (X)
(b) X
ω (x) − ω (x) .
1
≤ 2 1 2
x∈X

264
4.5. Validity-based distances 265

This proves Proposition 4.5.1 since the inequality in the middle is trivial.
We start with some preparatory definitions. Let U ⊆ X be an arbitrary subset.
We shall write ωi (U) = x∈U ωi (x) = (ω |= 1U ). We partition U in three
P
disjoint parts, and take the relevant sums:

 U> = {x ∈ U | ω1 (x) > ω2 (x)} 
 U↑ = ω1 (U> ) − ω2 (U> ) ≥ 0


 
= ω = ω
 
{x ∈ | (x) (x)}

 U = U 1 2  U↓ = ω (U ) − ω (U ) ≥ 0.

< <

2 1
 U< = {x ∈ U | ω1 (x) < ω2 (x)}


 

We use this notation in particular for U = X. In that case we can use:

1 = ω1 (X) = ω1 (X> ) + ω1 (X= ) + ω1 (X< )


1 = ω2 (X) = ω2 (X> ) + ω2 (X= ) + ω2 (X< )

Hence by subtraction we obtain, since ω1 (X= ) = ω2 (X= ),

0 = ω1 (X> ) − ω2 (X> ) + ω1 (X< ) − ω2 (X< )


 

That is,

X↑ = ω1 (X> ) − ω2 (X> ) = ω2 (X< ) − ω1 (X< ) = X↓ .

As a result:

1 ω1 (x) − ω2 (x)
P
2 x∈X
 
= 1 P
x∈X> ω1 (x) − ω2 (x) + x∈X< ω2 (x) − ω1 (x)
 P
2
 
= 2 ω1 (X> ) − ω2 (X> ) + ω2 (X< ) − ω1 (X< )
1 (4.16)


= 2 X↑ + X↓
1 

= X↑

We have prepared the ground for proving the above inequalities (a) and (b).

(a) We will see that the above maximum is actually reached for the subset U =
X> , first of all because:
(4.16)
x∈X ω1 (x) − ω2 (x) = X↑ = ω1 (X> ) − ω2 (X> )
1 P
2
= ω1 |= 1X> − ω2 |= 1X>
≤ max ω1 |= 1U − ω2 |= 1U .
U⊆X

(b) Let p ∈ Pred (X) be an arbitrary predicate. We have: 1U & p (x) = 1U (x) ·


265
266 Chapter 4. Observables and validity

p(x), which is p(x) if x ∈ U and 0 otherwise. Then:


ω1 |= p − ω2 |= p
= ω1 |= 1X> & p + ω1 |= 1X= & p + ω1 |= 1X< & p


− ω2 |= 1X> & p + ω2 |= 1X= & p + ω2 |= 1X< & p

= ω1 |= 1X & p − ω2 |= 1X & p − ω2 |= 1X & p − ω1 |= 1X & p
> > < <

ω1 |= 1X> & p − ω2 |= 1X> & p − ω2 |= 1X< & p − ω1 |= 1X< & p


  



(∗)


 if ω1 |= 1X> & p − ω2 |= 1X> & p ≥ ω2 |= 1X< & p − ω1 |= 1X< & p



=


ω2 |= 1X< & p − ω1 |= 1X< & p − ω1 |= 1X> & p − ω2 |= 1X> & p


  





 otherwise

 ω1 |= 1X> & p − ω2 |= 1X> & p if (∗)



 ω2 |= 1X< & p − ω1 |= 1X< & p otherwise


P
 x∈X> (ω1 (x) − ω2 (x)) · p(x) if (∗)

=

 x∈X< (ω2 (x) − ω1 (x)) · p(x) otherwise

 P
P
 x∈X> ω1 (x) − ω2 (x) if (∗)



 x∈X< ω2 (x) − ω1 (x) otherwise

 P

 X↑
 if (∗)
=

 X↓ = X↑ otherwise

= X↑
(4.16) 1
= 2 ω1 (x) − ω2 (x) .
P
x∈X

This completes the proof.

The sum-formulation in Proposition 4.5.1 is useful in many situations, for


instance in order to prove that the above distance function d between states is
a metric.

Lemma 4.5.2. The distance d(ω1 , ω2 ) between states ω1 , ω2 ∈ D(X) in (4.14)


turns the set of distributions D(X) into a metric space, with [0, 1]-valued met-
ric.

Proof. If d(ω1 , ω2 ) = 21 x∈X | ω1 (x)−ω2 (x) | = 0, then | ω1 (x)−ω2 (x) | = 0 for


P
each x ∈ X, so that ω1 (x) = ω2 (x), and thus ω1 = ω2 . Obviously, d(ω1 , ω2 ) =
d(ω2 , ω1 ). The triangle inequality holds for d since it holds for the standard

266
4.5. Validity-based distances 267

distance on [0, 1].

d(ω1 , ω3 ) = 1
| ω1 (x) − ω3 (x) |
P
2 x∈X
1
| ω1 (x) − ω2 (x) | + | ω2 (x) − ω3 (x)|
P
≤ 2 x∈X

= 1
x∈X | ω1 (x) − ω2 (x) | + 2
1 P
x∈X | ω2 (x) − ω3 (x) |
P
2
= d(ω1 , ω2 ) + d(ω2 , ω3 ).

We use this same sum-formulation for the following result.

Lemma 4.5.3. State transformation is non-expansive: for a channel c : X → Y


one has:
d c = ω1 , c = ω2 ≤ d ω1 , ω2 .
 

Proof. Since:
P
d c = ω1 , c = ω2 = 1
y (c = ω1 )(y) − (c = ω2 (y)

2
P P
= 1
x ω1 (x) · c(x)(y) − x ω2 (x) · c(x)(y)
P
2 y
P P
= 1
2 y x c(x)(y) · (ω1 (x) − ω2 (x))

x c(x)(y) · ω1 (x) − ω2 (x)
1 P P
≤ 2 y

= 1
x ( y c(x)(y)) · ω1 (x) − ω2 (x)
P P
2
= d ω1 , ω2 .


For two states ω1 , ω2 ∈ D(X) a coupling is a joint state σ ∈ D(X × X) that


marginalises to ω1 and ω2 , i.e. that satisfies σ 1, 0 = ω1 and σ 0, 1 = ω2 .
   
Such couplings give an alternative formulation of the distance between states,
which is often called the Wasserstein distance, see [164]. These couplings will
also be used in Section 7.8 as probabilistic relations. The proof of the next
result is standard and is included in order to be complete.

Proposition 4.5.4. For states ω1 , ω2 ∈ D(X),


^
d(ω1 , ω2 ) = {σ |= Eq ⊥ | σ is a coupling between ω1 , ω2 },

where Eq : X × X → [0, 1] is the equality predicate from Definition 4.1.1 (4).

Proof. We use the notation and results from the proof of Proposition 4.5.1. We
first prove the inequality (≤). Let X> = {x ∈ X | ω1 (x) > ω2 (x)}. Then:

ω1 (x) = y σ(x, y) = σ(x, x) + y,x σ(x, y)


P P

≤ ω2 (x) + σ |= (1 x ⊗ 1) & Eq ⊥ .


This means that ω1 (x) − ω2 (x) ≤ σ |= (1 x ⊗ 1) & Eq ⊥ for x ∈ X> . We similarly

267
268 Chapter 4. Observables and validity

have:
ω2 (x) = σ(y, x) = σ(x, x) + y,x σ(y, x)
P P
x
≤ ω1 (x) + σ |= (1 ⊗ 1 x ) & Eq ⊥ .


Hence ω2 (x) − ω1 (x) for x < X> . Putting this together gives:
P
d(ω1 , ω2 ) = 12 x∈X ω1 (x) − ω3 (x)
= 12 x∈X> ω1 (x) − ω2 (x) + 12 x∈¬X> ω2 (x) − ω1 (x)
P P

≤ 12 x∈X> σ |= (1 x ⊗ 1) & Eq ⊥ + 21 x∈¬X> σ |= (1 ⊗ 1 x ) & Eq ⊥


P P

= 1
2 σ |= (1X> ⊗ 1) & Eq ⊥ + 1
2 σ |= (1 ⊗ 1¬X> ) & Eq ⊥
≤ 1
2 σ |= Eq ⊥ + 1
2 σ |= Eq ⊥
= σ |= Eq ⊥ .
For the inequality (≥) one uses what is called an optimal coupling ρ ∈ D(X ×
X) of ω1 , ω2 . It can be defined as:

min ω1 (x), ω2 (x) if x = y
 



ρ(x, y) B 


max ω1 (x) − ω2 (x), 0 · max ω2 (y) − ω1 (y), 0
 

otherwise.


d(ω1 , ω2 )


(4.17)
We first check that this ρ is a coupling. Let x ∈ X> so that ω1 (x) > ω2 (x); then:

y ρ(x, y)
P
X max ω2 (y) − ω1 (y), 0
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
y,x
d(ω1 , ω2 )
y∈X< ω2 (y) − ω1 (y)
P
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
d(ω1 , ω2 )
X↓
= ω2 (x) + (ω1 (x) − ω2 (x)) ·
d(ω1 , ω2 )
= ω2 (x) + (ω1 (x) − ω2 (x)) · 1 see the proof of Proposition 4.5.1
= ω1 (x).
If x < X> , so that ω1 (x) ≤ ω2 (x), then it is obvious that y ρ(x, y) = ω1 (x)+0 =
P
ω1 (x). This shows σ 1, 0 = ω1 . In a similar way one obtains σ 0, 1 = ω2 .
   
Finally,
ρ |= Eq = x ρ(x, x) = x min ω1 (x), ω2 (x)
P P 

= x∈X> ω2 (x) + x<X> ω1 (x)


P P

= ω2 (X> ) + 1 − ω1 (X> )
= 1 − ω1 (X> ) − ω2 (X> ) = 1 − d(ω1 , ω2 ).


Hence d(ω1 , ω2 ) = 1 − ρ |= Eq = ρ |= Eq ⊥ .


268
4.5. Validity-based distances 269

Figure 4.1 Visual comparance of distance and divergence between flip states, see
Remark 4.5.5 for details.

Remark 4.5.5. In Definition 2.7.1 we have seen Kullback-Leibler divergence


DKL , as a measure of difference between states. However, this DKL is not a
proper metric, since it is not symmetric, see Exercise 2.7.1. Nevertheless, it
is often used as distance between states, especially in minimisation problems
(see e.g. Exercise ??).
Figure 4.1 compares the total variation distance and the Kullback-Leibler
divergence between two flip states:
   
d flip(r), flip(s) DKL flip(r), flip(s) .

for r, s ∈ [0, 1], where, recall, flip(r) = r| 1 i + (1 − r)|0 i. The distance and
divergence are zero when r = s and increases on both sides of the diagonal.
The distance ascends via straight planes, but the divergence has a more baroque
shape.

The relation between distributions and multisets is a recurring theme. Now


that we have a distance function on distributions we can speak about approxi-
mation of a distribution via a chain of multisets. This is what the next remark
is about. This topic returns as a law of large numbers in Section 5.5, see esp.
Theorem 5.5.4. Here we take a an algorithmic perspective.

Remark 4.5.6. Let a distribution ω ∈ D(X) be given. One can ask: is there a
squence of natural multisets ϕK ∈ N[K](X) with Flrn(ϕK ) getting closer and
closer, in the total variation distance d, as K goes to infinity?
The answer is yes. Here is one way to do it. Assume the distribution ω has

269
270 Chapter 4. Observables and validity

support {x1 , . . . , xn }, of course with n > 0. Define ω1 B 0, the empty multiset


in M({x1 , . . . , xn }).

• Pick σ1 = 1| x j i where ω takes a maximum at x j , i.e., ω(x j ) ≥ ω(xi ) for all


i. Set variable cp B j, for ‘current position’.
• Set ωn+1 B ωn + ω, in M(X). Look for the first position i after cp where
σK (xi ) < ωn+1 (xi ). Then you set σK+1 B σK + 1| xi i, and cp B i. This
search wraps around, if needed. When no i is found we are done and have
Flrn(σK ) = ω.

Concretely, if ω = 61 | x1 i + 12 | x2 i + 13 | x3 i. Then, consecutively,

• ω0 = 0 and σ1 = 1| x2 i
• ω1 = 61 | x1 i + 12 | x2 i + 31 | x3 i and σ2 = 1| x2 i + 1| x3 i
• ω2 = 13 | x1 i + 1| x2 i + 23 | x3 i and σ3 = 1| x1 i + 1| x2 i + 1| x3 i
• ω3 = 21 | x1 i + 32 | x2 i + 1| x3 i and σ4 = 1| x1 i + 2| x2 i + 1| x3 i
• ω4 = 32 | x1 i + 2| x2 i + 43 | x3 i and σ5 = 1| x1 i + 2| x2 i + 2| x3 i
• ω5 = 65 | x1 i + 52 | x2 i + 35 | x3 i and σ6 = 1| x1 i + 3| x2 i + 2| x3 i.

Indeed, Flrn(σ6 ) = ω. In this case we get a finite sequence of multisets ap-


proaching ω. The sequence is infinite for, e.g.,

ω = 17 | x1 i + 47 | x2 i + 27 | x3 i ≈ 0.1429| x1 i + 0.5714| x2 i + 0.2857| x3 i.

Running the above algorithm gives, for instance:

• ϕ10 = 2| x1 i + 5| x2 i + 3| x3 i
• ϕ100 = 15| x1 i + 56| x2 i + 29| x3 i
• ϕ1000 = 143| x1 i + 571| x2 i + 286| x3 i
• ϕ10000 = 1429| x1 i + 5713| x2 i + 2858| x3 i.

The metric on distributions is complete, when the sample space is finite.


This is a standard result, see e.g. [86, Lemma 2.4]. Moreover, natural multisets
can be used to get a dense subset.

Theorem 4.5.7. 1 If X is a finite set, then D(X), with the total variation dis-
tance d, is a complete metric space.
2 For an arbitrary set X, the set D(X) has a dense subset:
[
D[K](X) ⊆ D(X) where D[K](X) B {Flrn(ϕ) | ϕ ∈ N[K](X)}.
K∈N

This dense subset is countable when X is finite.


We often refer to the elements of D[K](X) as fractional distributions.

270
4.5. Validity-based distances 271

The combination of the these two points says that for a finite set X, the set
D(X) of distributions on X is a Polish space: a complete metric space with a
countable dense subset, see ?? for more details.

Proof. 1 Let X = {x1 , . . . , x` } and ωi ∈ D(X) be a Cauchy sequence. Fix n.


Then for all i, j,

ωi (xn ) − ω j (xn ) ≤ 2 · d ωi , ω j .


Hence, the sequence ωi (xn ) ∈ [0, 1] is a Cauchy sequence, say with limit
rn ∈ [0, 1]. Take ω = n rn | xn i ∈ D(X). This is the limit of the distributions
P
ωi .
2 Let ω ∈ D(X) and ε > 0. We need to find a multiset ϕ ∈ N(X) with
d(ω, Flrn(ϕ)) < ε. There is a systematic way to find such multisets via the
decimal representation of the probabilities in ω. This works as follows. As-
sume we have:

ω = 0.383914217 . . . a + 0.406475610 . . . b + 0.209610173 . . . c .

For each n we chop off after n decimals and multiply with 10n , giving:

ϕ1 B 3 a + 4 b + 2 c with d ω, Flrn(ϕ1 ) ≤ 21 · 3 · 10−1


ϕ2 B 38 a + 40 b + 20 c with d ω, Flrn(ϕ2 ) ≤ 21 · 3 · 10−2


ϕ3 B 383 a + 406 b + 209 c with d ω, Flrn(ϕ3 ) ≤ 21 · 3 · 10−3


etc.

In general, for a distribution ω with supp(ω) = {x1 , . . . , x` } we can thus


construct a sequence of multisets ϕn ∈ N {x1 , . . . , x` } with d ω, Flrn(ϕn ) ≤
 

2 · ` · 10 . This distance becomes less than any given ε > 0, by choosing n


1 −n

sufficiently large.
When X is finite, say with
 n elements, we know from Proposition   1.7.4
that N[K](X) contains Kn multisets, so that D[K](X) contains Kn distri-
S
butions. Hence a countable union K D[K](X) of such finite sets is count-
able.

Notice that the sequence of multisets ϕK approaching ω described in Re-


mark 4.5.6 has the special property that kϕK k = K. This does not hold for the
sequence ϕn in the above proof. Such a size property is not needed there.
The denseness of the fractional distributions is illustrated in Figure 4.2.
A different way to approximate a distribution ω is described in Section 5.5,
via what is called the law of large numbers.

271
272 Chapter 4. Observables and validity


Figure 4.2 Fractional distributions from D[K] {a, b, c} as points in the cube
 3K = 20 on the left and K = 50 on the right. The plot
[0, 1]3 , for  3 on
 the left
contains 20 = 231 dots (distributions) and the one on the right 50 = 1326,
see Proposition 1.7.4.

4.5.2 Distance between predicates


What we have to say about the validity-based distance (4.15) between predi-
cates is rather brief. First, there is also a pointwise formulation.

Lemma 4.5.8. For two predicates p1 , p2 ∈ Pred (X),


_
d(p1 , p2 ) = p (x) − p (x) .
1 2
x∈X

This distance function d makes the set Pred (X) into a metric space.

Proof. First, we have for each x ∈ X,

(4.15)
_
d(p1 , p2 ) = ω |= p − ω |= p
1 2
ω∈D(X)
≥ unit(x) |= p1 − unit(x) |= p2

= p (x) − p (x) .
1 2

_
Hence d(p1 , p2 ) ≥ p (x) − p (x) .
1 2
x

272
4.5. Validity-based distances 273

The other direction follows from:


P
ω |= p1 − ω |= p2 = z ω(z) · p1 (z) − z ω(z) · p2 (z)
P

≤ z ω(z) · p1 (z) − p2 (z)
P
_
≤ z ω(z) ·
P p (x) − p (x) .
1 2
x
 _
= z ω(z) · p1 (x) − p2 (x) .
P
x
_
= p (x) − p (x) .
1 2
x
The fact that we get a metric space is now straightforward.
There is an analogue of Lemma 4.5.3.
Lemma 4.5.9. Predicate transformation is also non-expansive: for a channel
c : X → Y one has, for predicates p1 , p2 ∈ Pred (Y),
d c = p1 , c = p2 ≤ d p1 , p2 .
 

Proof. Via the formulation of Lemma 4.5.8 we get:


 _
d c = p1 , c = p2 = (c = p1 )(x) − (c = p2 )(x)
_x P
=
P
y c(x)(y) · p1 (y) − y c(x)(y) · p2 (y)

x
_ P
≤ y c(x)(y) · p1 (y) − p2 (y)
x
_ P
y c(x)(y) · d p1 , p2
 

x
= d p1 , p2 .


Exercises
4.5.1 1 Prove that for states ω, ω0 ∈ D(X) and ρ, ρ0 ∈ D(Y) there is an
inequality:
d ω ⊗ ρ, ω0 ⊗ ρ0 ≤ d ω, ω0 + d ρ, ρ0 .
  

(In this situation thereis an equality for Kullback-Leibler diver-


gence, see Lemma 2.7.2 (2).)
2 Prove similarly that for predicates p, p0 ∈ Pred (X) and q, q0 ∈
Pred (Y) one gets:
d p ⊗ q, p0 ⊗ q0 ≤ d(p, p0 ) + d(q, q0 ).


4.5.2 In the context of Remark 4.5.5, check that:


   
d flip(0), flip(1) = 1 = d flip(1), flip(0)
   
DKL flip(0), flip(1) = 0 = DKL flip(1), flip(0) .

273
274 Chapter 4. Observables and validity

4.5.3 In [88] the influence of a predicate p on a state ω is measured via the


distance d(ω, ω| p ). This influence can be zero, for the truth predicate
p = 1.
Consider the set {H, T } with state σ(ε) = ε| H i + (1 − ε)|T i, and
with predicate p = 1H . Prove that d(σ(ε), σ(ε)| p ) → 1 as ε → 0.
4.5.4 This exercise uses the distance between a joint state and the product
of its marginals as measure of entwinedness, like in [73].
1 Take σ2 B 12 | 00 i + 21 | 11 i ∈ D(2 × 2), for 2 = {0, 1}. Show that:
  
d σ2 , σ2 1, 0 ⊗ σ2 0, 1 = 12 .
 

2 Take σ3 B 21 | 000i + 12 |111 i ∈ D(2 × 2 × 2). Show that:


 
d σ3 , σ3 1, 0, 0 ⊗ σ3 0, 1, 0 ⊗ σ3 0, 0, 1 = 43 .
    

3 Now define σn ∈ D(2n ) for n ≥ 2 as:


σn B 21 | 0|{z}
· · · 0 i + 12 | |{z}
1 · · · 1 i.
n times n times

Show that:

Nπi = σn equals 2 | 0i + 2 | 1i;


1 1
• each marginal
• the product i (πi = σn ) of the marginals is the uniform state
on 2n ; N  2n−1 − 1
• d σn , i (πi = σn ) = .
2n−1
Note that the latter distance goes to 1 as n goes to infinity.
4.5.5 Show that D(X), with the total variation distance (4.5), is a complete
metric space when the set X is finite.
4.5.6 The next ‘splitting lemma’ is attributed to Jones [93], see e.g. [95,
117]. For ω1 , ω2 ∈ D(X) with distance d B d(ω1 , ω2 ) one can find
distributions ω01 , ω02 , σ ∈ D(X) so that both ω1 and ω2 can be written
as convex sum:
ωi = d · ω0i + (1 − d) · σ.
Prove this result.
ρ(x,x)
Hint: Use the optimal coupling ρ from (4.17) to define σ(x) = 1−d .

274
5

Variance and covariance

The previous chapter introduced validity ω |= p, of an observable p in a


state/distribution ω. The current chapter uses validity to define the standard
statistical concepts of variance and covariance, and the associated notions of
standard deviation and correlation. Informally, for a random variable (ω, p),
the variance Var(ω, p) describes to which extent the observable p differs from
the expected value ω |= p, that is, how much much p varies or is spread out.
Together, ω |= p and Var(ω, p) are representational values that capture the
statistical essence of a random variable. The standard deviation of a random
variable is the square root of its variance.
The notion of covariance is used to compare two random variables. What do
we mean by two? We can have:
1 two random variables (ω, p1 ) and (ω, p2 ), with (possibly) different observ-
ables p1 , p2 : X → R, but with the same shared state ω ∈ D(X);
2 a joint state τ ∈ D(X1 × X2 ) together with two observables q1 : X1 → R and
q2 : X2 → R on the two components X1 , X2 . Via weakening of the observ-
ables we get two random variables:
τ, q1 ⊗ 1 τ, 1 ⊗ q2
 

like in the previous point, involving two observables q1 ⊗ 1 and 1 ⊗ q2 , now


on the same set X1 × X2 .
These differences are significant, but the two cases are not always clearly dis-
tinguished in the literature. One of the principles in this book is to make states
explicit. Hence we shall clearly distinguish between the first shared-state form
of covariance and the second joint-state form.
Apart from that, covariance captures to what extent two random variables
change together. Covariance may be positive, when the variables change to-
gether, or negative, meaning that they change in opposite directions.

275
276 Chapter 5. Variance and covariance

This short chapter first introduces the basic definitions and results for vari-
ance, and for covariance in shared-state form. They are applied to draw distri-
butions, from Chapter 3, in Section 5.2. The joint-state version of covariance is
introduced in Section 5.3 and illustrated in several examples. Section 5.4 then
establishes the equivalence between:

• non-entwinedness of a joint state, meaning that it is the product of its marginals;


• joint-state independence of random variables on this state
• joint-state covariance is zero.
See Theorem 5.4.6 for details. Such equivalences do not hold for shared-state
formulations. This is the very reason for being careful about the distinction
between a shared state and a joint state.
Covariance and correlation of (observables on) joint states is relevant in the
setting of updating, in the next chapter. In presence of such correlation, updat-
ing in one (product) component has crossover influence in the other compo-
nent.
At the end of this chapter we use what we have learned about variance, to
formulate what is called the weak law of large numbers. It shows that by accu-
mulating repeated draws from a distribution one comes arbitrary close to that
distribution. This is an alternative way of expressing the denseness of fractional
distributions in among all distributions, as formulated in Theorem 4.5.7.

5.1 Variance and shared-state covariance


This section describes the standard notions of variance, covariance and corre-
lation within the setting of this book. It uses the validity relation |= and the op-
erations on observables from Section 4.2. Recall that we understand a random
variable here as a pair (ω, p) consisting of a state ω ∈ D(X) and an observable
p : X → R. The validity ω |= p is a real number, and can thus be used as
a scalar, in the sense of Section 4.2. The truth predicate forms an observable
1 ∈ Obs(X); scalar multiplication yields a new observable (ω |= p)·1 ∈ Obs(X).
It can be subtracted1 from p, and then squared, giving an observable:

p − (ω |= p) · 1 2 = p − (ω |= p) · 1 & p − (ω |= p) · 1 ∈ Obs(X).
  

This observable denotes the function that sends x ∈ X to p(x) − (ω |= p) 2 ∈




1 Subtraction expressions like these occur more frequently in mathematics. For instance, an
eigenvalue λ of a matrix M may be defined as the scalar that forms a solution to the equation
M − λ · 1 = 0, where 1 is the identity matrix. A similar expression is used to define the
elements in the spectrum of a C ∗ -algebra. See also Excercise 4.2.12.

276
5.1. Variance and shared-state covariance 277

R≥0 . It is thus a factor. Its validity in the original state ω is called variance. It
captures how far the values of p are spread out from their expected value.

Definition 5.1.1. For a random variable (ω, p), the variance Var(ω, p) is the
non-negative number defined by:

Var(ω, p) B ω |= p − (ω |= p) · 1 2 .


When the underlying sample space X is a subset of R, say via incl : X ,→ R,


we simply write Var(ω) for Var(ω, incl).
The name standard deviation is used for the square root of the variance;
thus:
p
StDev(ω, p) B Var(ω, p).

Example 5.1.2. 1 We recall Example 4.1.4 (1), with distribution flip( 10 3


) =
3
10 | 1i + 7
10 | 0i and observable v(0) = −50 and v(1) = 100. We had ω =
| v=
−5, and so we get:
  X 3
3
Var flip( 10 ), v = flip( )(x) · (v(x) + 5)2
x∈{0,1}
10
= 3
10 · (100 + 5)2 + 7
10 · (−50 + 5)2 = 4725.

The standard deviation is around 68.7.


2 For a (fair) dice we have pips = {1, 2, 3, 4, 5, 6} ,→ R and mean(dice) = 7
2
so that:
X
Var dice = dice(x) · x − 72 2
 
x∈pips
 2 
3 2 1 2 1 2 3 2 5 2
= 1
6 · 5
2 + 2 + 2 + 2 + 2 + 2 = 12 .
35

Via a suitable shift-and-rescale one can standardise an observable so that its


validity becomes 0 and its variance becomes 1, see Exercise 5.1.6.
The following result is often useful in calculations.

Lemma 5.1.3. Variance satisfies:

Var(ω, p) = ω |= p2 − ω |= p 2 .
 

Since variance is non-negative, we get as byproduct an inequality:

ω |= p2 ≥ ω |= p 2 .


277
278 Chapter 5. Variance and covariance

Proof. We have:
Var(ω, p)
= ω |= p − (ω |= p) · 1 2


= x ω(x) p(x) − (ω |= p) 2
P 
 
= x ω(x) p(x)2 − 2(ω |= p)p(x) + (ω |= p)2
P
P  P  P 
= x ω(x)p (x) − 2(ω |= p)
2
x ω(x)p(x) + x ω(x)(ω |= p)
2

= ω |= p2 − 2 ω |= p)(ω |= p) + (ω |= p)2


= ω |= p2 − (ω |= p)2 .


This result is used to obtain the variances of a draw distributions in a later


section.
We continue with covariance and correlation, which involve two random
variables, instead one, as for variance. We can distinguish two forms of these,
namely:

• The two random variables are of the form (ω, p1 ) and (ω, p2 ), where they
share their state ω.
• There is a joint state τ ∈ D(X1 ×X2 ) together with two observables q1 : X1 →
R and q2 : X2 → R on the two components X1 , X2 . This situation can be seen
as a special case of the previous point by first weakening the two observables
to the product space, via: π1 = q1 = q1 ⊗ 1 and π2 = q2 = 1 ⊗ q2 . Thus we
obtain two random variable with a shared state:
(τ, π1 = q1 ) and (τ, π2 = q2 ).

These observable transformations πi = qi along a deterministic channel


πi : X1 × X2 → Xi can also be described simply as function composition
qi ◦ πi : X1 × X2 → R, see Lemma 4.3.2 (7).
We start with the situation in the first bullet above, and deal with the second
bullet in Definition 5.3.1 in Section 5.3.
Definition 5.1.4. Let (ω, p1 ) and (ω, p2 ) be two random variable with a shared
state ω ∈ D(X).

1 The covariance of these random variables is defined as the validity:

Cov ω, p1 , p2 B ω |= p1 − (ω |= p1 ) · 1 & p2 − (ω |= p2 ) · 1 .
  

2 The correlation between (ω, p1 ) and (ω, p2 ) is the covariance divided by


their standard deviations:
Cov(ω, p1 , p2 )
Cor ω, p1 , p2 B .

StDev(ω, p1 ) · StDev(ω, p2 )

278
5.1. Variance and shared-state covariance 279

Notice that variance Var(ω, p) is a special case of covariance Cov(ω, p, p),


namely with equal observables. Hence if there is an inclusion incl : X ,→ R
and we would use this inclusion twice to compute covariance, we are in fact
computing variance.
Correlation is normalised covariance, so that the outcome is in the interval
[−1, 1], see Exercise 5.1.7 below. In ordinary language two phenomena are
called correlated when there is relation between them. More technically, two
random variables are called correlated if their correlation, as defined above,
is non-zero — or equivalently, when their covariance is non-zero. Positive
correlation means that the observables mover together in the same direction,
whereas negative correlation means that they move in opposite directions.
Before we go on, we state the following analogue of Lemma 5.1.3, leaving
the proof to the reader.

Lemma 5.1.5. Covariance can be reformulated as:

Cov(ω, p1 , p2 ) = ω |= p1 & p2 − ω |= p1 · ω |= p2 .
  

Example 5.1.6. 1 We have seen in Definition 4.1.2 (2) how the average of
an observable can be computed as its validity in a uniform state. The same
approach is used to compute the covariance (and correlation) in a uniform
joint state. Consider the following to lists a and b of numerical data, of the
same length.

a = [5, 10, 15, 20, 25] b = [10, 8, 10, 15, 12]

We will identify a and b with random variables, namely with a, b : 5 →


R, where 5 = {0, 1, 2, 3, 4}. Hence there are obvious definitions: a(0) = 5,
a(1) = 10, a(2) = 15, a(3) = 20, a(4) = 25, and simililarly for b. Then we
can compute their averages as validities in the uniform state unif5 on the set
5:

avg(a) = unif5 |= a = 15 avg(a) = unif5 |= a = 11.

We will calculate the covariance between a and b wrt. the uniform state
unif5 , as:

Cov unif5 , a, b = unif5 |= a − (unif5 |= a) · 1 & b − (unif5 |= b) · 1


  

= i 51 · a(i) − 15 · b(i) − 11
P  

= 11.

2 In order to obtain the correlation between a, b, we first need to compute their

279
280 Chapter 5. Variance and covariance

variances:
Var(unif5 , a) = i 1
a(i) − 15 2 = 50
P 
5
Var(unif5 , b) = i 1
b(i) − 11 2 = 5.6
P 
5

Then:
Cov(unif5 , a, b) 11
Cor unif5 , a, b = √ = √

√ √ ≈ 0.66.
Var(unif5 , a) · Var(unif5 , b) 50 · 5.6
The next result collects several linearity properties for (co)variance and cor-
relation. This shows that one can do a lot of re-scaling and stretching of ob-
servables without changing the outcome.

Theorem 5.1.7. Consider a state ω ∈ D(X), with observables p, p1 , p2 ∈


Obs(X) and numbers r, s ∈ R.

1 Covariance satisfies:

Cov(ω, p1 , p2 ) = Cov(ω, p2 , p1 )
Cov(ω, p1 , 1) = 0
Cov(ω, r · p1 , p2 ) = r · Cov(ω, p1 , p2 )
Cov(ω, p, p1 + p2 ) = Cov(ω, p, p1 ) + Cov(ω, p, p2 )
Cov(ω, p1 + r · 1, p2 + s · 1) = Cov(ω, p1 , p2 ).

2 Variance satisfies:

Var(ω, r · p) = r2 · Var(ω, p)
Var(ω, p + r · 1) = Var(ω, p)
Var(ω, p1 + p2 ) = Var(ω, p1 ) + 2 · Cov(ω, p1 , p2 ) + Var(ω, p2 ).

3 For correlation we have:


Cor(ω, p1 , p2 ) = Cor(ω, p2 , p1 )

Cor(ω, p1 , p2 )
 if r, s have the same sign
Cor(ω, r · p1 , s · p2 ) = 

−Cor(ω, p1 , p2 ) otherwise.

Cor(ω, p1 + r · 1, p2 + s · 1) = Cor(ω, p1 , p2 ).

Proof. 1 Obviously, covariance is symmetric and covariance with truth is 0.


Covariance preserves scalar multiplication in each (observable) argument
since by Lemma 5.1.5:

Cov(ω, r · p1 , p2 ) = ω |= (r · p1 ) & p2 − ω |= r · p1 · ω |= p2
  

= r · ω |= p1 & p2 − r · ω |= p1 · ω |= p2
  

= r · Cov(ω, p1 , p2 ).

280
5.1. Variance and shared-state covariance 281

For preservation of sums we reason from the definition:

Cov(ω, p, p1 + p2 )
= ω |= p & (p1 + p2 ) − ω |= p · ω |= p1 + p2
  

= ω |= (p & p1 ) + (p & p2 ) − ω |= p · (ω |= p1 ) + (ω |= p2 )
  

= ω |= p & p1 + ω |= p & p2
 

− ω |= p · ω |= p1 − ω |= p · ω |= p2
   

= Cov(ω, p, p1 ) + Cov(ω, p, p2 )

The equation Cov(ω, p1 + r · 1, p2 + s · 1) = Cov(ω, p1 , p2 ) follows from the


previous equations.
2 The first property holds by what we have just seen:

Var(ω, r · p) = Cov(ω, r · p, r · p) = r2 · Cov(ω, p, p) = r2 · Var(ω, p).

Similarly, Var(ω, p + r · 1) = Var(ω, p). Next:

Var(ω, p1 + p2 )
= Cov(ω, p1 + p2 , p1 + p2 )
= Cov(ω, p1 + p2 , p1 ) + Cov(ω, p1 + p2 , p2 )
= Cov(ω, p1 , p1 ) + Cov(ω, p2 , p1 ) + Cov(ω, p1 , p2 ) + Cov(ω, p2 , p2 )
= Var(ω, p1 ) + 2 · Cov(ω, p1 , p2 ) + Var(ω, p2 ).

3 Symmetry of correlation is obvious. By unpacking the definition of correla-


tion and using the previous two items we get:

Cov(ω, r · p1 , s · p2 )
Cor(ω, r · p1 , s · p2 ) = p p
Var(ω, r · p1 ) · Var(ω, s · p2 )
r · s · Cov(ω, p1 , p2 )
= p p
r · Var(ω, p1 ) · s2 · Var(ω, p2 )
2

r · s · Cov(ω, p1 , p2 )
= p p
|r| · Var(ω, p1 ) · |s| · Var(ω, p2 )

Cor(ω, p1 , p2 )
 if r, s have the same sign
=

−Cor(ω, p , p ) otherwise.

1 2

(The same sign means: either both r ≥ 0 and s ≥ 0 or both r ≤ 0 and s ≤ 0.)
The final equation Cor(ω, p1 + r · 1, p2 + s · 1) = Cor(ω, p1 , p2 ) holds since
both variance and covariance are closed under addition of constants.

281
282 Chapter 5. Variance and covariance

Exercises
5.1.1 Let ω be a state and p be a factor on the same set. Define for v ∈ R≥0 ,

f (v) B ω |= (p − v · 1)2 .

Show that the function f : R≥0 → R≥0 takes its minimum value at
ω |= p.
5.1.2 Let τ ∈ D(X × Y) be a joint state with an observable p : X → R. We
can turn them into a random variable in two ways, by marginalisation
and weakening:

(τ 1, 0 , p) (τ, π1 = p),
 
and

where π1 = p = p ⊗ 1 is an observable on X × Y.
These two random variables have the same expected value, by (4.8).
Show that they have the same variance too:

Var(τ 1, 0 , p) = Var(τ, π1 = p).


 

As a result, the standard deviations are also the same.


5.1.3 Prove Lemma 5.1.5 along the lines of the proof of Lemma 5.1.3.
5.1.4 Show for a predicate p,
1 Cov(ω, p, p⊥ ) = −Var(ω, p) ≤ 0;
2 Var(ω, p⊥ ) = Var(ω, p).
5.1.5 Let h : X → Y be a function, with a state ω ∈ D(X) on its domain an
an observable q : Y → R on its codomain. Show that:
   
Var ω, q ◦ h = Var D(h)(ω), q .

Hint: Recall Exercise 4.3.3.


5.1.6 Let (ω, p) be a random variable on a space X. Define a new standard
score observable StSc(ω, p) : X → R by:
p(x) − (ω |= p)
StSc(ω, p)(x) B .
StDev(ω, p)
This pair of ω with StSc(ω, p)(x) is also called the Z-random variable.
It is normalised in the sense that:
1 ω |= StSc(ω, p) = 0;
Var ω, StSc(ω, p) = StDev ω, StSc(ω, p) = 1.
 
2
Prove these two items.

282
5.1. Variance and shared-state covariance 283

5.1.7 Recall the Cauchy-Schwarz inequality, for real numbers ai , bi ∈ R,


P 2  P   P 
i bi .
2 2
i ai bi ≤ i ai ·

Use this inequality to prove that correlation is in the interval [−1, 1].
5.1.8 Let ω ∈ D(X) be a state with an n-test ~p = p1 , . . . , pn , see Defini-
tion 4.2.2.
1 Define the (symmetric) covariance matrix as:
 Cov(ω, p1 , p1 ) · · · Cov(ω, p1 , pn ) 
 
CovMat(ω, ~p) B 
 .. .. 
. . 
Cov(ω, pn , p1 ) · · · Cov(ω, pn , pn )

Prove that all rows and all colums add up to 0 and that the entries
on the diagonal are non-negative and have a sum below 1.
2 Next consider the vector v of validities and the symmetric matrix A
of conjunctions:
 ω |= p1   ω |= p1 & p1 · · · ω |= p1 & pn 
   
.. .. ..
v B  . A B 
   
. .
ω |= pn ω |= pn & p1 · · · ω |= pn & pn
   

Check that CovMat(ω, ~p) = A − v · vT , where (−)T is transpose.


5.1.9 In linear regression a finite collection (ai , bi )1≤i≤n of real numbers
ai , bi ∈ R is given. The aim is to find coefficients v, w ∈ R of a
line y = vx + w that best approximates these points. The error that
is minimised is the ‘sum of squared residuals’ given as:
X  2
f (v, w) B bi − (vai + w) .
i

We redescribe the ai , bi as observables a, b : {1, 2, . . . , n} → R, with


a(i) = ai , b(i) = bi , together with the uniform distribution unif on the
space {1, 2, . . . , n}. Thus we have two random variables (unif, a) and
(unif, b) with a shared state. Write:

a B 1n i ai b B 1n i bi .
P P

By taking partial derivatives ∂∂vf and ∂w ∂f


, setting them to zero, and using
some elementary calculus, one obtains the best linear approximation
of the (ai , bi ) via coefficients given by the familiar formulas:
P
ai (bi − b)
v̂ = Pi ŵ = b − v̂ a.
i ai (ai − a)

1 Derive the above formulas for v̂ and ŵ.

283
284 Chapter 5. Variance and covariance

2 Show that one can also write the slope v̂ of the best line as:
P
(ai − a)(bi − b)
v̂ = i P 2
.
i (ai − a)

3 Check that this yields:


Cov(unif, a, b)
v̂ = .
Var(unif, a)
4 Thus, with the above values v̂, ŵ the sum of squares of b−(v̂·a+ŵ·1)
is minimal. Show that the latter expression can also be written in
terms of standard scores, see Exercise 5.1.6, namely as:
 
b − (v̂ · a + ŵ · 1) = StDev(b) · StSc(b) − Cor(a, b) · StSc(a) ,

where we have omitted the uniform distribution for convenience.


The right-hand-side shows that by using correlation as scalar one
can bring the standard score of a closest to the standard score of b.
Linear regression is described here as a technique for obtaining the
‘best’ line, from data points (ai , bi ). Once this line is found, one can
use it for prediction: if we have an arbitrary first coordinate a we
can predict the corresponding second coordinate as v̂ · a + ŵ. For in-
stance, if ai is number of hours spent learning by student i, and bi is
the resulting mark of student i, then linear regression may give a rea-
sonable prediction of the mark given a (new) number a of time spent
on learning. Chapter ?? is devoted to learning techniques, of which
linear regression is simple instance.

5.2 Draw distributions and their (co)variances


This section establishes standard (co)variance results for the multinomial, hy-
pergeometric and Pólya draw distributions.
With results that we have collected so far it is easy to compute point-variances
of the draw distributions, using the point-evaluation observables πy from (4.10).

Proposition 5.2.1. Let X be a set with a distinguished element y ∈ X and with


a number K ∈ N.

1 For an urn-distribution ω ∈ D(X),


 
Var mn[K](ω), πy = K · ω(y) · 1 − ω(y) .


284
5.2. Draw distributions and their (co)variances 285

2 For a non-empty urn ψ ∈ N[L](X) of size L ≥ K,

  L−K
Var hg[K](ψ), πy = K · · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .

L−1

3 Similarly, in the Pólya case we have:

  L+K
Var pl[K](ψ), πy = K · · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .

L+1

Proof. 1 Using the variance formula of Lemma 5.1.3 we get:

Var mn[K](ω), πy


= mn[K](ω) |= πy & πy − mn[K](ω) |= πy 2


 
   2
 X   X 
=  mn[K](ω)(ϕ) · ϕ(y)  − 
2   mn[K](ω)(ϕ) · ϕ(y)
ϕ∈N[K](X) ϕ∈N[K](X)
= K · (K − 1) · ω(y)2 + K · ω(y) − K · ω(y) 2


by Exercise 3.3.7 (3) and Lemma 3.3.4


= K · ω(y) − ω(y)2 = K · ω(y) · 1 − ω(y) .
 

2 We now use Exercise 3.4.7 and Lemma 3.4.4 in:

Var hg[K](ψ), πy

   2
 X   X 
=  hg[K](ψ)(ϕ) · ϕ(y)  − 
2
hg[K](ψ)(ϕ) · ϕ(y)
ϕ∈N[K](X) ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= − K · Flrn(ψ)(y) 2

K · Flrn(ψ)(y) ·
L−1
L(K −1) · ψ(y) + L(L−K) (L−1)K · ψ(y)
!
= K · Flrn(ψ)(y) · −
L(L−1) L(L−1)
(L−K) · (L − ψ(y))
= K · Flrn(ψ)(y) ·
L(L−1)
L−K
= · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .


L−1

285
286 Chapter 5. Variance and covariance

3 Similarly:

Var pl[K](ψ), πy

   2
 X   X 
=   pl[K](ψ)(ϕ) · ϕ(y)  − 
2
pl[K](ψ)(ϕ) · ϕ(y)
ϕ∈N[K](X) ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= − K · Flrn(ψ)(y) 2

K · Flrn(ψ)(y) ·
L+1
L(K −1) · ψ(y) + L(L+K) (L+1)K · ψ(y)
!
= K · Flrn(ψ)(y) · −
L(L+1) L(L+1)
(L+K) · (L − ψ(y))
= K · Flrn(ψ)(y) ·
L(L+1)
L+K
= · Flrn(ψ)(y) · 1 − Flrn(ψ)(y) .


L+1
With these pointwise results we can turn variances into multisets, like in
Definition 4.4.1. In the literature one finds descriptions of such variances as
vectors but they presuppose an ordering on the points of the underlying space.
The multiset description given below does not require such an ordering.

Definition 5.2.2. Let X be a set (of colours). The variances of the three draw
distributions are defined as multisets over X in:
  X 
Var mn[K](ω) B Var mn[K](ω), πy y
y∈X X

= K· ω(y) · (1 − ω(y)) y
  X y∈X 
Var hg[K](ψ) B Var hg[K](ψ), πy y
y∈X
L−K X
= K·
· Flrn(ψ)(y) · (1 − Flrn(ψ)(y)) y
L−1 y∈X
  X 
Var pl[K](ψ) B Var pl[K](ψ), πy y
y∈X
L+K X
= K· · Flrn(ψ)(y) · (1 − Flrn(ψ)(y)) y .
L+1 y∈X

There are anologous results and definitions for covariance.

Proposition 5.2.3. Let X be a set with a two different elements y , z from X


and with a number K ∈ N.

1 For an urn-distribution ω ∈ D(X),


 
Cov mn[K](ω), πy , πz = −K · ω(y) · ω(z).

286
5.2. Draw distributions and their (co)variances 287

2 For an urn ψ ∈ N[L](X) of size L ≥ K,


  L−K
Cov hg[K](ψ), πy , πz = −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L−1

3 In the Pólya case:


  L+K
Cov pl[K](ψ), πy , πz = −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L+1

Proof. 1 Using the covariance formula of Lemma 5.1.5 and Exercise 3.3.7 we
get:

Cov mn[K](ω), πy , πz


= mn[K](ω) |= πy & πz − mn[K](ω) |= πy · mn[K](ω) |= πz


  

= K · (K − 1) · ω(y) · ω(z) − K · ω(y) · K · ω(z)


= −K · ω(y) · ω(z).

2 In the same way, using Exercise 3.4.7 and Lemma 3.4.4:

Cov hg[K](ψ), πy , πz


= hg[K](ψ) |= πy & πz − hg[K](ψ) |= πy · hg[K](ψ) |= πz


  

ψ(z)
= K · (K −1) · Flrn(ψ)(y) · − K · Flrn(ψ)(y) · K · Flrn(ψ)(z)
L−1
L(K −1) · ψ(z) − K(L−1) · ψ(z)
= K · Flrn(ψ)(y) ·
L(L−1)
L−K
= −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L−1

3 And:

Cov pl[K](ψ), πy , πz


= pl[K](ψ) |= πy & πz − pl[K](ψ) |= πy · pl[K](ψ) |= πz


  

ψ(z)
= K · (K −1) · Flrn(ψ)(y) · − K · Flrn(ψ)(y) · K · Flrn(ψ)(z)
L+1
L(K −1) · ψ(z) − K(L+1) · ψ(z)
= K · Flrn(ψ)(y) ·
L(L+1)
L+K
= −K · · Flrn(ψ)(y) · Flrn(ψ)(z).
L+1

Definition 5.2.4. For a set X we define the covariances of the three draw dis-

287
288 Chapter 5. Variance and covariance

tributions as (joint) multisets over X × X in:


  X 
Cov mn[K](ω) B Cov mn[K](ω, πy , πz y, z
y,z∈X
  X 
Cov hg[K](ψ) B Cov hg[K](ψ), πy , πz y, z
y,z∈X
  X 
Cov pl[K](ψ) B Cov pl[K](ψ), πy , πz y, z .
y,z∈X

On the diagonal in these joint multisets, where y = z, there are the variances.
When the elements of the space X are ordered, these covariance multisets can
be seen as matrices. We elaborate an illustration.

Example 5.2.5. Consider a group of 50 people of which 25 vote for the green
party (G), 10 vote liberal (L) and the ramaining ones vote for the christian-
democractic party (C). We thus have a set of vote options V = {G, L, C} with a
(natural) voter multiset ν = 25|G i + 10| L i + 15|C i.
We select five people from the group and look at their votes. These five
people are obtained in hypergeometric mode, where selected individuals step
out of the group and are no longer available for subsequent selection.
The hypergeometric mean is introduced in Definition 4.4.1. It gives the mul-
tiset:
 
mean hg[5](ν) = 5 · Flrn(ν) = 25 |G i + 32 | L i + 1|C i.

The deviations of the mean are given by the variance, see Definition 5.2.2.
  50 − 5 X
Var hg[5](ν) = 5 · · Flrn(ν)(x) · (1 − Flrn(ν)(x)) x
50 − 1 x∈V
= 225
196 |G i + 27
28 | L i + 36
49 |C i
≈ 1.148|G i + 0.9643| L i + 0.7347|C i.
The covariances give a 2-dimensional multiset over V ×V, see Definition 5.2.4.
 
Cov hg[5](ν) = 196225
|G, G i − 135 45
196 |G, L i − 98 |G, C i

196 | L, G i +
− 135 27 27
28 | L, L i − 98 | L, C i
− 45 98 |C, L i + 49 |C, C i
27 36
98 |C, G i −
≈ 1.148|G, G i − 0.6888|G, L i − 0.4592|G, C i
− 0.6888| L, G i + 0.9643| L, L i − 0.2755| L, C i
− 0.4592|C, G i − 0.2755|C, L i + 0.7347|C, C i.
When these covariances are seen as a matrix, we recognise that the matrix is
symmetric and has variances on its diagonal.

288
5.2. Draw distributions and their (co)variances 289

We recall from Definition 4.4.2 the extension of an observable p on a set X to


an observable on natural multisets N(X) over X. This can be done additively
and multiplicatively. Interestingly, the additive extension interacts well with
(co)variance in the multinomial case, like with validity in Proposition 4.4.3 (1).

Proposition 5.2.6. Let p, q : X → R be observables on a set X, with their


addtive extensions p + , q + : N(X) → R. For a distribution ω ∈ D(X) and a
number K ∈ N,

1 Variance of p + over multinomial draws is related to variance of p over X,


via:
Var mn[K](ω), p + = K · Var ω, p .
  

2 Similarly for covariance:

Cov mn[K](ω), p + , q + = K · Cov ω, p, q .


  

Proof. 1 By Proposition 4.4.3 (1) and Exercise 3.3.7:

Var mn[K](ω), p +
 

= mn[K](ω) |= p + & p + − mn[K](ω) |= p + 2




mn[K](ω)(ϕ) · p + (ϕ) · p + (ϕ) − K 2 · (ω |= p)2


X
=
ϕ∈N[K](X)
X X
= p(x) · p(y) · mn[K](ω)(ϕ) · ϕ(x) · ϕ(y) − K 2 · (ω |= p)2
x,y∈X ϕ∈N[K](X)
 
X  K · (K −1) · ω(x) · ω(y)
 if x , y 
= − K 2 · (ω |= p)2
 
p(x) · p(y) · 

 K · (K −1) · ω(x) + K · ω(x) if x = y 
 2 

x,y∈X

= K · (K −1) · (ω |= p)2 + K · (ω |= p & p) − K 2 · (ω |= p)2


= K · (ω |= p & p) − K · (ω |= p)2
= K · Var ω, p .


2 Similarly.

5.2.1 Distributions of validities and of variances


Fix a random variable (ω, p) on a set X and a number K. We can form the multi-
nomial distribution mn[K](ω) on the set N[K](X) of natural multisets ϕ of size
K, over X. Applying frequentist learning to such multisets ϕ gives new distri-
butions Flrn(ϕ) ∈ D(X), and thus new random variables (Flrn(ϕ), p). We can
look a the validity and variance of the latter. This gives what we call distribu-
tions of validity and distributions of variance. They are defined as distributions

289
290 Chapter 5. Variance and covariance

on R via:
 
dval[K](ω, p) B D Flrn(−) |= p mn[K](ω)
X
= mn[K](ω)(ϕ) Flrn(ϕ) |= p

ϕ∈M[K](X)
  (5.1)
dvar[K](ω, p) B D Var(Flrn(−), p) mn[K](ω)
X
= mn[K](ω)(ϕ) Var(Flrn(ϕ), p) .

ϕ∈M[K](X)

Such distributions are useful in hypothesis testing in statistics, for instance


when ω is a huge distribution for which we wish we check the validity of the
observable (or predicate) p. We can then take small samples ϕ from ω via the
multinomial distribution and check the validity of p in the normalised sample
Flrn(ϕ). Proposition 5.2.7 (1) below says that the mean of all such samples
equals the validity ω |= p.
We illustrate the above definitions (5.1). Consider a three-element set X =
{a, b, c} with distribution ω = 61 | ai + 12 | b i + 31 | ci and observable p = 6 · 1a +
4 · 1b + 12 · 1c we get:

dval[2](ω, p) = 36 1
Flrn(2| ai) |= p + 61 Flrn(1|a i + 1| bi) |= p


+ 14 Flrn(2| b i) |= p + 19 Flrn(1| ai + 1| c i) |= p


+ 1 Flrn(1| b i + 1| c i) |= p + 19 Flrn(2| c i) |= p

3
= 36
1
1| ai |= p + 16 12 | ai + 12 |b i |= p


+ 14 1| b i |= p + 19 21 | a i + 12 | ci |= p


+ 1 1 | bi + 1 |c i |= p + 19 1| c i |= p

3 2 2
= 36 6 + 16 5 + 14 4 + 19 9 + 13 8 + 91 12
1

Similarly,

dvar[2](ω, p) = 1
ai, p) + 16 Var( 21 |a i + 12 | b i, p)

36 Var(1|
+ 41 Var(1| b i, p) + 91 Var( 12 | ai + 12 | ci, p)


+ 1 Var( 12 | bi + 12 | ci, p) + 19 Var(1| ci, p)

3
= 1 2 2
+ 61 26 − 52 + 41 42 − 42

36 6 − 6
+ 9 90 − 92 + 31 80 − 82 + 19 122 − 122
1

= 18 0 + 6 1 + 9 9 + 3 16 .
7 1 1 1

Diagrammatically the definitions (5.1) involve the composites:


D(−|= p)
&
D(X)
mn[K]
/ D M[K](X) D(Flrn)
/ D D(X) D(R) (5.2)
8
D(Var(−,p))

290
5.2. Draw distributions and their (co)variances 291

The distribution of validities dval[K] generalises the distribution of means that


is often used in hypothesis testing in statistics (see e.g. [157]). This distribution
of means uses the validiy (−) |= p to compute the mean, by taking the inclu-
sion incl : X ,→ R as observable, as in Definition 4.1.3. This works of course
only when X is a set of numbers. The above approach (5.1) with an arbitrary
observable p is more general.
Once we have formed a distribution of validities/variances, we can ask what
its validity/variance is. It turns out that these can be expressed in terms of
validity/variance of the orginal random variable. These results resemble The-
orem 3.3.5 which says that transforming a multinomial distribution along fre-
quentist learning yields the orginal (urn) distribution.
Proposition 5.2.7. Let (ω, p) be a random variable on a set X, with a number
K > 0.
1 The mean of the distribution of validities is the validity of the original ran-
dom variable:
 
mean dval[K](ω, p) = ω |= p.
2 The variance of the distribution of validities satisfies:
  Var(ω, p)
Var dval[K](ω, p) = .
K
3 The mean of the distribution of variances is:
  K −1
mean dvar[K](ω, p) = · Var(ω, p).
K
There is a fourth option to consider, namely the variance of the distribution
of variances, but that doesn’t seem to be very interesting.
Proof. 1 Easy, via the following diagrammatic proof.
D(−|= p)
D(X)
mn[K]
/ D M[K](X) D(Flrn)
/ D D(X) / D(R)
flat mean
 (−)|= p 
D(X) /R

The rectangle on the right commutes by Exercise 4.1.8 (3), and the triangle
on the left by Theorem 3.3.5.
2 The second item requires more work. Let’s assume supp(ω) = {x1 , . . . , xn }.
We first prove the auxiliary result (∗) below, via the Multinomial Theo-
(E)
rem (1.27) and Exercise 3.3.7, denoted below by the marked equation = .
(K −1) ω |= p 2 + ω |= p2
X  2  
mn[K](ω)(ϕ) Flrn(ϕ) |= p = . (∗)
ϕ∈M[K](X)
K

291
292 Chapter 5. Variance and covariance

We reason as follows.

X  2
mn[K](ω)(ϕ) Flrn(ϕ) |= p
ϕ∈M[K](X)
X P ϕ(xi )
2
= mn[K](ω)(ϕ) i K · p(xi )
ϕ∈M[K](X)
1
ϕ(xi ) · p(xi ) ψ(i)
(1.27)
X X Y
= ( ψ) ·

2
· mn[K](ω)(ϕ) ·
K ϕ∈M[K](X) ψ∈N[2]({1,...,n})
i

1   X X
= · 2 mn[K](ω)(ϕ) · ϕ(xi ) · p(xi ) · ϕ(x j ) · p(x j )
K 2  i, j ϕ∈M[K](X) 
X X 
+ mn[K](ω)(ϕ) · ϕ(xi ) · p(xi ) 
2 2

 i ϕ∈M[K](X)

(E) 1   X
= K · (K −1) · ω(xi ) · p(xi ) · ω(x j ) · p(x j )

· 2
K 2  i, j 
X
+ K · (K −1) · ω(xi ) · p(xi ) + K · ω(xi ) · p(xi ) 
2 2

2

i
(1.27) K −1 P 2 1 P
= · i ω(xi ) · p(xi ) + · i ω(xi ) · p(xi )
2
K K
(K −1) ω |= p 2 + ω |= p2
 
= .
K

Now we are ready to prove the formula for the variance of the distribution
of validities in item (2) in the proposition. We use item (1) and the auxiliarly
equation (∗).

 
Var dval[K](ω, p)
X  2   2
= mn[K](ω)(ϕ) Flrn(ϕ) |= p − mean dval[K](ω, p)
ϕ∈M[K](X)

(K −1) ω |= p 2 + ω |= p2
 
(∗)
= − ω |= p 2

K
− ω |= p 2 + ω |= p2
 
=
K
Var(ω, p)
= .
K

292
5.3. Joint-state covariance and correlation 293

3 We use item (1) and (∗).


 
mean dvar[K](ω, p)
X
=

mn[K](ω)(ϕ) · Var Flrn(ϕ), p
ϕ∈M[K](X)
X  
= mn[K](ω)(ϕ) · (Flrn(ϕ) |= p2 ) − (Flrn(ϕ) |= p)2
ϕ∈M[K](X)
(K −1) ω |= p 2 + ω |= p2
 
(∗)
= ω |= p − 2
K
(K −1) ω |= p2
(K −1) ω |= p 2

= −
K K
K −1
= · Var(ω, p).
K

Exercises
5.2.1 In the distribution of validities (5.1) we have used multinomial sam-
ples. We can also use hypergeometric ones. Show that in that case the
analogue of Proposition 5.2.7 (1) still holds: for an urn υ ∈ N(X), an
observable p : X → R and a number K ≤ kυk,
 
mean D Flrn(−) |= p hg[K](υ) = Flrn(υ) |= p.


5.3 Joint-state covariance and correlation


Section 5.1 has introduced variance and shared-state covariance / correlation.
This section looks at a slightly different version, which we call joint-state co-
variance / correlation. It is subtly different from the shared-state version since
the observables involved are defined not on the sample space of the shared
underlying state, but on the components of a product space.
We will thus first define covariance and correlation for a joint state with
observables on its component spaces, as a special case of what we have seen
so far.

Definition 5.3.1. Let τ ∈ D(X1 × X2 ) be a joint state on sets X1 , X2 and let q1 ∈


Obs(X1 ) and q2 ∈ Obs(X2 ) be two observables on these two sets separately. In
this situation the joint covariance is defined as:

JCov(τ, q1 , q2 ) B Cov(τ, π1 = q1 , π2 = q2 ).

Thus we use the weakenings π1 = q1 = q1 ⊗ 1 and π2 = q2 = 1 ⊗ q2 to turn

293
294 Chapter 5. Variance and covariance

the observables q1 , q2 on different sets X1 , X2 into observables on the same


(product) set X1 × X2 — so that Definition 5.1.4 applies.
Similarly, the joint correlation is:

JCor(τ, q1 , q2 ) B Cor(τ, π1 = q1 , π2 = q2 ).
In both these cases, if there are inclusions X1 ,→ R and X2 ,→ R, then one can
use these inclusions as random variables and write just JCov(τ) and JCor(τ).

Joint covariance can be reformulated in different ways, including in the style


that we have seen before, in Lemma 5.1.3 and 5.1.5.

Lemma 5.3.2. Joint covariance can be reformulated as:

JCov(τ, q1 , q2 ) = τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1 .
     

= τ |= q1 ⊗ q2 − (τ 1, 0 |= q1 ) · (τ 0, 1 |= q2 ).
    

We see that if the joint state τ is non-entwined, that is, if it is the product
of its marginals τ 1, 0 and τ 0, 1 , then the joint covariance is 0, whatever
   
the obervables q1 , q2 are. This situation will be investigated further in the next
section.

Proof. The first equation follows from:

JCov(τ, q1 , q2 ) = Cov(τ, π1 = q1 , π2 = q2 ).
= τ |= (π1 = q1 ) − (τ |= π1 = q1 ) · 1


& (π2 = q2 ) − (τ |= π2 = q2 ) · 1


= τ |= (π1 = q1 ) − (τ 1, 0 |= q1 ) · (π1 = 1)
  

& (π2 = q2 ) − (τ 0, 1 |= q2 ) · (π2 = 1)


  
 
= τ |= π1 = q1 − (τ 1, 0 |= q1 ) · 1
 
 
& π2 = q2 − (τ 0, 1 |= q2 ) · 1
 

= τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1 .
     

The last equation follows from Exercise 4.3.7.


Via this first equation in Lemma 5.3.2 we prove the second one:

JCov(τ, q1 , q2 )
= τ |= q1 − (τ 1, 0 |= q1 ) · 1 ⊗ q2 − (τ 0, 1 |= q2 ) · 1
     

= x,y τ(x, y) · q1 − (τ 1, 0 |= q1 ) · 1 (x) · q2 − (τ 0, 1 |= q2 ) · 1 (y)


P      

= x,y τ(x, y) · q1 (x) − (τ 1, 0 |= q1 ) · q2 (y) − (τ 0, 1 |= q2 )


P      

= x,y τ(x, y) · q1 (x) · q2 (y) − x,y τ(x, y) · q1 (x) · (τ 0, 1 |= q2 )


P  P   

x,y τ(x, y) · (τ 1, 0 |= q1 ) · q2 (y) + (τ 1, 0 |= q1 )(τ 0, 1 |= q2 )


P       

= τ |= q1 ⊗ q2 − (τ 1, 0 |= q1 )(τ 0, 1 |= q2 ).
    

294
5.3. Joint-state covariance and correlation 295

We turn to some illustrations. Many examples of covariance are actually of


joint form, especially if the underlying sets are subsets of the real numbers.
In the joint case it makes sense to leave these inclusions implicit, as will be
illustrated below.
Example 5.3.3. 1 Consider sets X = {1, 2} and Y = {1, 2, 3} as subsets of R,
together with a joint distribution τ ∈ D(X × Y) given by:
τ = 14 | 1, 1i + 14 |1, 2 i + 41 | 2, 2 i + 14 | 2, 3 i.
Its two marginals are:
τ 1, 0 = 12 | 1 i + 21 | 2i τ 0, 1 = 41 | 1i + 12 | 2i + 14 |3 i.
   

Since both X ,→ R and Y ,→ R we get means as validities of the inclusions:


mean τ 1, 0 = 23 mean τ 1, 0 = 2.
   

Now we can compute the joint covariance in the joint state τ as:
JCov(τ) = τ |= incl − mean τ 1, 0 · 1 ⊗ incl − mean τ 1, 0 · 1
     

= x,y τ(x, y) · (x − 32 ) · (y − 2)
P

= 41 − 12 · −1 + 12 · 1


= 41 .
2 In order to calculate the (joint) correlation of τ we first need the variances
of its marginals:
Var τ 1, 0 = x τ 1, 0 (x) · (x − 32 )2 = 14
  P  

Var τ 0, 1 = y τ 0, 1 (y) · (y − 2)2 = 12 .


  P  

Then:
JCov(τ) 1/4 √
JCor(τ) = p   = 1/2 · 1/ √2 =
1
2 2.
Var τ 1, 0 · Var τ 0, 1
  p

We have defined joint covariance as a special case of ordinary covariance.


We now show that shared-state covariance can also be seen as joint-state co-
variance, namely for a copied state. Recall that copying of states is a subtle
matter, since ∆ = ω , ω ⊗ ω, in general, see Subsection 2.5.2.
Proposition 5.3.4. Shared state covariance can be expressed as joint covari-
ance:
Cov ω, p1 , p2 = JCov ∆ = ω, p1 , p2 .
 

More generally, for suitably typed channels c, d,


Cov ω, c = p, d = q = JCov hc, di = ω, p, q .
 

295
296 Chapter 5. Variance and covariance

Proof. We prove the second equation since the first one is a special case,
namely when c, d are identity channels. Via Lemma 5.3.2 we get:

JCov hc, di = ω, p, q


= hc, di = ω |= p − ((hc, di = ω) 1, 0 |= p) · 1
  

⊗ q − ((hc, di = ω) 0, 1 |= q) · 1
  
 
= ω |= hc, di = p − (c = ω |= p) · 1 ⊗ q − (d = ω |= q) · 1

   
= ω |= c = p − (ω |= c = p) · 1 & d = q − (ω |= d = q) · 1


by Exercise 4.3.8
= ω |= (c = p) − (ω |= c = p) · 1 & (d = q) − (ω |= d = q) · 1
 

since c = (−) is linear and preserves 1


= Cov ω, c = p, d = q .


We started with ‘ordinary’ covariance in Definition 5.1.4 for two random


variables with a shared state. It was subsequently used to define the ‘joint’
version in Definition 5.3.1. The above result shows that we could have done
this the other way around: obtain the ordinary formulation in terms of the joint
version. As we shall see below, there are notable differences between shared-
state and joint-state versions, see Proposition 5.4.3 and Theorem 5.4.6 in the
next section.
But first we formulate a joint-state analogue for the linearity properties of
Theorem 5.1.7.

Theorem 5.3.5. Consider a state τ ∈ D(X1 × X2 ), with observables q1 ∈


Obs(X1 ), q2 , q3 ∈ Obs(X2 ) and numbers r, s ∈ R.

1 Joint-state covariance satisfies:

JCov(τ, q1 , q2 ) = JCov(τ, q2 , q1 )
JCov(τ, q1 , 1) = 0
JCov(τ, r · q1 , q2 ) = r · JCov(τ, q1 , q2 )
JCov(ω, q1 , q2 + q3 ) = JCov(τ, q1 , q2 ) + Cov(τ, q1 , q3 ).
JCov(τ, q1 + r · 1, q2 + s · 1) = JCov(τ, q1 , q2 ).

2 For joint-state correlation we have:



JCor(τ, q1 , q2 )
 if r, s have the same sign
JCor(τ, r · q1 , s · q2 ) = 

−JCor(τ, q1 , q2 ) otherwise.

JCor(τ, q1 + r · 1, q2 + s · 1) = JCor(ω, q1 , q2 ).

Proof. This follows directly from Theorem 5.1.7, using that predicate trans-

296
5.3. Joint-state covariance and correlation 297

formation πi = (−) is linear and thus preserves sums and scalar multiplications
(and also truth), see Lemma 4.3.2 (2).
We conclude this section with a medical example about the correlation be-
tween disease and test.
Example 5.3.6. We start with a space D = {d, d⊥ } for occurence of a disease
or not (for a particular person) and a space T = {p, n} for a positive or negative
test outcome. Prevalence is used to indicate the prior likelihood of occurrence
of the disease, for instance in the whole population, before a test. It can be
described via a flip-like channel:

[0, 1]
prev
◦ /D with prev(r) B r| d i + (1 − r)| d⊥ i.

We assume that there is a test for the disease with the following characteristics.
• (‘sensitivity’) If someone has the disease, then the test is positive with prob-
ability of 90%.
• (‘specificity’) If someone does not have the disease, there is a 95% chance
that the test is negative.
We formalise this via a channel test : D → T with:
test(d) = 9
10 | p i + 1
10 | n i test(d⊥ ) = 1
20 | p i + 19
20 | n i.

We can now form the joint ‘graph’ state:


joint(r) B hid , testi = prev(r) ∈ D D × T .


Exercise 5.3.3 below tells that it does not really matter which observables we
choose, so we simply take 1d : D → [0, 1] and 1 p : T → [0, 1], mapping d
and p to 1 and d⊥ and n to 0. We are thus interested in (joint-state) correlation
function:
 
[0, 1] 3 r 7−→ JCor joint(r), 1d , 1 p ∈ [−1, 1].
This is plotted in Figure 5.1, on the left. We see that, with the sensitivity and
specificity values as given above, there is a clear positive correlation beteen
disease and test, but less so in the corner cases with minimal and maximal
prevalence.
We now fix a prevalence of 20% and wish to understand correlation as a
function of sensitivity and specificity. We thus parameterise the above test
channel to test(se, sp) : D → T with parameters se, sp ∈ [0, 1].
test(se, sp)(d) = se| p i + (1 − se)| n i
test(se, sp)(d⊥ ) = (1 − sp)| p i + sp| n i.

297
298 Chapter 5. Variance and covariance

Figure 5.1 Disease-Test correlations, on the left as a function of prevalence (with


fixed sensitivity and specificity) and on the right as a function of sensitivity and
specificity (with fixed prevalence), see Example 5.3.6 for details.

As before we form a joint state, but now with different parameters.


joint(se, sp) B hid , test(se, sp)i = prev( 15 ) ∈ D D × T .


We are then interested in the function:


 
[0, 1] × [0, 1] 3 (se, sp) 7−→ JCor joint(se, sp), 1d , 1 p ∈ [−1, 1].
It is described on the right in Figure 5.1. We see that with maximal sensitivity
and specificity (both 1) the correlation between disease and test is also maximal
(actually 1), and dually with minimal sensitivity and specificity (both 0) the
correlation is minimal (namely −1). These (unrealistic) extremes correspond
to an optimal test and an inferior one.
Exercise 5.3.4 makes some intuitive properties of this (parameterised) cor-
relation between disease and test explicit.

Exercises
5.3.1 Find examples of covariance and correlation computations in the liter-
ature (or online) and determine if they are of shared-state or joint-state
form.
5.3.2 Prove that:
JCov(τ, q1 , q2 )
JCor(τ, q1 , q2 ) = .
StDev(τ 1, 0 , q1 ) · StDev(τ 0, 1 , q2 )
   

298
5.4. Independence for random variables 299

5.3.3 Consider two two-element sample spaces X1 = {a1 , b1 } and X2 =


{a2 , b2 } with a joint state τ ∈ D(X1 × X2 ). Prove that in this binary
case joint-state covariance and correlation do not depend on the ob-
servables, that is:

JCor(τ, p1 , p2 ) = ±JCor(τ, q1 , q2 ),

for all non-constant observables p1 , q1 ∈ Obs(X1 ), p2 , q2 ∈ Obs(X2 ).


Hint: Use Theorem 5.3.5 to massage p1 , p2 to the observables which
send a1 , b1 to 0 and a2 , b2 to 1.
5.3.4 Prove in the context of Example 5.3.6,
1 for each s ∈ [0, 1] one has:
 
JCor joint(s, 1 − s, 1d , 1 p = 0;

2 for all se, sp ∈ [0, 1],


 
se + sp ≥ 1 ⇐⇒ JCor joint(se, sp), 1d , 1 p ≥ 0.

5.4 Independence for random variables


Earlier we have called a joint state entwined when it cannot be written as
product of its marginals. This is sometimes called dependence. But we like
to reserve the word dependence for random variables in this setting. Such de-
pendence (or independence) is the topic of this section. We shall follow the
approach of the previous two sections and introduce two versions of depen-
dence, also called shared-state and joint-state. Independence is related to the
property ‘covariance is zero’, but in a subtle manner. This will be elaborated
below.
Suppose we have two random variables describing the number of icecream
sales and the temperature. Intuitively one expects a dependence between the
two, and that the two variables are correlated (in an informal sense). The op-
posite, namely independence is usually formalised as follows. Two random
variables p1 , p2 are called independent if the equation,

P[p1 = a, p2 = b] = P[p1 = a] · P[p2 = b]. (5.3)

holds for all real numbers a, b. In this formulation a distribution is assumed


in the background. We like to use it explicitly. How should the above equa-
tion (5.3) then be read?

299
300 Chapter 5. Variance and covariance

As we described in Subsection 4.1.1, the expression P[p = a] can be inter-


preted as transformation along the observable pi , considered as deterministic
channel:
(4.9)
P[p = a] = p = ω |= 1a = ω |= p = 1a .

where ω is the implicit distribution.


We can then translate the above equation (5.3) into:

hp1 , p2 i = ω = (p1 = ω) ⊗ (p2 = ω). (5.4)

But this says that the joint state hp1 , p2 i = ω on R × R, transformed along
hp1 , p2 i : X → R×R, is non-entwined: it is the product of its marginals. Indeed,
its (first) marginal is:

hp1 , p2 i = ω 1, 0 = π1 = hp1 , p2 i = ω
  

= π1 ◦· hp1 , p2 i = ω


= p1 = ω.

This brings us to the following formulation. We formulate it for two random


variables, but it can easily be extended to n-ary form.

Definition 5.4.1. Let (ω, p1 ) and (ω, p2 ) be two random variables with a com-
mon, shared state ω. They will be called independent if the transformed state
hp1 , p2 i = ω on R × R is non-entwined, as in Equation (5.4): it is the product
of its marginals:

hp1 , p2 i = ω = (p1 = ω) ⊗ (p2 = ω).

We sometimes call this the shared-state form of independence, in order to


distinguish it from a later joint-state version.

We give an illustration, of non-independence, that is, of dependence.

Example 5.4.2. Consider a fair coin flip = 21 |1 i + 12 | 0 i. We are going to


use it twice, first to determine how much we will bet (either €100 or €50),
and secondly to determine whether the bet is won or not. Thus we use the
distribution ω = flip ⊗ flip with underlying set 2 × 2, where 2 = {0, 1}. We will
define two observables p1 , p2 : 2 × 2 → R, to be used as random variables for
this same distribution ω.
We first define an auxiliary observable p : 2 → R for the amount of the bet:

p(1) = 100 p(0) = 50.

We then define p1 = p ⊗ 1 = p ◦ π1 : 2 × 2 → R, as on the left below. The

300
5.4. Independence for random variables 301

observable p2 is defined on the right.


 
 100 if x = 1

  p(x)

 if y = 1
p1 (x, y) B  p2 (x, y) B 
 50
 if x = 0  −p(x)
 if y = 0.
We claim that (ω, p1 ) and (ω, p2 ) are not independent. Intuitively this may be
clear, since the observable p forms a connection between p1 and p2 . Formally,
we can see this by doing the calculations:
hp1 , p2 i = ω =D(hp1 , p2 i)(flip ⊗ flip)

= 1
p (1, 1) + 41 p1 (1, 0), p2 (1, 0)

4 p1 (1, 1), 2
+ 41 p1 (0, 1), p2 (0, 1) + 14 p1 (0, 0), p2 (0, 0)


= 41 100, 100 + 41 100, −100 + 14 50, 50 + 14 50, −50 .

The two marginals are:


p1 = ω = hp1 , p2 i = ω 1, 0 = 21 |100 i + 12 | 50 i
 

p2 = ω = hp1 , p2 i = ω 0, 1 = 41 |100 i + 14 | − 100i + 14 | 50 i + 14 | − 50 i.


 

It is not hard to see that the parallel product ⊗ of these two marginal distribu-
tions differs from hp1 , p2 i = ω.
Proposition 5.4.3. The shared-state covariance of shared-state independent
random variables is zero: if random variables (ω, p1 ) and (ω, p2 ) are indepen-
dent, then Cov(ω, p1 , p2 ) = 0.
The converse does not hold.
Proof. If (ω, p1 ) and (ω, p2 ) are independent, then, by definition, hp1 , p2 i =
ω = (p1 = ω) ⊗ (p2 = ω). The calculation belows shows that covariance is
then zero. It uses multiplication & : R × R → R as observable. It can also be
described as the parallel product id ⊗ id of the observable id : R → R with
itself.
Cov(ω, p1 , p2 )
= ω |= p1 & p2 − ω |= p1 · ω |= p2
  
by Lemma 5.1.5
= ω |= & ◦ hp1 , p2 i


− p1 = ω |= id · p2 = ω |= id
 
by Exercise 4.1.4
= hp1 , p2 i = ω |= &


− (p1 = ω) ⊗ (p2 = ω) |= id ⊗ id

by Lemma 4.2.8, 4.1.4
= hp1 , p2 i = ω |= & − hp1 , p2 i = ω |= &
 
by assumption
= 0.
The claim that the converse does not hold follows from Example 5.4.4, right
below.

301
302 Chapter 5. Variance and covariance

Example 5.4.4. We continue Example 5.4.2. The set-up used there involves
two dependent random variables (ω, p1 ) and (ω, p2 ), with shared state ω =
flip ⊗ flip. We show here that they (nevertheless) have covariance zero. This
proves the second claim of Proposition 5.4.3, namely that zero-covariance need
not imply independence, in the shared-state context.
We first compute the validities (expected values):

ω |= p1 = 1
4 · p1 (1, 1) + 14 · p1 (1, 0) + 14 · p1 (0, 1) + 14 · p1 (0, 0)
= 1
4 · 100 + 14 · 100 + 14 · 50 + 14 · 50 = 75
ω |= p2 = 1
4 · 100 + 14 · −100 + 14 · 50 + 14 · −50 = 0
Then:
Cov(ω, p1 , p2 )
= ω |= p1 − (ω |= p1 ) · 1 & p2 − (ω |= p2 ) · 1
 
 
= 14 (100−75) · 100 + (100−75) · −100 + (50−75) · 50 + (50−75) · −50
= 0.
We now turn to independence in joint-state form, in analogy with joint-state
covariance in Definition 5.3.1.
Definition 5.4.5. Let τ ∈ D(X1 × X2 ) be a joint state and with two observables
q1 ∈ Obs(X1 ) and q2 ∈ Obs(X2 ). We say that there is joint-state independence
of q1 , q2 if the two random variables (τ, π1 = q1 ) and (τ, π2 = q2 ) are (shared-
state) independent, as described in Definition 5.4.1.
Concretely, this means that:

(q1 ⊗ q2 ) = τ = q1 = τ 1, 0 ⊗ q2 = τ 0, 1 .
   
(5.5)
Equation (5.5) is an instance of the formulation (5.4) used in Definition 5.4.1
since πi = qi = qi ◦ πi and:

hq1 ◦ π1 , q2 ◦ π2 i = τ = D hq1 ◦ π1 , q2 ◦ π2 i (τ)




= D q1 × q2 (τ)


= (q1 ⊗ q2 ) = τ
(q1 ◦ π1 ) = τ ⊗ (q2 ◦ π2 ) = τ = q1 = (π1 = τ) ⊗ q2 = (π2 = τ)
   

= q1 = τ 1, 0 ⊗ q2 = τ 0, 1 .
   

In the joint-state case — unlike in the shared-state situation — there is a


tight connection between non-entwinedness, independence and covariance be-
ing zero.
Theorem 5.4.6. For a joint state τ the following three statements are equiva-
lent.

302
5.4. Independence for random variables 303

1 τ is non-entwined, i.e. is the product of its marginals;


2 all observables qi ∈ Obs(Xi ) are joint-state independent wrt. τ;
3 the joint-state covariance JCov(τ, q1 , q2 ) is zero, for all observables qi ∈
Obs(Xi ) — or equivalently, all correlations JCor(τ, q1 , q2 ) are zero.
Proof. Let joint state τ ∈ D(X1 × X2 ) be given. We write τi B πi = τ ∈ D(Xi )
for its marginals.
(1) ⇒ (2). If τ is non-entwined, then τ = τ1 ⊗ τ2 . Hence for all observ-
ables q1 ∈ Obs(X1 ) and q2 ∈ Obs(X2 ) we have that σ B (q1 ⊗ q2 ) = τ
is non-entwined. To see this, first note that πi = σ = qi = τi . Then, by
Lemma 2.4.2 (1):
(π1 = σ) ⊗ (π2 = σ) = (q1 = τ1 ) ⊗ (q2 = τ2 )
= (q1 ⊗ q2 ) = (τ1 ⊗ τ2 ) = (q1 ⊗ q2 ) = τ = σ.
(2) ⇒ (3). Let q1 ∈ Obs(X1 ) and q2 ∈ Obs(X2 ) be two observables. We
may assume that q1 , q2 are independent wrt. τ, that is, (q1 ⊗ q2 ) = τ = (q1 =
τ1 ) ⊗ (q2 = τ2 ) as in (5.5). We must prove JCov(τ, q1 , q2 ) = 0. Consider, like
in the proof of Proposition 5.4.3, the multiplication map & : R × R → R, given
by & (r, s) = r · s, as an observable on R × R. We consider its validity:
(q1 ⊗ q2 ) = τ |= & = r,s (q1 ⊗ q2 ) = τ (r, s) · & (r, s)
P 

= r,s D(q1 × q2 )(τ)(r, s) · r · s


P

= r,s (x,y)∈(q1 ×q2 )−1 (r,s) τ(x, y) · r · s


P P 

= r,s x∈q−1 τ(x, y) · q1 (x) · q2 (y)


P P
−1
1 (r),y∈q2 (s)

= x,y τ(x, y) · (q1 ⊗ q2 )(x, y)


P

= τ |= q1 ⊗ q2 .
In the same way one proves (q1 = τ1 ) ⊗ (q2 = τ2 ) |= & = (τ1 |= q1 ) ·
(τ2 |= q2 ). But then we are done via the formulation of binary covariance from
Lemma 5.3.2:
JCov(τ, q1 , q2 ) = τ |= q1 ⊗ q2 − (τ1 |= q1 ) · (τ2 |= q2 )


= (q1 ⊗ q2 ) = τ |= & − (q1 = τ1 ) ⊗ (q2 = τ2 ) |= &


 

= (q1 ⊗ q2 ) = τ |= & − (q1 ⊗ q2 ) = τ |= &


 

= 0.
(3) ⇒ (1). Let joint-state covariance JCov(τ, q1 , q2 ) be zero for all observ-
ables q1 , q2 . In order to prove that τ is non-entwined, we have to show τ(x, y) =
τ1 (x) · τ2 (y) for all (x, y) ∈ X1 × X2 . We choose as random variables the observ-
ables 1 x and 1y and use again the formulation of covariance from Lemma 5.3.2.
Then, since, by assumption, the binary covariance is zero, so that:
τ(x, y) = τ |= 1 x ⊗ 1y = (τ1 |= 1 x ) · (τ2 |= 1y ) = τ1 (x) · τ2 (y).

303
304 Chapter 5. Variance and covariance

In essence this result says that joint-state independence and joint-state co-
variance being zero are not properties of observables, but of joint states.

Exercises
5.4.1 Prove, in the setting of Definition 5.4.5 that the first marginal (q1 ⊗
q2 ) = τ 1, 0 of the transformed state along q1 ⊗ q2 is equal to the
 
transformed marginal q1 = τ 1, 0 .
 

5.5 The law of large numbers, in weak form


In this section we describe what is called the weak law of large numbers. The
strong version appears later on, in ??. This weak law captures limit behaviour
of probabilistic operations, such as: if we throw a fair dice many, many times,
we expect to see each number of pips 61 of the time. It is sometimes also called
Bernoulli’s Theorem.
We shall describe this law of large numbers in two forms: as binary version,
which is most familiar, and as multivariate version. Both versions use results
about means and variances that we have seen before.
First we need to introduced two famous inequalities, due to Markov and to
Chebyshev. If we have an observable p : X → R and a number a ∈ R we
introduce a sharp predicate p ≥ a on X, defined in an obvious way as:

 1 if p(x) ≥ a


p ≥ a (x) B  (5.6)
0 otherwise.

We can similarly write p > a, p ≤ a and p < a.

Lemma 5.5.1. Let ω ∈ D(X) be a state, p ∈ Obs(X) an observable, and a ∈ R


an arbitrary number. Then:
 
1 Chebyshev’s inequality holds: a · ω |= (p ≥ a) ≤ ω |= p;
 
2 Markov’s inequality holds: a2 · ω |= p − (ω |= p) · 1 ≥ a ≤ Var(ω, p).

Proof. 1 Because:
ω |= p ≥ ω |= p & (p ≥ a) ≥ ω |= (a · 1) & (p ≥ a)
= ω |= a · 1 & (p ≥ a)


= ω |= a · p ≥ a


= a · ω |= (p ≥ a) .


304
5.5. The law of large numbers, in weak form 305

2 Write q B p − (ω |= p) · 1 : X → R, so that q(x) = p(x) − (ω |= p). Then:

|q| ≥ a (x) = 1 ⇐⇒ |q(x)| ≥ a




=⇒ q(x)2 ≥ a2 ⇐⇒ q2 ≥ a2 (x) = 1.


This gives an inequality of (sharp) predicates: |q| ≥ a ≤ q2 ≥ a2 . Hence,


 
by the previous item (Markov’s inequality),
 
a2 · ω |= p − (ω |= p) · 1 ≥ a = a2 · ω |= |q| ≥ a


≤ a2 · ω |= q2 ≥ a2


≤ ω |= q2 by item (1)
= Var(ω, p).

We turn to the weak law of large numbers, in binary form. To start, fix the
single and parallel coin states:

σ B 12 | 1 i + 21 | 0 i ∈ D(2) and σn B σ ⊗ · · · ⊗ σ ∈ D(2n ),

with a sum predicate on 2n , namely:

sum n
/ [0, 1] x1 + · · · + xn
2n sum n x1 , . . . , xn B .

defined by
n

The predicate sum n captures the average number of heads (as 1) in n coin flips.
Then:

σ |= sum 1 = 1
2 ·1+ 1
2 ·0 = 1
2
1
· (1 + 1) + 1
· (1 + 0) + 1
· (0 + 1) + 1
· (0 + 0)
σ2 |= sum 2 = 4 4 4 4
= 1
2
.. 2
. 
n
P k
1≤k≤n k · 2n
σ |= sum n =
n
= 1
· mean bn[n]( 21 ) = 2.
1

n n

For the last equation, see Example 4.1.4 (4). These equations express that in n
coin flips the average number of heads is 12 . This makes perfect sense.
The weak law of large numbers involves a more subtle statement, namely
that for each ε > 0 the validity:

σn |= sum n − 12 · 1 ≥ ε .

(5.7)

goes to zero as n goes to infinity. One may think that this convergence to zero
is obvious, but it is not, see Figure 5.2. The above validities (5.7) do go down,
but not monotonously.

305
306 Chapter 5. Variance and covariance

Figure 5.2 Example validities (5.7) with ε = 1


10 , from n = 1 to n = 20.

Theorem 5.5.2 (Weak law of large numbers). Using the fair coin σ and pred-
icate sum n : 2n → [0, 1] defined above one has, for each ε > 0,

lim σn |= sum n − 21 · 1 ≥ ε = 0.

n→∞

Proof. We use Chebyshev’s inequality from Lemma 5.5.1 (2). Since σn |=


sum n = 21 , as we have seen above, it gives an inequality in:
Var σn , sum n (∗) 1

σn |= sum n − 21 · 1 ≥ ε ≤ = .

ε2 4nε2
(∗)
It suffices to prove the marked equality =, since then it is clear that the limit in
the theorem goes to zero, as n goes to infinity.
The product 2n comes with n projection functions πi : 2n → 2. Using these
we can write the predicate sum n : 2n → [0, 1] as n1 · i πi . Thus:
P

1 X n
σn |= sum n & sum n = · σ |= πi & π j
n2 i, j
 
1 X n X 
= 2 ·  σ |= πi & πi + σ |= πi & πi 
n
n i i, j, i, j
n n2 − n
!
1
= 2· +
n 2 4
n+1
= .
4n
Hence, by Lemma 5.1.3,
    2 n+1 1 1
Var σn , sum n = σn |= sum n & sum n − σn |= sum n = − = .
4n 4 4n
(∗)
This proves the marked equation =, and thus the theorem.

306
5.5. The law of large numbers, in weak form 307

An obvious next step is to extend this result to arbitrary distributions. So let


us fix a distribution ω, with supp(ω) = {x1 , . . . , x` } = X. For a number n ∈ N
and for 1 ≤ i ≤ ` we use the accumulation predicate acc i : X n → [0, 1], given
by:
acc y1 , . . . , yn (xi ) | { j | y j = xi } |

acc i y1 , . . . , yn B = .

n n

Thus, the predicate acc i yields the average number of occurrences of the ele-
ment xi in its input sequence. As one may expect, its value converges to ω(xi )
in increasing product states ωn = ω ⊗ · · · ⊗ ω. This is the content of item (3)
below, which the multivariate version of the weak law of large numbers.

Proposition 5.5.3. Consider the above situation, with a distribution ω and a


predicate acc i .

1 The validity is given by:

ωn |= acc i = ω(xi ).

2 The formula for the variance is:

ω(x1 ) · (1 − ω(xi ))
Var ωn , acc i = .

n

3 For each ε > 0,



lim ωn |= acc i − ω(xi ) · 1 ≥ ε = 0.

n→∞

Proof. 1 Notice that we can write:


π xi
acc / N[n](X) / R≥0 ,
 
acc i = 1
n · Xn

where π xi is the point evaluation map from (4.10). Then, using acc as a
deterministic channel,
 
ωn |= acc i = 1
n · ωn |= acc = π xi
 
= 1
n · acc = iid [n](ω) |= π xi
 
= 1
n · mn[n](ω) |= π xi by Theorem 3.3.1 (2)
= ω(xi ) by (4.11).

307
308 Chapter 5. Variance and covariance

2 Via a combination of several earlier results we get:


   
Var ωn , acc i = Var ωn , n1 · (π xi ◦ acc)
 
= n12 · Var ωn , π xi ◦ acc by Theorem 5.1.7 (2)
 
= n2 · Var D(acc)(ω ), π xi
1 n
by Exercise 5.1.5
 
= n12 · Var mn[n](ω), π xi see Theorem 3.3.1 (2)
ω(x1 ) · (1 − ω(xi ))
= by Proposition 5.2.1 (1).
n
3 Via Chebyshev’s inequality from Lemma 5.5.1 (2), in combination with the
previous two items, we get:
Var(ωn , acc i )
lim ωn |= acc i − ω(xi ) · 1 ≥ ε ≤ lim

n→∞ n→∞ ε2
ω(x1 ) · (1 − ω(xi ))
= lim
n→∞ n · ε2
= 0.

In the above description we use the accumulation map acc : X n → N[n](X).


We shall now write:
acc
= Flrn ◦ acc : X n → D(X),
n

6 (a, c, a, a, c, b) = 2 | a i + 6 | b i + 3 | c i.
so that, for instance acc 1 1 1

For a given state ω ∈ D(X) we can look at the total variation distance d, see
Subsection 4.5.1, as a predicate X n → [0, 1], given by:
 x)  X
~x 7−→ d acc(~ acc(~x)(y) − ω(y) ,
n , ω = 2 ·
1
n
y∈X

see Proposition 4.5.1.


This predicate is used in the following multivariate weak large number the-
orem. Informally it says: the distribution that is obtained by accumulating n
samples from ω has a total variation distance to ω that goes to zero as n goes
to infinity.

Theorem 5.5.4. For a state ω ∈ D(X) and a number ε > 0 one has:
 
lim ωn |= d acc(−) , ω ε = 0.

n ≥
n→∞

Proof. Let supp(ω) ⊆ X have ` ∈ N>0 elements. There is an inequality of


sharp predicates on X n of the form:
X 
acc(−)(y) , ω(y) ≥ 2ε  ≥  d acc(−) , ω ≥ ε.
n ` n (∗)
y∈supp(ω)

308
5.5. The law of large numbers, in weak form 309

Figure 5.3 Initial segment of the limit from Theorem 5.5.4, with ω = 1
4 |a i +
12 |b i + 3 | c i and ε = 10 , from n = 1 to n = 12.
5 1 1

Indeed, if the left-hand-side in (∗) is 0, for ~x ∈ X n , then for each y ∈ supp(ω)


we have:
, ω(y) < ε .
acc(~x)(y)
2`
n

Hence, by Proposition 5.5.3 (3),


X X
acc(~x)(y) , ω(y) <
n , ω = 2 ·
d acc(~ = ε.
x)  1 1 2ε
n 2 · `
y∈supp(ω) y∈supp(ω)

Now we can prove the limit result in the theorem via Proposition 5.5.3 (3):
 
lim ωn |= d acc(−)
n , ω ≥ε

n→∞ X 
≤ lim ωn |= acc(−)(y) , ω(y) ≥ 2ε 
n→∞ n `
X y∈supp(ω)  acc(−)(y) 

= lim ωn |= n , ω(y) ≥ 2ε
`
n→∞
y∈supp(ω)
= 0.

Figure 5.3 gives an impression of the terms in a limit like in Theorem 5.5.4,
for ω = 14 | ai + 12
5
| bi + 31 | ci. The plot covers only a small initial fragment
because of the exponential growth of the sample space. For instance, the space
of the parallel product ω12 has 312 elements, which is more than half a million.
This initial segment suggests that the validities go down, but the suggestion is
weak; Theorem 5.5.4 provides certainty.

Exercises
5.5.1 For a number a ∈ R, write 1≥a : R → [0, 1] for the sharp predicate
with 1≥a (r) = 1 iff r ≥ a. Consider an observable p : X → R as a

309
310 Chapter 5. Variance and covariance

deterministic channel. Check that the sharp predicate p ≥ a from (5.6)


can also be described as predicate transformation p = 1≥a .
5.5.2 Show how Theorem 5.5.2 is an instance of Theorem 5.5.3 (3).

310
6

Updating distributions

One of the most interesting and magical aspects of probability distributions


(states) is that they can be ‘updated’. Informally, this means that in the light
of evidence, one can revise a state to a new state, so that the new state better
matches the evidence. This updating is also called belief update, conditioning,
revision, learning or inference.
Less informally, for a factor (or predicate or event) p and a state ω, both
on the same sample space, one can form a new updated (conditioned) state,
written as ω| p . It absorps the evidence p into ω. This updating satisfies various
properties, including the famous rule of Thomas Bayes, and also a less well
known but also important validity-increase property: ω| p |= p ≥ ω |= p. It
states that after absorbing p the validity of p increases. This makes perfect
good sense and is crucial in learning, see Chapter ??.
Updating is particularly interesting for joint states. A theme that will run
through this chapter is that probabilistic updating (conditioning) has ‘crossover’
influence. This means that if we have a joint state on a product sample space,
then updating in one component typically changes the state in the other com-
ponent (the marginal). This crossover influence is another magical aspect in a
probabilistic setting and depends on correlation between the two components.
As we shall see, channels play an important role in this phenomenon.
The combination of updating and transformation along a channel adds its
own dynamics to the topic. An update ω| p involves both a state ω and a factor
p. In presence of a channel we can first transform the state ω or the factor p
and then update. But we can also transform an update state ω| p . How are all
these related? The power of conditioning becomes apparent when it is com-
bined with transformation, especially for inference in probabilistic reasoning.
We shall distinguish two forms of inference: forward inference involves con-
ditioning followed by state transformation; it is commonly called causal rea-
soning. There is also backward inference, which is observable transformation

311
312 Chapter 6. Updating distributions

followed by conditioning, and is known as evidential reasoning. We illustrate


the usefulness of both these inference techniques in many examples in Sec-
tions 6.3 and 6.4, but also for Bayesian networks, in Section 6.5.
Via backward inference we can turn a channel c : X → Y into a reversed
channel c† : Y → X, which is commonly called the dagger of c. The construc-
tion also involves a state on X, but it is left out at this stage for convenience.
This reversed channel plays an important role in Theorem 6.6.3, which is one
of the more elegant results in Bayesian updating: it relates forward (resp. back-
ward) inference along a channel c to backward (resp. forward) inference along
the reversed channel c† . In this chapter we first introduce dagger channels for
probabilistic reasoning. A more thorough analysis appears in Chapter 7, using
the graphical language of string diagrams.
A special topic is backward inference, which is a form of updating along a
channel. It arises in a situation with a channel c : X → Y with a state σ ∈ D(X)
on its domain. If we have evidence on the codomain Y, we would like to update
σ after transporting the evidence along the channel c. It turns out that there are
two fundamentally different mechanisms for doing so.
1 In the first case the evidence is given in the form of a predicate (or factor)
on Y. One can then apply backward inference. This is called Pearl’s rule,
after [135] (and Bayes).
2 In the second case the evidence is given by a distribution on Y. One can then
do state transformation with the inverted channel (dagger) c† . This is called
Jeffrey’s rule, after [90].
In general, these two update mechanisms produce different outcomes. This is
uncomfortable, especially because it is not clear under which circumstances
one should use which update rule. The mathematical characterisation of these
update rules gives some guidance. Briefly, Pearl’s update rule gives an increase
of validity and Jeffrey’s update rule yields a decrease of divergence. This will
be made mathematically precise in Section 6.7 below. Informally, this involves
the difference between learning as reinforcing what goes well (go higher), or
learning as steering away from what goes wrong (go less low).

6.1 Update basics


We shall use updating and conditioning synonymously. These terms refer to
the incorporation of evidence into a distribution, where the evidence is given
by a predicate (or more generally by a factor). This section describes the def-
inition and basic results, including Bayes’ theorem and validity-increase. The

312
6.1. Update basics 313

relevance of conditioning in probabilistic reasoning will be demonstrated in


many examples later on in this chapter.

Definition 6.1.1. Let ω ∈ D(X) be a distribution on a sample space X and let


p ∈ Fact(X) = (R≥0 )X be a factor, on the same space X.

1 If the validity ω |= p is non-zero, we define a new distribution ω| p ∈ D(X)


as normalised pointwise product of ω and p:

ω(x) · p(x) X ω(x) · p(x)


ω| p (x) B i.e. ω| p = x .
ω |= p x∈X
ω =
| p

This distribution ω| p may be pronounced as: ω, given p.


2 The conditional expectation or conditional validity of an observable q on X,
given p and ω, is the validity:

ω| p |= q.

3 For a channel c : X → Y and a factor q on Y we define the updated channel


c|q : X → Y via pointwise updating:

c|q (x) B c(x)|q .

In writing c|q we assume that validity c(x) |= q = (c = q)(x) is non-zero, for


each x ∈ X.

Often, the state ω before updating is called the prior or the a priori state,
whereas the updated state ω| p is called the posterior or the a posteriori state.
The posterior incorporates the evidence given by the factor p. One may thus
expect that in the updated state ω| p the factor p is more true than in ω. This is
indeed the case, as will be shown in Theorem 6.1.5 below.
The conditioning c|q of a channel in item (3) is in fact a generalisation of the
conditioning of a state ω| p in item (1), since the state ω ∈ D(X) can be seen as
a channel ω : 1 → X with a one-element set 1 = {0} as domain. We shall see
the usefulness of conditioning of channels in Subsection 6.3.1.
The standard conditional probability notation is: P(E | D) for events E, D ⊆
X, where the distribution involved is left implicitly. If ω is this implicit distri-
bution, then P(E | D) corresponds to the conditional expectation expressed by
the validity ω|1D |= 1E of the sharp predicate 1E in the state ω updated with the

313
314 Chapter 6. Updating distributions

sharp predicate 1D . Indeed,


X
ω|1D |= 1E = ω|1D (x) · 1E (x)
x
X ω(x) · 1D (x)
=
x∈E ω |= 1D
ω(x)
P
P(E ∩ D)
= Px∈E∩D = = P(E | D).
x∈D ω(x) P(D)
The formulation ω| p of conditioning that is used above is not restricted to sharp
predicates, but works much more generally for fuzzy predicates/factors p. This
is sometimes called updating with soft or uncertain evidence [19, 33, 77]. It is
what we use right from the start.
Example 6.1.2. 1 Let’s take the numbers of a dice as sample space: pips =
{1, 2, 3, 4, 5, 6}, with a fair/uniform dice distribution dice = unifpips = 61 | 1 i+
6 |2 i + 6 | 3 i + 6 |4 i + 6 | 5i + 6 | 6 i. We consider the predicate evenish ∈
1 1 1 1 1

Pred (pips) = [0, 1]pips expressing that we are fairly certain of the outcome
being even:
evenish (1) = 1
5 evenish (3) = 1
10 evenish (5) = 1
10
evenish (2) = 9
10 evenish (4) = 9
10 evenish (6) = 4
5

We first compute the validity of evenish for our fair dice:


dice |= evenish = x dice(x) · evenish (x)
P

= 1
6 · 1
5 + 1
6 · 9
10 + 1
6 · 1
10 + 1
6 · 9
10 + 1
6 · 1
10 + 1
6 · 4
5

= 2+9+1+9+1+8
60 = 2.
1

If we take evenish as evidence, we can update our dice state and get:
X dice(x) · evenish (x)
dice evenish = x
x dice |= evenish
1/6 · 1/5
= 1 1 + 1/6 · 9/10 2 + 1/6 · 1/10 3
/2 1/2 1/2
1/6 · 9/10 1/6 · 1/10 1/6 · 4/5
+ 1 4 +
5 +
6
/ 2
1/2
1/2

= 15
1
1 + 10 2 + 30
3
3 + 10
1
4 + 30
3
5 + 15
1 4
6 .
As expected, the probabilities for the even pips is now, in the posterior,
higher than for the odd ones.
2 The following alarm example is due to Pearl [133]. It involves an ‘alarm’ set
A = {a, a⊥ } and a ‘burglary’ set B = {b, b⊥ }, with the following a priori joint
distribution ω ∈ D(A × B).
0.000095| a, b i + 0.009999| a, b⊥ i + 0.000005| a⊥ , b i + 0.989901| a⊥ , b⊥ i.

314
6.1. Update basics 315

The a priori burglary distribution is the second marginal:


ω 0, 1 = 0.0001| bi + 0.9999| b⊥ i.
 

Someone reports that the alarm went off, but with only 80% certainty
because of deafness. This can be described as a predicate p : A → [0, 1]
with p(a) = 0.8 and p(a⊥ ) = 0.2. We see that there is a mismatch between
the p on A and ω on A × B. This can be solved via weakening p to p ⊗ 1, so
that it becomes a predicate on A × B. Then we can take the update:
ω| p⊗1 ∈ D(A × B).
In order to do so we first compute the validity:
X
ω |= p ⊗ 1 = ω(x, y) · p(x) = 0.206.
x,y

We can then compute the updated joint state as:


X ω(x, y) · p(x)
ω| p⊗1 = x, y
x,y ω |= p
= 0.0003688| a, b i + 0.03882| a, b⊥ i
+ 0.000004853|a⊥ , bi + 0.9608| a⊥ , b⊥ i.
The resulting posterior burglary distribution — with the alarm evidence
taken into account — is obtained by taking the second marginal of the up-
dated distribution:
ω| p⊗1 0, 1 = 0.0004| bi + 0.9996| b⊥ i.
 

We see that the burglary probability is four times higher in the posterior
than in the prior. What happens is noteworthy: evidence about one compo-
nent A changes the probabilities in another component B. This ‘crossover
influence’ (in the terminology of [88]) or ‘crossover inference’ happens pre-
cisely because the joint distribution ω is entwined, so that the different parts
can influence each other. We shall return to this theme repeatedly, see for
instance in Corollary 6.3.13 below.
One of the main results about conditioning is Bayes’ theorem. We present it
here for factors, and not just for sharp predicates (events), as is common.
Theorem 6.1.3. Let ω be distribution on a sample space X, and let p, q be
factors on X.
1 The product rule holds:
ω |= p & q
ω| p |= q = .
ω |= p

315
316 Chapter 6. Updating distributions

2 Bayes’ rule holds:


(ω|q |= p) · (ω |= q)
ω| p |= q = .
ω |= p
This result carefully distinguishes a product rule, in item 1, and Bayes’ rule,
in item (2). This distinction is not always made, and the product rule is some-
times also called Bayes’ rule.

Proof. 1 We straightforwardly compute:


X X ω(x) · p(x)
ω| p |= q = ω| p (x) · q(x) = · q(x)
x x ω |= p
ω(x) · p(x) · q(x)
P
= x
ω |= p
x ω(x) · (p & q)(x) ω |= p & q
P
= = .
ω |= p ω |= p
2 This follows directly by using the previous item twice, in combination with
the commutativity of &, in:
(1) ω |= p & q ω |= q & p (1) (ω|q |= p) · (ω |= q)
ω| p |= q = = = .
ω |= p ω |= p ω |= p
Example 6.1.4. We instantiate Proposition 6.1.3 with sharp predicates 1E , 1D
for subsets/events E, D ⊆ X. Then the familiar formulations of the prod-
uct/Bayes rule appear.

1 The product rule specialises to the definition of conditional probability:


ω |= 1D & 1E ω |= 1D∩E P(D ∩ E)
P(E | D) = ω|1D |= 1E = = = .
ω |= 1D P(D) P(D)
2 Bayes rule, in the general formulation of Proposition 6.1.3 (2) specialises to
the well known inversion property of conditional probabilities:
(ω|1E |= 1D ) · (ω |= 1D ) P(D | E) · P(E)
P(E | D) = ω|1D |= 1E = = .
ω |= 1E P(D)
We have explained updating ω| p as incorporating the evidence p into the
distribution ω. Thus, one expects p to be ‘more true’ in ω| p than in ω. The next
result shows that this is indeed the case. It plays an important role in learning
see Chapter ??.

Theorem 6.1.5 (Validity-increase). For a distribution ω and a factor p on the


same set, if the validity ω |= p is non-zero, one has:

ω| p |= p ≥ ω |= p.

316
6.1. Update basics 317

Proof. We recall the ‘byproduct’ ω |= p & p ≥ ω |= p 2 from Lemma 5.1.3.



Then, by the product rule from Theorem 6.1.3 (1),

ω |= p & p ω |= p 2

ω| p |= p = ≥ = ω |= p.
ω |= p ω |= p
We add a few more basic facts about conditioning.

Lemma 6.1.6. Let ω be distribution on X, with factors p, q ∈ Fact(X).

1 Conditioning with truth has no effect:

ω|1 = ω.

2 Conditioning with a point predicate gives a point state: for a ∈ X,

ω|1a = 1| ai, assuming ω(a) , 0.

3 Iterated conditionings commute:

ω| p |q = ω| p&q = ω|q | p .
 

4 Conditioning is stable under multiplication of the factor with a scalar s > 0:

ω| s·p = ω| p .

5 Conditioning can be done component-wise, for product states and factors:



(σ ⊗ τ) (p⊗q) = σ|q ⊗ τ| p .
 

6 Marginalisation of a conditioning with a (similarly) weakened predicate is


conditioning of the marginalised state:
 
ω|1⊗q 0, 1 = ω 0, 1 q .
 

7 For a function f : X → Y, used as deterministic channel,


 
f = ω q = f = ω q◦ f .

When we ignore undefinedness issues, we see that items 1 and 3 show that
conditioning is an action on distributions, namely of the monoid of factors with
conjunction (1, &), see Definition 1.2.4.

Proof. 1 Trivial since ω |= 1 = 1.


2 Assuming ω(a) , 0 we get for each x ∈ X,
ω(x) · 1a (x) ω(a) · 1| ai(x)
ω|1a (x) = = = 1|a i(x).
ω |= 1a ω(a)

317
318 Chapter 6. Updating distributions

3 It suffices to prove:
ω| p (x) · q(x) ω(x)·p(x)/ω|= p · q(x)
ω| p |q (x) = =

by Proposition 6.1.3 (1)
ω| p |= q ω|= p&q/ω|= p

ω(x) · (p & q)(x)


= = ω| p&q (x).
ω |= p & q
4 First we have:
X X
ω |= s · p = ω(x) · (s · p)(x) = ω(x) · s · p(x)
x x
X
= s· ω(x) · p(x) = s · (ω |= p).
x
Next:
ω(x) · (s · p)(x) ω(x) · s · p(x) ω(x) · p(x)
ω| s·p (x) = = = = ω| p (x).
ω |= s · p s · (ω |= p) ω |= p
5 For states σ ∈ D(X), τ ∈ D(Y) and factors p on X and q on Y one has:
  (σ ⊗ τ)(x, y) · (p ⊗ q)(x, y)
(σ ⊗ τ) (p⊗q) (x, y) =
(σ ⊗ τ) |= (p ⊗ q)
σ(x) · τ(y) · p(x) · q(y)
= by Lemma 4.2.8
(σ |= p) · (τ |= q)
σ(x) · p(x) τ(y) · q(y)
= ·
σ |= p τ |= q
= σ| p (x) · τ|q (y)
 
 
= σ| p ⊗ τ|q (x, y).


6 Let ω ∈ D(X × Y) and q be a factor on Y; then:


X
ω|1⊗q 0, 1 (y) = ω|1⊗q (x, y)
 
x
X ω(x, y) · (1 ⊗ q)(x, y)
=
x ω |= 1 ⊗ q
(4.8) ω 0, 1 (y) · q(y)
 
  
= = ω 0, 1 q (y).
ω 0, 1 |= q
 

7 For f : X → Y, ω ∈ D(X) and q ∈ Fact(Y),


D( f )(ω)(y) · q(y)
( f = ω)|q (y) = D( f )(ω)|q (y) =

D( f )(ω) |= q
X ω(x) · q(y)
=
x∈ f −1 (y) ω(x) · p(y)
P
x∈ f −1 (y)
X ω(x) · q( f (x))
=
x ω(x) · q( f (x))
P
x∈ f −1 (y)
X
= ω|q◦ f (x)
x∈ f −1 (y)
= D( f ) ω|q◦ f (y) = f = (ω|q◦ f ) (y).
 

318
6.1. Update basics 319

In the beginning of this section we have defined updating ω| p for a state


ω ∈ D(X) and a factor p : X → R≥0 . Now let’s assume that this factor p
is bounded: there is a bound B ∈ R>0 such that p(x) ≤ B for all x ∈ X.
The rescaled factor B1 · p is then a predicate. Proposition 6.1.6 (4) shows that
updating with the factor p is the same as updating with the predicate B1 · p.
Hence we do not loose much if we restrict updating to predicates. Nevertheless
it is most convenient to define updating for factors so that we do not have to
bother about any rescaling.

Exercises
6.1.1 Let ω ∈ D(X) with a ∈ supp(ω). Prove that:

ω |= 1a = ω(x) and ω|1a = 1| ai.

6.1.2 Consider the following girls/boys riddle: given that a family with two
children has a boy, what is the probability that the other child is a girl?
Take as space {G, B}. On it we use the uniform distribution unif =
2 |G i + 2 | Bi since there is no prior knowledge.
1 1

1 Take as ‘at least one girl’ and ‘at least one boy’ predicates on
{G, B} × {G, B}:
g B 1B ⊗ 1B ⊥ = (1G ⊗ 1) > (1B ⊗ 1G )


b B 1G ⊗ 1G ⊥ = (1B ⊗ 1) > (1G ⊗ 1B ).




Compute unif ⊗ unif |= g and unif ⊗ unif |= b.


2 Check that unif ⊗ unif b |= g = 32 .
(For a description and solution of this problem in a special library for
probabilistic programming of the functional programming language
Haskell, see [45].)
6.1.3 In the setting of Example 6.1.2 (1) define a new predicate oddish =
evenish ⊥ = 1 − evenish .
1 Compute dice|oddish
2 Prove the equation below, involving a convex sum of states on the
left-hand side.

(dice |= evenish ) · ω|evenish + (dice |= oddish ) · ω|oddish = dice.

6.1.4 Let p1 , . . . , pn ∈ Pred (X) be a test, i.e. an n-tuple of predicates on X


with p1 > · · · > pn = 1. Let ω ∈ D(X) and q ∈ Fact(X).
1 Check that 1 = i (ω |= pi ).
P

319
320 Chapter 6. Updating distributions

2 Prove what is called the law of total probability:

ω = i (ω |= pi ) · ω| pi .
P
(6.1)

What happens to the expression on the right-hand side if one of


the pi has validity zero? Check that this equation generalises Exer-
cise 6.1.3.
(The expression on the right-hand side in (6.1) is used to turn a
D(X)
test into a ‘denotation’ function → D(D(X)) in [121, 122],
namely as ω 7→ i (ω |= pi ) ω| pi . This proces is described more
P
abstractly in terms of ‘hypernormalisation’ in [72].)
3 Show that:
ω |= q = i ω |= q & pi .
P

4 Prove now:
ω |= q & pi
ω|q |= pi = P .
j ω |= q & p j

6.1.5 Show that conditioning a convex sum of states yields another convex
sum of conditioned states: for σ, τ ∈ D(X), p ∈ Fact(X) and r, s ∈
[0, 1] with r + s = 1,

r · σ + s · τ p
r · (σ |= p) s · (τ |= p)
= · σ| p + · τ| p
r · (σ |= p) + s · (τ |= p) r · (σ |= p) + s · (τ |= p)
r · (σ |= p) s · (τ |= p)
= · σ| p + · τ| p .
(r · σ + s · τ) |= p (r · σ + s · τ) |= p
6.1.6 Consider ω ∈ D(X) and p ∈ Fact(X) where p is non-zero, at least
on the support of X. Check that updating ω with p can be undone via
updating with 1p .
6.1.7 Let c : X → Y be a channel, with state ω ∈ D(X × Z).
1 Prove that for a factor p ∈ Fact(Z),
 
(c ⊗ id ) = ω 1⊗p = (c ⊗ id ) = ω 1⊗p .

2 Show also that for q ∈ Fact(Y × Z),


      
(c ⊗ id ) = ω q 0, 1 = ω (c⊗id ) = q 0, 1 .

6.1.8 This exercise will demonstrate that conditioning may both create and
remove entwinedness.

320
6.1. Update basics 321

1 Write yes = 11 : 2 → [0, 1], where 2 = {0, 1}, and no = yes ⊥ = 10 .


Prove that the following conditioning of a non-entwined state,

τ B (flip ⊗ flip) (yes⊗yes)>(no⊗no)

is entwined.
2 Consider the state ω ∈ D(2 × 2 × 2) given by:

ω= 18 | 111i + 9 | 110 i + 9 | 101i + 9 | 100i


1 1 2 1

+ 9 | 011i + 9 | 010i + 9 | 001i + 18


1 2 1 1
| 000i

Prove that ω’s first and third component are entwined:

ω 1, 0, 1 , ω 1, 0, 0 ⊗ ω 0, 0, 1 .
     

3 Now let ρ be the following conditioning of ω:

ρ B ω|1⊗yes⊗1 .

Prove that ρ’s first and third component are non-entwined.


The phenomenon that entwined states become non-entwined via con-
ditioning is called screening-off, whereas the opposite, non-entwined
states becoming entwined via conditioning, is called explaining away.
6.1.9 Show that for ω ∈ D(X) and p1 , p2 ∈ Fact(X) one has:

∆ = ω p ⊗p = ∆ = ω| p1 &p2 .

1 2

Note that this is a consequence of Lemma 6.1.6 (7).


6.1.10 We have mentioned (right after Definition 6.1.1) that updating of the
identity channel has no effect. Prove more generally that for an ordi-
nary function f : X → Y, updating the associated deterministic chan-
nel  f  : X → Y has no effect:

 f |q =  f .

6.1.11 Let c : Z → X and d : Z → Y be two channels with a common domain


Z, and with factors p ∈ Fact(X) and q ∈ Fact(Y) on their codomains.
Prove that the update of a tuple channel is the tuple of the updates:

hc, di| p⊗q = hc| p , d|q i.

Prove also that for e : U → X and f : V → Y,

(e ⊗ f )| p⊗q = (e| p ) ⊗ ( f |q ).

321
322 Chapter 6. Updating distributions

6.1.12 Consider a state ω and predicate p on the same set. For r ∈ [0, 1]
we use the (ad hoc) abbreviation ωr for the convex combination of
updated states:
ωr B r · ω| p + (1 − r) · ω| p⊥ .
1 Prove that:
r ≤ (ω |= p) =⇒ r ≤ (ωr |= p) ≤ (ω |= p)
r ≥ (ω |= p) =⇒ r ≥ (ωr |= p) ≥ (ω |= p).
2 Show that, still for all r ∈ [0, 1],
ω |= p ω |= p⊥
r· + (1 − r) · ≤ 1.
ωr |= p ωr |= p⊥
Hint: Write 1 = r + (1 − r) on the right-hand side of the inequality
≤ and move the two summands to the left.
This inequality occurs in n-ary form in Proposition 6.7.10 below. It
is due to [50].

6.2 Updating draw distributions


In this section we apply updating to distributions that are obtained by drawing
from an urn, as described in Chapter 3. We first show that conditioning com-
mutes with multinomial channels. The proof is not particularly difficult, but
the result is noteworthy because it shows how well basic probabilistic notions
are integrated.
Recall that p• : N(X) → R is the multiplicative extension of an observable
p : X → R, see Definition 4.4.2. In Proposition 4.4.3 (2) we already saw that
the validity mn[K](ω) |= p• in a multinomial distribution equals the K-power
(ω |= p)K of the validity in the original distribution. We now extend this result
to updating.
Theorem 6.2.1. Let ω ∈ D(X) be given. For a factor (or predicate) p : X →
R≥0 and a number K ∈ N,

1 mn[K](ω) = mn[K](ω| );
p• p
2 Similarly, for the parallel multinomial law pml from Section 3.6,


ni | ωi i p• = pml i ni | ωi | p i .
P P 
pml i

An example that combines multinomials and updating occurs in [145, §6.4],


but without a general formulation.

322
6.2. Updating draw distributions 323

Proof. 1 Via the validity formula of Proposition 4.4.3 (2):


Xmn[K](ω)(ϕ) · p• (ϕ)
mn[K](ω) p • =
ϕ∈N[K](X)
mn[K](ω) |= p•
ω(x) · p(x) ϕ(x)
X Q 
= (ϕ) · x
ϕ∈N[K](X)
(ω |= p)K
!ϕ(x)
X Y ω(x)
= (ϕ) · since kϕk = K
x ω |= p
ϕ∈N[K](X)
X Y
= (ϕ) · ω| p (x)ϕ(x)
x
ϕ∈N[K](X)
= mn[K](ω| p ).

2 Similarly, via Proposition 4.4.5:



pml i ni | ωi i p •
P

X pml P ni | ωi i(ϕ) · p• (ϕ)


i
= ϕ where K = i ni
P
= •
P 
ϕ∈N[K](X)
pml n
i i i|ω i | p
Q •
(3.25) i mn[ni ](ωi )(ϕi ) · p (ϕi ) P
X
= i ϕi

= n
Q
i,ϕi ∈N[ni ](X) i (ω i | p) i

(ω(x) · p(x))ϕi (x) P


X Y Q
= ( ϕi ) · x i ϕi
i,ϕi ∈N[ni ](X)
i (ω i |= p)ni

X Y Y ω(x) · p(x) !ϕi (x) P


= ( ϕi ) · ϕ
i i
i,ϕi ∈N[ni ](X)
i x ω i |= p
X Y Y
ωi | p (x)ϕi (x) i ϕi
P
= ( ϕi ) ·
i x
i,ϕi ∈N[ni ](X)
X Y P
= mn[ni ](ωi | p )(ϕi ) i ϕi
i
i,ϕi ∈N[ni ](X)
= pml ni | ωi | p i .
P 
i

The action ϕ • p of a factor p on a multiset ϕ from Exercise 4.2.13 can


be used to get an analogue of the first item in the previous proposition for
hypergeometric and Pólya distributions. However, this result is more restricted
and works only for a sharp predicate p. The sharp predicate restricts both the
urn and the draws from it to balls of certain colours only.

Theorem 6.2.2. Let ψ ∈ N(X) be an urn and E ⊆ X an event/subset, forming


a sharp indicator predicate 1E : X → [0, 1].

323
324 Chapter 6. Updating distributions

1 If ψ • 1E ≥ K,
 
hg[K](ψ) 1 • = hg[K] ψ • 1E .
E

2 Similarly, if ψ • 1E is non-empty,

= pl[K] ψ • 1E .

pl[K](ψ) 1E

Proof. 1 We first notice that for ϕ ∈ N(X),



1E (x) ϕ(x) = 1
Y
1E (ϕ) = 1 ⇐⇒

x∈X
⇐⇒ ∀x ∈ X. x < E ⇒ ϕ(x) = 0
⇐⇒ supp(ϕ) ⊆ E.
Next, let’s write L = kψk for the size of the urn and LE = kψ • 1E k for the
size of the restricted urn — where it is assumed that LE ≥ K.
• X •
hg[K](ψ) |= 1E = hg[K](ψ)(ϕ) · 1E (ϕ)
ϕ≤K ψ
X
= hg[K](ψ)(ϕ)
ϕ≤K ψ, supp(ϕ)⊆E
X
= hg[K](ψ)(ϕ)
ϕ≤K ψ•1E
P ψ•1  L 
E E
ϕ≤K ψ•1E ϕ K
= L = L by Lemma 1.6.2.
K K

Then:
X hg[K](ψ)(ϕ) · 1E • (ϕ)
hg[K](ψ) 1 • (ϕ) = •
ϕ
E
ϕ≤K ψ hg[K](ψ) |= 1E
Q ψ(x)  L 
X x ϕ(x) K
=  L  · L  ϕ
E
ϕ≤K ψ•1E
ψ•1K K
E
ϕ
X
= L  ϕ = hg[K] ψ • 1E .

E
ϕ≤K ψ•1E K

2 Similarly.
We illustrate how to solve a famous riddle via updating a draw distribution.
Example 6.2.3. We start by recalling the ordered draw channel Ot − : N∗ (X) →
L∗ (X) × N(X) from Figure 3.1. For convenience, we recall that it is given by:
X ψ(x)
Ot − (ψ) = [x], ψ − 1| x i .
x∈supp(ψ)
kψk

324
6.2. Updating draw distributions 325

We recall that these maps can be iterated, via Kleisli composition ◦·, as de-
scribed in (3.18).
Below we like to incorporate the evidence that the last drawn colour is in a
subset U ⊆ X. We thus extend U to last(U) ⊆ L∗ (X), given by:
last(U) = { [x1 , . . . , xn ] ∈ L∗ (X) | xn ∈ U}.
We now come to the so-called Monty Hall problem. It is a famous riddle in
probability due to [150], see also e.g. [61, 158]:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one
door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who
knows what’s behind the doors, opens another door, say No. 3, which has a goat. He
then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch
your choice?

We describe the three doors via a set D = {1, 2, 3} and the situation in which
to make the first choice as a multiset ψ = 1| 1 i + 1| 2i + 1| 3 i. One may also
view ψ as an urn from which to draw numbered balls. Let’s assume without
loss of generality that the car is behind door 2. We shall represent this logically
by using a (sharp) ‘goat’ predicate 1G : {1, 2, 3} → [0, 1], for G = {1, 3}.
Your first draw yields a distribution:

Ot − (ψ) = 13 [1], 1| 2i + 1| 3i + 13 [2], 1| 1i + 1| 3 i + 13 [3], 1| 1 i + 1| 2 i .

Next, the host’s opening of a door with a goat is modeled by a new draw, but
we know that a goat appears. The resulting distribution is described by another
draw followed by an update:

Ot − ◦· Ot − (ψ) 1
last(G) ⊗1

= 14 [1, 3], 1| 2 i + 41 [2, 1], 1| 3 i + 14 [2, 3], 1| 1 i + 14 [3, 1], 1| 2 i .

Inspection of this outcome shows that for two of your three choices — when
you choose 1 or 3, in the first and last term — it makes sense to change your
choice, because the car is behind the remaining door 2. Hence changing is
better than sticking to your original choice.
We have given a formal account of the situation. More informally, the host
knows where the car is, so his choice is not arbitrary. By opening a door with
a goat behind it, the host is giving you information that you can exploit to
improve your choice: two of your possible choices are wrong, but in those two
out three cases the host gives you information how to correct your choice.
We conclude with a well known result, namely that a multinomial distri-
bution can be obtained from conditioning parallel Poisson distributions. This
result can be found for instance in [145, §6.4], in binary form. Recall that a

325
326 Chapter 6. Updating distributions

Poisson distribution has a rate/intensity λ as parameter, giving the number of


events per time unit. The Poisson distribution then describes the probability of
k events, per time unit. When we put n Poisson processes in parallel, with rates
λ1 , . . . , λn and condition on getting K events in total, the resulting probability
can also be obtained as a multinomial with K-sized draws, and with state/urn
proportional to the rates λ1 , . . . , λn .

Theorem 6.2.4. Let λ1 , . . . , λm ∈ R≥0 be given, with λ B i λi > 0. For


P
K ∈ N, consider the sharp predicate sum K : Nm → {0, 1} given my:

sum K n1 , . . . , nm = 1 ⇐⇒ n1 + · · · + nm = K.


Then, for n1 , . . . , nm ∈ N with ni = K,


P
i


pois[λ1 ] ⊗ · · · ⊗ pois[λm ] sum
K

mn[K] i λλi | i i
X P  P 
= i ni | i i n1 , . . . , nm .

n1 ,..., nm , ni =K
P
i

Proof. We first compute the validity:

pois[λ1 ] ⊗ · · · ⊗ pois[λm ] |= sum K


X
= pois[λ1 ](n1 ) · . . . · pois[λm ](nm )
n1 ,..., nm , ni =K
P
i
X λn11 λnm
= · . . . · e−λm · 2
e−λ1 ·
n1 ,..., nm , i ni =K
P n1 ! nm !
−(λ1 +···+λm )
!
e X K
= · λn11 · . . . · λnmm
K! n ,..., n , n =K
P n 1 , . . . , n m
1 m i i
e−λ K
(1.26)
= ·λ
K!
= pois[λ](K).

326
6.2. Updating draw distributions 327

The update then becomes:



pois[λ1 ] ⊗ · · · ⊗ pois[λm ] sum
K
X pois[λ1 ](n1 ) · . . . · pois[λm ](nm )
= n1 , . . . , nm

n ,..., n , n =K
P pois[λ1 ] ⊗ · · · ⊗ pois[λm ] |= sum K
1 m i i
n
λ11 λn2m
X e−λ1 · n1 ! · . . . · e−λm · nm !
= n1 , . . . , nm

n1 ,..., nm ,
P
i ni =K
e−λ
K! · λK
! Y
X K λi ni

= n1 , . . . , nm

·
n1 , . . . , nm i λ
n1 ,..., nm , ni =K
P
i

mn[K] i λλi | ii
X P  P 
= i ni | i i n1 , . . . , nm .

n1 ,..., nm , ni =K
P
i

The next result shows how hypergeometric distributions can be expressed in


terms of multinomial distributions, where, somewhat remarkably, the state ω
is completely irrelevant.

Theorem 6.2.5. Let K, L ∈ N be given with K ≤ L. Fix ψ ∈ N[L](X) and


consider the sum of natural multisets as a sharp predicate sum ψ : N[K](X) ×
N[L − K](X) → {0, 1} given by:

sum ψ (ϕ, ϕ0 ) = 1 ⇐⇒ ϕ + ϕ0 = ψ.

1 For each state ω ∈ D(X),

mn[K](ω) ⊗ mn[L−K](ω) |= sum ψ = mn[L](ω)(ψ).

2 Again for each state ω ∈ D(X),


   
hg[K](ψ) = mn[K](ω) ⊗ mn[L−K](ω) sum 1, 0 .
ψ

Proof. 1 By Exercise 3.3.8 we first get:

mn[K](ω) ⊗ mn[L−K](ω) |= sum ψ


X
= mn[K](ω) ⊗ mn[L−K](ω) (ϕ, ϕ0 )

ϕ∈N[K](X), ϕ0 ∈N[L−K](X), ϕ+ϕ0 =ψ
 
= sum = mn[K](ω) ⊗ mn[L−K](ω) (ψ)
= mn[L](ω)(ψ).

327
328 Chapter 6. Updating distributions

2 Next we do the update:



(mn[K](ω) ⊗ mn[L−K](ω)) sum
ψ
X mn[K](ω)(ϕ) · mn[L−K](ω)(ϕ0 )
= ϕ, ϕ0

ϕ∈N[K](X), ϕ0 ∈N[L−K](X), ϕ+ϕ0 =ψ
mn[K](ω) ⊗ mn[L−K](ω) =
| sum ψ

ϕ(x)
X (ϕ) · (
x ω(x) ) · ( ψ − ϕ ) · ( x ω(x)ψ(x)−ϕ(x) )
Q Q
= ϕ, ψ − ϕ

ϕ≤K ψ
mn[L](ω)(ψ)
X ( ϕ ) · (ψ − ϕ ) · (Q ω(x)ψ(x) )
= x
ψ(x) )
ϕ, ψ − ϕ
ω(x)
Q
ϕ≤K ψ
(ψ ) · ( x
 ψ
X ϕ
=   ϕ, ψ − ϕ ,

L
see Lemma 1.6.4 (1).
ϕ≤K ψ K

The first marginal of the latter distribution is the hypergeometric distribu-


tion.

Exercises
6.2.1 Prove Theorem 6.2.2 (2) yourself.
Let K1 , . . . , KN ∈ N with ψ ∈ N[K](X) be given, where K = i Ki .
P
6.2.2
Write sum ψ : N[K1 ](X)×· · ·×N[KN ] → {0, 1} for the sharp predicate
defined by:
sum ψ (ϕ1 , . . . , ϕN ) = 1 ⇐⇒ ϕ1 + · · · + ϕN = ψ.
Show that, for any ω ∈ D(X),

mn[K1 ](ω) ⊗ · · · ⊗ mn[KN ](ω) sum
ψ
X 1
=  ψ  ϕ1 , . . . , ϕ N .

ϕ1 ≤K1 ψ, ..., ϕN ≤KN ψ, ϕi =ψ ϕ1 , ..., ϕN
P
i

Recall from Exercise 1.6.9 that the inverses of the multinomial coef-
ficients of multisets in the above sum add up to one.

6.3 Forward and backward inference


Forward transformation = of states and backward transformation = of observ-
ables can be combined with updating of states. It gives rise to the powerful
techniques of forward and backward inference. This will be illustrated in the
current section. At the end, these channel-based inference mechanisms are re-
lated to crossover inference for joint states.

328
6.3. Forward and backward inference 329

The next definition captures the two basic patterns — first formulated in this
form in [87]. We shall refer to them jointly as channel-based inference, or as
reasoning along channels.

Definition 6.3.1. Let ω ∈ D(X) be a state on the domain of a channel c : X →


Y.

1 For a factor p ∈ Fact(X), we define forward inference as transformation


along c of the state ω updated with p, as in:

c = ω| p .

This is also called prediction or causal reasoning.


2 For a factor q ∈ Fact(Y), backward inference is updating of ω with the
transformed factor:
ω|c = q .

This is also called explanation or evidential reasoning. We shall also refer


this operation as Pearl’s update rule, in contrast with Jeffrey’s rule, to be
discussed in Section 6.7.

In both cases the distribution ω is often called the prior distribution or simply
the prior. Similarly, c = ω| p and ω|c = q are called posterior distributions or just
posteriors.

Thus, with forward inference one first conditions and then performs (for-
ward, state) transformation, whereas for backward inference one first performs
(backward, factor) transformation, and then one conditions. We shall illustrate
these fundamental inferences mechanisms in several examples. They mostly
involve backward inference, since that is the more useful technique. An impor-
tant first step in these examples is to recognise the channel that is hidden in the
description of the problem. It is instructive to try and do this, before reading
the analysis and the solution.

Example 6.3.2. We start with the following question from [144, Example 1.12].

Consider two urns. The first contains two white and seven black balls, and the second
contains five white and six black balls. We flip a coin and then draw a ball from the
first urn or the second urn depending on whether the outcome was heads or tails. What
is the conditional probability that the outcome of the toss was heads given that a white
ball was selected?

Our analysis involves two sample spaces {H, T } for the sides of the coin and

329
330 Chapter 6. Updating distributions

{W, B} for the colours of the balls in the urns. The coin distribution is uni-
form: unif = 12 | H i+ 12 | T i. The above description implicitly contains a channel
c : {H, T } → {W, B}, namely:

c(H) = Flrn 2|W i + 7| Bi c(T ) = Flrn 5| W i + 6| Bi


 

= 29 | W i + 79 | Bi = 11
5
| W i + 11
6
| Bi.

As in the above quote, the first urn is associated with heads and the second one
with tails.
The evidence that we have is described in the quote after the word ‘given’.
It is the point predicate 1W on the set of colours {W, B}, indicating that a white
ball was selected. This evidence can be pulled back (transformed) along the
channel c, to a predicate c = 1W on the sample space {H, T }. It is given by:
X
c = 1W (H) = c(H)(x) · 1W (x) = c(H)(W) = 29 .

x∈{W,B}

Similarly we get c = 1W (T ) = c(T )(W) = 11 5



.
The answer that we are interested in is obtained by updating the prior unif
with the transformed evidence c = 1W , as given by unif|c = 1H . This is an in-
stance of backward inference.
In order to get this answer, we first we compute the validity:

unif |= c = 1W = unif(H) · (c = 1W )(H) + unif(T ) · (c = 1W )(T )


= 21 · 29 + 12 · 11
5
= 19 + 22
5
= 198
67
.

Then:
1/2 · 2/9 1/2 · 5/11
unif|c = 1W = 67/198
|Hi + 67/198
|T i = 22
67 | H i + 45
67 | T i.

Thus, the conditional probability of heads is 22


67 . The same outcome is obtained
in [144], of course, but there via an application of Bayes’ rule.

Example 6.3.3. Consider the following classical question from [159].

A cab was involved in a hit and run accident at night. Two cab companies, Green and
Blue, operate in the city. You are given the following data:
• 85% of the cabs in the city are Green and 15% are Blue
• A witness identified the cab as Blue. The court tested the reliability of the witness
under the circumstances that existed on the night of the accident, and concluded that
the witness correctly identified each one of the two colors 80% of the time and failed
20% of the time.
What is the probability that the cab involved in the accident was Blue rather than Green?

330
6.3. Forward and backward inference 331

We use as colour set C = {G, B} for Green and Blue. There is a prior ‘base
rate’ distribution ω = 2017
|G i + 20
3
| Bi ∈ D(C), as in the first bullet above.
The reliability information in the second bullet translates into a ‘correctness’
channel c : {G, B} → {G, B} given by:
c(G) = 45 |G i + 51 | Bi c(B) = 15 |G i + 54 | Bi.
The second bullet also gives evidence of a Blue car. It translates into a point
predicate 1B on {G, B}. It can be used for backward inference, giving the answer
to the query, as posterior:
ω|c = 1B = 17
29 |G i + 12
29 | Bi ≈ 0.5862|G i + 0.4138| Bi.
Thus the probability that the Blue car was actually involved in the incident is
a bit more that 41%. This may seem like a relatively low probability, given
that the evidence says ‘Blue taxicab’ and that observations are 80% accurate.
But this low percentage is explained by the fact that are relatively few Bleu
taxicabes in the first place, namely only 15%. This is in the prior, base rate
distribution ω. It is argued in [159] that humans find it difficult to take such base
rates (or priors) into account. They call this phenomenon “base rate neglect”,
see also [57].
Notice that the channel c is used to accomodate the uncertainty of observa-
tions: the sharp point observation ‘Blue taxicab’ is transformed into a fuzzy
predicate c = 1B . The latter is G 7→ 15 and B 7→ 54 . Updating ω with this
predicate gives the claimed outcome.
Example 6.3.4. We continue in the setting of Example 2.2.2, with a teacher in
a certain mood, and predictions about pupils’ performances depending on the
mood. We now assume that the pupils have done rather poorly, with no-one
scoring above 5, as described by the following evidence/predicate q on the set
of grades Y = {1, 2, . . . , 10}.
q = 1
10 · 11 + 3
10 · 12 + 3
10 · 13 + 2
10 · 14 + 1
10 · 15 .
The validity of this predicate q in the predicted state c = σ is:
c = σ |= q = σ |= c = q = 299
4000 = 0.07475.
The interested reader may wish to check that the updated state σ0 = σ|c = q via
backward inference is:
σ0 = 77
299 | p i + 162
299 | ni + 60
299 | o i ≈ 0.2575| p i + 0.5418| n i + 0.2007| o i.
Interestingly, after updating the teacher has more realistic view in the sense
that the validity of the predicate q has risen to c = σ0 |= q = 149500
15577
≈ 0.1042.
This is one way how the mind can adapt to external evidence.

331
332 Chapter 6. Updating distributions

Example 6.3.5. Recall the Medicine-Blood Table (1.18) with data on different
types of medicine via a set M = {0, 1, 2} and blood pressure via the set B =
{H, L}. From the table we can extract a channel b : M → B describing the
blood pressure distribution for each medicine type. This channel is obtained
by column-wise frequentist learning:

b(0) = 23 | H i + 31 | L i b(1) = 79 | H i + 92 | L i b(2) = 58 | H i + 38 | L i.

The prior medicine distribution ω = 20 3


| 0i + 20
9
| 1 i + 25 | 2i is obtained from the
totals row in the table.
The predicted state b = ω is 107
| H i + 103
| L i. It is the distribution that is
learnt from the totals column in Table (1.18). Suppose we wish to focus on
the people that take either medicine 1 or 2. We do so by conditioning, via the
subset E = {1, 2} ⊆ B, with associated sharp predicate 1E : B → [0, 1]. Then:
9/20 2/5
ω |= 1E = 9
20 + 2
5 = 17
20 so ω|1E = 17/20
| 1i + 17/20
|2i = 9
17 |1 i + 8
17 | 2 i.

Forward reasoning, precisely as in Definition 6.3.1 (1), gives:

b = ω|1E = b = 17 9
| 1 i + 178
 
|2i
= ( 17
9
· 97 + 17 8
· 58 )| H i + ( 17
9
· 29 + 17
8
· 38 )| H i
= 17 | H i + 17 | L i.
12 5

This shows the distribution of high and low blood pressure among people using
medicine 1 or 2.
We turn to backward reasoning. Suppose that we have evidence 1H on {H, L}
of high blood pressure. What is then the associated distribution of medicine
usage? It is obtained in several steps:
X
b = 1H (x) = b(x)(y) · 1H (y) = b(x)(H)

y
X X
ω |= b = 1H = ω(x) · (b = 1H )(x) = ω(x) · b(x)(H)
x x
= 3
20· + · + · =
2
3
9
20
7
9
2
5
5
8
7
10
X ω(x) · (b = 1H )(x)
ω|b = 1H = x
x ω |= b = 1H
3/20 · 2/3 9/20 · 7/9 2/5 · 5/8
= 7 | 0i + 7 | 1i + 7 | 2i
/10 /10 /10
= 17 | 0 i + 21 | 1 i + 14
5
| 2i ≈ 0.1429| 0i + 0.5| 1 i + 0.3571| 2 i.

We can also reason with ‘soft’ evidence, using the full power of fuzzy pred-
icates. Suppose we are only 95% sure that the blood pressure is high, due to
some measurement uncertainty. Then we can use as evidence the predicate

332
6.3. Forward and backward inference 333

q : B → [0, 1] given by q(H) = 0.95 and q(L) = 0.05. It yields:

ω|b = q ≈ 0.1434| 0i + 0.4963| 1i + 0.3603| 2 i.

We see a slight difference wrt. the outcome with sharp evidence.

Example 6.3.6. The following question comes from [158, §6.1.3] (and is also
used in [129]).
One fish is contained within the confines of an opaque fishbowl. The fish is equally
likely to be a piranha or a goldfish. A sushi lover throws a piranha into the fish bowl
alongside the other fish. Then, immediately, before either fish can devour the other,
one of the fish is blindly removed from the fishbowl. The fish that has been removed
from the bowl turns out to be a piranha. What is the probability that the fish that was
originally in the bowl by itself was a piranha?

Let’s use the letters ‘p’ and ‘g’ for piranha and goldfish. We are looking at a
situation with multiple fish in a bowl, where we cannot distinguish the order.
Hence we describe the contents of the bowl as a (natural) muliset over {p, g},
that is, as an element of N({p, g}). The initial situation can then be described
as a distribution ω ∈ D(N({p, g})) with:

ω = 12 1| p i + 21 1| g i .

Adding a piranha to the bowl involves a function A : N({p, g}) → N({p, g}),
such that A(σ) = (σ(p) + 1)| p i + σ(g)| gi. It forms a deterministic channel.
We use a piranha predicate P : N({p, g}) → [0, 1] that gives the likelihood
P(σ) of taking a piranha from a multiset/bowl σ. Thus:
(2.13) σ(p)
P(σ) B Flrn(σ)(p) = .
σ(p) + σ(g)
We have now collected all ingredients to answer the question via backward
inference along the deterministic channel A. It involves the following steps.
σ(p) + 1
(A = P)(σ) = P(A(σ)) =
σ(p) + 1 + σ(g)
ω |= A = P = 1
2 · (A = P)(1| pi) + 12 · (A = P)(1| gi)
= 1
2 · 1+1
1+1+0 + 1
2 · 0+1
0+1+1 = 1
2 + 1
4 = 3
4
1/2 · 1 1/2 · 1/2
ω|A = P = 1| p i +

1| gi
3/4 3/4

= 23 1| pi + 13 1| gi .

2
Hence the answer to the question in the beginning of this example is: 3 proba-
bility that the original fish is a piranha.

333
334 Chapter 6. Updating distributions

Example 6.3.7. In [146, §20.1] a situation is described with five different bags,
numbered 1, . . . , 5, each containing its own mixture of cherry (C) and lime (L)
candies. This situation can be described via a candy channel:

=




 c(1) 1|C i
= 4 |C i + 4 | L i
3 1

c(2)




/ {C, L} where
c

B = {1, 2, 3, 4, 5} = 2 |C i + 2 | L i
 1 1
B and

◦ 
 c(3)
= 4 |C i + 4 | L i

 1 3




 c(4)
=

 c(5)
 1| L i.

The initial bag distribution is ω = 10


1
| 1 i + 15 | 2 i + 25 | 3 i + 15 | 4i + 10
1
| 5i.
In the situation described in [146, §20.1] the sample space of bags B is
regarded as hidden (not directly observable), in a scenario where a new bag
i ∈ B is given and candies are drawn from it, inspected and returned. It turns
out that 10 consecutive draws yield a lime candy; what can we then infer about
the distribution of bags?
Transforming the lime point predicate 1L along channel c yields the fuzzy
predicate c = 1L : B → [0, 1] given by:

c = 1L = >i c(i)(L) · 1i = 1
4 · 12 > 1
2 · 13 > 3
4 · 14 > 1 · 15 .

The question is what we learn about the bag distribution after observing this
predicate 10 consecutive times? This involves computing:

ω|c = 1L = 1
10 | 2i + 25 |3 i + 3
10 | 4 i + 15 | 5 i
ω|c = 1L |c = 1L = ω|(c = 1L )&(c = 1L ) = ω|(c = 1L )2
= 26
1
| 2i + 13 4
| 3 i + 269
| 4 i + 13
4
| 5i
≈ 0.0385| 2 i + 0.308| 3i + 0.346| 4i + 0.308|5 i
ω|c = 1L |c = 1L |c = 1L = ω|(c = 1L )&(c = 1L )&(c = 1L ) = ω|(c = 1L )3
= 76
1
| 2i + 19 4
| 3 i + 2776 | 4 i + 19 | 5i
8

≈ 0.0132| 2 i + 0.211| 3i + 0.355| 4i + 0.421|5 i . . .

Figure 20.1 in [146] gives a plot of these distributions; it is reconstructed here


in Figure 6.1 via the above formulas. It shows that bag 5 quickly becomes most
likely — as expected since it contains most lime candies — and that bag 1 is
impossible after drawing the first lime.

Example 6.3.8. Medical tests form standard examples of Bayesian reasoning,


via backward inference, see e.g. Exercises 6.3.1 and 6.3.2 below. Here we look
at Covid-19 (“Corona”) which is interesting because its standard PCR-test has

334
6.3. Forward and backward inference 335

Figure 6.1 Posterior, updated bag distributions ω|(c = 1L )n for n = 0, 1, . . . , 10,


aligned vertically, after n candy draws that all happen to be lime.

low false positives but high false negatives, and moreover these false negatives
depend on the day after infection.
The Covid-19 PCR-test has almost no false positives. This means that if you
do not have the disease, then the likelihood of a (false) postive test is very
low. In our calculations below we put it at 1%, independently of the day that
you get tested. In contrast, the PCR-test has considerable false negative rates,
which depend on the day after infection. The plot at the top in Figure 6.2 gives
an indication; it does not precisely reflect the medical reality, but it provides
a reasonable approximation, good enough for our calculation. This plot shows
that if you are infected at day 0, then a test at this day or the day after (day 1)
will surely be negative. On the second day after your infection a PCR-test
might start to detect, but still there is only a 20% chance of a positive outcome.
This probability increases and after on day 6 the likelihood of a positive test
has risen to 80%.
How to formalise this situation? We use the following three sample spaces,
for Covid (C), days after infection (D), and test outcome (T ).
C = {c, c⊥ } D = {0, 1, 2, 3, 4, 5, 6} T = {p, n}.
The test probabilities are then captured via a test channel t : C × D → T , given

335
336 Chapter 6. Updating distributions

Figure 6.2 Covid-19 false negatives and posteriors after tests. In the lower plot P2
means: positive test after 2 days, that is, in state (r| ci + (1 − r)|c⊥ i) ⊗ ϕ2 , where
r ∈ [0, 1] is the prior Covid probability. Similarly for N2, P5, N5.

in the following way. The first equation captures the false positives, and the

336
6.3. Forward and backward inference 337

second one the false negatives, as in the plot at the top of Figure 6.2.

t(c⊥ , i) = 1
100 | p i + 99
100 | ni

if i = 0 or i = 1



 1| ni

10 | pi + if i = 2
 2 8
10 | n i






 10 | pi + if i = 3
3 7

10 | n i


t(c, i) = 

10 | pi + if i = 4
4 6
10 | n i






10 | pi + if i = 5
 6 4
10 | n i





10 | pi + if i = 6

8 2
10 | n i

In practice it is often difficult to determine the precise date of inffection. We


shall consider two cases, where the infection happened (around) two and five
days ago, via the two distributions:

ϕ2 = 14 | 1 i + 12 | 2i + 14 | 3i ϕ5 = 14 | 4 i + 12 | 5i + 14 | 6i.

Let’s consider the case where we have no (prior) information about the like-
lihood that the person that is going to be tested has the disease. Therefore we
use as prior ω = unifC = 12 |c i + 12 | c⊥ i.
Suppose in this situation we have a positive test, say after two days. What
do we then learn about the disease probability? Our evidence is the postive test
predicate 1 p on T , which can be transformed to t = 1 p on C × D. We can use
this predicate to update the joint state ω ⊗ ϕ2 . The distribution that we are after
is the first marginal of this updated state, as in:
  
(ω ⊗ ϕ2 ) t = 1 1, 0 ∈ D(C).
p

We shall go through the computation step-by-step. First,


X
t = 1 p = t(x, y)(p) · 1(x,y)
x∈C,y∈D
= 10 · 1(c,2) + 10 · 1(c,3) + 10 · 1(c,4) + 10 · 1(c,5) + 10
2 3 4 6 8
· 1(c,6)
+ 100 · 1(c⊥ ,0) + 100 · 1(c⊥ ,1) + 100 · 1(c⊥ ,2) + 100
1 1 1 1
· 1(c⊥ ,3)
+ 100
1
· 1(c⊥ ,4) + 100
1
· 1(c⊥ ,5) + 100
1
· 1(c⊥ ,6) .

Then:
X
ω ⊗ ϕ2 |= t = 1 p = ω(x) · ϕ2 (y) · (t = 1 p )(x, y)
x∈C,y∈D
= 2 · 2 · 10 + 2 · 4 · 10 + 2
1 1 2 1 1 3 1
· 1
4 · 1
100 + 1
2 · 1
2 · 1
100 + 1
2 · 1
4 · 1
100
= 40+30+1+2+1
800 = 400
37
.

337
338 Chapter 6. Updating distributions

But then:
1/2 · 1/2 · 2/10

(ω ⊗ ϕ2 ) t = 1 = c, 2 + 1/2 · 1/4 · 3/10 c, 3 + 1/2 · 1/4 · 1/100 c⊥ , 1
p 37/400 37/400 37/400
1/2 · 1/2 · 1/100 1/2 · 1/4 · 1/100 ⊥
+ 37/400
c , 2 +

37/400
c , 3
1 ⊥
= 20 + 15
+ 1 ⊥
, + 1 ⊥
, + 74 c , 3 .

37
c, 2 37 c, 3 74 c 1 37 c 2

Finally:
  
(ω ⊗ ϕ2 ) t = 1 1, 0 = 35
37 | c i + 2 ⊥
37 | c i ≈ 0.946| c i + 0.054| c⊥ i.
p

Hence a positive test changes the a priori likelihood of 50% to about 95%. In
a similar way one can compute that:
  
(ω ⊗ ϕ2 ) t = 1 1, 0 = 363
165
| c i + 198
363 | c i ≈ 0.455| c i + 0.545| c i.
⊥ ⊥
n

We see that a negative test, 2 days after infection, reduces the prior disease
probability of 50% only slightly, namely to 45%.
Doing the test around 5 days after infection gives a bit more certainty, espe-
cially in the case of a negative test:
  
(ω ⊗ ϕ5 ) t = 1 1, 0 = 60
61 | c i + 61 | c i ≈ 0.984| c i + 0.016| c i
1 ⊥ ⊥

 p  
(ω ⊗ ϕ5 ) t = 1 1, 0 = 139
40
| c i + 139
99 ⊥
| c i ≈ 0.288| c i + 0.712| c⊥ i.
n

The lower plot in Figure 6.2 gives a more elaborate description, for different
disease probabilities r in a prior ω = r| c i + (1 − r)| c⊥ i as used above. We see
that a postive test outcome quickly gives certainty about having the disease.
But a negative test outcome gives only a little bit information with respect to
the prior — which in this plot can be represented as the diagonal. For this
reason, if you get a negative PCR-test, often a second test is done a few days
later.

In the next two examples we look at estimating the number of fish in a pond
by counting marked fishes, first in multinomial mode and then in hypergeo-
metric mode.

Example 6.3.9. Capture and recapture is a methodology used in ecology to


estimate the size of a population. So imagine we are looking at a pond and we
wish to learn the number of fish. We catch twenty of them, mark them, and
throw them back. Subsequently we catch another twenty, and find out that five
of them are marked. What do we learn about the number of fish?
The number of fish in the pond must be at least 20. Let’s assume the maximal
number is 300. We will be considering units of 10 fish. Hence the underlying

338
6.3. Forward and backward inference 339

sample space F together with the uniform ‘prior’ state unif F is:

F = {20, 30, 40, . . . , 300} unif F = x∈F 29 1


P
with | x i.
We now assume that K = 20 of the fish in the pond are marked. We can then
compute for each value 20, 30, 40, . . . in the fish space F the probability of
finding 5 marked fish when 20 of them are caught. In order not to complicate
the calculations too much, we catch these 20 fish one by one, check if they are
marked, and then throw them back. This means that the probability of catching
a marked fish remains the same, and is described by a binomial distribution,
see Example 2.1.1 (2). Its parameters are K = 20 with probability 20i of catch-
ing a marked fish, where i ∈ F is the assumed total number of fish. This is
incorporated in the following ‘catch’ channel c : F → K +1 = {0, 1, . . . , K}.
  X    k  K−k
c(i) B bn[K] Ki = K
k · i
K
· i−K
i
k .
k∈K+1

Once this is set up, we construct a posterior state by updating the prior with
the information that five marked fish have been found. The latter is expressed
as point predicate 15 ∈ Pred (K +1) on the codomain of the channel c. We can
then do backward inference, as in Definition 6.3.1 (2), and obtain the updated
uniform distribution:
20/i 5 · i−20/i 15
X  
unif F c = 1 = P 20 5 j−20 15 i .
5
i∈F j /j · /j
The bar chart of this posterior is in Figure 6.3; it indicates the likelihoods of
the various numbers of fish in the pond. One can also compute the expected
value (mean) of this posterior; it’s 116.5 fish. In case we had caught 10 marked
fish out of 20, the expected number would be 47.5.
Note that taking a uniform prior corresponds to the idea that we have no
idea about the number of fish in the pond. But possibly we already had a good
estimate from previous years. Then we could have used such an estimate as
prior distribution, and updated it with this year’s evidence.

Example 6.3.10. We take another look at the previous example. There we used
the multinomial distribution (in binomial form) for the probability of catching
five marked fish, per pond size. This multinomial mode is appropriate for draw-
ing with replacement, which corresponds to returning each fish that we catch
to the pond. This is probably not what happens in practice. So let’s try to re-
describe the capture-recapture model in hypergeometric mode (like in [145,
§4.8.3, Ex. 8h]).
Let’s write M = {m, m⊥ } for the space with elements m for marked and m⊥ for

339
340 Chapter 6. Updating distributions

Figure 6.3 The posterior fish number distribution after catching 5 marked fish, in
multinomial mode, see Example 6.3.9.

unmarked. Our recapture catch (draw) of K = 20 fish, with 5 of them marked,


is thus a multiset κ = 5| mi + 15| m⊥ i. The urn from which we draw is the pond,
in which the total number of fish is unknown, but we do know that 20 of them
are marked. The urn/pond is thus a multiset 20| mi + (i − 20)| m⊥ i with i ∈ F.
We now use a channel:

{20, 30, . . . , 300} = F


d
◦ / N[K](M)  {0, 1, . . . , K +1},

given by:
 
d(i) B hg[K] 20| mi + (i − 20)| m⊥ i .

The updated pond distribution is then:

15 )/(20)
X (i−20 i
unif F d = 1 = i .
j ( 15 )/(20)
κ
P j−20 j
i∈F

Its bar-chart is in Figure 6.4. It differs minimally from the multinomial one in
Figure 6.3. In the hypergeometric case the expected value is 113 fish, against
116.5 in the multinomial case. When the recapture involves 10 marked fish,
the expected values are 45.9, against 47.5. As we have already seen in Re-
mark 3.4.3, the hypergeometric distribution on small draws from a large urn
look very much like a multinomial distribution.

340
6.3. Forward and backward inference 341

Figure 6.4 The posterior fish number distribution after catching 5 marked fish, in
hypergeometric mode, see Example 6.3.10.

6.3.1 Conditioning after state transformation

After all these examples we develop some general results. We start with a fun-
damental result about conditioning of a transformed state. It can be reformu-
lated as a combination of forward and backward reasoning. This has many
consequences.

Theorem 6.3.11. Let c : X → Y be a channel with a state ω ∈ D(X) on its


domain and a factor q ∈ Fact(Y) on its codomain. Then:

c = ω |q = c|q = (ω|c = q ).

(6.2)

341
342 Chapter 6. Updating distributions

Proof. For each y ∈ Y,


(c = ω)(y) · q(y)
(c = ω)|q (y) =
c = ω |= q
X ω(x) · c(x)(y) · q(y)
=
x ω |= c = q
X ω(x) · (c = q)(x) c(x)(y) · q(y)
= ·
x ω |= c = q (c = q)(x)
X c(x)(y) · q(y)
= ω|c = q (x) ·
x c(x) |= q
X
= ω|c = q (x) · c(x)|q (y)
x
X
= ω|c = q (x) · c|q (x)(y)
x
= c|q = (ω|c = q ) (y).


This result has a number of useful consequences.

Corollary 6.3.12. For appropriately typed channels, states, and factors:

1 (d ◦· c)|q = d|q ◦· c|d = q ;


2 (hc, di = ω)| p⊗q = hc| p , d|q i = ω|(c = p)&(d = q) ;
3 ((e ⊗ f ) = ω)| p⊗q = (e| p ⊗ f |q ) = ω|(e = p)⊗( f = q) .

Proof. 1 (d ◦· c)|q (x) = (d ◦· c)(x)|q = (d = c(x))|q


(6.2)
= d|q = c(x)|d = q


= d|q = c|d = q (x) = d|q ◦· c|d = q (x).


 

2 (hc, di = ω)| p⊗q


(6.2)
= (hc, di| p⊗q = ω|hc,di = (p⊗q)


= hc| p , d|q i = ω|(c = p)&(d = q)



by Exercises 6.1.11 and 4.3.8.
3 Using the same exercises we also get:
(6.2)
((e ⊗ f ) = ω)| p⊗q = ((e ⊗ f )| p⊗q = ω|(e⊗ f ) = (p⊗q)
= (e| p ⊗ f |q ) = ω|(e = p)⊗( f = q)

Earlier we have seen ‘crossover update’ for a joint state, whereby one up-
dates in one component, and then marginalises in the other, see for instance Ex-
ample 6.1.2 (2). We shall do this below for joint states gr(σ, c) = hid , ci = σ
that arise as graph, see Definition 2.5.6, via a channel. The result below ap-
peared in [21], see also [87, 88]. It says that crossover updates on joint graph
states can also be done via forward and backward inferences. This is also a
consequence of Theorem 6.3.11.

342
6.3. Forward and backward inference 343

Corollary 6.3.13. Let c : X → Y be a channel with a state σ ∈ D(X) on its


domain. Let ω B hid , ci = σ ∈ D(X × Y) be the associated joint state. Then:
1 for a factor p on X,
c = (σ| p ) = ω| p⊗1 0, 1
 

= π2 = ω|π1 = p .


2 for a factor q on Y,
σ|c = q = ω|1⊗q 1, 0
 

= π1 = ω|π2 = q .


The formulation with the projections π1 , π2 will be generalised to an infer-


ence query in Section 7.10.
Proof. Since:
hid , ci = σ | p⊗1 0, 1
  
(6.2)  
= π2 = hid , ci| p⊗1 = σ|hid ,ci = (p⊗1)
= π2 ◦· hid , ci = σ|(id = p)&(c = 1)
 
by Exercises 6.1.11 and 4.3.8
= c = (σ| p )
hid , ci = σ |1⊗q 1, 0
  
(6.2)  
= π1 = hid , ci|1⊗q = σ|hid ,ci = (1⊗q)
= π1 ◦· hid , c|q i = σ|(id = 1)&(c = q)
 

= σ|c = q .
This result will help us how to perform inference in a Bayesian network, see
Section 7.10.

Exercises
6.3.1 We consider some disease with an a priori probability (or ‘preva-
lence’) of 1%. There is a test for the disease with the following char-
acteristics.
• (‘sensitivity’) If someone has the disease, then the test is positive
with probability of 90%.
• (‘specificity’) If someone does not have the disease, there is a 95%
chance that the test is negative.
1 Take as disease space D = {d, d⊥ }; describe the prior as a distribu-
tion on D;
2 Take as test space T = {p, n} and describe the combined sensitivity
and specificity as a channel c : D → T ;

343
344 Chapter 6. Updating distributions

3 Show that the predicted positive test probability is almost 6%.


4 Assume that a test comes out positive. Use backward reasoning to
prove that the probability of having the disease (the posterior, or
‘Positive Predictive Value’, PPV) is then a bit more than 15% (to
18
be precise: 117 ). Explain why it is so low — remembering Exam-
ple 6.3.3.
6.3.2 In the context of the previous exercise we can derive the familiar
formulas for Postive Predictive Value (PPV) and Negative Predic-
tive Value (NPV). Let’s assume we have prevalence (prior) given by
ω = p| d i+(1− p)| d⊥ i with parameter p ∈ [0, 1] and a channel channel
c : {d, d⊥ } → {p, n} with parameters:

sensitivity: c(d) = se| pi + (1 − se)| ni


specificity: c(d⊥ ) = (1 − sp)| p i + sp| n i.

Check that:
p · se
PPV B ω|c = 1 p (d) = .
p · se + (1 − p) · (1 − sp)
This is commonly expressed in medical textbooks as:
prevalence · sensitivity
PPV = .
prevalence · sensitivity + (1 − prevalence) · (1 − specificity)
Check similarly that:
(1 − p) · sp
NPV B ω|c = 1n (d⊥ ) = .
p · (1 − se) + (1 − p) · sp
As an aside, the (positive) likelihood ratio LR is the fraction:
c(d)(p) se
LR B = .
c(d )(p)

1 − sp
6.3.3 Give a channel-based analysis and answer to the following question
from [144, Chap. I, Exc. 39].
Stores A, B, and C have 50, 75, and 100 employees, and respectively, 50, 60,
and 70 percent of these are women. Resignations are equally likely among
all employees, regardless of sex. One employee resigns and this is a woman.
What is the probability that she works in store C?
6.3.4 The multinomial and hypergeometric charts in Figures 6.3 and 6.4
are very similar. Still, if we look at the probability for the situation
when there are 40 fish in the pond, there is clear difference. Give a
conceptual explation for this difference.

344
6.3. Forward and backward inference 345

6.3.5 The following situation about the relationship between eating ham-
burgers and having Kreuzfeld-Jacob disease is insprired by [8, §1.2].
We have two sets: E = {H, H ⊥ } about eating Hamburgers (or not),
and D = {K, K ⊥ } about having Kreuzfeld-Jacob disease (or not). The
following distributions on these sets are given: half of the people eat
hamburgers, and only one in hundred thousand have Kreuzfeld-Jacob
disease, which we write as:

ω = 21 | H i + 12 | H ⊥ i and σ = 1
100,000 | K i + 99,999 ⊥
100,000 | K i.

1 Suppose that we know that 90% of the people who have Kreuzfeld-
Jacob disease eat Hamburgers. Use this additional information to
define a channel c : D → E with c = σ = ω.
2 Compute the probability of getting Kreuzfeld-Jacob for someone
eating hamburgers (via backward inference).
6.3.6 Consider in the context of Example 6.3.8 a negative Covid-19 test
obtained after 2 days, via the distribution ϕ2 = 14 |1 i + 21 | 2i + 14 | 3 i,
assuming a uniform disease prior ω. Show that the posterior ‘days’
distribution is:
  
(ω ⊗ ϕ ) 0, 1 = 199 | 1 i + 179 | 2i + 169 | 3 i
2 t = 1 726 363 726
n

≈ 0.274| 1 i + 0.493| 2i + 0.233| 3i.


Explain why there a shift ‘forward’, making the earlier days more
likely in this posterior — with respect to the prior ϕ2 .
6.3.7 Consider the following challenge, copied from [153].
(i) I have forgotten what day it is.
(ii) There are ten buses per hour in the week and three buses per hour at the
weekend.
(iii) I observe four buses in a given hour.
(iv) What is the probability that it is the weekend?
Let W = {wd , we} be a set with elements wd for weekday and we for
weekend, with prior distribution 57 | wd i + 72 | we i. Use the Poisson dis-
tribution to define a channel bus : W → D∞ (N) and use it to answer
the above question via backward inference.
6.3.8 Let c : X → Y be a channel with state ω ∈ D(X) and factors p ∈
Fact(X), q ∈ Fact(Y). Check that:
   
c|q = ω p&(c = q) = c = ω| p q .

Describe what this means for p = 1.


6.3.9 Let p ∈ Fact(X) and q ∈ Fact(Y) be two factors on spaces X, Y.

345
346 Chapter 6. Updating distributions

1 Show that for two channels c : Z → X and d : Z → Y with a joint


state on their (common) domain σ ∈ D(Z) one has:
   
hc, di = σ p⊗q 1, 0 = c = (σ|d = q ) p
   
hc, di = σ p⊗q 0, 1 = d = (σ|c = p ) q

2 For channels e : U → X and d : V → Y with a joint state ω ∈


D(U × V) one has:
     
(e ⊗ f ) = ω p⊗q 1, 0 = e = ω 1⊗( f = q) 1, 0 p
     
(e ⊗ f ) = ω p⊗q 0, 1 = f = ω (e = p)⊗1 0, 1 q .

6.4 Discretisation, and coin bias learning


So far we have been using discrete probability distributions, with a finite do-
main. We shall be looking at continuous distributions later on, in Chapter ??. In
the meantime we can approximate continuous distributions by discretisation,
namely by chopping them up into finitely many parts, like in Riemann integra-
tion. In fact, computers doing computation in continuous probability perform
such fine-grained discretisation. This section first introduces such discretisa-
tion and then uses it in an extensive example on learning the bias of a coin via
successive backward inference. It is classical illustration.
We start with discretisation.

Definition 6.4.1. Let a, b ∈ R with a < b and N ∈ N with N > 0 be given.

1 We write [a, b]N ⊆ [a, b] ⊆ R for the interval [a, b] reduced to N elements:

[a, b]N B {a + (i + 21 )s | i ∈ N} where s B b−a


N
= {a + 21 s, a + 23 s, . . . , a + 2N−1
2 s}
= {a + 2 s, a + 2 s, . . . , b − 2 s}.
1 3 1

2 Let f : S → R≥0 be a function, defined on a finite subset S ⊆ R. We write


Disc( f, S ) ∈ D(S ) for the discrete distribution defined as:
X
f (x) P
Disc( f, S ) B t x
where t B x∈S f (x).
x∈S

Often we combine the notations from these two items and use discretised states
of the form Disc( f, [a, b]N ).

346
6.4. Discretisation, and coin bias learning 347

To see an example of item (1), consider the interval [1, 2] with N = 3. The
step size s is then s = 2−1
3 = 3 , so that:
1

[1, 2]3 = 1 + 21 · 13 , 1 + 32 · 13 , 1 + 52 · 13 = 1 + 16 , 1 + 12 , 1 + 65
 

We choose to use internal points only and exclude the end-points in this finite
subset since the end-points sometimes give rise to boundary problems, with
functions being undefined. When N goes to infinity, the smallest and largest
elements in [a, b]N will approximate the end-points a and b — from above and
from below, respectively.
The ‘total’ number t in item (2) normalises the formal sum and ensures that
the multiplicities add up to one. In this way we can define a uniform distri-
bution on [a, b]N as unif[a,b]N , like before, or alternatively as Disc(1, [a, b]N ),
where 1 is the constant-one function.
Example 6.4.2. We look at the following classical question: suppose we are
given a coin with an unknown bias, we flip it eight times, and observe the
following list of heads (H) and tails (T ):
[T, H, H, H, T, T, H, H].
What can we then say about the bias of the coin?
The frequentist approach that we have seen in Section 2.3 would turn the
above list into a multiset and then into a distribution, by frequentist learning,
see also Diagram (2.16). This gives:
[T, H, H, H, T, T, H, H] 7−→ 5| H i + 3| T i 7−→ 85 | H i + 38 | T i.
Here we do not use this frequentist approach to learning the bias parameter,
but take a Bayesian route. We assume that the bias parameter itself is given
by a distribution, describing the likelihoods of various bias values. We assume
no prior knowledge and therefore start from the uniform distribution. It will
be updated based on successive observations, using the technique of backward
inference, see Definition 6.3.1 (2).
The bias b of a coin is a number in the unit interval [0, 1], giving rise to a
coin distribution flip(b) = b| H i + (1 − b)| T i. Thus we can see flip as a chan-
nel flip : [0, 1] → {H, T }. At this stage we avoid continuous distributions and
discretise the unit interval. We choose N = 100 in the chop up, giving as under-
lying space [0, 1]N with N points, on which we take the uniform distribution
unif as prior:
X X
unif B Disc(1, [0, 1]N ) = 1
N | x i = 1 2i+1
N 2N
x∈[0,1]N 0≤i<N

= 1 1
+ N1 2N
3
+ · · · + N1 2N−1 .

N 2N 2N

347
348 Chapter 6. Updating distributions

We use the flip operation as a channel, restricted to the discretised space:

[0, 1]N
flip
◦ / {H, T } given by flip(b) = b| H i + (1 − b)| T i.

There are the two (sharp, point) predicates 1H and 1T on the codomain
{H, T }. It is not hard to show, see Exercise 6.4.2 below, that:

flip = unif |= 1H = flip = unif |= 1T = 2.


1

Predicate transformation along flip yields two predicates on [0, 1]N given by:

flip = 1H (r) = r flip = 1T (r) = 1 − r.


 
and

Given the above sequence of head/tail observations [T, H, H, H, T, T, H, H],


we perform successive predicate transformations flip = 1(−) and update the
prior (uniform) state accordingly. This gives, via Lemma 6.1.6 (3),

unif|flip = 1T
unif|flip = 1T |flip = 1H = unif|(flip = 1T )&(flip = 1H ) = unif|(flip = 1H )&(flip = 1T )
unif|flip = 1T |flip = 1H |flip = 1H = unif|(flip = 1T )&(flip = 1H )&(flip = 1H )
= unif|(flip = 1H )2 &(flip = 1T )
..
.
unif|(flip = 1H )5 &(flip = 1T )3
An overview of the resulting distributions is given in Figure 6.5. These distri-
butions approximate (continuous) beta distributions, see Section ?? later on.
These beta functions form a smoothed out version of the bar charts in Fig-
ure 6.5. We shall look more systematically into ‘learning along a channel’ in
Section ??, where the current coin-bias computation re-appears in Example ??.
After these eight updates, let’s write ρ = unif|(flip = 1H )5 &(flip = 1T )3 for the re-
sulting distribution. We now ask three questions.

1 Where does ρ reach its highest value, and what is it? The answers are given
by:
argmax(ρ) = { 25
40 } with ρ( 25
40 ) ≈ 0.025347.

2 What is the predicted coin distribution? The outcome, with truncated multi-
plicities, is:
flip = ρ = 0.6| 1 i + 0.4|0 i.

3 What is the expected value of ρ? It is:

mean(ρ) = 0.6 = ρ |= flip = 1H .

348
6.4. Discretisation, and coin bias learning 349

Figure 6.5 Coin bias distributions arising from the prior uniform distribution by
successive updates after coin observations [0, 1, 1, 1, 0, 0, 1, 1].

For mathematical reasons1 the exact outcome is 0.6. However, we have used
approximation via discretisation. The value computed with this discretisation
is 0.599999985316273. We can conclude that chopping the unit interval up
with N = 100 already gives a fairly good approximation.

We conclude with a ‘conjugate prior’ property in this situation. It is a dis-


crete instance of a phenomenon that will be described more systematically later
on, see Section ??, in continuous probability.
We define, parameterised by N ∈ N, the channel βN : N × N → [0, 1]N as
normalisation:
 
 X 
βN (a, b) B Flrn  a b
r · (1 − r) r 
r∈[0,1]N
(6.3)
X ra · (1 − r)b
= P a b
r .
r∈[0,1] N
s∈[0,1]N s · (1 − s)

Then we have the following results.

Proposition 6.4.3. Let N ∈ N be the the discretisation parameter, used in the


chopped subspace [0, 1]N ⊆ [0, 1].

1 The distribution ρ is an approximation of the probability density function β(5, 3), which has
5+1
mean (5+1)+(3+1) = 0.6.

349
350 Chapter 6. Updating distributions

1 For a, b ∈ N the above distribution βN (a, b) satisfies:



βN (a, b) = unif (flip = 1 )a &(flip = 1 )b .
H T

2 For additional numbers n, m ∈ N one has:



βN (a, b) (flip = 1 )n &(flip = 1 )m = βN (a + n, b + m).
H T

This last equation says that the βN distributions are closed under updating
via flip. This is the essence of the fact that the βN channel is ‘conjugate prior’
to the flip channel, see Section ?? for more details. For now we note that this
relationship is convenient because it means that we don’t have to perform all
the state updates explicitly; instead we can just adapt the inputs a, b of the
channel βN . These inputs are often called hyperparameters.

Proof. 1 For r ∈ D we have:


unif(r) · (flip = 1H )a (r) · (flip = 1T )b (r)
unif (flip = 1 )a &(flip = 1 )b (r) =
H T unif |= (flip = 1H )a & (flip = 1T )b
1/N · ra · (1 − r)b
= P
1/N · sa · (1 − s)b
s∈D
= βN (a, b).

2 We use this result and Lemma 6.1.6 (3) in:



βN (a, b) (flip = 1 )n &(flip = 1 )m
H T

= unif (flip = 1H )a &(flip = 1T )b (flip = 1H )n &(flip = 1T )m


= unif (flip = 1 )a &(flip = 1 )b &(flip = 1 )n &(flip = 1 )m
H T H T

= unif (flip = 1 )a+n &(flip = 1 )b+m
H T

= βN (a + n, b + m).

6.4.1 Discretisation of states


In the beginning of this section we have described how to chop up intervals
[a, b] of real numbers into a discrete sample space. We have used this in par-
ticular for the unit interval [0, 1] which we used as space for a coin bias. Since
there is an isomorphism [0, 1]  D(2), this discretisation of [0, 1] might as
well be seen as a discretisation of the set of states on 2 = {0, 1}. Can we do
such discretisation more generally, for sets of states D(X)? There is an easy
way to do so via normalisation of natural multisets.
Recall that we write N[K](X) for the set of natural multisets — with natural

350
6.4. Discretisation, and coin bias learning 351

numbers as multiplicities — of size K. We recall from Theorem 4.5.7 (2) that


we write:

D[K](X) = {Flrn(ϕ) | ϕ ∈ N[K]} = { N1 · ϕ | ϕ ∈ N[K]}.

Recall also — from Proposition


 1.7.4 — that if the set X has
 nnelements, then
n
N[K](X) contains K multisets, so that D[K](X) contains K distributions.
For instance, when X = {a, b}, then M[5](X) contains the multisets:

5|a i, 4| a i + 1| b i, 3| a i + 2| b i, 2| ai + 3| b i, 1| ai + 4| bi, 5|b i.

Hence, D[5]({a, b}) contains the distributions:

1| a i, 4
5|ai + 51 | b i, 3
5|ai + 25 | b i, 2
5|ai + 35 | b i, 1
5|ai + 45 | b i, 1| bi.

By taking large K in D[K](X) we can obtain fairly good approximations, since


the union of these subsets D[K](X) ⊆ D(X) is dense, by Theorem 4.5.7 (2).

Example 6.4.4. Consider an election with three candidates a, b, c, which we


collect in a set X of candidates X = {a, b, c}. A candidate wins if (s)he gets
more than 50% of the votes. A poll has been conducted among 100 possible
voters, giving the following preferences.

candidates a b c

poll numbers 52 28 20

What is the probability that candidate a then wins in the election? Notice that
these are poll numbers, before the election, not actual votes. If they were actual
votes, candidate a gets more than 50%, and thus wins. However, these are poll
numbers, from which we like to learn about voter preferences.
A state in D(X) is used as distribution of preferences over the three candi-
dates in X = {a, b, c}. In this example we discretise the set of states and work
with the finite subset D[K](X) ,→ D(X), for a number K. There is no ini-
tial knowledge about voter preferences, so our prior is the uniform distribution
over these preference distributions:
 
υK B unifD[K](X) ∈ D D[K](X) .

We use the following sharp predicates on D[K](X), called: aw for a wins,


bw for b wins, cw for c wins, and nw for no-one wins. The are defined on

351
352 Chapter 6. Updating distributions

σ ∈ D[K](X) as:
aw(σ) = 1 ⇐⇒ σ(a) > 1
2
bw(σ) = 1 ⇐⇒ σ(b) > 1
2
cw(σ) = 1 ⇐⇒ σ(c) > 1
2
aw(σ) = 1 ⇐⇒ σ(a) ≤ 1
2 and σ(b) ≤ 1
2 and σ(c) ≤ 12 .
The prior validities:
υK |= aw υK |= bw υK |= cw υK |= nw
all approximate 14 , as K goes to infinity.
We use the inclusion D[K](X) ,→ D(X) as a channel i : D[K](X) → X. Like
in the above coin scenario, we perform successive backward inferences, using
the above poll numbers, to obtain a posterior ρK obtained via updates:
ρK B υK | (i = 1a )52 | (i = 1b )28 | (i = 1c )20 = υK | (i = 1a )52 & (i = 1b )28 & (i = 1c )20 .
We illustrate that the posterior probability ρK |= aw takes the following values,
for several numbers K.
N = 100 N = 500 N = 1000

probability that a wins 0.578 0.609 0.613


Clearly, these probabilities are approximations. When modeled via continuous
probability theory, see Chapter ??, the probability that a wins can be calculated
more accurately as 0.617. This illustrates that the above discretisations of states
work reasonably well.
To give a bit more perspective, the posterior probability that b or c wins
with the above poll numbers is close to zero. The probability that no-one wins
is substantial, namely almost 0.39.

Exercises
6.4.1 Recall the the N-element set [a, b]N from Definition 6.4.1 (1).
1 Show that its largest element is b − 21 s, where s = b−a
N .
2 Prove that x∈[a,b]N x = N(a+b) = N(N−1)
P P
2 . Remember: i∈N i 2 .
6.4.2 Consider coin parameter learning in Example 6.4.2.
1 Show that the predicition in the prior (uniform) state unif on [0, 1]N
gives a fair coin, i.e.
flip = unif = 12 | 1 i + 12 | 0i.
This equality is independent of N.

352
6.4. Discretisation, and coin bias learning 353

2 Prove that, also independently of N,


unif |= flip = 1H = 12 .
3 Show next that:
X
unif|flip = 1H = 2x
| x i.
x∈[0,1]N N

i2 = N(N−1)(2N−1)
P
4 Use the ‘square pyramidal’ formula i∈N 6 to prove
that:
flip = (unif|flip = 1H ) (1) = 2 1
.

3 − 6N 2

Conclude that flip = (unif|flip = 1H ) approaches 23 | 1 i + 13 | 0 i as N


goes to infinity.
Check that the probabilities flip = ρ (1) and mean(ρ) are the same,

6.4.3
in Example 6.4.2.
6.4.4 Check that βN (0, 0) in (6.3) is the uniform distribution unif on [0, 1]N .
6.4.5 Consider the discretised beta channel βN : N × N → D(X) in the fol-
lowing diagram:
βN ×id
(N × N) × N({H, T }) / D([0, 1]N ) × N({H, T })
add
 βN  infer
N×N / D([0, 1]N )

where the maps add and infer are given by:


add a, b, n| H i + m| T i = (a + n, b + m)


infer ω, n| H i + m| T i = ω|(flip = 1H )n &(flip = 1T )m .




Show that:
1 The map add is an action of the monoid N({H, T }) on N × N.
2 Show that the above rectangle commutes — and thus that βN is a
homomorphism of monoid actions. (This description of a conjugate
prior relationship as monoid action comes from [81].)
6.4.6 We accept, without proof, the following equation. For a, b ∈ N,
1 X a a! · b!
lim · r · (1 − r)b = . (∗)
N→∞ N
r∈[0,1]
(a + b + 1)!
N

Use (∗) to prove that the binary Pólya distribution can be obtained
from the binomial, via the limit of the discretised β distribution (6.3):
 
lim bn[K] = βN (a, b) (i)
N→∞   
= pl[K] (a+1)| H i + (b+1)|T i i| H i + (K −i)| T i .

353
354 Chapter 6. Updating distributions

Later, in ?? we shall see a proper continuous version of this result.


6.4.7 Redo the coin bias learning from the beginning of this section in the
style of Example 6.4.4, via discretisation of states in D(2).

6.5 Inference in Bayesian networks


In previous sections we have seen several examples of channel-based infer-
ence, in forward and backward form. This section shows how to apply these
inference methods to Bayesian networks, via an example that is often used in
the literature: the ‘Asia’ Bayesian network, originally from [110]. It captures
the situation of patients with a certain probability of smoking and of an earlier
visit to Asia; this influences certain lung diseases and the outcome of an xray
test. Later on, in Section 7.10, we shall look more systematically at inference
in Bayesian networks.
The Bayesian network example considered here is described in several steps:
Figure 6.6 gives the underlying graph structure, in the style of Figure 2.4, in-
troduced at the end of Section 2.6, with diagonals and types of wires writ-
ten explicitly. Figure 6.7 gives the conditional probability tables associated
with the nodes of this network. The way in which these tables are written
is different from Section 2.6: we now only write the probabilities r ∈ [0, 1]
and omit the values 1 − r. Next, these conditional probability tables are de-
scribed in Figure 6.8 as states smoke ∈ D(S ) and asia ∈ D(A) and as channels
lung : S → L, tub : A → T , bronc : S → B, xray : E → X, dysp : B × E → D,
either : L × T → E.
Our aim in this section is illustrate channel-based inference in this Bayesian
network. It is not so much the actual outcomes that we are interested in, but
more the systematic methodology that is used to obtain these outcomes.

Probability of lung cancer, given no bronchitis


Let’s start with the question: what is the probability that someone has lung
cancer, given that this person does not have bronchitis. The latter information is
the evidence. It takes the form of a singleton predicate 1b⊥ = (1b )⊥ : B → [0, 1]
on the set B = {b, b⊥ } used for presence and absence of bronchitis.
In order to obtain this updated probability of lung cancer we ‘follow the
graph’. In Figure 6.6 we see that we can transform (pull back) the evidence
along the bronchitis channel bronc : A → B, and obtain a predicate bronc =
1b⊥ on A. The latter can be used to update the asia distribution on A. Subse-
quently, we can push the updated distribution forward along the lung channel

354
6.5. Inference in Bayesian networks 355

O O
D X
   
dysp k xray
O  O 
•O
E
B  
either
L

? O 
T
     
bronc lung tub
 d ;  O 
•O A
S
   
smoke asia
   

Figure 6.6 The graph of the Asia Bayesian network, with node abbreviations:
bronc = bronchitis, dysp = dyspnea, lung = lung cancer, tub = tuberculosis. The
wires all have 2-element sets of the form A = {a, a⊥ } as types.

lung : A → L via state transformation. Thus we follow the ‘V’ shape in the
relevant part of the graph.
Combining this down-update-up steps gives the required outcome:

lung = smoke bronc = 1 ⊥ = 0.0427| ` i + 0.9573|`⊥ i.

b

We see that this calculation combines forward and backward inference, see
Definition 6.3.1.

Probability of smoking, given a positive xray


In Figures 6.7 and 6.8 we see a prior smoking probability of 50%. We like to
know what this probability becomes if we have evidence of a positive xray.
The latter is given by the point predicate 1 x ∈ Pred (X) for X = {x, x⊥ }.
Now there is a long path from xray to smoking, see Figure 6.6 that we need
to use for (backward) predicate transformation. Along the way there is a com-
plication, namely that the node ‘either’ has two parent nodes, so that pulling
back along the either channel yields a predicate on the product set L × T .
The only sensible thing to do is continue predicate transformation down down-
wards, but now with the parallel product channel lung ⊗ tub : S × A → L × T .
The resulting predicate on S × A can be used to update the product state
smoke⊗asia. Then we can take the first marginal to obtain the desired outcome.

355
356 Chapter 6. Updating distributions

P(smoke) P(asia)
0.5 0.01
smoke P(lung) asia P(tub)
s 0.1 a 0.05
s⊥ 0.01 a⊥ 0.01

smoke P(bronc) either P(xray)

s 0.6 e 0.98

s ⊥
0.3 e⊥ 0.05

bronc either P(dysp) lung tub P(either)


b e 0.9 ` t 1
b e⊥ 0.7 ` t⊥ 1
b⊥ e 0.8 `⊥ t 1
b ⊥
e⊥
0.1 ` ⊥
t ⊥
0

Figure 6.7 The conditional probability tables of the Asia Bayesian network

smoke = 0.5| s i + 0.5| s⊥ i asia = 0.01| a i + 0.99| a⊥ i


lung(s) = 0.1| ` i + 0.5| `⊥ i tub(a) = 0.05| t i + 0.95| t⊥ i
lung(s⊥ ) = 0.3| ` i + 0.7| `⊥ i tub(a⊥ ) = 0.01| t i + 0.99| t⊥ i
bronc(s) = 0.6|b i + 0.4| b⊥ i xray(e) = 0.98| x i + 0.02| x⊥ i
bronc(s⊥ ) = 0.3| bi + 0.7| b⊥ i xray(e⊥ ) = 0.05| x i + 0.95| x⊥ i
dysp(b, e) = 0.9| d i + 0.1| d⊥ i either(`, t) = 1| e i
dysp(b, e⊥ ) = 0.7| d i + 0.3| d⊥ i either(`, t⊥ ) = 1| e i
dysp(b⊥ , e) = 0.8| d i + 0.2| d⊥ i either(`⊥ , t) = 1| e i
dysp(b⊥ , e⊥ ) = 0.1| d i + 0.9| d⊥ i either(`⊥ , t⊥ ) = 1| e⊥ i

Figure 6.8 The conditional probability tables from Figure 6.7 described as states
and channels.

Thus we compute:
  
(smoke ⊗ asia) (lung⊗tub) = (either = (xray = 1 )) 1, 0
x
  
= (smoke ⊗ asia) 1, 0
(xray ◦· either ◦· (lung⊗tub)) = 1 x
= 0.6878| si + 0.3122| s⊥ i.

Thus, a positive xray makes it more likely — w.r.t. the uniform prior — that the
patient smokes — as is to be expected. This is obtained by backward inference.

356
6.5. Inference in Bayesian networks 357

Probability of lung cancer, given both dyspnoea and tuberculosis


Our next inference challenge involves two evidence predicates, namely 1d on D
for dyspnoea and 1t on T for tuberculosis. We would like to know the updated
lung cancer probability.
The situation looks complicated, because of the ‘closed loop’ in Figure 6.6.
But we can proceed in a straightforward manner and combine evidence via
conjunction & at a suitable meeting point. We now more clearly separate the
forward and backward stages of the inference process. We first move the prior
states forward to a point that includes the set L — the one that we need to
marginalise on to get our conclusion. We abbreviate this state on B × L × T as:

σ B (bronc ⊗ lung ⊗ id ) = (∆ ⊗ tub) = (smoke ⊗ asia) .




Recall that we write ∆ for the copy channel, in this expression of type S →
S × S.
Going in the backward direction we can form a predicate, called p below,
on the set B × L × T , by predicate transformation and conjunction:

p B 1 ⊗ 1 ⊗ 1t & (id ⊗ either) = (dys = 1d ) .


 

The result that we are after is now obtained via updating and marginalisation:

σ| p 0, 1, 0 = 0.0558| ` i + 0.9442|`⊥ i.
 
(6.4)

There is an alternative way to describe the same outcome, using that certain
channels can be ‘shifted’. In particular, in the definition of the above state
σ, the channel bronc is used for state transformation. It can also be used in
a different role, namely for predicate transformation. We then use a slightly
different state, now on S × L × T ,

τ B (id ⊗ lung ⊗ id ) = (∆ ⊗ tub) = (smoke ⊗ asia) .




The bronc channel is now used for predicate transformation in the predicate:

q B 1 ⊗ 1 ⊗ 1t & (bronc ⊗ either) = (dys = 1d ) .


 

The same updated lung cancer distribution is now obtained as:

τ|q 0, 1, 0 = 0.0558| ` i + 0.9442| `⊥ i.


 
(6.5)

The reason why the outcomes (6.4) and (6.5) are the same is the topic of Exer-
cise 6.5.2.
We conclude that inference in Bayesian networks can be done composi-
tionally via a combination of forward and backward inference, basically by
following the network structure.

357
358 Chapter 6. Updating distributions

Exercises
6.5.1 Consider the wetness Bayesian network from Section 2.6. Write down
the channel-based inference formulas for the following inference ques-
tions and check the outcomes that are given below.
1 The updated sprinkler distribution, given evidence of a slippery
63
road, is 260 |b i + 260
197
| b i.
2 The updated wet grass distribution, given evidence of a slippery
5200 |d i + 5200 | d i.
road, is 4349 851 ⊥

(Answers will appear later, in Example 7.3.3, in two different ways.)


6.5.2 Check that the equality of the outcomes in (6.4) and in (6.5) can be
explained via Exercises 6.1.7 (2) and 4.3.9.

6.6 Bayesian inversion: the dagger of a channel


So far in this chapter we have seen the power of probabilistic inference, es-
pecially in backward form, especially in the many examples in Section 6.3.
Applying backward inference to point predicates, as evidence, leads to chan-
nel reversal. This is a fundamental construction, known as Bayesian inversion,
which we shall describe with a ‘dagger’ c† : Y → X, for a channel c : X → Y.
The superscript-dagger notation (−)† is used since Bayesian inversion is sim-
ilar to the adjoint-transpose of a linear operator A between Hilbert spaces,
see [24] (and Exercise 6.6.3 below). This transpose is typically written as A† ,
or also as A∗ . The dagger notation is more common in quantum theory, and has
been formalised in terms of dagger categories [25]. This categorical approach
is sketched in Section 7.8 below.
There is much to say about Bayesian inversion and dagger channels. In the
remainder of this chapter we introduce the definition, together with some basic
results, and apply inversion in probabilistic reasoning, using a new update rule
due to Jeffrey. In Chapter 7 we will re-introduce Bayesian inversion as a special
case of disintegration, in the context of a graphical calculus. There we will
illustrate the use of the dagger channel in basic techniques in machine learning,
see Section 7.7.

Definition 6.6.1. Let c : X → Y be a channel with a state ω ∈ D(X) on its


domain, such that the transformed / predicted state c = ω ∈ D(Y) has full
support. The latter means that supp(c = ω) = Y, so that (c = ω)(y) > 0, for
each y ∈ Y. Implicitly, the codomain Y is then a finite set — since supports are
finite.

358
6.6. Bayesian inversion: the dagger of a channel 359

The dagger channel c†ω : Y → X is defined as on y via backward inference


using the point predicate 1y , as in:
X ω(x) · c(x)(y)
c†ω (y) B ω|c = 1y = x
x∈X
(c = ω)(y)
X ω(x) · c(x)(y) (6.6)
= x .
z ω(z) · c(z)(y)
P
x∈X

Thus, the dagger is obtained by updating the prior ω via the likelihood func-
tion Y → Pred (X), given by y 7→ c = 1y , see Definition 4.3.1. We have already
implicitly used the dagger of a channel in several situations.
Example 6.6.2. 1 In Example 6.3.2 we used a channel c : {H, T } → {W, B}
from sides of a coin to colors of balls in an urn, together with a fair coin
unif ∈ D({H, T }). We had evidence of a white ball and wanted to know the
updated coin distribution. The outcome that we computed can be described
via a dagger channel, since:
c†unif (W) = unif|c = 1W = 22
67 | H i + 45
67 | T i.
The same redescription in terms of Bayesian inversion can be used for Ex-
amples 6.3.3 and 6.3.9, 6.3.10. In Example 6.3.5 we can use Bayesian in-
version for the point evidence case, but not for the illustration with soft
evidence, i.e. with fuzzy predicates. This matter will be investigated further
in Section 6.7.
2 In Exercises 6.3.1 and 6.3.2 we have seen a channel c : {d, d⊥ } → {p, n} that
combines the sensitivity and specificity of a medical test, in a situation with
a prior / prevalence of 1% for the disease. The associated Positive Prediction
Value (PPV) and Negative Predication Value (NPV) can be expressed via a
dagger as:
PPV = c†ω (p)(d) = ω|c = 1 p (d) = 18
117
NPV = c†ω (n)(d⊥ ) = ω|c = 1n (d⊥ ) = 1883 ,
1881

where ω = 1
100 | d i + 99 ⊥
100 | d i is the prior distribution.
In Definition 6.3.1 we have carefully distinguished forward inference c =
(ω| p ) from backward inference ω|c = q . Since Bayesian inversion turns channels
around, the question arises whether it also turns inference around: can one ex-
press forward (resp. backward) inference along c in terms of backward (resp.
forward) inference along the Bayesian inversion of c? The next result from [77]
shows that this is indeed the case. It demonstrates that the directions of infer-
ence and of channels are intimately connected and it gives the channel-based
framework internal consistency.

359
360 Chapter 6. Updating distributions

Theorem 6.6.3. Let c : X → Y be a channel with a state σ ∈ D(X) on its


domain, such that the transformed state τ B c = σ has full support.

1 Given a factor q on Y, we can express backward inference as forward infer-


ence with the dagger:
σ|c = q = c†σ = τ|q .


2 Given a factor p on X, we can express forward inference as backward infer-


ence with the dagger:

c = σ| = τ .

p c†σ = p

Proof. 1 For x ∈ X we have:


 †  X
cσ = τ|q (x) = c†σ (y)(x) · (c = σ)|q (y)
y∈Y
X c(x)(y) · σ(x) (c = σ)(y) · q(y)
= ·
y∈Y
(c = σ)(y) c = σ |= q
σ(x) ·
P 
y∈Y c(x)(y) · q(y)
=
σ |= c = q
σ(x) · (c = q)(x)
=
σ |= c = q
= σ|c = q (x).

2 Similarly, for y ∈ Y,
  (c = σ)(y) · (c†σ = p)(y)
τ c† = p (y) =
σ
c = σ |= c†σ = p
X (c = σ)(y) · c† (y)(x) · p(x)
σ
=
x∈X cσ = (c = σ) |= p

X (c = σ)(y) · σ(x)·c(x)(y)
(c = σ)(y) · p(x)
=
x∈X
σ |= p
X σ(x) · p(x)
= c(x)(y) ·
x∈X
σ |= p
X
= c(x)(y) · σ| p (x)
x∈X 
= c = (σ| p ) (y).

We include two further results with basic properties of the dagger of a chan-
nel. More such properties are investigated in Section 7.8, where the dagger is
put in a categorical perspective.

360
6.6. Bayesian inversion: the dagger of a channel 361

Proposition 6.6.4. Let c : X → Y be a channel with a state σ ∈ D(X) such


that c = σ has full support. Then:
1 Transforming the predicted state along the dagger channel yields the origi-
nal state:
c†σ = c = σ = σ.


2 The double dagger of a channel is the channel itself:


 † †
cσ = c.
c = σ

Proof. 1 For x ∈ X we have:


 †  X
cσ = c = σ (x) = c†σ (y)(x) · (c = σ)(y)
y∈Y
X σ(x) · c(x)(y)
(6.6)
= · (c = σ)(y)
y∈Y
(c = σ)(y)
 
X 
= σ(x) ·  c(x)(y)

y∈Y
= σ(x).
2 Again we unfold the definitions: for x ∈ X and y ∈ Y,
 † † c†σ (y)(x) · (c = σ)(y)
(6.6)
cσ (x)(y) =
c = σ c†σ = (c = σ) (x)


σ(x) · c(x)(y) · (c = σ)(y)


= by (6.6) and item (1)
(c = σ)(y) · σ(x)
= c(x)(y).
Next we show the close connection between channels, and their daggers,
and joint states.
π1
Theorem 6.6.5 (Disintegration). Consider the two projections X ←− X ×
π2
Y −→ Y as deterministic channels. Let ω ∈ D(X × Y) be a joint state, whose
marginals ω1 B ω 1, 0 = π1 = ω ∈ D(X) and ω2 B ω 0, 1 = π2 = ω ∈
   
D(Y) both have full support.
Extract from ω the two channels:
(π1 )†ω π2 (π2 )†ω π1
/ X×Y /Y / X×Y /X
   
c B X ◦ ◦ d B Y ◦ ◦

Then:
1 The joint state ω is the graph of both these channels:

hid , ci = ω1 = ω = hd, id i = ω2 .

361
362 Chapter 6. Updating distributions

2 The extracted channels are each other’s daggers:

c†ω1 = d and dω† 2 = c.

This process of extracting channels from a joint state is called disintegration.


It will be studied more systematically in the next chapter, see Section 7.6. The
two channels X → Y and Y → X that we can extract in different directions
from a joint state on X × Y are each other’s daggers. In traditional probability
this is expressed as equation:

P(A | B) · P(B) = P(A, B) = P(B | A) · P(A).


Proof. For both items we only do the first equations, since the second ones
follow by symmetry.

1 For x ∈ X and y ∈ Y,
 
hid , ci = ω1 (x, y) = ω1 (x) · π2 ◦· π1 †ω (x)(y)
 
 
X 

= ω1 (x) ·  π1 ω (x)(z, y)

z∈X
 
(6.6)
X ω(z, y) · π1 ((z, y), x) 
= (π1 = ω)(x) ·  
z∈X
(π1 = ω)(x) 

= ω(x, y).
2 Similarly,
(6.6) ω1 (x) · c(x)(y) (hid , ci = ω1 )(x, y)
c†ω1 (y)(x) = =
(c = ω1 )(y) ω2 (y)
ω(x, y)
=
(π2 = ω)(y)
X
= π2 †ω (y)(x, z)

z∈Y
† 
= π1 ◦· π2 ω (y)(x)
= d(y)(x).

We recall the Ewens draw-delete/add maps with the Ewens distributions


from Section 3.9, in a situation:
EDD
t ◦
3 E(K +1) ew[K](t) ∈ D E(K) .

E(K) with

EDA(t)

As announced there, these EDD and EDA maps are each other’s daggers. We
can now make this precise.

362
6.6. Bayesian inversion: the dagger of a channel 363

Proposition 6.6.6. The Ewens draw-delete/add channels are each other’s dag-
gers, via the Ewens distributions: for each K, t ∈ N>0 one has:

EDD †ew[K+1](t) = EDA(t) and EDA(t)†ew[K](t) = EDD.

Proof. Fix Ewens multisets ϕ ∈ E(K) and ψ ∈ E(K +1). A useful observation
to start with is that the commuting diagrams in Corollary 3.9.12 give us the
following validities of transformed point predicates.

ew[K +1](t) |= EDD = 1ϕ = EDD = ew[K +1](t) (ϕ)




= ew[K](t)(ϕ)
ew[K](t) |= EDA(t) = 1ψ = EDA(t) = ew[K](t) (ψ)


= ew[K +1](t)(ψ).

For the first equation in the proposition we mimick the proof of Corollary 3.9.12
and follow the reasoning there:

EDD †ew[K+1](t) (ϕ)


(6.6)
X ew[K +1](t)(ψ) · (EDD = 1ϕ )(ψ)
= ψ
ψ∈E(K+1)
ew[K +1](t) |= EDD =  1ϕ
X ew[K +1](t)(ψ) · EDD(ψ)(ϕ)
= ψ
ψ∈E(K+1)
ew[K](t)(ϕ)
ϕ(1)+1 ew[K +1](t)(ϕ + 1| 1i)
= ϕ + 1| 1i

·
K +1 ew[K](t)(ϕ)
X (ϕ(k+1)+1) · (k+1)
+
1≤k≤K
K +1
ew[K +1](t)(ϕ − 1| k i + 1| k+1i)
ϕ − 1| k i + 1| k+1i

·
ew[K](t)(ϕ)
t X ϕ(k) · k
= ϕ + 1| 1 i + ϕ − 1| k i + 1| k+1 i

K +t 1≤k≤K
K +t
= EDA(t)(ϕ).

363
364 Chapter 6. Updating distributions

Similarly, using the approach in the proof of Theorem 3.9.10,


EDA(t)†ew[K](t) (ψ)
(6.6)
X ew[K](t)(ϕ) · (EDA(t) = 1ψ )(ϕ)
= ϕ
ϕ∈E(K)
ew[K](t) =
| EDA(t) = 1 ψ
X ew[K](t)(ϕ) · EDA(t)(ψ)(ϕ)
= ϕ
ϕ∈E(K)
ew[K +1](t)(ψ)
t ew[K](t)(ψ − 1| 1 i)
= ψ − 1|1 i if ψ(1) > 0

·
K +t ew[K +1](t)(ψ)
X (ψ(k)+1) · k
+
1≤k≤K
K +t
ew[K](t)(ψ + 1| k i − 1| k+1i)
ψ + 1| k i − 1| k+1i

·
ew[K +1](t)(ψ)
 t 
t (ψ−1| 1 i)(`) ψ · K+1
Q
t 1≤`≤K `
= ψ − 1|1 i

·  t  · Q
K +t (ψ − 1| 1 i) · 1≤`≤K+1 `
t  ψ(`)
K
X (ψ(k)+1) · k Q t (ψ+1| k i−1| k+1 i)(`)
1≤`≤K `
+ ·
K +t
 
1≤k≤K (ψ + 1| k i − 1| k+1i) · Kt
 t 
ψ · K+1
ψ + 1| k i − 1| k+1i

·Q ψ(`)
t 
1≤`≤K+1 `
t K +t 1
= · ψ(1) · · ψ − 1|1 i if ψ(1) > 0, so ψ(K +1) = 0

K +t K +1 t
X (ψ(k)+1) · k ψ(k+1) K +t t/k
+ · · ·t ψ + 1| k i − 1|k+1 i
1≤k≤K
K +t ψ(k)+1 K +1 /(k+1)

ψ(1) X ψ(k) · k
= ψ − 1| 1 i + ψ + 1| k−1 i − 1| k i

K +1 2≤k≤K+1
K +1
= EDD(ϕ).

Exercises
6.6.1 Let σ, τ ∈ D(X) be distributions with supp(τ) ⊆ supp(σ). Show that
for the identity channel id one has:
id †σ = τ = τ.
6.6.2 Let c : X → Y be a channel with state ω ∈ D(X), where c = ω has
full support. Consider the following distribution of distributions:
X
ΩB (c = ω)(y) c†ω (y) ∈ D D(X) .

y∈Y

364
6.6. Bayesian inversion: the dagger of a channel 365

Recall the probabilistic ‘flatten’ operation from Section 2.2 and show:
flat(Ω) = ω.
This says that Ω is ‘Bayes plausible’ in the terminology of [97] on
Bayesian persuasion. The construction of Ω is one part of the bijective
correspondence in Exercise 7.6.7.
6.6.3 For ω ∈ D(X), c : X → Y, p ∈ Obs(X) and q ∈ Obs(Y) prove:
c = ω |= (c†ω = p) & q = ω |= p & (c = q).
This is a reformulation of [24, Thm. 6]. If we ignore the validity signs
|= and the states before them, then we can recognise in this equation
the familiar ‘adjointness property’ of daggers (adjoint transposes) in
the theory of Hilbert spaces: hA† (x) | yi = hx | A(y)i.
6.6.4 Consider the draw-delete channel DD : N[K + 1](X) → N[K](X)
from Definition 2.2.4. Let X be a finite set, say with n elements. Con-
sider the the uniform state unif on N[K + 1](X), see Exercise 2.3.7.
Show that DD’s dagger, with respect to unif, can be described on
ϕ ∈ N[K](X) as:
X ϕ(x) + 1
DD †unif (ϕ) = ϕ + 1| x i .
x∈X
K + n
This dagger differs from the draw-add map DA : N[K](X) → N[K +
1](X).
6.6.5 Let ω ∈ D(X × Y) be a joint state, whose two marginals ω1 B
ω 1, 0 ∈ D(X) and ω2 B ω 0, 1 ∈ D(Y) have full support. Let
   
p ∈ Pred (X) and q ∈ Pred (Y) be arbitrary predicates.
Prove that there are predicates q1 on X and p2 on Y such that:
ω1 |= p & q1 = ω |= p ⊗ q = ω2 |= p1 & q.
This is a discrete version of [131, Prop. 6.7].
Hint: Disintegrate!
6.6.6 Let c : X → Y and d : X → Z be channels with a state ω ∈ D(X) on
their domain such that both predicted states c = ω and d = ω have
full support. Show that:
hc, di†ω (y, z) = ω|(c = 1y )&(d = 1z ) = d†† (z)
cω (y)
= ω|(d = 1z )&(c = 1y ) = c†† (y).
cω (z)

6.6.7 Recall from Exercise 2.2.11 what it means for c : X → Y to be a bi-


channel. Check that c† : Y → Y, as defined there, is c†unif , where unif
is the uniform distribution on X (assuming that X is finite).

365
366 Chapter 6. Updating distributions

6.7 Pearl’s and Jeffrey’s update rules


We have seen that backward inference is a useful reasoning technique, in a
situation where we have a state σ ∈ D(X), on the domain X of a channel
c : X → Y, that we wish to update in the light of evidence on the codomain Y
of the channel. The evidence is given in the form of a predicate (or factor) on
Y. This backward inference is also called Pearl’s update rule.
It turns out that there is an alternative update mechanism in this situation,
where the evidence is not given by a predicate on the codomain of the channel,
but by state. This alternative was introduced by Jeffrey [90], see also [19, 33,
39, 160], or [17] for a recent application in physics. Jeffrey’s rule can be for-
mulated conveniently (see [77]) in terms of the Bayesian inverse (dagger) c†
of the channel — as introduced in the previous section.
As we shall see below, the update rules of Pearl and Jeffrey can give quite
different outcomes. Hence the question arises: when to use which rule? What
is the difference? There is a clear difference, which can be summarised as fol-
lows: Pearl’s rule increases validity and Jeffrey’s rule decreases divergence.
This will be made technically precise, in Theorem 6.7.4 below. The proof in
Pearl’s case is easy, but the proof in Jeffrey’s case is remarkably difficult. We
copy it from [79] and include it in a separate subsection.
In general terms, one can learn by reinforcing what goes well, or by steering
away from what goes wrong. In the first case one improves a positive evalua-
tion and in the second case one reduces a negative outcome, as a form of correc-
tion. Thus, Jeffrey’s update rule is a correction mechanism, for correcting er-
rors. As such it is used in ‘predictive coding’, a theory in cognitive science that
views the human brain as a Bayesian prediction engine, see [143, 51, 64, 23].
This section introduced Jeffrey’s update rule, especially in contrast to Pearl’s
rule, that is, to backward inference.

Definition 6.7.1. Let c : X → Y be channel with a prior state σ ∈ D(X).

1 Pearl’s rule is backward inference: it involves updating of the prior σ with


evidence in the form of a factor q ∈ Fact(Y) to the posterior:

σ|c = q ∈ D(X).

2 Jeffrey’s rule involves update of the prior σ with evidence in form of a state
τ ∈ D(Y) to the posterior:

c†σ = τ ∈ D(X).

We shall illustrate the difference between Pearl’s and Jeffreys’s rules in two
examples.

366
6.7. Pearl’s and Jeffrey’s update rules 367

Example 6.7.2. Consider the situation from Exercise 6.3.1 with set D = {d, d⊥ }
for disease (or not) and T = {p, n} for a positive or negative test outcome, with
a test channel c : D → T given by:

c(d) = 9
10 | p i + 1
10 | n i c(d⊥ ) = 1
20 | p i + 19
20 | n i.

The test thus has a sensitivity of 90% and a specificity of 95%. We assume a
prevalence of 10% via a prior state ω = 101
| d i + 10
9
| d⊥ i.
The test is performed, under unfavourable circumstances like bad light, and
we are only 80% sure that the test is positive. With Pearl’s update rule we thus
use as evidence predicate q = 45 · 1 p + 51 · 1n . It gives as posterior disease
probability:

ω|c = q = 74
281 | d i + 207
281 | d i ≈ 0.26| d i + 0.73| d i.
This gives a disease likelihood of 26%.
When we decide to use Jeffrey’s rule we translate the 80% certainty of a
positive test into a state τ = 54 | p i + 15 | ni. Then we compute:

c†σ = τ = 5 · cσ (p) + 5 · cσ (n)


4 † 1 †

(6.6) 4
= 5 · ω|c = 1 p + 15 · ω|c = 1n
   
= 45 · 23 | d i + 13 | d i + + 45 · 2
173 | d i + 171
173 | di
= 519278
| d i + 241
519 | d i
≈ 0.54| d i + 0.46| d i.
The disease likelihood is now 54%, more than twice as high as with Pearl’s
update rule. This is a serious difference, which may have serious consequences.
Should we start asking our doctors: does your computer use Pearl’s or Jeffrey’s
rule to calculate likelihoods, and come to an advice about medical treatment?
We also review an earlier illustration, now with Jeffrey’s approach.
Example 6.7.3. Recall the situation of Example 2.2.2, with a teacher predict-
ing the performance of pupils, depending on the teacher’s mood. The evidence
q from Example 6.3.4 is used for Pearl’s update. It can be translated into a state
τ on the set G of grades:

τ= 1
10 | 1 i + 3
10 | 2i + 3
10 |3 i + 2
10 | 4 i + 1
10 | 5 i.

There is an a priori divergence DKL (τ, c = σ) ≈ 1.336. With some effort one
can prove that the Jeffrey-update of σ is:
σ0 = c†σ = τ = 3913520
972795
| p i + 1966737
3913520 | ni + 3913520 | o i
973988

≈ 0.2486| p i + 0.5025| n i + 0.2489| o i.

367
368 Chapter 6. Updating distributions

The divergence has now dropped to: DKL (τ, c = σ0 ) ≈ 1.087.


In the end it is interesting to compare the orinal (prior) mood with its Pearl-
and Jeffrey-updates. In both cases we see that after the bad marks of the pupils
the teacher has become less optimistic.

prior mood

Pearl-update Jeffrey-update
The prior mood is reproduced from Example 2.2.2 for easy comparison. The
Pearl and Jeffrey updates differ only slightly in this case.

The following theorem (from [79]) captures the essences of the update rules
of Pearl and Jeffrey: validity increase or divergence decrease.

Theorem 6.7.4. Consider the situation in Definition 6.7.1, with a prior state
σ ∈ D(X) and a channel c : X → Y.

1 For a factor q on Y Pearl’s update rule gives an increase of validity:

c = σP |= q ≥ c = σ |= q for the updated state σP = σ|c = q .

2 For an evidence state τ ∈ D(Y) Jeffrey’s update rule gives a decrease of


Kullback-Leibler divergence:

DKL τ, c = σ J ≤ DKL τ, c = σ σ J = c†σ = τ.


 
for

In this situation we assume that the predicted state c = σ as fulll support,


so that the dagger c†σ is well-defined.

In this sitation c = σ is the predicted state, where we evaluate the evidence.

• In Pearl’s case we look at the validity c = σ |= q of the evidence q in this


predicted state. The above theorem tells that by switching to the Pearl-update
σP that we get a higher validity c = σP |= q.
• In Jeffrey’s case we look at the divergence / mismatch DKL τ, c = σ be-

tween the evidence state τ and the predicted state c = σ. By changing to the

368
6.7. Pearl’s and Jeffrey’s update rules 369

Jeffrey-update σ J we get a lower divergence DKL τ, c = σ J . Thus, via Jef-



frey’s rule one reduces ‘prediction errors’, in the terminology of predictive
coding theory.

Proof. 1 Via Proposition 4.3.3 we can turn = on the left of |= into = on


the right and vice-versa, so that we can use the validity increase of Theo-
rem 6.1.5:
c = σ|c = q |= q = σ|c = q |= c = q


≥ σ |= c = q
= c = σ |= q.

2 The proof is non-trivial and relegated to Subsection 6.7.1.

The fact that Jeffrey’s rule involves correction of prediction errors, as stated
in Theorem 6.7.4 (2), has led to the view that Jeffrey’s update rule should
be used in situations where one is confronted with ‘surprises’ [41] or with
‘unanticipated knowledge’ [40]. Here we include an example from [41] (also
used in [77]) that involves such error correction after a ‘surprise’.

Example 6.7.5. Ann must decide about hiring Bob, whose characteristics are
described in terms of a combination of competence (c or c⊥ ) and experience (e
or e⊥ ). The prior, based on experience with many earlier candidates, is a joint
distribution on the product space C × E, for C = {c, c⊥ } and E = {e, e⊥ }, given
as:
ω= 4
10 | c, e i + 1 ⊥
10 | c, e i + 10 |c , ei
1 ⊥
+ 10 | c , e i.
4 ⊥ ⊥

The first marginal of ω is the uniform distribution 12 | c i + 12 |c⊥ i. It is the neutral


base rate for Bob’s competence.
π1 π2
We use the two projection functions C ←− C × E −→ E as deterministic
channels along which we reason, using both Pearl’s and Jeffrey’s rules.
When Ann would learn that Bob has relevant work experience, given by
point evidence 1e , her strategy is to factor this in via Pearl’s rule / backward
inference: this gives as posterior ω|π2 = 1e = ω|1⊗1e , whose first marginal is
5 | c i + 5 | c i. It is then more likely that Bob is competent.
4 1 ⊥

Ann reads Bob’s letter to find out if he actually has relevant experience. We
quote from [41]:

Bob’s answer reveals right from the beginning that his written English is poor. Ann no-
tices this even before figuring out what Bob says about his work experience. In response
to this unforeseen learnt input, Ann lowers her probability that Bob is competent from
1
2
to 18 . It is natural to model this as an instance of Jeffrey revision.

369
370 Chapter 6. Updating distributions

Bob’s poor English is a new state of affairs: a surprise that causes Ann to switch
to error reduction mode, via Jeffrey’s rule. Bob’s poor command of the English
language translates into a competence state ρ = 18 | ci + 78 | c⊥ i. Ann wants to
adapt to this new surprising situation, so she uses Jeffrey’s rule, giving a new
joint state:

ω0 = (π1 )†ω = ρ = 1
10 | c, e i + 1 ⊥
40 | c, e i + 40 | c , e i
7 ⊥
+ 10 | c , e i.
7 ⊥ ⊥

If the letter now tells that Bob has work experience, Ann will factor this in, in
this new situation ω0 , giving, like above, via Pearl’s rule followed by marginal-
isation,
ω0 |π2 = 1e 1, 0 = 11
4
| c i + 11
7 ⊥
 
| c i.

The likelihood of Bob being competent is now lower than in the prior state,
4
since 11 < 21 . This example reconstructs the illustration from [41] in channel-
based form, with the associated formulations of Pearl’s and Jeffrey’s rules from
Definition 6.7.1, and produces exactly the same outcomes as in loc. cit.

We briefly discuss some further commonalities and differences between the


rules of Pearl and Jeffrey. We have seen some of these results before, but now
they are put in the context of Pearl’s and Jeffrey’s update rule.

Proposition 6.7.6. Let c : X → Y be a channel with a prior state σ ∈ D(X).

1 Pearl’s rule and Jeffrey’s rule agree on point predicate / states: for y ∈ Y,
with associated point predicate 1y and point state 1| y i, one has:

σ|c = 1y = c†σ (y) = c†σ = 1| yi.

2 Pearl’s updating with a constant predicate (no information) does not change
the prior state ω:
σ| c = r ·1 = σ, for r > 0.

Jeffrey’s update does not change anything when we update with what we
already predict:
c†σ = c = σ = σ.


This is in essence the law of total probability, see Exercise 6.7.6 below.
3 For a factor q ∈ Fact(Y) we can update the state σ according to Pearl, and
also the channel c so that the updated predicted state is predicted:

c|q = (σ|c = q ) = c = σ |q .


In Jeffrey’s case, with an evidence state τ ∈ D(Y), we can update both the

370
6.7. Pearl’s and Jeffrey’s update rules 371

state and the channel, via a double-dagger, so that the evidence state τ is
predicted:
 † †
cσ = c†σ = τ = τ.

τ
4 Multiple updates with Pearl’s rule commute, but multiple updates with Jef-
frey’s rule do not commute.
Proof. These items follow from earlier results.
1 Directly by Definition 6.7.1.
2 The first equation follows from Lemma 6.1.6 (1) and (4); the second one is
Proposition 6.6.4 (1).
3 The first claim is Theorem 6.3.11 and the second one is an instance of Propo-
sition 6.6.4 (2).
4 By Lemma 6.1.6 (3) we have:
σ|c = q1 |c = q2 = σ|(c = q1 )&(c = q2 ) = σ|c = q2 |c = q1 .
However, in general, for evidence states τ1 , τ2 ∈ D(Y),
c†† = τ2 , c†† = τ1 .
cσ = τ1 cσ = τ2

Exercise 6.7.5 contains a concrete example.


Item (2) present different views on ‘learning’, where learning is now used
in an informal sense: according to Pearl’s rule you learning nothing when you
get no information; but according to Jeffrey you learn nothing when you are
presented with what you already know. Both interpretations make sense.
Probabilistic updating may be used as a model for what is called priming in
cognitive science, see e.g. [60, 64]. It is well-known that the human mind is
sensitive to the order in which information is processed, that is, to the order
of priming/updating. Thus, the last item (4) above suggests that Jeffrey’s rule
might be more appropriate in such a setting. This strengthens the view in pre-
dictive coding that the human mind learns from error correction, as in Jeffrey’s
update rule.
There are translations back-and-forth between Pearl’s and Jeffrey’s rules,
due to [19]; they are translated here to the current setting.
Proposition 6.7.7. Let c : X → Y be a channel with a prior state σ ∈ D(X) on
its domain, such that the predicted state c = σ has full support.
1 Pearl’s updating can be expressed as Jeffrey’s update, by turning a factor
q : Y → R≥0 into a state ρ|q ∈ D(Y), so that:
 
c†σ = (c = σ) q = σ|c = q .

371
372 Chapter 6. Updating distributions

2 Jeffrey’s updating can also be expressed as Pearl’s updating: for an evidence


state τ ∈ D(Y) write τ/(c = σ) for the factor y 7→ (c τ(y)
= σ)(y) ; then:

c†σ = τ = σ| c = τ/(c=σ) .

Proof. The first item is exactly Theorem 6.6.3 (1). For the second item we first
note that (c = σ) |= τ/(c = σ) = 1, since (c = σ) is a state:
X τ(y) X
(c = σ) |= τ/(c = σ) = (c = σ)(y) · = τ(y) = 1. (6.7)
y∈Y
(c = σ)(y) y∈Y

But then, for x ∈ X,


X
c†σ = τ (x) = τ(y) · c†σ (y)(x)

y∈Y
(6.6)
X σ(x) · c(x)(y)
= τ(y) ·
y∈Y
(c = σ)(y)
X τ(y)
= σ(x) · c(x)(y) ·
y∈Y
(c = σ)(y)
σ(x) · (c = τ/(c = σ))(x)
= as just shown
c = σ |= τ/(c = σ)
σ(x) · (c = τ/(c = σ))(x)
=
σ |= c = τ/(c = σ)
= σ| c = τ/(c=σ) (x).

Combination of Theorem 6.7.4 and Proposition 6.7.7 tells us how Pearl’s


rule can also lead to a decrease of divergence. The situation is quite subtle and
will be discussed further in the subsequent remarks.

Corollary 6.7.8. Let c : X → Y be a channel with a prior state σ ∈ D(X) and


a factor q on Y. Then:
   
DKL (c = σ)|q , c = σ|c = q ≤ DKL (c = σ)|q , c = σ .

This result can be interpreted as follows. With my prior state σ I can predict
c = σ. I can update this prediction with evidence q to (c = σ)|q . The divergence
between this update and my prediction is bigger than the divergence between
the update and the prediction from the Pearl/Bayes posterior σ|c = q . Thus, the
posterior gives a correction.

Proof. Take τ = (c = σ)|q . Theorem 6.7.4 (2) says that:


   
DKL τ, c = c†σ = τ ≤ DKL τ, c = σ .

But Proposition 6.7.7 (1) tells that c†σ = τ = σ|c = q .

372
6.7. Pearl’s and Jeffrey’s update rules 373

Remarks 6.7.9. 1 In the above corrolary we use that Pearl’s rule can be ex-
pressed as Jeffrey’s, see point (1) in Proposition 6.7.7. The other way around,
Jeffrey’s rule is expressed via Pearl’s in point (2). This leads to a validity in-
crease property for Jeffrey’s rule: let c : X → Y be a channel with states
σ ∈ D(X) and τ ∈ D(Y). Assuming that c = σ has full support we can form
the factor q = τ/(c = σ). We then get an inequality:
(6.7)
c = c†σ = τ |= q = c = σ|c = q |= q ≥ c = σ |= q = 1.


This is not very useful.


2 We have emphasised the message “Pearl increases validity” and “Jeffrey
decreases divergence”. But Corollary 6.7.8 seems to nuance this message,
since it describes a divergence decrease for Pearl too, and the previous item
gives a validity increase for Jeffrey. What is going on? We try to clarify the
situation via an example, demonstrating that the validity increase of Pearl’s
rule fails for Jeffrey’s rule and that the divergence decrease of Jeffrey’s rule
fails for Pearl’s rule.
Take sets X = {0, 1} and Y = {a, b, c} with uniform prior σ = 12 | 0i+ 12 |1 i ∈
D(X). We use the channel c : X → Y given by:
c(0) = 91 | ai + 32 | bi + 29 | c i and c(1) = 7
25 | ai + 7
25 | bi + 11
25 | c i.

The predicted state is then c = σ = 44


225 | ai + 71
150 |b i + 149
450 | c i. We use as
‘equal’ evidence predicate and state:
q= 1
2 · 1a + 1
3 · 1b + 1
6 · 1c and τ = 12 |a i + 13 | b i + 16 | c i.
We then get the following updates, according to Pearl and Jeffrey, respec-
tively:
σP = σ|c = q σ J = c†σ = τ
= 425
839 | 0i + 414
839 |1 i and = 805675
1861904 |0 i + 1056229
1861904 | 1i
≈ 0.5066| 0 i + 0.4934| 1 i ≈ 0.4327| 0 i + 0.5673| 1 i.
The validities are summarised in the following table.
description formula value

prior validity c = σ |= q 0.31074


after Pearl c = σP |= q 0.31079
after Jeffrey c = σ J |= q 0.31019
The differences are small, but relevant. Pearl’s updating increases validity, as
Theorem 6.7.4 (1) dictates, but Jeffrey’s updating does not, in this example.
The divergences in this example are as follows.

373
374 Chapter 6. Updating distributions

description formula value

prior divergence DKL (τ, c = σ) 0.238


after Pearl DKL (τ, c = σP ) 0.240
after Jeffrey DKL (τ, c = σ J ) 0.221

We see that Jeffrey’s updating decreases divergence, in line with Theo-


rem 6.7.4 (2), but Pearl’s updating does not. Is there a contradiction with
Corollary 6.7.8, which does involve a divergence decrease? No, since there
the state τ has a very particular shape, namely (c = σ)|q . We conclude that
for that particular state Pearl’s rule gives a divergence decrease, but there is
no such decrease in general.
Thus, we conclude that, in general, validity increase is exclusive for Pearl’s
rule, and divergence decrease is exclusive for Jeffrey’s rule.

6.7.1 Proof of the Jeffrey’s divergence reduction


We have skipped a proof of the divergence reduction in Jeffrey’s update rule,
in Theorem 6.7.4 (2). Here we include a non-trivial proof, copied from [79]. It
uses the following result, which we shall call the ‘weigthed update inequality’.

Proposition 6.7.10. Let ω ∈ D(X) be a state with predicates p1 , . . . , pn ∈


Pred (X) forming a ‘test’, so that >i pi = 1. We assume ω |= pi , 0, for each i.
For all numbers r1 , . . . , rn ∈ R≥0 with i ri = 1, one has:
P

X ri · (ω |= pi )
≤ 1. (6.8)
j r j · (ω| p j |= pi )
P
i

This weighted update inequality follows from [50, Thm. 4.1]. The proof is
non-trivial, but involves some interesting properties of matrices of validities.
We skip this proof now and first show how it is used in a proof of the diver-
gence reduction of Jeffrey’s rule. A binary version, for n = 2, of this result has
appeared already in Exercise 6.1.12.
We use the name ‘weigthed update’, since the denominator in (6.8) can also
be written as the validity of the predicate pi in the weighted sum of updates:
  P 
r j · (ω| p j |= pi = j r j · ω| p j |= pi .
P
j

374
6.7. Pearl’s and Jeffrey’s update rules 375

Proof. [of Theorem 6.7.4 (2)] We reason as follows:


   
DKL τ, c = c†σ = τ − DKL τ, c = σ
τ(y) τ(y)
X   X !
= τ(y) · log  τ(y) · log
 
 −
c = (cσ = τ) (y)
† c = σ (y)

y y


τ(y) c = σ (y) 


X   
= τ(y) · log 

 · τ(y)

y c = (c†σ = τ) (y)
τ(y) · c = σ (y) 
X  

≤ log   
y c = (c† = τ) (y)
σ
≤ log 1 = 0.


The first inequality is an instance of Jensen’s inequality, see Lemma 2.7.3. The
second one follows from:
X τ(y) · c = σ(y)
y c = (c† = τ) (y)

σ
X τ(y) · (σ |= c = 1y )
=
x (cσ = τ)(x) · c(x)(y)
y
P †

X τ(y) · (σ |= c = 1y )
=
x,z τ(z) · cσ (z)(x) · (c = 1y )(x)
y
P †

(6.6)
X τ(y) · (σ |= c = 1y )
= σ(x)·(c = 1z )(x)
x,z τ(z) · · (c = 1y )(x)
y P
σ|=c = 1z
X τ(y) · (σ |= c = 1y )
=
x σ(x)·((c = 1z )&(c = 1y ))(x) σ|=(c = 1z )&(c = 1y )
P
z τ(z) ·
y P
σ|=(c = 1z )&(c = 1y ) · σ|=c = 1z
X τ(y) · (σ |= c = 1y )
= by Proposition 6.1.3 (1)
z τ(z) · (σ|c = 1z |= c = 1y )
P
y

≤ 1, by Proposition 6.7.10.

In the last line we apply Proposition 6.7.10 with test pi B c = 1yi , where Y =
{y1 , . . . , yn }. The point predicates 1yi form a test on Y. Predicate transformation
preserves tests, see Exercise 4.3.4. As is common with daggers c†σ , we assume
that c = σ has full support, so that σ |= pi = σ |= c = 1yi = (c = σ)(yi ) is
non-zero for each i.

We now turn to the proof of Proposition 6.7.10. It is relies on some basic


facts from linear algebra, esp. about non-negative matrices, see [124] for back-
ground information. We shall see that for a test (pi ) the associated conditional
expectations ω| p j |= pi form a non-negative matrix C with mathematically in-

375
376 Chapter 6. Updating distributions

teresting properties. The proof that we present is extracted2 from [50] and is
applied only to this conditional expectation matrix C. The original proof is
formulated more generally.
We recall that for a square matrix A the spectral radius ρ(A) is the maximum
of the absolute values of its eigenvalues:
n o
ρ(A) B max |λ| λ is an eigenvalue of A .

We shall make use the following result. The first item is known as Gelfand’s
formula, originally from 1941. The proof is non-trivial and is skipped here; for
details, see e.g. [112, Appendix 10]. For convenience we include short (stan-
dard) proofs of the other two items.

Theorem 6.7.11. Let A be a (finite) square matrix, and k − k be a matrix norm.

1 The spectral radius satisfies:


1/n
ρ(A) = lim An .
n→∞

2 Here we shall use the 1-norm k A k1 B max j i |Ai j |. It yields that ρ(A) = 1
P
for each (left/column-)stochastic matrix A.
3 Let square matrix A now be non-negative, that is, satisfy Ai j ≥ 0, and let x
be a positive vector, so each xi > 0. If Ax ≤ r · x with r > 0, then ρ(A) ≤ r.

Proof. As mentioned, we skip the proof of the Gelfand’s formula. If A is


stochastic, one gets:

k A k1 = max j i |Ai j | = max j i Ai j = max j 1 = 1.


P P

Since stochastic matrices are closed under matrix multiplication, one gets k An k1 =
1 for each n. Hence ρ(A) = 1 via Gelfand’s formula.
We next show how the third item can be obtained from the first one, as
in [98, Cor. 8.2.2]. By assumption, each entry xi in the (finite) vector x is
positive. Let’s write x− for the least one and x+ for the greatest one. Then
0 < x− ≤ xi ≤ x+ for each i. For each n we have:
P 
An 1 · x− = max j i An i j · x−
≤ max j i An i j · xi
P 

= max j An x j


≤ max j rn · x j


≤ r n · x+ .
2 With the help of Harald Woracek and Ana Sokolova.

376
6.7. Pearl’s and Jeffrey’s update rules 377

Hence:
!1/n !1/n
1/n x x+
An 1 ≤ rn · + = r· .
x− x−

Thus, by Gelfand’s formula, in the first item,


!1/n !1/n
1/n x+ x+
ρ(A) = lim An ≤ lim r · = r · lim = r · 1 = r.
n→∞ n→∞ x− n→∞ x−

For the remainder of this subsection the setting is as follows. Let ω ∈ D(X)
be a fixed state, with an n-test p1 , . . . , pn ∈ Pred (X) so that >i pi = 1, by
definition (of test). We shall assume that the validities vi B ω |= pi are non-
zero. Notice that i vi = 1. We organise these validities in a vector v and in
P
diagonal matrix V:

 v1   ω |= p1 


     
 v1 0 
 .   ..
v B  ..  =  .
 V B  . . .  .
 
ω |= pn

vn 0 vn
    

In addition, we use two n × n (real, non-negative) matrices B and C given by:


ω |= pi & p j
Bi j B
(ω |= pi ) · (ω |= p j )
(6.9)
ω |= pi & p j
Ci j B ω| p j |= pi = = (ω |= pi ) · Bi j .
ω |= p j

In the last line we use Bayes’ product rule from Theorem 6.1.3 (1). The next
series of facts is extracted from [50].

Lemma 6.7.12. The above matrices B and C satisfy the following properties.

1 The matrix B is non-negative and symmetric, and satisfies Bii ≥ 1 and B v =


1. Moreover, B is positive definite, so that its eigenvalues are positive reals.
2 As a result, the inverse B−1 and square root B1/2 exist — and B−1/2 too.
3 The matrix C of conditional expectations is (left/column-)stochastic and thus
its spectral radius ρ(C) equals 1, by Theorem 6.7.11 (2). Moreover, C satis-
fies C v = v and C = V · B.
4 For an n × n real matrix D, ρ(DC) = ρ B1/2 DV B1/2 .


5 Assume now that D is a diagonal matrix with numbers d1 , . . . , dn ≥ 0 on its


diagonal. Then:

i di · vi ≤ ρ(DC).
P

377
378 Chapter 6. Updating distributions

Proof. 1 Clearly, Bi j = B ji , and Bii ≥ 1 by (the byproduct in) Lemma 5.1.3.


Further,
X ω |= pi & p j
Bv i = · (ω |= p j )

j
(ω = pi ) · (ω |= p j )
|
ω |= pi & ( j p j ) ω |= pi & 1 ω |= pi
P
= = = = 1.
ω |= pi ω |= pi ω |= pi
The matrix B is positive definite since for a non-zero vector z = (zi ) of reals:
X ω |= pi & p j
zT B z = zi · · zj
i, j
(ω |= pi ) · (ω |= p j )
P  P zj 
= ω |= i vzii · pi & j vj · p j

= ω |= q & q for q = i vi · pi
P zi

> 0.
We have a strict inequality > here since q ≥ vz11 · p1 and ω |= p1 > 0, by
assumption; in fact this holds for each pi . Thus:
z21 z21
ω |= q & q ≥ v21
· (ω |= p1 & p1 ) ≥ v21
· (ω |= p1 )2 > 0.

2 The square root B1/2 and inverse B−1 are obtained in the standard way via
spectral decomposition B = QΛQT where Λ is the diagonal matrix of eigen-
values λi > 0 and Q is an orthogonal matrix (so QT = Q−1 ). Then: B1/2 =
QΛ1/2 QT where Λ1/2 has entries λi/2 . Similarly, B−1 = QΛ−1 QT , and B−1/2 =
1

QΛ− /2 QT .
1

3 It is easy to see that all C’s columns add up to one:

i Ci j = i ω| p j |= pi = ω| p j |= i pi = ω| p j |= 1 = 1.
P P P

This makes C left-stochastic, so that ρ(C) = 1. Next:


C v i = j (ω| p j |= pi ) · (ω |= p j )
 P

= j ω |= pi & p j
P
by Proposition 6.1.3 (1)
= ω |= pi & ( j p j ) = ω |= pi & 1 = ω |= pi = vi .
P

Further, V B i j = vi · Bi j = Ci j .


4 We show that DC and B1/2 DV B1/2 have the same eigenvalues, which gives
ρ(DC) = ρ(B1/2 DV B1/2 ). First, let DC z = λz. Take z0 = B1/2 z one gets:

B1/2 DV B1/2 z0 = B1/2 DV B1/2 B1/2 z = B1/2 DCz = B1/2 λz = λB1/2 z = λz0 .
In the other direction, let B1/2 DV B1/2 w = λw. Now take w0 = B−1/2 w so that:

DCw0 = B−1/2 B1/2 DV BB−1/2 w = B−1/2 B1/2 DV B1/2 w = B−1/2 λw = λw0 .




378
6.7. Pearl’s and Jeffrey’s update rules 379

5 We use the standard fact that for non-zero vectors z one has:
| (Az, z) |
≤ ρ(A),
(z, z)
where (−, −) is inner product. In particular,
| (B1/2 DV B1/2 z, z) |
≤ ρ B1/2 DV B1/2 = ρ(DC).

(z, z)
We instantiate with z = B1/2 v and use that V Bv = Cv = v and Bv = 1 in:
| (B1/2 DV B1/2 B1/2 v, B1/2 v) | | (DV Bv, B1/2 B1/2 v) |
ρ(DC) ≥ =
(B1/2 v, B1/2 v) (v, B1/2 B1/2 v)
P
| (Dv, Bv) | | (Dv, 1) | | i di · vi |
= = = = i di · vi .
P
P
(v, Bv) (v, 1) i vi

We are now close to proving the inequality of Proposition 6.7.10 — which is


the aim of this subsection. Consider a vector r of non-zero numbers ri ∈ (0, 1]
with i ri = 1. We form a diagonal matrix D with non-zero diagonal entries
P
d1 , . . . , dn with:
ri ri
di B  = P .
Cr i j ij · rj
C
A crucial observation is that r is an eigenvector of the matrix DC, with eigen-
value 1, since:
ri
DCr i = j DC i j · r j = j di · Ci j · r j = j C i j · r j = ri .
 P  P P 
 ·
Cr i
Theorem 6.7.11 (3) now yields ρ(DC) = 1.
By Lemma 6.7.12 (5) we get the required inequality in Proposition 6.7.10:
X ri · vi
 = i di · vi ≤ ρ(DC) = 1.
P
i Cr
i

Exercises
6.7.1 Check for yourself the claimed outcomes in Example 6.7.5:
1 ω|π2 = 1e 1, 0 = 45 | ci + 51 | c⊥ i;
 
2 ω0 B (π1 )†ω = ρ = 10 1
| c, e i + 401
| c, e⊥ i + 40
7 ⊥
| c , e i + 10 | c , e i;
7 ⊥ ⊥

3 ω |π2 = 1e 1, 0 = 11 | ci + 11 | c i.
0   4 7 ⊥

6.7.2 This example is taken from [39], where it is attributed to Whitworth:


there are three contenders A, B, C for winning a race with a prori dis-
tribution ω = 112
| A i + 11
4
| Bi + 11
5
|C i. Surprising information comes
1
in that A’s chances have become 2 . What are the adapted chances of
B and C?

379
380 Chapter 6. Updating distributions

1 Split up the sample space X = {A, B, C} into a suitable two-element


partition via a function f : X → 2.
2 Use the uniform distribution unif = 12 |1 i + 12 | 0i on 2 and show that
Jeffrey’s rule gives as adapted distribution:
fω† = unif = 1
2 · ω|1{A} + 1
2 · ω|1{B,C} = 21 | A i + 29 | Bi + 5
18 |C i.

6.7.3 The next illustration is attributed to Jeffrey, and reproduced for in-
stance in [19, 33]. We consider three colors: green (g), blue (b) and
violet (v), which are combined in a space C = {g, b, v}. These colors
apply to cloths, which can additionally be sold or not, as represented
by the space S = {s, s⊥ }. There is a prior joint distribution τ on C × S ,
namely:
ω= 3
25 | g, si + 9
50 | g, s⊥ i + 3
25 | b, si + 9
50 | b, s⊥ i + 8
25 | v, si + 2
25 | v, s⊥ i.
A cloth is inspected by candlelight and the following likelihoods are
reported per color: 70% certainty that it is green, 25% that it is blue,
and 5% that it is violet.
1 Compute the two marginals ω1 B ω 1, 0 ∈ D(C) and ω2 B
 
ω 0, 1 ∈ D(S ) and show that we can write the joint state ω in
 
two ways as:
hc, id i = ω2 = ω = hid , di = ω1
for channels c : S → C and d : C → S , given by:

  d(g) = 52 | s i + 35 | s⊥ i
 c(s) = 14 | bi + 14 |b i + 14 | v i
 3 3 8



d(b) = 52 | s i + 35 | s⊥ i
 

 c(s⊥ ) = 9 | bi + 9 |b i + 2 | v i

 

 d(v) = 4 | s i + 1 | s⊥ i.

22 22 11 

5 5

These channels c, d are each other’s daggers, see Theorem 6.6.5 (1).
2 Capture the above inspection evidence as a predicate q on C and
show that Pearl’s rule gives:
ω2 |c = q = ω|q⊗1 0, 1 = d = ω1 |q = 26 61 | s i + 61 | s i.
35 ⊥
  

3 Describe the evidence now as a state τ on C and show that Jeffrey’s


rule gives:
   
d = τ = π1 †ω = τ 0, 1 = 2150 | s i + 50 | s i.
29 ⊥

4 Check also that:


†
hid , ci = τ = π1 = τ = 25 | g, si + 50 |g, s i + 10 | b, s i
7 21 ⊥
 1
ω
+ 20 3
|b, s⊥ i + 251
| v, s i + 100
1

| v, s⊥ i.

380
6.7. Pearl’s and Jeffrey’s update rules 381

The latter outcome is given in [33].


6.7.4 The following alarm example is in essence due to Pearl [133], see
also [33, §3.6]; we have adapted the numbers in order to make the
calculations a little bit easier. There is an ‘alarm’ set A = {a, a⊥ } and a
‘burglary’ set B = {b, b⊥ }, with the following a priori joint distribution
on A × B.

ω= 1
200 | a, b i + 7 ⊥
500 | a, b i + 1000 | a , bi
1 ⊥
+ 100 |a , b i.
98 ⊥ ⊥

Someone reports that the alarm went off, but with only 80% certainty
because of deafness.
1 Translate the alarm information into a predicate p : A → [0, 1] and
show that crossover updating leads to a burglary distribution:

ω| p⊗1 0, 1 = 151
3
| b i + 148
  ⊥
151 | b i
≈ 0.02| b i + 0.98| b⊥ i.

Compute the extracted channel c B π2 ◦· π1 †ω : B → A as in Theo-



2
rem 6.6.5, and express the answer in the previous item in terms of
Pearl’s update rule / backward inference using c.
3 Use the Bayesian inversion/dagger d B c†ω[0,1] : A → B of this
channel c to calculate the outcome of Jeffrey’s update rule as:

d = 4
+ 15 |a⊥ i = 93195
19639
| b i + 73556
 ⊥
5 | ai 93195 | b i
≈ 0.21| b i + 0.79| b⊥ i.

(Notice again the considerable difference in outcomes between Pearl


and Jeffrey.)
6.7.5 Recall from Example 6.7.2 the Jeffrey update:

σ1 B c†σ = ρ = 278
519 | d i + 278 ⊥
519 | d i.

Suppose we have another evidence state ρ = 53 | p i + 25 | ni. Compute:


1 c†σ1 = ρ;
2 σ2 B c†σ = ρ;
3 c†σ2 = τ.
The first and third items produce different results, which proves Propo-
sition 6.7.6 (4).
6.7.6 Let ω ∈ D(X) be a state.

381
382 Chapter 6. Updating distributions

1 Consider a predicate p : X → [0, 1] as a channel p : X → 2, using


that D(2)  [0, 1], see Exercises 4.3.5 and 7.6.1. Show that the
binary version of the law of total probability, see Exercise 6.1.4,
can be expressed via Jeffrey’s rule as:
 
ω = p†ω = (ω |= p)|1 i + (ω |= p⊥ )| 0 i .

2 Let c : X → n be a channel and define n predicates on X by pi B


c = 1i . Show that these p0 , . . . , pn−1 form a test:

>i pi = 1 with c†ω (i) = ω| pi .

3 Conclude that the equation c†ω = (c = ω) = ω from Proposi-


ton 6.6.4 (1) can also be understood as the (n-ary version of the)
law of total probability.
6.7.7 Consider a channels c : X → Y with initial state σ ∈ D(X) and ev-
idence state τ ∈ D(Y). Assume that τ is a point state 1| z i, for some
z ∈ Y. Write ω B hid , ci = σ ∈ D(X × Y).
1 Show that the Jeffrey-updated joint state ω0 B (π2 )†ω = τ ∈ D(X ×
Y) is of the form:
ω0 = σ|c = 1z ⊗ 1| z i.

Verify that the double-dagger channel c0 B c†σ †τ : X → Y is ex-



2
tremely trivial, namely a ‘constant-point’ channel, that is, of the
form c0 (x) = 1| z i for all x ∈ X.
6.7.8 Jeffrey’s rule is frequently formulated (notably in [61], to which we
refer for details) and used (like in Exercise 6.7.2), in situations where
the channel involved is deterministic. Consider an arbitrary function
f : X → I, giving a partition of the set X via subsets Ui B f −1 (i) =
{x ∈ X | f (x) = i}. Let ω ∈ D(X) be a prior.
1 Show that applying Jeffrey’s rule to a new state of affairs ρ ∈ D(I)
gives as posterior:
X
fω† = ρ = ρ(i) · ω|1Ui f = fω† = ρ = ρ.

satisfying
i∈I

2 Prove the following minimal-distance result:


^
d fω† = ρ, ω) = {d(ω, ω0 ) | ω0 ∈ D(X) with f = ω0 = ρ},

where d is the total variation distance from Section 4.5.

382
6.8. Frequentist and Bayesian discrete probability 383

6.7.9 Prove that for a general, not-deterministic channel c : X → Y with


prior state ω ∈ D(X) and state ρ ∈ D(Y) that there is an inequality:
^
d c†ω = ρ, ω) ≤ d(ω, ω0 ) + d(c = ω0 , ρ).
ω0 ∈D(X)

6.7.10 Show that:


DKL c†σ = τ, σ ≤ DKL τ, c = σ
 

DKL σ, c†σ = τ ≤ DKL c = σ, τ .


 

6.7.11 Consider the set {a, b, c} with distribution and test of predicates:
p1 = 1
2 · 1a + 1
2 · 1b
ω= 1
4 | ai + 1
2 | bi + 1
4 |c i p2 = 1
2 · 1a + 1
4 · 1b + 1
2 · 1c
p3 = 1
4 · 1b + 1
2 · 1c .
1 Check that the two matrices B, C defined in (6.9) are:
 4/3 8/9 2/3   1/2 1/3 1/4 
   

B =  8/9 10/9 1  C =  1/3 5/12 3/8  .


   
and
/3 1 3/2 /6 /4 /8
 2   1 1 3 

21+3 17
2 Show that C has eigenvalues 1, of course, and 144 ≈ 0.2317

and 21−3144
17
≈ 0.0599, both below 1.
3 Check that the inequality in Proposition 6.7.10 amounts to: for pos-
itive r1 , r2 , r3 ∈ (0, 1] with r1 + r2 + r3 = 1 one has:
3 3 1
8 r1 8 r2 4 r3
1 ≥ + +
1
2 r1 + 1
3 r2 + 14 r3 1
3 r1 + 5
12 r2 + 38 r3 1
6 r1 + 1
4 r2 + 38 r3
9r1 9r2 6r3
= + + .
12r1 + 8r2 + 6r3 8r1 + 10r2 + 9r3 4r1 + 6r2 + 9r3
Try to prove this inequality via analytical means.

6.8 Frequentist and Bayesian discrete probability


Here, at the end of Chapter 6, we have seen the basic concepts in (our treatment
of) discrete probability theory, namely: distributions (states), predicates, valid-
ity, updating. We shall further explore these notions in subsequent chapters ??
– ??, but at this point we like to sit back and reflect on what we have seen so
far, from a more abstract perspective.
We start with a fundamental result, for which we recall that the set Pred (X) =
[0, 1]X of fuzzy predicates on a set X has the structure of an effect module: there

383
384 Chapter 6. Updating distributions

is truth and falsity 1, 0, orthocomplement p⊥ , partial sum p>q, and scalar mul-
tiplication r · p, see Subsection 4.2.3.
The next result is a Riesz-style representation result, representing distribu-
X
tions on X via a double dual [0, 1][0,1] .
Theorem 6.8.1. Let X be a finite set. There is then a ‘representation’ isomor-
phism,

D(X)  / EMod Pred (X), [0, 1]


V  
(6.10)

given by validity: V(ω)(p) B ω |= p.



This result uses the notation EMod Pred (X), [0, 1] for the ‘hom’ set of
homomorphisms of effect modules Pred (X) → [0, 1].
Proof. We have already seen that V(ω) : Pred (X) → [0, 1] preserves the effect
module structure, see Lemma 4.2.6. In order to show that it is an isomorphism,
let h : Pred (X) → [0, 1] be a homomorphism of effect modules. It gives rise to
a distribution: X
V−1 (h) B h(1 x ) x ∈ D(X).
x∈X
−1
These V and V are each other’s inverses:
X X 
V−1 ◦ V (ω) = V(ω)(1 x ) x = ω |= 1 x x

x∈X x∈X
X
= ω(x) x = ω
  x∈X
X  X
V◦V (h)(p) = 
−1 
h(1 x ) x  |= p = h(1 x ) · p(x)
x∈X x∈X
 
= h  p(x) · 1 x  = h(p).
 
>
x∈X

In the last line we use that h is a homomorphism of effect modules and that the
predicate p has a normal form > x p(x) · 1 x , see Lemma 4.2.3 (2).
There are two leading interpretations of probability, namely a frequentist
interpretation and a Bayesian interpretation. The frequentist approach treats
probability distributions as records of probabilities of occurrences, obtained
from long-term accumulations, e.g. via frequentist learning. One can asso-
ciate this view with the set D(X) of distributions on the left-hand-side of
the representation isomorphism (6.10). The right-hand-side fits the Bayesian
view, which focuses on assigning probabilities to belief functions (predicates),
see [38], in this situation via special functions Pred (X) → [0, 1] that preserve

384
6.8. Frequentist and Bayesian discrete probability 385

the effect module structure. The representation isomorphism demonstrates that


the frequentist and Bayesian interpretations of probability are thightly con-
nected, at least in finite discrete probability. We shall see a similar isomor-
phism below, in Theorem 6.8.4, for infinite discrete probability. The situation
for continuous probability is more subtle and will be described in ??.
But first we look at how operations on distributions, like marginalisation,
tensor product, and updating work accross the above representation isomor-
phism (6.10). The next result establishes a close connection between opera-
tions on distributions and logical operations, like between marginalisation and
weakening. In the case of updating (conditioning) of a distribution, we see that
the corresponding logical formulation is the common formulation of condi-
tonal probability as a fraction of probability assignments. Not suprisingly, the
connection relies on Bayes’ rule.

Proposition 6.8.2. Fix finite sets X, Y.

1 Let h : Pred (X × Y) → [0, 1] be a map of effect modules. Define first and


   
second marginals h 1, 0 : Pred (X) → [0, 1] and h 0, 1 : Pred (Y) → [0, 1]
of h as:
   
h 1, 0 (p) B h(p ⊗ 1) and h 0, 1 (q) B h(1 ⊗ q).

Then: V−1 h 1, 0 = V−1 (h) 1, 0 and V−1 h 0, 1 = V−1 (h) 0, 1 .


       

2 Let h : Pred (X) → [0, 1] and k : Pred (Y) → [0, 1] be maps of effect modules.
We define their tensor product h ⊗ k : Pred (X × Y) → [0, 1] as:
 
(h ⊗ k)(r) B h x 7→ k r(x, −)
 
= k y 7→ h r(−, y) .

Then: V−1 h ⊗ k = V−1 (h) ⊗ V−1 (k).




3 Let h : Pred (X) → [0, 1] be a map of effect modules and let p ∈ Pred (X) be
a predicate with h(p) , 0. We define an update h| p : Pred (X) → [0, 1] as:

h(p & q)
h| p (q) B .
h(p)

Then: V−1 h| p = V−1 (h) p .


Proof. 1 We consider the first marginal only. It is easy to see that the map
 
h 1, 0 : Pred (X) → [0, 1], as defined above, preserves the effect module

385
386 Chapter 6. Updating distributions

structure. We get an equality of distributions since for x ∈ X,

V−1 h 1, 0 (x) = h 1, 0 (1 x ) = h(1 x ⊗ 1)


   
 
 
= h 1 x ⊗
 > 1y 
y∈Y
X
= h(1 x ⊗ 1y )
y∈Y
X
= h(1(x,y) )
y∈Y
X
= V−1 (h)(x, y) = V−1 (h) 1, 0 (x).
 
y∈Y

2 For elements u ∈ X and v ∈ Y we have:


V−1 h ⊗ k (u, v) = h ⊗ k (1(u,v) ) =
  
h x 7→ k 1(u,v) (x, −)
=

h x 7→ k(1u (x) · 1v )
=

h x 7→ 1u (x) · k(1v )
=

h k(1v ) · 1u
= k(1v ) · h(1u )
= V−1 (h)(u) · V−1 (k)(v)
 
= V−1 (h) ⊗ V−1 (k) (u, v).

The same can be shown for the other formulation of h ⊗ k in item (2).
3 Again, h| p : Pred (X) → [0, 1] is a map of effect modules. Next, for x ∈ X,
h(p ⊗ 1 x )
V−1 h| p (x) = h| p (1 x ) =

h(p)
V−1 (h) |= p ⊗ 1 x
=
V−1 (h) |= p

= V−1 (h) p |= 1 x by Bayes, see Theorem 6.1.3 (2)

= V−1 (h) (x).
p

In order to extend this result to discrete distributions in D∞ (X) with possibly


infinite support we need to add ω-joins to effect algebras/modules.

Definition 6.8.3. 1 A sequence (or chain) of elements an ∈ A, for n ∈ N, in a


poset A is called ascending if an ≤ an+1 for each n. One says that the poset
A is ω-complete if each ascending chain (an ) has a join n an ∈ A.
W

2 A monotone function f : A → B between ω-complete posets A, B is called


ω-continuous if it preserves joins of ascending chains, that is, if f n an =
W 
W
n f (an ), for each ascending chain (an ) in A.

386
6.8. Frequentist and Bayesian discrete probability 387

3 An ω-effect algebra is an effect algebra which is ω-complete as a poset,


using the order described in Exercise 4.2.14. We shall write ω-EA for the
category with ω-effect algebras as objects, and with ω-continuous effect al-
gebra maps as morphisms.
4 Similarly, an ω-effect module is an ω-complete effect module. The category
ω-EMod contains such ω-effect modules, with ω-continous effect module
maps between them.

For each set X, the powerset P(X) is an effect algebra with arbitrary joins, so
in particular ω-joins. The unit interval [0, 1] is an ω-effect module, and more
generally, each set of predicates Pred (X) = [0, 1]X is an effect module, via
pointwise joins. Later on, in continuous probability, we shall see measurable
spaces whose sets of measurable subsets form examples of ω-effect algebras.
We recall from Exercise 4.2.15 that the sum operation > of an ω-effect al-
W
gebra is continuous in both arguments separately. Explicitly: n x > yn ≤
W W W
x > n yn . Further, if there are joins , there are also meets via orthosup-
plement.

Theorem 6.8.4. For each countable set X there is a representation isomor-


phism,

D∞ (X)  / ω-EMod Pred (X), [0, 1]


V  
(6.11)

also given by validity: V(ω)(p) B ω |= p.

Proof. Much of this works as in the proof of Theorem 6.8.1, except that we
have to prove that V(ω) : Pred (X) → [0, 1] is ω-continuous, now that we have
ω ∈ D∞ (X). So let (pn ) be an ascending chain of predicates, with pointwise
join p = n pn . Then:
W

_  _ X _ 
V(ω) pn = ω |= pn = ω(x) · pn (x)
n n n
x∈X
X _ 
= ω(x) · pn (x)
n
x∈X
X _
= ω(x) · pn (x)
n
x∈X
(∗)
_ X _
= ω(x) · pn (x) = V(ω)(pn ).
n n
x∈X
(∗)
The direction (≥) of the marked equation = holds by monotonicity. For (≤) we
reason as follows. Since X is countable, we can write it as X = {x1 , x2 , . . .}. For

387
388 Chapter 6. Updating distributions

each N ∈ N we can use that finite sums preserve ω-joins in:


X_ _ X _ X
ω(xk ) · pn (xk ) = ω(xk ) · pn (xk ) ≤ ω(x) · pn (x).
n n n
k≤N k≤N x∈X

Hence:
X_ X_ _ X
ω(x) · pn (x) = lim ω(xk ) · pn (xk ) ≤ ω(x) · pn (x).
n N→∞ n n
x∈X k≤N x∈X

The inverse of V in (6.11) is defined as countable formal sum, using that X


is countable: for a homomorphism of ω-effect modules h : Pred (X) → [0, 1]
we take:
X
V−1 (h) B h(1 ) x . x
x∈X

In order to see that the probabilities h(1 x ) add up to one, we write X = {x1 , x2 , . . .}.
We express the sum over these xi via a join of an ascending chain:
 
X _ X _  
h(1 x ) = h(1 xk ) = h  > 1 xk 
n n
x∈X k≤n k≤n
_ 
= h 1{x1 ,...,xn } = h(1) = 1.
n

The argument that V and V−1 are each other’s inverses works essentially in
the same way as in the proof of Theorem 6.8.1.

Exercises
6.8.1 Consider the representation isomorphism in Theorem 6.8.1.
1 Show that a uniform distribution corresponds to the mapping that
sends a predicate to its average value.

2 Show that for finite sets X, Y and maps h ∈ EMod Pred (X), [0, 1]

and k ∈ EMod Pred (Y), [0, 1] one has:

h ⊗ k 1, 0 = h.
 

6.8.2 Recall from Theorem 4.2.5 that Pred (X) is the free effect module on
the effect algebra P(X), for a finite set X.
1 Deduce from this fact that there is an isomorphism:
   
EA P(X), [0, 1]  EMod Pred (X), [0, 1] .

Describe this isomorphism in detail.

388
6.8. Frequentist and Bayesian discrete probability 389

2 Conclude that an alternative version of the representation isomor-


phism (6.10) is:

D(X)  / EA P(X), [0, 1]


 

Recall the Poisson distribution pois[λ] = k∈N e−λ · λk! | k i ∈ D∞ (N)


P k
6.8.3
from (2.7). Describe the corresponding homomorphism of ω-effect
modules Pred (N) → [0, 1] via (6.11).
6.8.4 Reformulate Proposition 4.3.3 as naturality for the representation iso-
morphism V in (6.10) with respect to channels.

389
7

Directed Graphical Models

In the previous chapters we have seen several examples of graphs in probabilis-


tic reasoning, see for instance Figures 2.3, 2.4 or 6.6. At this stage a more sys-
tematic, principled approach will be developed in terms of so-called string di-
agrams. The latter are directed graphs that are built up inductively from boxes
(as nodes) with wires (as edges) between them. These boxes can be interpreted
as probabilistic channels. The wires are typed, where a type is a finite set that
is associated with a wire. These string diagrams are convenient for depicting
arrangements of channels and states, and for reasoning about them.
On a theoretical level we make a clear distinction between syntax and se-
mantics, where graphs are introduced as syntactic objects, defined inductively,
like terms in some formal language based on a signature or grammar. Graphs
can be given a semantics via an interpretion as suitable collections of chan-
nels. In practice, the distinction between syntax and semantics is not so sharp:
we shall frequently use graphs in interpreted form, essentially by describing
a structured collection of channels as interpretation of a graph. This is like
what happens in (mathematical) practice with algebraic notions like group or
monoid: one can precisely formulate a syntax for monoids with operation sym-
bols that are interpreted as actual functions. But one can also use monoids as
sets with specific functions. The latter form is often most convenient. But is
important to be aware of the difference between syntax and semantics.
String diagrams are similar to the graphs used for Bayesian networks. A no-
table difference is that they have explicit operations for copying and discard-
ing . This makes them more expressive (and more useful). String diagrams
are the language of choice to reason about a large class of mathematical mod-
els, called symmetric monoidal categories, see [149], including categories of
channels (of powerset, multiset or distribution type). Once we have seen string
diagrams, we will describe Bayesian networks only in string diagrammatic
form. The most notable difference is that we then write copying as an explicit

390
391

operation, as we have already started doing in Section 2.6. The language of


string diagrams also makes it easier to turn a Bayesian network into a joint
distribution (state). String diagrams turn out to be useful not only for repre-
sentation (of joint distributions) but also for reasoning (using both forward and
backward inference). We shall use string diagrams also for other probabilistic
models, like Markov chains and hidden Markov models in Section 7.4.
One of the main themes in this chapter is how to move back and forth be-
tween a joint state/distribution, say on A, B, and a channel A → B. The pro-
cess of extracting a channel form a joint state is called disintegration. It is
the process of turning a joint distribution P(a, b) into a conditional distribution
P(b | a). It gives direction to a probabilistic relationship. It is this disintegration
technique that allows us to represent a joint state, on a product of (multiple)
sets, via a graph structure with multiple channels, like in Bayesian networks,
by performing disintegration multiple times. More explicitly, it is via disinte-
gration that the conditional probability tables of a Bayesian network — which,
recall, correspond to channels — can be obtained from a joint distribution.
The latter can be presented as a (possibly large) empirical distribution, analo-
gously to the (small) medicine - blood pressure example in Subsections 1.4.1
and 1.4.3.
Via the correspondence between joint states and channels one can invert
channels. We have already had a first look at this Bayesian inversion — as
dagger of channel — in Section 6.6; we used it in Jeffrey’s update rule in Sec-
tion 6.7. In this chapter we take a more systematic look at inversion. It is now
obtained by first turning a channel A → B into a joint state on A, B, then into
a joint state on B, A by swapping, and then again into a channel B → A. It
amounts to turning a conditional probability P(b | a) into P(a | b), essen-
tially by Bayes’ rule. As we have seen, this channel inversion is a fundamental
operation in Bayesian probability, which bears resemblance adjoint transpose
(−)† of complex matrices in quantum theory. The previous chapter showed the
role of inverted channels in probabilistic reasoning. The current chapter illus-
trates their role in machine learning, esp. in naive Bayesian classification and
in learning a decision tree, see Section 7.7.
For readers interested in the more categorical aspects of Bayesian inversion,
Section 7.8 elaborates how a suitable category of channels has a well-behaved
dagger functor. This category of channels is shown to be equivalent to a cate-
gory of couplings. It mimics the categorical structure underlying the relation-
ship between functions X → P(Y) and relations on X × Y. This categorically-
oriented section can be seen as an intermezzo that is not necessary for the
remainder; it could be skipped.
Sections 7.9 and 7.10 return to (directed) graphical modelling. Section 7.9

391
392 Chapter 7. Directed Graphical Models

uses string diagrams to express intrinsic graphical structure, called shape, in


joint probability distributions. This is commonly expressed in terms of inde-
pendence and d-separation, but we prefer to stick at this stage to a purely string-
diagrammatic account. String diagrams are also used in Section 7.10 to give a
precise account of Bayesian inference, including a (high-level) channel-based
inference algorithm.

7.1 String diagrams


In Chapter 1 we have seen how (probabilistic) channels can be composed se-
quentially, via ◦·, and in parallel, via ⊗. These operations ◦· and ⊗ satisfy certain
algebraic laws, such as associativity of ◦·. Parallel composition ⊗ is also asso-
ciative, in a suitable sense, but the equations involved are non-trivial to work
with. These equations are formalised via the notion of symmetric monoidal cat-
egory that axiomatises the combination of sequential and parallel composition.
Instead of working with such somewhat complicated categories we shall
work with string diagrams. These diagrams form a graphical language that of-
fers a more convenient and intuitive way of reasoning with sequential and par-
allel composition. In particular, these diagrams allow obvious forms of topo-
logical stretching and shifting that are less complicated, and more intuitive than
equations.
The main reference for string diagrams is [149]. Here we present a more
focused account, concentrating on what we need in a probabilistic setting. In
this first section we describe string diagrams with directed edges. In the next
section we simply write undirected edges | instead of directed edges ↑, with the
convention that information always flows upward. However, we have to keep
in mind that we do this for convenience only, and that the graphs involved are
actually directed.
String diagrams look a bit like Bayesian networks, but differ in subtle ways,
esp. wrt. copying, discarding and joint states, see Subsection 7.1.1. The shift
that we are making from Bayesian networks to string diagrams was already
prepared in Subsection 2.6.1.

Definition 7.1.1. The collection Type of types is the smallest set with:

1 1 ∈ Type, where 1 represents the trivial singleton type;


2 F ∈ Type for any finite set F;
3 if A, B ∈ Type, then also A × B ∈ Type.

In fact, the first item can easily be seen as a special case of the second one.

392
7.1. String diagrams 393

However, we like to emphasise that there is a distinguished type that we write


1. More generally, we write, as before, n = {0, 1, . . . , n − 1}, so that n ∈ Type.
The nodes of a string diagram are ‘boxes’ with input and output wires:
B1 ··· Bm
(7.1)
A1 ··· An
The symbols A1 , . . . , An are the types of the input wires, and B1 , . . . , Bm are the
types of the output wires. We allow the cases n = 0 (no input wires) and m = 0
(no output wires), which are written as:
B1 ··· Bm
and
A1 ··· An
A diagram signature, or simply a signature, is a collection of boxes as in (7.1).
The interpretation of such a box is a probabilistic channel A1 × · · · × An →
B1 × · · · × Bm . In particular, the interpretation of a box with no input channels is
a distribution/state. In principle, we like to make a careful distinction between
syntax (boxes in a diagram) and semantics (their interpretation as channels).
Informally, a string diagram is a diagram that is built-up inductively, by con-
necting wires of the same type between several boxes, see for instance the two
diagrams (7.3) below. The connections may include copying and swapping, see
below for details. A string diagram involves the requirement that it contains no
cycles: it should be acyclic. We write string diagrams with arrows flow going
upwards, suggesting that there is a flow from bottom to top. This direction is
purely a convention. Elsewhere in the literature the flow may be from top to
bottom, of from left to right.
More formally, given a signature Σ of boxes of the form (7.1), the set SD(Σ)
of string diagrams over Σ is defined as the smallest set satisfying the items
below. At the same time we define two ‘domain’ and ‘codomain’ functions
from string diagrams to lists of types, as in:

dom /
SD(Σ) / L(Type)
cod

Recall that L is the list functor.

(1) Σ ⊆ SD(Σ), that is, every box in f ∈ Σ in the signature as in (7.1) is a string
diagram in itself, with dom( f ) = [A1 , . . . , An ] and cod ( f ) = [B1 , . . . , Bm ].
(2) For all A, B ∈ Type, the following diagrams are in SD(Σ).

393
394 Chapter 7. Directed Graphical Models

(a) The identity diagram:



 dom(id A ) = [A]

id A = A

with
 cod (id A ) = [A].

(b) The swap diagram:



B A  dom(swap A,B ) = [A, B]

swap A,B =

with
 cod (swap A,B ) = [B, A].


A B
(c) The copy diagram:

A A  dom(copy A ) = [A]

copy A =

with
 cod (copy A ) = [A, A].


A
(d) The discard diagram:

 dom(discard A ) = [A]

discard A = A

with
 cod (discard A ) = [].

(e) The uniform state diagram:



 dom(unif A ) = []

unif A = A

with
 cod (unif A ) = [A].

(f) For each type A and element a ∈ A a point state diagram:



A  dom(point A (a)) = []

point A (a) =

with
a  cod (point A (a)) = [A].

(g) Boxes for the logical operations conjunction, orthosupplement and scalar
multiplication, of the form:
2 2 2
& ⊥ r

2 2 2 2
where r ∈ [0, 1]. These string diagrams have obvious domains and co-
domains. We recall that predicates on A can be identified with channels
A → 2, see Exercise 4.3.5. Hence they can be included in string diagrams.
(3) If S 1 , S 2 ∈ SD(Σ) are string diagrams, then the parallel composition (jux-
taposition) S 1 S 2 is also string diagram, with dom(S 1 S 2 ) = dom(S 1 ) ++
dom(S 2 ) and cod (S 1 S 2 ) = cod (S 1 ) ++ cod (S 2 ).

394
7.1. String diagrams 395

(4) If S 1 , S 2 ∈ SD(Σ) and cod (S 1 ) ends with C~ = [C1 , . . . , Ck ] whereas dom(S 2 )


starts with C, ~ then there is a sequential composition string diagram T of the
form:
···

S2 
··· 

 dom(T )
 = dom(S 1 ) ++ dom(S 2 ) − C~


 

···
T =

with (7.2)
C~




 cod (T )
 = cod (S 1 ) − C~  ++ cod (S 2 ).



···
S1

···

From this basic form, different forms of composition can be obtained by


reordening wires via swapping.
In this definition of string diagrams all wires point upwards. This excludes
that string diagrams contain cycles: they are acyclic by construction. One could
add more basic constructions to item (2), such as convex combination, see
Exercise 7.2.3, but the above constructions suffices for now.
Below we redescribe the ‘wetness’ and ‘Asia’ Bayesian networks, from Fig-
ures 2.4 and 6.6, as string diagrams.
D X

D E dysp xray

wet grass slippery road B


E
bronc
B either
C
sprinkler L T
rain (7.3)
lung tub

A A
S asia
winter
smoking

What are the signatures for these string diagrams?

7.1.1 String diagrams versus Bayesian networks


One can ask: why are these strings diagrams any better than the diagrams that
we have seen so far for Bayesian networks? An important point is that string

395
396 Chapter 7. Directed Graphical Models

diagrams make it possible to write a joint state, on an n-ary product type. For
instance, for n = 2 we can have a write a joint state as:

(7.4)

In Bayesian networks joint states cannot be expressed. It is possible to write an


initial node with multiple outgoing wires, but as we have seen in Section 2.6,
this should be interpreted as copy of a single outgoing wire, coming out of a
non-joint state:
g 7
_ ?
•O
  =  
initial
  initial
 
In string diagrams one can additionally express marginals via discarding :
if a joint state ω is the interpretation of the above string diagram (7.4), then the
diagram on the left below is interpreted as the marginal ω 1, 0 , whereas the
 
diagram on the right is interpreted as ω 0, 1 .
 

We shall use this marginal notation with masks also for boxes in general — and
not just for states, without incoming wires. This is in line with Definition 2.5.2
where marginalisation is defined both for states and for channels.
An additional advantage of string diagrams is that they can be used for equa-
tional reasoning. This will be explained in the next section. First we give mean-
ing to string diagrams.

7.1.2 Interpreting string diagrams as channels


String diagrams will be used as representations of (probabilistic) channels. We
thus consider string diagrams as a special, graphical term calculus (syntax).
These diagrams are formal constructions, which are given mathematical mean-
ing via their interpretation as channels.
This interpretation of string diagrams can be defined in a compositional way,
following the structure of a diagram, using sequential ◦· and parallel ⊗ compo-
sition of channels. An interpretation of a string diagram S ∈ SD(Σ) is a channel
that is written as [[ S ]]. It forms a channel from the product of the types in the
domain dom(S ) of the string diagram to the product of types in the codomain

396
7.1. String diagrams 397

cod (S ). The product over the empty list is the singleton element 1 = {0}. Such
an interpretation [[ S ]] is parametrised by an interpretation of the boxes in the
signature Σ as channels.
Given such an interpretation of the boxes in Σ, the items below describe
this interpretation function [[ − ]] inductively, acting on the whole of SD(Σ), by
precisely following the above four items in the inductive definition of string
diagrams.

(1) The interpretation of a box (7.1) in Σ is assumed. It is some channel of type


A1 × · · · × An → B1 × · · · × Bm .
(2) We use the following interpretation of primitive boxes.
(a) The interpretation of the identity wire of type A is the identity/unit channel
id : A → A, given by point states: id (a) = 1| ai.
(b) The swap string diagram is interpreted as the (switched) pair of projec-
tions hπ2 , π1 i : A × B → B × A. It sends (a, b) to 1|b, ai.
(c) The interpretation [[ copy A ]] of the copy string diagram is the copy chan-
nel ∆ : A → A × A with ∆(a) = 1| a, a i.
(d) The discard channel of type A is interpreted as the unique channel
! : A → 1. Using that D(1)  1, this function sends every element a ∈ A
to the sole element 1| 0 i of D(1) for 1 = {0}.
(e) The uniform string diagram of type A is interpreted as the uniform
distribution/state unif A ∈ D(A), considered as channel 1 → A. Recall, if
A has n-elements, then unif A = a∈A 1n | ai.
P

(f) The point string diagram point A (a), for a ∈ A is interpreted as the point
state 1| a i ∈ D(A).
(g) The ‘logical’ channels for conjunction, orthosupplement, and scalar mul-
tiplication are interpreted by the corresponding channels conj, orth and
scal(r) from Exercise 4.3.5.
(3) The tensor product ⊗ of channels is used to interpret juxtaposition of string
diagrams [[ S 1 S 2 ]] = [[ S 1 ]] ⊗ [[ S 2 ]].
(4) If we have string diagrams S 1 , S 2 in a composition T as in (7.2), then:

[[ T ]] = id ⊗ [[ S 2 ]] ◦· ([[ S 1 ]] ⊗ id ),


where the identity channels have the appropriate product types.

At this stage we can say more precisely what a Bayesian network is, within
the setting of string diagrams. This string diagrammatic perspective started
in [48].

Definition 7.1.2. A Bayesian network is given by:

397
398 Chapter 7. Directed Graphical Models

1 a string diagram G ∈ SD(Σ) over a finite signature Σ, with dom(G) = [];


2 an interpretation of each box in Σ as a channel.

The string diagram G in the first item is commonly referred to as the (underly-
ing) graph of the Bayesian network. The channels that are used as interpreta-
tions are the conditional probability tables of the network.

Given the definition of the boxes in (the signature of) a Bayesian network,
the whole network can be interpreted as a state/distribution [[ G ]]. This is in
general not the same thing as the joint distribution associated with the network,
see Definition 7.3.1 (3) later on. In addition, for each node A in G, we can look
at the subgraph G A of cumulative parents of A. The interpretation [[ G A ]] is
then also a state, namely the one that one obtains via state transformation. This
has been described as ‘prediction’ in Section 2.6.
The above definition of Bayesian network is more general than usual, for
instance because it allows the graph to contain joint states, like in (7.4), or
discarding , see also the discussion in [82].

Exercises
7.1.1 For a state ω ∈ D(X1 × X2 × X3 × X4 × X5 ), describe its marginalisation
ω 1, 0, 1, 0, 0 as string diagram.
 

7.1.2 Check in detail how the two string diagrams in (7.3) are obtained by
following the construction rules for string diagrams.
7.1.3 Consider the boxes in the wetness string diagram in (7.3) as a signa-
ture. An interpretation of the elements in this signature is described
in Section 2.6 as states and channels wi, sp etc. Check that this in-
terpretation extends to the whole wetness string diagram in (7.3) as a
channel 1 → D × E given by:
(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆ ◦· wi.

7.1.4 Extend the interpretation from Section 6.5 of the signature of the Asia
string diagram in (7.3) to a similar interpretation of the whole Asia
string diagram.
7.1.5 For A = {a, b, c} check that:
hh ii
A = 3 | a, ai + 3 | b, b i + 3 | c, c i.
1 1 1

7.1.6 Check that there are equalities of interpretations:

[[ A B
]] = [[ A×B
]] [[ A B ]] = [[ A×B ]]

398
7.2. Equations for string diagrams 399

7.1.7 Verify that for each box, interpreted as a channel, one has:
hh B ii hh ii
= A .
A

7.1.8 Recall Exercise 4.3.6 and check that the interpretation of


2
g is [[ f ]] |= [[ g ]].

Similarly, check that the interpretation of


2
h
[[ g ]] = [[ f ]] |= [[ h ]]
is
g = [[ f ]] |= [[ g ]] = [[ h ]],

see Proposition 4.3.3.


7.1.9 Notice that the difference between ω |= p1 ⊗ p2 and σ |= q1 & q2 can
be clarified graphically as:
2
2
&
&
2 2
2 2
versus
A B A

See also Exercise 4.3.7.

7.2 Equations for string diagrams


This section introduces equations S 1 = S 2 between string diagrams S 1 , S 2 ∈
SD(Σ) over the same signature Σ. These equation are ‘sound’, in the sense that
they are respected by the string diagram interpretation of Subsection 7.1.2:
S 1 = S 2 implies [[ S 1 ]] = [[ S 2 ]]. From now on we simplify the writing of
string diagrams in the following way.

399
400 Chapter 7. Directed Graphical Models

Convention 7.2.1. 1 We will write the arrows ↑ on the wires of string dia-
grams simply as | and assume that the flow is allways upwards in a string
diagram, from bottom to top. Thus, even though we are not writing edges as
arrows, we still consider string diagrams as directed graphs.
2 We drop the types of wires when they are clear from the context. Thus, wires
in string diagrams are always typed, but we do not always write these types
explicitly.
3 We become a bit sloppy about writing the interpretation function [[ − ]] for
string diagrams explicitly. Sometimes we’ll say “the interpretation of S ”
instead of just [[ S ]], but also we sometimes simply write a string diagram,
where we mean its interpretation as a channel. This can be confusing at first,
so we will make it explicit when we start blurring the distinction between
syntax and semantics.

Shift equations
String diagrams allow quite a bit of topological freedom, for instance, in the
sense that parallel boxes may be shifted up and down, as expressed by the
following equations.

f g
= f g =
g f

Semantically, this is justified by the equation:

[[ f ]] ⊗ id ◦· id ⊗ [[ g ]] = [[ f ]] ⊗ [[ g ]] = id ⊗ [[ g ]] ◦· [[ f ]] ⊗ id .
    

Similarly, boxes may be pulled trough swaps:

f g
= =
f g

These four equations do not only hold for boxes, but also for and , since
we consider them simply as boxes but with a special notation.

400
7.2. Equations for string diagrams 401

Equations for wires


Wires can be stretched arbitrarily in length, and successive swaps cancel each
other out:

= =

Wires of product type correspond to parallel wires, and wires of the empty
product type 1 do nothing — as represented by the empty dashed box.

A×B = A B 1 =

Equations for copying


The copy string diagram also satisfies a number of equations, expressing
associativity, commutativity and a co-unit property. Abstractly this makes a
comonoid. This can be expressed graphically by:

= = = (7.5)

Because copying is associative, it does not matter which order of copying we


use in successive copying. Therefore we feel free to write such n-ary copying
as coming from a single node •, in:
 ··· 
···  
∆n B  (7.6)
 
and 
 

Copying of products A × B and of the empty product 1 comes with the


following equations.

= =
A×B A B 1
We should be careful that boxes can in general not be “pulled through”
copiers, as expressed on the left below (see also Exercise 2.5.10).

,
f f
but =
f a a a

401
402 Chapter 7. Directed Graphical Models

An exception to this is when copying follows a point state 1| a i for a ∈ A, as


above, on the right. The above equation on the left does hold for deterministic
channels f , see Exercise 2.5.12.

Equations for discarding and for uniform and point states


The most important discarding equation is the one immediately below, say-
ing that discarding after a box/channel is the same as discarding of the inputs
altogether. We refer to this as: boxes are unital1 .
···
···
= ··· in particular =
A1 An
A1 · · · An
Discarding only some (not all) of the outgoing wires amounts to marginalisa-
tion.
Discarding a uniform state also erases everything. This may be seen as a
special case of the above equation on the right, but still we like to formulate it
explicitly, below on the left. The other equations deal with trivial cases.

A = 1 = = 1

For product wires we have the following discard equations, see Exercise 7.1.6.

A×B = A B = A B
Similarly we have the following uniform state equations.
A×B = A B = A B

For point states we have for elements a ∈ A and b ∈ B,

=
(a, b) a b

Equations for logical operations


As is well-known, conjunction & is associative and commutative:

& &
&
= = &
& &

1 In a quantum setting one uses ‘causal’ instead of ‘unital’, see e.g. [25].

402
7.2. Equations for string diagrams 403

Moreover, conjunction has truth and falsity as unit and zero elements:

& & 0
= =
1 0

Orthosupplement (−)⊥ is idempotent and turns 0 to 1 (and vice-versa):

⊥ ⊥
= =
0
⊥ 1

Finally we look at the rules for scalar multiplication for predicates:

r & r
= r·s 1 = 0 =
0 =
s r &

These equations be used for diagrammatic equational reasoning. For in-


stance we can now derive a variation on the last equation, where scalar multi-
plication on the second input of conjunction can be done after the conjunction:

& & r
& r
= = r = & =
r r &

All of the above equations between string diagrams are sound, in the sense
that they hold under all interpretations of string diagrams — as described at
the end of the previous section. There are also completeness results in this
area, but they require a deeper categorical analysis. For a systematic account
of string diagrams we refer to the overview paper [149]. In this book we use
string diagrams as convenient, intuitive notation with a precise semantics.

Exercises
7.2.1 Give a concrete description of the n-ary copy channel ∆n : A → An
in (7.6).

403
404 Chapter 7. Directed Graphical Models

7.2.2 Prove by diagrammatic equational reasoning that:

& r·s
=
r s &

7.2.3 Consider for r ∈ [0, 1] the convex combination channel cc(r) : A ×


A → A given by cc(r)(a, a0 ) = r| a i + (1 − r)| a0 i. It can be used as
interpretation of an additional convex combination box:
A
r

A A
Prove that it satisfies the following ‘barycentric’ equations, due to [155].

r rs
1 = r = 1−r =
r(1−s)
s 1−rs

For the last equation we must assume that rs , 1, i.e. not r = s = 1.


7.2.4 Prove that:
··· ···
=

··· ···

What does this amount to when there are no input (or no output)
wires?

7.3 Accessibility and joint states


Intuitively, a string diagram is called accessible if all its internal connections
are also accessible externally. This property will be defined formally below,
but it is best illustrated via an example. Below on the left we see the wetness
Bayesian network as string diagram, like in (7.3) — but with lines instead
of arrows. On the right is an accessible version of this string diagram, with

404
7.3. Accessibility and joint states 405

(single) external connections to all its internal wires.


D E A B D C E
wet grass slippery road wet grass slippery road

B
C
sprinkler sprinkler
rain rain (7.7)

A
winter winter

The non-accessible parts of a string diagram are often called latent or unob-
served.
We are interested in such accessible versions because they lead in an easy
way to joint states (and vice-versa, see below): the joint state for the wetness
Bayesian network is the interpretation of the above accessible string diagram
on the right, giving a distribution on the product set A × B × D × C × E. Such
joint distributions are conceptually important, but they are often impractical to
work with because their size quickly grows out of hand. For instance, the joint
distribution ω for the above wetness network is (with zero-probability terms
omitted, as usual):

6250 | a, b, d, c, e i + 6250 | a, b, d, c, e i + 1250 | a, b, d, c , e i


399 171 ⊥ 27 ⊥ ⊥

+ 6250 | a, b, d , c, ei + 6250 | a, b, d , c, e i + 1250 | a, b, d , c⊥ , e⊥ i


21 ⊥ 9 ⊥ ⊥ 3 ⊥

+ 3125
672
| a, b⊥ , d, c, ei + 3125
288
| a, b⊥ , d, c, e⊥ i + 3125168
| a, b⊥ , d⊥ , c, e i
+ 3125 | a, b , d , c, e i + 125 | a, b , d , c , e i + 20000
72 ⊥ ⊥ ⊥ 12 ⊥ ⊥ ⊥ ⊥ 399
| a⊥ , b, d, c, e i
+ 20000
171
| a⊥ , b, d, c, e⊥ i + 1000
243
| a⊥ , b, d, c⊥ , e⊥ i + 20000
21
| a⊥ , b, d⊥ , c, e i
+ 20000 | a , b, d , c, e i + 1000 | a , b, d , c , e i + 1250 |a⊥ , b⊥ , d, c, ei
9 ⊥ ⊥ ⊥ 27 ⊥ ⊥ ⊥ ⊥ 7

+ 12503
| a⊥ , b⊥ , d, c, e⊥ i + 5000
7
| a⊥ , b⊥ , d⊥ , c, e i + 5000
3
| a⊥ , b⊥ , d⊥ , c, e⊥ i
+ 100 | a , b , d , c , e i
9 ⊥ ⊥ ⊥ ⊥ ⊥

This distribution is obtained as outcome of the interpreted accessible graph


in (7.7):

ω = id ⊗ id ⊗ wg ⊗ id ⊗ sr ◦· id ⊗ ∆2 ⊗ ∆3 ◦· id ⊗ sp ⊗ sr ◦· ∆3 ◦· wi.
  

Recall that ∆n is written for the n-ary copy channel, see (7.6).
Later on in this section we use our new insights into accessible string dia-
grams to re-examine the relation between crossover inference on joint states

405
406 Chapter 7. Directed Graphical Models

and channel-based inference, see Corollary 6.3.13. But we start with a more
precise description.

Definition 7.3.1. 1 An accessible string diagram is defined inductively via


the string diagram definitions steps (1) – (3) from Section 7.1, with sequen-
tial composition diagram (7.2) in step (4) extended with copiers on the in-
ternal wires:
··· ···

S2
···

···
(7.8)

···
S1

···

In this way each ‘internal’ wire between string diagrams S 1 and S 2 is ‘ex-
ternally’ accessible.
2 Each string diagram S can be made accessible by replacing each occur-
rence (7.2) of step (4) in its construction with the above modified step (7.8).
We shall write S for a choice of accessible version of S .
3 A joint state or joint distribution associated with a Bayesian network G ∈
SD(Σ) is the interpretation [[ G ]] of an accessible version G ∈ SD(Σ) of the
string diagram G. It re-uses G’s interpretation of its boxes.

There are several subtleties related to the last two points.

Remark 7.3.2. 1 The above definition of accessible string diagram allows


that there are multiple external connections to an internal wire, for instance
if one of the internal wires between S 1 and S 2 in (7.8) is made accessible in-
side S 2 , possibly because it is one of the outgoing wires of S 2 . Such multiple
accessibility is unproblematic, but in most cases we prefer ‘minimal’ acces-
sibility via single wires, as in (7.7) on the right. Anyway, we can always
purge unnecessary outgoing wires via discard .
2 The order of the outgoing wires in an accessible graph involves some level
of arbitrairiness. For instance, in the accessible wetness graph in (7.7) one
could also imagine putting C to the left of D at the top of the diagram.
This would involve some additional crossing of wires. The result would be
essentially the same. The resulting joint state, obtained by interpretation,

406
7.3. Accessibility and joint states 407

would then be in D(A × B × C × D × E). These differences are immaterial,


but do require some care if one performs marginalisation.
Thus, strictly speaking, the accessible version S of a string diagram S
is determined only up to permutation of outgoing wires. In concrete cases
we try to write accessible string diagrams in such a way that the number of
crossings of wires is minimal.
With these points in mind we allow ourselves to speak about the joint state
associated with a Bayesian network, in which duplicate outgoing wires are
deleted. This joint state is determined up-to permutation of outgoing wires.

In the crossover inference corollary 6.3.13 we have seen how component-


wise updates on a joint state correspond to channel-based inference. This the-
orem involved a binary state and only one channel. Now that we have a better
handle on joint states via string diagrams, we can extend this correspondence
to n-ary joint states and multiple channels. This is a theme that will be pursued
further in this chapter. At this stage we only illustrate how this works for the
wetness Bayesian network.

Example 7.3.3. In Exercise 6.5.1 we have formulated two inference questions


for the wetness Bayesian network, namely:

1 What is the updated sprinkler distribution, given evidence of a slippery road?


Using channel-based inference it is computed as:

sp = wi ra = (sr = 1 ) = 260
63
| b i + 197
 ⊥
e 260 | b i.

2 What is the updated wet grass distribution, given evidence of a slippery


road? It’s:
  
wg = (sp ⊗ ra) = (∆ = wi) 1⊗(sr = 1 ) = 5200
4349
| d i + 5200
851
| d⊥ i.
e

Via (multiple applications) of crossover inference, see Corollary 6.3.13, these


same updated probabilities can be obtained via the joint state ω ∈ D(A × B ×
D × C × E) of the wetness Bayesian network, as mentioned at the beginning of
this section.
Concretely, this works as follows. In the above first case we have (point)
evidence 1e : E → [0, 1] on the set E = {e, e⊥ }. It is weakened (extended) to
evidence 1⊗1⊗1⊗1⊗1e on the product A×B×D×C ×E. Then it can be used to
update the joint state ω. The required outcome then appears by marginalising
on the sprinkler set B in second position, as in:

ω 1⊗1⊗1⊗1⊗1 0, 1, 0, 0, 0 = 260
63
| b i + 197
  ⊥
e 260 | b i
≈ 0.2423| bi + 0.7577| b⊥ i.

407
408 Chapter 7. Directed Graphical Models

For the second inference question we use the same evidence, but we now
marginalise on the wet grass set D, in third position:

ω 0, 0, 1, 0, 0 = 4349 | d i + 851 | d⊥ i
 
1⊗1⊗1⊗1⊗1e 5200 5200
≈ 0.8363| d i + 0.1637|d⊥ i.
We see that crossover inference is easier to express than channel-based in-
ference, since we do not have to form the more complicated expressions with =
and = in the above two points. Instead, we just have to update and marginalise
at the right positions. Thus, it is easier from a mathematical perspective. But
from a computational perspective crossover inference is more complicated,
since these joint states grow exponentially in size (in the number of nodes in
the graph). This topic will be continued in Section 7.10.

Exercises
7.3.1 Draw an accessible verion of the Asia string diagram in (7.3). Write
also the corresponding composition of channels that produces the as-
sociated joint state.

7.4 Hidden Markov models


Let’s start with a graphical description of an example of what is called a hidden
Markov model:
0.5

Cloudy
0.5
0.2 Stay-in
0.15 0.2
0.3
0.9
Sunny 0.8 (7.9)
0.2 0.5
0.05 0.2 0.8
Go-out
Rainy 0.1

0.6
This model has three ‘hidden’ elements, namely Cloudy, Sunny, and Rainy,
representing the weather condition on a particular day. There are ‘temporal’
transitions with associated probabilities between these elements, as indicated

408
7.4. Hidden Markov models 409

by the labeled arrows. For instance, if it is cloudy today, then there is a 50%
chance that it will be cloudy again tomorrow. There are also two ‘visible’ el-
ements on the right: Stay-in, and Stay-out, describing two possible actions of
a person, depending on the weather condition. There are transitions with prob-
abilities from the hidden to the visible elements. The idea is that with every
time step a transition is made between hidden elements, resulting in a visible
outcome. Such steps may be repeated for a finite number of times — or even
forever. The interaction between what is hidden and what can be observed is
a key element of hidden Markov models. For instance, one may ask: given a
certain initial state, how likely is it to see a consecutive sequence of the four
visible elements: Stay-in, Stay-in, Go-out, Stay-in?
Hidden Markov models are simple statistical models that have many ap-
plications in temporal pattern recognition, in speech, handwriting or gestures,
but also in robotics and in biological sequences. This section will briefly look
into hidden Markov models, using the notation and terminology of channels
and string diagrams. Indeed, a hidden Markov model can be defined easily in
terms of channels and forward and backward transformation of states and ob-
servables. In addition, conditioning of states by observables can be used to for-
mulate and answer elementary questions about hidden Markov models. Learn-
ing for Markov models will be described separately in Sections ?? and ??.
Markov models are examples of probabilistic automata. Such automata will be
studied separately, in Chapter ??.

Definition 7.4.1. A Markov model (or a Markov chain) is given by a set X of


‘internal positions’, typically finite, with a ‘transition’ channel t : X → X and
an initial state/distribution σ ∈ D(X).
A hidden Markov model, often abbreviated as HMM, is a Markov model, as
just described, with an additional ‘emission’ channel e : X → Y, where Y is a
set of ‘outputs’.

In the above illustration (7.9) we have as sets of positions and outputs:

X = Cloudy, Sunny, Rainy Y = Stay-in, Go-out ,


 

with transition channel t : X → X,



t(Cloudy) = 0.5 Cloudy + 0.2 Sunny + 0.3 Rainy


t(Sunny) = 0.15 Cloudy + 0.8 Sunny + 0.05 Rainy


t(Rainy) = 0.2 Cloudy + 0.2 Sunny + 0.6 Rainy ,

409
410 Chapter 7. Directed Graphical Models

and emission channel e : X → Y,



e(Cloudy) = 0.5 Stay-in + 0.5 Go-out


e(Sunny) = 0.2 Stay-in + 0.8 Go-out


e(Rainy) = 0.9 Stay-in + 0.1 Go-out .

An initial state is missing in the picture (7.9).


In the literature on Markov models, the elements of the set X are often called
states. This clashes with the terminology in this book, since we use ‘state’ as
synonym for ‘distribution’. So, here we call σ ∈ D(X) an (initial) state, and
we call elements of X (internal) positions. At the same time we may call X the
sample space. The transition channel t : X → X is an endo-channel on X, that
is a channel from X to itself. As a function, it is of the form t : X → D(X); it is
an instance of a coalgebra, that is, a map of the form A → F(A) for a functor
F, see Section ?? for more information.
In a Markov chain/model one can iteratitively compute successor states. For
an initial state σ and transition channel t one can form successor states via state
transformation:

σ
t = σ 
unit
 if n = 0
t = (t = σ) = (t ◦· t) = σ t = 
n

where
.. t ◦· tn−1
 if n > 0.
.
tn = σ

In these transitions the state at stage n + 1 only depends on the state at stage n:
in order to predict a future step, all we need is the immediate predecessor state.
This makes HMMs relatively easy dynamical models. Multi-stage dependen-
cies can be handled as well, by enlarging the sample space, see Exercise 7.4.6
below.
One interesting problem in the area of Markov chains is to find a ‘stationary’
state σ∞ with t = σ∞ = σ∞ , see Exercise 7.4.2 for an illustration, and also
Exercise 2.5.16 for a sufficient condition.
σ t e
Here we are more interested in hidden Markov models 1 → X → X → Y. The
elements of the set Y are observable — and hence sometimes called signals —
whereas the elements of X are hidden. Thus, many questions related to hidden
Markov models concentrate on what one can learn about X via Y, in a finite
number of steps. Hidden Markov models are examples of models with latent
variables.
We briefly discuss some basic issues related to HMMs in separate subsec-

410
7.4. Hidden Markov models 411

tions. A recurring theme is the relationship ‘joint’ and ‘sequential’ formula-


tions.

7.4.1 Validity in hidden Markov models


The first question that we like to address is: given a sequence of observables,
what is their probability (validity) in a HMM? Standardly in the literature, one
only looks at the probability of a sequence of point observations (elements),
but here we use a more general approach. After all, one may not be certain
about observing a specific point at a particular state, or some point observations
may be missing; in the latter case one may wish to replace them by a constant
(uniform) observation.
We proceed by defining validity of the sequence of observables in a joint
state first; subsequently we look at (standard) algorithms for computing these
validities efficiently. We thus start by defining a relevant joint state.
σ t e
We fix a HMM 1 → X → X → Y. For each n ∈ N a channel he, tin : X →
Y n × X is defined in the following manner:
id
/ X  1 × X  Y0 × X
 
he, ti0 B X ◦
(7.10)
id n ⊗he,ti
◦ / / Y n × (Y × X)  Y n+1 × X .
 he,tin 
he, tin+1 B X Yn × X ◦

We recall that the tuple he, ti of channels is (e ⊗ t) ◦· ∆, see Definition 2.5.6 (1).
With these tuples we can form a joint state he, tin = σ ∈ D(Y n × X). As an
(interpreted) string diagram it looks as follows.
Y Y Y X
e e ··· e t

..
.
(7.11)
t

We consider the combined likelihood of a sequence of observables on the set


Y in a hidden Markov model. In the literature these observables are typically
point predicates 1y : Y → [0, 1], for y ∈ Y, but, as mentioned, here we allow
more general observables Y → R.

411
412 Chapter 7. Directed Graphical Models
σ t e
Definition 7.4.2. Let H = 1 → X → X → Y be a hidden Markov model and

let ~p = p1 , . . . , pn be a list of observables on Y. The validity H |= ~p of this
sequence ~p in the model H is defined via the tuples (7.10) as:

H |= ~p B he, tin = σ 1, . . . , 1, 0 |= p1 ⊗ · · · ⊗ pn
 
(4.8)
= he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ 1 (7.12)
(4.9)
= σ |= he, tin = p1 ⊗ · · · ⊗ pn ⊗ 1 .


The marginalisation mask [1, . . . , 1, 0] contains n times the number 1. It en-


sures that the X outcome in (7.11) is discarded.

We describe an alternative way to formulate this validity without using the


(big) joint state on Y n × X. It forms the essence of the classical ‘forward’
and ‘backward’ algorithms for validity in HMMs, see e.g. [14, 144] or [96,
App. A]. An alternative algorithm is described in Exercise 7.4.4.

Proposition 7.4.3. The HMM-validity (7.12) can be computed as:

H |= ~p = σ |= (e = p1 ) &
t = (e = p2 ) &
(7.13)
t = (e = p3 ) & · · ·
t = (e = pn ) · · · .


This validity can be calculated recursively in forward manner as:



α [q] = e = q
 
σ |= α ~p
 
where

 α [q] ++ ~q = (e = q) & (t = α ~q ).

  

Alternatively, this validity can be calculated recursively in backward manner


as:

β [q] = q
 
σ |= β ~p, 1 where 
 

 β ~q ++ [q , q ] = β ~q ++ [(e = q ) & (t = q )] .
  
n n+1 n n+1

Proof. We first prove, by induction on n ≥ 1 that for observables pi on Y and


q on X one has:

he, tin = (p1 ⊗ · · · ⊗ pn ⊗ q)


(∗)
= (e = p1 ) & t = ((e = p2 ) & t = (· · · t = ((e = pn ) & t = q) · · · ))
The base case n = 1 is easy:

he, ti1 = (p1 ⊗ q) = ∆ = (e ⊗ t) = (p1 ⊗ q)




= ∆ = (e = p1 ) ⊗ (t = q)

by Exercise 4.3.8
= (e = p1 ) & (t = q) by Exercise 4.3.7.

412
7.4. Hidden Markov models 413

For the induction step we reason as follows.

he, tin+1 = p1 ⊗ · · · ⊗ pn ⊗ pn+1 ⊗ q




= he, tin = (id n ⊗ ht, ei) = p1 ⊗ · · · ⊗ pn ⊗ pn+1 ⊗ q




= he, tin = p1 ⊗ · · · ⊗ pn ⊗ (he, ti = (pn+1 ⊗ q)




= he, tin = p1 ⊗ · · · ⊗ pn ⊗ ((e = pn+1 ) & (t = q))



as just shown
(IH)
= (e = p1 ) & t = ((e = p2 ) & t = (· · ·
t = ((e = pn ) & t = ((e = pn+1 ) & (t = q))) · · · )).

We can now prove Equation (7.13):

H |= ~p
(7.12)
= σ |= he, tin = p1 ⊗ · · · ⊗ pn ⊗ 1

(∗)
= (e = p1 ) & t = ((e = p2 ) & t = (· · · t = ((e = pn ) & t = 1) · · · ))
= (e = p1 ) & t = ((e = p2 ) & t = (· · · t = (e = pn ) · · · )).

7.4.2 Filtering
Given a sequence ~p of factors, one can compute their validity H |= ~p in a
HMM H, as described above. But we can also use these factors to ‘guide’
the evolution of the HMM. At each state i the factor pi is used to update the
current state, via backward inference. The new state is then moved forward via
the transition function. This process is called filtering, after the Kalman filter
from the 1960s that is used for instance in trajectory optimisation in navigation
and in rocket control (e.g. for the Apollo program). The system can evolve
autonomously via its transition function, but observations at regular intervals
can update (correct) the current state.
σ t e
Definition 7.4.4. Let H = 1 → X → X → Y be a hidden Markov model and

let ~p = p1 , . . . , pn be a list of factors on Y. It gives rise to the filtered sequence
of states σ1 , σ1 , . . . , σn+1 ∈ D(X) following the observe-update-proceed prin-
ciple:

σ1 B σ σi+1 B t = σi e = p .

and
i

In the terminology of Definition 6.3.1, the definition of the state σi+1 in-
volves both forward and backward inference. Below we show that the final
state σn+1 in the filtered sequence can also be obtained via crossover inference
on a joint state, obtained via the tuple channels (7.10). This fact gives a theoret-
ical justification, but is of little practical relevance — since joint states quickly
become too big to handle.

413
414 Chapter 7. Directed Graphical Models

Proposition 7.4.5. In the context of Definition 7.4.4,



σn+1 = he, tin = σ p ⊗ ··· ⊗p ⊗1 0, . . . , 0, 1
 
1 n

The marginalisation mask [0, . . . , 0, 1] has n zero’s.

Proof. By induction on n ≥ 1. The base case with he, ti1 = he, ti is handled as
follows.

he, ti = σ p ⊗1 0, 1
 
1

= π2 = he| p1 , ti = σ|he,ti = (p1 ⊗1)



by Corollary 6.3.12 (2)
= t = σ|e = p1


= σ2 .

The induction step requires a bit more work:



he, tin+1 = σ p ⊗ ··· ⊗p ⊗p ⊗1 0, . . . , 0, 0, 1
 
 1 n n+1
 
= πn+2 = (id n ⊗ he, ti) = (he, tin = σ) p ⊗ ··· ⊗p ⊗p ⊗1
1 n n+1
(6.2) 
= πn+2 = (id n ⊗ he, ti) p ⊗ ··· ⊗p ⊗p ⊗1
 1 n n+1 
= he, tin = σ (id n ⊗he,ti) = (p ⊗ ··· ⊗p ⊗p ⊗1)
1 n n+1
   
= πn+2 = (id n ⊗ he| pn+1 ,t i) = he, tin = σ p ⊗ ··· ⊗p ⊗((e = p )⊗(t = 1))
   1 n n+1

= t ◦· πn+1 = he, tin = σ p ⊗ ··· ⊗p ⊗(e = p )


1 n n+1
 
= t = (he, tin = σ) p ⊗ ··· ⊗p ⊗1 1⊗ ··· ⊗1⊗(e = p ) 0, . . . , 0, 1

1 n n+1

by Lemma 6.1.6 (3) and Lemma 4.2.9 (1)


  
= t = (he, tin = σ) p ⊗ ··· ⊗p ⊗1 0, . . . , 0, 1 e = p

1 n n+1

by Lemma 6.1.6 (6)


(IH)
= t = σn+1 e = p

n+1

= σn+2 .

The next consequence of the previous proposition may be understood as


Bayes’ rule for HMMs — or more accurately, the product rule for HMMs, see
Proposition 6.1.3.

Corollary 7.4.6. Still in the context of Definition 7.4.4, let q be a predicate on


X. Its validity in the final state in the sequence σ1 , . . . , σn+1 , filtered by factors
p1 , . . . , pn , is given by:

he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ q
σn+1 |= q = .
H |= ~p

414
7.4. Hidden Markov models 415

Proof. This follows from Bayes’ rule, in Proposition 6.1.3 (1):


σn+1 |= q = he, tin = σ p ⊗···⊗p ⊗1 0, . . . , 0, 1 |= q
 
1 n

= he, tin = σ p ⊗···⊗p ⊗1 |= 1 ⊗ · · · ⊗ 1 ⊗ q



by (4.8)
 1 n

he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ 1 & 1 ⊗ · · · ⊗ 1 ⊗ q


 
=
he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ 1

he, tin = σ |= p1 ⊗ · · · ⊗ pn ⊗ q

= .
H |= ~p

The last equation uses Lemma 4.2.9 (1) and Definition 7.4.2.

7.4.3 Finding the most likely sequence of hidden elements


We briefly consider one more question for HMMs: given a sequence of obser-
vations, what is the most likely path of internal positions that produces these
observations? This is also known as the decoding problem. It explicitly asks
for the most likely path, and not for the most likely individual position at each
stage. Thus it involves taking the argmax of a joint state. We concentrate on
giving a definition of the solution, and only sketch how to obtain it efficiently.
We start by constructing the state below, as intepreted string diagram, for a
σ t e
given HMM 1 → X → X → Y.

X X X X Y Y Y
··· t e ··· e e

..
.
(7.14)
t

We proceed, much like in the beginning of this section, this time not using
tuples but triples. We define the above state as element of the subset:

hid ,t,ein
hid , t, ein = σ for X / Xn × X × Y n

415
416 Chapter 7. Directed Graphical Models

These maps hid , t, ein are obtained by induction on n:


id
/ X  1 × X × 1  X0 × X × Y 0
 
hid , t, ei0 B X ◦

hid ,t,ein id n ⊗hid ,t,ei⊗id n


◦ / /

hid , t, ein+1 B X Xn × X × Y n ◦ X n × (X × X × Y) × Y n
o 
X × X × Y n+1
n+1

σ t e
Definition 7.4.7. Let 1 → X → X → Y be a HMM and let ~p = p1 , . . . , pn be a
sequence of factors on Y. Given these factors as successive observations, the
most likely path of elements of the sample space X, as an n-tuple in X n , is
obtained as:
  
argmax hid , t, ein = σ 1⊗ ··· ⊗1⊗1⊗p ⊗ ··· ⊗p 1, . . . , 1, 0, 0, . . . , 0 .

(7.15)
n 1

This expression involves updating the joint state (7.14) with the factors pi ,
in reversed order, and then taking its marginal so that the state with the first n
outcomes in X remains. The argmax of the latter state gives the sequence in X n
that we are after.
The expression (7.15) is important conceptually, but not computationally.
It is inefficient to compute, first of all because it involves a joint state that
grows exponentionally in n, and secondly because the conditioning involves
normalisation that is irrelevant when we take the argmax.
We given an impression of what (7.15) amounts to for n = 3. We first focus

on the expression within argmax − . The marginalisation and conditioning
amount produce an state on X × X × X of the form:
X (hid , t, ei3 = σ)(x1 , x2 , x3 , x, y3 , y2 , y1 ) · p3 (y3 ) · p2 (y2 ) · p1 (y1 )

x,y1 ,y2 ,yX


hid , t, ei3 = σ |= 1 ⊗ 1 ⊗ 1 ⊗ 1 ⊗ p3 ⊗ p2 ⊗ p1
3

∼ σ(x1 ) · t(x1 )(x2 ) · t(x2 )(x3 ) · t(x3 )(x)


x,y1 ,y2 ,y3
· e(x )(y ) · e(x2 )(y2 ) · e(x1 )(y1 ) · p3 (y3 ) · p2 (y2 ) · p1 (y1 )
P 3 3  P 
= σ(x1 ) · y1 e(x1 )(y1 ) · p1 (y1 ) · t(x1 )(x2 ) · y2 e(x2 )(y2 ) · p2 (y2 )
P  P 
· t(x2 )(x3 ) · y3 e(x3 )(y3 ) · p3 (y3 ) · x t(x3 )(x)
= σ(x1 ) · (e = p1 )(x1 ) · t(x1 )(x2 ) · (e = p2 )(x2 ) · t(x2 )(x3 ) · (e = p3 )(x3 ).
This approach is known as the Viterbi algorithm, see e.g. [14, 144, 96].

Exercises
7.4.1 Consider the HMM example 7.9 with initial state σ = 1| Cloudyi.
1 Compute successive states tn = σ for n = 0, 1, 2, 3.

416
7.4. Hidden Markov models 417

2 Compute successive observations e = (tn = σ) for n = 0, 1, 2, 3.


3 Check that the validity of the sequence of (point-predicate) obser-
vations Go-out, Stay-in, Stay-in is 0.1674.
4 Show that filtering, as in Definition 7.4.4, with these same three
(point) observations yields as final outcome:
1867
+ 1395
6696 |Cloudy i
347
| Sunnyi + 15817
33480 | Rainyi
≈ 0.2788| Cloudyi + 0.2487| Sunnyi + 0.4724| Rainyi.

7.4.2 Consider the transition channel t associated with the HMM exam-
ple 7.9. Check that in order to find a stationary state σ∞ = x| Cloudyi+
y| Sunny i + z| Rainy i one has to solve the equations:

x = 0.5x + 0.15y + 0.2z


y = 0.2x + 0.8y + 0.2z
z = 0.3x + 0.05y + 0.6z

Deduce that σ∞ = 0.25| Cloudyi + 0.5|Sunny i + 0.25| Rainyi and


double-check that t = σ∞ = σ∞ .
7.4.3 (The set-up of this exercise is copied from machine learning lecture
notes of Doina Precup.) Consider a 5-state hallway of the form:

1 2 3 4 5

Thus we use a space X = {1, 2, 3, 4, 5} of positions, together with a


space Y = {2, 3} of outputs, for the number of surrounding walls. The
transition and emission channels t : X → X and e : X → Y for a robot
in this hallway are given by:

t(1) = 3
4 |1 i + 1
4|2i e(1) = 1| 3 i
t(2) = 1
4 |1 i + 1
2|2i + 1
4|3i e(2) = 1| 2 i
t(3) = 1
4 |2 i + 1
2|3i + 1
4|4i e(3) = 1| 2 i
t(4) = 1
4 |3 i + 1
2|3i + 1
4|5i e(4) = 1| 2 i
t(5) = 1
4 |4 i + 3
4|5i e(5) = 1| 3 i.

We use σ = 1| 3 i as start state, and we have a sequence of observa-


tions α = [2, 2, 3, 2, 3, 3], formally as a sequence of point predicates
[12 , 12 , 13 , 12 , 13 , 13 ].

1 Check that (σ, t, e) |= α = 3


512 .

417
418 Chapter 7. Directed Graphical Models

2 Next we filter with the sequence α. Show that it leads succesively


to the following states σi as in Definition 7.4.4

σ1 B σ = 1| 3i
σ2 = 4 |2 i + 2 | 3 i + 4 | 4 i
1 1 1

σ3 = 16 | 1 i + 4 | 2 i + 8 | 3 i + 4 | 4i + 16 | 5i
1 1 3 1 1

σ4 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i
3 1 1 3

σ5 = 8 |1 i + 4 | 2 i + 4 | 3 i + 4 | 4 i + 8 | 5i
1 1 1 1 1

σ6 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i
3 1 1 3

σ7 = 8 |1 i + 8 | 2 i + 8 | 4 i + 8 | 5 i.
3 1 1 3

3 Show that a most likely sequence of states giving rise to the se-
quence of observations α is [3, 2, 1, 2, 1, 1]. Is there an alternative?
7.4.4 Apply Bayes’ rule to the validity formulation (7.13) in order to prove
the correctness of the following HMM validity algorithm.

(σ, t, e) |= [] B 1
   
(σ, t, e) |= [p] ++ ~p B σ |= e = p · (t = (σ|e = p ), t, e) |= ~p .

(Notice the connection with filtering from Definition 7.4.4.)


7.4.5 A random walk is a Markov model d : Z → Z given by d(n) = r| n −
1 i + (1 − r)|n + 1i for some r ∈ [0, 1]. This captures the idea that a
step-to-the-left or a step-to-the-right are the only possible transitions.
(The letter ‘d’ hints at modeling a drunkard.)
1 Start from initial state σ = 1| 0 i ∈ D(Z) and describe a couple
of subsequent states d = σ, d2 = σ, d3 = σ, . . . Which pattern
emerges?
2 Prove that for K ∈ N,
X
d K = σ =

bn[K](1 − r)(k) 2k − K
0≤k≤K
X  
= K k K−k
k (1 − r) r 2k − K
0≤k≤K

7.4.6 A Markov chain X → X has a ‘one-stage history’ only, in the sense


that the state at stage n + 1 depends only on the state at stage n. The
following situation from [144, Chap. III, Ex. 4.4] involves a two-stage
history.
Suppose that whether or not it rains today depends on wheather conditions
through the last two days. Specifically, suppose that if it has rained for the
past two days, then it will rain tomorrow with probability 0.7; if it rained
today but not yesterday, then it will rain tomorrow with probability 0.5; if

418
7.5. Disintegration 419

it rained yesterday but not today, then it will rain tomorrow with probability
0.4; if it has not rained in the past to days, then it will rain tomorrow with
probability 0.2.
1 Write R = {r, r⊥ } for the state space of rain and no-rain outcomes,
and capture the above probabilities via a channel c : R × R → R.
2 Turn this channel c into a Markov chain hπ2 , ci : R × R → R × R,
where the second component of R × R describes whether or not it
rains on the current day, and the first component on the previous
day. Describe hπ2 , ci both as a function and as a string diagram.
3 Generalise this approach to a history of length N > 1: turn a chan-
nel X N → X into a Markov model X N → X N , where the relevant
history is incorporated into the sample space.
7.4.7 Use the approach of the prevous exercise to turn a hidden Markov
model into a Markov model.
7.4.8 In Definition 7.4.4 we have seen filtering for hidden Markov models,
starting from a sequence of factors ~p as observations. In practice these
factors are often point predicates 1yi for a sequence of elements ~y of
the visible space Y. Show that in that case one can describe filtering
via Bayesian inversion as follows, for a hidden Markov model with
transition channel t : X → X, emission channel e : X → Y and initial
state σ ∈ D(X).
σ1 = σ and σi+1 = (t ◦· e†σi )(yi ).
7.4.9 Let H1 and H2 be two HMMs. Define their parallel product H1 ⊗ H2
using the tensor operation ⊗ on states and channels.

7.5 Disintegration
In Subsections 1.3.3 and 1.4.3 we have seen how a binary relation R ∈ P(A×B)
on A × B corresponds to a P-channel A → P(B), and similarly how a multiset
ψ ∈ M(A × B) corresponds to an M-channel A → M(B). This phenomenon
was called: extraction of a channel from a joint state. This section describes the
analogue for probabilistic binary/joint states and channels (like in [52]). It turns
out to be more subtle because probabilistic extraction requires normalisation
— in order to ensure the unitality requirement of a D-channel: multiplicities
must add up to one.
In the probabilistic case this extraction is also called disintegration. It corre-
sponds to a well known phenomenon, namely that one can write a joint prob-
ability P(x, y) in terms of a conditional probability as via what is often called

419
420 Chapter 7. Directed Graphical Models

the chain rule:


P(x, y) = P(y | x) · P(x).

The mapping x 7→ P(y | x) is the channel involved, and P(x) refers to the
first marginal of P(x, y). We have already seen this extraction of a channel via
daggers, in Theorem 6.6.5.
Our current description of disintegration (and daggers) makes systematic
use of the graphical calculus that we introduced earlier in this chapter. The
formalisation of disintegration in category theory is not there (yet), but the
closely related concept of Bayesian inversion — the dagger of a channel —
has a nice categorical description, see Section 7.8 below.
We start by formulating the property of ‘full support’ in string diagrammatic
terms. It is needed as pre-condition in disintegration.

Definition 7.5.1. Let X be a non-empty finite set. We say that a distribution or


multiset ω on X has full support if ω(x) > 0 for each x ∈ X.
More generally, we say that a D- or M-channel c : Y → X has full support
if the distribution/multiset c(y) has full support for each y. This thus means that
c(y)(x) > 0 for all y ∈ Y and x ∈ X.

Lemma 7.5.2. Let ω ∈ D(X), where X is non-empty and finite. The following
three points are equivalent:

1 ω has full support;


2 there is a predicate p on X and a non-zero probability r ∈ (0, 1] such that
ω(x) · p(x) = r for each x ∈ X.
3 there is a predicate p on X such that ω| p is the uniform distribution on X.

Proof. Suppose that the finite set X has N ≥ 1 elements. For the implication
(1) ⇒ (2), let ω ∈ D(X) have full support, so that ω(x) > 0 for each x ∈ X.
Since ω(x) ≤ 1, we get ω(x)
1
≥ N and t · ω(x) ≥ 1, for
P 1
≥ 1 so that t B x ω(x)
each x ∈ X. Take r = 1t ≤ N1 ≤ 1, and put p(x) = t·ω(x)
1
. Then:

ω(x) · p(x) = ω(x) · 1


t·ω(x) = 1
t = r.

Next, for (2) ⇒ (3), suppose ω(x) · p(x) = r, for some predicate p and
non-zero probability r. Then ω |= p = x ω(x) · p(x) = N · r. Hence:
P

ω(x) · p(x) r 1
ω| p (x) = = = = unif X (x).
ω |= p N·r N
For the final step (3) ⇒ (1), let ω| p = unif X . This requires that ω |= p , 0
and thus that ω(x) · p(x) = ω|N= p > 0 for each x ∈ X. But then ω(x) > 0.

420
7.5. Disintegration 421

This result allows us to express the full support property in diagrammatic


terms. This will be needed as precondition later on.

Definition 7.5.3. A state box s has full support if there is a predicate box p
and a scalar box r, for some r ∈ [0, 1], with an equation as on the left below.
2 A 2 A
p 2 A p 2 A
= = q
r
s f B
B
Similarly, we say that a channel box f has full support if there are predicates
p, q with an equation as above on the right.

Strictly speaking, we didn’t have to distinguish states and channels in this


definition, since that state-case is a special instance of the channel-case, namely
for B = 1. In the sequel we shall no longer make this distinction.
We now come to the main definition of this section. It involves turing a
conditional probability P(b, c | a) into P(c | b, a) such that

P(b, c | a) = P(c | b, a) · P(b | a).

This is captured diagrammatically in (7.16) below.

Definition 7.5.4. Consider a box f


B C
f 1, 0 =
 
f for which f has full support.
A
We say that f admits disintegration if there is a unique ‘disintegration’ box f 0
from B, A to C satisfying the equation:
B C
0
f
B C
= f (7.16)
f A

A
0
Uniqueness of f in this definition means: for each box g from B, A to C and

421
422 Chapter 7. Directed Graphical Models

for each box h from C to A with full support one has:

g
B C
= f =⇒ h = f and g = f0
h
A

The first equation h = f 1, 0 in the conclusion is obtained simply by applying


 
discard to two right-most outgoing wires in the assumption — on the left
and on the right of the equation — and using that g is unital:

h = = = f
h h

The second equation g = f 0 in the conclusion expresses uniqueness of the


disintegration box f 0 .
 
We shall use the notation dis 1 ( f ) or f 0, 1 1, 0 for this disintegration box
f 0 . This notation will be explained in greater generality below.

The first thing to do is to check that this newly introduced construction is


sound.

Proposition 7.5.5. Disintegration as described in Definition 7.5.4 exists for


probabilistic channels.
 
Proof. Let f : A → B × C be a channel such that f 1, 0 : A → B has full
support. The latter means that for each a ∈ A and b ∈ B one has f 1, 0 (a)(b) =
 
P 0
c∈C f (a)(b, c) , 0. Then we can define f : B × A → C as:

X f (a)(b, c)
f 0 (b, a) B   c . (7.17)
c
f 1, 0 (a)(b)

By the full support assumption this is well-defined.

422
7.5. Disintegration 423

The left-hand side of Equation (7.16) evaluates as:


 
(id ⊗ f 0 ) ◦· (∆ ⊗ id ) ◦· ( f 1, 0 ⊗ id ) ◦· ∆ (a)(b, c)
 

= x,y id (y)(b) · f 0 (y, x)(c) · f 1, 0 (a)(y) · id (x)(a)


P  

= f 0 (b, a)(c) · f 1, 0 (a)(b)


 

= f (a)(b, c).
For uniqueness, assume (id ⊗ g) ◦· (∆ ⊗ id ) ◦· (h ⊗ id ) ◦· ∆ = f . As already men-
tioned, we can obtain h = f 1, 0 via diagrammatic reasoning. Here we reason
 
element-wise. By assumption, g(b, a)(c) · h(a)(b) = f (a)(b, c). This gives:
h(a)(b) = 1 · h(a)(b) = c g(b, a)(c) · h(a)(b) =
P  P
c g(b, a)(c) · h(a)(b)
= c f (a)(b, c)
P

= f 1, 0 (a)(b).
 

Hence h = f 1, 0 . But then g = f 0 since:


 

f (a)(b, c) f (a)(b, c)
g(b, a)(c) = =   = f 0 (b, a)(c).
h(a)(b) f 1, 0 (a)(b)
Without the full support requirement, disintegrations may still exist, but they
are not unique, see Example 7.6.1 (2) for an illustration.
Next we elaborate on the notation that we will use for disintegration.
Remark 7.5.6. In traditional notation in probability theory one simply omits
variables to express marginalisation. For instance, for a distribution ω ∈ D(X1 ×
X2 × X3 × X4 ), considered as function ω(x1 , x2 , x3 , x4 ) in four variables xi , one
writes:
X
ω(x2 , x3 ) for the marginal ω(x1 , x2 , x3 , x4 ).
x1 ,x4

We have been using masks instead: lists containing as only elements 0 or 1.


We then write ω 0, 1, 1, 0 ∈ D(X2 × X3 ) to express the above marginalisation,
 
with a 0 (resp. a 1) at position i in the mask meaning that the i-th variable in
the distribution is discarded (resp. kept).
We like to use a similar mask-style notation for disintegration, mimicking
the traditional notation ω(x1 , x4 | x2 , x3 ). This requires two masks, separated
by the sign ‘|’ for conditional probability. We sketch how it works for a box f
from A to B1 , . . . , Bn in a string diagram. The notation
f [N | M]
will be used for a disintegration channel, in the following manner:
1 masks M, N must both be of length n;

423
424 Chapter 7. Directed Graphical Models

2 M, N must be disjoint: there is no position i with a 1 both in M and in N;


3 the marginal f [M] must have full support;
4 the domain of the disintegration box f [N | M] is B ~ ∩ M, A, where B~∩M
contain precisely those Bi with a 1 in M at position i;
5 the codomain of f [N | M] is B ~ ∩ N.
6 f [N | M] is unique in satisfying an “obvious” adaptation of Equation (7.16),
in which f [M ∪ N] is equal to a string diagram consisting of f [M] suitably
followed by f [N | M].

How this really works is best illustrated via a concrete example. Consider the
box f on the left below, with five output wires. We elaborate the disintegration
 
f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 , as on the right.
B1 B2 B3 B4 B5 B1 B5

 
f f 1, 0, 0, 0, 1 0, 1, 0, 1, 0

A B2 B4 A
 
The disintegration box f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 is unique in satisfying:

 
f 1, 0, 0, 0, 1 0, 1, 0, 1, 0

= f

In traditional notation one could express this equation as:

f (b1 , b5 | b2 , b4 , a) · f (b2 , b4 | a) = f (b1 , b2 , b4 , b5 | a).

Two points are still worth noticing.

• It is not required that at position i there is a 1 either in mask M or in mask


N when we form f [N | M]. If there is a 0 at i both in M and in N, then the
wire at i is discarded altogether. This happens in the above illustration for
the third wire: the string diagram on the right-hand side of the equation is
f [M ∪ N] = f 1, 1, 0, 1, 1 .
 

424
7.5. Disintegration 425
 
• The above disintegration f 1, 0, 0, 0, 1 0, 1, 0, 1, 0 can be obtained from the
‘simple’, one-wire version of disintegration in Definition 7.5.4 by first suit-
ably rearranging wires and combining them via products. How to do this
precisely is left as an exercise (see below).

Disintegration is a not a ‘compositional’ operation that can be obtained by


combining other string diagrammatic primitives. The reason is that disintegra-
tion involves normalisation, via division in (7.17). Still it would be nice to be
able to use disintegration in diagrammatic form. For this purpose one may use
a trick: in the setting of Definition 7.5.4 we ‘bend’ the relevant wire down-
wards and put a gray box around the result in order to suggest that its interior
is closed off and has become inaccessible. Thus, we write the disintegration of:
C
B C

f 0, 1 1, 0 =
 
f as f

A
B A
This ‘shaded box’ notation can also be used for more complicated forms of
disintegration, as described above. This notation is useful to express some basic
properties of disintegration, see Exercise 7.5.4.

Exercises
7.5.1 1 Prove that if a joint state has full support, then each of its marginals
has full support.
2 Consider the joint state ω = 12 | a, b i + 12 | a⊥ , b⊥ i ∈ D(A × B) for
A = {a, a⊥ } and B = {b, b⊥ }. Check that both marginals ω 1, 0 ∈
 
D(A) and ω 0, 1 ∈ D(B) have full support, but ω itself does not.
 
Conclude that the converse of the previous point does not hold.
7.5.2 Consider the box f in Remark 7.5.6. Write down the equation for the
 
disintegration f 0, 1, 0, 1, 0 1, 0, 1, 0, 1 . Formulate also what unique-

ness means.
 
7.5.3 Show how to obtain the disintegration f 0, 0, 0, 1, 1 0, 1, 1, 0, 0 in
Remark 7.5.6 from the formulation in Definition 7.5.4 via rearranging
and combining wires (via ×).
7.5.4 (From [52]) Prove the following ‘sequential’ and ‘parallel’ proper-
ties of disintegration. The best is to give a diagrammatic proof, us-
ing uniqueness of disintegration. But you may also check that these
equations are sound, i.e. hold for all interpretations of the boxes as

425
426 Chapter 7. Directed Graphical Models

probabilistic channels (with appropriate full support).

7.5.5 Let f : A × B → A × C be a channel, where A, B, C are finite sets.


Define the “probabilistic trace” prtr( f ) : B → C as:
C

f
prtr( f ) B (7.18)
A
B
1 Check that:
X 1 f (a, b)(a, c)
prtr( f )(b)(c) = ·P ,
a #A y f (a, b)(a, y)

where #A is the number of elements of A.


A categorically oriented reader might now ask if the above defini-
tion prtr( f ) yields a proper ‘trace’ construction in the symmetric
monoidal Kleisli category Chan(D) of probabilistic channels, but
this is not the case: the so-called dinaturality condition fails. This
will be illustrated in the next few points.
2 Take space A = {a, b, c} and 2 = {0, 1} with channels f : A → 2 × 2
and g : 2 → A defined by:
f (a) = 14 | 0, 0 i + 34 | 1, 0 i
g(0) = 31 | a i + 23 | c i
f (b) = 25 | 0, 0 i + 35 | 1, 1 i
g(1) = 21 | a i + 12 | b i.
f (c) = 12 | 0, 1 i + 12 | 1, 0 i

426
7.5. Disintegration 427

The aim is to show an inequality of states:

2 2

g f
prtr (g ⊗ id ) ◦· f = = prtr f ◦· g
 
,
f g

A 2

Both sides are an instance of (7.18) with B = 1.


3 Check that (g ⊗ id ) ◦· f : A → A × 2 is:

24 | a, 0 i + 8 | b, 0 i + 6 | c, 0i
a 7→ 11 3 1

2
b 7→ 15 | a, 0 i + 10
3
| a, 1 i + 103
| b, 1 i + 15
4
| c, 0i
c 7→ 4 | a, 0 i + 6 | a, 1 i + 4 | b, 0i + 3 |c, 1 i.
1 1 1 1

And that f ◦· g : 2 → 2 × 2 is:

0 7→ 1
12 | 0, 0 i + 13 | 0, 1i + 7
12 | 1, 0i
1 7→ 13
40 | 0, 0 i + 38 | 1, 0i + 3
10 | 1, 1i.

 
4 Now show that the disintegration (g⊗id )◦· f 0, 1 1, 0 : A×A → 2
is:

(a, a) 7→ 1| 0 i 4
(b, a) 7→ 13 |0i + 9
13 | 1 i (c, a) 7→ 35 | 0 i + 25 | 1 i
(a, b) 7→ 1| 0 i (b, b) 7→ 1| 1i (c, b) 7→ 1| 0i
(a, c) 7→ 1| 0 i (b, c) 7→ 1| 0i (c, c) 7→ 1| 1i.

 
And that f ◦· g 0, 1 1, 0 : 2 × 2 → 2 is:

(0, 0) 7→ 15 | 0 i + 45 | 1 i (1, 0) →
7 1| 0i
(0, 1) 7→ 1|0 i (1, 1) →7 59 |0 i + 49 | 1 i.

5 Conclude that:

prtr (g ⊗ id ) ◦· f = 13 | 0i + 23 | 1i,


whereas:

prtr f ◦· g = 17
+ 28

45 | 0i 45 |1 i.

427
428 Chapter 7. Directed Graphical Models

7.6 Disintegration for states


The previous section has described disintegration for channels, but in the liter-
ature it is mostly used for states only, that is, for boxes without incoming wires.
This section will look at this special case, following [21]. It will reformulate
the channel extraction from Theorem 6.6.5 via string diagrams.
In its most basic form disintegration involves extracting a channel from a
(binary) joint state, so that the state can be reconstructed as a graph — as in
Definition 2.5.6. This works as follows.

B
A B
From extract c B
ω ω
A

so that:
A B
A B c
= (7.19)
ω
ω

Using the notation for disintegration


from Remark 7.5.6 we can write the ex-
tracted channel c as c = ω 0, 1 1, 0 . Occasionaly we also write it as c =
 

dis 1 (ω). We can express Equation (7.19) as graph equation

ω = id ⊗ ω 0, 1 1, 0 = ∆ = ω 1, 0
   
(7.20)
= hid , ω 0, 1 1, 0 i = ω 1, 0 ,
   

like in Theorem 6.6.5. This is a special case of Equation (7.16), where the
incoming wire is trivial (equal to 1). We should not forget the side-condition
of disintegration, which in this case is: the first marginal ω 1, 0 must have full
 
support. Then one can define the channel c = dis 1 (ω) = ω 0, 1 1, 0 : A → B
 
as disintegration, explicitly, as a special case of (7.17):
X ω(a, b) X ω(a, b)
b =
c(a) B   b .

P
ω(a, y) ω 1, 0 (a) (7.21)
b∈B y b∈B

From a joint state ω ∈ D(X × Y) we can extract a channel ω 0, 1 1, 0 : X →
 

Y and also a channel ω 1, 0 0, 1 : Y → X in the opposite direction, provided
 
that ω’s marginals have full support, see again Theorem 6.6.5. The direction of
the channels is thus in a certain sense arbitrary, and does not reflect any form
of causality, see Chapter ??.

428
7.6. Disintegration for states 429

Example 7.6.1. We shall look at two examples, involving spaces A = {a, a⊥ }


and B = {b, b⊥ }.

1 Consider the following state ω ∈ D(A × B),

ω = 14 | a, b i + 12 | a, b⊥ i + 14 | a⊥ , b⊥ i.

We have as first marginal σ B ω 1, 0 = 34 | a i + 41 | a⊥ i with full support. The


 
extracted channel c B ω 0, 1 1, 0 : A → B is given by:
 

ω(a, b) ω(a, b⊥ ) ⊥ 1/4 1/2


c(a) = | bi + |b i = 3 | b i + | b⊥ i = 13 | b i + 23 | b⊥ i
σ(a) σ(a) /4 3/4

ω(a , b)

ω(a , b ) ⊥
⊥ ⊥
0 1/4
c(a⊥ ) = | bi + |b i = 1 |bi + | b⊥ i = 1| b⊥ i.
σ(a⊥ ) σ(a⊥ ) /4 1/4

Then indeed, hid , ci = σ = ω.


2 Now let’s start from:

ω = 13 | a, b i + 23 | a, b⊥ i.

Then σ B ω 1, 0 = 1| a i. It does not have full support. Let τ ∈ D(B) be an


 
arbitrary state. We define c : A → B as:

c(a) = 13 | b i + 23 | b⊥ i and c(a⊥ ) = τ.

We then still get hid , ci = σ = ω, no matter what τ is.


More generally, if we don’t have full support, disintegrations still exist,
but they are not unique. We generally avoid such non-uniqueness by requir-
ing full support.

Example 7.6.2. The natural join ./ is a basic construction in database theory


that makes it possible to join two databases which coincide on their overlap.
Such natural joins are used in a probabilistic setting in [1, 16] — in particular
in relation to Bell tables, see Exercises 2.2.11, ??.
Let ω ∈ D(X ⊗ Y) and ρ ∈ D(X ⊗ Z) be two joint states with equal X-
marginal: ω 1, 0 = ρ 1, 0 . A natural join, if it exists, is a state ω ./ ρ ∈
   
D(X ⊗ Y ⊗ Z) which marginalises both to ω and two ρ, as in:

(ω ./ ρ) 1, 1, 0 = ω (ω ./ ρ) 1, 0, 1 = ρ.
   
and

We show how such natural joins can be constructed via disintegration of states.
Let’s assume we have joint states ω and ρ as described above, with common
marginal written as σ B ω 1, 0 = ρ 1, 0 . We extract channels:
   

c B ω 0, 1 1, 0 : X → Y d B ρ 0, 1 1, 0 : X → Z.
   

429
430 Chapter 7. Directed Graphical Models

Now we define:
X Y Z

c d
ω ./ ρ B (7.22)

The (ω ./ ρ) 1, 1, 0 = ω can be obtained easily via diagrammatic reasoning:


 

c d c c
= = =
ω
σ σ ω

Similarly one proves (ω ./ ρ) 1, 0, 1 = ρ. For a concrete instantiation of this


 
construction, see Exercise 7.6.6. Such natural joins are typically non-trivial,
but the above construction (7.22) gives a clear recipe. The graphical approach
clearly describes the relevant flows and is especially useful in more compli-
cated situations, with multiple states which agree on multiple marginals.

The next result illustrates two bijective correspondences resulting from dis-
integration. These correspondences will be indicated with a double line ,
like in (1.3), expressing that one can translate in two directions, from what
is above the lines to what is below, and vice-versa. The first bijective corre-
spondence is basically a reformulation of disintegration itself (for states). The
second one is new and involves distributions over distributions — sometimes
called hyperdistributions [121, 122].

Theorem 7.6.3. Let X, Y be arbitrary sets, where Y is finite.

1 There is a bijective correspondence between:


τ ∈ D(X × Y) where τ 0, 1 has full support
 
=============================================================
ω ∈ D(X) and c : X → Y such that c = ω has full support
2 For each natural number N ≥ 1 there is a bijective correspondence between:

Ω ∈ D D(X) with supp(Ω) = N

==============================
 ==================
ρ ∈ D(X × N) where ρ 0, 1 has full support
Recall that we write | A | ∈ N for the number of elements of a finite set A.

430
7.6. Disintegration for states 431

Proof. 1 In the downward direction, starting from a joint state τ ∈ D(X × Y)


a state ω B τ 1, 0 ∈ D(X) as first marginal and a channel c B
 
we obtain
τ 0, 1 1, 0 : X → Y by disintegration. The latter exists because the second
 
marginal τ 0, 1 has full support. In the upward direction we transform a
 
state-channel pair ω, c to the joint state τ B hid , ci = ω ∈ D(X × Y). By
assumption, τ 0, 1 = c = ω has full support. Doing these transformations
 
twice, both down-up and up-down, yields the orginal data, by definition of
disintegration.
2 Let distribution Ω ∈ D(D(X)) have a support with N elements, say supp(Ω) =
{ω0 , . . . , ωN−1 } with ωi ∈ D(X). We take:
ρ(x, i) B Ω(ωi ) · ωi (x). (7.23)
This yields a distribution ρ ∈ D(X × N) since these probabilities add up to
one:
X X X X
ρ(x, i) B Ω(ωi ) · ωi (x) = Ω(ωi ) = 1.
x∈X, i∈N i∈N x∈X i∈N

Furthermore, ρ’s second marginal has full support, since for each i ∈ N,
X X
ρ 0, 1 (i) = ρ(x, i) = Ω(ωi ) · ωi (x) = Ω(ωi ) > 0.
 
x∈X x∈X

In the upward direction, starting from ρ ∈ D(X×N) we from the channel d B


ρ 1, 0 0, 1 : N → X by disintegration and use it to define Ω ∈ D(D(X)) as:
 
X  
ΩB ρ 0, 1 (i) d(i) .

(7.24)
i∈N

Since ρ 0, 1 (i) > 0 for each i, this Ω’s support has N elements.
 
Starting from Ω with support {ω0 , . . . , ωN−1 }, we can get ρ as in (7.23)
with extracted channel d satisfying:
ρ(x, i) Ω(ωi ) · ωi (x)
d(i)(x) =   = = ωi (x).
ρ 0, 1 (i) Ω(ωi )
Hence the definition (7.24) yields the original state Ω ∈ D(D(X)):
X   X
ρ 0, 1 (i) d(i) = Ω(ω ) ω = Ω. i i
i∈N i∈N

In the other direction, starting from a joint state ρ ∈ D(X × N) whose second
marginal has full support, we can form Ω as in (7.24) and turn it into a joint
state again via (7.23), whose probability at (x, i) is:

Ω(d(i)) · d(i)(x) = ρ 0, 1 (i) · ρ 1, 0 0, 1 (i)(x)
   
  
= hid , ρ 1, 0 0, 1 i = ρ 0, 1 (x, i)
 

= ρ(x, i).

431
432 Chapter 7. Directed Graphical Models

This last equation holds by disintegration.

By combining the two correspondences in this theorem we can bijectively


relate a hyperdistribution Ω ∈ D(D(X)) and a state-channel pair ω ∈ D(X),
c : X → Y, see Exercise 7.6.7 below. This corresponce is used in Bayesian
persuasion, see [97], where the channel c is called a signal.
With disintegration well-understood we can formulate a follow-up of the
crossover update corollary 6.3.13. There we looked at marginalisation after
update. Here we look at extraction of a channel after such an update, at the
same position of the original channel. The newly extracted channel can be
expressed in terms of an updated channel, see Definition 6.1.1 (3).

Theorem 7.6.4. Let c : X → Y be a channel with state σ ∈ D(X) on its


domain, and let p ∈ Fact(X) and q ∈ Fact(Y) be factors.

1 The extraction on an updated graph state yields:


 
hid , ci = σ p⊗q 0, 1 1, 0 = c|q where c|q (x) B c(x)|q .


2 And as a result:

hid , ci = σ p⊗q = hid , c|q i = σ| p & (c = q) .

Proof. 1 For x ∈ X and y ∈ Y we have:



  (7.21)
hid , ci = σ p⊗q (x, y)
hid , ci = σ p⊗q 0, 1 1, 0 (x)(y) =

  
hid , ci = σ p⊗q 1, 0 (x)
hid , ci = σ (x, y) · (p ⊗ q)(x, y)

=
v hid , ci = σ (x, v) · (p ⊗ q)(x, v)
P 

σ(x) · c(x)(y) · p(x) · q(y)


=
v σ(x) · c(x)(v) · p(x) · q(v)
P
c(x)(y) · q(y)
= P
v c(x)(v) · q(v)
c(x)(y) · q(y)
=
c(x) |= q
= c(x)|q (y)
= c|q (x)(y).

2 An (updated) joint state like (hid , ci = σ)| p⊗q can always be written as graph
hid , di = τ. The channel d is obtained by distintegration of the joint state,
and equals c|q by the previous point. The state τ is the first marginal of the

432
7.6. Disintegration for states 433

joint state. Hence we are done by charactersing this first marginal, in:
  
hid , ci = σ p⊗q 1, 0

= hid , ci = σ (1⊗q)&(p⊗1) 1, 0
 
by Exercise 4.3.7

= hid , ci = σ 1⊗q p⊗1 1, 0
  
by Lemma 6.1.6 (3)
  
= hid , ci = σ 1⊗q
1, 0 p
by Lemma 6.1.6 (6)
= σ| c = q | p by Corollary 6.3.13 (2)
= σ| p & (c = q) by Lemma 6.1.6 (3).

7.6.1 Disintegration of states and conditioning


With disintegration (of states) we can express conditioning (updating) of states
in a new way. Also, the earlier results on crossover inference can now be refor-
mulated via conditioning.
For the next result, recall that we can identify a predicate X → [0, 1] with a
channel X → 2, see Exercise 4.3.5.
Proposition 7.6.5. Let ω ∈ D(A) be state and p ∈ Pred (A) be predicate, both
on the same finite set A. Assumme that the validity ω |= p is neither 0 nor 1.
Then we can express the conditionings ω| p ∈ D(A) and ω| p⊥ ∈ D(A) as
(interpreted) string diagrams:
A A

p p

ω| p = and ω| p⊥ = (7.25)
ω ω

2 2
1 0

Proof. First, let’s write σ B hp, id i = ω ∈ D(2 × A) for the joint state
that is disintegrated in the above string diagrams. The pre-conditioning for
disintegration is that the marginal σ 1, 0 ∈ D(2) has full support. Explicitly,
 
this means for b ∈ 2 = {0, 1},

X X ω |= p
 if b = 1
σ 1, 0 (b) = σ(b, a) = ω(a) · p(a)(b) = 
  
ω |= p⊥ if b = 0.

a∈A a∈A

The full support requirement that σ 1, 0 (b) > 0 for each b ∈ 2 means that both
 
ω |= p and ω |= p⊥ = 1 − (ω |= p) are non-zero. This holds by assumption.

433
434 Chapter 7. Directed Graphical Models

We elaborate the above string diagram on the left. Let’s write c for the ex-
tracted channel. It is, according to (7.21),
X σ(1, a) X ω(a) · p(a)
c(1) = a = a = ω| .
p
σ(1, ω =
P
a∈A a a) a∈A
| p

We move to crossover inference, as described in Corollary 6.3.13. There,


a joint state ω ∈ D(X × Y) is used, which can be described as a graph ω =
hid , ci = σ. But we can drop this graph assumption now, since it can be ob-
tained from ω via disintegration — assuming that ω’s first marginal has full
support.
Thus we come to the following reformulation of Corollary 6.3.13.

Theorem 7.6.6. Let ω ∈ D(X × Y) be joint state whose first marginal σ B


ω 1, 0 ∈ D(X) has full support, so that the channel c B ω 0, 1 1, 0 : X → Y
   
exists, uniquely, by disintegration, and thus ω = hid , ci = σ. Then:

1 for a factor p on X,
ω| p⊗1 0, 1 = c = σ| p .
 

2 for a factor q on Y,
ω|1⊗q 1, 0 = σ|c = q .
 

7.6.2 Bayesian inversion, graphically


In Section 6.6 we have introduced Bayesian inversion as the dagger of a chan-
nel. Here we redescribe it in graphical form. So let c : A → B be a channel
with finite codomain B, and with a state ω ∈ D(A) on its domain, such that the
predicted state c = ω ∈ D(B) has full support. The Bayesian inversion (also
called: dagger) of c wrt. ω is the channel c†ω : B → A defined as disintegration
of the joint state hc, id i = ω in:
A

c†ω
c c
c†ω B such that = (7.26)
c
ω ω
ω
B
By construction, c†ω is the unique box giving the above equation. By applying

434
7.6. Disintegration for states 435

to the left outgoing line on both sides of the above equation we get the
equation that we saw earlier in Proposition 6.6.4 (1):

ω = c†ω = c = ω .


Above, we have described Bayesian inversion in terms of disintegration.


In fact, the two notions are inter-definable, since disintegration can also be
defined in terms of Bayesian inversion. This is in fact what we have done in
Theorem 6.6.5. We now put it in graphical perspective.

Lemma 7.6.7. Let ω ∈ D(A × B) be such that its first marginal π1 = ω =


ω 1, 0 ∈ D(A) has full support, where π1 : A × B → A is the projection chan-
 
nel. The second marginal of the Bayesian inversion:

π2 = π1 †ω π1 †ω : A → A × B,
 
for

is then the disintegration channel A → B for ω.

Proof. We first note that the projection channel π1 : A × B → A can be written


as string diagram:

π1 =

We need to prove the equation in (7.19). It can be obtained via purely diagram-
matic reasoning:

(π1 )†ω

= =
ω
ω ω

The first equation is an instance of (7.26).

7.6.3 Pearl’s and Jeffrey’s update rules, graphically


In Section 6.7 we first described the update rules of Pearl and Jeffrey. At this
stage we can recast them in terms of string diagrams. Recall that the setting
of these rules is given by a channel c : X → Y with a state σ ∈ D(Y) on its
domain. We are presented with evidence on Y and like to transport it along c
in order to update σ.

435
436 Chapter 7. Directed Graphical Models

In Pearl’s update rule the evidence is given by a factor. Here we restrict it


to a predicate q : Y → 2, since such predicates can be represented in a string
diagram. The state σ is then updated to σ|c = q . This is represented on the left
below. It corresponds to the update diagram (on the left) in (7.25), but now
with transformed predicate c = q.
For Jeffrey’s update we have evidence in the form of a state τ ∈ D(Y) and
we update σ to c†σ = τ. Hence we can use the string diagram of the dagger, as
in (7.26). The result is on the right.
X
X
q
c
Pearl’s c Jeffrey’s
update update (7.27)
rule: rule: σ
σ
Y
2
τ
1

We see that the state σ and channel c : X → Y together form a graph state
ω B hc, id i = σ ∈ D(Y × X). This joint state ω is what’s really used in
the above two diagrams. Hence we can also reformulate the above to update
diagrams in terms of this joint state ω.
X
X
q

Pearl: Jeffrey: ω (7.28)


ω
Y
2 τ
1

This ‘joint’ formulation is sometimes convenient when the situation at hand is


presented in terms of a joint state, like in Exercise 6.7.3.

Exercises
7.6.1 Check that Proposition 7.6.5 can be reformulated as:

ω| p = p†ω (1) and ω| p⊥ = p†ω (0).

436
7.6. Disintegration for states 437

7.6.2 Let σ ∈ D(X) have full support and consider ω B σ ⊗ τ for some
τ ∈ D(Y). Check that the channel X → Y extracted from ω by disin-
tegration is the constant function x 7→ τ. Give a string diagrammatic
account of this situation.
Consider an extracted channel ω 1, 0, 0 0, 0, 1 for some state ω.
 
7.6.3
1 Write down the defining equation for this channel, as string dia-
gram.
Check that ω 1, 0, 0 0, 0, 1 is the same as ω 1, 0, 1 1, 0 0, 1 .
    
2
7.6.4 Check that a marginalisation ωM can also be described as disintegra-
tion ω M 0, . . . , 0 where the number of 0’s equals the length of the
 
mask/list M.
7.6.5 Disintegrate the distribution Flrn(τ) ∈ D({H, L} × {1, 2, 3}) in Subsec-
tion 1.4.1 to a channel {H, L} → {1, 2, 3}.
7.6.6 Consider sets X = {x1 , x2 }, Y = {y1 , y2 , y3 }, Z = {z1 , z2 , z3 } with distri-
bution ω ∈ D(X × Y) given by:

4 | x1 , y1 i + 41 | x1 , y2 i + 16 | x2 , y1 i + 61 | x2 , y2 i + 16 | x2 , y3 i
1

and ρ ∈ D(X × Z) by:

12 | x1 , z1 i + 31 | x1 , z2 i + 12 | x1 , z3 i + 81 | x2 , z1 i + 18 | x2 , z2 i + 14 | x2 , z3 i.
1 1

1 Check that ω and ρ have equal X-marginals.


2 Explicitly describe the extracted channels c B ω 0, 1 1, 0 : X →
 
Y and d B ω 0, 1 1, 0 : X → Z.
 

3 Show that the natural join ω ./ ρ ∈ D(X ×Y ×Z) according to (7.22)


is:
24 | x1 , y1 , z1 i + 6 | x1 , y1 , z2 i + 24 | x1 , y1 , z3 i
1 1 1

+ 24 1
| x1 , y2 , z1 i + 16 | x1 , y2 , z2 i + 24
1
| x1 , y2 , z3 i
+ 24 | x2 , y1 , z1 i + 24 | x2 , y1 , z2 i + 12
1 1 1
| x2 , y1 , z3 i
+ 24 | x2 , y2 , z1 i + 24 | x2 , y2 , z2 i + 12 | x2 , y2 , z3 i
1 1 1

+ 24 1
| x2 , y3 , z1 i + 24
1
| x2 , y3 , z2 i + 12
1
| x2 , y3 , z3 i.

7.6.7 Combining the two items of Theorem 7.6.3 gives a bijective corre-
spondence between:

Ω ∈ D D(X) with supp(Ω) = N

============================================================
ω ∈ D(X) and c : X → N such that c = ω has full support
Define this correspondence in detail and check that it is bijective.
7.6.8 Prove the following possibilitistic analogues of Theorem 7.6.3.

437
438 Chapter 7. Directed Graphical Models

1 There is a bijective correspondence between:


R ∈ P(X × Y)
=============================
U ∈ P(X) with f : U → P∗ (Y)
where P∗ is used for the the subset of non-empty subsets.
2 For each number N ≥ 1 there is a bijective correspondence:
A ∈ P(P(X)) with | A| = N
=================================================
R ∈ P(X × N) with ∀i , j. ∃x. ¬ R(x, i) ⇔ R(x, j)
7.6.9 Prove that the equation
 
hid , ci = σ p⊗1 0, 1 1, 0 = c.


can be obtained both from Theorem 6.3.11 and from Theorem 7.6.4.
7.6.10 Prove that for a state ω ∈ D(X × Y) with full support one has:
   †
ω 1, 0 0, 1 = ω 0, 1 1, 0 : Y → X.


7.7 Disintegration and inversion in machine learning


This section illustrates the role of disintegration and Bayesian inversion in two
fundamental techniques in machine learning, namely in naive Bayesian classi-
fication and in decision tree learning. These applications will be explained via
examples from the literature.

7.7.1 Naive Bayesian classification


We illustrate the use of both disintegration and Bayesian inversion in an ex-
ample of ‘naive’ Bayesian classification from [165]; we follow the analysis
of [21]. Instead of trying to explain what a naive Bayesian classification is
or does, we demonstrate via this example how it works. In the end, in Re-
mark 7.7.1 we give a more general description.
Consider the table in Figure 7.1. It collects data about certain weather condi-
tions and whether or not there is playing (outside). The question asked in [165]
is: given this table, what can be said about the probability of playing if the out-
look is sunny, the temperature is cold, the humidity is high and it is windy?
This is a typical Bayesian update question, starting from (point) evidence. We
will first analyse the situation in terms of channels.
We start by extracting the underlying spaces for the columns/categories in

438
7.7. Disintegration and inversion in machine learning 439

Outlook Temperature Humidity Windy Play

sunny hot high false no


sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

Figure 7.1 Weather and play data, copied from [165].

the table in Figure 7.1. We choose obvious abbreviations for the entries in the
table:

O = {s, o, r} T = {h, m, c} H = {h, n}. W = {t, f } P = {y, n}

These sets are joined into a single product space:

S B O × T × H × W × P.

It combines the five columns in Figure 7.1. The table itself can now be consid-
ered as a multiset in M(S ) with 14 elements, each with multiplicity one. We
will turn it immediately into an empirical distribution — formally via frequen-
tist learning. It yields a distribution τ ∈ D(S ), with 14 entries, each with the
same probability, written as:

τ= 1
14 | s, h, h, f, n i + 1
14 | s, h, h, t, n i + ··· + 1
14 | r, m, h, t, ni.

We use a ‘naive’ Bayesian model in this situation, which means that we


assume that all weather features are independent. This assumption can be vi-
sualised via the following string diagram:

Outlook Temperature Humidity Windy


(7.29)

Play

This model oversimplifies the situation, but still it often leads to good (enough)
outcomes.

439
440 Chapter 7. Directed Graphical Models

We take the above perspective on the distribution τ, that is, we ‘factorise’ τ


according to this string diagram (7.29). Obviously, the play state π ∈ D(P) is
obtained as the last marginal:

π B τ 0, 0, 0, 0, 1 = 14
9
|y i + 14
5
 
| n i.
Next, we extract four channels cO , cT , cH , cW via appropriate disintegrations,
from the Play column to the Outlook / Temperature / Humidity / Windy columns.

cO B τ 1, 0, 0, 0, 0 0, 0, 0, 0, 1 cH B τ 0, 0, 1, 0, 0 0, 0, 0, 0, 1
   

cT B τ 0, 1, 0, 0, 0 0, 0, 0, 0, 1 cW B τ 0, 0, 0, 1, 0 0, 0, 0, 0, 1 .
   

For instance the ‘outlook’ channel cO : P → O looks as follows.

cO (y) = 92 | s i + 94 | o i + 39 | r i cO (n) = 35 | s i + 25 | r i. (7.30)

It is analysed in greater detail in Exercise 7.7.1 below.


Now we can form the tuple channel of these extracted channels, called c in:
c B hcO ,cT ,cH ,cW i
P ◦ / O×T ×H×W

Recall the question that we started from: what is the probability of playing
if the outlook is sunny, the temperature is Cold, the humidity is High and it
is Windy? These features can be translated into an element (s, c, h, w) of the
codomain O×T ×H×W of this tuple channel — and thus into a point predicate.
Hence our answer can be obtained by Bayesian inversion of the tuple channel,
as:
 (6.6)
c†π s, c, h, t = π c = 1 = 611
125
|y i + 486
611 | n i
(s,c,h,t)

≈ 0.2046| y i + 0.7954| n i.
This corresponds to the probability 20.5% calculated in [165] — without any
disintegration or Bayesian inversion.
The classification that we have just performed works via what it called a
naive Bayesian classifier. In our set-up this classifier is the dagger channel:
c†π
O×T ×H×W ◦ /P

It predicts playing for a 4-tuple in O × T × H × W.


In the end one can reconstruct a joint state with space S via the extracted
channel, as graph:
hcO , cT , cH , cW , id i = π.

This state differs considerably from the original table/state τ. It shows that the
shape (7.29) does not really fit the data that we have in Figure 7.1. But recall

440
7.7. Disintegration and inversion in machine learning 441

that this approach is called naive. We shall soon look closer into such matters
of shape in Section 7.9.

Remark 7.7.1. Now that we have seen the above illustration we can give a
more abstract recipe of how to obtain a Bayesian classifier. The starting point
is a joint state ω ∈ D X1 ×· · ·× Xn ×Y where X1 , . . . , Xn are the sets describing

the ‘input features’ and Y is the set of ‘target features’ that are used in clas-
sification: its elements represent the different classes. The recipe involves the
following steps, in which disintegration and Bayesian inversion play a promi-
nent role.

1 Compute the prior classification probability π B ω 0, . . . , 0, 1 ∈ D(Y).


 

2 Extract n-channels ci : Y → Xi via disintegration:


c1 B ω 1, 0, . . . , 0 0, . . . , 0, 1 : Y → X1
 

c2 B ω 0, 1, 0 . . . , 0 0, . . . , 0, 1 : Y → X2
 
..
. 
cn B ω 0, . . . , 0, 1, 0 0, . . . , 0, 1 : Y → Xn .


3 Form the tuple channel c B hc1 , . . . , cn i : Y → X1 × · · · × Xn .

4 Take the Bayesian inversion (dagger channel) c†π : X1 × · · · × Xn → Y, as


classifier. It gives for each n-tuple of input features x1 , . . . , xn a distribution
c†π (x1 , . . . , xn ) ∈ D(Y) on the set Y of classes (or target features). The distri-
bution gives the probability of the tuple belonging to class y ∈ Y.

Graphically, these steps can be described as follows. First take:

Xi Xi
Y

π = ··· ci =
ω ω

Y Y

The classification channel X1 × · · · × Xn → Y is then obtained via the dagger

441
442 Chapter 7. Directed Graphical Models

Author Thread Length Where read User action

known new long home skip


unknown new short work read
unknown follow-up long work skip
known follow-up long home skip
known new short home read
known follow-up long work skip
unknown follow-up short work skip
unknown new short work read
known follow-up long home skip
known new long work skip
unknown follow-up short home skip
known new long work skip
known follow-up short home read
known new short work read
known new short home read
known follow-up short work read
known new short home read
unknown new short work read

Figure 7.2 Data about circumstances for reading or skipping of articles, copied
from [140].

construction:
Y

c1 cn
···
naive Bayesian classifier =
π

X1 ··· Xn
The twisting at the bottom happens implicitly when we consider the product
X1 × · · · × Xn as a single set and take the dagger wrt. this set.

7.7.2 Decision tree learning


We proceed as before, by first going trough an example from the literature, and
then taking a step back to describe more abstractly what is going on. We start
from a table of data in Figure 7.2, from [140, Fig. 7.1]. It describes user actions
(read or skip) for articles that are “posted to a threaded discussion website
depending on whether the author is known or not, whether the article started a

442
7.7. Disintegration and inversion in machine learning 443

new thread or was a follow-up, the length of the article, and whether it is read
at home or at work.”
We formalise the table in Figure 7.2 via the following five sets, with ele-
ments corresponding in an obvious way to the entries in the table: k = known,
u = unknown, etc..
A = {k, u} T = {n, f } L = {l, s} W = {h, w} U = {s, r}.
We interprete the table as a joint distribution ω ∈ D(A × T × L × W × U) with
the same probability for each of the 18 entries in the table:

ω = 1 k, n, l, h, s + 1 u, n, s, w, r + · · · + 1 u, n, s, w, r .

18 18 18
(7.31)
So far there is no real difference with the naive Bayesian classification example
in the previous subsection. But the aim now is not classification but deciding:
given a 4-tuple of input features (x1 , x2 , x3 , x4 ) ∈ A × T × L × W we wish to
decide as quickly as possible whether this tuple leads to a read or a skip action.
The way to take such a decision is to use a decision tree as described in
Figure 7.3. It puts Length as dominant feature on top and tells that a long article
is skipped immediately. Indeed, this is what we see in the table in Figure 7.2:
we can quickly take this decision without inspecting the other features. If the
article is short, the next most relevant feature, after Length, is Thread. Again
we can see in Figure 7.2 that a short and new article is read. Finally, we see in
Figure 7.3 that if the article is short and a follow-up, then it is read if the author
is known, and skipped if the author is unkown. Apparently the location of the
user (the Where feature) is irrelevant for the read/skip decision.
We see that such decision trees provide a (visually) clear and efficient method
for reaching a decision about the target feature, starting from input features.
The question that we wish to address here is: how to learn (derive) such a
decision tree (in Figure 7.3) from the table (in Figure 7.2)?
The recipe (algorithm) for learning the tree involves iteratively going through
the following three steps, acting on a joint state σ.
1 Check if we are done, that is, if the U-marginal (for User action) of the
current joint state σ is a point state; if so, we are done and write this point
state 1| x i in a box as leaf in the tree.
2 If we are not done, determine the ‘dominant’ feature X for σ, and write X as
a circle in the tree.
3 For each element of x ∈ X, update (and marginalise) σ with the point predi-
cate 1 x and re-start with step 1.
We shall describe these steps in more detail, starting from the state ω in (7.31),
corresponding to the table in Figure 7.2.

443
444 Chapter 7. Directed Graphical Models

Figure 7.3 Decision tree for reading (r) or skipping (s) derived from the data in
Figure 7.2.

1.1 The U-marginal ωU B ω 0, 0, 0, 0, 1 ∈ D(U) equals 12 | r i + 12 | s i, be-


 
cause there are equal numbers of read and skip actions in Figure 7.2. Since
this is not a point state, we are not done.
1.2 We need to determine the dominant feature in ω. This is where disinte-
gration comes in. We first extract four channels and states:


ω 0, 0, 0, 0, 1 1, 0, 0, 0, 0
 
cA B : A→U
ωA ω 1, 0, 0, 0, 0 ∈ D(A)
 
B

ω 0, 0, 0, 0, 1 0, 1, 0, 0, 0
 
cT B : T →U
ωT ω 0, 1, 0, 0, 0 ∈ D(T )
 
B

ω 0, 0, 0, 0, 1 0, 0, 1, 0, 0
 
cL B : L→U
ωL ω 0, 0, 1, 0, 0 ∈ D(L)
 
B

ω 0, 0, 0, 0, 1 0, 0, 0, 1, 0
 
cW B : W→U
ωW ω 0, 0, 0, 1, 0 ∈ D(W).
 
B

Note that these channels go in the opposite direction with respect to the

444
7.7. Disintegration and inversion in machine learning 445

approach for naive Bayesian classification. It is not hard to see that:


cA (k) = 12 | si + 12 | r i cA (u) = 21 | s i + 12 | r i ωA = 23 | k i + 13 | ui
cT (n) = 103
| si + 10
7
|r i cT ( f ) = 43 | s i + 14 | r i ωT = 59 | ni + 49 | f i
cL (l) = 1| s i cL (s) = 112
| s i + 11
9
|ri ωL = 187
|l i + 18
11
| si
cW (h) = 21 | si + 12 | r i cW (w) = 2| si + 2|ri
1 1
ωW = 9 | hi + 9 | w i.
4 5

In order to determine which of the input features A, T, L, W is most


dominant we compute the expected entropy — sometimes called intrinsice
value — for each of these features. Recall the Shannon entropy function
H : D(X) → R≥0 from Exercise 4.1.9. Post-composing it with the above
channels gives a factor, like H ◦ cA : A → R≥0 . It computes the entropy
H(cA (x)) for each x ∈ A. Hence we can compute its validity in the marginal
state ωA ∈ D(A), giving the expected entropy. Explicitly:
X
ωA |= H ◦ cA = ωA (x) · H(cA (x))
x∈A
= ωA (k) · H(cA (k)) + ωA (u) · H(cA (u))
= 32 · 21 · − log( 21 ) + 12 · − log( 21 )


+ 13 · 12 · − log( 21 ) + 12 · − log( 21 )


= 23 + 13
= 1.

In the same way one computes the other expected entropies as:
ωT |= H ◦ cT = 0.85 ωL |= H ◦ cL = 0.42 ωW |= H ◦ cW = 1.

One then picks the lowest entropy value, which is 0.42, for feature / com-
ponent L. Hence L = Length is the dominant feature at this first stage.
Therefore it is put on top in the decision tree in Figure 7.3.
1.3 The set L has two elements, l for long and s for short. We update the
current state ω with each of these, via suitably weakened point predicates
1l and 1 s , and marginalise out the L component. This gives new states for
which we use the following ad hoc notation.

ω/l B ω 1⊗1⊗1 ⊗1⊗1 1, 1, 0, 1, 1 ∈ D(A × T × W × U)
 
l

ω/s B ω 1⊗1⊗1 ⊗1⊗1 1, 1, 0, 1, 1 ∈ D(A × T × W × U).


 
s

We now go into a recursive loop and repeat the previous steps for both these
states ω/l and ω/s. Notice that they are ‘shorter’ than ω, since they only
have 4 components instead of 5, since we marginalised out the dominant
component L.

445
446 Chapter 7. Directed Graphical Models

2.1.1 In the l-branch we are now done, since the U-marginal of the l-update
is a point state:
ω/l 0, 0, 0, 1 = 1| si.
 

It means that a long article is skipped immedately. This is indicated via the
the l-box as (left) child of the Length node in the decision tree in Figure 7.3.
2.1.2 We continue with the s-branch. The U-marginal of ω/s is not a point
state.
2.2.2 We will now have to determine the dominant feature in ω/s. We com-
pute the three expected entropies for A, T, W as:

ω/s 1, 0, 0, 0 |= H ◦ ω/s 0, 0, 0, 1 1, 0, 0, 0 = 0.44
   

ω/s 0, 1, 0, 0 |= H ◦ ω/s 0, 0, 0, 1 0, 1, 0, 0 = 0.36
   

ω/s 0, 0, 1, 0 |= H ◦ ω/s 0, 0, 0, 1 0, 0, 1, 0 = 0.68.
   

The second value is the lowest, so that T = Thread is now the dominant fea-
ture. It is added as codomain node of the s-edge out Length in the decision
tree in Figure 7.3.
2.3.2 The set T has two elements, n for new and f for follow-up. We take
the corresponding updates:

ω/s/n B ω/s 1⊗1 ⊗1⊗1 1, 0, 1, 1 ∈ D(A × W × U)
 
n

ω/s/ f B ω/s 1⊗1 ⊗1⊗1 1, 0, 1, 1 ∈ D(A × W × U).


 
f

Then we enter new recursions with both these states.


3.1.1 In the n-branch we are done, since we find a point state as U-marginal:

ω/s/n 0, 0, 1 = 1| r i.
 

This settles the left branch under the Thread node in Figure 7.3.
3.1.2 The f -branch is not done, since the U-marginal of the state ω/s/ f is
not a point state.
3.2.2 We thus start computing expected entropies again, in order to find out
which of the remaining input features A, T is dominant.

ω/s/ f 1, 0, 0 |= H ◦ ω/s/ f 0, 0, 1 1, 0, 0 = 0
   

ω/s/ f 0, 1, 0 |= H ◦ ω/s/ f 0, 0, 1 0, 1, 0 = 1.
   

Hence the A feature is dominant, so that the Author node is added to the
f -edge out of Thread in Figure 7.3.

446
7.7. Disintegration and inversion in machine learning 447

3.3.2 The set A has two elements, k for known and and u for unknown. We
form the corresponding two updates of the current state ω/s/ f .

ω/s/ f /k B ω/s/ f 1 ⊗1⊗1 0, 1, 1 ∈ D(W × U)
 
k
ω/s/ f /u B ω/s/ f 1 ⊗1⊗1 0, 1, 1 ∈ D(W × U).
 
u

The next cycle continues with these two states.


4.1.1 In the k-branch we are done since:

ω/s/ f /k 0, 1 = 1| r i.
 

4.1.2 Also in the u-branch we are done since:

ω/s/ f /u 0, 1 = 1| s i.
 

This gives the last two boxes, so that the decision tree in Figure 7.3 is
finished.
There are many variations on the above learning algorithm for decision trees.
The one that we just described is sometime called the ‘classification’ version,
as in [91], since it works with discrete distributions. There is also a ‘regres-
sion’ version, for continuous distributions. The key part of the the above al-
gorithm is deciding which feature is dominant (in step 2). We have described
the so-called ID3 version from [142], which uses expected entropies (intrinsic
values). Sometimes it is described in terms of ‘gains’, see [125], where in the
above step 1.2 we can define for feature X ∈ {A, T, L, W},

gain(X) B H(ωU ) − ωX |= H ◦ cX .


One then looks for the feature with the highest gain. But one may as well look
for the lowest expected entropy — given by the valdity expression after the
minus sign — as we do above. There are alternatives to using gain, such as
what is called ‘gini’, but that is out of scope.
To conclude, running the decision tree learning algorithm on the distribution
associated with the weather and play table from Figure 7.1 — with Play as
target feature — yields the decision tree in Figure 7.4, see also [165, Fig. 4.4].

Exercises
7.7.1 Write σ ∈ M(O × T × H × W × P) for the weather-play table in
Figure 7.1, as multidimensional multiset.
1 Compute the marginal multiset σ 1, 0, 0, 0, 1 ∈ M(O × P).
 

447
448 Chapter 7. Directed Graphical Models

Figure 7.4 Decision tree for playing (y) or not (n) derived from the data in Fig-
ure 7.1.

2 Reorganise this marginalised multiset as a 2-dimensional table with


only Outlook (horizontal) and Play (vertical) data, as given below,
and check how this ‘marginalised’ table relates to the original one
in Figure 7.1.
Sunny Overcast Rainy

yes 2 4 3
no 3 0 2
3 Deduce a channel P → O from this table, and compare it to the
description (7.30) of the channel cO given in Subsection 7.7.1 —
see also Lemma 2.3.2 (and Proposition ??).
4 Do the same for the marginal tables σ 0, 1, 0, 0, 1 , σ 0, 0, 1, 0, 1 ,
   
σ 0, 0, 0, 1, 1 , and the corresponding channels cT , cH , cW in Sub-
 
section 7.7.1.
7.7.2 Continuing with the weather-play table σ, notice that instead of tak-
ing the dagger of the tuple of channels hcO , cT , cH , cW i : P → O ×
T × H × W in Example 7.7.1 we could have used instead the ‘direct’
disintegration:
σ[0,0,0,0,1 | 1,1,1,1,0]
O×T ×H×W ◦ /P

Check that this gives a division-by-zero error.


7.7.3 Naive Bayesian classification, as illustrated in Subsection 7.7.1, is
often used for classifying email messages as either ‘spam’ or ‘ham’
(not spam). One then looks for words which are typical for spam or
ham. This exercise elaborates a standard small example.

448
7.7. Disintegration and inversion in machine learning 449

Consider the following table of six words, together with the likeli-
hoods of them being spam or ham.

spam ham

review 1/4 1
send 3/4 1/2
us 3/4 1/2
your 3/4 1/2
password 1/2 1/2
account 1/4 0

Lets write S = {s, h} for the probability space for spam and ham, and,
as usual 2 = {0, 1}.
1 Translate the above table into six channels:

review, send, us, your, password, account : S → 2.

Write c : S → 26 = 2 × 2 × 2 × 2 × 2 × 2 for the tuple channel:

c B hreview, send, us, your, password, accounti.

2 We wish to classify the message “review us now”. Explain how it


gets translated to the point predicate 1(1,0,1,0,0,0) on 26 .
3 Show that:

c = 1(1,0,1,0,0,0) = (review = 11 ) & (send = 10 )


& (us = 11 ) & (your = 10 )
& (password = 10 ) & (account = 10 )
= 2048
9
· 1 s + 16
1
· 1h

4 Let’s assume a prior spam distribution ω = 23 | s i + 31 |h i. Show that


the posterior spam distribution for our message is:

c†ω (1, 0, 1, 0, 0, 0) = ω|c = 1(1,0,1,0,0,0) = 73


9
| si + 73
64
| hi
≈ 0.1233| s i + 0.8767| h i.

7.7.4 Show in detail that we get point states in items 2.1.1 , 3.1.1 , 4.1.1
and 4.1.2 in Subsection 7.7.2.
7.7.5 In Subsection 7.7.1 we classified the play distribution as 611 125
|yi +
486
611 | ni, given the input features (s, c, h, t). Check what the play de-
cision is for these inputs in the decision tree in Figure 7.4.

449
450 Chapter 7. Directed Graphical Models

7.7.6 As mentioned, the table in Figure 7.2 is copied from [140]. There
the following two questions are asked: what can be said about read-
ing/skipping for the two queries:
Q1 = (unknown, new, long, work)
Q2 = (unknown, follow-up, short, home).
1 Answer Q1 and Q2 via the decision tree in Figure 7.3.
2 Compute the distributions on U for both queries Q1 and Q2 via
naive Bayesian classification.
(The outcome for Q2 obtained via Bayesian classification is 35 | r i +
2
5 | s i; it gives reading the highest probability. This does not coincide
with the outcome via the decision tree. Thus, one should be careful to
rely on such classification methods for important decisions.)

7.8 Categorical aspects of Bayesian inversion


As suggested in the beginning of this section the ‘dagger’ of a channel — i.e. its
Bayesian inversion — can also be described categorically. It turns out to be a
special ‘dagger’ functor. Such reversal is quite common for non-deterministic
computation, see Example 7.8.1 below. The fact that this same abstract struc-
ture exists for probabilistic computation demonstrates once again that Bayesian
inversion is a canonical operation — and that category theory provides a useful
language for making such similarities explicit.
This section goes a bit deeper into the category-theoretic aspects of proba-
bilistic computation in general and of Bayesian inversion in particular. It is not
essential for the rest of this book, but provides deeper insight into the underly-
ing structures.
Example 7.8.1. Recall the category Chan(P) of non-deterministic compu-
tations. Its objects are sets X and its morphisms f : X → Y are functions
f : X → P(X). The identity morphism unit X : X → X in Chan(P) is the sin-
gleton function unit(x) = {x}. Composition of f : X → Y and g : Y → Z is the
function g ◦· f : X → Z given by:

g ◦· f (x) = {z ∈ Z | ∃y ∈ Y. y ∈ f (x) and z ∈ g(y)}.




It is not hard to see that ◦· is associative and has unit as identity element. In
fact, this has already been proven more generally, in Lemma 1.8.3.
There are two aspects of the category Chan(P) that we wish to illustrate,
namely (1) that it has an ‘inversion’ operation, in the form of a dagger functor,

450
7.8. Categorical aspects of Bayesian inversion 451

and (2) that it is isomorphic to the category Rel of sets with relations between
them (as morphisms). Probabilistic analogues of these two points will be de-
scribed later.

1 We start from a very basic observation, namely that morphisms in Chan(P)


can be reversed. There is a bijective correspondence, indicated by the double
lines, between morphisms in Chan(P):

X ◦ /Y / P(Y)
f
X
============ that is, between functions: ===============
Y ◦ /X Y g
/ P(X)

This correspondence sends f : X → P(Y) to the function f † : Y → P(X)


with f † (y) B {x | y ∈ f (x)}. Hence y ∈ f (x) iff x ∈ f † (y). Similarly one
sends g : Y → P(X) to g† : X → P(Y) via g† (x) B {y | x ∈ g(y)}. Clearly,
f †† = f and g†† = g.
It turns out that this dagger operation (−)† interacts nicely with compo-
sition: one has unit † = unit and also (g ◦· f )† = f † ◦· g† . This means that
the dagger is functorial. It can be described as a functor (−)† : Chan(P) →
Chan(P)op , which is the identity on objects: X † = X. The opposite (−)op
category is needed for this functor since it reverses arrows.
2 We write Rel for the category with sets X as objects and relations R ⊆ X × Y
as morphisms X → Y. The identity X → X is given by the equality relation
Eq X ⊆ X × X, with Eq X = {(x, x) | x ∈ X}. Composition of R ⊆ X × Y and
S ⊆ X × Z is the ‘relational’ composition S • R ⊆ X × Z given by:

S • R B {(x, z) | ∃y ∈ Y. R(x, y) and S (y, z)}.


It is not hard to see that we get a category in this way.
There is a ‘graph’ functor G : Chan(P) → Rel, which is the identity
on objects: G(X) = X. On a morphism f : X → Y, that is, on a function
f : X → P(Y), we define G( f ) ⊆ X×Y to be G( f ) = {(x, y) | x ∈ X, y ∈ f (x)}.
Then: G(unit X ) = Eq X and G(g ◦· f ) = G( f ) • G( f ).
In the other direction there is also a functor F : Rel → Chan(P), which
is again the identity on objects: F(X) = X. On a morphism R : X → Y
in Rel, that is, on a relation R ⊆ X × Y, we define F(R) : X → P(Y) as
F(R)(x) = {y | R(x, y)}. This F preserves identities and composition.
These two functors G and F are each other’s inverses, in the sense that:

F ◦ G = id : Chan(P) → Chan(P) and G ◦ F = id : Rel → Rel.


This establishes an isomorphism Chan(P)  Rel of categories.
Interestingly, Rel is also a dagger category, via the familar operation of

451
452 Chapter 7. Directed Graphical Models

reversal of relations: for R ⊆ X × Y one can form R† ⊆ Y × X via R† (y, x) =


R(x, y). This yields a functor (−)† : Rel → Relop , obviously with (−)†† = id .
Moreover, the above functors G and F commute with the daggers of
Chan(P) and Rel, in the sense that:

G( f † ) = G( f )† and F(R† ) = F(R)† .

We shall prove the first equation and leave the second one to the interested
reader. The proof is obtained by carefully unpacking the right definition at
each stage. For a function f : X → P(Y) and elements x ∈ X, y ∈ Y,

G( f † )(y, x) ⇔ x ∈ f † (y) ⇔ y ∈ f (x) ⇔ G( f )(x, y) ⇔ G( f )† (y, x).

We now move from non-deterministic to probabilistic computation. Our aim


is to obtain analogous results, namely inversion in the form of a dagger functor
on a category of probabilistic channels, and an isomorphism of this category
with a category of probabilistic relations. One may expect that these results
hold for the category Chan(D) of probabilistic channels. But the situation is
a bit more subtle. Recall from the previous section that the dagger (Bayesian
inversion) c†ω : Y → X of a probabilistic channel c : X → Y requires a state
ω ∈ D(X) on the domain. In order to conveniently deal with this situation we
incorporate these states ω into the objects of our category. We follow [24] and
denote this category as Krn; its morphisms are ‘kernels’.

Definition 7.8.2. The category Krn of kernels has:

• objects: pairs (X, σ) where X is a finite set and σ ∈ D(X) is a distribution on


X with full support;
• morphisms: f : (X, σ) → (Y, τ) are probabilistic channels f : X → Y with
f = σ = τ.

Identity maps (X, σ) → (X, σ) in Krn are identity channels unit : X → X,


given by unit(x) = 1| x i, which we write simply as id . Composition in Krn is
ordinary composition ◦· of channels.

Theorem 7.8.3. Bayesian inversion forms a dagger functor (−)† : Krn →


Krnop which is the identity on objects and which sends:
 f   fσ† 
(X, σ) −→ (Y, τ) 7−→ (Y, τ) −→ (X, σ)

This functor is its own inverse: f †† = f .

Proof. We first have to check that the dagger functor is well-defined, i.e. that

452
7.8. Categorical aspects of Bayesian inversion 453

the above mapping yields another morphism in Krn. This follows from Propo-
sition 6.6.4 (1):
fσ† = τ = fσ† = ( f = σ) = σ.

Aside: this does not mean that f † ◦· f = id .


We next show that the dagger preserves identities and composition. We use
the concrete formulation of Bayesian inversion of Definition 6.6.1. A string-
diagrammatic proof is possible, see below. The identity map id : (X, σ) →
(X, σ) in Krn satisfies:

id †σ (x) = σ|id = 1x = σ|1x = 1| x i see Exercise 6.1.1


= id (x).
f g
For morphisms (X, σ) −→ (Y, τ) −→ (Z, ρ) in Krn we have, for z ∈ Z,

g ◦· f †σ (z) = σ|(g◦· f ) = 1z = σ| f = (g = 1z )


= fσ† = ( f = σ)|g = 1z by Theorem 6.6.3 (1)




= fσ† = τ|g = 1z


= fσ† = g†τ (z)


= fσ† ◦· g†τ (z).


†
Finally we have fσ† τ = f by Proposition 6.6.4 (2).

One can also prove this result via equational reasoning with string diagrams.
For instance preservation of composition ◦· by the dagger follows by unique-
ness from:

g g fσ† fσ† fσ†

f fσ† g g†τ g†τ


= = = =

σ f τ g g

σ τ f

We now turn to probabilistic relations, with the goal of finding a category


of such relations that is isomorphic to Krn. For this purpose we use what are
called couplings. The definition that we use below differs only in inessential
ways from the formulation that we used in Subsection 4.5.1 for the Wasserstein
distance. For more information couplings, see e.g. [11].

453
454 Chapter 7. Directed Graphical Models

Definition 7.8.4. We introduce a category Cpl of couplings with the same


objects as Krn. A morphism (X, σ) → (Y, τ) in Cpl is a joint state ϕ ∈ D(X×Y)
with ϕ 1, 0 = σ and ϕ 0, 1 = τ. Such a distribution which marginalises to σ
   
and τ is called a coupling between σ and τ.
Composition of ϕ : (X, σ) → (Y, τ) and ψ : (Y, τ) → (Z, ρ) is the distribution
ψ • ϕ ∈ D(X × Z) defined as:

ψ • ϕ B hϕ 1, 0 0, 1 , ψ 0, 1 1, 0 i = τ
   
 
X X ϕ(x, y) · ψ(y, z)  (7.32)
=  x, z .


x∈X, z∈Z y∈Y
τ(y)

The identity coupling Eq (X,σ) : (X, σ) → (X, σ) is the distribution ∆ = σ.

The essence of the following result is due to [24], but there it occurs in
slightly different form, namely in a setting of continuous probability. Here it is
translated to the discrete situation.

Theorem 7.8.5. Couplings as defined above indeed form a category Cpl.

1 This category carries a dagger functor (−)† : Cpl → Cplop which is the
identity on objects; on morphisms it is defined via swapping:
 ϕ †  hπ2 ,π1 i = ϕ 
(X, σ) −→ (Y, τ) B (Y, τ) −−−−−−−→ (X, σ) .

More concretely, this dagger is defined by swapping arguments, as in: ϕ† =


x,y ϕ(x, y)|y, x i.
P

2 There is an isomorphism of categories Krn  Cpl, in one direction by taking


the graph of a channel, and in the other direction by disintegration. This
isomorphism commutes with the daggers on the two categories.

Proof. We first need to prove that ψ • ϕ is a distribution:


X X ϕ(x, y) · ψ(y, z)
(ψ • ϕ)(x, z) =
x,z x,y,z τ(y)
x ϕ(x, y) · ψ(y, z)
X P 
=
y,z τ(y)
X ϕ0, 1(y) · ψ(y, z)
=
y,z τ(y)
X τ(y) · ψ(y, z) X
= = ψ(y, z) = 1.
y,z τ(y) y,z

We leave it to the reader to check that Eq (X,σ) = ∆ = σ is neutral element for


•. We do verify that • is associative — and thus that Cpl is indeed a category.

454
7.8. Categorical aspects of Bayesian inversion 455

Let ϕ : (X, σ) → (Y, τ), ψ : (Y, τ) → (Z, ρ), χ : (Z, ρ) → (W, κ) be morphisms in
Cpl. Then:
  X (ψ • ϕ)(x, z) · χ(z, w)
χ • (ψ • ϕ) (x, w) =
z ρ(z)
X ϕ(x, y) · ψ(y, z) · χ(z, w)
=
y,z τ(y) · ρ(z)
X ϕ(x, y) · (χ • ψ)(y, w)  
= = (χ • ψ) • ϕ (x, w).
y τ(y)
We turn to the dagger. It is obvious that (−)†† is the identity functor, and also
that (−)† preserves identity maps. It also preserves composition in Cpl since:
ψ • ϕ † (z, x) = ψ • ϕ (x, z)
 
X ϕ(x, y) · ψ(y, z)
=
y τ(y)
X ψ† (z, y) · ϕ† (y, x)
= = ϕ† • ψ† (z, x).

y τ(y)
The graph operation gr on channels from Definition 2.5.6 gives rise to an
identity-on-objects ‘graph’ functor G : Krn → Cpl via:
 f   gr(σ, f ) 
G (X, σ) −→ (Y, τ) B (Y, τ) −−−−−→ (X, σ) .
where gr(σ, f ) = hid , f i = σ. This yields a functor since:
G id (X,σ) = hid , id i = σ


= Eq (X,σ)
G g ◦· f (x, z) = gr(σ, g ◦· f )(x, z)


= σ(x) · (g ◦· f )(x)(z)
X
= σ(x) · f (x)(y) · g(y)(z)
y
X σ(x) · f (x)(y) · τ(y) · g(y)(z)
=
y τ(y)
X gr(σ, f )(x, y) · gr(τ, g)(y, z)
=
y τ(y)
= gr(τ, g) • gr(σ, f ) (x, z)


= G(g) • G( f ) (x, z).




In the other direction we define a functor F : Cpl → Krn which is the iden-
tity on objects and use disintegration on morphisms:
for ϕ : (X, σ) → (Y, τ) in
Cpl we get a channel F(ϕ) B dis 1 (ϕ) = ϕ 0, 1 1, 0 : X → Y which satisfies,
 
by construction (7.20):
ϕ = gr ϕ 1, 0 , dis 1 (ϕ) = gr σ, F(ϕ) = GF(ϕ).
   

455
456 Chapter 7. Directed Graphical Models

Moreover, F(ϕ) is a morphism (X, σ) → (Y, τ) in Krn since:

F(ϕ) = σ = dis 1 (ϕ) = ϕ 1, 0


 
   (7.19)  
= gr ϕ 1, 0 , dis 1 (ϕ) 0, 1 = ϕ 0, 1 = τ.
 

We still need to prove that F preserves identities and composition. This fol-
lows by uniqueness of disintegration:

hid , F(Eq (X,σ) )i = σ


= Eq (X,σ) by definition
= ∆ = σ
= hid , id i = σ
hid , F(ψ • ϕ)i = σ
= ψ•ϕ by definition
(7.32)
= hϕ 1, 0 0, 1 , ψ 0, 1 1, 0 i = τ
   
  
= (id ⊗ ψ 0, 1 1, 0 ) = hϕ 1, 0 0, 1 , id i = ϕ 1, 0 0, 1 = σ
    
 
= (id ⊗ F(ψ)) = hF(ϕ)† , id i = F(ϕ) = σ

by Exercise 7.6.10
(7.26)
= (id ⊗ F(ψ)) = hid , F(ϕ)i = σ


= hid , F(ψ) ◦· F(ϕ)i = σ.

We have seen that, by construction, G ◦ F is the identity functor on the


category Cpl. In the other direction we also have F ◦ G = id : Krn → Krn.
This follows directly from uniqueness of disintegration.
In the end we like to show that the functors G and F commute with the dag-
gers, on kernels and couplings. This works as follows. First, for f : (X, σ) →
(Y, τ) we have:

G f † (y, x) = gr(τ, f † )(y, x) = τ(y) · fσ† (y)(x)



σ(x) · f (x)(y)
= ( f = σ)(y) ·
( f = σ)(y)
= σ(x) · f (x)(y)
= gr(σ, f )(x, y)
= G( f )(x, y) = G( f )† (y, x).

456
7.9. Factorisation of joint states 457

Next, for ϕ : (X, σ) → (Y, τ) in Cpl,

F ϕ† (y)(x) = dis 1 ϕ† (y)(x)


 

(7.21) ϕ† (y, x)
=
ϕ† 1, 0 (y)
 
ϕ(x, y)
=
ϕ 0, 1 (y)
 

= ϕ 1, 0 0, 1 (y)(x)
 

= ϕ 0, 1 1, 0 † (y)(x)
 
by Exercise 7.6.10
= F(ϕ)† (y)(x).

We have done a lot of work in order to be able to say that Krn and Cpl are
isomorphic dagger categories, or, more informally, that there is a one-one cor-
respondence between probabilistic computations (channels) and probabilistic
relations.

Exercises
7.8.1 Prove in the context of the powerset channels of Example 7.8.1 that:
1 unit † = unit and (g ◦· f )† = f † ◦· g† .
2 G(unit X ) = Eq X and G(g ◦· f ) = G( f ) • G( f ).
3 F(Eq X ) = unit X and F(S • R) = F(S ) ◦· F(R).
4 (S • R)† = R† • S † .
5 F(R† ) = F(R)† .
7.8.2 Give a string diagrammatic proof of the property f †† = f in Theo-
rem 7.8.3. Prove also that ( f ⊗ g)† = f † ⊗ g† .
7.8.3 Prove the equation = in the definition (7.32) of composition • in the
category Cpl. Give also a string diagrammatic description of •.

7.9 Factorisation of joint states


Earlier, in Subsection 2.5.1, we have called a binary joint state non-entwined
when it is the product of its marginals. This can be seen as an intrinsic property
of the state, which we will express in terms of a string diagram, called its shape.
For instance, the state:

ω = 41 | a, b i + 12 | a, b⊥ i + 12 | a , b i
1 ⊥
+ 16 | a⊥ , b⊥ i

457
458 Chapter 7. Directed Graphical Models

is non-entwined: it is the product of its marginals ω 1, 0 = 43 | ai + 14 | a⊥ i and


 
ω 0, 1 = 3 | bi + 23 |b⊥ i. We will formulate this as:
  1

ω has shape
or as: and write this as: ω |≈ .
ω factorises as

We shall give a formal definition of |≈ below, but at this stage it suffices to read
σ |≈ S , for a state σ and a string diagram S , as: there is an interpretation of the
boxes in S such that σ = [[ S ]].
In the above case of ω |≈ we obtain ω = ω1 ⊗ ω2 , for some state ω1 that
interpretes the box on the left, and some ω2 interpreting the box on the right.
But then:
ω 1, 0 = ω1 ⊗ ω2 1, 0 = ω1 .
   

Similarly, ω 0, 1 = ω2 . Thus, in this case the interpretations of the boxes are


 
uniquely determined, namely as first and second marginal of ω.
We conclude that non-entwinedness of an arbitrary binary joint state ω can
be expressed as: ω |≈ . Here we are interested in similar intrinsic ‘shape’
properties of states and channels that can be expressed via string diagrams.
These matters are often discussed in the literature in terms of (conditional)
independencies. Here we prefer to use shapes instead of independencies since
they are more expressive.
In general there may be several interpretations of a string diagram (as shape).
Consider for instance:
14
25 | H i + 11
25 | T i |≈

This state has this form in multiple ways, for instance as:

c1 = σ1 = 14
25 | H i + 11
25 | T i = c2 = σ2

for:
σ1 = 15 | 1 i + 54 | 0i σ2 = 25 | 1 i + 35 | 0 i
c1 (1) = 54 | H i + 15 | T i and c2 (1) = 12 | H i + 21 | T i
c1 (0) = 12 | H i + 12 | T i c2 (0) = 35 | H i + 52 | T i.

We note that is not an accessible string diagram: the wire inbetween the two
boxes cannot be accessed from the outside. If these wires are accessible, then
we can access the individual boxes of a string diagram and use disintegration
to compute them. We illustrate how this works.

458
7.9. Factorisation of joint states 459

Example 7.9.1. Consider two-element sets A = {a, a⊥ }, B = {b, b⊥ }, C = {c, c⊥ }


and D = {d, d⊥ } and an (accessible) string diagram S of the form:
A C D B
f g
S = (7.33)

Now suppose we have a joint state ω ∈ D(A × C × D × B) given by:

ω= 25 | a, c, d, b i + 50 | a, c, d, b i + 50 | a, c, d , bi + 50 | a, c, d , b i
1 9 ⊥ 3 ⊥ 1 ⊥ ⊥

+ 25 1
| a, c⊥ , d, b i + 50 9
| a, c⊥ , d, b⊥ i + 50 3
| a, c⊥ , d⊥ , b i
+ 50 | a, c , d , b i + 125 | a , c, d, bi + 500
1 ⊥ ⊥ ⊥ 3 ⊥ 9
| a⊥ , c, d, b⊥ i
+ 250 9
| a⊥ , c, d⊥ , b i + 5001
| a⊥ , c, d⊥ , b⊥ i + 12512
| a⊥ , c⊥ , d, bi
+ 125 | a , c , d, b i + 125 |a , c , d , b i + 125 | a⊥ , c⊥ , d⊥ , b⊥ i
9 ⊥ ⊥ ⊥ 18 ⊥ ⊥ ⊥ 1

We ask ourselves: does ω |≈ S hold? More specifically, can we somehow ob-


tain interpretations of the boxes σ, f and g in S so that ω = [[ S ]]? We shall
show that by appropriately using marginalisation and disintegration we can
‘factorise’ this joint state according to the above string diagram.
First we can obtain the state σ ∈ D(A × B) by discarding the C, D outputs in
the middle, as in:

f g
= =
σ

σ σ

We can thus compute σ as:

σ = ω 1, 0, 0, 1
 

= u,v ω(a, u, v, b) |a, bi + u,v ω(a, u, v, b ) | a, b i


P  P ⊥
 ⊥

+ u,v ω(a , u, v, b) | a , bi + u,v ω(a , u, v, b ) | a , b i


P ⊥
 ⊥
P ⊥ ⊥
 ⊥ ⊥

= 25
1
+ 50
3
+ 25 1
+ 50
3 
|a, bi + 50 9
+ 50
1
+ 50 9
+ 501 
|a, b⊥ i
+ 125 + 250 + 125 + 125 | a , bi + 500 + 500 + 125
3 9 12 18  ⊥ 9 1 9
+ 125 |a , b i
1  ⊥ ⊥

= 15 | a, b i + 25 | a, b⊥ i + 10 | a , b i
3 ⊥
+ 10 | a , b i.
1 ⊥ ⊥

Next we concentrate on the channels f and g, from A to C and from B to D.


We first illustrate how to restrict the string diagram to the relevant part via

459
460 Chapter 7. Directed Graphical Models

marginalisation. For f we concentrate on:

f g f f
= =

σ σ σ

The string diagram on the right tells us that we can obtain f via disintegration
from the marginal ω 1, 1, 0, 0 , using that extracted channels are unique, in
 
diagrams of this form, see (7.19). In the same way one obtains g from the
marginal ω 0, 0, 1, 1 . Thus:
 

f = ω 0, 1, 0, 0 1, 0, 0, 0
 

v,y ω(a, c, v, y) v,y ω(a, c , v, y) ⊥


 P P ⊥

+

a →
7 | c i |c i


u,v,y ω(a, u, v, y) u,v,y ω(a, u, v, y)


 P P
=


v,y ω(a , c, v, y) v,y ω(a , c , v, y) ⊥
 P ⊥
P ⊥ ⊥

+
 ⊥
a →
7 | c i |c i


u,v,y ω(a , u, v, y) u,v,y ω(a , u, v, y)


 P ⊥
P ⊥


 a 7→ 21 | c i + 12 | c⊥ i

=

 a⊥ 7→ 1 | c i + 4 | c⊥ i

5 5

g = ω 0, 0, 1, 0 0, 0, 0, 1
 

x,u ω(x, u, d, b) x,u ω(x, u, d , b) ⊥


 P P ⊥

b →
7 | d i + |d i


x,u,v ω(x, u, v, b) x,u,v ω(x, u, v, b)


 P P
=


x,u ω(x, u, d, b ) x,v ω(x, u, d , b ) ⊥
 P ⊥
P ⊥ ⊥

+
 ⊥
 b 7→ P |d i |d i


x,u,v ω(x, u, v, b ) x,u,v ω(x, u, v, b )
 ⊥
P ⊥

 b 7→ 52 | d i + 35 | d⊥ i

=

 b⊥ 7→ 9 | d i + 1 | d⊥ i.

10 10

At this stage one can check that the joint state ω can be reconstructed from
these extracted state and channels, namely as:

[[ S ]] = (id ⊗ f ⊗ g ⊗ id ) = (∆ ⊗ ∆) = σ = ω.


This proves ω |≈ S .

We now come to the definition of |≈. We will use it for channels, and not just
for states, as used above.

Definition 7.9.2. Let A1 , . . . , An , B1 , . . . , Bm be finite sets. Consider a channel


c : A1 ×· · ·× An → B1 ×· · ·× Bm and a string diagram S ∈ SD(Σ) over signature
Σ, with domain [A1 , . . . , An ] and codomain [B1 , . . . , Bm ]. We say that c and S
have the same type.

460
7.9. Factorisation of joint states 461

In this situation we write:

c |≈ S

if there is an interpretation of the boxes in Σ such that c = [[ S ]].


We then say that c has shape S and also that c factorises according to S . Al-
ternative terminology is: c is Markov relative to S , or: S represents c. Given c
and S , finding an interpretation of Σ such that c = [[ S ]] is called a factorisation
problem. It may have no, exactly one, or more than one solution (interpreta-
tion), as we have seen above.

The following illustration is a classical one, showing how seemingly differ-


ent shapes are related. It is often used to describe conditional independence —
in this case of A, C, given B.

Theorem 7.9.3. Let a state ω ∈ D(A × B × C) have full support.

1 There are equivalences:

ω |≈ ⇐⇒ ω |≈ ⇐⇒ ω |≈

2 These equivalences can be extended as:

ω |≈ ω|1⊗1b ⊗1 1, 0, 1 |≈
 
⇐⇒ for all b ∈ B.

The string diagrams on the left in item (1) is often called a fork; the other
two are called a chain. In item (2) this shape is related to non-entwinedness,
pointwise.

Proof. 1 Since ω has full support, so have all its marginals, see Exercise 7.5.1.
This allows us to perform all disintegrations below. We start on the left-hand
side, and assume an interpretation ω = hc, id , di = τ, consisting of a state
τ = ω 0, 1, 0 ∈ D(B) and channels c : B → A and d : B → C. We write
 
σ = c = τ = ω 1, 0, 0 and take the Bayesian inversion c†τ : A → B. We now
 
have an interpretation of the string diagram in the middle, which is equal to

461
462 Chapter 7. Directed Graphical Models

ω, since by (7.26):

d
d d

c†τ c d
c†τ = = c = = ω

c τ
σ τ
τ

Similarly one obtains an interpretation of the string diagram on the right via
ρ = d = τ = ω 0, 0, 1 and the inversion dτ† : C → B.
 

In the direction (⇐) one uses Bayesian inversion in a similar manner to


transform one interpretation into another one.
2 The direction (⇒) is easy and is left to the reader. In fact, Proposition 7.9.5 (1)
below gives a slightly stronger result.
We concentrate on (⇐). By assumption, for b ∈ B we can write:

ω|1⊗1b ⊗1 1, 0, 1 = σb ⊗ τb for σb ∈ D(A), τb ∈ D(C).


 

In a next step we define channels f : B → A and g : B → C and a state


ρ ∈ D(B) as:

f (b) B σb g(b) B τb ρ B ω 0, 1, 0 .
 

Then, for x ∈ A and z ∈ C,


f (b)(x) · g(b)(z) = f (b) ⊗ g(b) (x, z) = σb ⊗ τb (x, z)
 

= ω|1⊗1b ⊗1 1, 0, 1 (x, z)
 
X ω(x, y, z) · 1b (y)
=
y ω |= 1 ⊗ 1b ⊗ 1
ω(x, b, z)
= 
ω 0, 1, 0 |= 1b

ω(x, b, z)
= .
ρ(b)
But now we see that ω has a fork shape:

ω(y, b, z) = f (b)(x) · g(b)(z) · ρ(b) = h f, id , gi = ρ (y, b, z).




What the first item of this result shows is that (sub)shapes of the form:

can be changed into

462
7.9. Factorisation of joint states 463

and vice-versa. By applying these transformation directly to the shapes in The-


orem 7.9.3 the equivalences ⇔ can be obtained.
In the naive Bayesian model in Subsection 7.7.1 we have extracted certain
structure from a joint state, given a particular string diagram (or shape) (7.29).
This can be done quite generally, essentially as in Example 7.9.1. But note
that the resulting interpretation of the string diagram need not be equal to the
original joint state — as Subsection 7.7.1 shows.

Proposition 7.9.4. Let c be a channel with full support, of the same type as
an accessible string diagram S ∈ SD(Σ) that does not contain . Then there
is a unique interpretation of Σ that can be obtained from c. In this way we can
factorise c according to S .
In the special case that c |≈ S holds, the factorisation interpretation of S
obtained in this way from c is the same one that gives [[ S ]] = c because of
c |≈ S and uniqueness of disintegrations.

Proof. We conveniently write multiple wires as a single wire, using product


types. We use induction on the number of boxes in S . If this number is zero,
then S only consists of ‘structural’ elements, and can be interpreted irrespec-
tive of the channel c.
Now let S contain at least one box. Pick a box g at the top of S , with all of
its output wires directly coming out of S . Thus, since S is accessible, we may
assume that it has the form as described on the left below.

g g g

S = so that S0 B =
h h h

By discarding the wire on the left we get a diagram S 0 as used in disintegra-


tion, see Definition 7.5.4. But then a unique channel (interpretation) g can be
extracted from [[ S 0 ]] = c 0, 1, 1 .
 
 
We can now apply the induction hypothesis with the channel c 1, 1, 0 and
corresponding string diagram from with the box g has been removed. It will
give an interpretation of all the other boxes — not equal to g — in S .

This result gives a way of testing whether a channel c has shape S : fac-
torise c according to S and check if the resulting interpretation [[ S ]] equals c.
Unfortunately, this is computationally rather expensive.

463
464 Chapter 7. Directed Graphical Models

7.9.1 Shapes under conditioning


We have seen that a distribution can have a certain shape. An interesting ques-
tion that arises is: what happens to such a shape when the distribution is up-
dated? The result below answers this question, much like in [88], for three
basic shapes, called fork, chain and collider,

Proposition 7.9.5. Let ω ∈ D(X × Y × Z) be an arbitrary distribution and let


q ∈ Fact(Y) be a factor on its middle component Y. We write a ∈ Y for an
arbitrary element with associated point predicate 1a .

1 Let ω have fork shape:

ω |≈ then also ω|1⊗q⊗1 |≈ .

In the special case of conditioning with a point predicate we get:

ω|1⊗1a ⊗1 |≈ .

As a consequence of Theorem 7.9.3, if ω has chain shape

ω |≈ then also ω|1⊗q⊗1 |≈ .

2 Let ω have collider shape:

ω |≈ then ω|1⊗q⊗1 |≈

For this shape it does not matter if q is a point predicate or not.

Proof. 1 Let’s assume we have an interpretation ω = hc, id , di = σ, for


c : Y → X, d : Y → Z and σ ∈ D(Y). Then:

ω|1⊗q⊗1
= hc, id , di = σ |1⊗q⊗1


= hc, id , di|1⊗q⊗1 = σ |hc,id ,di = (1⊗q⊗1)


 
by Corrolary 6.3.12 (2)
= hc, id , di = (σ|q ) by Exercises 6.1.11 and 4.3.8.

Hence we see the same shape that ω has.


In the special case when q = 1a we can extend the above calculation and

464
7.9. Factorisation of joint states 465

obtain a parallel product of states:

ω|1⊗1a ⊗1 = hc, id , di = (σ|1a ) as just shown


= (c ⊗ id ⊗ d) ◦· ∆3 = 1| a i

by Lemma 6.1.6 (2)
= (c ⊗ id ⊗ d) = ∆3 = 1|a i


= (c ⊗ id ⊗ d) = 1| ai ⊗ 1| a i ⊗ 1| a i


= (c = 1| ai) ⊗ 1|a i ⊗ (d = 1| a i).


2 Let’s assume as interpretation of the collider shape:

ω = id ⊗ c ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ) ,
 

for states σ ∈ D(X) and τ ∈ D(Z) and channel c : X × Z → Y. Then:


ω|1⊗q⊗1
 
= id ⊗ c ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ) 1⊗q⊗1

   
= id ⊗ c|q ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ) 1⊗(c = q)⊗1
by Corrolary 6.3.12 (3)
 
= id ⊗ c|q ⊗ id = (∆ ⊗ ∆)|1⊗(c = q)⊗1 = (σ ⊗ τ) (∆⊗∆) = (1⊗(c = q)⊗1)


by Theorem 6.3.11
 
= id ⊗ c|q ⊗ id = (∆ ⊗ ∆) = (σ ⊗ τ) (∆⊗∆) = (1⊗(c = q)⊗1)


by Exercise 6.1.10.

The updated state (σ ⊗ τ) (∆⊗∆) = (1⊗(c = q)⊗1) is typically entwined, even if q is
a point predicate, see also Exercise 6.1.8.

The fact that conditioning with a point predicate destroys the shape is an
important phenomenon since it allows us to break entwinedness / correlations.
This is relevant in statistical analysis, esp. w.r.t. causality [136, 137], see the
causal surgery procedure in Section ??. In such a context, conditioning on a
point predicate is often expressed in terms of ‘controlling for’. For instance, if
there is a gender component G = {m, f } with elements m for male and f for
female, then conditioning with a point predicate 1m or 1 f , suitably weakenend
via tensoring with truth 1, amounts to controlling for gender. Via restriction to
one gender value one fragments the shape and thus controls the influence of
gender in the situation at hand.

Exercises
7.9.1 Check the aim of Exercise 2.5.2 really is to prove the shape statement
ω |≈

465
466 Chapter 7. Directed Graphical Models

for the (ternary) state ω defined there.


7.9.2 Prove that a collider shape leads to non-entwinedness:

ω |≈ =⇒ ω 1, 0, 1 |≈
 

7.9.3 Let S ∈ SD(Σ) be given with an interpretation of the string diagram


signature Σ. Check that then [[ S ]] |≈ S .
7.9.4 Prove the following items, which are known as the ‘semigraphoid’
properties, see [163, 55]; they are seen as the basic rules of conditional
independence.

1 Symmetry: ω |≈ implies hπ3 , π2 , π1 i = ω |≈ .

Decomposition: ω |≈ implies ω 0, 1, 1, 1 |≈
 
2

3 Weak union: ω |≈ implies ω |≈

Hint: Apply disintegration to the upper-left box.


Contraction: if ω |≈ and also ω 1, 1, 1, 0 |≈
 
4 then

ω |≈ .

Hint: Form a suitable combination of the two upper-right boxes in


the assumptions.

7.10 Inference in Bayesian networks, reconsidered


Inference is one of the main topics in Bayesian probability. We have seen illus-
trations of Bayesian inference in Section 6.5 for the Asia Bayesian network,
using forward and back inference along a channel (see Definition 6.3.1). We
are now in a position to approach the topic of inference more systematically,
using string diagrams. We start by defining inference itself, namely as what we
have earlier called crossover inference.
Let ω ∈ D(X1 × · · · × Xn ) be a joint state on a product space X1 × · · · × Xn .
Suppose we have evidence p on one component Xi , in the form of a factor p ∈
Fact(Xi ). After updating with p, we are interested in the marginal distribution
at component j. We will assume that i , j, otherwise the problem is simple and
can be reduced to updating the marginal at i = j with p, see Lemma 6.1.6 (6).
In order to update ω on X1 × · · · × Xn with p ∈ Fact(Xi ) we first have to

466
7.10. Inference in Bayesian networks, reconsidered 467

extend the factor p to a factor on the whole product space, by weakening it to:

πi = p = 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1.

After the update with this factor we take the marginal at j in the form of:

π j = ω|πi = p = ω|πi = p 0, . . . , 0, 1, 0 . . . , 0 .
  

The latter provides a convenient form.

Definition 7.10.1. A basic inference query is of the form:

π j = ω|πi = p

(7.34)

for a joint state ω with an evidence factor p on its i-th component and with the
j-th marginal as conclusion.
In words: an inference query is a marginalisation of a joint state conditioned
with a weakened factor. What is called inference is the activity of calculating
the outcome of an inference query.

The term ‘query’ suggests a connection to the world of databases, espe-


cially in probabilistic form [156]. This is justified since in the above expression
π j = ω|πi = p in (7.34) one may read the select-from-where structure in basic

database queries:

• The marginalisation π j = (−) performs the select-part, performing a restric-


tion to the j-th component.
• The joint distribution ω on tuples can be seen as a probabilistic database.
• The conditioning | with weakened predicate πi = p corresponds to the
where-part that imposes a condition on the database.

An inference query may also have multiple evidence factors pk ∈ Fact(Xik ),


giving an inference query of the form:

π j = ω|(πi1 = p1 ) & ··· & (πim = pim ) = π j = ω|πi1 = p1 · · · |πim = pim .


 

By suitably swapping components and using products one can reduce such a
query to one of the from (7.34). Similarly one may marginalise on several com-
ponents at the same time and again reduce this to the canonical form (7.34).
The notion of inference that we use is formulated in terms of joint states.
This gives mathematical clarity, but not a practical method to actually compute
queries. If the joint state has the shape of a Bayesian network, we can use this
network structure to guide the computations. This is formalised in the next
result: it describes quite generally what we have been illustrating many times
already, namely that inference can be done along channels — both forward and

467
468 Chapter 7. Directed Graphical Models

backward — in particular along channels that interpret edges in a Bayesian


network (i.e. conditional probability tables).

Theorem 7.10.2. Let joint state ω have the shape S of a Bayesian network:
ω |≈ S . An inference query π j = ω|πi = p can then be computed via forward

and backward inference along channels in the string diagram S .

Proof. We use induction on the number of boxes in S . If this number is one,


S consists of a single state box, whose interpretation is ω. Then we are done
by the crossover inference Corollary 6.3.13.
We now assume that S contains more than one box. We pick a box at the top
of S so that we can write:

ω = id X1 ×···×Xk−1 ⊗ c ⊗ id Xk+1 ×···×Xn = ρ ρ |≈ S 0 ,



where

and where S 0 is the string diagram obtained from S by removing the single
box whose interpretation we have written as channel c, say of type Y → Xk .
The induction hypothesis applies to ρ and S 0 .
Consider the situation below wrt. the underlying product space X1 × · · · × Xn
of ω,

evidence marginal
p
 
X1 × · · · × Xi × · · · × XO k × · · · × X j × · · · × Xn
c

channel

Below we distinguish three cases about (in)equality of i, k, j, where, recall, that


we assume i , j.

1 i , k and j , k. In that case we can compute the inference query as:


 
π j = ω|πi = p = π j = id ⊗ c ⊗ id ) = ρ |πi = p
 
 
= π j = (id ⊗ c ⊗ id ) = ρ|πi = p by Exercise 6.1.7 (1)
= π j = ρ|πi = p .


Hence we are done by the induction hypothesis.


Aside: we are oversimplifying by assuming that the domain of the channel
c is not a product space; if it is, say Y1 × · · · × Ym of length m > 1 we should
not marginalise at j but at j + m − 1. A similar shift may be needed for the
weakening πi = p if i < k.

468
7.10. Inference in Bayesian networks, reconsidered 469

2 i = k, and hence j , k. Then:


 
π j = ω|πi = p = π j = id ⊗ c ⊗ id ) = ρ |πi = p
 

= π j = ρ|(id ⊗c⊗id ) = (πi = p)



by Exercise 6.1.7 (2)
= π j = ρ|πi = (c = p) .


The induction hypothesis now applies with transformed evidence c = p.


Aside: again we are oversimplifying; when the channel c has a product
space as domain, the single projection πi in the conditioning-factor πi =
(c = p) may have to be replaced by an appropriate tuple of projections.
3 j = k, and hence i , k. Then:
 
π j = ω|πi = p = π j = id ⊗ c ⊗ id ) = ρ |πi = p
 
 
= π j = (id ⊗ c ⊗ id ) = ρ|πi = p by Exercise 6.1.7 (1)
 
= c = π j = ρ|πi = p .

Again we are done by the induction hypothesis.


Instead of a channel c, as used above, one can have a diagonal or a swap
map in the string diagram S . Since diagonals and swaps are given by func-
tions f — forming trivial channels — Lemma 6.1.6 (7) applies, so that we
can proceed via evaluation of πi ◦ f .

Example 7.10.3. We shall recompute the query ‘smoking, given a positive


xray’ in the Asia Bayesian network from Section 6.5, this time using the mech-
anism of Theorem 7.10.2. In order to do this we have to choose a particular
order in the channels of the network. Figure 7.5 give two such orders, which
we also call ‘stretchings’ of the network. These two string diagrams have the
same semantics, so it does not matter whether which one we take. For this
illustration we choose the one on the left.
This means that we can write the joint state ω ∈ D(S × D × X) as composite:

ω = (id ⊗ dysp ⊗ id ) ◦· (id ⊗ id ⊗ id ⊗ xray) ◦· (id ⊗ id ⊗ ∆2 )


◦· (id ⊗ id ⊗ either) ◦· (id ⊗ id ⊗ id ⊗ tub) ◦· (id ⊗ id ⊗ id ⊗ asia)
◦· (id ⊗ id ⊗ lung) ◦· (id ⊗ dysp ⊗ id ) ◦· ∆3 ◦· smoking

The positive xray evidence translates into a point predicate 1 x on the set X =
{x, x⊥ } of xray outcomes. It has to be weakened to π3 = 1 x = 1 ⊗ 1 ⊗ 1 x
on the product space S × D × X. We are interested in the updated smoking
distribution, which can be obtained as first marginal. Thus we compute the
k
following inference query, where the number k in = refers to the k-th of the

469
470 Chapter 7. Directed Graphical Models

S D X S D X
dysp xray

xray dysp

bronc
either

tub either

asia
lung
lung
tub
bronc
smoking

smoking asia

Figure 7.5 Two ‘stretchings’ of the Asia Bayesian network from Section 6.5, with
the same semantics.

three distinctions in the proof of Theorem 7.10.2.



π1 = ω π = 1

3 x  
= π1 = (id ⊗ dysp ⊗ id ) ◦· · · · π = 1
3
1   x 
= π1 = (id ⊗ id ⊗ id ⊗ xray) ◦· · · · π = 1
4 x
2   
= π1 = (id ⊗ id ⊗ ∆2 ) ◦· · · · π = (xray = 1 )
 4 x 

= π1 = (id ⊗ id ⊗ either) ◦· · · · π = (xray = 1 )
3 x
2   
= π1 = (id ⊗ id ⊗ id ⊗ tub) ◦· · · · hπ ,π i = (either = (xray = 1 ))
3 4 x
2   
= π1 = (id ⊗ id ⊗ id ⊗ asia) ◦· · · · hπ ,π i = ((id ⊗tub) = (either = (xray = 1 )))
3 4 x
2   
= π1 = (id ⊗ id ⊗ lung) ◦· · · · π = ((id ⊗asia) = ((id ⊗tub) = (either = (xray = 1 ))))
2   3 x 
= π1 = (id ⊗ dysp ⊗ id ) ◦· · · · π = (lung = ((id ⊗asia) = ((id ⊗tub) = (either = (xray = 1 )))))
3 x
1   
= π1 = ∆3 ◦· smoking π = (lung = ((id ⊗asia) = ((id ⊗tub) = (either = (xray = 1 )))))
3 x

= smoking lung = ((id ⊗asia) = ((id ⊗tub) = (either = (xray = 1 ))))


x

= smoking (xray◦·either◦·(id ⊗tub)◦·(id ⊗asia)◦·lung) = 1 x


= 0.6878| s i + 0.3122| s⊥ i.

It is a exercise below to compute the same inference query for the stretching

470
7.10. Inference in Bayesian networks, reconsidered 471

on the right in Figure 7.5. The outcome is necessarily the same, by Theo-
rem 7.10.2, since there it is shown that all such inference computations are
equal to the computation on the joint state.
The inference calculation in Example 7.10.3 is quite mechanical in nature
and can thus be implemented easily, giving a channel-based inference algo-
rithm, see [75] for Bayesian networks. The algorithm consists of two parts.
1 It first finds a stretching of the Bayesian network with a minimal width,
that is, a description of the network as a sequence of channels such that
the state space in between the channels has minimal size. As can be seen
in Figure 7.5, there can be quite a bit of freedom in choosing the order of
channels.
2 Perform the calculation as in Example 7.10.3, following the steps in the
proof of Theorem 7.10.2.
The resulting algorithm’s performance compares favourably to the performance
of the pgmpy Python library for Bayesian networks [5]. By design, it uses
(fuzzy) factors and not (sharp) events and thus solves the “soft evidential up-
date” problem [160].

Exercises
7.10.1 In Example 7.10.3 we have identified a state on a space X with a chan-
nel 1 → X, as we have done before, e.g. in Exercise 1.8.2. Consider
states σ ∈ D(X) and τ ∈ D(Y) with a factor p ∈ Fact(X × Y). Prove
that, under this identification,
σ|(id ⊗τ) = p = (σ ⊗ τ)| p 1, 0 .
 

7.10.2 Compute the inference query from Example 7.10.3 for the stretching
of the Asia Bayesian network on the right in Figure 7.5.
7.10.3 From Theorem 7.10.2 one can conclude that the for the channel-based
calculation of inference queries the particular stretching of a Bayesian
network does not matter — since they are all equal to inference on
the joint state. Implicitly, there is a ‘shift’ result about forward and
backward inference.
Let c : A → X and d : B → Y be channels, together with a joint
state ω ∈ D(A× X) and a factor q : Y → R≥0 on Y. Then the following
marginal distributions are the same.
   
(c ⊗ id ) = ω (id ⊗d) = (1⊗q) 1, 0
  
= (c ⊗ id ) = ((id ⊗ d) = ω) 1⊗q 1, 0 .

471
472 Chapter 7. Directed Graphical Models

1 Prove this equation.


2 Give an interpretation of this equation in relation to inference cal-
culations.

472
References

[1] S. Abramsky. Contextual semantics: From quantum mechanics to logic,


databases, constraints, and complexity. EATCS Bulletin, 113, 2014.
[2] D. Aldous. Exchangeability and related topics. In P. Hennequin, editor, École
d’Été de Probabilités de Saint-Flour XIII — 1983, number 1117 in Lect. Notes
Math., pages 1–198. Springer, Berlin, 1985.
[3] E. Alfsen and F. Shultz. State spaces of operator algebras: basic theory, ori-
entations and C ∗ -products. Mathematics: Theory & Applications. Birkhauser
Boston, 2001.
[4] C. Anderson. The end of theory. The data deluge makes the scientific method
obsolete. Wired, 16:07, 2008. Available at https://www.wired.com/2008/
06/pb-theory/.
[5] A. Ankand and A. Panda. Mastering Probabilistic Graphical Models using
Python. Packt Publishing, Birmingham, 2015.
[6] C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian non-
parametric problems. Annals of Statistics, 2:1152–1174., 1974.
[7] S. Awodey. Category Theory. Oxford Logic Guides. Oxford Univ. Press, 2006.
[8] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge Univ.
Press, 2012. Publicly available via http://web4.cs.ucl.ac.uk/staff/D.
Barber/pmwiki/pmwiki.php?n=Brml.HomePage.
[9] M. Barr and Ch. Wells. Toposes, Triples and Theories. Springer, Berlin, 1985.
Revised and corrected version available from URL: www.tac.mta.ca/tac/
reprints/articles/12/tr12.pdf.
[10] M. Barr and Ch. Wells. Category Theory for Computing Science. Prentice Hall,
Englewood Cliffs, NJ, 1990.
[11] G. Barthe and J. Hsu. Probabilistic couplings from program logics. In G. Barthe,
J.-P. Katoen, and A. Silva, editors, Foundations of Probabilistic Programming,
pages 145–184. Cambridge Univ. Press, 2021.
[12] J. Beck. Distributive laws. In B. Eckman, editor, Seminar on Triples and Cat-
egorical Homology Theory, number 80 in Lect. Notes Math., pages 119–140.
Springer, Berlin, 1969.
[13] J. Bernardo and A. Smith. Bayesian Theory. John Wiley & Sons, 2000.
[14] C. Bishop. Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer, 2006.

473
474 Chapter 7. References

[15] F. van Breugel, C. Hermida, M. Makkai, and J. Worrell. Recursively defined


metric spaces without contraction. Theor. Comp. Sci., 380:143–163, 2007.
[16] P. Bruza and S. Abramsky. Probabilistic programs: Contextuality and relational
database theory. In Quantum Interaction, pages 163–174, 2016.
[17] F. Buscemi and V. Scarani. Fluctuation theorems from bayesian retrodiction.
Phys. Rev. E, 103:052111, 2021.
[18] A. Carboni and R. Walters. Cartesian bicategories I. Journ. of Pure & Appl.
Algebra, 49(1-2):11–32, 1987.
[19] H. Chan and A. Darwiche. On the revision of probabilistic beliefs using uncer-
tain evidence. Artif. Intelligence, 163:67–90, 2005.
[20] K. Cho and B. Jacobs. The EfProb library for probabilistic calculations. In
F. Bonchi and B. König, editors, Conference on Algebra and Coalgebra in Com-
puter Science (CALCO 2017), volume 72 of LIPIcs. Schloss Dagstuhl, 2017.
[21] K. Cho and B. Jacobs. Disintegration and Bayesian inversion via string dia-
grams. Math. Struct. in Comp. Sci., 29(7):938–971, 2019.
[22] K. Cho, B. Jacobs, A. Westerbaan, and B. Westerbaan. An introduction to effec-
tus theory. see arxiv.org/abs/1512.05813, 2015.
[23] A. Clark. Surfing Uncertainty. Prediction, Action, and the Embodied Mind. Ox-
ford Univ. Press, 2016.
[24] F. Clerc, F. Dahlqvist, V. Danos, and I. Garnier. Pointless learning. In J. Esparza
and A. Murawski, editors, Foundations of Software Science and Computation
Structures, number 10203 in Lect. Notes Comp. Sci., pages 355–369. Springer,
Berlin, 2017.
[25] B. Coecke and A. Kissinger. Picturing Quantum Processes. A First Course in
Quantum Theory and Diagrammatic Reasoning. Cambridge Univ. Press, 2016.
[26] B. Coecke and R. Spekkens. Picturing classical and quantum Bayesian inference.
Synthese, 186(3):651–696, 2012.
[27] H. Crane. The ubiquitous Ewens sampling formula. Statistical Science, 31(1):1–
19, 2016.
[28] J. Culbertson and K. Sturtz. A categorical foundation for Bayesian probability.
Appl. Categorical Struct., 22(4):647–662, 2014.
[29] F. Dahlqvist, V. Danos, and I. Garnier. Robustly parameterised higher-order
probabilistic models. In J. Desharnais and R. Jagadeesan, editors, Int. Conf. on
Concurrency Theory, volume 59 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl,
2016.
[30] F. Dahlqvist and D. Kozen. Semantics of higher-order probabilistic programs
with conditioning. In Princ. of Programming Languages, pages 57:1–57:29.
ACM Press, 2020.
[31] F. Dahlqvist, L. Parlant, and A. Silva. Layer by layer – combining monads. In
B. Fischer and T. Uustalu, editors, Theoretical Aspects of Computing, number
11187 in Lect. Notes Comp. Sci., pages 153–172. Springer, Berlin, 2018.
[32] V. Danos and T. Ehrhard. Probabilistic coherence spaces as a model of higher-
order probabilistic computation. Information & Computation, 209(6):966–991,
2011.
[33] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge
Univ. Press, 2009.

474
475

[34] S. Dash and S. Staton. A monad for probabilistic point processes. In D. Spi-
vak and J. Vicary, editors, Applied Category Theory Conference, Elect. Proc. in
Theor. Comp. Sci., 2020.
[35] S. Dash and S. Staton. Monads for measurable queries in probabilistic databases.
In A. Sokolova, editor, Math. Found. of Programming Semantics, 2021.
[36] E. Davies and J. Lewis. An operational approach to quantum probability. Com-
munic. Math. Physics, 17:239–260, 1970.
[37] B. de Finetti. Funzione caratteristica di un fenomeno aleatorio. Memorie della
R. Accademia Nazionale dei Lincei, IV, fasc. 5:86–113, 1930. Available at www.
brunodefinetti.it/Opere/funzioneCaratteristica.pdf.
[38] B. de Finetti. Theory of Probability: A critical introductory treatment. Wiley,
2017.
[39] P. Diaconis and S. Zabell. Updating subjective probability. Journ. American
Statistical Assoc., 77:822–830, 1982.
[40] P. Diaconis and S. Zabell. Some alternatives to Bayes’ rule. Technical Report
339, Stanford Univ., Dept. of Statistics, 1983.
[41] F. Dietrich, C. List, and R. Bradley. Belief revision generalized: A joint charac-
terization of Bayes’ and Jeffrey’s rules. Journ. of Economic Theory, 162:352–
371, 2016.
[42] E. Dijkstra. A Discipline of Programming. Prentice Hall, Englewood Cliffs, NJ,
1976.
[43] E. Dijkstra and C. Scholten. Predicate Calculus and Program Semantics.
Springer, Berlin, 1990.
[44] A. Dvurečenskij and S. Pulmannová. New Trends in Quantum Structures.
Kluwer Acad. Publ., Dordrecht, 2000.
[45] M. Erwig and E. Walkingshaw. A DSL for explaining probabilistic reasoning.
In W. Taha, editor, Domain-Specific Languages, number 5658 in Lect. Notes
Comp. Sci., pages 335–359. Springer, Berlin, 2009.
[46] W. Ewens. The sampling theory of selectively neutral alleles. Theoret. Popula-
tion Biology, 3:87–112, 1972.
[47] W. Feller. An Introduction to Probability Theory and Its applications, volume I.
Wiley, 3rd rev. edition, 1970.
[48] B. Fong. Causal theories: A categorical perspective on Bayesian networks. Mas-
ter’s thesis, Univ. of Oxford, 2012. see arxiv.org/abs/1301.6201.
[49] D. J. Foulis and M.K. Bennett. Effect algebras and unsharp quantum logics.
Found. Physics, 24(10):1331–1352, 1994.
[50] S. Friedland and S. Karlin. Some inequalities for the spectral radius of non-
negative matrices and applications. Duke Math. Journ., 42(3):459–490, 1975.
[51] K. Friston. The free-energy principle: a unified brain theory? Nature Reviews
Neuroscience, 11(2):127–138, 2010.
[52] T. Fritz. A synthetic approach to Markov kernels, conditional independence, and
theorems on sufficient statistics. Advances in Math., 370:107239, 2020.
[53] T. Fritz and E. Rischel. Infinite products and zero-one laws in categorical prob-
ability. Compositionality, 2(3), 2020.
[54] S. Fujii, S. Katsumata, and P. Melliès. Towards a formal theory of graded mon-
ads. In B. Jacobs and C. Löding, editors, Foundations of Software Science and

475
476 Chapter 7. References

Computation Structures, number 9634 in Lect. Notes Comp. Sci., pages 513–
530. Springer, Berlin, 2016.
[55] D. Geiger, T., and J. Pearl. Identifying independence in Bayesian networks.
Networks, 20:507–534, 1990.
[56] A. Gibbs and F. Su. On choosing and bounding probability metrics. Int. Statis-
tical Review, 70(3):419–435, 2002.
[57] G. Gigerenzer and U. Hoffrage. How to improve Bayesian reasoning without
instruction: Frequency formats. Psychological Review, 102(4):684–704, 1995.
[58] M. Giry. A categorical approach to probability theory. In B. Banaschewski,
editor, Categorical Aspects of Topology and Analysis, number 915 in Lect. Notes
Math., pages 68–85. Springer, Berlin, 1982.
[59] A. Gordon, T. Henzinger, A. Nori, and S. Rajamani. Probabilistic programming.
In Int. Conf. on Software Engineering, 2014.
[60] T. Griffiths, C. Kemp, and J. Tenenbaum. Bayesian models of cognition. In
R. Sun, editor, Cambridge Handbook of Computational Cognitive Modeling,
pages 59–100. Cambridge Univ. Press, 2008.
[61] J. Halpern. Reasoning about Uncertainty. MIT Press, Cambridge, MA, 2003.
[62] M. Hayhoe, F. Alajaji, and B. Gharesifard. A pólya urn-based model for epi-
demics on networks. In American Control Conference, pages 358–363, 2017.
[63] W. Hino, H. Kobayashi I. Hasuo, and B. Jacobs. Healthiness from duality. In
Logic in Computer Science. IEEE, Computer Science Press, 2016.
[64] J. Hohwy. The Predictive Mind. Oxford Univ. Press, 2013.
[65] F. Hoppe. Pólya-like urns and the Ewens’ sampling formula. Journ. Math.
Biology, 20:91–94, 1984.
[66] M. Hyland and J. Power. The category theoretic understanding of universal alge-
bra: Lawvere theories and monads. In L. Cardelli, M. Fiore, and G. Winskel, ed-
itors, Computation, Meaning, and Logic: Articles dedicated to Gordon Plotkin,
number 172 in Elect. Notes in Theor. Comp. Sci., pages 437–458. Elsevier, Am-
sterdam, 2007.
[67] B. Jacobs. Convexity, duality, and effects. In C. Calude and V. Sassone, editors,
IFIP Theoretical Computer Science 2010, number 82(1) in IFIP Adv. in Inf. and
Comm. Techn., pages 1–19. Springer, Boston, 2010.
[68] B. Jacobs. New directions in categorical logic, for classical, probabilistic and
quantum logic. Logical Methods in Comp. Sci., 11(3), 2015.
[69] B. Jacobs. Affine monads and side-effect-freeness. In I. Hasuo, editor, Coalge-
braic Methods in Computer Science (CMCS 2016), number 9608 in Lect. Notes
Comp. Sci., pages 53–72. Springer, Berlin, 2016.
[70] B. Jacobs. Introduction to Coalgebra. Towards Mathematics of States and Ob-
servations. Number 59 in Tracts in Theor. Comp. Sci. Cambridge Univ. Press,
2016.
[71] B. Jacobs. Introduction to coalgebra. Towards mathematics of states and obser-
vations. Cambridge Univ. Press, to appear, 2016.
[72] B. Jacobs. Hyper normalisation and conditioning for discrete probability dis-
tributions. Logical Methods in Comp. Sci., 13(3:17), 2017. See https:
//lmcs.episciences.org/3885.

476
477

[73] B. Jacobs. A note on distances between probabilistic and quantum distributions.


In A. Silva, editor, Math. Found. of Programming Semantics, Elect. Notes in
Theor. Comp. Sci. Elsevier, Amsterdam, 2017.
[74] B. Jacobs. A recipe for state and effect triangles. Logical Methods in Comp. Sci.,
13(2), 2017. See https://lmcs.episciences.org/3660.
[75] B. Jacobs. A channel-based exact inference algorithm for Bayesian networks.
See arxiv.org/abs/1804.08032, 2018.
[76] B. Jacobs. Learning along a channel: the Expectation part of Expectation-
Maximisation. In B. König, editor, Math. Found. of Programming Semantics,
number 347 in Elect. Notes in Theor. Comp. Sci., pages 143–160. Elsevier, Am-
sterdam, 2019.
[77] B. Jacobs. The mathematics of changing one’s mind, via Jeffrey’s or via Pearl’s
update rule. Journ. of Artif. Intelligence Research, 65:783–806, 2019.
[78] B. Jacobs. From multisets over distributions to distributions over multisets. In
Logic in Computer Science. IEEE, Computer Science Press, 2021.
[79] B. Jacobs. Learning from what’s right and learning from what’s wrong. In
A. Sokolova, editor, Math. Found. of Programming Semantics, 2021.
[80] B. Jacobs. Multinomial and hypergeometric distributions in Markov categories.
In A. Sokolova, editor, Math. Found. of Programming Semantics, 2021.
[81] B. Jacobs. Multisets and distributions, in drawing and learning. In A. Palmi-
giano and M. Sadrzadeh, editors, Samson Abramsky on Logic and Structure in
Computer Science and Beyond. Springer, 2021, to appear.
[82] B. Jacobs, A. Kissinger, and F. Zanasi. Causal inference by string diagram
surgery. In M. Bojańczyk and A. Simpson, editors, Foundations of Software
Science and Computation Structures, number 11425 in Lect. Notes Comp. Sci.,
pages 313–329. Springer, Berlin, 2019.
[83] B. Jacobs, J. Mandemaker, and R. Furber. The expectation monad in quantum
foundations. Information & Computation, 250:87–114, 2016.
[84] B. Jacobs, A. Silva, and A. Sokolova. Trace semantics via determinization.
Journ. of Computer and System Sci., 81(5):859–879, 2015.
[85] B. Jacobs and S. Staton. De Finetti’s construction as a categorical limit.
In D. Petrişan and J. Rot, editors, Coalgebraic Methods in Computer Sci-
ence (CMCS 2020), number 12094 in Lect. Notes Comp. Sci., pages 90–111.
Springer, Berlin, 2020.
[86] B. Jacobs and A. Westerbaan. Distances between states and between pred-
icates. Logical Methods in Comp. Sci., 16(1), 2020. See https://lmcs.
episciences.org/6154.
[87] B. Jacobs and F. Zanasi. A predicate/state transformer semantics for Bayesian
learning. In L. Birkedal, editor, Math. Found. of Programming Semantics, num-
ber 325 in Elect. Notes in Theor. Comp. Sci., pages 185–200. Elsevier, Amster-
dam, 2016.
[88] B. Jacobs and F. Zanasi. A formal semantics of influence in Bayesian reasoning.
In K. Larsen, H. Bodlaender, and J.-F. Raskin, editors, Math. Found. of Computer
Science, volume 83 of LIPIcs, pages 21:1–21:14. Schloss Dagstuhl, 2017.
[89] B. Jacobs and F. Zanasi. The logical essentials of Bayesian reasoning. In
G. Barthe, J.-P. Katoen, and A. Silva, editors, Foundations of Probabilistic Pro-
gramming, pages 295–331. Cambridge Univ. Press, 2021.

477
478 Chapter 7. References

[90] R. Jeffrey. The Logic of Decision. The Univ. of Chicago Press, 2nd rev. edition,
1983.
[91] F. Jensen and T. Nielsen. Bayesian Networks and Decision Graphs. Statistics
for Engineering and Information Science. Springer, 2nd rev. edition, 2007.
[92] P. Johnstone. Stone Spaces. Number 3 in Cambridge Studies in Advanced Math-
ematics. Cambridge Univ. Press, 1982.
[93] C. Jones. Probabilistic Non-determinism. PhD thesis, Edinburgh Univ., 1989.
[94] P. Joyce. Partition structures and sufficient statistics. Journ. of Applied Proba-
bility, 35(3):622–632, 1998.
[95] A. Jung and R. Tix. The troublesome probabilistic powerdomain. In A. Edalat,
A. Jung, K. Keimel, and M. Kwiatkowska, editors, Comprox III, Third Workshop
on Computation and Approximation, number 13 in Elect. Notes in Theor. Comp.
Sci., pages 70–91. Elsevier, Amsterdam, 1998.
[96] D. Jurafsky and J. Martin. Speech and language processing. Third Edition draft,
available at https://web.stanford.edu/~jurafsky/slp3/, 2018.
[97] E. Kamenica and M. Gentzkow. Bayesian persuasion. American Economic Re-
view, 101(6):2590–2615, 2011.
[98] S. Karlin. Mathematical Methods and Theory in Games, Programming, and
Economics. Vol I: Martrix Games, Programming, and Mathematical Economics.
The Java Series. Addison-Wesley, 1959.
[99] K. Keimel and G. Plotkin. Mixed powerdomains for probability and nondeter-
minism. Logical Methods in Comp. Sci., 13(1), 2017. See https://lmcs.
episciences.org/2665.
[100] J. Kingman. Random partitions in population genetics. Proc. Royal Soc., Series
A, 361:1–20, 1978.
[101] J. Kingman. The representation of partition structures. Journ. London Math.
Soc., 18(2):374–380, 1978.
[102] A. Kock. Closed categories generated by commutative monads. Journ. Austr.
Math. Soc., XII:405–424, 1971.
[103] A. Kock. Monads for which structures are adjoint to units. Journ. of Pure &
Appl. Algebra, 104:41–59, 1995.
[104] A. Kock. Commutative monads as a theory of distributions. Theory and Appl.
of Categories, 26(4):97–131, 2012.
[105] D. Koller and N. Friedman. Probabilistic Graphical Models. Principles and
Techniques. MIT Press, Cambridge, MA, 2009.
[106] D. Kozen. Semantics of probabilistic programs. Journ. Comp. Syst. Sci,
22(3):328–350, 1981.
[107] D. Kozen. A probabilistic PDL. Journ. Comp. Syst. Sci, 30(2):162–178, 1985.
[108] P. Lau, T. Koo, and C. Wu. Spatial distribution of tourism activities: A Pólya urn
process model of rank-size distribution. Journ. of Travel Research, 59(2):231–
246, 2020.
[109] S. Lauritzen. Graphical models. Oxford Univ. Press, Oxford, 1996.
[110] S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on
graphical structures and their application to expert systems. Journ. Royal Statis-
tical Soc., 50(2):157–224, 1988.
[111] F. Lawvere. The category of probabilistic mappings. Unpublished manuscript,
see ncatlab.org/nlab/files/lawvereprobability1962.pdf, 1962.

478
479

[112] P. Lax. Linear Algebra and Its Applications. John Wiley & Sons, 2nd edition,
2007.
[113] T. Leinster. Basic Category Theory. Cambridge Studies in Advanced Mathe-
matics. Cambridge Univ. Press, 2014. Available online via arxiv.org/abs/
1612.09375.
[114] L. Libkin and L. Wong. Some properties of query languages for bags. In
C. Beeri, A. Ohori, and D. Shasha, editors, Database Programming Languages,
Workshops in Computing, pages 97–114. Springer, Berlin, 1993.
[115] H. Mahmoud. Pólya Urn Models. Chapman and Hall, 2008.
[116] E. Manes. Algebraic Theories. Springer, Berlin, 1974.
[117] R. Mardare, P. Panangaden, and G. Plotkin. Quantitative algebraic reasoning. In
Logic in Computer Science. IEEE, Computer Science Press, 2016.
[118] S. Mac Lane. Categories for the Working Mathematician. Springer, Berlin, 1971.
[119] S. Mac Lane. Mathematics: Form and Function. Springer, Berlin, 1986.
[120] A. McIver and C. Morgan. Abstraction, refinement and proof for probabilistic
systems. Monographs in Comp. Sci. Springer, 2004.
[121] A. McIver, C. Morgan, and T. Rabehaja. Abstract hidden Markov models: A
monadic account of quantitative information flow. In Logic in Computer Science,
pages 597–608. IEEE, Computer Science Press, 2015.
[122] A. McIver, C. Morgan, G. Smith, B. Espinoza, and L. Meinicke. Abstract chan-
nels and their robust information-leakage ordering. In M. Abadi and S. Kremer,
editors, Princ. of Security and Trust, number 8414 in Lect. Notes Comp. Sci.,
pages 83–102. Springer, Berlin, 2014.
[123] S. Milius, D. Pattinson, and L. Schröder. Generic trace semantics and graded
monads. In L. Moss and P. Sobocinski, editors, Conference on Algebra and
Coalgebra in Computer Science (CALCO 2015), volume 35 of LIPIcs, pages
253–269. Schloss Dagstuhl, 2015.
[124] H. Minc. Nonnegative Matrices. John Wiley & Sons, 1998.
[125] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[126] E. Moggi. Notions of computation and monads. Information & Computation,
93(1):55–92, 1991.
[127] R. Nagel. Order unit and base norm spaces. In A. Hartkämper and H. Neu-
mann, editors, Foundations of Quantum Mechanics and Ordered Linear Spaces,
number 29 in Lect. Notes Physics, pages 23–29. Springer, Berlin, 1974.
[128] M. Nielsen and I. Chuang. Quantum Computation and Quantum Information.
Cambridge Univ. Press, 2000.
[129] F. Olmedo, F. Gretz, B. Lucien Kaminski, J-P. Katoen, and A. McIver. Con-
ditioning in probabilistic programming. ACM Trans. on Prog. Lang. & Syst.,
40(1):4:1–4:50, 2018.
[130] M. Ozawa. Quantum measuring processes of continuous observables. Journ.
Math. Physics, 25:79–87, 1984.
[131] P. Panangaden. The category of Markov kernels. In C. Baier, M. Huth,
M. Kwiatkowska, and M. Ryan, editors, Workshop on Probabilistic Methods in
Verification, number 21 in Elect. Notes in Theor. Comp. Sci., pages 171–187.
Elsevier, Amsterdam, 1998.
[132] V. Paulsen and M. Tomforde. Vector spaces with an order unit. Indiana Univ.
Math. Journ., 58-3:1319–1359, 2009.

479
480 Chapter 7. References

[133] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible


Inference. Graduate Texts in Mathematics 118. Morgan Kaufmann, 1988.
[134] J. Pearl. Probabilistic semantics for nonmonotonic reasoning: A survey. In
R. Brachman, H. Levesque, and R. Reiter, editors, First Intl. Conf. on Prin-
ciples of Knowledge Representation and Reasoning, pages 505–516. Morgan
Kaufmann, 1989.
[135] J. Pearl. Jeffrey’s rule, passage of experience, and neo-Bayesianism. In Jr. H. Ky-
burg, editor, Knowledge Representation and Defeasible Reasoning, pages 245–
265. Kluwer Acad. Publishers, 1990.
[136] J. Pearl. Causality. Models, Reasoning, and Inference. Cambridge Univ. Press,
2nd ed. edition, 2009.
[137] J. Pearl and D. Mackenzie. The Book of Why. Penguin Books, 2019.
[138] B. Pierce. Basic Category Theory for Computer Scientists. MIT Press, Cam-
bridge, MA, 1991.
[139] H. Pishro-Nik. Introduction to probability, statistics, and random pro-
cesses. Kappa Research LLC, 2014. Available at https://www.
probabilitycourse.com.
[140] D. Poole and A. Mackworth. Artificial Intelligence. Foundations of Computa-
tional Agents. Cambridge Univ. Press, 2nd edition, 2017. Publicly available via
https://www.cs.ubc.ca/~poole/aibook/2e/html/ArtInt2e.html.
[141] S. Pulmannová and S. Gudder. Representation theorem for convex effect alge-
bras. Commentationes Mathematicae Universitatis Carolinae, 39(4):645–659,
1998.
[142] J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
[143] R. Rao and D. Ballard. Predictive coding in the visual cortex: a functional in-
terpretation of some extra-classical receptive-field effects. Nature Neuroscience,
2:79–87, 1999.
[144] S. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2007.
[145] S. Ross. A first course in probability. Pearson Education, 10th edition, 2018.
[146] S. Russell and P. Norvig. Artificial Intelligence. A Modern Approach. Prentice
Hall, Englewood Cliffs, NJ, 2003.
[147] A. Ścibior, O. Kammar, M. Vákár, S. Staton, H. Yang, Y. Cai, K. Ostermann,
S. Moss, C. Heunen, and Z. Ghahramani. Denotational validation of higher-
order Bayesian inference. In Princ. of Programming Languages, pages 60:1–
60:29. ACM Press, 2018.
[148] P. Scozzafava. Uniform distribution and sum modulo m of independent random
variables. Statistics & Probability Letters, 18(4):313–314, 1993.
[149] P. Selinger. A survey of graphical languages for monoidal categories. In B. Co-
ecke, editor, New Structures in Physics, number 813 in Lect. Notes Physics,
pages 289–355. Springer, Berlin, 2011.
[150] S. Selvin. A problem in probability (letter to the editor). Amer. Statistician,
29(1):67, 1975.
[151] S. Sloman. Causal Models. How People Think about the World and Its Alterna-
tives. Oxford Univ. Press, 2005.
[152] A. Sokolova. Probabilistic systems coalgebraically: A survey. Theor. Comp. Sci.,
412(38):5095–5110, 2011.

480
481

[153] S. Staton. Probabilistic programs as measures. In G. Barthe, J.-P. Katoen, and


A. Silva, editors, Foundations of Probabilistic Programming, pages 43–74. Cam-
bridge Univ. Press, 2021.
[154] S. Staton, H. Yang, C. Heunen, O. Kammar, and F. Wood. Semantics for proba-
bilistic programming: higher-order functions, continuous distributions, and soft
constraints. In Logic in Computer Science. IEEE, Computer Science Press, 2016.
[155] M. Stone. Postulates for the barycentric calculus. Ann. Math., 29:25–30, 1949.
[156] D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Morgan and
Claypool, 2011.
[157] Y. Suhov and M. Kelbert. Probability and Statistics by Example: Volume 1, Basic
Probability and Statistics. Cambridge Univ. Press, 2005.
[158] H. Tijms. Understanding Probability: Chance Rules in Everyday Life. Cam-
bridge Univ. Press, 2nd edition, 2007.
[159] A. Tversky and D. Kahneman. Evidential impact of base rates. In D. Kahneman,
P. Slovic, and A. Tversky, editors, Judgement under uncertainty: Heuristics and
biases, pages 153–160. Cambridge Univ. Press, 1982.
[160] M. Valtorta, Y.-G. Kim, and J. Vomlel. Soft evidential update for probabilistic
multiagent systems. Int. Journ. of Approximate Reasoning, 29(1):71–106, 2002.
[161] D. Varacca. Probability, Nondeterminism and Concurrency: Two Denotational
Models for Probabilistic Computation. PhD thesis, Univ. Aarhus, 2003. BRICS
Dissertation Series, DS-03-14.
[162] D. Varacca and G. Winskel. Distributing probability over non-determinism.
Math. Struct. in Comp. Sci., 16:87–113, 2006.
[163] T. Verma and J. Pearl. Causal networks: semantics and expressiveness. In
R. Shachter, T. Levitt, L. Kanal, and J. Lemmer, editors, Uncertainty in Artif.
Intelligence, pages 69–78. North-Holland, 1988.
[164] C. Villani. Optimal Transport — Old and New. Springer, Berlin Heidelberg,
2009.
[165] I. Witten, E. Frank, and M. Hall. Data Mining – Practical Machine Learning
Tools and Techniques. Elsevier, Amsterdam, 2011.

481

You might also like