0% found this document useful (0 votes)
43 views

Quantum Computing

Uploaded by

anonplay9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Quantum Computing

Uploaded by

anonplay9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 194

QuantumandMechanics

Bayesian Machines
This page intentionally left blank
QuantumandMechanics
Bayesian Machines

George Chapline
Lawrence Livermore National Laboratory, USA

World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data


Names: Chapline, George, author.
Title: Quantum mechanics and Bayesian machines / George Chapline,
Lawrence Livermore National Laboratory, USA.
Description: New Jersey : World Scientific Publishing Co. Pte. Ltd., [2023] |
Includes bibliographical references and index.
Identifiers: LCCN 2022042480 | ISBN 9789813232464 (hardcover) |
ISBN 9789813232471 (ebook for institutions) | ISBN 9789813232488 (ebook for individuals)
Subjects: LCSH: Quantum Bayesianism. | Quantum theory.
Classification: LCC QC174.17.Q29 C43 2023 | DDC 530.12--dc23/eng20230111
LC record available at https://lccn.loc.gov/2022042480

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd.


All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.

For any available supplementary material, please visit


https://www.worldscientific.com/worldscibooks/10.1142/10775#t=suppl

Desk Editors: Logeshwaran Arumugam/Steven Patt

Typeset by Stallion Press


Email: [email protected]

Printed in Singapore
Preface

Although digital computation has enjoyed enormous successes, there


remain many problems of interest where real time solutions remain
out of reach. Prominent among the problems that have largely
remained beyond the state of the art for digital computation are
machine learning problems such as pattern recognition or decision-
making in situations where the interpretation of observational data
is ambiguous. It is somewhat embarrassing in this connection the
mammalian brain can often resolve ambiguities in the interpretation
of sensory data in real time with a footprint that is dramatically
smaller than the footprint of the large-scale computers that are typ-
ically used for automated data analysis or reinforcement learning.
A pregnant question for the future of data science is what are the
mathematical and engineering principles that underlie the data anal-
ysis capabilities of the mammalian brain. Along these lines, there is
at the present time broad interest in whether emulation of the data
analysis capabilities of the mammalian brain would benefit from the
development of quantum information processing. In this book we do
not promise a definitive answer to this question, but instead will
review some of the potential benefits of using quantum informa-
tion processing. Our primary motivation in writing this book is to
bring attention to the remarkable circumstance that in many respects
quantum mechanics provides a natural framework for Bayesian
analysis — which, apart from the benefits that a quantum perspec-
tive may provide for using Bayesian methods to solve practical data
analysis problems, may also shed light on the remarkable capabilities
of mammals to evaluate risks and elect survival strategies.

v
vi Quantum Mechanics and Bayesian Machines

What follows is essentially an elaboration of the author’s 2001


Philosophical Magazine piece ‘Quantum mechanics as self-organized
information fusion’, 2005 International Journal of Quantum Infor-
mation paper ‘Quantum Mechanics and Pattern Recognition’, and
2007 contribution to the Berni Alder Festschrift Quantum Mechanics
and Machine Learning. This book updates these presentations in the
sense that it refines our previous notion of quantum self-organization
by emphasizing the connection between optimal control/RL and solu-
tions of integrable nonlinear partial differential equations. Our most
important result is that these equations provide a basis for describing
the reward function for a wide range of problems, including exam-
ples of reinforcement learning, which are very difficult to solve with
conventional computational resources. These integrable systems also
underscore the paramount importance of spaces of holomorphic func-
tions of a complex number lying on a Riemann surface for repre-
senting Bayesian learning. The emergence of Riemann surfaces as
an essential ingredient underscores a profound connection between
cognitive science, pure mathematics, and theoretical physics.
The evolution in our views from what was presented in our original
papers has especially benefited from the insights found in the papers
and books of Norbert Wiener, David MacKay, Thomas Kailath,
and James Rosen. In addition, the 1967 Tokyo lectures of Michio
Kuga and 1997 Oxford lectures of Graeme Segal have been very
inspirational. As a result, we are now much better able to explain
why quantum theory is essential to understanding the mathemati-
cal principles involved in applying Bayes’s formula. For the future,
we hope that these insights will lead to better methods for solving
practical data analysis problems.
No knowledge will be assumed on the part of the reader with
respect to the current state of the art for machine learning, quan-
tum computing, or computer-based artificial intelligence. Instead, our
focus will be on the underlying mathematical relationship between
quantum mechanics and Bayesian inference. Our hope is that since
our description of the mathematical foundations of Bayesian learn-
ing has its own charm, and with the help of the references we cite,
students and researchers will hopefully gain some new insights into
why quantum devices may be useful for data analysis and artificial
intelligence.
Our presentation assumes that the reader has some familiar-
ity with the basics of probability theory and its connection with
Preface vii

information theory; for example, as outlined in MacKay’s outstand-


ing Information Theory, Inference, and Learning Algorithms. Some
familiarity with feedback control at the level of Astrom and Murray’s
or Anderson and Moore’s treatises would also be very helpful. In
addition, some exposure to quantum mechanics at the level of David
Saxon’s undergraduate textbook, or better yet Feynman and Hibbs’s
Quantum Mechanics and Path Integrals should be regarded as a pre-
requisite for this book. Hermann Weyl’s Group Theory and Quantum
Mechanics provides the definitive introduction to why the math-
ematical theory of groups and quantum mechanics are intimately
intertwined.
Previous exposure to “quantum computing” is not required. For
the most part our approach to quantum Bayesian learning is quite
different from the approaches to information processing that can be
found in the literature on quantum computing, On the other hand,
occasional perusal of Nielson and Chuang’s Quantum Computation
and Quantum Information may prove helpful. By and large we will
reserve the term “machine learning” to mean the extensive use of
multi-layer “deep neural networks” (DNNs) for pattern recognition
and reinforcement learning. However, this book does not assume that
the reader is familiar with DNNs or their applications. Instead, our
intent is to focus on the relationship between Bayesian learning and
quantum mechanics.
Additional background for our presentation can be found in the
references cited in the text. The references cited in the text are not
intended by any measure to be an exhaustive listing of all the rele-
vant literature; but instead for the most part are simply the books
and papers that the author has found to be helpful. It is assumed
that the reader is already familiar with linear algebra and matrices,
and elementary methods for solving differential equations. A prob-
lematic aspect of our presentation for the general reader though is its
extensive dependence on sophisticated mathematical results from the
theory of analytic functions of complex number variables, algebraic
geometry, and the theory of groups. The reader may have to spend
some time “coming up to speed” with these topics in order to fully
understand our presentation. As for the theory of analytic functions
of a complex number variable, it would be very helpful if the reader
obtained one of the standard textbooks on the theory of functions of
a complex variable from Amazon. (The author is particularly fond
of Copson’s 1935 textbook.) For the most part, physicists and data
viii Quantum Mechanics and Bayesian Machines

scientists regard the theory of functions of a complex variable as terra


incognita. Ironically, this area of mathematics is usually not ignored
in engineering schools, so engineers may find our presentation more
accessible.
Throughout our presentation the acronym GP will mean a vec-
tor with many components which are independent, independently
distributed (iid) Gaussian random variables. The acronyms DNN,
HPC, ML, MDL, MDP, NLS, ODLRO, PDE, RL, and TSP will
stand for deep neural network, high performance computing, maxi-
mum likelihood, minimum description length, Markov decision pro-
cess, off-diagonal long-range order, nonlinear Schrodinger, partial dif-
ferential equation, reinforcement learning, topological insulator, and
the traveling salesman problem (TSP). Throughout we reserve the
notation V (x) to mean the Bellman cost (or value) function, while
the notations v(x) or q(x) will be reserved to mean the potential
in a Schrodinger or Dirac equation. The initials BFS, GLM, HJB,
KdV, KL, RH, and TO stand for Bargmann–Fock–(Irving)Segal,
Hamilton–Jacobi–Bellman, Gelfand–Levitan–Marčenko, Korteweg
de-Vries, Kullbach–Leibler, Riemann–Hilbert, and Togdan–Olver.
We depart form the standard terminology used in the mathemat-
ics literature in one important respect: we refer to the kinematical
group for quantum mechanics as the Weyl–Heisenberg group rather
than the Heisenberg group, because it was Herman Weyl inspired by
the work of Pascal Jordan who discovered this group.
About the Author

George Chapline’s scientific career began at age 15, when he wrote


a letter to Richard Feynman regarding the incompatibility of quan-
tum mechanics and general relativity. The author’s assertion – that
these two theories are incompatible because quantum mechanics is
intrinsically non-local, while general relativity is local – remains to
this day the definitive statement of this unsolved problem. In 2000,
Chapline proposed in collaboration with Robert Laughlin and his
students in 2000 an explanation of how classical space-time can
emerge from quantum gravity, based on the concept of quantum
self-organization. This in turn has led to new perspectives in astro-
physics; e.g. the hypothesis that dark matter consists of primordial
black holes. Chapline graduated with a BA in Mathematics from
UCLA in 1961 and a PhD in physics from Cal Tech in 1966. He was
assistant professor of physics at the University of California at Santa
Cruzfro 1966–1969. He has been a staff member of the Lawrence
Livermore Laboratory since 1970. In 1984, Chapline won the E. O.
Lawrence award for directing the experimental team that demon-
strated the world’s first X-ray laser. Since 2000 Chapline’s research
has been primarily focused on applying quantum self-organization to
Bayesian inference.

ix
This page intentionally left blank
Acknowledgments

The author is especially grateful to Michael Schneider for sharing his


extensive knowledge of Gaussian processes and Fredholm integral
equations, and to James Rose for bringing his optimization principle
for inverse scattering to his attention. The author would also like to
thank Jim Barbieri for discussions regarding analog neutral comput-
ing, Gennady Berman, Jonathan Dubois, and Mathew Otten for dis-
cussions regarding quantum computing, Dongxia Qu for sharing her
understanding of topological insulators, and Bradan Soper for shar-
ing his knowledge of game theory. Finally, the author is grateful to
Steve Libby for discussions regarding all aspects of quantum physics.

xi
This page intentionally left blank
Contents

Preface v
About the Author ix
Acknowledgments xi

1. Introduction 1

2. Six Fundamental Discoveries 25


2.1. Bayes’s Probability Formula . . . . . . . . . . . . . . 25
2.2. The Wiener and Kalman–Bucy Filters . . . . . . . . 27
2.3. Bellman’s Dynamic Programming Approach
to Optimal Control . . . . . . . . . . . . . . . . . . . 32
2.4. Feynman’s Path Integral Approach to
Quantum Mechanics . . . . . . . . . . . . . . . . . . 34
2.5. Quantum Solution of the Traveling Salesman
Problem (TSP) . . . . . . . . . . . . . . . . . . . . . 36
3. Ockham’s Razor 41
3.1. Bayesian Searches . . . . . . . . . . . . . . . . . . . 41
3.2. A Tale of Two Costs . . . . . . . . . . . . . . . . . . 46
3.3. Hidden Factors and the Helmholtz Machine . . . . . 49
4. Control Theory 55
4.1. The Hamilton–Jacobi–Bellman Equation . . . . . . . 55
4.2. Pontryagin Maximum Principle . . . . . . . . . . . . 56

xiii
xiv Quantum Mechanics and Bayesian Machines

4.3. Lie–Poisson Dynamics . . . . . . . . . . . . . . . . . 61


4.4. H∞ Control . . . . . . . . . . . . . . . . . . . . . . . 63
5. Integrable Systems 67
5.1. RH Solution of the Airy Equation . . . . . . . . . . 67
5.2. The KdV Equation . . . . . . . . . . . . . . . . . . . 70
5.3. Segal–Wilson Construction . . . . . . . . . . . . . . 76
5.4. The NLS Equation . . . . . . . . . . . . . . . . . . . 80
5.5. Galois Remembered . . . . . . . . . . . . . . . . . . 84
6. Quantum Tools 89
6.1. Weyl Remembered . . . . . . . . . . . . . . . . . . . 89
6.2. Helstrom’s Theorem and Universal Hilbert
Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3. Measurement-based Quantum Computation . . . . . 99
7. Quantum Self-organization 105
7.1. Pontryagin Control and Quantum Criticality . . . . 105
7.2. Quantum Theory of Innovations . . . . . . . . . . . 109
7.3. Quantum Helmholtz Machine . . . . . . . . . . . . . 114
7.4. Ad Mammalian Intelligence . . . . . . . . . . . . . . 122
8. Holistic Computing 127
8.1. Quantum Mechanics and 3D Geometry . . . . . . . . 127
8.2. Cognitive Science and Quantum Physics . . . . . . . 134
Appendices 137
A. Gaussian Processes . . . . . . . . . . . . . . . . . . . 137
B. Wiener–Hopf Methods . . . . . . . . . . . . . . . . . 140
C. Riemann Surfaces . . . . . . . . . . . . . . . . . . . . 154
D. The Eightfold Way . . . . . . . . . . . . . . . . . . . 158
E. Quantum Theory of Brownian Motion . . . . . . . . 160

References 167
Index 177
Chapter 1

Introduction

One of the most important challenges for modern computer science


is how to emulate the remarkable cognitive capabilities of the mam-
malian brain; particularly in situations where there is a necessity for
rapidly selecting among multiple possible interpretations for sensory
data (cf. Fig. 1.1). At first sight the “machine learning” techniques
that are currently used to emulate the information processing capa-
bilities of the mammalian brain would appear to lack any coherent
mathematical foundation. It has long been thought [1] that “deep”
artificial neural networks (DNNs), which can be either determinis-
tic or probabilistic, might provide a pathway to emulating the cog-
nitive capabilities of the mammalian brain. However, it turns out
that these techniques are better suited to interpolation than extrap-
olation. Moreover, the Herculean efforts typically required to train
neural networks for complex reinforcement learning (RL) [2] or feed-
back control [3] problems have given pause to the assumption that
deep neural networks (DNNs) are the best path forward. An alterna-
tive view is that probability theory [4–6], and Bayes’s theorem [5–9],
provide the best framework for automating pattern recognition and
decision-making.
The point of departure for this book is that, apart from the possi-
bility [10] that the mammalian brain can estimate Bayesian probabil-
ities, the only other circumstance where mathematical probabilities
spontaneously appear in nature are physical phenomena where quan-
tum effects are important. Therefore, it is natural to imagine [11] that
the apparent ability of the mammalian brain to construct models of
Bayesian inference is somehow related to quantum theory, and this

1
2 Quantum Mechanics and Bayesian Machines

Fig. 1.1. Natural Bayesian machines.

book can be thought of as an updated version of Ref. [11]. Our belief


in the potential importance of quantum theory for understanding
mammalian cognition is abetted by Teuvo Kohonen’s discovery [13]
of the importance of self-organizing maps for understanding the orga-
nization of sensory neurons in the cerebral cortex of the mammalian
brains. However, we want to make clear from the outset that we
do not subscribe to the notion that the mammalian cerebral cortex
is a “quantum computer” in the sense that this term is commonly
used [14]. Kohonen’s maps involve holomorphic analytic functions of
a complex number variable [13] that are very similar to those that
appear in analytic solutions of the 2D Schrodinger equation [15].
(Holomorphic functions are smooth functions of a complex number
Introduction 3

whose only singularities are zeros.) Focusing on the importance of


holomorphic functions is a central theme of our presentation, and is
what allows us to tie together Bayes’s formula, integrable dynamics,
and mammalian cognition.
Putting aside the question as to how the mammalian brain actu-
ally works, we will adopt the point of view that the similarity of
the space-time structure of quantum dynamics and Markov decision
processes (MDPs) [16,17] is an important clue that should not be
ignored. At center stage here are the observations of Aharonov et al.
[18] regarding the relationship of past and future measurements in
quantum theory. Historically, it was first observed by Schrodinger
himself in the 1930s [15] that MDPs can be written in a way that
combines forward and backward propagating processes. Remarkably
both Bayes’s formula and Todorov’s extension of Bayes’s formula
[19] to include control processes [3,19,20] also involve forward and
backward MDPs.
The foundation for our presentation is Bayes’s epochal formula
for conditional probabilities [5,6], which provides a rational basis for
data analysis and decision- making in virtually all contexts. In poetic
language, Bayes’s formula can be written:
Likelihood × P rior
P (θ|d, α) = , (1.1)
Evidence
where P (θ|d, α) for the parameters θ associated with a particular
data model α chosen from a set {α} of possible data models is cor-
rect given a set {d} of input data and the Likelihood P (d|α, θ) and
Prior P(α|θ). The Prior is the a priori probability that a particular
explanation α is correct, and the Evidence is the probability of the
input data summed over all possible explanations. Although Bayes
put forward his formula in the 18th century, widespread appreciation
of the great scientific and practical significance of Bayes’s formula did
not arrive until late in the 20th century. Fortunately, the theoretical
importance of Bayes’s formula for data analysis is now widely recog-
nized. Indeed, it is now widely accepted [5–8] that formula (1.1) pro-
vides the basic framework for solving virtually all problems involving
understanding sensory data and observation-based decision-making.
In the following, we will refer to the enterprise of evaluating the con-
ditional probabilities on the r.h.s of Eq. (1.1) for the purposes of
solving these types of problems as “Bayesian inference”.
4 Quantum Mechanics and Bayesian Machines

Unfortunately, even though the fundamental importance of


Bayes’s formula is now widely appreciated, the use of Eq. (1.1) toward
obtaining accurate estimates for P (θ|d, α) in all circumstances of
interest has remained limited. This is especially true, for example, in
situations where there are hidden factors [7]. As has been emphasized
by MacKay [6], one reason for lack of progress in this direction is that
in general all three of the factors on the r.h.s. of Eq. (1.1) must be
evaluated to obtain a complete understanding of input data. In addi-
tion, the conditional probabilities that appear in the numerator and
denominator of the r.h.s of Eq. (1.1) cannot in general be evaluated
in real time with state-of-the-art computational algorithms.
In the 19th century, Maxwell had emphasized the importance of
probability theory for understanding physical phenomena [23], but
folklore attributes to Helmholtz the notion that the human brain can
estimate the probabilities for various alternative explanations for sen-
sory observations. As it happens, Helmholtz’s suggestion regarding
the ability of the mammalian brain to estimate probabilities lay dor-
mant for more than a century. In contrast with theoretical physics,
where there are validated models for almost all physical phenomena
of any practical importance, the mathematical principles underlying
the cognitive capabilities of the mammalian brain have remained elu-
sive. Observations of the natural behavior of mammals in the wild
(cf. Fig. 1.1) do suggest that the mammalian brain can construct
reinforcement learning-like strategies for dealing with adversity in
the real world. One of our primary motivations for writing this book
is to explain the mathematical principles underlying the observed
abilities of mammals to evaluate Bayesian probabilities.
For a glimmer of hope that improved understanding of how the
mammalian brain evaluates Bayes’s probabilities might indeed lead
to better methods for artificial data analysis, the data science com-
munity is indebted to Harvard mathematician David Mumford [24]
for having put forward the plausible hypothesis that the intellectual
capabilities of the mammalian brain evolved in such a way as to
endow the mammalian brain with the capability to construct mod-
els for the world which are the simplest from the perspective of the
amount of information needed to describe the model. Mumford’s sug-
gestion is similar to the suggestions of Kolmogorov and Chaitin [25]
that the value of any approach to data analysis can be assayed by
the algorithmic complexity of the computer program needed to imple-
ment the model. The phase “minimum description length” (MDL)
Introduction 5

for these ideas was introduced by Rissanen [26]. As it happens, no


practical method for implementing the MDL principle with existing
computational resources has yet been found; however, the Helmholtz
machine described by Dayan, Hinton, Neal, and Zemel [8,9] is con-
structed around the conceptually elegant idea that the information
cost for data analysis can also be interpreted as the free energy of a
physical spin system.
A fundamental result in classical statistical physics is that a
Maxwell–Boltzmann distribution for the relative occupations of the
energy levels for a physical system in thermal equilibrium is a conse-
quence of minimizing the non-equilibrium free energy of the system.
Hinton et al. point out that Bayes’s formula, Eq. (1.1), can then
be interpreted as a Boltzmann-like distribution for configurations
of the Helmholtz machine, where the configuration “energies” are
the negative logarithms of the Likelihood factors in Eq. (1.1). This
is analogous to the fact that in ordinary statistical mechanics this
Boltzmann distribution can also be characterized as the probability
distribution Pα which minimizes the free energy F = E–TS, where
E is the average of the negative log-likelihood and S is the entropy
associated with Pα . The net result is that the Evidence factor in
Eq. (1.1) can also be regarded as the non-equilibrium Helmholtz-free
energy F [P (x)] of a “physical” system:
 

Evidence ≡ exp − [Pα Eα − (−Pα log Pα )] , (1.2)
α

where Pα is the conditional probability that appears on the l.h.s


of (1.1). The first term on the r.h.s of Eq. (1.2) is the expected
“reward”; i.e. the sum over paths of a negative exponential of an
integral of a function q(x) which encourages the controller to visit
more likely states (cf. [19]). The second term on the r.h.s of Eq. (1.2)
is analogous to the term that represents the contribution of entropy
to the thermodynamic free energy. In the context of classical sta-
tistical mechanics [23] minimizing the free energy can sometimes be
realized by minimizing the expected energy, and at other times real-
ized by maximizing the entropy term. In a similar way in the context
of Bayesian inference when there are multiple data models, the Evi-
dence factor in Eq. (1.1) may in some cases be maximized by choosing
a model whose expected likelihood (the 1st term in the free energy
in Eq. (1.2)) is high, while in other cases one would like to maximize
6 Quantum Mechanics and Bayesian Machines

the entropy associated with the probability distribution Pα for var-


ious models (the 2nd term on the r.h.s of Eq. (1.2)). Hinton et al.
also proposed a specific algorithm, the wake–sleep algorithm [8], for
minimizing the Evidence factor in Eq. (1.1). In practice, this algo-
rithm amounts to minimizing the Kullback–Leibler (KL) divergence
[27], i.e. the difference between the entropies of the two Helmholtz
machine networks expressed in terms of their respective probability
distributions for their Ising spin degrees of freedom. The KL diver-
gence plays a central role in minimizing the information description
for the recognition and data description networks in the Helmholtz
machine [8] by forcing the information cost of their respective repre-
sentations for the data to be equal.
The KL divergence also plays a central role in measuring the infor-
mation cost Bayesian search for an object in an unknown location
[28], and in Bellman’s dynamic programming formalism [20] for opti-
mal control. A central feature of Bellman’s dynamic programming
formalism was introduction of a performance index, or “cost func-
tion” V [x(t), u(t)] for quantifying how well the controller for feedback
control or agent for RL is performing in efforts to bring a system or
environment to a desired final state x(T). Although Bellman didn’t
initially phrase it this way, it was eventually realized [28] that, apart
from a “reward function” [30], Bellman’s value function represents
the rate at which, as a result of observations, information is gathered
about the likely results of future observations. As noted by Tororov
[30] and Kappen [31], the evolution of the Bellman function taking
a small step along a random path can also be represented as the
negative exponential of a “reward function” q(x, t) describing the
likelihood of a state x given a previous or concurrent control action
u. The Bellman function V (t) describing the information cost mov-
ing from a state x(0) to a state x(T ) will then be given by a negative
logarithm sum over all paths of an exponential of minus the reward
function q(x, t):
  
x(T )
e−V (T ) = P exp − q[x (t)]dt . (1.3)
x(0)

The symbol P means averaging over all the possible paths going
from x at t = 0 to x(T ) at t = T . The optimal path will be defined
by the path where the integral over q(x, t) is minimized. However,
because of the necessity of exploring many paths in Eq. (1.3), the
Introduction 7

change in V (t) along the optimal path will have an additional cost
term, the “control cost”, which encourages the controlled history of
the system to lie near a path that might be attributed to the system
dynamics without controls and can be identified as a KL divergence.
In an earlier paper [19], Todorov showed that the loss rate for the
optimal Bellman cost function can be written as a sum of q(x, u) and
the KL divergence term, which together are minimized as w.r.t u(x).
This in turn leads to an expression for V (t) as the backward filtering
probability for the state of the system given all previous observations
of the system. By expressing the Bellman function V (t) in terms of a
probabilistic chain of actions (cf. [19]), the sum in Eq. (1.3) can also
be evaluated using Monte Carlo methods [31]. However, this can be
very time-consuming, if not intractable.
Taking to heart the similarity between the space-time structure
of MDPs and quantum dynamics that was one of our original inspi-
rations [16], one might guess that Eq. (1.3) can be faithfully emu-
lated using Feynman’s “sum over paths” interpretation for quantum
mechanics introduced in his 1942 PhD thesis [32]. In this interpre-
tation of Eq. (1.3), the action function appearing in the exponent of
Feynman’s sum over paths replaces the reward function q(x, u) that
appears in Todorov’s formulation of optimal control [19], while the
control variable u(t) is represented by Dirac’s momentum operator
−i∂/∂x. The r.h.s of Eq. (1.3) would be then replaced by a sum over
quantum paths where the real√exponents, the r.h.s of Eq. (1.3), are
replaced by iq(x) where i = −1. In this quantum interpretation
of Eq. (1.3) the sum over real negative exponentials is replaced by
Feynman’s sum over quantum paths expression [32] for a quantum
propagator describing the translation along a path x(t) of a two-
component wave function Ψ(x):
  y 
 
T (x, y|λ) = P exp −i q̃(x , t|λ)dx (t) , (1.4)
x

where the P symbol means time ordering of path segments, λ is


a spectral parameter, and q̃(x, t) is a 2 × 2 matrix (The necessity
for two-component wave function is discussed in Section 5.5). The
potential q̃(x, t) will guide the evolution of the wave function in much
the same way as was originally contemplated by Schrodinger for the
phase of his wave function [33]. An optimization principle that will
cause the amplitude in Eq. (1.4) to become focused on a single final
8 Quantum Mechanics and Bayesian Machines

state is known as Rose optimization [34]. The two components are


necessary [35] if one wants to represent the fact that in contrast with
classical mechanics, quantum dynamics can also allow for simultane-
ous forward and backward propagation.
In the sum of terms corresponding to the r.h.s of (1.4) the contri-
bution from paths very different from the optimal path will tend to
cancel one another. Thus, in the classical limit the only term in the
sum that should survive is the one where the integral of the reward
function for the initial to the final is minimized. Unfortunately, it is
not at all straightforward to explicitly construct a classical path from
a quantum path integral. Any attempt to construct a classical path
as the classical limit of Feynman’s path integral immediately runs
into the immediate difficulty that the classical equation of motion
for any system treats both the position and momentum on the same
footing as classical variables, whereas in quantum mechanics these
variables are treated in very different ways. This makes evaluating
the classical limit of a Feynman path integral very tricky. Indeed,
in the 1920s the founders of quantum mechanics were very puzzled
as to why well-known results in classical mechanics could not be
obtained in any obvious way as the limit of Schrodinger’s wave equa-
tion [33] when Planck’s constant of action  was assumed to approach
zero. For example, the problem of how to obtain Kepler’s laws for
planetary motion from Schrodinger’s wave mechanics wasn’t solved
until the late 1980s [37]. The root of this difficulty is while the posi-
tion variable is treated as a classical variable in wave mechanics, the
momentum variable is regarded as an operator acting on the wave
function. This difference can allow a control variable represented as
a derivative of the wave function to have large excursions from its
value along an optimal path. This happens, for example, if the pas-
sive dynamics for the system has a classical turning point [71]. One
of the main themes in our presentation will be that this difficulty
can be resolved if Feynman’s original path integral for the quantum
dynamics of a system is modified to allow for frequent measurements
of the control variables, which are typically represented as momen-
tum variables. This is the strategy used in adaptive optics [38] and
“measurement-based quantum computing” [39] to arrive at a spec-
ified final state. The bottom line is that by including a theory of
measurement quantum mechanics does seem to show promise for
evaluating Eq. (1.4).
Introduction 9

Quantum theory began to take form at the end of the 19th cen-
tury as a result of Max Planck’s introduction [40] of the quantum of
light in 1900 in connection with the problem of understanding the
spectrum of thermal radiation emerging from an oven. There was
already an appreciation at the time of Planck’s paper that there were
a variety of physical phenomena, e.g. the dependence of the chemical
and spectral properties of atoms on their atomic number, the spec-
trum of thermal radiation, radioactivity, etc. that were refractory
to explanations based on classical physics. However, Planck’s focus
on the problem of understanding the entropy of thermal radiation
turned out to be pivotal for the future of physics. Following Planck’s
unveiling of light quanta, it was soon realized, largely as a result of
the work of Bohr and Sommerfeld [41], that Planck’s discovery had
profound implications for our understanding of atomic matter. Quan-
tum mechanics emerged in the 1920s because of a desire to extend
the Bohr–Sommerfeld quantum theory, which had only been success-
ful for simple (actually “integrable”) physical systems, to all types
of physical systems. As prophesied by Dirac [35], quantum mechan-
ics does in fact appear to provide us with a mathematically con-
sistent framework for understanding all known natural phenomena.
Quantum mechanics made its debut [42] in 1925 with the two simul-
taneous papers of Dirac and Heisenberg, Born, and Jordan. These
papers provided a foundation for theoretical physics where matri-
ces were used to represent physical quantities. Initially the physical
meaning of this “matrix mechanics” was rather obscure, although
it soon became clear [43] that the epistemological flaw with classi-
cal physics lay with the implicit assumption that the variables used
in classical physics, e.g. the position of a particle or the magnitude
and polarization of an electric field, could — at least in principle —
be simultaneously measured with arbitrary precision. Before 1925 it
had always been imagined that physics should be directly based on
measurable quantities. Heisenberg’s great achievement [43] was his
“uncertainty principle”, which explained that the flaw in classical
physics lay in the tension that always exists between the way exper-
imental measurements are carried out — particularly when atomic
phenomena are involved — and the desire that physics should be
based entirely on physical laws that were completely independent of
the way measurements are carried out. In the 1925 papers of Dirac
and Born, Heisenberg, and Jordan [42] it was proposed that this
10 Quantum Mechanics and Bayesian Machines

tension could be resolved by using matrices which satisfied certain


commutation relations rather than scalar variables to represent phys-
ical quantities such as position and momentum. These commutation
relations were given a precise definition and elevated to a mathe-
matical group by Herman Weyl [44]. This 3D group, which we will
refer to as the Weyl–Heisenberg group, provides a mathematically
precise definition of quantum kinematics. (In the pure mathematics
literature, this group is referred to as the Heisenberg group, even
though, inspired by the 1925 work of Pascal Jordan [42], this group
was originally introduced by Herman Weyl. Weyl’s book [44] provides
a nice introduction to this group.)
As it happens though, despite Heisenberg’s extraordinary insight
into the raison d’etre for quantum mechanics, the physical mean-
ing of quantum mechanics has remained somewhat enigmatic [45].
This veil of mystery was partially lifted in the spring of 1926, when,
while trying to understand how collisions between atomic particles
can be understood within the framework of the matrix mechanics of
Heisenberg et al., Max Born hit upon the idea [46] that the absolute
square of the wave function which appears in Schrodinger’s wave
equation formulation of quantum mechanics [33] represents a prob-
ability density for the values of the argument of the wave function;
e.g. the position or momentum of a particle. This is surely one of the
most important discoveries in the entire history of science! For our
purposes we will rely on the definitive description of the relationship
of quantum theory and probability theory provided by Feynman and
Hibbs [47].
Although the physical meaning of quantum mechanics remains
somewhat mysterious, there is no doubt that quantum mechanics
has led to an enormously better understanding of Shannon informa-
tion. It is noteworthy in this respect that while the concept of entropy
plays a central role in both practical thermodynamics and informa-
tion theory, no way is known for defining what the absolute entropy
of a physical system means without quantum mechanics. In engi-
neering practice one can follow Carnot [22] and define the entropy
of “working fluids” using the phenomenological specific heat of these
fluids. However, if one follows Boltzmann [23] and tries to define
entropy of a physical system as the number of microscopic states cor-
responding to a macroscopic state, then in almost all cases of interest
one requires quantum mechanics to define the Boltzmann entropy.
Introduction 11

This is true for both kitchen ovens and the universe. (The entropy
of the universe is to a very good approximation just the number of
cosmic microwave photons per gram of dark matter.) As was empha-
sized by Planck in his original work on thermal radiation [40], one of
the most satisfying consequences of quantizing the energy levels of a
system is that the absolute entropy of any physical system acquires
a well-defined combinatorial definition. The combinatorial definition
of the entropy of thermal radiation provided by Planck suggests that
quantum theory could be relevant to representing the information
theoretic aspects of Bayesian learning.
At first sight this may appear implausible because the equations
of quantum mechanics, e.g. the Schrodinger wave equation, are by
themselves deterministic, and therefore there is no obvious mech-
anism for representing the gathering of information required for
Bayesian learning. On the other hand, there is an underlying random-
ness associated with the choice of quantum paths in the path integral
formulation of the Schrodinger equation. In addition, including the
measurement process into a quantum description of the dynamics
of a system apparently does offer the possibility of introducing the
randomness represented by the conditional probabilities in Bayes’s
formula. There have been several attempts (see e.g. [48]) to describe
the effects of measurements by modifying the Schrodinger equation
so that it is no longer deterministic. However, there is as yet no
universal agreement as to which of these stochastic extensions of
Schrodinger’s equation would be the canonical best choice. For our
purposes we will follow the ideas of Schwinger [49], Caldeira and
Leggett [50], and Keldysh [51] regarding how to describe relaxation
processes due to measurements within the framework of quantum
mechanics. In particular, we will make use of the double path inte-
gral description of interacting quantum systems due to Feynman and
Vernon [47] (see also Appendix D).
Kappen [31] pointed out that as a function of the level of innova-
tion noise (the difference between observations of the state models for
a system) the Bellman dynamic programming equations change from
being deterministic at low noise levels to being explicitly stochastic
at high noise levels. This transition is reflected in the relative con-
tributions of the reward function and KL divergence to the Bellman
function loss rate. At low noise levels, the KL divergence term can be
neglected and the Bellman loss rate will be determined by a reward
12 Quantum Mechanics and Bayesian Machines

function that is independent of the innovation noise. On the high


innovation noise side of Kappen’s transition, the typical objective of
optimal control and RL will be to relax the KL divergence between
the observed controlled dynamics and desired passive dynamics of a
system, which of course has an information theoretic flavor.
Our quantum approach to stochastic control problems will be
based on the notion that the Helmholtz machine [8,9] provides a
kind of Rosetta Stone for translating Bayesian inference into the lan-
guage of quantum theory. The “Greek to Coptic” part of this Rosetta
Stone is the identification of the controller/agent in a feedback loop as
the recognition network in the Helmholtz machine, and the system/
environment as the unsupervised data model generation network of
the Helmholtz machine. The wake–sleep algorithm [9] reconciles these
forward and backward representations of the data in such a way
that the output of the data generation network relaxes to a good
approximation for the input data. At the same time the wake–sleep
algorithm minimizes the innovation noise seen by either network,
which can also be interpreted as minimizing the Bellman–Issacs loss
function for a Markov game [53]. The “Coptic to Hieroglyphic” part
of the Rosetta Stone amounts to replacing the probabilistic Ising
spin degrees of freedom in the original Helmholtz machine with
Θ-functions [54–57]. These multi-variable analytic functions play an
important role in the theory of Riemann surfaces. Riemann surfaces
are smooth 2D surfaces that can also be parameterized by a sin-
gle valued complex number, and turn up in a surprising variety of
contexts in pure mathematics. One way of visualizing Riemann sur-
faces, due to Solomon Lefshetz [54], is that Riemann surfaces can be
represented as the locus of a homogeneous polynomial in a higher
projective space. Mumford later realized [55,56] that these coordi-
nates, known as “Θ-functions with rational characteristics” [54–56],
provide a representation, of the Weyl–Heisenberg group [44]. This
identification of Riemann surfaces as “algebraic varieties” (which is
the mathematical term for any manifold that can be described as
the locus of a homogeneous polynomial in a projective space of some
dimension) ushers in quantum mechanics deus ex machina, and pro-
vides a link between the theory of Riemann surfaces, optimal control,
and quantum mechanics.
Perhaps the first person to sense that there is a deep connec-
tion between feedback control and quantum mechanics was Freeman
Introduction 13

Dyson. In 1975, Dyson noticed [58] that at low light intensities where
photon noise becomes important, the feedback equations of adaptive
optics are formally identical with the theory of inverse scattering for
the 3D Schrodinger equation. In Dyson’s approach to adaptive optics,
the effect of the atmosphere on a flat 2D wave front is observed by
using a phase sensor that allows for observation of arbitrary 2D cor-
relations between the atmospheric noise in different optical channels.
These equations are a 3D generalization of the equations developed
in the 1950s by Gelfand, Levitan, and Marčenko [59,60] for the pur-
pose of finding the potential of the 1D Schrodinger equation based
on scattering data, (see Appendix B). Dyson’s discovery of a con-
nection between adaptive optics in the presence of photon noise and
the 3D Schrodinger equation naturally stimulated interest in why
so seemingly disparate topics are connected, and a full resolution of
this puzzle remains to this day. Our aim though is somewhat different
than just understanding scattering solutions of the 3D Schrodinger
equation. As in our original paper [36], we will be focused on regres-
sion between observations and models for an entire control history
which terminates in a desired history.
In the following, we do not claim to prove that translating
probabilistic models for optimal control such as Bellman’s dynamic
programming into the language of quantum mechanics necessar-
ily provides better results than what might be achieved with conven-
tional computational resources. However, we do wish to emphasize
some ab initio advantages that quantum amplitudes enjoy in com-
parison with conventional probabilistic representations for Markov
decision chains (MDPs). One prominent advantage is the elegant
way in which quantum amplitudes can capture causal relationships.
This is very challenging [62] for conventional machine learning tech-
niques; especially in cases where the computational model aspires to
“artificial intelligence” [63]. As was noted by Feynman in his Nobel
Prize winning paper [64] introducing a relativistic quantum theory
of photons interacting with electrons and positrons, there is no nat-
ural way to combine causal and anti-causal influences within the
framework of classical electrodynamics (a footnote in [64]). On the
other hand, in his theory of quantum electrodynamics [64] Feynman
introduced a way of combining forward and backward in time prop-
agation for electrons that takes full advantage of the fact that the
relevant quantum amplitudes can be regarded as smooth functions of
14 Quantum Mechanics and Bayesian Machines

the particle momenta regarded as arbitrary complex numbers. Later,


it was realized [65] that, independent of any underlying Hamiltonian,
general quantum amplitudes for transitions between elementary par-
ticle states can always be regarded as smooth analytic functions
of the momenta of the incoming and outgoing elementary particles
regarded as complex numbers. This behavior is peculiar to quantum
theory and reflects a fundamental equivalence between the analytic
behavior of quantum transition amplitudes as a function of momen-
tum variables and the causal structure of operator commutators in
relativistic quantum field theory [66]. Causality is only the tip of
the iceberg though for the shiny gloss that analytic behavior as a
function of momentum variables provides for quantum amplitudes.
During the time he was a post-doctoral fellow at the Bohr Insti-
tute, Lev Landau discovered a book in the library describing how
the special analytic functions that are important for finding exact
solutions of the Schrodinger equation, especially in two-dimensions,
could be expressed as integrals where their extension to an analytic
function of a spectral parameter regarded as a complex number was
transparent. Landau’s loitering in the library at the Bohr Institute
led to his Appendix to Quantum Mechanics [15], which set the stage
for the notion that is now in full bloom that the theory of functions of
a complex number variable and quantum mechanics are deeply inter-
twined (see e.g. [54,55]). What is important for us is that the integral
representations for analytic functions rediscovered by Landau could
also be obtained as solutions of a Riemann–Hilbert (RH) problem
[67–69] (see Appendix B for a discussion of the RH problem). This
type of problem can often be solved analytically using the Cauchy
integral representation for analytic functions of a complex variable
[67]. As shown by Its [68], the Cauchy theorem for analytic functions
also provides a path from the integral representations for special
functions described in Landau’s Appendix to Quantum Mechanics
to exact solutions for certain nonlinear integrable PDEs.
A simple example of this usefulness of the special analytic func-
tions defined by a RH problem is the modified Airy functions (MAFs)
[70], which provide a good approximation to the solution of 1D
Schrodinger equation for any potential — even one with a classi-
cal turning point [71]. This property is a prerequisite for being able
to use a quantum path integral to define a classical trajectory. As
Introduction 15

it happens, the properties of analytic functions like those listed in


Landau’s Appendix [15] also seem to provide the necessary ingre-
dients for a universal underlying structure for feedback control and
RL. A hint that the special analytic functions listed by Landau might
indeed be useful for describing feedback control and RL is provided
by the intimate connection [72] between the KdV equation and the
Kalman filter. More generally, the analytic functions delivered by the
RH method provide a basis for representing both the reward func-
tion and innovation for feedback control and RL problems. Of course,
this leaves open the question as to how one can realize in practice the
minimization of the innovation noise; i.e. minimizing the difference
between the trajectory of observed observations and the expected
history for a system or environment based on a model.
The great achievement of Graeme Segal (who held the Astronomy
and Geometry chair at the University of Cambridge) and George
Wilson in this connection [73] was to “geometrize” the inverse scat-
tering method [73,75] for solving the KdV equation by constructing a
distinguished space for meromorphic functions defined on a subset of
a Hilbert space consisting of linear sum of two vector spaces of square-
integrable holomorphic functions: “Hardy spaces” (named after the
Cambridge U mathematician who discovered Ramanujan). Meromor-
phic functions are rational functions of holomorphic functions, which
are smooth functions of a complex number variable whose only singu-
larities are isolated zeros. With their construction, Segal and Wilson
introduced to data science the importance of basing computational
approaches to stochastic estimation and feedback control on the con-
struction of a special holomorphic function, known as the τ -function,
defined on a canonical Riemann surface. A canonical Riemann sur-
face is a universal feature of integrable dynamics [73,77], and the
recognition of its importance for integrable systems is a momentous
development for data science because the topological obstructions to
conventional Monte Carlo regression [78,79] can be “untangled” if
the labels for observed and model states for a system or environment
can be lifted to curves on Riemann surface with sufficiently large
genus [80].
Our formal presentation of how quantum mechanics might be used
to emulate Bayesian learning begins in Chapter 2 with a recounting of
six fundamental discoveries which provide the guideposts for our path
16 Quantum Mechanics and Bayesian Machines

to quantum Bayesian learning. Pride of place naturally belongs to


Thomas Bayes’s famous formula, Eq. (1.1), for how the implications
of observations can be captured as posterior probabilities [5]. After
Bayes’s formula perhaps the most notable discovery was Norbert
Wiener’s noise filter [81], which supplanted the least squares regres-
sion method for interpolating data that originated with Gauss and
Legendre [82]. The original least squares method didn’t distinguish
between signal and noise, and therefore didn’t necessarily provide a
good estimate of the underlying “signal”. This deficit was addressed
in Wiener’s wonderful 1942 paper [81] introducing the “Wiener fil-
ter”, which is the seminal spring from which essentially all the ana-
lytical methods we will discuss for data analysis flow. For example, an
immediate dividend of Wiener’s approach to signal analysis was the
very successful Kalman–Bucy model [3,83] for feedback control. The
Kalman–Bucy filter does make some simplistic assumptions, such as
linear time evolution in the absence of control actions, which in prin-
ciple were later removed by the development by Richard Bellman of
his dynamic programming approach to optimal control [20]. Unfor-
tunately, Bellman’s approach is typically very difficult to implement
in practice. In some cases, Bellman’s cost function V (t) can be deter-
mined by numerically solving a linear differential equation (see e.g.
[21]), while in other cases, e.g. complex RL problems, V (t) can only
be determined using machine learning techniques such as deep neu-
ral networks. (The Bellman’s cost function is also sometimes referred
to as a “value function”. This nomenclature is awkward in that the
object of optimal control is to minimize the Bellman function.) It
happens that Bellman’s cost function is intimately connected with
information gathering [27,28]; so minimizing the cost also means
choosing actions which maximize the collection rate for information
with respect to an optimal choice of control actions. For our final
choice for a fundamental discovery to be singled out, Chapter 2 con-
cludes with the observation that Feynman’s quantum path integral
[32] can in principle be used to solve the traveling salesman problem.
This observation draws attention to the possibility of using the nat-
ural tendency quantum paths to fill up all of configuration space to
resolve combinatorially difficult aspects of Bayesian model selection.
Chapter 3 is an introduction to the Mumford–Rissanen Minimum
Description Length criterion [26] for selecting from a variety of pos-
sible explanations for a given set of observations, the explanation
Introduction 17

that from the point of view of information theory is the most eco-
nomical. This principle is a legacy of William of Ockham, who early
in the 14th century [85] put forward one of the fundamental tenets
of science: that the best explanation for a physical phenomenon is
usually the simplest. It is perhaps counterintuitive that a principle
of physical science should underlie data analysis. However, William’s
principle of minimizing the complexity of the explanation for a set
of observations is at the heart of the notion that Bayes’s formula
provides the logical basis for solving a variety of problems includ-
ing stochastic estimation, Bayesian searches, and feedback control.
The maximum likelihood method [6], which is widely used to solve
these types of problems, short circuits the full use of the Bayes for-
mula by looking only at ratios of the likelihood factor in Eq. (1.1).
However, as has been emphasized by McKay [6], simply looking at
the likelihood that a model for the data yields a particular set of
observations can lead to serious errors when one must choose the
model from an ensemble of a priori approximately equally plausible
models. Reflecting Mumford’s insight [23] regarding the MDL princi-
ple and mammalian cognition, McKay’s “Occam razor” factor [6] is
possibly the best metric yet proposed for guiding data model selec-
tion. Chapter 3 concludes with a brief account of how the search for
methods for dealing with hidden factors [7,29] led to the Helmholtz
machine [8], which provides a logical framework for how conditional
probabilities which reflect the MDL principle might be computed as
Markov decision chains.
Chapter 4 focuses on control theory [3], and in particular on the
deterministic limit of Bellman optimization, known as Pontryagin
control [86,87]. The Pontryagin procedure for realizing the determin-
istic limit of optimal control is somewhat different than the Euler–
Lagrange variational procedure described in textbooks for obtaining
the classical equations of motion by minimizing the Maupertuis
action (see e.g. [88]). The Euler–Lagrange method for obtaining the
equations of motion for classical mechanics differs from the proce-
dure for obtaining the optimal path for feedback control by minimiz-
ing the Bellman cost function in that the Euler–Lagrange method
only demands uniform convergence for the positions’ classical tra-
jectory dx/dt, whereas the Pontryagin limit of Bellman optimiza-
tion demands simultaneous uniform convergence in both the system
18 Quantum Mechanics and Bayesian Machines

variables x(t), dx/dt, and control variables u(t). Understanding the


relationship between Pontryagin control, classical dynamics, and the
classical limit of a Feynman path integral for a quantum system
will be an important part of our presentation. The bottom line is
that while extracting the classical limit of a quantum path integral
can be very tricky, the resultant equations can mimic the Hamil-
tonian dynamics of Pontryagin control. Chapter 4 also introduces
“Lie–Poisson dynamics” [89]; a class of analytically solvable exam-
ples of Pontryagin control which arise from the action of a continuous
group on itself regarded as a smooth manifold. Lie–Poisson dynamics
has some interesting practical applications such as spacecraft control
and illustrates how special analytic functions such as elliptic integrals
can naturally provide solutions for control problems.
Chapter 5 is the centerpiece of our presentation. Our main focus is
on the use of the inverse scattering method to solve nonlinear PDEs
such as the KdV and NLS equations. Of particular interest to us is
the construction of Segal and Wilson [72–73], who were among the
first to connect the problem of finding exact solutions of the KdV
equation with evaluating the Riemann Θ-function, which is a periodic
analytic function on a Riemann surface.
In the Segal–Wilson construction, the input data and data fea-
tures are represented by elements of vector spaces, known as “Hardy
spaces”, of square integrable analytic functions of a complex number
whose domains are, respectively, the inside and outside of a closed
curve on a sphere. Obviously, the simplest version of this setup is
when the closed curve is the equator, and the two Hardy spaces
are related by complex conjugation. The KL divergence will vanish
when the probability distributions for the states constructed from
the two Hardy spaces are identical. This supports our identification
of the Helmholtz machine as a Rosetta stone for translating Bayesian
inference into quantum language, as well as providing new insights
into what mathematical principles underlie the remarkable cognitive
capabilities of the mammalian brain.
One might imagine that analytic solutions of the KdV or NLS
equations are too specific to encompass all feedback control or RL
problems of interest. However, it was realized by David Hilbert at
the beginning of the 20th century that there is a remarkable similar-
ity between Galois’s theory of solvability of algebraic equations [92]
Introduction 19

and the role of meromorphic functions, i.e. rational combinations of


holomorphic functions, in finding exact solutions of integrable non-
linear PDEs. This connection with Galois theory transports us from
the world of special analytic function solutions of the KdV and NLS
equations to a world of profound and very general mathematical
objects such as Galois groups. As was realized by Hilbert [93], the
analog of the subgroups of the group of permutations of the roots of
a polynomial that plays a central role in Galois’s theory of solvable
algebraic equations are the subgroups of the group of homeomor-
phisms for a Riemann surface.
Chapter 6 lists some quantum tools which seem to be pertinent for
improving our understanding of Bayesian learning. The most promi-
nent of these tools is the flexibility in choosing a representation for
the Weyl–Heisenberg group [44] that is best suited for the application
at hand. For the feedback control and RL problems that are our pri-
mary focus in this book, it seems that representations whose elements
form a reproducing Hilbert space of holomorphic functions [94–96]
will be of special interest. (Holomorphic functions are smooth ana-
lytic functions of a complex number variable defined in some domain
of the complex plane whose only singularities are zeros.) Among rep-
resentations of the Weyl–Heisenberg group, those involving spaces
of holomorphic functions seem to be of particular importance for
Bayesian learning. For example, Hilbert spaces of square integrable
holomorphic functions, known as Hardy spaces as mentioned above,
play an important role in the work of Trogdon–Olver [69] and Segal–
Wilson [73] are useful for obtaining exact solutions of the KdV equa-
tion, as well as H∞ control of 2-dimensional fluids. H∞ control is
an extension [84] of the Kalman filter which is of inherent inter-
est because of its connection with game theory. Wiener [97] drew
attention to eigenstates of “reproducing” kernels as being of spe-
cial importance for machine learning because of their usefulness for
understanding nonlinear systems with noise. The usefulness of quan-
tum kernel spaces for machine learning has also been emphasized by
Schuld et al. [98,99]. One class of representing kernel spaces that is
of particular interest to us are the Θ-functions [53–55] whose repre-
senting kernel is the quantum propagator for a closed string [100].
One important way in which quantum mechanics differs from clas-
sical mechanics is the role played by measurements. The holomorphic
20 Quantum Mechanics and Bayesian Machines

states constituting Hardy spaces (“BFS states”) are not orthogonal,


which means they are not completely distinguishable by “von Neu-
mann projections” [101]. However, Helstrom’s theorem [102] provides
a way of characterizing the information content of any quantum mea-
surement when the measured states are not orthogonal. In a nod
to quantum computing, “measurement-based quantum computing”
[132] is both a mirror and an inspiration for our quantum approach
to Bayesian inference. One feature that our approach to Bayesian
inference shares with measurement-based quantum computing is that
measurements of a momentum variable act as the control variable,
guiding the system to evolve in a desirable way.
Chapter 7 caters to the notion that our quantum path integral
interpretation of the innovation for the traveling salesman problem
(TSP) [36] opens the door to a general quantum theory of optimal
control and RL, and at the same time provides new insights into
the nature of mammalian cognition. As a first step in this direc-
tion, we observe that our path integral solution of the TSP [36] can
be generalized to a quantum version of the Helmholtz machine by
replacing the probabilistic Ising spin degrees of freedom in the origi-
nal Helmholtz machine with quantum string-like degrees of freedom
[52]. In contrast with the TSP problem, where the itinerary for the
salesman is regarded as fixed, the part of the quantum Helmholtz
machine representing the environment is “flexible”, which requires
introducing methods similar to those used to solve the KdV equa-
tion, e.g. the Riemann–Hilbert method for solving the KdV or NLS
equations. In our quantum version of the Helmholtz machine, the
wake sleep algorithm which leads to the MDL models for input
data is replaced by the stochastic version of the Feynman–Vernon
[47] “influence functional”. Adapting the wake–sleep algorithm
to our quantum Helmholtz machine then leads to a Fokker–
Planck representation for the evolution of the Bellman function for
observer/controller and system/environment pieces of the Helmholtz
machine. The Gaussian processes associated with this Fokker–Planck
evolution correspond to quantum variations in the shape of a
surface.
It is interesting in this connection how Pontryagin control, which
is the deterministic limit of Bellman’s theory of optimal control,
emerges. Since quantum mechanics involves probabilistic predictions,
Introduction 21

one might assume that quantum mechanics and Pontryagin control


would have little in common. However, many-body quantum sys-
tems have the property that under certain circumstances they can
undergo a phase transition to a collective state characterized by a
single space-dependent complex number referred to as the ODLRO
order parameter [103]. In this state quantum fluctuations are sup-
pressed, and the state of the system can be described as a quasi-
classical Eulerian fluid with a well defined density and velocity. The
precise story [104] is that a path integral description of the ther-
mal density matrix for a gas of quantum bosons requires longer and
longer paths as one approaches a quantum critical point where the
speed of sound vanishes and ODLRO appears. After the appearance
of ODLRO the flow of a bosonic fluid becomes very smooth and in
two-dimensions describable using holomorphic functions. The control
problem then resembles H∞ control [84]. The appearance here of a
path integral description of a thermal density matrix is a replay of
our quantum theory of innovation [36] below the Kappin transition.
Chapter 7 provides a partial answer to the fundamental question
motivating our presentation as to whether quantum theory might
offer some insights into why the cognitive capabilities of the mam-
malian brain are in many respects better than the data analysis
capabilities of state-of-the-art computers. Our penultimate result
is that a quantum version of the wake–sleep algorithm for train-
ing the Helmholtz machine provides a connection between using
the Riemann–Hilbert method to obtain exact solutions of integrable
PDEs in terms of holomorphic functions on a Riemann surface and
determining the Bellman function in feedback control and RL prob-
lems. Viewed through the lens of Kohonen’s self-organizing maps
[12,13], this connection may well be the long sought “explanation”
for the remarkable cognitive capabilities of mammalian species. For
example, comparison of a self-organizing map for the outputs of
somatosensory sensors on a human hand [12,13] with multi-soliton
solutions of the KdV and similar integrable PDEs reveals that there
is an astonishing similarity between the two systems with respect
to the role that topology plays in distinguishing different kinds of
information.
Our general approach to quantum Bayesian learning makes use
of the fact that the Feynman–Keldysh double path integral [47] the-
ory of interacting quantum systems looks a lot like the Helmholtz
22 Quantum Mechanics and Bayesian Machines

machine. If the dynamics of interacting quantum systems can serve


as a model for the recognition and model generation networks of a
Helmholtz machine, our goals in this book will have essentially been
achieved. In particular, the dynamics of one of the arrays is frozen
in time, then the coupled arrays can serve as a model for the feed-
back control systems in general (cf. Fig. 2.1), or the Kalman–Bucy
scheme [3]. Among the advantages of such a representation is that it
allows one to use the KdV or NLS equations to determine the time-
dependent reward function by choosing initial and final states for the
Lax equation as the “initial conditions” for the KdV or NLS equation.
The connection of the KdV/Loiuville models with Bayesian learning
arises from the map between the two Hilbert spaces of holomorphic
functions (which hereafter we will refer to using their eponymous
name Hardy spaces).
In Chapter 8, we turn to explore the apparently deep questions
relating quantum mechanics, mathematics, and the ability of mam-
malian cerebral cortex to create holistic 3D representations of its
environment. The lifeblood of science is observation, and in an impor-
tant sense cognitive science is (or ought to be) also about how the
mammalian brain makes sense of observations. One key to under-
standing this is undoubtedly Kohonen’s self-organizing maps [12,13].
Self-organizing maps have the property that they can organize sen-
sory data in such a way that one can understand what the data
mean by “visual inspection”. This is perhaps a realization of the
oft-quoted admonition that a picture is worth a thousand words.
We are attracted to Kohonen self-organization not only because it
provides a possible explanation for the way sensory neurons are
organized in the mammalian cerebral cortex [107–109] but also
because self-organization is relevant [108–110] to the problem of how
to design arrays of artificial sensors that can efficiently fuse together
different kinds of information. At the end of Chapter 8, we com-
ment on what this means for the relationship of mathematics and
theoretical physics.
A notable aspect of connecting mammalian cognition and quan-
tum theory is that there is a very elegant quantum computa-
tional model for the seemingly effortless way mammals construct
3D representations of their environment. This model is based on
the work of Julian Schwinger [111] (as a graduate student!) on the
Introduction 23

connection between quantum oscillator states and the quantum the-


ory of angular momentum, In Chapter 8, we show how Schwinger’s
quantum oscillator representation for quantum angular momentum
states can be used to construct a double pyramid. This construc-
tion is related to the Biedenharn–Elliott identity for the Racah 6j
symbols [112]. In particular, we make use of Schwinger’s oscilla-
tor representation of Wigner’s triangles, together with the theory
of quantum teleportation [113], to construct “by hand” a double
pyramid. This construction exposes both a deep connection between
the quantum angular momenta and the topology of three-manifolds
and the ability of quantum wave functions to represent nontrivial
knots. A perhaps deeper view of the usefulness of combining Wigner’s
representation for triangles together with quantum teleportation of
continuous variables [113] to form holistic images of 3D objects is
that this may shed light on how mammalian brain constructs holis-
tic representation of its environment.
Many technical details necessary for a complete understanding
of our presentation are relegated to Appendices. This includes a
brief description of Gaussian processes and artificial neural networks,
certain intractable wave scattering problems could be exactly solved
by viewing the wave number as a complex number valued variable
rather than just a real parameter. In a historical sense this discovery
is the fundamental mathematical result underlying our presentation.
In the 1950s and 1960s, the Wiener–Hopf method was adapted to the
problem of describing wave propagation in a channel with a flexible
boundary [116] as well as the “inverse scattering” problem in three-
dimensions [117]. This application for the Wiener–Hopf method also
provides the framework for Dyson’s adaptive optics formalism, as well
as much of our quantum approach to Bayesian inference. Appendix D
is devoted to a noteworthy historical detail regarding the Riemann–
Hilbert method for finding exact solutions of the KdV and NLS
equations described in Trogdon–Olver [69]. Namely, the structure
of the Riemann Method that underlies the T–O numerical method
for solving initial value problems for the KdV and NLS equations has
an interesting connection with the relationship between varieties of
strongly interacting elementary particles developed in the late 1950s
by Gell-Mann and Ne’eman, that was based on a novel way of gen-
erating representations of the SU(3) group known as the “Eightfold
24 Quantum Mechanics and Bayesian Machines

Fig. 1.2. Duality of optimal control and pattern recognition.

Way” [118]. The Gell-Mann–Ne’eman construction also leads to a


construction [119] of solution of the KdV equation using fermionic
operators.
Figure 1.2 is a cartoon illustrating how we imagine that the top-
ics discussed in the following chapters are related to one other.
Our main theme is that Bayesian inference involves two funda-
mental optimization principles: the Rissanen–Mumford minimum
information description (MDL) principle [26] and Bellman opti-
mization [19,20]. As noted by Todorov [19], these principles are
closely related, leading to a duality between pattern recogni-
tion and optimal control/RL-type problems. In the end one can
view this duality as a consequence of quantum self-organization,
i.e. the tendency of quantum mechanics to favor certain types
of holomorphic representations for input data and data features,
particularly holomorphic representations of the Weyl–Heisenberg
group related to the theory of integrable PDEs and Riemann
surfaces, thus providing a plausible beginning for understanding
the remarkable cognitive capabilities of the mammalian cerebral
cortex.
Chapter 2

Six Fundamental Discoveries

2.1. Bayes’s Probability Formula

Thomas Bayes was born in 1701 and died in 1761. He was an ordained
minister in Turnbridge Wells — about 35 miles southeast of London.
Although he published no scientific papers during his lifetime, his
mathematical abilities must have been known to his contemporaries
because he was elected a Fellow of the Royal Society in 1742. After
his death, his “Essay Towards Solving a Problem in the Doctrine
of Chances” was published in the Philosophical Transactions of the
Royal Society [82]. This essay, arguably one of the most important
papers in the history of science, introduced a rigorous methodology
for estimating the probabilities for different possible explanations for
experimental observations based on the “evidence” [5,6]. Given the
revolutionary implications of Bayes’s essay, especially compared with
what was understood in the 18th century about the scientific method,
one might have expected that his formula would have immediately
been celebrated. Unfortunately, in what perhaps may be assayed as
one of the most significant lapses in the post-Renaissance progress
of science, it took more than 300 years after its publication for the
great value of Bayes’s theorem for data analysis to be fully appreci-
ated. Fortunately, the fundamental usefulness of Bayes’s formula for
data analysis, Markov decision problems, and optimal control is now
widely appreciated (see e.g. [6]) — even if not widely used.
The origins of Bayes’s essay are not entirely understood, but it
seems likely [82] that he was motivated by the earlier work of another

25
26 Quantum Mechanics and Bayesian Machines

self-taught amateur mathematician, Thomas Simpson, on the prob-


lem of how to combine multiple observations of the positions of an
astronomical body to obtain the best estimate of the true position
of the body. Simpson introduced the fundamental notion that in
order to understand the truth behind experimental observations, one
needed to have some understanding of how the errors in observations,
i.e. the difference between the observed and true position of the body,
are distributed. In modern terminology, this would mean specifying
a probability density for the observational error. The tremendous
power of Bayes’s formula lies in the fact that in a general setting it
provides a canonical approach to finding best explanation α for a set
{d} of input data, based on a set {α} of possible data models (cf. [6]).
Each explanation consists of a data model or hypothesis α together
with ancillary parameters θ for each data model, and the best expla-
nation typically changes as data is acquired from new observations.
The canonical Bayesian prescription for data analysis is to maximize
at any particular time the posterior probability p(θ|d, α) given a par-
ticular model α and set of data {d} that has been acquired up to
that time:
p(d|α, θ)p(θ|α)
p(θ|d, α) = , (2.1)
p(d|α)
where the two factors in the numerator represent the expected like-
lihood for the data given a particular data model and the a priori
probability for that particular model. In the ideal case where the a
priori probabilities P (α) for the occurrence of various explanations α
and the conditional probability densities P (d|α), for the sensor data
d within each class are known, the best possible classification proce-
dure would simply be to choose the explanation α(θ) for which the
posterior probability, P (d|α, θ), is the largest. Unfortunately, in the
real world one is typically faced with the situation that neither the a
priori probabilities P (θ|α) for the various possible explanations nor
the conditional probability densities P (d|α) for the input data given
a particular explanation are precisely known. Therefore, one must in
general rely on ad hoc models for these probabilities in order to find
the best explanation for a particular dataset. In his comprehensive
2003 review, Mackay [6] describes the strategies behind the various
efforts to make practical use of Bayes’s formula for data analysis,
and has a nice discussion as to why Bayes’s prescription favors the
simplest models in the sense of Mumford–Rissanen [26].
Six Fundamental Discoveries 27

One of the reasons for the long historical delay in making exten-
sive use of Bayes’s formula was apparently confusion as to how one
could estimate the in general unknown a priori probability distri-
butions. This uncertainty about the usefulness of Bayes’s formula
was eventually dissipated by the development of Bayesian approaches
to search and control problems; where the problem of defining the
prior probabilities for possible explanations is side-stepped by using
the probabilistic predictions from the prior step of the search or con-
trol process as the input a priori probability for the next step. The
problem with the initial a priori probability lingers, but it is often
the case that the final answer is insensitive to the exact initial a priori
probability distribution. In addition, conceptual unease with uncer-
tainties in a priori probabilities has been largely erased by the notion
of using a model generation or “adversarial” network to predict the
probabilities for observed data.

2.2. The Wiener and Kalman–Bucy Filters

In second and third place behind Bayes’s formula, we believe that


data science’s greatest historical advance debt belongs to Norbert
Wiener and Rudolf Kalman for introducing methods for signal
extraction and state estimation methods that take into account
extraneous noise associated with observational errors or errors in
choosing the best model to explain the observational data. Wiener’s
approach to signal extraction was not based directly on Bayes’s for-
mula, but instead on the difference between the correlation structure
of signals and noise. In the 19th century, Adrian Legendre and Karl
Friedrich Gauss initiated [82] the least squares estimation method
which allows one to interpolate between examples where the expla-
nation for the input data is known. Although there was a hint of the
method of least squares in a 1722 paper by Roger Cotes, Legendre’s
1805 book Nouvelle methodes pour la determination des orbites des
cometes contained the first clear presentation of the least squares
method. Legendre illustrated his method with the practical prob-
lem of determining the Paris meridian from survey data. Following
Legendre’s 1805 presentation, the method of least squares became
widely used in astronomy and geodesy for representing data. In these
early applications though it was not clear in what sense Legendre’s
method provided the “best” way of representing data, nor what was
28 Quantum Mechanics and Bayesian Machines

its relationship to probability theory. The answer to these questions


was partially answered by Gauss, who showed that the method of
least squares did provide a probabilistic best approximation for the
data when the errors for the data could be described using Gauss’s
eponymous probability distribution. Gauss also introduced a recur-
sive least squares method to carry out the calculations he undertook
to locate the asteroid Ceres [82]. Unfortunately, in the presence of
noise the least squares method doesn’t necessarily provide the best
estimate for the signal. The solution to this problem was eventually
provided by Wiener [81] and Kalman–Bucy [83].
To his surprise Wiener discovered in the course of applying the
least squares method to the problem of extracting a continuous time
signal from noise that, at least in the case of white noise, the problem
could be analytically solved by solving a certain nonlinear integral
equation that he and Hopf had introduced in 1931 in connection
with finding analytic solutions to certain scattering problems [61].
A simple way of understanding how this integral equation arises is
provided by Yaglom’s projection theorem [83]. Let us consider the
Hilbert space Hx generated by a random process X(t), where 0 <
t < T . Specifically, let us assume that X(t) is a Gaussian process
with a covariance matrix R(s, t) ≡ E[X(t)X(s)], where XY means
the dot product of vectors X and Y . The smoothing problem is: given
the Hilbert space HYT generated by observations Y (t) of signals Z(t)
perturbed by white noise N (t), i.e. Y (t) = Z(t) + N (t), find the least
squares estimator Ẑ ∈ Hx for Z(t) that satisfies the orthogonality
condition
  
E Z(t) − Ẑ(t) Y (s) = 0 for 0 < s, t < T, (2.2)

which insures that E|Z(t)−Ẑ)|2 is a minimum and Ẑ(t) aligns with Y:



E[|Ẑ − Z |2 ] = minZHx E[|Z − Y |2 ]. (2.3)

Of greatest interest is the correlation function

R(s, t) = δ(t − s) + K(t, s), (2.4)

where K(t, s) ≡ E[Z(t)Z(s)] is a continuous matrix defined on


[0, t] x[0, T ] and a delta function representing white noise. If we now
Six Fundamental Discoveries 29

define a matrix H(t, s) where s < t such that


 t
Ẑ(s) = H(t, s)Y (s)ds, (2.5)
−∞

then Eq. (2.2). implies that


 T
K(t, τ ) = H(t, τ ) + H(t, s)K(s, τ )ds. (2.6)
0
A matrix H satisfying this integral equation is called the Fredholm
resolvant of the covariance matrix K, and serves as a “filter” for
extracting a well-defined signal Ẑ(t) in the presence of white noise
N (t), as is evident from Eq. (2.6). It is the restriction to t ≤ T
that prevents Eq. (2.6) from being easily solved using Fourier trans-
forms. However, Wiener’s observed [81] that Eq. (2.6) can be solved
using Laplace transforms, with the result that the filter function
regarded as an analytic function of frequency is meromorphic; i.e. it
is a rational function of holomorphic functions (a holomorphic func-
tion is a smooth function of a complex number whose only singular-
ities are zeros). This seminal discovery by Wiener is the spring from
which essentially all the results described in the following flow. With
respect to classical Bayesian learning, Eq. (2.2) can be regarded as
the fundamental equation for stochastic estimation when covariance
information for the signal (as opposed to just a model for the time
dependence for the signal) is available.
Wiener’s analytic solution for Eq. (2.6) did find significant appli-
cations, e.g. to antiaircraft control [19]. As it happened though,
the success of the Wiener filter was eventually overtaken in impor-
tance by the Kalman–Bucy filter [83,84], which was developed inde-
pendently by Richard Bucy and collaborators at the John Hopkins
Applied Physics Lab and Rudolf Kalman, and jointly published by
Kalman and Bucy in 1960 [83]. Like Wiener, Kalman and Bucy were
interested in analyzing a time series of observations in order to make
predictions about the future state of a system. As in Weiner’s filter,
the basic strategy for extracting signals in the presence of both mea-
surement and system noise is to use the difference between signal
and noise time correlations. In addition, by assuming that systems
evolved linearly in time, Kalman and Bucy were able to reduce opti-
mal control of the problem to numerically solving an ordinary differ-
ential equation.
30 Quantum Mechanics and Bayesian Machines

Kalman and Bucy’s great achievement was to describe a practical


method for updating one’s knowledge of the state of a system based
on a linear model for the system dynamics and a control variable
equal to the covariance of a Gaussian process describing the stochas-
tic difference between the observed time history of a certain feature
z(t) (defined a priori as a linear combination of the components of
the vector x(t) describing the history of the system) and an under-
lying model value for this history. The probabilistic aspects of the
observations as well as the system dynamics are taken into account
by introducing white noise for theses quantities. Of course, there is
a certain art to choosing the signals that are most useful in practice.
(In the Indian forest described in Rudyard Kipling’s famous book,
the antelopes rely on the calls of monkeys and birds to discern the
presence of tigers.) The great success of Kalman’s model derives from
the fact that given a choice for extracting useful signals from the raw
data, Kalman’s equations are solvable via numerical integration of
an ordinary differential equation.
Kalman assumes that the input for the system consists of an
exogenous perturbation f (t) and a feedback control variable u(t).
The aim of the Kalman filter is to make use of a series of measure-
ments {Yk } in order to minimize the uncertainty X̃(t) = (X(t) −
X̂(t)) in the state of the system, where X̂(t) is an underlying model
for the system dynamics. In the discrete case, one can speak about
a “gain” matrix K, which describes the gain in knowledge about the
state of the system after each measurement:

Xk = Xk−1 + Kk (Yk − Hk Xk−1 ). (2.7)

The success of Kalman’s filter also relies on the use of two Gaussian
processes: (1) a noise source N (t) which limits the ability of an
observer to measure a signal (i.e. Y (t) = Ẑ(t)+N (t)), and (2) another
GP w(t) which represents an intrinsic randomness in the system
dynamics. Kailath [83] introduced the designation “innovation” for
u(t) in recognition of the fact that it represents that part of an
observation which yields information about the new state. The gen-
eral scheme (Fig. 4.1) can be pictured as an interaction between an
“observer–controller” and a system; e.g. a mechanical device or an
“environment” (cf. the zebras and their surroundings in Fig. 2.1).
The objective of the Kalman filter is that, given the GPs, w(t) and
Six Fundamental Discoveries 31

Fig. 2.1. Kalman–Bucy scheme for feedback control.

v(t), minimize the mean square difference between the estimated cur-
rent state of the system X(u, t) and a desired target state XT .
In cases where both the state X(t) of the environment and
measurements Y (t) are continuous matrix functions of time, these
matrices satisfy:
˙ ˙
Ẑ(t) = H(t)X̂(t), X̂ ((t) = F (t)X̂(t) + K(t)(Y (t) − Ẑ(t)), (2.8)
where the observations Y (t) differ from the signal Z(t) by an obser-
vational noise ν(t); i.e. Y (t) = Z(t) + ν(t), K(t) = P (t)H  + G(t)H X̂
describes the increase in our knowledge of the system based on con-
tinuous observation of particular features of the system, and it is
assumed that all the coefficient matrices are all known functions of
time. Here, Hk is a matrix which defines the “features” {Zk } of the
environment which are of greatest interest to designated controllers.
It is a result of combining observation of a system (or ‘environment’
in the case of RL) with the linear dynamical model, Eq. (2.3), for the
system that the controller hopes to gain enough information about
the state or environment to take effective corrective actions.
The time-dependent covariance P (t) = E x̃(t)x̃(t ) for the error
x̃(t) = (x(t) − x̂(t)) can then be found [118] by numerically solving
a ordinary nonlinear differential equation:
Ṗ = −P H T R−1 HP + F P + P F + Q, (2.9)
32 Quantum Mechanics and Bayesian Machines

where v and w represent observation and system noise with covari-


ances Q and R.
The Kalman–Bucy “filter” garnered much praise in the 1960s
as a result of its successes for spacecraft control [3]. However, the
assumption that state variables evolve linearly in time is rather
restrictive, and the Kalman approach to control theory was eventu-
ally supplanted as a general approach to optimal control by Richard
Bellman’s “dynamic programming” formalism [20].

2.3. Bellman’s Dynamic Programming Approach


to Optimal Control

The filters of Wiener and Kalman–Bucy might reasonably be


regarded as the seminal springs for machine learning. However, we
are indebted to Richard Bellman [20] for having been the first person
to introduce a quantity that one wants to optimize with feedback con-
trol (although in a certain sense this was implicit in the 14th century
work of William of Ockham [85]). Bellman’s great contribution was
to introduce a performance index V (t) for the efficacy of feedback
control together with a recurrence relation for determining its value.
Bellman referred to this performance index as a “cost-to-go”. As in
classical mechanics where the object is to minimize the Maupertuis
action [88], the object of optimal control is to minimize the Bellmann
cost function. (One thing that is confusing about the optimal control
literature is that V (t) is often referred to as a “value” function for
a control or RL strategy, even though the object is to minimize its
magnitude.) Bellman’s dynamic programming approach to optimal
control [20] is based on the idea that the optimal cost-to-go should
not depend on whether one carries out the optimization all in one
campaign or in two steps, leading to a recursion relation for a “cost
function” V (t):

  t1
min min
V (T ) = l( X(τ ), U (τ ), τ )dτ
[t, t1 ] [t1 , T ] t
 T
+ l( X(τ ), U (τ ), τ )dτ . (2.10)
t1
Six Fundamental Discoveries 33

The main burden of Bellman’s approach to feedback control is find-


ing practical ways of solving Eq. (2.10). To a significant extent the
frontier of data science is defined by finding better ways of calculating
the loss function l(X, U, t) for Bellman’s cost function, especially in
complex situations. Perhaps the most interesting aspect of Eq. (2.10)
is the important role played by the “loss” function in its connection
with information theory.
Although Bellman didn’t initially phrase it this way, it was even-
tually realized that the loss function represents the rate at which
information about the optimal path is accumulated. Worth men-
tioning in this respect is a 2011 paper [28] prompted by the Monte
Hall controversy [5], which clarified that the rate of accumulation of
Shannon information during a Bayesian search can be assayed by the
change in Bellman’s cost function (where the change in information
is identified with the negative of the change in Shannon entropy).
This information theory interpretation for Bellman’s value function
is consistent with Todorov’s interpretation [19] of the negative expo-
nential of the Bellman value function as the “backward filtering”
probability p(y(n), . . . , y(N )|x(n)) for obtaining a series of measure-
ments {y(i), i = n, n + 1, . . . , N } in the future, given the current
state x(n) of the system. Thus, any attempt to provide a quantum
interpretation for Bellman’s approach to optimal control would nec-
essarily require including a model for how future measurements will
affect one’s understanding of the state of the system. This echoes the
“time-symmetric” formulation of quantum mechanics [18], and con-
trasts with classical mechanics where one is only allowed to specify
an initial, or alternatively, a final state for a system.
Curiously, Bellman’s invention of dynamic programming followed
not long after Feynman’s development of his path integral approach
to quantum mechanics for his PhD thesis at Princeton [32]). The fact
that Bellman was a PhD student of Soloman Lefshetz at Princeton
in the early 1950s certainly allows for the possibility that Bellman
was aware of Feynman’s ideas at the time he developed his dynamic
programming algorithm (although Feynman left Princeton during
WWII and went to Cornell University after the war). One of our
main threads in the following will be exploring the relationship
between Bellman’s dynamic programming algorithm and Feynman’s
path integral approach to quantum mechanics.
34 Quantum Mechanics and Bayesian Machines

2.4. Feynman’s Path Integral Approach to


Quantum Mechanics

In the seminal papers of Dirac and Heisenberg, Born, and Jordan


[42], it was assumed that the classical equations of motion could be
written in the form given by Hamilton’s equations of motion, where
the time derivatives of the position q and momentum p variables are
written in the form
∂H ∂H
ṗ = − , q̇ = , (2.11)
∂q ∂p
where H(q, p) is a function, known as the Hamiltonian, describing
the physical system. In the matrix mechanics of Heisenberg [40],
the classical quantities q, p, and H are represented as matrices. In
addition, the variables q and p describing the state of the system
satisfy the commutation relation [q, p] = i, where  is 1/2π times
the constant that Planck introduced in connection with his quantum
treatment of thermal radiation. Indeed, the first estimates of the
value of  were obtained by comparing Planck’s theory of thermal
radiation [40] with experimental measurements of the spectrum of
infrared radiation from ovens. Although the canonical Eqs. (2.11)
are very simple, a perhaps more elegant way to derive the equations
of motion of classical mechanics is to make use of the Principle of
Least Action. In this way, classical mechanics is derived from an
optimization principle. The quantity that is minimized, the “action”,
is defined as the time integral of a Lagrangian function:
 T
S(T ) = L(p, q)dt, (2.12)
0

where
L = pq̇ − H(q, p). (2.13)
Although using a Lagrangian function of q and dq/dt rather than a
Hamiltonian function of q and p might seem to be a trivial differ-
ence, it turned out that using the Principle of Least Action as the
starting point for formulating quantum mechanics made a profound
difference.
Following Dirac’s lead [35], Richard Feynman investigated in his
PhD thesis [32] what role the classical principle of least action might
Six Fundamental Discoveries 35

play in quantum mechanics. As a result, he was led to an entirely


new way of formulating quantum mechanics. Rather than beginning
with the Schrodinger equation which had previously been the starting
point for quantum mechanical calculations, Feynman focused on the
“propagator” K(xb , tb ; xa , ta ) which transforms the Schrodinger wave
function in the position representation as a function of time:

ψ(xb , tb ) = K(xb , tb ; xa , ta )ψ(xa , ta )d3 xa . (2.14)

Feynman began by considering the form of (2.14) in the limit where


tb is very close to ta . Following Dirac, Feynman assumed that when
tb is infinitesimally close to ta , what is known in classical mechanics
as a contact transformation, Eq. (2.14) becomes

1 xb −xb
ψ(xb , ta + ε) = eiL(  ,xb ) ψ(xa , ta )d3 xa , (2.15)
A( )

where L(ẋ, x) is the classical Lagrangian function (2.12) and A( )


is a normalization factor, which is required in the transition from
classical to quantum mechanics and whose exact form depends on
the specific form of the Lagrangian. By concatenating infinitesimal
transformations of the form (2.15), Feynman arrived at his path inte-
gral for the propagator; i.e. the quantum amplitude for going from
xa to xb :
 b
K(xb , tb ; xa , ) = eiS[b.a] D[x(t)], (2.16)
a

where S[b.a] is the classical action (cf. [47]) going from xa to xb , and
D[b, a] denotes a sum over all paths leading from xa to xb . For a free
particle with mass m in one-dimension, this propagator takes the
simple form [47]:

1 im (xb − xa )2
K(xa , tb ; xa , ta ) = exp , (2.17)
A(tb − ta ) 2 tb − ta

where A(τ ) = [m/2πi(tb − ta )]1/2 . The exponent of the exponential


factor is just (i/) times the classical action for a free particle. If
instead of assuming that the particle started at a definite position,
36 Quantum Mechanics and Bayesian Machines

we assumed that the particle started with a definite momentum, then


the amplitude for the particle to arrive at a position x is
  
p2
1 ipx−i m
t /
ψ(x, t) = e . (2.18)
A(t)

The exponent of the exponential will be minimized when x = (p/m)t,


which is just the classical equation of motion for a free particle. The
probability of finding the particle at a point x is the absolute square
of the r.h.s of Eq. (2.9) is 1/|A|2 , which depends on time but is inde-
pendent of x. This reflects the fact that in quantum mechanics when
the momentum is known with certainty, the position variable is com-
pletely unknown. This complementarity would seem to be unhelpful
from the point of view of classical physics, but from the perspective of
data analysis it will turn out that this complementarity is very useful.

2.5. Quantum Solution of the Traveling Salesman


Problem (TSP)

The primary inspiration for our presentation is the observation [36]


that quantum mechanics provides a simple solution to the prob-
lem of finding the order in which a salesman visits cities, and pro-
vides a canonical example of the type of topological obstruction that
often attends Bayesian model selection [36,78]. As was emphasized in
Kailath’s review of linear noise filters [83], a crucial ingredient needed
for these filters is the Fourier transform of the anti-causal correlation
function for the signal plus noise regarded as an analytic function
of frequency (cf. Appendix B). The crucial discovery that we would
like to highlight is the discovery [36] that the appearance of holomor-
phic functions (rational functions of smooth functions of a complex
number whose only singular points are zeros) in pattern recognition
and feedback control has a simple quantum mechanical interpretation
in terms of Feynman path integrals; at least in the case of the TSP.
The solution to the TSP we describe below also illustrates why mero-
morphic functions (ratios of holomorphic functions) naturally arise
in data analysis problems involving combinatorial optimization. Of
course, it is not always true that data analysis involves combinato-
rial optimization, but what sets the mammalian brain apart from the
Six Fundamental Discoveries 37

brains of other animal species is that the mammalian brain has some
capability of dealing with ambiguities in the interpretation of sensory
data, which necessarily involves [78] combinatorial optimization.
Our introduction of the term quantum self-organization in con-
nection with the TSP is a pointer to the relationship between our
solution for the TSP and the appearance of holomorphic functions
in Kohonen self-organization of sensory data [12]. Our discovery was
prompted by the Durbin–Willshaw elastic net method [1] for finding
solutions to the TSP. As the name suggests, this involves adding to
the locations of the cities to be visited, indicated by the round dots
in Fig. 2.1, a trial “itinerary” for the salesman, the square points in
Fig. 2.2, and then connecting all the points to each other by springs.
The lengths d(i, μ) of the springs connecting the nodes where the
salesman is assumed to stop to the actual location of cities that are
to be visited is the “innovation” for the Durbin–Willshaw method;
i.e. the distance between the actual locations of the cities and a model
for the salesman’s itinerary.
In the Durbin–Willshaw approach [1], the TSP is solved by the
simple expedient of connecting the initially randomly placed movable
nodes with elastic strings, and the cities to be visited by nonlinear
strings, and allowing the system to relax to the lowest energy state
using gradient descent dynamics for an energy functional [1]:

  
|ξ μ − wi |2 K
E[{wi }] = − log exp + |wi+1 − wi |2 .
μ
2 2
i
(2.19)

When the locations {ξ μ } of the square points in Fig. 2.2 are not too
far from round points wi , this approach gives a satisfactory solution
for the TSP via gradient of the energy functional (2.19). What makes
the traveling salesman problem especially interesting from the point
of view of using quantum mechanics to solve optimal control and RL
problems is that the term in Eq. (2.19) involving the difference in
the positions of the round and square points can be replaced [36] by
a Feynman path over all paths marked with squares.
   t  
i m
K(t − t0 ) = exp |ẋ − v(t)| dt Dy(t ),
2
(2.20)
 t0 2
38 Quantum Mechanics and Bayesian Machines

d(i,j)
i,j
,)

Fig. 2.2. Durbin–Willshaw setup for solving the traveling salesman problem.

where the classical velocity v(x) is defined for all x by the actual
motion of the salesman, and y(t) = x(t) − xcl (t) is the deviation of
the Feynman path from the salesman’s itinerary.
Equation (2.19) together with Fig. 2.2 implicitly illustrates one
aspect of the Bayesian model selection problem that is particularly
troublesome; namely, connecting data points with models will in gen-
eral involve a topologically non-trivial planar graph (i.e. a planar
graph where at least two lines defining the graph cross one another).
If one changes the order in which the cities are visited, then in gen-
eral both the solid line marking the salesman’s path and the dashed
line will cross one another. This makes improving the model for
the salesman’s using Markov chain Monte Carlo regression meth-
ods essentially intractable (see e.g. [79]). However, because of the
general mathematical equivalence of topologically nontrivial planar
graphs and topologically simple paths on Riemann surfaces [80], the
regressions that are ill-defined for planar graphs can be carried out
on a topologically nontrivial surface.
In an almost obvious way, the Durbin–Willshaw setup can also be
interpreted as a control problem by interpreting the x and y coordi-
nates of the square dots as estimations of a time-dependent vector
X(t) describing the evolution of the state of a system in phase space
(viz. the evolution of position and velocity variables ( x, ẋ ) for a
self-driving car). In this interpretation, the round points represent
an underlying model {ẑ(t)}, for how these variables vary with time.
The distances d(i, μ) in Fig. 2.2 corresponds to Kailath’s “innovation
Six Fundamental Discoveries 39

Gaussian process” [83], which describes the difference between obser-


vations and the predicted state of the system used in the Kalman–
Bucy filter method [3,83] for optimal control. The terms in Eq. (2.18)
involving just the estimated positions {wi } measure the area swept
out by the elastic string; so we see here the emergence of a connection
between optimization of Bellman’s value function for system control
[20] and the Nambu action principle in string theory [100].
Elevating the true and estimated locations of the cities to be vis-
ited in the Durbin–Willshaw setup onto a Riemann surface allows the
cities to be connected by a simple non-self-intersecting path with only
modest changes in the direction of motion after visiting each city.
A crucial observation is that this in turn allows one to obtain
Kailath’s innovation, defined as the difference between estimated
and model histories for the salesman’s path, as a smooth function
on the surface without the topological obstructions to finding the
optimal itinerary for the salesman that arise if it would be needed
to examine all permutations of the order in which the salesman vis-
its the cities. If a topologically nontrivial itinerary for the salesman
defined on a plane is lifted to Riemann surface with sufficiently large
genus, i.e. number of “donut holes” (this is always possible [80]), then
the use of Monte Carlo relaxation to find a good approximation to
the optimal path, even with a large number of cities, would become
straightforward.
This page intentionally left blank
Chapter 3

Ockham’s Razor

3.1. Bayesian Searches

The modern-day impetus for the development of information-


based Bayesian learning was provided by the WWII problem of
locating submarines [122]. A general Bayesian search is defined by
a sequence of decisions UN ≡ {x i+1 − x i , i = 1, . . . , N − 1} lead-
ing to a sequence XN ≡ {x i , i = 1, . . . , N } of compact subsets in
a d-dimensional space that will be interrogated for the presence of
the object at locations x i . The possible locations x ∗ of this object
are indexed by a variable μ = 1, . . . , M . The interrogations result
in a sequence of measurements YN ≡ {yk , k = 1, . . . , N } that are
intended to determine at each step n whether the sought-after object
is present in a localized setting. In reality the result of the observa-
tion yn at step n only determines probabilistically whether x n is the
location, i.e. x n = x ∗ . Based on the collection Yn of measurements
gathered during the first n steps of a search and the probability
density pn−1 (x ), one can estimate the posterior probability density
pn (x ) ≡ pn (x = x∗ |Xn , Yn , pn−1 ) that at step n the sought after
object is located at location x = x ∗ is

pn (x∗ = x|Yn , Un−1 , pn−1 )


p(Yn |Un , X ∗ = x, pn−1 )pn−1 (x∗ = x|Yn−1 , Un−1 )
= . (3.1)
p(Yn |Un )
Of course, future decisions are needed only if the object isn’t
detected.

41
42 Quantum Mechanics and Bayesian Machines

A simple real-world search problem that can serve as an introduc-


tion to general problem of Bayesian model selection is the problem of
searching in two-dimensions for the location of an object X ∗ located
at a location, say x∗ . For example, one could be searching for the
location of an airplane that crashed in some region of the Atlantic
Ocean. This classic Bayesian search is characterized by a series of
compact areas {An }, and associated answers to the question “Is X ∗
located inside An ?” In order to inject an element of realism, it is fur-
ther assumed that the answers to the questions regarding the location
of the object are not communicated in an unambiguous fashion, but
only as measurement results {Yn } which are probabilistically related
to the true answers tn = 1(x∗ ∈ An ) or tn = 0(x∗ ∈ / An ). We are inter-
ested in the posterior probability densities pn (x∗ |Dn−1 ) for finding
the object in the vicinity of location x∗ after n-steps, given a dataset
Dn−1 ≡ {(A1 , Y1 ), . . . , (An−1 , Yn−1 )} that has been gathered regard-
ing the location of the object prior to step n. The theory of Bayesian
searches is based on the use of Bayes’s formula, Eq. (1.1), to eval-
uate at each step of the search the conditional probability density
pn (x∗ |Xn , Yn ). In the context of our search problem, the explanation
α being sought is the location x∗ of X ∗ , while the input data {d}
consists of the sequence {(A1 , Y1 ), . . . , (An−1 , Yn−1 )} of prior choices
for search locations and measurement results. The prior probability
p(α) for finding X ∗ in An is pn−1 (An ) ≡ ∫An pn−1 (x|Dn−1 )dx; i.e.
the probability that X ∗ is located in An given the information gath-
ered up to the time of the previous step. The conditional probability
P (d|α) is the probability for obtaining a response y if the search
volume is An and X ∗ is located at x∗ :

f1 (y), tn = 1
P (Yn = y|A = An , x∗ ) = (3.2)
f0 (y), tn = 0.
The probability distributions f0 (y) and f1 (y) for the Yn values will
in general be different in the two cases tn = 1(x∗ ∈ An ) or tn =
0(x∗ ∈ / An ), and depend on n, but as a simplification, we assume
that these probability distributions are independent of n. With these
assumptions Bayes’s formula for the posterior probability density
distribution for the location of X ∗ after n search steps Pn (x∗ |Dn ) ≡
pn (x∗ ) becomes
P (Yn = y|A = An , x∗ )pn−1 (x∗ )
pn (x∗ ) = . (3.3)
f1 (y)pn−1 (An ) + f0 (y)(1 − pn−1 (An ))
Ockham’s Razor 43

This search problem corresponding to (3.3) can be visualized as a


tree of observations and decisions.
Since the results of our efforts are expressed as probabilities, it
is natural to inquire how much information has been gained about
the location of the object after N steps of choosing subsets of Rd
and observing whether the object lies within the chosen volume.
A natural measure of the information gathered after n-steps is minus-
ing the Shannon entropy:

H(pn ) = − pn (x) log pn (x)dx, (3.4)

where the integral extends over all space and the log is base 2.
The Shannon entropy, Eq. (3.4), is a measure of the progress for
Bayesian searches, optimal control, and reinforcement learning. All
three of these types of machine learning problems can be charac-
terized as the problem of finding a policy for choosing the sequence
of actions so that the Shannon entropy H(pN ) is minimized after
N -steps. Bellman’s optimization of this cost function introduced in
his paper on dynamic programming [20] minimizes at each step the
entropy (3.4):

V (p, n) = − minπ EY [H(pN |p = pn ], (3.5)

where the index π denotes a particular choice for {A1 , . . . , An−1 }.


It turns out that one can also rephrase search optimization as the
problem of finding the choice for An , given a previous choice for
{A1 , . . . , An−1 } that leads to the largest expected decrease in the
Shannon entropy. As shown in an important paper by Jedynak,
Fraser, and Sznitman [28], the optimal solution for Bayesian search
problems can also be obtained by optimizing the differential increase
ΔI(pn ) in the information regarding the location x∗ of the object
X ∗ when going from step n to n + 1. As shown in [19], the expected
increase in information is

ΔI(x∗ , Yn ) = H(pn ) − EYn [H(pn+1 )|An = A, pn ], (3.6)

where the second term on the r.h.s is the expected information avail-
able after step n + 1 given the choice An = A assuming that prior
information is information contained in the the posterior probability
density pn (x). The l.h.s of (3.6) also represents the mutual informa-
tion between the conditional distributions for x∗ and Yn . This mutual
44 Quantum Mechanics and Bayesian Machines

information can also be written in terms of the information contained


in the probability density for Yn :
ΔI(x∗ , Yn ) = H(Yn |An = A, pn ) − EYn [H(Yn |x∗ An = Apn )]. (3.7)
Using Eq. (3.5), this can also be written in the form
ΔI(x∗ , Yn ) = H[f1 pn (A) + f0 (1 − pn (A)]
− H(f1 )pn (A) + H(f0 ), (3.8)
where pn (x) is the posterior probability density for finding x ∗ near
location x , and f0 (y) and f1 (y) are the probability distributions for
the yn values in the two cases (0) x ∗ is not near to x or (1) x ∗ is
near to x . The right side of (3.8) is a concave function of pn (A),
and therefore has a unique maximum as a function of pn (A) at say
pn (A) = p∗ . A central result of Bayesian search theory is that the
optimal search strategy is to choose the sequence {A1 , . . . , An } in
such a way that each step pn (A) = p∗ . With this choice for pn (A),
the information regarding the location of X ∗ increases by a constant
C ∗ ≡ max ΔI at
E[H(pn+1 |)An ] = H(pn ) − C ∗ . (3.9)
After n-steps out of total of N -steps the value of the optimal search
can be expressed as
E[H(pn+1 )|An = A, pn ] = H(pN ) − (N − n)C ∗ . (3.10)
Thus, we see that the optimal search strategy is characterized by
constant information gain at each stage of the search. By analogy
with communication theory, the constant C ∗ is sometimes referred to
as the channel capacity for the search. What is especially noteworthy
though is that (3.10) is essentially the same as the result obtained
from Bellman’s recursion relation for the cost function in his dynamic
programming approach to optimal control.
If one defines an optimal search strategy πn as the choice of
locations that at each step minimizes the Bellman cost function:
V (p, n) ≡ minπ Eπn [H(pn )|pn = p], n = 0, . . . N, (3.11)
where πn is choice for a sequence {A1 , . . . , An }, then it follows from
the theory of Markov processes that V (p, n) satisfies the recursion
Ockham’s Razor 45

relation [20]

V (p, n) = minA E[V (pn+1 , n + 1)|An = A, pn = p], n < N. (3.12)

This equation is essentially identical with Bellman’s original dynamic


programming relation [20]. One way of showing that the strategy
pn (A) = p∗n is optimal is to use the recursion relation (3.11) to
show that this choice allows one to attain the minimum in (3.12).
It follows from Eq. (3.12) that the universal strategy for Bayesian
searches is that one attempts to optimize the search by maximiz-
ing at each stage of the search the decrease in the Shannon entropy
H(pn ) = − ∫ pn (x) log pn (x)dx associated with the posterior condi-
tional probability πn (x) for finding the object at various locations.
Thus, the optimal search strategy can be characterized as the prob-
lem of finding a policy for choosing a control sequence so that the
amount of Shannon information gathered at each step is maximized.
The explicit dependence of the state of a system or environment
on the history of control actions can be exhibited by recursive use of
formula 3.12, yielding

p(xN = x∗ |UN −1 , YN )
 
Dxk k=N ∗ ∗
k=1 p(Yk |Uk−1 , x , pk−1 )p1 (x = x)
= . (3.13)
p(Yn |Un )
In a completely analogous manner to the way the posterior proba-
bility for a Bayesian search p(x|D) was obtained by integration over
all possible values of an interpolation function for input data labels,
the posterior probability that after N + 1 steps the new state will
be x N + ΔxN can be obtained by integrating over an interpolation
function U (x) for the controls

p(Δx|DN −1 , xN ) = DU (x)p(Δx|U (x), xN )p(U(x)|DN −1 ). (3.14)

Equation (3.14) by itself doesn’t determine an optimal control or RL


strategy. In order to achieve this, one must introduce an optimiza-
tion principle like Eq. (3.12). This brings us back to Bellman’s value
function Eq. (3.11), where πn is “policy” for choosing a sequence of
control decisions UM {un }. It follows from the theory of Markov pro-
cesses that V (p, n) satisfies the recursion relation Eq. (3.12), which
46 Quantum Mechanics and Bayesian Machines

is equivalent to Bellman’s original dynamic programming formal-


ism. Unfortunately, as with Bellman’s original dynamic programming
equations, this equation is difficult to solve.

3.2. A Tale of Two Costs

Although the goal of Bayesian pattern recognition is the optimization


of the posterior probability defined in Bayes’s formula Eq. (1.1), a
widely used approximation is choosing the pattern that finds the best
explanation for the data based on the product of p(α) and p(d|α).
This approximation, known as the ML method [6], focuses on maxi-
mizing the following “likelihood function”:

L(α) = −log[p(d|α) (p(α) ]. (3.15)

The maximum likelihood method is often quite useful for what


MacKay calls “the first level of inference” where one assumes that
one has a model for the data that represents the underlying truth,
and the data analysis task is to find the parameters for the model
that provide the best fit for the given set of input data. However,
in situations where different types of explanations for a particular
dataset are possible, e.g. the dilemma faced by the animals in the
wild trying to sense whether a predator is near, a more sophisticated
method must be employed. Finding a path to a more sophisticated
approach to problems such as those faced by animals in the wild will
be one of our main focuses for the remainder of the book. (A better
method is apparently available to mammals thanks to evolution, but
alas it is not yet available to data science.) At the level of Bayesian
model selection, one is often faced with the task of choosing from
possibly a multitude of models {Hi } which are all possible a priori
explanations for the data, the one that is the best fit for observations.
The posterior probability for the correctness of a model is propor-
tional to the product of the evidence of the model and the prior
probability of the model:

P (Hi |D) ∼ P (D|Hi )P (Hi ).

If all the a priori probabilities are roughly the same, then MacKay’s
prescription is to turn to the evidence P (D|Hi ) in order to rank
Ockham’s Razor 47

the plausibility of each model. As in the ML method, the relative


likelihood of two different models is
P (H1 |D) P (D|H1 ) P (H1 )
= . (3.16)
P (H2 |D) P (D|H2 ) P (H2 )

The ratio of prior probabilities on the r.h.s of Eq. (3.16) allows one
to input one’s personal judgement regarding the relative elegance or
simplicity of the two models. However, MacKay [6] points out that
the ratio of the evidence factors in Eq. (3.16) allows one to assay
the relative simplicity of two models in a way that is independent of
subjective judgments. Moreover, the evidence factor for any model
can be estimated from the way the model parameters needed for
the given dataset are distributed relative to the ML value for these
parameters:

P (D|Hi ) ∼
= P (D|wM L , Hi )P (wM L |Hi )σw|D , (3.17)

where σw|D is the width of the distribution of the model parameters


needed for a given dataset. The product of the last two factors on
the r.h.s of Eq. (3.23) were christened “Occam’s Razor factor” by
MacKay [6]. (In the literature on medieval science (see e.g. [85]) the
place where William, the discoverer of this principle, lived is spelled
Ockham, which is the spelling we have adopted.)
In contrast with the ML method [6], the necessity for considering
many possible models or even a single model with multiple parame-
ters’ as an explanation for a pattern recognition or state estimation
problem can render model selection intractable — at least in real
time — with conventional computational resources. A simple way
to visualize the extra complexity introduced by multiple models is
provided by the linked orbit problem: instead of just determining
the orbital parameters of individual astronomical objects, suppose
that there is ambiguity in the observations as to how the objects
in a set of objects should be associated with optical observations of
orbiting objects. The likelihood function for this problem has the
form (3.21). For example, suppose we want to determine the orbital
parameters of N objects distributed among M orbits. Unfortunately,
in this case the parameters linking objects to orbits are not readily
determined as part of the observational input. Assuming that a set of
observations are statistically independent, the likelihood function for
48 Quantum Mechanics and Bayesian Machines

a model where the N objects are assigned a particular set of orbits


n that take values in {0,1} has the
specified by linking parameters km
form [78]

M

n n n
p(D|km ) = p(D|θm , km )p(θm |km ), (3.18)
m=1

where data consisting of a series of distinct observations of the N


objects and the θm are parameters for the orbit indexed by m, and
point out that introducing parameters like kαi reduces a model com-
parison problem to a parameter estimation problem in a “product
space” of model parameters θ and model selector parameters k. In
principle, improved strategies for Markov Chain Monte Carlo regres-
sion [79] might help, although in practice this would be impractical if
N and M are large, which serves as a reminder of the difficulties one
can encounter when the number of models becomes exponentially
large.
One way to view such problems is to imagine that each model
α in Eq. (3.15) is associated with a set of parameters Θ. The free
energy defined in Eq. (1.2) can then be written as
 
F = L(α) = − log[p(d|α) (p(α|θ) ], (3.19)
α α,θ

where L(α) is the likelihood function defined in Eq. (3.21). The exact
answer for the best model is still given by maximizing the probability
defined in Eq. (3.23), and this is equivalent to minimizing the free
energy of the avatar physical system:

F (x) = {Eα P (α) − (−P (α) log P (α))}. (3.20)
α

If instead of the true probability distributions P (α) and P (x|a) one


uses model probabilities P (α; θ) and P (x|α; θ) depending on param-
eters θ to calculate an approximate probability distribution P (α; θ)
for different classifications of a dataset x, then Eq. (3.3) will no
longer necessarily be satisfied and the free energy F (x, θ) calculated
using the distribution P (α; θ) will in general differ from the true free
energy. The advantages of minimizing the free energy (3.19) vs the
ML method are captured by MacKay’s Occam Razor factor.
Ockham’s Razor 49

Unfortunately, minimizing the free energy — or equivalently max-


imizing the evidence factor in Eq. (1.1) — using conventional Monte
Carlo regression techniques is in general frustrated if there are a pri-
ori numerous plausible models for a set of input data. As examples
are problems where M objects are distributed among N locations. In
contrast with the simple Monty Hall problem considered in the last
section, these types of problems are often intractable with conven-
tional computational resources. Assuming that a set of observations
are statistically independent, the likelihood function for a set of data
can be written [120]:

N M
  n
P {dn }n=N α=M
n=1 |{θα }α=1 = [P (dn |Θα )]km , (3.21)
n=1 α=1

where km n in Eq. (3.21) are not readily determined as part of the

observational input.
Determining MacKay’s Occam’s razor is also closely related to
the problem of constructing “adversarial” networks, i.e. devising an
algorithm which will generate authentic looking input data given a
suitable choice of network parameters. The construction of adversar-
ial networks that are of practical use in general settings is currently
an active area of research in the data science community. Here will
focus on the approach of Neal, Hinton, et al., known as the Helmholtz
machine [8,9].

3.3. Hidden Factors and the Helmholtz Machine

We now turn to the problem of selecting data models when hidden


factors are important. Hidden factor analysis [20] is a type of
Bayesian inference that is of considerable intrinsic as well as practical
interest. For example, factor analysis is widely used for optimizing
manufacturing processes and designing experimental trials of drugs
[7,29]. One common reason for the proliferation of models in prac-
tice is the existence of hidden variables, i.e. variables that affect the
current and future state of a system or environment, but are not
explicitly recognized as input data or an external influence. In eco-
nomic terms, the hidden variable problem can be very important,
e.g. hidden factor analysis is widely used to design experiments to
test the efficacy of experimental drugs.
50 Quantum Mechanics and Bayesian Machines

Linear hidden factor analysis problem is usually formulated by


regarding the hidden factors as a linear matrix equation of the form
[122]
x = gy + ν, (3.22)
where x is a vector of N measurable quantities and y is a vector of
hidden “factors”. These are the unrecognized factors that for example
could affect the reliability of a product or the effectiveness of a drug.
The k x m matrix g contains the parameters for generating the hid-
den factors from input data. The learning problem is how to estimate
the effect of the hidden factors on the results of experiments. Indeed,
if in addition to the familiar state variables XN hidden factors are
involved, it is mathematically inconsistent to just include the visi-
ble variables in a Bayesian analysis of the likelihood of an observed
outcome. Instead, one must maximize the information value of the
evidence including possibly many hidden factors:
n=N

C({YN }) = − log[p(Yn |Un )], (3.23)
n=1

Initial attempts to solve hidden factor problems were based on the ad


hoc assumption that the outcome is a linear or quadratic function of
the hidden parameters, and this type of factor analysis is still widely
used, for example, in drug testing and industrial process optimiza-
tion. The importance of factor analysis invited the development of
more rigorous methods for solving these types of problems. In 1977,
an algorithm called the expectation maximization (EM) algorithm
[29] emerged that provided a systematic approach to all types of
model selection problems, including problems with hidden factors.
The hidden variables are typically assumed to be Gaussian random
variables, while the EM algorithm proceeds in two steps: (1) An
E-step that finds the probability distributions for the hidden vari-
ables based on current estimates for the model parameters, and (2)
an M-step that updates the g vector in Eq. (3.23) as well as the vari-
ances of the y components using the loglikelihood for the observed
data. Unfortunately, the matrix operations involved in using the EM
method don’t appear to be a plausible model for learning in the brain.
This led Dayan, Hinton, Neal, and Zemel to introduce a statistical
mechanics-like model [8] for how the human brain might construct
models for sensory data, which they called the “Helmholtz machine”.
Ockham’s Razor 51

In contrast with the much simpler problem of choosing the


parameters for a single GP, the 1994 Helmholtz machine papers of
Hinton, Neal et al. [8,9] were the first papers to provide a definite
scheme as to how one might construct an minimum description length
(MDL) data model that takes into account hidden factors. (The
name Helmholtz machine is a tip of the hat to Helmholtz’s intuition
that the human brain can estimate probabilities.) In the expecta-
tion maximization method by a gradient descent method. They were
able to completely side-step the usual problem of a priori defining a
framework for data models by using the same structure for the “gen-
erative” network as the “recognition” network, and in alternating
epochs using the excitations in one network to modify the parame-
ters in the other network. This procedure, known as the “wake–sleep”
algorithm [9], had the miraculous properties that it could simulta-
neously be interpreted as the minimization of the free energy of the
two networks regarded as physical stochastic systems, and deliver a
probabilistic representation for the data that satisfied the Munford–
Rissanen MDL principle. Furthermore, in the MDL limit the con-
ditional probabilities connecting observed data and model parame-
ters automatically satisfied Bayes’s formula Eq. (1.1). This symmetry
between the recognition and model generation networks provides the
inspiration for much of what follows.

Fig. 3.1. Helmholtz Machine recognition + data generation networks [8].


52 Quantum Mechanics and Bayesian Machines

Use of the wake–sleep algorithm to solve practical problems has


been severely limited because of the difficulties of calculating con-
ditional probabilities with existing conventional computers. One of
our goals in this book is to inquire whether there is a quantum reso-
lution of the Helmholtz machine difficulties. As will be discussed in
Chapter 7, there are reasons for thinking that the quantum dynam-
ics of two interacting oscillator arrays to representing the XN and
UN variables might provide an alternative framework for implement-
ing a Helmholtz machine-like approach to hidden factors and model
selection.
Despite their practical successes, deterministic back-propagation
neural networks have the drawback that in general they neither calcu-
late the probability that the explanation they find for the input data
is correct, nor provide a guarantee that the solution is stable against
small changes in the input data. Moreover, the learning algorithm
for layered neural networks (see Appendix A) does not map in any
obvious way onto how the mammalian brain uses self-organization of
neurons to extract cognitively significant information from sensory
inputs. The Helmholtz machine is a promising alternative, with the
added advantage of being able to generate probabilities for alterna-
tive explanations for sensory data without external supervision. The
internal degrees of freedom of both the recognition and model gen-
eration networks consist of layers of “Ising spins” whose activation
levels {si } are either 0 or 1. The Ising spins in the lowest layer are
used to represent input data, while the probabilistic activation levels
of spins in the “hidden” layers represent the “explanations” for the
input data. The probabilistic activation of the Ising spins in the final
“output” layer (cf. Fig. 3.1) represents the explanation. The informa-
tion cost for describing an explanation (α) with this spin ensemble
in either the recognition or generative networks is

C({slj }) = (−slj (α)logplj (α) − (1 − slj (α))log(1 − plj (α)),
l j
(3.24)

where the slj (α) are the spin excitations in the l’th layer. Because the
excitations are stochastic variables, running the recognition network
many times over generates a probability distribution Qα . According
to Mumford–Rissanen [26], the best explanation minimizes the total
Ockham’s Razor 53

cost C(α):

C(α) = − [Qα Eα + Qα logQα − Qα log(Pα /Qα )] , (3.25)
α

where the probability distribution Pα is inferred from the recognition


network and the probability distribution Qα is inferred from the data
generation network. The objective of the “wake–sleep” algorithm [9]
is to self-consistently lower the free energy, Eq. (1.2), estimated from
Bayes’s formula to construct MDL representations of sensor data.
During the “wake phase”, the connection strengths of the top-down
model generation network are modified to make the model activation
probabilities for the units in the hidden layers align more closely with
the actual unit activations that are observed when the bottom-up
recognition network is used to interpret examples of input data that
are fed to the first layer. This learning step makes the data gener-
ation “adversarial” network better at constructing realistic models
of the world. In the “sleep phase”, the connection strengths in the
recognition network are modified so that the activation probabili-
ties for the units in the hidden layers of the recognition network
are aligned more closely with the activities of these units that are
observed when the top-down network is used to generate “fantasies”
of the world. Unfortunately, practical applications of the wake–sleep
algorithm have been meager because of the calculations needed to
construct the “bottom-up” and “top-down” representations of the
input data requiring Monte Carlo sampling of the parameters and
variables that cannot be carried out in real time. On the other hand,
the introduction of the Helmholtz machine has led to the concep-
tually important realization that both pattern recognition and RL
problems can in principle be approached as the regression problem
of reducing the KL divergence between Qα and Pα , without the need
for an explicit model for the system or environment.
The Helmholtz machine also provides another way of viewing
optimal control. Indeed, the Bayesian search problem discussed in
Section 3.1 can also be regarded as a control problem by inter-
preting the choice of subsets {An } as a series of control decisions
Un ≡ {u(1), u(2), . . . , u(n)}, where u(t) is a map from {1, . . . , N } to
compact subsets of Rd , which leads to a convergence in a finite time
between the choice of search area and the location of the object We
observe that these conditional probabilities for control actions can be
54 Quantum Mechanics and Bayesian Machines

generated from recursive use of Bayes’s formula by assuming that the


input training data D consists of N -pairs (x n , yn ) of d-dimensional
(n) (n)
vectors x n ≡ {x1 , . . . , xd } and scalar labels yn for each vector.
As discussed in the previous section, this training data allows one
to immediately construct an interpolating function z(x) for the set
of measurements making a prediction for a new measurement result
yn+1 given an new choice for a compact volume x n+1 . If Gaussian
probability distributions are used for all the probability distributions
in Bayes’s formula, then Eq. (3.1) provides a probabilistic prediction
for the outcome of a control action at step N . A step in the direction
of reconciling stochastic estimation based on Bayes’s equation and
Bellman optimization is to notice that Bayes’s formula implies the
posterior probability for finding the system in the desired state is
proportional to the product of forward and backward filtering prob-
ability densities:

pn (x = x∗ |Yn , Un−1 , pn−1 )


∝ p (Yn |Un , X ∗ = x, pn−1 ) pn−1 (x∗ = x|Yn−1 , Un−1 ) .

In this way, we recover Todorov’s formula, Eq. (1.3), for the Evidence
factor. Of course, in both quantum mechanics and Todorov’s formu-
lation of control theory, the devil lies in the difficulty of taking into
account many alternative paths in a path integral representation for
the Bellman function.
Chapter 4

Control Theory

4.1. The Hamilton–Jacobi–Bellman Equation

When the state variables are supplemented by control variables, then


Bellman’s dynamic programming condition, Eq. (2.10), becomes the
Hamilton–Jacobi–Bellman (HJB) equation [21,87]:
 
∂V ∗ (X, t) ∂V ∗ (X, t)
= minu(t) l(x, u, t) + ẋ(t, x) , (4.1)
∂t ∂x
so named because of its similarity to the classical Hamilton–Jacobi
equation [88]. In (4.1), V ∗ is the optimum value of the Bellman cost
function, and l(x, t, u) is the notation we shall use from now on for
the rate of change of the Bellman cost function along a controlled
path of the system in state space. In the case of “linear-quadratic”
control theory [19], l(x, u) = q(x) + 12 uT Ru, and Eq. (4.1) becomes
 
∂V ∗ (X, t) T T ˙ ∂V ∗ (X, t)
= minu(t) u Ru + x Qx + x(t, x) , (4.2)
∂t ∂x

If one assumes that the Bellman function is a quadratic form xT Sx,


and that Kalman’s linear dynamical law ẋ = Ax + Bu describes the
passive plus controlled dynamics, the HJB equation becomes

xT Ṡx = −min[uT Ru + xT Qx + 2xT SAx + 2xT SBu], (4.3)

where the capital letters stand for time-dependent matrices acting


on the (x, u) vector space. If we minimize the r.h.s with respect to

55
56 Quantum Mechanics and Bayesian Machines

u, we obtain the optimum control

u∗ = −R−1 B T Sx(t). (4.4)

This formula for the control variable is useful in many circumstances


of practical interest, which sensibly depends on the deviation of the
state variables from the passive dynamics. This noise variance R
is a result of an intrinsic noise in the system dynamics, and as we
shall see in what follows, increasing this noise is an easy (and often
quite effective) way for an adversary to thwart successful control of
a system. Equation (4.2) can also be written as a nonlinear ordinary
differential matrix equation

S  (t) = −SA − AT S + SBR−1 B T S − Q. (4.5)

The ability to rephrase the Bellman recursion relation for optimal


control as an ordinary differential equation, which can then be solved
with pedestrian numerical integration techniques, is one of the rea-
sons the Kalman filter was so successful.

4.2. Pontryagin Maximum Principle

The solutions of optimal control problems are often represented as


streamlines in (x, ẋ) space. In principle, these streamlines can be
found by solving the HJB equation, Eq. (4.1), for various initial
values x0 . This equation is similar to the classical Hamilton–Jacobi
equation except that the velocity ẋ is identified as a control variable
u, and the Hamiltonian function that appears in the HBJ equation
(the “control Hamiltonian”) is minimized w.r.t to u. Just looking at
the minimization of the control Hamiltonian is often sufficient [86,87]
to find an adequate solution to the control problem.
The deterministic limit of Bayesian feedback control or RL is
encapsulated in Pontryagin Maximum Principle, and asserts that
the optimum control path is determined by the requirement that
the control Hamiltonian in Eq. (4.1) vanishes identically along the
control path. In fact, combing the vanishing of the r.h.s of Eq. (4.6)
with the Eq. (4.13) for λ̇ and replacing u∗ (t) with ẋ∗ (t) does lead
to the usual classical Euler–Lagrange equations, whose solution is
the classical path x∗ (t) for which the value of the action functional
Control Theory 57

S[x(t)] is stationary w.r.t variations in the path. Although the form


of the classical equations for x and p is preserved when the position
and momentum become quantum operators, the passage from quan-
tum mechanics to classical mechanics by minimizing the phase of the
Feynman path integral involves a subtlety that doesn’t appear in the
textbook treatments of obtaining classical mechanics by minimizing
the Maupertuis action functional [88]. In particular, the usual opti-
mization procedure certainly ensures that as Planck’s constant go to
zero, all the trajectories x(t) in some continuous neighborhood of the
optimum trajectory are close to the optimum trajectory x∗ (t). How-
ever, this doesn’t ensure that the time-dependence of the classical
momentum p∗ (t) emerges in an obvious way. (As noted in the intro-
duction, this greatly puzzled the founders of quantum mechanics.)
The opposite limit of allowing wide deviations in the state of a sys-
tem from the optimal path is to assume that the only paths which
contribute to Eq. (4.9) are paths near the optimal path. Indeed,
just as is often the case in quantum mechanics that only paths
near the classical path contribute, and in a similar way it is often
the case that the only paths that are important in the sum in
Eq. (4.9) are near the optimal path. This path is determined by the
Pontryagin Maximum Principle [86,87], which is the control theory
analog of the Moupertuis principle of least action in classical mechan-
ics. Pontryagin principle gives the fundamental necessary condition
for a controlled trajectory to be optimal in the sense of optimizing
the Bellman value function.
A fundamental equation for Bayesian learning is the Bellman
equation, which at least in principle allows one to solve optimal con-
trol problems by minimizing the integral of the rate of change l(x, u)
of the Bellman value function over all possible paths. This equation
is similar in form to the classical Hamilton–Jacobi equation that pro-
vides a geometric optics-like formulation for classical mechanics:
∂S ∂S
= −L(x, ẋ, t) − ẋ(x, t), (4.6)
∂t ∂x
where S(x, t) is the classical action function (cf. [83]). Equation (4.6)
describes the time dependence of S(x, t) along a path x(t). This path
is determined by minimizing S[x(t)] in accordance with Maupertuis’s
principle of least action [88]. The name geometric optics refers to the
fact that in optics path x(t) of a light ray can also be described as
58 Quantum Mechanics and Bayesian Machines

the normal to a surface of constant phase for a solution of the wave


equation. The limit of geometric optics for say light waves is where
the wave nature of light is neglected, and the light is assumed to
propagate along a one-dimensional curve determined by the index
of refraction for the medium it is propagating through. In classical
mechanics, the index of refraction is replaced by a potential, and
the path is determined by the condition that the function L(x, ẋ, t)
(known in classical mechanics as the Lagrangian function) along the
path is stationary with respect to variations in the path. The function
L(x, ẋ, t) also plays an important role in quantum mechanics (cf.
[33]). Indeed the Hamilton–Jacobi equation can also be written in a
way that is familiar from elementary quantum mechanics:
 
∂S ∂S
= −H t, x, − , (4.7)
∂t ∂X
where the Hamiltonian function for a particle depends on the position
x and momentum p of the particle. In writing Eq. (4.7), we have used
∂S
the Dirac equation p = −i ∂X to eliminate the dependence on p.
Thus, Eq. (4.7) is a nonlinear equation for classical path. In a similar
way in optimal control theory there is an equation for the Bellman
value function, V ∗ (t, x∗ , u∗ ); the HJB equation, which describes the
time dependence of the value function along an optimal control path
(x∗ , u∗ )
∂V ∗ ∂V ∗
= l(x∗ , u∗ , t) − f (x, u∗ , t), (4.8)
∂t ∂x
where f plays the role of velocity, and l(x∗ , u∗ , t) is the sum of
a non-stochastic “reward” function that describes how likely the
control action leads to the desired state, and a stochastic KL diver-
gence term. We see from Eq. (4.8) that the rate of change of the
Bellmann cost function is analogous to the Lagrangian function clas-
sical mechanics. Actually, Pontryagin control theory is mathemat-
ically equivalent to classical mechanics, and in some respects it is
a better formalism than the usual Euler–Lagrangian formalism one
finds in textbook accounts of classical mechanics (see e.g. [88]).
An important difference is that the Euler–Lagrange equations
only yield a “weak optimization” for classical paths where x∗ (t) is
the optimum trajectory among trajectories x(t) within some contin-
uous neighborhood of the optimum trajectory. On the other hand,
Control Theory 59

the Pontryagin Maximum Principle yields a “strong” optimization


x∗ (t) and u∗ (t) where the state variable x(t) and control variable
u(t) are simultaneously within a continuous neighborhood for both
x∗ (t) and u̇∗ (t). This strong minimum can be found by first identi-
fying the control variable u(t) with a velocity field ψ(x, t) = ẋ by an
equation of the form
ψ(x, t) = f (t, x, u). (4.9)
For example, in linear-quadratic regulator theory (cf. Ref. [11]), one
writes
ψ(x, t) = Ax(t) + Bu(t). (4.10)
The optimum velocity field can then be found geometrically [82] by
minimizing the Weierstrass function
∂l
E(t, x, ψ, y) = l(t, x, y) − l(t, x, ψ) − (t, x, ψ)(y − ψ), (4.11)
∂x
which is a convex function of y. This is equivalent to minimizing the
control Hamiltonian
HC (x, u, λ, t) = l(x, u, t) + λf (x, u, t), (4.12)
where the parameter λ plays much the same role as the momentum in
classical mechanics. The variables x and λ satisfy the usual Hamilton
equations w.r.t the HC . For example, if (x∗ , u∗ ) is an optimal trajec-
tory, the equation of motion of λ is
∂L ∂f
λ̇ = − −λ , (4.13)
∂x ∂x
where the derivatives are evaluated at (x∗ , u∗ ).
In linear-quadratic regulator (LQR) theory [21], the reward func-
tion is approximated by a quadratic function in x and/or u. For
example, a model Hamiltonian might be
1
HC (x, u, λ, t) = [(uT Ru) + (xT Qx)] + λ(Ax(t) + Bu(t)). (4.14)
2
At its minimum value, Hc is insensitive to u
ðHC
= u∗T R(t) + λB(t) = 0 (4.15)
ðu
60 Quantum Mechanics and Bayesian Machines

This “classical” optimum value for the time history of control


variable u∗ (t) leads to the insight that at least for linear-quadratic
regulator theory the control variable is essentially a “momentum”
variable

u∗ (t) = −R−1 B −T λT . (4.16)

Of course, this is a conundrum for quantum mechanics in that in


quantum mechanics the position and momentum of a particle are
treated differently. We will address this enigma in Chapter 7.

4.2.1. The Moon Lander problem


The moon lander problem [87] is a simple example of how Pontryagin
Maximum Principle works in practice, and is characterized by three
simple equations:
u
ḣ = v, v̀ = −ġ + , ṁ = −ku, (4.17)
m
where h is the lander’s height above the surface, v is the lander’s
vertical velocity, m is the lander mass (including fuel), u is the
control variable, and g is the acceleration of gravity. The problem
resembles the SO(3) Lie–Poisson problem, in that there are three
momentum variables; the Lagrange multipliers for v, and −g + u/m,
and ku, respectively. The control Hamiltonian consists of a piece pro-
portional to u, and the Σpi xi piece. It is inherent in this problem
that the optimum control is for u∗ = 0 until a certain time, and
then u∗ = 1 until the landing is complete. The time when one should
kick up the thrust to u∗ = 1 can be determined by back-tracking the
solutions to Eq. (4.17) with u∗ = 1 from h = 0 and v = 0 at t = T
to the point where this trajectory in {h, v} space intersects Galileo’s
parabolic trajectory for a falling body starting at some h = h0 and
v = v0 . Of course, the back-tracking can only be continued for a
time = fuel mass/k:
The state of the system is a two-dimensional vector x = (h, v)
with equation of motion:
   
0 1 0
ẋ = x+ u, (4.18)
−1 0 1
Control Theory 61

while the Hamiltonian having the form


    
0 1 0
H = λ0 + λ x+ u . (4.19)
−1 0 1

The equations of motion for the Lagrange multipliers λ = (λ1 , λ2 )


that follow from Hamilton’s equation for “momenta” are
 
0 1
λ̇ = −λ (4.20)
−1 0

The complete solution to the moon landing problem involves [87]


matching the trajectory for free fall with the controlled trajectory
integrated backward from the moment the lander came to rest.

4.3. Lie–Poisson Dynamics

Typically, “Lie–Poisson” dynamics [89] is defined by an evolution


equation of the form

ġ = f (A + u1 E1 + · · · + ul El ), f ∈ T G, u ∈ Rl , (4.21)

where I and the Es are elements of the Lie algebra g for G (A is


called the “drift” term). A controlled trajectory is a pair (g(t), u (t))
where g ∈ G specifies the “position” within the group manifold, u is
a control vector, and g(0) = g0 and g(T ) = gT . Following Bellman’s
description of optimal control, the optimal control problem is defined
by optimizing the integral of a cost rate function l(u) over all possible
controlled trajectories (g(t), u (t)) for 0 < t < T :
 T
l[g(t), u(t)]dt → minimum. (4.22)
0

Typically, l(u) is chosen to be a quadratic function of the control


variables
1
l(u) = (c1 u21 + · · · + cl u2l ), (4.23)
2
It can be shown that finding the controlled paths which optimize
the integral of this function can be found by minimizing the control
62 Quantum Mechanics and Bayesian Machines

Hamiltonian H C , which differs from the classical Hamiltonian only


in that the sign of l(u) reverses the sign of the classical Lagrangian,
so that

HC (u, q) = l(u, q) + pġ. (4.24)

In a similar way to how the Euler–Lagrange equations follow


from the classical principle of least action, it can be shown that
the Lie–Poisson variational principle yields a Lie group analog of
Hamilton’s equations:

ġ = g dHα , ṗ = ad∗ (dHα ). (4.25)

In general, the dynamics for Lie–Poisson systems can be found [89]


by numerically integrating Hamilton’s equations along the fibers in
T*G with coordinates (μ − β0 ), where both the fiber coordinate μ
and the constant β0 are elements of the dual Lie algebra g ∗ for the
symmetry group G underlying the integrable dynamics.

4.3.1. Rigid body attitude control


A simple control problem that is important in astronautics and illus-
trates how the Lie–Poisson dynamics works is provided by the prob-
lem of controlling the attitude of a 3D rigid body [3]. The classical
Hamiltonian for a freely rotating rigid body in three-dimensions has
the form
 
1 J12 J2 J2
H= + 2 + 3 , (4.26)
2 A1 A2 A3

where the Js are the components of the angular momenta with


respect to the axes where the moment of inertia tensor is diagonal,
which can also be identified as the Lie algebra generators for SO(3)
rotations. The coefficients of this quadratic form define the moment
of inertia tensor. Hamilton’s equations imply that the time derivative
of the angle variable conjugate to the Ji that is a constant of motion
is just the constant Ji /Ai . As was shown by Poisson in connection
with his investigations of the rotation of astronomical bodies, con-
servation of energy and angular momentum allows one to reduce the
Control Theory 63

six-dimensional phase space for a “weightless” rigid body floating in


space to a two-dimensional phase space, the Hamiltonian becomes:
   
1 J12 J22 J32 1 sin2 φ cos2 φ L2
+ + → + (J 2 − L2 ) + .
2 A1 A2 A3 2 A1 A2 2A3
(4.27)
The reduced phase space here is just the 2-sphere = SO(3)/SO(2) with
latitude and longitude coordinates (L, φ), and the classical motion is
just uniform rotation about an axis that is fixed in space. The quantum
problem for rotation of a rigid body is a bit more complicated, and
involves solving a matrix equation for 2J + 1 quantum energy levels.
The problem of controlling the attitude of a rigid body involves the
entire six-dimensional phase space. However, if we have a quadratic
control function of form (4.26), the dynamics remains reducible under
three-dimensional rotations and corresponds to the intersection of the
sphere J = constant and the ellipsoid Hc = constant. It turns out that
the exact control dynamics can be expressed in terms of Jacobi elliptic
functions. This is our first example of a special analytic function
providing an exact solution for a controlled system.

4.4. H∞ Control

It has been recognized for some time that there is an extension of


the Kalman filter, H∞ control [84], that can be interpreted as a
2-person game. The Kalman filter approach to optimal control has
some serious shortcomings: the Kalman filter seeks to minimize the
error z̃(t) = (z(t) − ẑ(t)) in a series of estimates {xk } for the state of
the system based on a series of signals {zk } based on a series of obser-
vations {yk− } of. Unfortunately, the Kalman control method does
not prevent the estimated state of the system from deviating very
significantly from the actual state. One possible reason for a large
deviation is that an “adversary” can increase the level of noise in
the innovations (z(t) − ẑ(t)). This would directly increase the uncer-
tainty in the state of the system and potentially drastically impact
the effectiveness of control actions. This possibility shows up in real-
life situations such as the drama in Fig. 1.1. It has also been recog-
nized for some time that there is an extension of the Kalman filter,
H∞ control [84], that can be interpreted as a two-person game. In
64 Quantum Mechanics and Bayesian Machines

particular, H∞ control generalizes the Kalman filter by introducing


a limit on the magnitude of the variance of the innovation GP for
the Kalman filter. The “H” here stands for the Hardy spaces of holo-
morphic functions introduced by Segal and Wilson [61] in connection
with finding exact solutions for the KdV equation. Let us rewrite the
Kalman filter equations in the form
xk+1 = Fk xk + wk , yk = Hk xk + vk , (4.28)
where vk is the measurement noise and wk is the innovation noise.
The goal of the adversary in the H∞ game is to maximize the uncer-
tainty in the innovation zk − ẑk , where these zs are combinations
of the xs that appear in Eq. (4.28) and are convenient for char-
acterizing the state of a system or environment. The goal of the
observer/controller is to minimize by his actions the largest possible
estimation error that can occur because of an adversary’s actions
to increase the innovation noise wk . The solution to this problem
is found by considering how the sum J of all the signal variances
normalized by the sum Σ of all the RMS noise variances (including
the RMS uncertainty in the initial state x0 ) depends on the filter
parameters, where
k=N −1
|Zk − Ẑk |2
J = k=0 . (4.29)
Σ
The minimax procedure [84] for limiting the estimation error is to
first maximize J w.r.t x0 and wk , assuming that xk and yk satisfy
Eq. (4.28), and then finding a stationary point of J w.r.t to xk and
yk . It can be shown that maximizing J w.r.t x0 and wk is equivalent
to finding the minimum of a control Hamiltonian
2λTk
HC (k) = Lk + (Fk xk + wk ), (4.30)
Δ
where Lk = |xx − x|2 − Jmax (|wk |2 + |yk − Hk xk |2 ) plays the same role
as the classical Lagrangian. This leads to the Pontryagin equations
of motion:
∂Hk 2λTk ∂Hk
= 0, = , (4.31)
∂λk Δ ∂xk
which are the control theory analogs [86,87] of Hamilton’s canonical
equations of motion for classical mechanics. The Pontryagin solution
Control Theory 65

for the state variables is an affine Lie–Poisson flow):

xk = μk + Pk λTk (4.32)

for the state variables. The solution to Eq. (4.31) that minimizes Jmax
w.r.t to xk and yk is x̂k = μk and yk = Hk μk . Evidently, given the
existence of a bound on the magnitude of J, we obtain a quiescent
equilibrium state. Thus, H∞ control does seem to provide relief for
an adversarial increase in the innovation noise.
In contrast with the Kalman filter, the variance of the innovation
is bounded. In other words, the actual forward (or backward) phase
space trajectories for a system or environment are uniformly close
to the observed trajectories. (This is also the miracle of Pontryagin
control [86,87]). The geometric and topological proximity of these
trajectories allows one to picture the forward and backward “innova-
tions”, i.e. two fluctuating smooth surfaces representing differences
between the model and observed trajectories viewed from the per-
spective of the observer/controller or environment acting as the RL
agent while the degrees of freedom of the adversary are frozen. The
areas of these two surfaces are just the Bellman values V of the
actions of the two agents. In the von Neumann equilibrium state,
the two surfaces have the same shape, but the Bellman values have
opposite signs due to the reversal of the direction of time. (In quan-
tum mechanics, systems propagating backward in time have negative
energy [64].)
This page intentionally left blank
Chapter 5

Integrable Systems

5.1. RH Solution of the Airy Equation

It is a little-advertised fact that accurate approximate solutions of


the time-independent Schrodinger equation for any potential can be
obtained by making use of “modified Airy functions” (MAFS) [70].
The Airy functions Ai and Bi are defined as exact solutions of the
Airy equation which describes quantum motion of a particle in a
linear potential. (See also the NBS special functions handbook)

uxx = xu. (5.1)

At the same time, the MAFs can be used to construct approximate


solutions to the inhomogeneous 1D Helmholtz equation:

d2 ψ 2m
+ Γ2 (x)ψ(x) = 0, Γ2 (x) = [E − v(x)] (5.2)
dx2 2
of the form
Ai(ξ) Bi(ξ)
ψMAF (x) ≡ C1  + C2  , (5.3)

ξ (x) ξ  (x)

where the Ai and Bi are the two kinds of Airy functions. In


an oscillatory regime Γ2 (x) > 0, the argument of the MAFs is
x
ξ(x) = −( 32 x 0 Γ(x)dx)2/3 , where xo is a classical turning point.
Equation (5.2) provides a very good approximation to the exact solu-
tion to Eq. (5.2) in an oscillatory regime for any continuous potential

67
68 Quantum Mechanics and Bayesian Machines

v(x) (even very near to a turning point where the WKB approxima-
tion fails) [70,71].
Airy functions were originally introduced into quantum mechanics
in order to solve the problem of quantum motion in a linear potential
[14], but later made an appearance in connection with the general
problem of relating the oscillating and exponentially decaying solu-
tions of the 1D Schrodinger equation at classical turning points [71].
What is of paramount interest for us is that the solutions (5.3) pro-
vide approximate solutions for the time-independent 1D Schrodinger
equation that are accurate for all values of x, and for any poten-
tial. Thus, the MAFs may be especially useful for analytically rep-
resenting the progress of feedback control or RL, both of the past
and future. This property of MAFs is shared with solutions of the
KdV and NLS equations and reflects a universal property of func-
tions that satisfy integrable PDEs. They also share the fact that they
have meromorphic integral representations as Cauchy integrals [68].
This analytic behavior and its attendant connection with solvability
is fundamental to our approach to optimal control.
In Landau’s Appendix for Quantum Mechanics [15] (which has a
colorful history going back to the time he was a postdoctoral fellow in
Copenhagen), he points out that the solutions to the Airy equation
can also be expressed as an integral along a line in the complex plane.
When the contour is the imaginary axis, the representation for the
forward propagating solution is
 ∞  
1 s3
Ai(x) = √ cos sx + ds. (5.4)
π 0 3
A similar integral expression exists for the backward propagating
eigenstate Bi (x). Arnold Its [68] pointed out that these integral rep-
resentations for the Airy function can also be reformulated as a
Riemann–Hilbert problem (see Appendix B for an introduction to
the Riemann–Hilbert problem, which goes back to Riemann’s PhD
thesis). The Riemann–Hilbert problem [67–69,73] is to reconstruct
a function that is analytic in the complement of a closed contour Γ
from the discontinuity of the function along the contour. Applying
this to the case where the holomorphic functions are the two simple
momentum space eigenstates of the 1D Schrodinger equation [13] for
a linear potential yields the representation in Eq. (5.4). Except for
arcs at infinity, the RH contour Γ consists of the real line plus the
60◦ and 120◦ diagonal lines in the complex plane. Remarkably, these
Integrable Systems 69

same lines played a central role in the “Eightfold Way” scheme for
constructing representations of SU(3) [118] that played a historical
role in the early understanding of elementary particles with strong
interactions. The Riemann–Hilbert problem of interest to us consists
in recovering the 2-component analytic function Φ(z) and the solu-
tion of the nonlinear Airy equation from a jump condition across Γ:

Φ+ (z) = Φ+ (z)G(z), (5.5)

1 Γk
 3
exp−i(2xs+ 8s )
3 
where G(z) = s−z ds
and Φ+ and Φ− are the holo-
0 1
morphic pieces of Φ defined on the two sides of the RH Γ. The integral
in Eq. (5.4) is recovered from the component of Φ corresponding to
the Ai (x) solution of the Airy equation with a wave incident from
the left:
lim
u(x, t) = 2i zΦ(z)12 . (5.6)
z→∞
Its [68] generalized this construction for the Ai (x) to a solution that
include both of the two independent solutions of the Airy equation.
The jump contour Γ now consists of the real line (the “I spin” axis)
plus the entire “U-spin” and “We spin” axes. The jump condition
is modified so that along the added pieces of the “U-spin” and “We
spin” axes, the jump matrix G(z) in Eq. (5.4) is replaced by its
inverse [68–69]. describe a version of this construction where the
analytic matrices Φ(z) involve MAFs near the turning points. They
enjoy nice analytic behavior when ξ(x) is extended to the entire
complex plane — with the exception of the origin — which induced
T–O to modify the jump contour by adding a circle around the ori-
gin where Φ is a 2 × 2 matrix that describes the two independent
MAFs that appear inside and outside this circle around the origin
in the complex plane. The matrix Φ is holomorphic inside and outside
the jump contour, which anticipates the Segal–Wilson solution for
the KdV equation.
The matrix Φ has a Cauchy integral representation in terms of its
discontinuity across the jump contour

1 Φ+ − Φ−
Φ(z) = ds, (5.7)
2πi s−z
70 Quantum Mechanics and Bayesian Machines

which can also be written as

Φ+ (s) = Φ− (s)G(s),

Landau’s integral representation for g(λ)dλ for the Airy function
can be recovered by taking the limit λ → ∞ of λG12 . When the
contour is the real axis, the jump condition is
1 − |r|2 −re−2ixs
Φ+ = Φ− . (5.8)
re2ixs 1

In this formalism, the input data which allows one to extract u(x, t)
are the reflection coefficients r(s). If the contour is a polygon with
G(λ) = M1 M2 , . . . , Mk, a product of piecewise constant matrices,
then the meromorphic scattering amplitude S(λ) can be constructed
as a product of τ -functions [68]:

τk (Y ) = Y (λ)Mk , (5.9)

where
 g(μ)
1 μ−λ dμ
Y (λ) = .
0 1

5.2. The KdV Equation

The event that eventually led to the recognition of the remarkable


capability of exact solutions of completely integrable PDEs to pre-
dict the distant future was the discovery in 1834 by a Scottish marine
engineer that the waves created in a shallow canal by a sudden change
in motion of a barge can propagate for many miles without chang-
ing their shape or speed (the initial observations were limited by
the distance the engineer’s horse could gallop). Remarkably, it took
more than a century for these waves to be understood. (For a his-
tory of the efforts to obtain analytic solutions for the KdV equa-
tion, see Newell’s Solitons in Mathematics and Physics [75].) After
a prolonged debate within the mathematical physics community as
to whether the marine engineer’s observations could be believed, a
Integrable Systems 71

mathematical description of these waves was finally achieved at the


end of the 19th century by Korteweg and DeVries [75]:

∂u ∂ 3 u ∂u
+ 3 + 6u = 0, (5.10)
∂t ∂x ∂x

where u(x, t) is the wave amplitude. Equation (5.14) has solitary


wave solutions which are in agreement with the marine engineer’s
observations. These waves can even undergo collisions with other
solitary waves without changing their shapes. It took 60 years after
the KdV equation was first written down to find a way of extract-
ing the analytic solutions of the KdV equation which represent
solitary waves, and begin to understand why the KdV equation is
solvable at all. The first exact solution for the KdV equation was
found as a result of the realization by Peter Lax [74] that nonlin-
ear integrable PDEs like the KdV equation can be replaced with
an infinite set of linear evolution equations describing the separate
evolution in x and an infinite set of times {tn }. The solution to
these equations is for historical reasons known as the “Baker func-
tion” Ψ(x, {tn }). This nomenclature goes back to the 1920s [77] and
refers to the fact that the eigenvalues of two partial differential oper-
ators describing linear evolution in x and t which commute at some
time are independent of time, and therefore for each pair of eigenval-
ues the equations define a time-independent Riemann surface. The
Riemann surfaces that are of interest in connection with Bayesian
learning are “hyper-elliptic” curves Kn , whose surfaces are parame-
terized by two variables y and z related by a simple algebraic equa-
tion of the form (see Appendix C for an introduction to Riemann
surfaces).
The operator L describing the spatial evolution of the Baker func-
tion in x is called the Lax operator:

LΨ(x, t) = λΨ(x, t). (5.11)

where we have used t to mean the set {tn }. In addition there is a


set operators Bn describing the evolution w.r.t the an infinite set of
72 Quantum Mechanics and Bayesian Machines

times t2n+1 , where n = 2j + 1 and t3 is ordinary time:

∂Ψ(x, t)
= Bn Ψ(x, t). (5.12)
∂tn

As explained in Ref. [77], it is worthwhile to focus on a combination


Q of the Bn operators which commutes with L:

QΨ(x, t) = μΨ(x, t). (5.13)

For the integrable differential equations of interest for us the Lax


operator L is either the Schrodinger–Hill [125] or a Dirac operator
with eigenvalue λ2 or λ, respectively. The operators Bn (λ) are matrix
polynomials in the “spectral parameter” λ and d/dλ. These operators
act on a “Baker” wave function ψ(x, λ) in a Hilbert space of functions
that is generally infinite dimensional, but can also be finite. This
nomenclature derives from the coincidence in the KdV case that the
Lax operator L(λ) is the time-independent Schrodinger–Hill operator
with eigenvalue λ2 . The potential for this eigenvalue problem is the
physical KdV wave amplitude u(x, t). The eigenfunctions of L(λ)
are for historical reasons [77] referred to as “Baker functions”. For
large λ, these functions have the following asymptotic meromorphic
form [73]:
 
a1 (x) a2 (x)
ψ(x, λ) = 1 + + + · · · eiλx . (5.14)
λ λ2

The asymptotic behavior of the pre-factor in Eq. (5.5) as x → ∞ is


the familiar S-matrix for the 1D non-relativistic Schrodinger equa-
tion. If this S-matrix is known, then the exact solutions can be found
using the same inverse scattering method originally introduced by
Gelfand, Levitan, and Marčenko (see [59,60] and Appendix B) for
extracting the potential for the 1D Schrodinger equation from the
asymptotic behavior of the pre-factor in Eq. (5.22) interpreted as
“scattering data”.
It is perhaps noteworthy in this connection that the original noise
filter developed by Wiener for extracting a signal from a time series
contaminated with white noise also involves using an analytic mero-
morphic function of frequency (a meromorphic function is a rational
Integrable Systems 73

function of a complex variable which can have isolated pole singular-


ities) to represent the noise filtered signal. In switching from the real
valued spectral parameter λ appearing in the Lax equations to a com-
plex variable ζ, we are acknowledging our fundamental thesis that
integrable PDEs represent a promising new approach to the Bayesian
model selection problem where conventional regression methods, e.g.
Markov chain Monte Carlo methods, fail. This is why machine learn-
ing problems can potentially be greatly simplified by transporting
the problem to a Riemann surface.

y = ± a0 + a1 z + a2 z 2 + · · · an z n . (5.15)

In both the KdV and NLS cases, this Riemann surface arises as a
corollary of the Burchnall–Chaundry theorem [77] for commuting
scalar differential operators. The wave function φ(z, x), as well as
the functions q(x) and p(x), can be calculated exactly in terms of
the Θ-functions associated with this Riemann surface [56].
The integrable structure for the KdV and NLS equations is largely
hidden. Indeed, initially it was not even suspected that these equa-
tions were completely integrable. However, following a decade-long
campaign by some talented mathematicians, this hidden structure
was finally revealed (see [72–75] for nice reviews). The solution of
the KdV equation representing a single solitary wave turned out to
have the form [75] u1 (x, t) = −2η 2 sech2 (η(x − x0 + η 3 t + η 5 t5 + · · · ),
where the appearance of an infinite set of independent time variables
reflects the fact that the KdV equation is an example of an integrable
dynamical system with an infinite number of degrees of freedom. An
important development in the theory of the KdV equation was the
discovery by Hiroto [76] that that multi-solitary wave solutions of
the KdV equation can be represented in terms of Θ-functions:

∂2
u(x, t) = 2 lnτ (θ1 , . . . , θ1 ), (5.16)
∂x2
where for multiple solitary wave solutions of the KdV equation:
⎡ ⎛ ⎞⎤
N N
τ (x1 , t2 , . . . , tN ) = exp ⎣iπ ⎝ ini nj Tij + 2iθj nj ⎠⎦
ni ,nj ∈Z i,j=1 j=1
74 Quantum Mechanics and Bayesian Machines

and
⎡ ⎛ ⎞⎤

Θ({θj }) ≡ exp ⎣iπ ⎝ ni Tij nj + 2πi θj nj ⎠⎦


n∈zg ij j

is the Riemann Θ-function (see [54–57]


 for the definition and prop-
erties of Θ-functions). The Tij ≡ dωj (Bi ) are the “periods” for the
associated Riemann surface: (5.15). These periods are obtained by
integrating a rational holomorphic differential dωj around one of the
“B” cyclic paths on the surface of the Riemann surface correspond-
ing to going around the circumferences of the “donut holes” in the
Riemann surface (see Appendix C). For each donut hole in the Rie-
mann surface, there are two types of cycles corresponding to how one
goes around the donut hole: the “A” cycles correspond to a cyclic
path threading a donut hole, while the “B” cycles correspond to the
cyclic paths going around the circumference of each donut hole.
Given the τ -function in Eq. (5.16), the solution for Eq. (5.11), i.e.
the “Baker function”, for the KdV equation becomes
ψ(x, P )

 
−i k x k zk
1 1 1
=e τ x1 − , x2 − 2 , x3 − 3 .. / τ (x1 , . . . , xN ).
z 2z 3z
(5.17)
In all these equations, x1 represents the distance along the “canal”,
while the other xi s represent an infinite sequence of times {ti } associ-
ated with the infinite number of commuting Hamiltonian flows asso-
ciated with the complete integrability of the KdV equation. It can be
shown [75–76] that the expression for ψ(z, x)in Eq. (5.17) is consis-
tent with the expression for u(x, t) in terms of the τ -function given in
Eq. (5.16). Also, the appearance of a ratio of τ -functions in Eq. (5.17)
explains the emergence of a meromorphic (as opposed to holomorpic)
structure for ψ(x, P ) since the exact τ -function in both the KdV and
NSE cases is identically a Riemann Θ-function, which has zeros but
no poles; indeed, the positions of these zeros of τ -functions become
parameters in the exact solutions for ψ(z, x) and u(x, t). Apart from
these zeros, the initial data needed to define these exact solutions
consists of the scattering data for the Baker function as a function
of the wave number of the incoming wave number; for example, the
Integrable Systems 75

reflection coefficient as a function of the wave number. The exact


solution for the KdV equation can be found [75] using the inverse
scattering method developed in the 1950s by Marčenko, Gelfand,
and Levitan (see [61] and Appendix B for more details of the GLM
method). Lax showed that the Hamiltonian flows for the Baker func-
tion leave the eigenvalues of the L(u) operator in Eq. (5.11) intact,
but evolve the potential u(x, t) for the Lax equation in such a way
to generate a solution to (5.14).
The asymptotic Baker function can be written in the form:
X(ζ)τ
ψ(x, {tn }, ζ) ∼ , (5.18)
τ
where
∞ ∞
2k+1 ς −2k−1 ∂
X(ζ) = exp i ς t2k+1 exp i .
2k+1 ∂t2k+1
0 0
(5.19)

The model selection problem for the KdV equation amounts to choos-
ing a set of solitary waves and values for the initial positions xi0 and
“momenta” ηi that best explain a set of observations of the wave
amplitude that, say for practical reasons, are limited in their scope
of times and locations. Finding the best choice of parameters for
the τ -function in Eq. (5.16) based on actual video observations of
wave amplitudes would be a very difficult problem for conventional
machine learning if many solitary waves were present.
In the context of using the KdV equation as an avatar for Bayesian
learning, this model selection problem amounts to choosing a “Back-
lund transformation” [75]. This involves transforming the τ -function
to accommodate the addition of another solitary wave. Using the
formula for u(x, t) as a second derivative of the τ -function one finds
that the new τ -function can be expressed in terms of asymptotic
scattering wave functions for the KdV Lax equation:

τ̃ = τ (Aψ+ (x, ζ) + Bψ− (x, −ζ)) = (AX(ζ) + BX(ζ)) τ, (5.20)

where A and B are different constants for each application of the


Backlund transformation (as well as independent ζ variables for
each application of the Backlund transformation since, after all, the
76 Quantum Mechanics and Bayesian Machines

Riemann surfaces after each transformation are quite different!). To


get some idea of what is going on here, we note that the τ -function
could be written in the following form:

brk xk z −k 1+ ai z −i
k=1 i
⎧ ⎫
⎨     
1 1 1 ⎬
= exp λi xi − i + μij xi − i xj − i
⎩ z z z ⎭
i ij
⎛ ⎞

exp ⎝ λi xi + μij xi xi ⎠ . (5.21)
i ij

Thus, if we introduce new terms with negative powers of ζ to the


phase of the Baker function, we end up with something that looks
like Hiroto’s KdV solution with an extra solitary wave. This mani-
festly changes the topology of the Riemann surface. As we shall see in
Chapter 7, we can to some extent avoid using the Backlund transfor-
mations if we simply start with a Riemann surface with sufficiently
high genus.

5.3. Segal–Wilson Construction

The work of Its [68] set the data science community on the path
connecting the use of the inverse scattering method to solve nonlin-
ear integrable PDEs to construct special meromorphic functions in
a neighborhood of the north pole of a Riemann surface, which can
be used to construct a 1:1 map between input data to data features.
Following the success of the inverse scattering method for construct-
ing exact solutions of the KdV or NSE equation, Segal and Wilson
discovered [73] a nice geometric way of side-stepping the usual way
of solving the GLM integral equation. In the Segal–Wilson approach,
scattering amplitudes for the Schrodinger or Dirac equations appears
as a discontinuity between square-integrable holomorphic functions
defined in the upper and lower halves of the complex plane. Their
construction is based on the introduction of the “Grassmannian” of
all closed subspaces W of the Hilbert space H consisting of square
Integrable Systems 77

integrable functions that are close to the subspace H+ spanned by


the {z i } where i ≥ 0. “Close” to H+ means that the orthogonal
projection W → H+ is transversal to H− ; i.e. it consists of complex
analytic functions of the form

w(z) = ak z k . (5.22)
k=−N

The Segal–Wilson approach to solving the KdV equation begins


with the introduction of “loop maps”:

g : S 1 → H+ + H− , (5.23)

which map the equator of the Riemann sphere (S 2 ) into a curve W in


the space of complex n x n matrices representing H = H+ + H− . In
this language the usual least squares regression problem described
in Chapter 2 becomes the problem of factorizing the loop map g a la
+
Wiener–Hopf (cf. Appendix B) into the product g − g of two maps
+ −
g and g that are holomorphic in the upper and lower hemisphere’s
S+ and S−, respectively, of the Riemann sphere. Taken together H
represents a meromorphic extension of the space H− of holomorphic
functions representing input data. In the case of the 1D inhomoge-
neous wave equation, the spaces H+ and H− corresponding to all
possible outgoing and incoming states are related by

aλ bλ
g(λ) = . (5.24)
cλ dλ

The usual scattering matrix Sλ in 1D is defined by

g(λ) = e−λJz Sλ eλJz , (5.25)

where

1 1 bλ Tλ Rλ
Sλ = ≡ (5.26)
dλ −cλ 1 −Rλ Tλ

Rλ and Tλ are the usual reflection and transmission functions for a


wave packet incident on a localized potential from either the left
or right. The factorization problem for g(λ) amounts to writing
78 Quantum Mechanics and Bayesian Machines

g− = g(g+ )−1 . If we write g − = g(g+ )−1 , g = 1 + γ, g+ = 1 + γ+ ,


and g − = 1 + γ− , then [73]
γ− = γ + γ + + γγ + . (5.27)
Finding the two factors is reminiscent of the Wiener–Hopf method
of representing scattering amplitudes as a ratio of factors that are
holomorphic above and below the real axis in k 2 space. However,
Eq. (5.27) is a significant improvement over the usual least squares
regression method which depends on the inversion of a large matrix
to solve the Weiner–Hopf equation. Indeed, Eq. (5.27) doesn’t require
a matrix inversion, but instead is a product of matrices.
The crucial part of the Segal–Wilson construction is understand-
ing how the matrices representing g + and g− act within H. The
part of the map gH+ → H− connecting H+ and H− yields the
meromorphic pre-factor A(x, z) in Eq. (5.14) by providing a map
w : H+ → H− whose graph is just this sought after 1:1 connection
between the spaces H+ and H− . That is gH+ can be written in the
following form:
1 0
gH+ = H+ . (5.28)
w 1

Because g −1 (gH+ ) leaves H+ invariant, we obtain


   
−1 1 0 a b
g = (5.29)
w 1 0 c
as the solution of the problem of factoring the map g into the two
factors g+ and g − holomorphic on H+ and H−, respectively. Over-
all, the Baker function is a product of an exponential factor and the
function w(z). Segal and Wilson are acknowledging that an essential
element in understanding the KdV and NLS equations is the replace-
ment of the real valued spectral parameter λ with a point P on a
Riemann surface, which is often referred to as the spectral curve.
The spectral curves that are of interest in connection with machine
learning are “hyper-elliptic” curves Kn , whose surfaces are parame-
terized by two variables y and z related by the algebraic equation in
Eq. (5.15).
The Segal–Wilson provides a geometric construction of the Baker
function w(z) on a hyper-elliptic Riemann surface. In both the KdV
Integrable Systems 79

and NLS cases, the meromorphic pre-factor in the  asymptotici Baker


function has the general form φW (x, P ) = 1 + ∞ i=1 ai (x)/z , and
can be represented as the ratio of two τ -functions τW (z).  For cer-
tain discrete combinations (zk , tk ), the quantity exp (− k zk tk ) W
is transversal to H− ; i.e. it has the form

exp − zk tk w =1+ ai (x)/z i . (5.30)
k i=1

If the tk in Eq. (5.30) are nonzero, then the action of the exponential
factor in Eq. (5.17) represents the effect of the multiple flows on the
solution uw (z), provided that these are independent linear flows for
each value of k, which of course makes sense since the KdV equation is
an infinitely integrable system. The transversal condition means that
the parameters defining w satisfy certain conditions, which in the
“Kyoto school” theory of the KdV equation are met by demanding
that the Baker function be derived as the ratio of two τ -functions as
in Eq. (5.17), which explains the meromorphic structure of the Baker
function a la Wiener filter.
In the multi-solitary wave case, these τ -functions can also be rep-
resented as a determinant of propagators for solutions of a Fokker-
Planck equation [120]:
⎢ ⎥
  ⎢K(s1 , s1 ) · · · K(s1 , sl )⎥

zl b b ⎢ ⎥
⎢ ··· ⎥
D(z) = ds1 . . . dsl ⎣ : : ⎦,
l! a a
l=1 K(sl , s1 ) · · · K(sl , sl )
(5.31)

where each K(s1 ,s2 ) can be thought of as a propagator for Gaussians


localized at points s = (x, t). The D(s) notation for the τ -function
in (5.31) has its origin in scattering theory (see Appendix B). The
K(s1 , s2 ) can also be interpreted as a backward filter for solutions of
the time-independent KdV equation
 b
Kf ≡ K(s, s )f (s )ds .
a

The solution for u(x, t) obtained from (5.31) is the same as the
analytic expression involving theta functions for a Riemann surface
80 Quantum Mechanics and Bayesian Machines

obtained by Hirota [76]. This provides a simple and beautiful solution


of the model selection problem for Bayesian learning.
The initial clue that exact solutions for the KdV equation might
be connected quantum mechanics was provided by Einstein [91],
who showed the Bohr–Sommerfeld quantization rules are exact for
completely integrable systems. As we shall see in Chapter 7, this
has the consequence that the high frequency limit of multi-channel
quantum mechanics can directly yield the reward function for multi-
observer control problems or multi-agent RL problems. A significant
role for multi-channel quantum mechanics also shows up in the work
of Fadeev et al. on the NLS equation [126].

5.4. The NLS Equation

The Lax equation approach for finding exact solutions also works for
the nonlinear Schrodinger equation [69,75]. Of particular interest to
us is its “complexified form” where the scalar wave amplitude u(x, t)
of the KdV equation is replaced by two amplitudes p(x, t) and q(x, t),
which play the role of momentum and position controls:

∂q
i = −qxx + |p|2 q
∂t
∂p
i = pxx − |q|2 p. (5.32)
∂t

If we assume p = ±q, then we have the real form analogs to the KdV
equation:

∂u
i = −uxx + 2|u|2 u, (5.33)
∂t

which is of particular interest because it has found optical fiber appli-


cations. The discovery that the NLS equation, like the KdV equation,
is completely integrable [69,73,75] led to the realization that the same
kind of inverse scattering approach that worked for the KdV equation
also works for the NLS equation.
The NSE differs from the KdV equation in that in the KdV case
the physical wave amplitude u(x, t) is real, whereas the NSE is more
naturally interpreted as a wave equation for a complex amplitude.
Integrable Systems 81

The Lax equation for the complexified NSE is a Dirac-like matrix


equation

M Ψ(x, z) = zΨ(x, z), (5.34)

where
d/dx −q
M =i . (5.35)
p −d/dx

The Baker wave function has the form


 x !
  
Ψ(z, x) = ψ1 = exp (−iz + q x φ(z, x)dx (5.36)
x0

ψ2 = φ(z, x)ψ1 ,

where φ(z, x) is a meromorphic function with asymptotic form


(1 + a1 (x)/z + a2 (x)/z 2 . . .). Just as in the KdV case, the vari-
able z refers to a point on a canonical Riemann surface associ-
ated with eigenvalues of the Lax operator. Also, as in the KdV case
exact solutions for all the quantities in Eq. (5.36) can be found [59]
in terms of Θ-functions, and in addition the reward function q(x)
can also be expressed in terms of a τ -function in a fashion analo-
gous to Eq. (5.16). As also in the KdV case the appearance of a
ratio of τ -functions in the expression for φ(z, x) explains the emer-
gence of a meromorphic (as opposed to holomorphic) structure for
ψ(x, P ). However, in contrast with the KdV equation, the nonlin-
ear Schrodinger equation mixes forward and backward propagating
modes in essentially the same way that the Dirac equation connects
positive and negative energy states when there is a nonzero potential.
Dirac’s original motivation was to develop a relativistic version
of the Schrodinger equation [35]. The original version of quantum
mechanics allowed nonlocal physical effects that violated the princi-
ple of relativity. In his Theory of Positrons [64], Feynman provided a
clear explanation as to why the Dirac equation cures this difficulty.
Formally, the way the Dirac equation solves the problem with causal-
ity is to include negative energy modes that propagate backwards in
time. This also means though [127] that it is more natural to regard
the solution of the Dirac equation as a quantum field than a classical
wave function as in Schrodinger’s approach to quantum mechanics.
82 Quantum Mechanics and Bayesian Machines

For similar reasons in the case of the NLS equation it is almost nec-
essary from the beginning to recognize that the ψ(x, P ) amplitudes
are quantum fields.
A “second quantized” version of the NLS equation was introduced
by Faddeev et al. [125]. Their Quantum Inverse Scattering formalism
allows one to express the τ -function and Baker function for the sec-
ond quantized NLS equation in terms of expectation values for prod-
ucts of creation and annihilation operators for the oscillator array.
The Hamiltonian for NLS model introduced by Faddeev & Co is
the same as the Hamiltonian for discretized version of a 1D gas of
strongly repulsive bosons, with interactions:
 " #

H = dx ∂x Ψ↑ ∂x Ψ + cΨ↑ Ψ ΨΨ , (5.37)

where the bosonic operators Ψn , n = 1, . . . , M , satisfying the usual


commutation relations for Bose fields [127]:
1
{Ψm , Ψn } = {Ψ∗m , Ψ∗n } = 0, {Ψm , Ψ∗n } = δmn . (5.38)
Δ
Actually, for a finite number of oscillators the Hamiltonian (5.37) can
be replaced by a many-body quantum mechanical problem defined
by a Hamiltonian
N
∂2
HN = − , (5.39)
j=1
∂zj2

with a boundary condition


 
∂ ∂
− −c χN = 0, for zj+1 = zj + .
∂zj+1 ∂zj
The energy eigenfunctions have the form

1
|ΨN >= √ dz N χN (z|λ)Ψ† (z1 ) · · · Ψ† (zN )|0 >. (5.40)
N!
Apart from a normalization factor χN, this solution is the Bethe
ansatz:
% N &
$
|P |
χN ∝ [sgn(zj − zk )] (−1) exp i zn λP n , (5.41)
N >j>k P n=1
Integrable Systems 83


where λ is the spectral parameter and EN = N 2
j=1 (λj − μF ). The
momentum eigenvalues associated with excitations of the Bethe vac-
uum are

exp (iLλj ) = (−1)N −1 . (5.42)

The quantities that would most naturally play the role of the real
valued kernel functions K(x, y) that appear in the theory of the KdV
equation are the equal time correlation functions
' x2 (
T
Q(x1 , x2 ) = Ψ (y)Ψ(y) dy , (5.43)
x1

where the bracket is evaluated in either the Bethe ground state or


some combination of excited Fock states. As shown in [126] the quan-
tum correlation function Q(x1 , x2 ) satisfies differential and integral
equations similar to those satisfied by the self-reproducing kernels in
the classical theory of stochastic estimation. A typical result is that
when the background state is a thermal state, then the analog of the
familiar covariance matrix K(x, y) used for stochastic estimation is

 sin(λ − μ) 
K(λ, μ) = ϑ(λ) ϑ(μ), (5.44)
λ−μ

where ϑ(λ) is the Fermi weight 1/ exp(λ2 − β). The analog of the
conventional scattering problem is |λ| → ∞. Because we have ana-
lytic expressions for these kernel functions, we can naturally make
contact between traditional methods of data analysis and our use
of integrable models to make predictions for optimal control or RL
strategies.
The τ -function τ (λ) is defined to be the trace of the analog of the
S-matrix [125]:

τ (λ) = trT (L, 1|λ), (5.45)

which in turn is a product of the lattice displacement operators


L(n|λ):

T (L, 1/μ) = L(L/μ) . . . L(1/μ).


84 Quantum Mechanics and Bayesian Machines

Each L(n|μ) factor is a discretization of the Lax displacement


operator:
⎛ ⎞
1 + (−1)n Δ − iμΔ/2 −iΔΨ ∗
⎜ 4 n ⎟
L(n|μ) = ⎝ Δ ⎠, (5.46)
iΔΨn 1 + (−1)n + iμΔ/2
4
where Δ is the “lattice spacing”. These 2 × 2 matrices play much
the same role as the classical Lax operator for integrable nonlinear
differential equations. We have thus come complete circle in the
sense the τ -function had its origin in the mathematics of the Ising
model [120].

5.5. Galois Remembered

Of course, the foregoing discussion of integrable models begs the


question as to whether the use of the KdV or NLS equations to
define the reward function for feedback control or RL is sufficiently
general to solve all pattern recognition, feedback control, or RL prob-
lems of interest. One possible answer to this question is provided by
the Helmholtz machine [7,8], where it is not necessary to define a pri-
ori the models for a system or environment. Instead, the Helmholtz
machine uses the input data itself to define an adversarial network
scheme to represent the input data, in much the same way that the
Boltzmann machine [1] is used to represent a set of data as a thermal
state of spins.
The penultimate fundamental discovery that we would like to
highlight is the cornucopia of mathematical results that have flowed
from the notes that Evariste Galois scribbled in 1832 on the evening
before the duel which took his life. Ten years after Galois’s death the
scope of his accomplishment eventually came to light when Joseph
Liouville announced that upon perusal of Galois’s notes he found that
Galois had solved in a particularly elegant way the problem of deter-
mining when an algebraic equation could be solved by combining the
usual arithmetic operations of addition, subtraction, multiplication,
and division with the operation of taking the nth root of a rational
number. Galois’s great achievement was to link this problem to group
theory [92]. His method involved translating the problem of solving
Integrable Systems 85

an algebraic equation into a problem in group theory. Following the


publication of Galois’s idea, several mathematicians pursued the idea
of developing similar group theory approaches for solving differential
equations. However, these efforts only achieved results that are deci-
sive for our enterprise in conjunction with efforts to solve the “inverse
Galois program” of determining whether a given group is the Galois
group of some field extension.
Digging a little deeper one can perhaps glimpse that our quan-
tum approach to Bayesian model selection is the fruit of a marriage
between Galois’s theory of solvability and Weyl’s premonition of an
intimate link between group theory and quantum mechanics. In par-
ticular, beneath the association of integrable differential equations,
the theory functions on a Riemann surface and Bayesian inference,
we are led back to a Shakespearean tragedy that occurred in 1832,
when Evariste Galois lost his life in a duel, just a day after he had
scribbled some notes which are now recognized as one of the citadels
of human achievement. The focus of his notes was the problem of
solving algebraic equations using arithmetic and square root opera-
tions in much the same way one learns in elementary algebra how
to solve quadratic equations. Galois observed that there is a cor-
respondence between the “normal” subgroups of the automorphism
group of the field of rational numbers extended by the roots of a
polynomial equation and the arithmetic operations used to solve the
polynomial equation (see [92]). A normal subgroup was defined by
Galois as subgroup which leaves the rational numbers used to define
the field extensions’ invariant. As was first glimpsed by David Hilbert
[128], the whole structure of using group theory to solve polynomial
equations can be replicated by considering the fields of meromorphic
functions on coverings of a Riemann surface as an analog of Galois’s
field extensions of rational numbers by the square roots of prime
numbers.
One begins to get a sense that what really underlies Bayesian
machine learning is Galois’s theory of the solvability of polynomial
equations. One area of unfinished business for mathematics in the
21st century is to extend Galois’s insights regarding the solvability
of polynomial equations to the integrability of differential equations.
It is now understood through [93] that the analog of the Galois fields
that play a central role in the solvability of polynomial equations is
the field of rational functions on a Riemann surface. This is congruent
86 Quantum Mechanics and Bayesian Machines

with our expectation that solvability of the KdV and/or nonlinear


Schrodinger equation is what underlies Bayesian machine learning
since in either case the construction of a certain rational function on
Riemann surface plays a central role in constructing both the exact
solution of these equations and the transition amplitude in Eq. (1.4).
The important role played by a characteristic rational function of a
complex variable actually goes back to the original work of Wiener
and Kalman on linear filters [83] for signal analysis or optimal control.
Indeed, one might consider our quantum approach to automating
Bayesian inference as the fruit of a marriage between Galois’s theory
of solvability and Weyl’s premonition of a close link between group
theory and quantum mechanics [44].
The business of machine learning typically amounts to deter-
mining the meromorphic prefactor in (5.30). This characterization
of machine learning in turn suggests a parallel between our quan-
tum mechanical approach to machine learning and Galois theory.
In particular, Eq. (5.30) suggests that the space of meromorphic
functions on a Riemann surface can be regarded as an extension
of the space of holomorphic functions of a complex variable z in
a manner reminiscent of the way that Galois theory provides an
elegant characterization of the solvability of algebraic equations in
terms of an extension of rational arithmetic operations by nth-
root operations (For an entertaining introduction to the connection
between Galois theory, the solvability of ordinary differential equa-
tions, and meromorphic functions, see Kuga’s Galois’ Dream [93]).
In the context of using representations of the quantum commuta-
tion relations (5.38) (see also Appendix D) to find the form of the
meromorphic function in (5.30), the role of group theory in Galois’s
original approach to finding the roots of algebraic equations will evi-
dently be played by quantum mechanics. To some extent this reflects
Hermann Weyl’s spotlight [44] on the importance of group theory
for quantum mechanics. An intimate connection between the contin-
uous group SU(N) and quantum mechanics will be one of the main
threads of our presentation. One of our ambitions in the following will
be to assess the plausibility that this thread will eventually lead to
practical methods for using quantum mechanics for solving machine
learning problems; although admittedly at this time we can only
offer theoretical arguments rather than experimental results to sup-
port this expectation. Our theoretical arguments will be based on the
Integrable Systems 87

intuitive ideas that quantum machine learning can be thought of as a


form of Galois theory. We can immediately offer the observation [42]
that the representations of SU(N) can be classified using the same
permutation groups that play an important role in Galois’s theory of
algebraic equations. The connection between Galois theory and Rie-
mann surfaces also appears [128] in the problem of Bayesian model
selection, which shows that the Bayesian model selection problem
and its connection with quantum mechanics has very deep mathe-
matical roots.
This page intentionally left blank
Chapter 6

Quantum Tools

6.1. Weyl Remembered

In general quantum mechanics, Hilbert spaces can be defined as finite


or infinite dimensional vector spaces that give rise to representations
for the Weyl–Heisenberg group; i.e. the continuous 3-dimensional
group obtained by exponentiation of the Heisenberg commutation
relations [44,94]. (In the mathematics literature, this group is called
the Heisenberg group. We have restored Weyl’s name because this
group was first introduced by Weyl.) The generators for the Weyl–
Heisenberg group are a shift operator S(y), and an “automorphic”
translation T (x):

S(y) = exp(−iy · ∂/∂ξ), T (x) = eix·y/2 exp(ix · ξ), (6.1)

where x, y, and ξ are d-dimensional vectors of continuous real vari-


ables. Apart from the factor eix · y/2 , the operators S(y) and T (x) are
familiar as the building blocks X(s) = exp(−isp̂) and Z(t) = exp(itq̂)
for universal quantum computations with quantum states parameter-
ized by continuous variables [12].
What originally attracted the attention of mathematicians to the
Weyl–Heisenberg group was the observation that the original Heisen-
berg commutation relations don’t make rigorous mathematical sense
in some important cases, e.g. in the case of action/angle variables
which are used to represent completely integrable classically dynam-
ical systems [91]. According to the celebrated theorem of Stone and

89
90 Quantum Mechanics and Bayesian Machines

von Neumann [94], all irreducible unitary representations of this


group have the following form (see also Munford’s Tata Lectures
[96]):
1
πλ (x, y, t)ϕ(ξ) = eiλt eiλ(x · ξ+ 2 x · y) ϕ(ξ + y), (6.2)

where ϕ is a square integrable function of n real variables and x, y,


and ξ are real vectors. Thus, the general form of the Weyl–Heisenberg
group is a translation by y followed by a multiplication by an expo-
nential factor involving x as a “wave number” and a scalar t. This 3D
group can also be realized as upper triangular n + 2 × n + 2 matrices
of the form
⎛ ⎞
0 x t
⎜ ⎟
I + ⎝0 0 y ⎠. (6.3)
0 0 0

This group is a “nilpotent”, which means that the Lie algebra for the
Heisenberg group has the same form as the above matrix minus the
identity operator. (Incidentally, this is the mathematically rigorous
formulation of the original matrix mechanics of Heisenberg, Born,
and Jordan [42]).
Among the representations of the Weyl–Heisenberg group, the
representations related to the energy eigenstates of a harmonic oscil-
lator will be of special importance to us. In particular, Bargmann
introduced a type of quantum coherent state for 1D quantum oscil-
lators, known as the BFS states [94]. The Bargmann–Segal trans-
form [95] is a map f (x) → F (z) from square integrable functions on
Euclidean space Rd to Cd :
 √
1 z 2 +2 2x·z−x2
F (z) = d e− 2 f (x)dx (6.4)
π
The power of the BS transform (6.4) is that a space of holomorphic
function defined on a compact domain can be mapped to a com-
pact space of harmonic oscillators with real valued wave functions.
Because of the ubiquitous importance of holomorphic functions for
Bayesian learning, this result is potentially of great interest to us.
In 1928, Fock had observed [94,95] that regarded as operators in
a Hilbert space of holomorphic functions z and d/dz obey the same
Quantum Tools 91

commutation relations as the annihilation and creation operators for


a quantum harmonic. Following Bargmann’s original paper [95], Fock
introduced as an alternative to the Fock space of energy eigenvalues
the vector space of holomorphic functions
√ defined on some domain
U of CN whose basis is the set {z n / n!} where z ∈ CN , and the
normalization integral includes a factor μ(z) = π1d exp(−|z|2 ). Of
course, the normalization integral must necessarily include an addi-
tional factor to make it finite. The reproducing kernel is

K(z, w) = ez · w . (6.5)

This kernel is reproducing in the sense that any holomorphic function


f (z) in U can be written in the form

f (z) = K(z, w)f (w)μ(w)dw. (6.6)

The presence of the factor μ(w) in this equation means that the
displacement operator f (z) → f (z − a) is not unitary; instead one
represents displacements with a unitary operator
 
|z||2
− 2
+z · a
Ta f (z) = e f (z − a). (6.7)

As first noted by Fock, the operators A = d/dz and A∗ = z satisfy


the canonical commutation relations; i.e.

[A, A∗ ] = 1. (6.8)

Also, as a result of these commutation relations, the translation oper-


ators (6.7) satisfy a composition law

Ta Tb f (z) = e−iIm(a · b) Ta+b f (z). (6.9)

These BFS states are defined as superpositions of the coherent states


for a quantum harmonic oscillator introduced by Schrodinger of the
form

|α >= exp(αa+ − α∗ a)|0 >, (6.10)

where a+ and a are the creation and annihilation operators for


a quantum harmonic oscillator. These states are also relevant for
92 Quantum Mechanics and Bayesian Machines

the Segal–Wilson construction [75], as well as H∞ control [84].


These BFS states also turned out to be of great practical impor-
tance in quantum optics [129].
Writing D(α) for the operator in Eq. (6.10), one has the compo-
sition law

D(α + β) = D(α)D(β)exp(−iIm[α∗ β]), (6.11)

The last factor in Eq. (6.11) is a signature for the fact that a+ and a
operators obey the Fock commutation relation (6.8). In the position
representation commonly used for the Schrodinger equation, these
states have the form
x2 √ α2
|α >= exp − 2αx + . (6.12)
2 2

These |α > states are not orthogonal, but have an overlap


2
| < β|α > |2 = e−|β−α| . (6.13)

It is of course interesting that these coherent quantum states over-


lap in a way that resembles the popular squared exponential kernel
that is so useful in data analysis. In many ways, the coherent states
defined by Eq. (6.12) are the canonical choice for a representation
of the Heisenberg group, and by providing an entre for holomorphic
functions, these functions play a central role in our quantum repre-
sentations for Bayesian learning. These states are also closely related
to the radial basis functions ψ(x, 0) = exp(− mω 2
4h (x − a) ) that are
commonly used in machine learning [6]. The propagator for these
states is [47]:
 ∞
ψb∗ (xb )Ko (xb , T ; xa , 0)ψa∗ (xa )dxb dxa
−∞

iωT Lω 2
= exp − − a + b2 −2abeiωT . (6.14)
2 4h

Starting from a state ψ(x, 0), the state after time t is

iωT mω 2
ψ(x, T ) = exp − − x − 2abxe−iωT +a2 cos(ωt) e−iωT .
2 4h
Quantum Tools 93

If, in addition there is a linear coupling to another oscillator,


Eq. (6.14) becomes

iωT Lω 2
F (b, a) = exp − − (a + b2 − 2abeiωT )
2 4h
 
mω ∗ −iωT
+ (aβ + bβ e + ·· ,
2h
1
T
where β = √2mω 0 f (t)dt and the dots are a term quadratic in the
external force f (t) acting on the oscillator.
Another notable way of relating holomorphic functions and quan-
tum harmonic oscillators involves N -dimensional quantum oscillator
states and the Wigner–Fourier transform [96]. In particular, any func-
tion f (z · t) can be reconstructed from its Fourier–Wigner transform:


f˜(z) = (f, Φα,β )Φα,β , (6.15)
α,β

where f ∈ L2 (CN ). When α, β are integers and Φα,β has the form
 
1 ix · ξ y  y
Φμ,ν (z) = e Φ μ ξ + Φ ν ξ − dξ, (6.16)
(2π)N/2 2 2

where x, y, ξ ∈ RN and μ, ν ∈ ZN , and Φμ (x) is the wave function


for an array of quantum oscillators where just 1 energy level per
oscillator is occupied:
N

Φn = Hni (xi ), (6.17)
i=1

where Hni is the Hermite function for a single Fock state, n = {nj } ∈
ZN and x = {xj } ∈ RN . It can be shown that the set of functions
{Φn } for all n provides a basis for the Hilbert space L2 (RN ). This
Hilbert space, in common with the Hilbert space for the single quan-
tum oscillator, is infinite dimensional. However, it can be truncated
a natural way by restricting attention to values of n ∈ ZN , and using
log2 N qubits to label the values of n. These states not only form
the basis for the Hilbert space L2 (CN ), but also form a space of
94 Quantum Mechanics and Bayesian Machines

holomorphic functions that live on a complex torus CN /Λ, where Λ


is a 2D lattice and μ, ν ∈ ZN /(ZN /m) where N = m2 . This Hilbert
space, in common with the Hilbert space for the single quantum oscil-
lator, is infinite dimensional. However, it can be truncated a natural
way [by first projecting the sum (6.15) onto the “radial wave func-
tions” for an oscillator array to form a representation for any square
integrable function of r = |z| in CN :

∞ 
 ∞ 
2n+1
f (r) = f (s)ϕk (s)s ds ϕk (r), (6.18)
k=0 0

−n 1 r2
2 k! 2 2 − 4 n
where the ϕk (r) = [ (k+n)! ] r e Lk (r) are the generalized Laguerre
functions which also appear in elementary quantum mechanics [14].
The expansion (6.18) can also be expressed as a projection:
 
Pk f (z) = (f, Φα,β )Φα,β ,
|β|=k α

where Pk projects f onto the space spanned by {Φα,β , α, β ∈


ZN , |β| = k}. One interesting thing about this expansion is that
it can be naturally truncated to a finite sum

Qk f (z) = (f, Φμ−m,μ )Φμ−m,μ , (6.19)
|μ|=k

which transforms homogeneously under the torus subgroup of the


group of n × n unitary matrices acting on an n-dimensional Hilbert
space; i.e. under z → eiθ z

f (eiθ z) = Qk f (z)eim · θ . (6.20)
m

The finite dimensional Hilbert space spanned by {Φα,β , αβ ∈ ZN ,


|β| = k} is evidently a close relative of the Θ-functions that were
used [54–57] to represent Riemann surfaces and which play a central
role in finding exact solutions.
Quantum Tools 95

The Θ-functions of interest in connection with Riemann surfaces


are defined by the relation [54–56]:
 
a
θ (z) = e2πia · b X(b)Z(a)θ(z), (6.21)
b

where a, b are real vectors, and X and Z are the Weyl–Heisenberg


group elements (see Appendix C)

X(y) = exp(−iy · ∂/∂ξ) and Z(x) = exp(ix · ξ). (6.22)

In the case where the shifts x and y are restricted to the integers, mod
n and z lie on a complex torus Cg /Λ, where Λ is a 2g dimensional
lattice. The indexed Θ-functions (6.18) were originally introduced
[54] by Solomon Lefshetz as the coordinates of a Riemann surface
embedded in flat projective space PN . (This embedding is of par-
ticular importance in mathematics because it means that Riemann
surfaces are “algebraic varieties”.) Remarkably, the set of functions
defined in Eq. (6.21) form a “reproducing” kernel space (cf. [99])
of dimension n2g . The term reproducing means that they are the
eigenfunctions for the defining kernel, which in the case of (6.21) is
the closed string theory propagator used in relativistic string the-
ory [100]. These functions can also can be constructed [55] by first
identifying their value at a reference point z = 0, and then using
the Weyl–Heisenberg shift operators, Eq. (6.2) to define their val-
ues over the entire Riemann surface. It was Lefshetz’s discovery of
the embedding of Riemann surfaces in projective space using these
functions that allowed quantum mechanics to emerge from algebraic
geometry. (For a detailed discussion of theta functions with charac-
teristics, see Griffiths and Harris’s Principles of Algebraic Geometry
[54] or Mumford’s more succinct Tata on Theta [56].)

6.2. Helstrom’s Theorem and Universal Hilbert Spaces

The primary task of quantum pattern recognition might be viewed


[99] as choosing a feature Hilbert space HF and a map x → Ψ (x)
from input data to the space HF such that the features represented
in the data are easily distinguished. Because the number of quantum
96 Quantum Mechanics and Bayesian Machines

states that can be represented with even a finite set of basis states
is literally infinite, it might seem that there would be an enormous
advantage to storing data features as quantum states. However, this
is probably a chimera because one must take into account that there
are strict limits in how much information can be stored in quantum
states. The key to understanding this is Helstrom’s theorem [102],
which places strict limits on the distinguishability of two quantum
states. Helstrom’s theorem plays a role in quantum Bayesian infer-
ence that is analogous to the singular role that the Neyman–Pearson
test plays in classical Bayesian approaches to data interpretation.
One of the advantages of quantum information processing is that
as a consequence of Helstrom’s theorem, one is able to immediately
attach information theoretic significance to the data features regard-
less of whether these features are Gaussian distributed variables.
One of the enigmas of quantum information processing is whether
it is possible how to encode experimental data as quantum states. For
example, if one wants to know how many measurements are needed
to distinguish two Gaussian distributed variables, one only needs
to know the estimated mean and variance for the two variables in
order to determine for example the probability of false alarm (PFA)
were really the same even when the measurement suggested they
were different. However, in quantum mechanics the wave functions
themselves are a deterministic rather than probabilistic quantity.
Therefore, in quantum mechanics there is no automatic way of asso-
ciating information with a state in Hilbert space. Nevertheless, there
is a simple and universal way for estimating the PFA for quantum
measurements. Namely, the probability of “false alarms” is elegantly
provided by Helstrom’s theorem:

PFA = 1 − 1 − η, (6.23)
where η = [< Ψ1 |Ψ2 > |2 . This estimate for the PFA is independent
of the number or type of measurements. Thus, Helstrom’s theorem
does provide a limitation on how well quantum measurements can
reproduce Bayes’s conditional probabilities. However, in practice the
statistical uncertainties associated with weak measurements typically
obscure this limitation. On the other hand, as noted in the intro-
duction, for the most part we are going to restrict our attention to
weak measurements, which allows Bayesian conditional probabilities
to appear in a completely natural way.
Quantum Tools 97

The underlying presence of quantum mechanics in any real exam-


ple of observations is revealed by the fundamental limitation (6.23)
on any measurement. In general, machine learning recognizes pat-
terns [4,7] by constructing a nonlinear mapping, x → z(x), between
a set of input data vectors {x (n) } and a smooth interpolation func-
tion z(x ) in feature space HZ . As noted in Ref. [6], in many cases
the interpolation function z(x) can often be constructed as the least
square estimator based on a correlation function for the data. In con-
ventional machine learning, e.g. using neural networks, the central
task can be viewed as the hierarchical construction of the nonlinear
map x → z(x) by using the kernel matrix K(x, y) which represents
the features at one level to construct the kernel matrix at the next
level. At each stage of machine learning, it is usually assumed that
the kernel describing correlations in feature space is “reproducing”
in the sense that its eigenvectors are the basis for feature space HF .
Following the legacy of Wiener’s 1958 essay [97], interest has
recently increased in using quantum eigenstates and associated ker-
nels to represent the feature spaces (see e.g. Schuld et al. [98,99]).
For example, the spatial part of the energy eigenvalue states for a
quantum harmonic oscillator has the form

φk (x) = exp(−(c − a)x2 )H k ( 2cx), (6.24)

where a = 1/4σ 2 and c = (a2 + 1/4l2 σ 2 )1/2 , and Hk (x) is a Hermite


polynomial. As it happens, the analytic properties of these eigen-
states make them natural candidates to stand-in for GPs. The repro-
ducing kernel in this case is

K(xx ) = exp(−(x − x )2 /2l2 ). (6.25)

As it happens, the usefulness (6.24) for representing the features in


datasets of practical importance has been recognized for some time.
As discussed in Chapter 3, this kernel function provides a prediction
model for feature labels ltn) attached to a dataset {x(n) }, when the
least square regression model z(x) for these labels is modeled as a sum
of “radial basis” Gaussian functions centered at discrete points. The
eigenfunctions of the kernel exp(−|x−x |2 /2σ 2 ), and when defined in
this manner the basis for the feature Hilbert space, HF turns out to
have the form of a Gaussian exponential times Hermite polynomials.
98 Quantum Mechanics and Bayesian Machines

The reproducing kernel for two quantum harmonic oscillators can


be written in the form

K(x, y) = ϕj (x)ϕj (y), (6.26)
n1 n2

∂ 2
where φj is an eigenfunction of the Hill–Schrodinger operator − ∂x 2 −
∂2
∂y 2
+ 12 (x2 + y 2 ) function. (This operator first appeared in the 19th
century in connection with Hill’s theory of the stability of Lagrange
triangles, but reappeared in 1925 in connection with Schrodinger’s
equation for an “upside down” 2D quantum oscillator). It turns out
that the radial part of wave function for a 2D quantum oscillator
involves a generalized Laguerre polynomial LnM that is closely related
to generalized Laguerre polynomial that appears in the radial wave
functions for the 2D hydrogen atom problem [14]. This brings us full
circle back to the problem that originally inspired Bayes and Gauss;
i.e. finding the orbital parameters for astronomical objects moving
under the influence of a 1/r potential. It is worth keeping this in mind
because this suggests that the model selection problem that attracted
Gauss’s interest, namely assigning multiple solar system objects to
distinctive orbits, might also be treated as a quantum problem for
multiple oscillators.
This focus on the 2D harmonic oscillator permits us an easy segue
to another very important Hilbert space related to the quantum the-
ory of angular momentum. Following the epochal 1922 discovery
by Stern and Gerlach of “spatial quantization” [131], Wigner and
Racah [113] developed a beautiful formalism for describing angular
momentum states in quantum mechanics. These states are of interest
for quantum machine learning because of a connection between the
energy eigenstates (Fock states) of a quantum oscillator and quan-
tum angular momentum states discovered by Julian Schwinger (when
he was a graduate student!). In On Angular Momentum, Schwinger
describes a very elegant way of constructing the quantum angular
momentum operators Jˆx ,Jˆy , and Jˆz as well as the Wigner–Racah
algebra [113] for representing vector sums of angular momentum in
terms of the annihilation and creation operators for a 2D quantum
harmonic oscillator. (These notes are unpublished, but a brief sum-
mary can be found in [112].) Schwinger’s construction of these states
is based on representing the quantum angular momentum operators
Quantum Tools 99

in terms of the raising and lowering operators for the number states
of a 2D quantum harmonic oscillator:
  σ 
J ≡ a+ς ς| |ς  a ς 

2
ς,ς =±
  
+
[a ς , a ς  ] = a +
ς , a ς  = 0 a ς  ,a + = δςς  (6.27)
ξ

 +  
J+ = a a2 , J− = a+ a1 , J3 = (a+ a1 − a+
2 a2 ).
2 1 2 2 2 1
One potential advantage of using the Fock states for a 2D oscillator to
represent quantum angular momentum states is that superconduct-
ing quantum oscillators provide an analog method for representing
these states.

6.3. Measurement-based Quantum Computation

Although our presentation has for the most part ignored the exten-
sive literature on qubit quantum computing, there is one develop-
ment in qubit quantum computing that mirrors our approach to
Bayesian inference: the “measurement-based quantum computing”
formalism of Raussendorf and Briegel [132]. Our approach to find-
ing the optimum strategies for Bayesian search and model selection
problems by encoding both observational data and the conditional
probabilities used in Bayesian inference as self-organized quantum
states is very similar in spirit to using measurements of entangled
states of qubits to carry out quantum computations. In Ref. [131], it
was shown that essentially all quantum computations that have been
contemplated using qubit quantum circuits can also be carried out
by making measurements of qubit states in a 2D array whose quan-
tum states have become! entangled by applying a controlled phase
gate CZ = exp (−i π4 <i,j> σiz σiz ) between qubits on neighbor nodes.
Such controlled phase gates can be realized naturally by allowing an
Ising spin-like interaction between neighboring qubits to act for time
intervals analogous to the Rabi time for spin flip in a magnetic field.
As a simple illustration of how measurement-based quantum com-
puting works for qubits, we consider the problem of teleporting a
bipartite qubit state of the form (α1 |0 > + β1 |1 >)(α2 |0 > + β2 |1 >)
from one location to another.
100 Quantum Mechanics and Bayesian Machines

The teleportation is accomplished using the controlled phase gate


π π 
CZ = exp i σz ⊗ σz , (6.28)
2 2
where σz |i >= (−1)i |i > to entangle the states 1 and 3 and 2 and 4.
After initializing the qubits in states 1 and 2 with arbitrary initial
states and the states in 3 and 4 with the Hadarmard states |0 >
+|1 >, followed by the application of CZ, the wave function has the
form
z
(α1 |0 >1 σ(3) + β1 |1 >1 )(|0 >3 +|1 >3 )
z
⊗(α2 |0 >2 σ(4) + β2 |1 >2 )(|0 >4 +|1 >4 ). (6.29)

After transforming to the conjugate basis |± >= (|0 > ±|1 >)/ 2,
this wave function has the form
  π 
exp −i si σ3x (α1 |− >3 +β1 |+ >3 )
si =0,1
2
  π 
⊗ exp −i si σ4x (α1 |− >4 +β2 |+ >4 ). (6.30)
2
si =0,1

Then measuring the eigenvalues of σ1x and σ2x yields the initial wave
function defined on the 1 and 2 nodes teleported to the 3 and 4 nodes.
As an illustration of the potential usefulness of this type of scheme
for Bayesian searches, we consider the “Monty Hall” search problem
where the location of an object of interest within a linear array of
boxes is being sought [5]. The 2D quantum oscillator array we envi-
sion using for this problem is illustrated in Fig. 6.1.
In this cartoon, each node of the middle layer consists of either
a single quantum oscillator plus a qubit or a pair of quantum oscil-
lators. This layer is the “quantum computer” which we use to find
the location of the hidden object. The quantum computations can
be carried out with either N levels of the single quantum oscilla-
tor or with [N/2] + 1 levels in each oscillator of a pair of quantum
Quantum Tools 101

X1 X2 X* XN

ψ0 = n ∑n1n2
n1n2 >

ψ1(ϕ1,s1) ψ2(ϕ2,s2) ψΝ(ϕΝ,sΝ)

Fig. 6.1. Quantum scheme for solving the Monty Hall problem.

oscillators. The bottom layer contains the wave-functions represent-


ing the observational data which consists of a sequence of choices
for locations to be searched in boxes and the results of the exami-
nation of the boxes at each location, while the upper “hidden” layer
contains information about possible explanations for the input data.
One might envision a quantum version of the Monty Hall problem
consisting of pairs of quantum oscillator states to encode both the
locations to be searched and the posterior probabilities for finding
the object in each location. The computation consists of introducing
prior information about possible models for the input data into the
hidden layer by teleportation, and then using a sequence of observa-
tions to relax the hidden layer wave function to find at each step of
the search the models which best match the data gathered up to that
time. Whereas in the case of classical Bayesian searches it is typically
an ad hoc assumption that the location of the object being sought
will be identified with a high probability in N or fewer steps, in a
quantum mechanical approach this result is a natural consequence
of the formalism. Indeed, to identify the location of an object that
classically might be in an infinite number of locations, in quantum
mechanics approaches success in N steps appears as a natural result
of “space quantization”. As noted in the last section the position vari-
able operator Qi associated with a 2D oscillator has a natural inter-
pretation as the z-component of the angular momentum operator J .
This opens the door to using spherical harmonic functions to define
quantized locations on a sphere. Although any location on the surface
of a sphere can be precisely defined using an infinite sum of spherical
102 Quantum Mechanics and Bayesian Machines

harmonic functions, one must keep in mind that practical quantum


computations are only possible with a finite number of basis states.
If the presence or absence of the object at a particular loca-
tion cannot be exactly determined, then the spin part of the wave
function in Eq. (6.31) will have to be replaced with a superposition
α|1 > +β|0 > of the si = 1, and si = 0 states which represents the
measurement uncertainty. The wave function describing the input
data accumulated up to the time of the nth step of the search will
therefore have the form
i=n

Ψin = λki |ψ(φ − φk ), αk |0 > + βk |1  (6.31)
i=1 k

for i = 1, . . . , n where the λni = 0 or 1 is a parameter which


describes which of the N possible locations is observed at the ith
step of the search. The product extends over all observations up to
the time of the measurement. Adopting the same convention used
in measurement-based quantum computing we will assume that the
initial wave functions for the hidden layer nodes are the zero modes
|ψ(φ), | + >); i.e.
""  ""
" "
Ψ0 = "" "ψ(φ − φk ), + >) >1
"
" (ψ (φ), + >) >i ,
"
k i=2

where ψ(φ) is the best approximation to a delta function that can


be achieved with spherical harmonic functions with l limited to lmax
and |+ >= |1 >I +|0 >I . Our Bayesian search will be implemented
by first delocalizing within the hidden layer these models using the
controlled phase gate of the form (6.30) to entangle the nodal zero
mode wave functions in nodes i = 1,2,3,4 (corresponding to N = 4)
of the hidden layer. This leads to an entangled wave function for the
hidden layer:
#" " " $
1 ""  iLz1 Lz2 "
" "
ΨH = |ψ(φ − φk )e , "0 > σ2 + ""1 >
z
2 "
k
h=N

z
× ψ(φ)(|0 > σh+1 + |1 >)h , (6.32)
h=2

where σNz
+1 = 1. Expanding the r.h.s of (6.32) after entanglement,
the search proceeds by measuring at each step of the search the
Quantum Tools 103

nodal wave functions in a basis corresponding to the input data. In


particular, we propose to measure the hidden layer nodes in a basis
ψ(φ − φi )Bi (θi ), (we assume here that the measurement at each node
can readily determine which “box” is being observed), where
 
cosθi /2|1 > −sinθi /2|0 > sinθi /2|1 > +cosθi /2|0 >
Bi (αi ) = √ , √
2 2
is a two-component spin state that represents the measured spinor
state α|1 >i +β|0 >i , where θ is the usual Cayley–Klein parameter
for a spinor state corresponding to a rotation about the y-axis of the
initial state |1> + |0 >. After the measurement of a node state the
result is either a spin variable s1 equal to the first component of
the two-component spin state Bi (θi ), or a spin variable s2 equal to the
second component of the state B(θi ). After measuring the nodal wave
functions in the wave function (6.32), the hidden layer has the form
1
= (|ψ1 s1 (θ1 ) >1 |ψ0 s2 >2 |ψ0 s3 >3 X(s1 , s2 , s3 )
2
 z z
× ei(θ1 σk /2+φ1 Lk ) |ψ 0 >4 ),
k

where X is a known correction factor depending on the measured


z z
values of s1 , s2 , and s3 while the unitary operator U = e(θ k σk /2+φk Lk
appearing in the node 4 brings all the model wave functions closer
to the input wave function at step 1. Introducing additional input
data pushes the model states toward the input states in the order
they appear in the search strategy. Our quantum search amounts to
evolving the initial wave function (6.31) in such a way that as mea-
surement data are accumulated, the wave function in node 4 of the
hidden layer approaches a wave function representing the locations
searched and the posterior probabilities for finding the object at var-
ious locations at each step of the search. Although this excursion into
qubit information processing does not provide much of an improve-
ment over classical search algorithms, it does illustrate the impor-
tance of using measurements of a “momentum” variable to guide the
evolution of the wave function.
This page intentionally left blank
Chapter 7

Quantum Self-organization

7.1. Pontryagin Control and Quantum Criticality

As has been emphasized by Kappen [31], if the innovation noise in


a controlled system is below a certain critical level, then the Bellman
control theory reverts to a deterministic form known as Pontryagin
control [86,87]. That is, in the limit when the Bellman function is
optimized and the stochastic control costs are limited, Pontryagin
control takes over. In this limit, solutions of optimal control prob-
lems are typically represented as smooth flows in (x, ẋ) space, where
system and control variables are continuously well defined in the
same way as the position and momenta variables in classical mechan-
ics. The phase space diagram of the moon lander problem discussed
in Chapter 4 is a nice illustration of the qualitative nature of the
solution. In principle, these streamlines can be found by solving the
Hamilton–Jacobi–Bellman (HJB) equation [18,20] for various initial
values x0 . The HJB equation is similar to the classical Hamilton–
Jacobi equation (see e.g. [88]), except that the velocity ẋ is identified
as a control variable u. A generic feature of Pontryagin control is that
as a function of initial conditions the ensemble of controlled paths
parameterized by position and control variables resemble flows of a
2D fluid. In a quantum regime, such flows can in turn be represented
as solutions of the Gross–Pitaevski equation [133] describing the col-
lective state of a multi-boson system. A potential advantage then for
using quantum wave function to represent a classical stochastic game
(x, u) space is replaced by a single coherent state.

105
106 Quantum Mechanics and Bayesian Machines

Because of its determinism, classical mechanics does not offer a


mechanism for storing or processing information. In contrast, quan-
tum dynamics does offer such a possibility because of the direct con-
nection between the Schrodinger equation and information theory
that was originally discovered by David Bohm [134] as a result of his
attempt to reformulate quantum mechanics as a statistical theory
of hidden variables. Despite this ill-fated expedition, the equations
he wrote down do resonate with the goal of this book. Bohm refor-
mulated the Schrodinger equation as a nonlinear equation for the
quantum phase that looks suspiciously like the HJB equation that
plays a central role in control theory (see e.g. [18,20]). Our descrip-
tion of Bohm’s result follows Frieden [135] and Reginatto [136]. The
Schrodinger equation in a “hyperbolic” space-time; i.e. a space-time
that is layered in such a way as to admit the universal time, has the
form
d
∂ψ 2  ij ∂ 2 ψ
i =− g + v(x), (7.1)
∂t 2 ∂xi ∂xj
i,j

where the x variables represent the positions of particles and v(x)


is the potential. They then assume that the wave function ψ(x) has
the form

ψ(x) = P 1/2 eiS/h , (7.2)

where P is a real function chosen so that integral of |ψ(x)|2 over


all space equals 1. Separating out the real and imaginary parts of
Eq. (7.13) leads to two coupled nonlinear equations for P and S
[135–136]:

d
  
∂P ij ∂ ∂S
=− g P j (7.3)
∂t ∂xi ∂x
i,j=1

d 
∂S 1  ij ∂S ∂S
=− g + v(x)
∂t 2 ∂xi ∂xj
i,j
 
2 2 ∂2P 1 ∂P ∂P
− i j
− 2 i j . (7.4)
8 P ∂x ∂x P ∂x ∂x
Quantum Self-organization 107

Rewriting Schrodinger’s equation as in Eqs. (7.3) and (7.4) has


turned out to be of interest in a number of contexts; e.g. its similar-
ity to both the NLS equation and the HBJ equation, which suggests
a connection with stochastic control theory (cf. [67]). The nondeter-
ministic nature of quantum dynamics is signaled by the appearance of
the 3rd term on the r.h.s of Eq. (7.4) — which is known as the “Bohm
potential”. In Refs. [133,134], it is noted that the Bohm potential is
an example of “Fisher information”, which is closely related to the
Kullback–Leibler (KL) divergence [27] that is an information theo-
retic measure of how well an estimated probability density matches
a “true” probability density, e.g. a conditional probability that one
might obtain from Bayes’s formula. When an estimate of a desired
probability density is improved, for example as a result of an observa-
tion, then the KL divergence changes by an amount equal to the infor-
mation that has been gained as a result of the measurement. When
the Bellman function is optimized, no more information gathering is
possible, and the dynamics is described by Pontryagin control.
When the number of particles is a large number and the wave func-
tion Ψ(x1 , . . . , xN ) is required to be symmetric under interchange
of the coordinates of any two bosons, then the Schrodinger equa-
tion (7.11) can often be replaced by the Gross–Pitaevskii descrip-
tion [133] of a cloud of bosons with off-diagonal long range order
(ODLRO). The Gross–Pitaevskii equation for the order parame-
ter can be derived as the Euler–Lagrange equation for a classical
Lagrangian:
 
∂ 2
L = ψ ∗ i + μ ψ − |∇ψ|2 − U |ψ|2 , (7.5)
∂t 2m
where μ is the chemical potential. Near a critical point where the
speed of sound in the boson fluid vanishes (which implies the Pon-
tryagin condition), the effective Lagrangian is
 
∗ ∂ 2
Lcrit = ψ i + μ ψ − (|ψ| − ψ0 )4 . (7.6)
∂t 2m
What we would like to emphasize here is that at a critical point
where the speed of sound in the quantum fluid vanishes “time stands
still”, which is exactly the condition that the control Hamiltonian
Hc = 0, which corresponds to Bellman optimization devolving to
108 Quantum Mechanics and Bayesian Machines

the Pontryagin “maximum principle”. The point of departure for


our proposed quantum approach to Pontryagin control is to replace
the Schrodinger wave function that appears in Eq. (7.11) with a
wave function satisfying the Gross–Pitaevskii equation for the order
parameter for a bosonic fluid:
∂ψ 2
i = − m∇2 ψ + [U (|ψ|2 ) + φ(x)]ψ, (7.7)
∂t 2
where ψ(x, t) is the complex valued order parameter whose absolute
square is the spatial density of bosons in the cloud, and φ(x) is an
external potential (e.g. due to an external spatially varying gravita-
tional field acting on the bosonic cloud in an atomic “fountain”). The
particle density and velocity field corresponding to (7.5) are smooth
and satisfy the Euler–Khalatnikov equation [137]:
  
∂v(x) 1 2 1
m = −∇ mv 2 − ∇2 lnρ + (∇ lnρ)2 . (7.8)
∂t 2 4m 2
The semi-classical fluid dynamics described by the Eq. (7.6) have the
characteristic property that the flows described by these equations
resemble Eulerian flow around obstacles. (As a historical aside, it
took observations of objects dropped from the Eiffel Tower to realize
that ordinary hydrodynamics was not Eulerian.) What makes this
work as a model for machine learning is that the phase of the order
parameter ψ(x, t) along the streamlines of the flow (7.6) satisfies the
HJB equation, where the potential U (x) plays the role of the reward
function.
The bosonic fluid described by Eq. (7.7) can also be regarded as
a multi-boson system defined by a Hamiltonian of the form
N
2  2
Hc = − ∂i + U (x1 , . . . , xN ), (7.9)
2m
i=1

An interesting sidelight for the Gross–Pitaevskii model for control


is that the theory corresponding to Eq. (7.9) can also be interpreted
[138] as a quantum theory of space-time near a black hole horizon.
There is no consensus at the present time as to what a quantum
theory of gravity looks like; however, the Gross–Pitaevskii model,
defined by Eq. (7.9), may cut the “Gordian Knot” as to the quan-
tum theory meaning of an event horizon including a theory of black
Quantum Self-organization 109

hole entropy [139]. Of course, this begs the question as to why some
interesting RL problem might be mapped onto such a model, but
our Gross–Pitaevskii model suggests that one way to understand
optimization of the Bellman function is to observe the time history
of the order parameter falling in a gravitational field as a function of
altitude:
∂ψ 2 2
i =− ∇ ψ + [U  (|ψ|2 ) − g(t)h(x)]ψ, (7.10)
∂t 2m
If the altitudes hi of atoms in the cloud and their velocities vi are
controlled by varying the acceleration of gravity g, then the optimal
point is the altitude h where the speed of sound vanishes, and the
phase of the order parameter has a stationary point.
The moon lander problem described in Chapter 4 illustrates what
the solution to Eq. (7.10) might look like. In this case, turning on
the rocket thrust of the moon lander can be emulated in a cloud of
bosons falling in a gravitational field by demanding that on reaching
a surface the bosons should come to rest. Another advantage of visu-
alizing the system being controlled as a quantum fluid is that fluid
control is susceptible to H∞ control [119], which provides an entre
[84] for quantum self-organization to the theory of games.

7.2. Quantum Theory of Innovations

An obvious challenge for the quantum version of the Durbin–


Willshaw set-up [1] for solving the traveling salesman problem (TSP)
is how to emulate the nonlinear couplings between the locations
where the salesman stops and the observed locations of the cities.
An attendant problem is how to model using quantum mechanics the
evolution of the receptive fields for the cities in Fig. 2.1; i.e. the set of
cities connected by dashed lines to each city. If these receptive fields
overlap, then as observed the salesman’s itinerary regresses toward
the true trajectory of the topology of the connections between the
station-stops and the cities may change to optimize the length of the
salesman’s path (or Bellman value function in the case of feedback
control or RL). One of the pleasant consequences of using quantum
paths in a path integral to represent the model paths in the TSP is
that the receptive fields and nonlinear springs associated with the
110 Quantum Mechanics and Bayesian Machines

d(i, μ) in the Durbin–Willshaw set-up are replaced by a continuous


local force acting on a particle salesman traveling along a model path
which only depends on the true velocity of the salesman at any given
time. The existence of an instantaneous influence (that may however
depend on the previous history of a system) acting to modify the cur-
rent state of system based on observations of signals from the system
is a universal feature of Bayesian feedback control and reinforcement
learning.
The quantum solution of the TSP introduced in Chapter 2 is a
good illustration of how this process works, and in addition how the
quantum theory of propagation of a particle in a magnetic field can
be used to emulate how the differences between the estimated state
of a system and its actual state can be reduced by the introduction of
a control variable. As it turns out, there is a perfect correspondence
between the deviations in the velocity of particle along a trial path
from the classical velocity and the action in the Feynman quantum
amplitude for motion along the path, which has the form
 
i  
K(t) = exp Scl [x(t )] + ΔS [x(t ) Dx(t ). (7.11)

The factor containing Scl = (xb − xa )2 /2 ((tb − ta )) describes the
classical motion of a free particle between cities (see [47]), while the
factor containing ΔS describes the deviation of the observed motion
of the “quantum salesman” from the classical path:
  
i t m
F (t − t0 ) = exp |ẋ − v(t )| dt Dy(t ),
 2
(7.12)
 t0 2
where the classical velocity v(x) is defined for all x by the actual
motion of the salesman, and y(t) = x(t) − xcl (t) is the deviation of
the Feynman path from the classical path. It is a crucial observation
for us that the action functional in (7.12) can also be expressed as
the effective action function for the motion of a classical particle in
a magnetic field [138].
t 
m 2
ΔS [x(t) = ẋ + ẋA(x) dt, (7.13)
t0 2
 
where A(x) ≡ m v(x) is formally the familiar vector potential for a
magnetic field B = curlA. In the context of Eq. (7.13), it is a “con-
trol variable” is just the velocity of the salesman at that particular
Quantum Self-organization 111

time. Thus, we arrive at the surprising result that the innovation


GP is equivalent to the quantum motion of a particle in a position-
dependent magnetic field.
As a hint we are on the right track to finding a quantum transla-
tion for optimal control, the quantity inside the parenthesis in (7.13)
has the same form (apart form a sign) as the control Hamiltonian
for Pontryagin control discussed in Chapter 4, with the proviso that
the control variable A(x) is now interpretable as a vector poten-
tial. Applying Eq. (7.13) to solve the traveling salesman problem
seems straightforward; namely one considers the classical limit, where
because of cancellation of phases the path integral becomes focused
on the shortest path length.
Restricting attention to quantum deviations within a region sur-
rounding the straight-line classical path between cities with loca-
tions ζb and ζa with area |ζ b − ζa |Δy then according to Landau (see
Quantum Mechanics [15]) the number of quantum states needed to
describe the deviation is B|ζ b − ζa |Δy/2π, each of which has the
form exp{−2π(z − zi )2 /B}; i.e. a localized Gaussian density. Evi-
dently, (omitting a normalization factor) the ground state describing
the localization of the salesman within the region corresponding to
the deviations from the classical path has the form

Ψ = exp{−2πS/B}, (7.14)

where S is the area of the region between the salesman’s path con-
necting the square dots in Fig. 2.1 and the observations of the sales-
man’s path linking the round dots representing the cities. B = curlA
is assumed to be constant in this region. Of course, this formula only
makes sense on a Riemann surface because in the plane the sales-
man’s itinerary will in general be a self-intersecting graph. Equa-
tion (7.14) is completely consistent with our conjecture [36] that the
area S plays the same role as the Bellman function in optimal control,
i.e. the area S represents the information regarding the salesman’s
itinerary that has been lost as a result of environmental noise. In
other words, in our quantum version of the TSP the innovation noise
is just a consequence of using a path integral with the Nambu-like
action for a string [100] to describe the observations. When the devi-
ations of the quantum path from the classical path are limited such
that the path is not self-intersecting, then Bellman optimization is
112 Quantum Mechanics and Bayesian Machines

equivalent to minimizing the area in Eq. (7.14). This is a very pleas-


ant result in the sense that the Bellman function, which represents
a loss of information, can now be interpreted as the area spanned
by the innovation. Of course, when the lost information about the
salesman’s movements due to the presence of the control variable

A(x) is recovered, the salesman becomes localized on the classical
path.
Although the action in Eq. (7.12) had its origin in our approach

to the TSP, the appearance of the vector potential A(x) suggests
that a physical analog realization of innovation for feedback control
would require magnetic fields. Serendipitously, it was discovered by
Duncan Haldane [140] that an effective magnetic field can sponta-
neously appear on the surface of certain atomic lattices with strong
spin orbit interactions. Somewhat later a class of materials, topolog-
ical insulators, was discovered with exactly this property [140]. Here
we just note the possibility that the quantum innovation, Eq. (7.12),
might have an analog realization involving topological insulators or
superconductors. Spin orbit effects in topological insulators are con-
trolled by a parameter 1/κ = spin orbit influence length/atomic spac-
ing and in TIs can be represented as the Chern–Simons out of plane
“magnetic field:
e
BCS = − ψ ∗ ψ
κ
and an in-plane electric field
i e ij i
ECS = ε j.
vF
The 2D Schrodinger equation describing the motion of a 2D quantum
fluid of particles interacting via both a point-like interaction and
the gauge potentials for these Chern–Simons fields has the following
form [142]:
∂ψ 1 2
i =− D ψ + eA0 ψ − g|ψ|2 ψ, (7.15)
∂t 2m
where Dα = ∂α −i(e/c)Aα and m is an inertia parameter. The gauge
fields A0 and Aα do not satisfy Maxwell’s equations, but instead are
determined self-consistently from the equations for Chern–Simons
Quantum Self-organization 113

electrodynamics in 2 + 1 dimensions. In the presence of a uniform 2D


electric field E, the current has the same form as the Hall current
for a magnetic field perpendicular to the plane:

jαβ = σH ε αβγ Eγ ,

where σH is the “Hall conductivity”. Neglecting spatial variations


in the electric field, the usual Gauss’s law will be replaced by the
Chern–Simons equation
e
B=− ,
κ
where B is the strength of an effective magnetic field whose direc-
tion is perpendicular to the surface, = ψ ∗ ψ, and 1/|κ| is an inverse
length with σH = vF κ/e, and vF is the Fermi velocity on the sur-
face of the TI. The control parameter appears as the vector potential
that appears in the covariant derivatives Dx and Dy that appear in
Eq. (7.15). Physically, the appearance of this vector potential is asso-
ciated with the effect of spin orbit coupling between spin polarized
charge carriers moving on the surface of the TI and the nuclei of high
Z atoms just below the surface.
Although the magnitude of the surface vector potential in the
surface of a topological insulator cannot be controlled, the charge
carrier flows implied by Eq. (7.14) are collectively very smooth. The
Hamiltonian corresponding to Eq. (7.15) is
 2   
2  2 1 e2 
H= d x |(Dx ± iDy )ψ| − g− ρ2 .
2m 2 mvF |κ|
(7.16)

As in optimal control theory, we are interested in the Pontryagin


limit H = 0. Simple analytic expressions for these zero modes can
be found if we assume

g = ±e2 /mcκ,

the equation for zero modes H = 0 reduces to

(Dx ± iDy )ψ = 0. (7.17)


114 Quantum Mechanics and Bayesian Machines

This equation has the simple analytical solution


 e  k
ψ(x) = eikx , ∇φ = kx̂, j =− A = ρ x̂. (7.18)
mvF m
We see here that for small deviations between the model path and the
salesman’s path the holomorphic flows satisfying (7.17) closely mimic
the TSP innovation. This is an illustration of the uniform conver-
gence appearance of phase space parameters in Pontryagin control.
Equation (7.17) can also be written as a Liouville equation:
2e2
∇2 ln ρ = ± , (7.19)
vF κ
which has vortex-like solutions [142]. The bottom line is that current
flows on the surface of a topological superconductor [141] may be
a way of representing Kailath’s innovations in an analog quantum
device [143].

7.3. Quantum Helmholtz Machine

The aim of the wake–sleep algorithm [9] for training the Helmholtz
machine is to produce joint probability distributions for the ensem-
ble of Ising spins (states = {+1, −1}) which minimizes the informa-
tion costs of representing a set of observations. One of the key ideas
behind the Helmholtz machine of Dayan et al. [8] was to follow in the
footsteps of the Boltzmann machine [1] by regarding the two arrays
of Ising spins as physical systems. From this perspective, minimiza-
tion of the information cost of the descriptions of the states of the
recognition and data generation is equivalent to minimizing the free
energy of this physical system of interacting Ising spins. This in turn
implies [8,9] that the conditional probability for a given model, i.e.
the l.h.s of Eq. (1.1), will be given by Bayes’s formula. We want to
extend this scenario by replacing the Ising spins with the Riemann
surface degrees of freedom introduced in Chapter 5.
One possibility [52] for going in this direction would be to replace
the array of Ising spins with the Ashkin–Teller (AT) statistical
model [144] for a 2D array of spins with two Ising spins per lat-
tice site. It was discovered by Kadanoff and Brown [145] that if
the 4-spin couplings are carefully chosen, then the energy functional
Quantum Self-organization 115

for the model has a Gaussian form similar in form to the energy
cost functions that naturally appear in Kohonen self-organization
[13]. When the spins at each lattice site are allowed to interact,
these AT models share with self-organizing networks [108] the cru-
cial property that for critical values for the spin couplings the
AT model will possess string-like excitations where the informa-
tion regarding the state of the observer/controller or environment
can be represented by the shape of a possibly topologically non-
trivial 2D surface. The emergence here of “self-organization” means
that the original AT spin degrees of freedom are effectively replaced
with Θ-functions representing the shape of a Riemann surface. This
has the pleasant consequence that both the recognition and gen-
erative networks of the Helmholtz machine can then be described
as path integral representations of the Lax equation for the KdV
equation.
The story line here is that we want to use the double path
integral formulation of Feynman and Vernon [47] to represent the
string degrees of freedom in the two Helmholtz machine arrays; one
representing the forward evolution of an observer/controller which
includes an estimation for the innovation, and the other to represent
the backward evolution of the system or environment. Formally, this
amounts to replacing Feynman’s original path integral with a double
path integral of the form (see Appendix E):

e−iS[x(t)] Dx(t)
 (t)
→ e−i{S[x(t)]−S[x ]}
F [x(t), x (t)]Dx(t)Dx (t)}, (E.3)

where the “influence function” F (x, x ) represents the effect of


the interaction between two separate quantum systems on their
joint evolution. The exact form for F (x, x ) depends on the details
of the two interacting quantum systems, but in general one can
write [47]
 T t

F [x(t)x (t)] = exp − (x(t )α(t, (t ) − x (t )α∗ (t, t )
0 0

  
− (x(t ) − x (t))}dtdt , (7.20)
116 Quantum Mechanics and Bayesian Machines

where α(t, t ) is a complex function that plays somewhat the same


role as the real valued autocorrelation function A(t, t ) for a random
time signal. When the environment consists of an assembly of quan-
tum oscillators that are linearly coupled to the coordinates x(t) of
the observer/controller, α(t, t ) has the following form:
 g2 
α(t, t ) = − i
e−iωi (t−t ) , (7.21)
ωi
2

where the ωi s are the frequencies of the oscillators in the 2nd oscilla-
tor array making up the “environment”. For a single harmonic oscil-
lator and an environment consisting of oscillators with frequencies
ωi , where Δi = ωi is the level spacing, the exponential factor in
Eq. (7.11) becomes

 g 2 M ω 0 tj+1 t
exp − i
dt ds(x(s)e−iΔi (t−s)
2hΔi tj tj
ij

−x (s)eiΔi (t−s) )(x(t) − x (t))

As described in Refs. [47,50], this expression can be used to exactly


evaluate the effect of a quantum oscillator environment on an
observer. However, we are mainly interested in the more sophisticated
problem of modeling an interaction between an observer/controller
with an environment that resembles the way sensory data are col-
lected in the real world, e.g. as a self-organizing map.
Because of its exponential form of F (x, x ), the double path inte-
gral in E.3 can be written as

Z(wi (σ)) = exp[i(S1i − S2i + ΔS12 )], (7.22)
i

where actions S1 and S2 can be identified with the Nambu action


[100] for an ensemble of free strings, which is the area of the
Riemann surface. The total action describing the Feynman–Vernon
dynamics of two arrays of interacting strings will be a sum of the free
string actions S1 and S2 for the two arrays plus an interaction term
ΔS12 describing the interaction between the Riemann surfaces in the
Quantum Self-organization 117

two arrays of the Helmholtz machine. This ΔS12 contribution gener-


alizes the nonlinear springs connecting the square and round points
in Fig. 2.1. As an illustration of what sort of interaction might replace
the nonlinear springs in the Durbin–Willshaw setup [1], ΔS12 might
be assumed to have the form [146]

1 1/2 2 1/2
ΔS 12 = dσ1 dσ2 [det(gab )] [det(gab )] G(ẋ2 − ẏ 2 ), (7.23)

where gab is the metric for a Riemann surface and ẋ2 − ẏ 2 is the
Lorentz invariant distance between a position on a Riemann surface
representing the observer/controller and a position on a Riemann
surface in representing environment. This in consistent with the
Chern–Simons interaction that appeared in our treatment of the
TSP. However, the path of the salesman is fixed, so there is no Rie-
mann surface associated with the salesman’s trajectory. On the other
hand, it will turn out to be of considerable interest in the case of
the Helmholtz machine to consider what happens when the degrees
of freedom of either array are frozen in time. In fact, this takes us
back to the Segal–Wilson and T–O descriptions of exact solutions of
the KdV equation in terms of holomorphic functions on a Riemann
surface.
Our elucidation of this follows some ideas of Chu [146]. Carrying
out the integration over the string degrees of freedom for the envi-
ronment array in the double path integral in (E.3) leads to an action
function for each string in the observer/controller array of strings of
the form

  2  2 
 τf β
dxi dxi

S = dτ dy exp c2s − + qj (xi [y, τ ]) ,
0 0 dτ dy
j
(7.24)

where cs is the speed of sound along the string, qi (x) is an effective


potential for the motion of the string, and the sum over j recognizes
that there may be qualitatively different models for representing the
118 Quantum Mechanics and Bayesian Machines

input data (e.g. Riemann surfaces with different topologies). We see


here how an inhomogeneous wave equation for a string moving in a
certain potential obtained can be associated with every model for the
environment. The crucial step to go from Eq. (7.15) to the KdV equa-
tion is to freeze the dynamics of the array of quantum strings repre-
senting the observer/controller, which then allows one to replace the
quantum dynamics of an array of strings with probability distribution
for the shapes of the frozen Riemann surfaces with a path integral
description of the thermal state of solutions of the Schrodinger–
Hill equation (that had its origins in the theory of Lagrange
triangles [125]:
   
 β  
dψ 2
Z(ψ0 , ψβ , β) = exp − + qj (ψ(y)) dy D[ψ(y)].
0 dx
j
(7.25)
Although the stochastic structure of the KdV equation was not
immediately recognized in the initial flurry of papers on exact solu-
tions for the KdV equation (cf. [75]), it is now understood (see e.g.
[147]) that the KdV equation is equivalent to an ensemble of Fokker–
Planck equations where the drift terms can be identified with exact
solutions of the classical KdV equation. From our perspective the
Gaussian noise in this Fokker–Planck process can be regarded as a
consequence of the quantum fluctuations inherent in a path integral
description of the Lax equation for the KdV equation. In this setting
the classical Mumford–Rissanen MDL principle for the Helmholtz
machine corresponds to Rose optimization [34] for the Lax equa-
tion. Furthermore, it is clear the Rose optimum is attained when the
entropy created by the wake–sleep algorithm is minimized. Thus, the
information aspects of Bellman optimization arise in a natural way
from the Fokker–Planck description of the Helmholtz machine during
wake–sleep interludes as the MDL limit for the Helmholtz machine
is approached. This is completely consistent with the conventional
view [19] that the rate of change of the optimal Bellman function is
partly due to a “drift term” (aka the reward function), and partly
due to stochastic diffusion. If we go back to our quantum predecessor
of this Fokker–Plank process, we can identify the reward functions
qi (y, t) as the descendant of the interaction term ΔS12 in our quan-
tum Helmholtz machine. This result offers the tantalizing prospect
Quantum Self-organization 119

of being able to use a quantum model for the Helmholtz machine to


solve complex optimal control and RL problems.
Because of the emergence of a probabilistic description for both
the observer/controller and system/environment, the application of
the wake–sleep algorithm to our quantum version of the Helmholtz
machine can also be interpreted as a stochastic game [53] involving
two players, where the “payoff” for each player is the information
gathered regarding the history of the adversary. Stochastic games
provide descriptions for adversarial conflicts in many real-life circum-
stances. For example, in the terminology of von Neumann [101] the
drama illustrated in Fig. 1.1 is a two-person zero-sum game where
the gain for the squirrel — his life — is matched by the bobcat’s
loss — a good meal. (We can reveal that a short time after the pic-
ture was taken the drama ended happily for the squirrel.) John von
Neumann published separate discussions of game theory, the math-
ematical foundations of quantum theory, and computer models for
the brain, and curiously his original paper on two-person zero-sum
games (see [90]) appeared at about the same time as his treatise on
the mathematical foundations of quantum mechanics [95]. Therefore,
it is easy to imagine that von Neumann had in the back of his mind
that these topics were somehow related. One of our ambitions is to
fill in the gaps between von Neumann’s published works.
As possibly a first step in this direction, our quantum model for
the Helmholtz machine seems to provide an elegant explanation for
why the strategies that “solve” two-person stochastic games [53] are
probabilistic in the sense that on average no other strategy can pro-
duce a better outcome [105]. It is almost obvious that our quantum
theory of the wake–sleep algorithm can provide such a model for
these strategies, where the two-players are identified with the recog-
nition and model generation networks of the Helmholtz machine,
and the “payoff” is the information gathered by each player regard-
ing the history of the other player. If one accepts that the goal of the
wake–sleep algorithm is to maximize information gathered regarding
the adversary, then it naturally follows that the innovations in this
information that appear during wake–sleep interludes contribute to
the optimization of the Bellman–Issacs function [53] that defines the
optimal choices of strategies for both players in a stochastic game. In
addition, von Neumann’s minimax solution for two-player zero-sum
games emerges because the equations describing the evolution of the
120 Quantum Mechanics and Bayesian Machines

observer/controller and system/environment from one wake–sleep


cycle to the next are linear because the quantum dynamics is linear.
The wake–sleep algorithm then forces these solutions to be compati-
ble in the sense that their density matrices for the observer/controller
and system/environment are similar, and the accumulated reward
for the observer/controller is just the negative of the accumulated
reward for the environment/system. That the expected equilibrium
payoff for one player is just the negative of the expected equilib-
rium payoff for the other player is a consequence of the fact that
the quantum dynamics for the observer/controller is forward prop-
agating, while the quantum dynamics for the system/environment
is backward propagating. (In the Feynman–Vernon formalism, this
forward and backward propagation also applies to density matrices.)
A previous hint that optimal control might be related to von
Neumann’s solution for two-person games was provided by H∞ con-
trol [84]. H∞ control is an extension of the Kalman filter that con-
strains the control history in such a way that the variance of the
innovation for the Kalman filter [83,84] remains bounded in magni-
tude. This type of constraint on the innovation also shows up in our
quantum solution of the TSP problem [36] when we try to extract the
optimal solution for the TSP by taking the classical limit. As noted
previously, approaching the classical limit of quantum description
of optimal control will in general require that the control histories
must be describable as paths on a Riemann surface. Our focus here
is that H∞ control provides a nice way of looking at both the von
Neumann and Nash solutions for games [105]. N-person games can
be introduced by replacing the holomorphic functions in the Hardy
space used in classical H∞ control with the Gross–Pitaevskii descrip-
tion of the order parameter for N particles discussed in Section 7.1.
In a Gross–Pitaevskii description for the “fluid states” of either the
observer/controller or system/environment, each player represents a
team of N agents that are treated symmetrically, and their dynam-
ics will be controlled by the negative of a Hamiltonian like that in
Eq. (7.16). The possible relevance of using a fluid-like model for an
observer/controller is illustrated in Fig. 7.1, which shows a SOM
model [12,13] for the somatosensory neurons on a human hand. As
we shall see, SOMs may be a useful way to emulate H∞ games.
As discussed in Section 4.2, the equilibrium state for an H∞
game will be determined by maximizing a control Hamiltonian.
Quantum Self-organization 121

Fig. 7.1. Self-organization of somatosensory sensors from Ref. [13].

On the other hand, the equilibrium state of either player can also
be described as the state where the information that the team of
agents representing a player has gathered regarding the state of the
adversarial player is maximized. In the equilibrium state, the optimal
strategies {πi∗ } for the 2N agents representing the two players satisfy
the Nash condition [105]:

e(π1∗ , π2∗ , ..πi∗ , . . . π2N



) ≥ e(π1∗ , π2∗ , ..πi , . . . , π2N

), i = 1, . . . , N,
(7.26)

where e ({πi }) is the game payoff for two teams of N agents represent-
ing the players in a two-player stochastic game. The {πi } are a set
of mixed strategies for each agent, while the ith strategy on the r.h.s
of the inequality (7.26) is any strategy other than the optimal strat-
egy. The strategies with an asterisk are the optimal strategies that
define the Nash equilibrium state. The Nash equilibrium condition
in Eq. (7.26) is formally the same as optimization of the Bellman–
Issacs function [53], which in this case sums up the payoffs for all the
122 Quantum Mechanics and Bayesian Machines

agents acting individually with the indicated strategies. The equilib-


rium state for a controlled Gross–Pitaevskii fluid will mean that no
more information regarding the state of the fluid can be obtained by
the independent actions of any agent. As in the original version of the
Helmholtz machine due to Dayan, Hinton, et al., [8] the equilibrium
state of Gross–Piteavskii fluid can be accessed by minimizing the
information gathered by the two players. A challenge facing this pro-
posal is to find some means for implementing this regression. In this
respect, construction of self-organizing maps with large scale parallel
computers [147] may be the path to consider.
Although real games typically involve a discrete sequence of
actions, while H∞ control involves continuous variations in state and
control variables, it might be useful to regard the H∞ game as an
approximation to the discrete sequence of actions for the multi-agent
teams of agents. One reason this might be an interesting avenue to
explore is that H∞ control provides a natural home for Kohonen
self-organization where both sensory data and explanations involve
holomorphic functions. This connection with self-organization will be
discussed further in the next section.

7.4. Ad Mammalian Intelligence

Despite numerous advertisements regarding their successes and


potential for artificial intelligence, neural networks have failed to live
up to the initial hopes that they would provide definitive insights
into why mammalian cognition works so well with a footprint dra-
matically smaller than the computational resources devoted to RL
problems of interest (see e.g. [106]). Admittedly DNNs have provided
many successes for machine learning, e.g. pattern recognition. On
the other hand, even before the most impressive successes of DNNs
appeared, a singular fundamental insight regarding mammalian cog-
nition and artificial neural networks came to light in the form of the
self-organizing maps for sensory data developed in the early 1980s
by Teuvo Kohonen and his collaborators [12,13].
The formal theory of self-organizing maps is based on the notion
of an output signal w(r), which initially we will assume is a complex
number describing a feature of its environment. We will posit that
Quantum Self-organization 123

these “feature detectors” are located at an arbitrary set of points {ri }


on a 2D surface, and evolve with time according to the following rule
of form [12]:

w(rj , t + 1) = w(rj , t) + Λ(rj − rj∗ ) |ξj − w(rj , t)| , (7.27)

where ri∗ is the position of the neuron whose output is initially closest
to an input feature ξj , while the function Λ (r ) leads to a “receptive
field” for each detector which is the union of all input signals from
a particular field of view producing a response in the detector at
position ri that is closer to the current state of the detector at ri
than the state of any other r detector. Λ(|ri − ri∗ |) is typically a
Gaussian function that allows the feature detectors to adjust their
outputs so that not only the detector located at ri∗ , but also nearby
detectors observe the signal ξj . The “receptive field” for a sensor is
the union of all input signals from a particular field of view that
produce a response in the detector at position ri that is closer to
the current state of the detector at ri than the state of any other r
detector (cf. the receptive fields for the TSP defined by the dashed
lines in Fig. 2.1). The self-organizing algorithm (7.27) adjusts the
response of the sensor at position ri to a particular environmental
stimulus ξμ in its receptive field to be at least as strong as any of its
nearest neighbors.
Of great importance for us is that Kohonen’s self-organization
maps also give rise to holomorphic functions that can serve
mammalian cognition in much the same way that the holomorphic
functions introduced in Chapters 5 and 6 can be melded together to
provide an analytic model for the reward function for optimal control.
In this regard, Ritter and Schulten have shown [13] that under the
influence of random variables ξμ, the model outputs {w(ri )} evolve
to a state which minimizes a stochastic energy functional
⎡ ⎤
1  
E[w(rj )] = ⎣ P (ξμ )|ξμ − w(rj )|2 ⎦ . (7.28)
2 <r,s>
ξμ ∈R

The reason that a sum over neighboring nodes appears on the r.h.s
of Eq. (7.28) is that at each time step the change in position of a
124 Quantum Mechanics and Bayesian Machines

node affects its neighbors, which allows the entire ensemble to relax
to a statistical configuration described by a partition function
⎛ ⎞
 F  
κ
Z= dw(rj )exp ⎝− w(ri ) − w(rj )2 ⎠, (7.29)
2
L i=1 i,j

where the sum over L means a sum over triangulations covering


the surfaces in Z[w(σ)], each triangulation consisting of triangles
F . Ritter and Schulten [13] show that in this equilibrium state the
outputs {w(ri )} can be approximated by a continuous function satis-
fying the Cauchy–Riemann equations, which is the precise definition
of a holomorphic function (cf. Appendix B). This connection of an
SOM from the space of input signals to the space of models w{(rj )}
with smooth surfaces and holomorphic functions is undoubtedly the
key to understanding how the analytic methods introduced in previ-
ous chapters relate to mammalian cognition.
If the input signals ξμ in Eq. (7.28) can be represented as a sta-
tionary random variable, a self-organized detector network will relax
[13] to an asymptotic state characterized by a stationary probabil-
ity distribution P ({w(ri )}) for the various possible configurations of
detector states, which in turn is derivable from a partition function
of the form Z = exp(−E [w} ):
1
E[w] = Cij |wi − wj |2 , (7.30)
2
i,j

where Cij is a covariance matrix for the detector states regarded


as random variables. Fundamentally, looking for a pattern in the
input data that is related to the underlying explanation for the data,
Eq. (7.29) reveals [109] that this is equivalent to minimizing the area
of a certain — possibly topologically nontrivial — surface, whose
area will turn out to play the role of the information description
cost. In addition, the energy functional (7.30) is the calling card
for a holomorphic function on a surface. It seems to be a reason-
able guess that these holomorphic functions play much the same
role as the “Hardy spaces” which animate the RH solutions of both
the KdV and NLS equations. Thus, we are invited to compare the
results of self-organization with the exact analytic solutions to the
Quantum Self-organization 125

integrable PDEs described in Chapter 5 and in T–O. In fact, Fig. 7.1,


which shows a self-organizing map for the somatosensory sensors on
a human hand, illustrates a dramatic result of such a comparison:
the sensors come in different varieties; corresponding to the finger
they are on. On the other hand, this association of different types
of somatosensory sensors with fingers exactly mirrors the association
of the number of solitons in a multi-soliton solution [76] of the KdV
equation. This coincidence not only confirms our hypothesis that ana-
lytic solutions of integrable PDEs are important for understanding
mammalian cognition, but suggests that the little understood reason
for the leap in cognitive capabilities that is generally associated with
the appearance of mammals is the ability to make use of various
types of information.
This comparison between exact solutions of the KdV equation
and SOMs makes sense of von Neumann’s parallel interests in game
theory, the mathematical foundations of quantum theory, and com-
puter models for the brain. What von Neumann apparently didn’t
appreciate, or at least didn’t put to paper, is the importance of ana-
lytic solutions of integrable PDEs as well as SOMs. These maps do
in fact seem to have something in common with the way neurons
are organized and process information in the cerebral cortex [107].
Furthermore, SOMs provide a “deer trail” heading in the direction
of practical applications [109,110] for the holomorphic functions gen-
erated by self-organization. In a sense this is the most important
technical result of our presentation because the spontaneous cre-
ation of holomorphic functions within the mammalian cerebral cortex
might explain how it is that the mammalian brain can tap into the
same analytic apparatus for generating reward functions and control
actions extrapolated into the indefinite future based on solvability of
PDEs. The bottom line is that Kohonen’s self-organization provides
what ultimately can be the most useful link between mammalian
cognition and quantum mechanics.
Self-organizing networks have a property that they can organize
sensory data in a way that one can understand what the data mean by
“visual inspection”. This is perhaps a realization of the often-quoted
pearl of wisdom that a picture is worth a thousand words. This is
illustrated in Fig. 7.1, which shows how an array of somatosensory
sensors self-organize information about a 3D object so that a natural
126 Quantum Mechanics and Bayesian Machines

picture of the object is created within the sensor array itself. There
is already evidence from MEG recordings [150] that different audio
patterns are recorded in different areas of the cerebral cortex. The
ability to provide a holistic understanding of different features of an
environment is of course probably one of the reasons for the evolu-
tionary success of mammals.
Chapter 8

Holistic Computing

8.1. Quantum Mechanics and 3D Geometry

One nice feature of the Schwinger angular momentum representa-


tion is that measurement of an angular momentum state it might
reveal much about the 3D shape of a quantum state as in the origi-
nal Stern–Gerlach experiment. Indeed, the Schwinger representation
can be used to both describe locations on the surface of a sphere
with spherical harmonic function, but also encode whether the object
being sought is present or not at the associated location by attaching
a quantum qubit to any approximate location. For example, in the
case of a Bayesian search for an object in an unknown location along
a road, one may label the progress of a search by a sequence of qubit
measurements. Ironically, in the context of these Schwinger states
the essence of a Bayesian search would be a series of measurements
a la the Stern–Gerlach experiment.
The role played by measurements in transforming Hilbert spaces
into Hardy spaces memory into a universal calculator illustrates the
more sophisticated role that can also can play in forming holistic
images of an environment. Our teleportation scheme is a general-
ization of the scheme introduced in Ref. [4] for teleportation of a
localized quantum state |ψ(qi ) > represented as a function of a con-
tinuous variable qi attached to a position i along a quantum wire
represented as a one-dimensional graph of nodes. As an example,
this variable might be the position of an oscillator located at i. Tele-
portation of this quantum state between two neighboring nodes i
and i + 1 of the quantum wire is accomplished by using a controlled

127
128 Quantum Mechanics and Bayesian Machines

phase gate, exp(iQi+1 ⊗ Qi ), where Qi and Qj+1 are the position


operators attached to the neighboring nodes, to entangle the quan-
tum states |ψ(qi )> and |pi+1 = 0> associated with these neighboring
nodes, and then making measurements of the momentum operator
Pi , where [Pi , Qi ] = −i, attached to the node i. In our scheme for
teleportation of geometric objects the one-dimensional graph used
in Ref. [4] to represent a quantum wire will first be replaced by a
one-dimensional array of columns of nodes as illustrated in Fig. 1.1,
where each node represents a two-dimensional quantum oscillator.
The teleportation of three-dimensional objects will then be accom-
plished by generalizing the array of nodes shown in Fig. 3.1 to a
three-dimensional array of nodes, where each node represents a four-
dimensional oscillator.
We begin by generalizing our discussion of quantum teleportation
of qubit states to teleportation of angular momentum states. In this
case both the input data and underlying models can also be expressed
in terms of eigenstates of the total angular momentum operator J =
L + S with values j = l ± 1/2 and the z-component of the angular
momentum J z with values jz = m ± 1/2. The initial wave function
for the hidden layer will be chosen to be a product of the |φi = 0>
states for each node:

ΨH (t = 0) = |jjz >i ) (8.1)
i j,jz

The possibility of using quantum entanglement between quantum


systems separated by some distance to teleport quantum information
[1] has been discussed mainly in the context of teleporting quantum
information represented by qubits. In this section, we note the pos-
sibility of using quantum entanglement of angular momentum states
to teleport a model for a three-dimensional geometric object from
one location to another. Our proposal extends previous proposals
[39] for universal quantum computation using continuous variable
quantum systems; and in particular the possibility of using quantum
measurements to teleport quantum states of a continuous variable
defined on the nodes of a graph. The particular continuous variables
that carry information of our teleportation scheme are those asso-
ciated with coupled quantized angular momentum states. As was
first recognized by Wigner and Racah [112], the algebraic properties
of angular momentum coupling coefficients for 3 or more angular
Holistic Computing 129

momenta are intimately connected with the geometric properties of


simplexes constructed from triangles. Honeycombs of tetrahedrons
can also serve as models for three-dimensional objects with an arbi-
trary shape. Our scheme for the teleportation of a geometrical object
depends on decorating the nodes of a three-dimensional graph with
quantum angular momentum states. Teleportation is accomplished
by introducing an entangling interaction between two neighboring
triples of angular momenta states representing two faces of a tetra-
hedron sharing a common edge, and making measurements of the
angular momentum components parallel to the shared edge.
In our two-dimensional generalization of the continuous variable
teleportation scheme in [114], the quantum states attached to the
nodes of the graph are assumed to be eigenstates of angular momen-
tum states with a definite value for Ji2 = ji (ji + 1). The “position”
operators Qi attached to the nodes of a graph are the angular momen-
tum operators Jiz which generate rotations of the state attached to
node i around a fixed axis (defined for each node), while the “momen-
tum” operators Pi are the operators φi corresponding to the angle
of rotation about this axis. The angle variables φi suffer from the
well-known problem that they cannot be represented as self-adjoint
operators; so we follow the familiar prescription of using instead sin φi
and cos φi variables, which in our setting appear as the following dis-
placement operators:

iφi Jix + iJiy


e = . (8.2)
Ji2 − Jiz (Jiz + 1)

In our scheme the single quantum wire in Ref. [4] used to tele-
port wave functions of a continuous variable q is replaced by a
two-dimensional graph for teleporting entangled states of three angu-
lar momenta. Quantum states representing triangles are created by
entangling the angular momentum states attached to three neigh-
boring nodes in a column of the quantum wire — as illustrated in
Fig. 6.1 — so that the sum of the three angular momenta is zero; i.e.
the angular momentum vectors form a perfect triangle in the classical
limit. In a basis where the angular momentum component J z along
a fixed axis is well defined, the state representing a triangle is the
sum of a product of the three states |ji mi >, where the coefficients
are Wigner’s 3j symbols [114]. In this chapter, we will focus on the
130 Quantum Mechanics and Bayesian Machines

possibilities of teleporting entangled graphical states associated with


a compact subset of the entire graph which represent an array of tri-
angles in two- or three-dimensions. We will assume that initially the
quantum states attached to the nodes {i} of the graph are always
defined in a basis where the eigenvalues of the operators Jiz have
definite values. Motivated by the scheme described in Ref. [4] for
the teleportation of continuous variable quantum states, we imagine
teleporting a triangle along a one-dimensional path within the graph
by combining the action of a controlled phase gate Πi exp(iJzit ⊗ Ji0 z)

acting between the angular momenta states attached to any column


{i0 } of the graph with a neighboring column {it } with measurements
of the angle operators sin ϕi and cos ϕi in the entangled state created
by the controlled phase gate. In the following, we describe how these
operations can be used to teleport geometric objects represented as
quantum states.
Following the prescription in [113], the initial graphical wave func-
tion Ψin which describes the array of triangles that one wants to
transport is a product of the entangled wave functions for triples of
nodes within a set of “input” columns {i0 }:
   
 jia jib jic
Ψin = Πi∈{i0 } |jia ma |jib mb |jic mc 
ma mb mc
ma ,mb ,mc
(8.3)

where the symbol in parenthesis is the Wigner 3j symbol. If the


initial quantum state of the graph nodes outside the input columns is
a product of angular momentum zero modes |J zi = 0>, teleportation
of Ψin to neighboring “target” columns {it } is accomplished by first
using the controlled phase gate CZ to entangle Ψin with the |J zi =
0> states on the neighboring target nodes, and then measuring the
momentum operators Pi = φi in the entangled state to effect the
transformation:
    
 JiZt → X(φi0 )R(θi0 )F |ψin , (8.4)

ita ,itb ,itc Jz i0 a,i0b ,i0c

where F is the operator that switches the Jiz basis to the φi basis,
X(<Jiz >) is a shift operator that depends on the result of a φi mea-
surement, and Rz (θ) is a rotation operator about common z-axis for
Holistic Computing 131

the input and target column which can be applied either before or
after the CZ gate. The CZ gate in Eq. (6.28) allows one to entangle
an input state representing a triangle of angular momentum states
with control states |ΣJ z = 0> on neighboring columns, and then as
a result of measurements of φi for the input nodes the wave function
for the nodes in the next target layer {it } become X(φi0 )Ψin (qi1 ).
As an illustration of how quantum angular momentum states
might be used for the teleportation of geometric objects, let us con-
sider the teleportation of angular momentum states representing a
triangle within a quantum circuit consisting of an array of two-
dimensional oscillators. We envision that this array consists of 2D
oscillators which are localized at points in three-dimensions and con-
nected in a fashion at each node in the lattice is the used to define
an angular momentum state in a basis where the Jiz operators for
all the nodes within each column have definite values with respect
to a common z-axis. The wave function for the input layer corre-
sponding to three nodes within the first column of the lattice has the
form given in Eq. (8.3), where i = 1, 2, 3. The lines connecting the
nodes correspond to controlled phase gates CZ = exp(iJzit ⊗ Ji0 z ) act-

ing between nodes in neighboring columns. During the teleportation


procedure, the z-axis used to define the angular momentum states for
two neighboring columns of nodes is identified with the shared edge
for two neighboring faces of a tetrahedron. The teleportation then
rotates the quantum state representing one face about this edge so
as to represent the neighboring face of a tetrahedron. In particular,
if the rotation angles θi appearing in Eq. (8.4) are chosen to be the
dihedral angle between neighboring faces of a tetrahedron, then by
carrying out the dihedral rotations and measuring the momentum
operators J x + iJy for the entanglement of the input state with the
product of J z zero modes for the target layer nodes, the input state is
transformed into a quantum state for three angular momenta which
represents a distortion of the face of a tetrahedron. By removing the
distortion using the squeezing operator X(m) and concatenating this
process three times over, one can use the quantum circuit represented
in Fig. 7.1 to construct all four faces of a regular tetrahedron.
Extending the quantum circuit represented in Fig. 8.1 by adding
more columns of nodes would allow the basic teleportation step (2)
to be repeated indefinitely; in effect allowing the quantum circuit to
function as a “quantum wire” for triangles. Adding columns of nodes
132 Quantum Mechanics and Bayesian Machines

Fig. 8.1. Teleportation sequence illustrating the connection between quantum


models for three-dimensional objects and knots.

along an axis perpendicular to this quantum wire would allow one


to use our basic teleportation step, Eq. (8.4), to teleport a triangle
to an arbitrary location in the resulting square lattice of columns.
It might be noted at this point that extending teleportation along
a straight wire to teleportation along an area filling a curve within
the two-dimensional array of columns could be used to construct an
area filling array of triangles. In this way construction of arbitrary
two-dimensional shapes becomes possible, since the interior region
of any closed planar curve can be approximated by a close-packed
two-dimensional array of triangles. As Lagrange first pointed out
[146], an equilateral triangle configuration for three bodies interact-
ing via an inverse square force is quasi-stable. In the quantum case
this quasi-stability of the classical configuration may translate to
robustness of the Schwinger/Wigner representation of an equilateral
triangle against quantum noise.
Generalizing the chain of clusters of 2D oscillators representing
our Lagrange triangles to a two-dimension array of the clusters would
allow a chain of tetrahedrons where neighboring tetrahedrons share a
common edge. Because it is clearly possible for the chain of tetrahe-
drons to connect any two given points in the two-dimensional array
of columns of triple nodes, we have a constructive proof that it is
Holistic Computing 133

possible to teleport a tetrahedron between any two locations in a


two-dimensional plane. This implies that there must be some sub-
tlety involved in trying to generalize this scheme for constructing
arrays of tetrahedrons to three-dimensions, because in general paths
in three-dimensions are “knotted”, which implies that a representa-
tion of the teleportation path lying within a two-dimensional array
of columns of nodes must necessarily intersect itself.
The question of some interest to us is whether there is some
generalization of the quantum circuit in Fig. 8.1 that would allow
one to construct three-dimensional space-filling arrays of tetrahe-
drons at arbitrary locations in 3D. As just noted, one can teleport a
two-dimensional closed curve by approximating the two-dimensional
region inside the curve by an array of contiguous triangles. This
suggests that one could approach the problem of teleporting three-
dimensional shapes by first approximating the three-dimensional
object by a space-filling array of tetrahedra. However, space-filling
arrays of tetrahedra cannot in general be constructed using the
two-dimensional teleportation circuit for triangles discussed above
because a mapping of the triangle teleportation path for a three-
dimensional array of tetrahedrons would be self-intersecting. The
difficulties with using the triangle teleportation scheme illustrated
in Fig. 8.1 can be understood by noting that the outer faces of the
double pyramid can be constructed in a straightforward way using
the scheme in Fig. 1.1; however, construction of the entire figure
requires a “knotted” teleportation path.
One possibility to generalize the teleportation scheme illustrated
in Fig. 8.1 so as to teleport tetrahedrons within a space-filling array
of tetrahedrons would be to extend the Schwinger oscillator repre-
sentation of the SU(2) angular momentum algebra to an oscillator
representation of the Pauli O(4) = SU(2)×SU(2) algebra [10]. For the
purposes of constructing quantum states which correspond to O(4)
representations, the one-dimension nodes of a three-dimensional
tetragonal graph generalizing the 2D lattice in Fig. 8.1 can then be
labeled by the eigenvalues of two independent angular momenta oper-
ators, J and K. Remarkably, if we utilize the freedom of labeling the
quantum states attached to nodes with J and K quantum numbers,
then the basic teleportation step Eq. (8.4) operating between nodes
of a tetragonal three-dimensional graph will allow us to construct
a finite space-filling array of tetrahedra. Since three-dimensional
134 Quantum Mechanics and Bayesian Machines

objects with a smooth boundary can be modeled as space-filling array


of tetrahedra, the replacement of SU(2) angular momentum states a
tetragonal with the O(4) of four-dimensional pentatopes analogous
to a space-filling array of tetrahedrons. The pentatope is the four-
dimensional generalization of a tetrahedron which contains 4 vertices,
10 edges, 10 triangles, and 5 tetrahedrons.
Remarkably, our approach to representing geometric figures is
closely related to theories of quantum gravity in three-dimensions.
In particular, tetrahedral simplexes and their relation to 6j symbols
play an important role in a model for quantum gravity in 2+1 dimen-
sions that was introduced some time ago by Ponzano and Regge
[155]. The Ponzano and Regge model is distinguished in the sense
that in the semi-classical limit the quantum action is related to
a well-known topological invariant for three-dimensional knots, the
Alexander polynomial. If the angular momentum states attached to
the nodes in the figure are replaced by representations of the quantum
group SUq (2), then it turns out that the effective quantum action for
the tetrahedral network is related to the Jones theory of knots. In the
semi-classical limit, the quantum action for Ponzano–Regge becomes
the partition function for a classical statistical model. This brings us
full circle back to statistical models, such as the quantum AT model
discussed in Chapter 7, related to Bayes’s formula Eq. (1.1).

8.2. Cognitive Science and Quantum Physics

We end our presentation with some musings about the deeper signif-
icance of the connection between quantum mechanics and Bayesian
inference. It is easy to get the impression from the chaotic literature
devoted to data science that there is nothing particularly mathe-
matically profound about machine learning algorithms. On the other
hand, one of our aims with this book is to frame the question of
the mathematical significance of Bayesian inference in terms of its
relationship to quantum mechanics. John von Neumann apparently
thought that there was something mathematically profound about
quantum mechanics. However, von Neumann did not clearly artic-
ulate what this means in his publications. Although our presen-
tation has been focused on the relationship of the Bayes formula
and quantum mechanics, we believe that our results may also shed
Holistic Computing 135

light on two perennial philosophical puzzles: (1) What lies beneath


the effectiveness of sophisticated mathematics for describing natural
phenomena? As has been emphasized by Wigner [114], there is no
apparent reason why this should be so, and (2) What is the meta-
mathematical meaning of mathematics? This has been a puzzle since
the time of Euclid and Plato.
If we adopt the utilitarian point of view that mathematics is sim-
ply an elaboration of the methods that the human brain has devel-
oped to solve practical problems [115], then at least to some extent
the second puzzle can be absorbed into the more general question
as to how the human brain works. Of course, whatever organiza-
tional features of the human brain are responsible for its capabil-
ity to invent mathematical methods to solve practical problems, it
seems reasonable to believe these features must have their origin in
the way the cerebral cortex of primitive primates, e.g. lemurs, are
organized. This in turn may mean that some simple organization
principle such as Kohonen self-organization may have some relevance
for understanding the emergence of mathematics. In view of the con-
nections between Kohonen self-organization, holomorphic functions,
and quantum mechanics that we have discussed in previous chapters,
we can begin to see how the subjects of cognitive science, quantum
mechanics, and the usefulness of mathematics may be intertwined.
It is interesting to note in this connection that although John von
Neumann argued that digital computers might provide a model for
how the human brain works, he was also apparently very interested
in the relationship between quantum mechanics and mathematics.
Indeed, it is perhaps not entirely coincidental that his most famous
mathematical result, the minimax theorem for zero-sum games [117],
was obtained at the same time (circa 1928) that he was working on
quantum mechanics. In his bio submitted to the National Academy
of Sciences (NAS), von revealed that “The part of my work I con-
sider most essential is that on quantum mechanics, which was devel-
oped in Gottingen in 1926, and subsequently in Berlin in 1927–1929”.
Although to the author’s knowledge von Neumann never suggested
that quantum mechanics might be directly relevant to how the brain
worked, it does seem that our results complement von Neumann’s
intuition regarding the joint importance of quantum mechanics and
game theory. Apparently, what von Neumann missed is the useful-
ness of integrable dynamics and Kohonen self-organization for tying
136 Quantum Mechanics and Bayesian Machines

together quantum mechanics, game theory, and cognitive science.


The ability to construct holistic three-dimensional images to repre-
sent data is an example of a capability which quantum devices seem
to share with the mammalian cortex, and in both cases conveys a
certain superiority over conventional computers.
Along these lines Kohonen drew attention to the usefulness of
self-organization for analyzing observational data or robotic control
[12], it was realized [13] that self-organizing maps also satisfy the
Cauchy–Riemann equations. Evidently, at the heart of the remark-
able capability of the mammalian brain to solve in real time Bayesian
pattern recognition or decision problems that would be very time
consuming — if not intractable — with conventional computational
resources is the ability of the cerebral cortex to in effect solve the
Cauchy–Riemann equations. On the other hand, efforts to emu-
late Kohonen self-organization using large-scale parallel computers
[147] may be a first step to understanding how this capability might
be approached with conventional computational resources. Further
direct investigation of how brain wave patterns are related to cogni-
tion [154] may benefit [155] from the analytic techniques for solving
Bayesian inference problems, like those discussed in earlier chapters,
seems warranted.
Appendices

A. Gaussian Processes

A Gaussian process (GP ) is a vector in an infinite dimensional space


where the vector components are iid Gaussian random variables.
GP s are universally useful in representing and analyzing data (see
e.g. [148–149]). In the following, we will use the symbol DP to denote
the training data consisting of N pairs, (XN , ZN ), of input Gaussian
(n)
processes {xi } and associated scalar labels {l(n) }. In common with
“supervised” neural networks, this approach to pattern recognition
requires introducing examples of input data together with labels dis-
tinguishing different types of data. However, in contrast with artifi-
cial neural networks this approach does not necessarily require that
an interpolating function z(x) for the training data and data labels
have a relatively simple representation in terms of known functions.
Following the advice of MacKay [6], the best procedure for inferring
the interpolating function z(x) given a set of input data and training
labels relies on the use of Bayes’s formula Eq. (2.1):

P (l|z(x), XN )P (z(x))
P (z(x)|lN , X N ) = , (A.1)
P (lN , |X N )

When the training labels l(n) are real numbers, the problem of find-
ing the function y(x) is usually referred to as a regression problem. If
l(n) = 0 or 1, the inference problem is usually referred to as a search
problem. Given a training dataset DN = {XN , ZN }, the regression

137
138 Quantum Mechanics and Bayesian Machines

problem is to infer an interpolating function z(x) which would allow


one to infer the most likely target label l∗ given a new input data
vector {x∗ }. In general, it is not necessary to assume that the inter-
polating function z(x) be parameterized in terms of a finite set of
parameters; i.e. the dimensionality of z(x) can be infinite. Thus, one
might imagine that the computational complexity of finding z(x) is
in general prohibitive. However, since the input dataset is finite, it
is generally sufficient to express z(x) in terms of a set of basis func-
tions that allow one to uniformly represent the features in the data
represented by the labels l(n) . It is often useful to assume that the
signal z(x) can be written in the form
m=M

z(x, w) = wm φm (x), (A.2)
m=1

where we will refer to the parameters wm as weights. The number


M of basis functions will of course depend on how accurately one
would like to represent the distribution of data features represented
by prior conditional probability for the labels l(n) . The covariance
matrix C(z, z  ) for the signals {ZN } will be given by
 
  
C z(x(n) ), z(x(n ) ) = σw 2
φm (x(n) )φh (x(n ) ) + o2ν δ nm . (A.3)
m

In the ML schemes described by MacKay [6], the input data are in


general characterized by a scalar label {l(n) } assigned to each specific
example {x(n) } of GP input data. If the labels are simply decimal
numbers, then this data analysis task can be considered a “regres-
sion” problem, which is how we will refer to the task in the following.
On the other hand, if the labels l(n) are discrete variables, then the
task can be thought of as a “classification” problem. In the case of
the Bayesian search discussed in Section 3.1, the obvious choice for
this label is whether the object or state being sought lies near the
(n) (n)
location corresponding to the vector xn ≡ {x1 , . . . , xd }, where
the superscript n corresponds to step n of the search. In general,
the labels {l(n) } attached to GP data cannot be determined exactly,
but with some error, and in general this error can be regarded as a
Gaussian random variable. The methodology described in [6] involves
constructing a continuous GP z(x) to interpolate a set of labels {l(n) }
Appendices 139

for a training set X N of input data {x (k)} } with say N examples (i.e.
k = 1, . . . , N), and then using this interpolation model to attach sim-
ilar labels to new examples of data. MacKay’s method assumes that
each measurement y(x (k) ) of l(k) differs from the model z(x (k)} ) by
a random error:
y(x (k)} ) = z(x (k)} ) + v, (A.4)

where v represents observational noise (It might be noted that


Eq. (A.1) is an example of the general circumstance that the pres-
ence of some noise is always beneficial when it comes to fitting data).
Using Bayes’s formula, one finds
P (YN |z(x), XN )P (z(x))
P (z(x)|YN , XN ) = . (A.5)
P (YN , |X N )
The power of using GP s lies in the ability to find analytic expres-
sions for the conditional probability distributions p(z(x)|Y N , X N )
and p(l∗ |z(x), XN , x∗ ). Having these analytic expressions in hand
allows one to quickly make probabilistic predictions for the label
l∗ that should be assigned to a new example of input data x∗ . Using
GP representations for p(z(x)|Y N , X P ) and p(l∗ |z(x), XN , x∗ ), the
posterior conditional probability for finding label l∗ at step N+1 in
a Bayesian search can be found by integrating over all paths y(x):

p(l∗ |D, x∗ ) = dy(x)p(l∗ |y(x), XN , x∗ )p(y(x)|D). (A.6)

It is interesting that Eq. (A.3) provides a description of the evolution


of a Bayesian search for the best interpretation of an observation as
an integral over “paths”, where the paths touch certain points as in
the traveling salesman problem. This is a harbinger for the quantum
schemes introduced in Chapter 7.
A new development in data analysis — the Gaussian process neu-
ral network (GPNN) [153] — offers an alternative to “wide” (i.e.
many nodes in the hidden layers) neural networks that allows proba-
bilistic pattern recognition to be carried out in a deterministic man-
ner without the extensive training effort typically required to fix the
connection strengths in deep back propagation neural networks. The
GPNN has the elegant feature that the network outputs after L lay-
ers can be interpreted as the Bayesian prediction for obtaining an
140 Quantum Mechanics and Bayesian Machines

output Gaussian process z ∗ given an input Gaussian process x∗ . In


particular, given a set of input–output Gaussian processes {x m , z m },
where the pairs for m = 1, . . . , M represent the supervised training
data, the Bayesian prediction for the output Gaussian process z ∗ in
the case of a new input x∗ is

 
P (t|z)
P (z ∗ |D, x∗ ) = dzP (z ∗ |z, x, x∗ )P (z|D) = dzP (z ∗ , z|x, x∗ ) ,
P (t)
(A.7)

where the “observation noise” P (t|z) as well as the prior P (t) are
assumed to be normally distributed about z. If the variables xi , zi ,
and ti are all assumed to be independent Gaussian distributed vari-
ables, the integral in Eq. (A.7) can be carried out analytically. The
result in z ∗ (x) is a Gaussian random variable with mean

μ = K(x∗ D)(K(D, D) + σ2 I)−1 t (A.8)

and variance

= K(x∗ , x∗ ) − K(x∗ , D)(K(D, D) + σ2 I)−1 t. (A.9)

B. Wiener–Hopf Methods

B.1. Cauchy–Riemann equations


A complex function f (p, q) = u+ iv is analytic if u, v satisfy Beltrami
equation:
   
∂ F ∂u/∂p − E∂u/∂q ∂ F ∂u/∂q − G∂u/∂p
= , (B.1)
∂q EG − F 2 ∂p EG − F 2

where E, F, G are the metric coefficients for a smooth surface

ds2 = Eds2 + 2F dpdq + Gdq 2 . (B.2)


Appendices 141

If F = 0, then the Beltrami equation becomes the Laplace equation


and u, v satisfy the Cauchy–Riemann equations:
∂u ∂v ∂u ∂v
= =− . (B.3)
∂x ∂y ∂x ∂y
The Cauchy–Riemann equations are the conditions that (u, v)
represents a “holomorphic flow”.

B.2. N/D factorization


A powerful addendum to the use of special functions to solve wave
propagation problems was introduced in 1931 by Wiener and Hopf
[61], who showed how sectionally holomorphic functions (functions
that in some neighborhood of a chosen point of the complex pain
can be written as a power series in z referring to that point as the
origin) can be used to analytically solve scattering problems that
would otherwise appear intractable. It is noteworthy from the point
of view of this book that the Wiener–Hopf method was later applied
with great effect to the problem of determining the potential in the
Schrodinger from the asymptotic phases of a quantum wave func-
tion [60,117]. It is this later development which provides a crucial
signpost for our quantum approach to machine learning. In the fol-
lowing, we will summarize how the Weiner–Hopf method makes much
of the demonstration by Segal and Wilson [73] that the Riemann–
Hilbert method for representing sectionally holomorphic functions
allows one to construct a map between the two BFS spaces that is
equivalent to the construction of the map between two Hilbert spaces
that is the essence of essentially all machine learning algorithms.
Indeed, an abbreviated summary of what we hope to accomplish in
this book is that we will tie together the threads provided by the
use of Riemann–Hilbert, Wiener–Hopf, and Segal–Wilson methods
for analyzing integrable differential equations in order to provide a
Hamiltonian framework for solving signal analysis, optimal control,
and reinforcement learning problems.
The person who turned the Wiener–Hopf idea of using the the-
ory of complex variables in order to provide exact descriptions of
scattering problems discovery toward machine learning was Wiener
himself. Working at the onset of World War II on the problem of
extracting a signal from a time series of measurements contaminated
142 Quantum Mechanics and Bayesian Machines

with noise, Wiener’s original derivation of his filter [81] involved solv-
ing an integral equation of the same form as the integral equation
for wave scattering discovered by Wiener and Hopf. Not only was
Wiener’s discovery applied with good effect during the war, but in
the years following WWII Kalman modified Wiener’s signal filter in
such a way as to address the problem of optimal control [83]. It turns
out [41] that the mathematical structure underlying both the Wiener
and Kalman filters involves functions that are rational functions (i.e.
ratios of polynomials) of the wave frequency regarded as a complex
number. Perhaps the most momentous aspect of the effort to find an
exact solution of the KdV equation is that a certain rational function
of the eigenvalue of the 1D Schrodinger operator regarded as a com-
plex variable plays much the same role as the rational functions in
the Wiener and Kalman filters. Thus, the parts of machine learning
that flow from the work of Wiener and Kalman seem to involve in
essence the construction of a certain rational function of a complex
variable. It was left to a group of Russian mathematicians [119] as
well as Segal [73] to point out [94] the connection of this effort with
algebraic geometry and that the setting for this rational function is
a Riemann surface rather than the usual complex plane.
A potentially very profound advantage of using quantum ampli-
tudes rather than real probabilities to solve pattern recognition and
decision tree problems arises from the observation that, in contrast
with classical probability densities, the quantum amplitudes used
to describe the state and evolution of a quantum system are always
complex valued quantities that are typically analytic functions of the
continuous variables describing the system. This allows one to take
advantage of powerful methods for representing analytic functions
in terms of their singularities in the complex plane. In particular, in
1931 Weiner and Hopf [61] made the remarkable observation that cer-
tain kinds of integral equations that arise in scattering problems can
be solved by regarding the scattering amplitudes as analytic func-
tions of the parameters of the problem. For example, it was shown in
the 1950s that certain interesting problems involving the scattering
of electromagnetic waves that one might guess are intractable, e.g.
the scattering of an electromagnetic wave from a flat conductor with
a knife edge, could be easily solved by extension of the physical solu-
tion to a solution where the frequency is a complex variable. In the
Appendices 143

case of the scattering of a quantum particle from a localized poten-


tial, the Weiner–Hopf method can be illustrated by considering the
low energy scattering of a quantum particle. The asymptotic wave
function is proportional to

u(k) = eikr e2iδ e−ikr /r. (B.4)

where δ is the S-wave phase shift. The scattering amplitude f (k) is

eiδ sinδ
f (k) = . (B.5)
k
The time reversal symmetry of the Schrodinger equation implies that

f ∗ (k) = f (−k∗ )
and (B.6)
2∗
f ∗ (k 2 ) = f (k ).

These properties of the scattering amplitude are consistent with


assuming that the only singularities in f (k) regarded as a complex
function of k 2 are branch cuts running along the real axis with an
imaginary discontinuity across the branch cut. This allows one to use
Cauchy’s theorem to write f (k) in the form
 −k0 2 2  ∞ 2 2
1 Imf (k )dk 1 Imf (k )dk
f (k 2 ) = + . (B.7)
π −∞ k2 − k2 − i π 0 k2 − k2 − i

If we now write f (k) ≡ N/D, where N has singularities only in the


left-hand complex plane and D has singularities only in the right-
hand plane, then the Cauchy representation (C.8) becomes [152]
 −k0 2 2 2
1 D(k )Imf (k )dk
N (k) = 2
π −∞ k − k2
 2 2
(B.8)

k2 N (k )k dk
D(k) = 1 − .
π 0 k2 − k2
The quantities N (k) and D(k) carry all the information regarding the
potential that is necessary to construct the S-wave scattering ampli-
tude for a particle as function of the momentum of the particle, and
144 Quantum Mechanics and Bayesian Machines

provide an elegant and unique path to defining the kernel function


that appears in both Dyson’s adaptive optics setup and Bayesian
pattern recognition. In the context of quantum scattering theory the
function D(k) is known as the Jost function. This function plays an
important role in the theory of inverse scattering in both one- and
three-dimensions. In the case of the three-dimensions, D(k) becomes
a matrix [120], and the inverse scattering equations are known as the
Newton–Jost equations.

B.3. The Gelfand–Levitan–Marčenko (GLM)


equation
The problem of finding the potential of the 1D Schrodinger equa-
tion using the frequency dependence of the reflection and trans-
mission coefficients for the potential (which is normally assumed
to be spatially compact). This problem was first addressed by
Marčenko [60]. In his 1955 paper, Marčenko showed that the
Wiener–Hopf method used by Wiener to construct his noise fil-
ter could be used to solve the problem of finding the poten-
tial for the Schrodinger equation based on scattering data.
A similar method was also developed by Gelfand and Levitan which
relates causal and anti-causal Green’s functions for the inhomoge-
neous wave equation. In order to explain how this works, we will
follow Tao’s nice derivation in his tribute to Israel Gelfand [59]. The
GLM equations allow to infer the potential u(x) that occurs in the
inhomogeneous wave equation:
∂2ψ ∂2ψ
+ = u(x)ψ(x, t) (B.9)
∂t2 ∂x2
based on scattering data, i.e. the obtained by measuring the reflection
coefficient R(k) as a function of wave number. If u(x) were 0, then
the solution to Eq. (B.9) would be a superposition of a right-moving
wave f (x − t) and a left-moving wave g(x + t). If V (x) is localized
near the origin inside a finite interval [−L, L], then the solution in the
region x < −L has the form f− (x − t) + g− (x + t), while the solution
in the region x > L has the form f+ (x − t) + g+ (x + t). Two special
cases of these general solutions are of particular interest: (1) for large
negative values of tf− = δ(x − t) and g+ = 0, while for large values of
t the solution for x < −L has the form G(x, t) = δ(t − x) + R(t + x)
Appendices 145

where R(t + x) is the amplitude of the wave generated by reflection


of the input pulse from the potential, and (2) the roles of x and
t are reversed, so that as a result of a time-dependent potential a
solution where f− = δ(x − t) and g− = 0 for x < −L evolves to a
solution of the form u(x, t) = δ(x − t) + K(t − x), where K(x, t) = 0
unless x > t. The Fourier transform of the function R is referred to
as the scattering data while the Fourier transform of the matrix K is
often referred to as the Jost function [51]. Multiplying the solution
δ(x + t − s) + K(s − t, x) by R(s) and integrating over s one can show
that δ(x − t) + R(x + t) + ∫ K(x, s)R(s + t)ds is also a solution. Using
the fact that G(x, t) vanishes for x > t, one arrives at the integral
equation:
 x
R(t + x) + K(t − x) + R(s + x) Ks − x) ds = 0. (B.10)
−∞

Given the input data R(t+x) the acausal covariance function K(x, y)
can be determined by solving the liner integral Eq. (2.15). This is the
covariance function that is used in least squares stochastic estima-
tion. The potential that appears in the wave equation (B.9) is given
by
d
u(x) = 2 K(x, x). (B.11)
dx
The GL equations pertain to the scattering solutions of the time-
independent form of Eq. (B.9):

∂2u
− u(x)ψ(t, x) = k2 ψ(t, x), (B.12)
∂x2
where uv(x) is assumed to be everywhere positive (u(x) > 0) with
compact support centered on the origin. The solutions to Eq. (B.12)
that are of interest in connection with the inverse scattering problem
are solutions which for x → −∞ have the form

ψ(k, x) = e−ikx + R(k)eikx , (B.13)

when R(k) is referred to as the reflection coefficient. For x → ∞, the


solutions of interest have the form

ψ(k, x) = T (k)ψ(k, x), (B.14)


146 Quantum Mechanics and Bayesian Machines

where the reflection and transmission coefficients satisfy


|T (k)|2 + |R(k)|2 = 1. (B.15)
The inverse problem is given R(k) for 0 < x < ∞ determine v(x).
The Marčhenko method [52] solves the integral equation
 t
u(s, x) = R(s + x) + R(τ )u(τ, x)dτ , (B.16)
−∞
where s < x. The potential for the inhomogeneous wave equation is

v(x) = 2 u(x, x). (B.17)
∂x
Alternatively, the kernel u(x, x) can be written as

1 x
u(x, x) = v(x)dx (B.18)
2 −∞
As noted in the text, kernel functions u(x, y) that satisfy (B.15) also
arise in connection with finding least square estimators for signal
filters and data feature predictors. In both these cases the interpre-
tation of Eq. (B.11) as the time-independent Schrodinger equation is
very useful as a guide to what sorts of kernels might be of interest.
Indeed, the eigenfunction expansion considered in the original paper
of Machenko et al. is a natural quantum Hilbert space representation
for data features.

B.4. The Riemann–Hilbert problem


First posed by Riemann in his 1852 PhD thesis, the question of how
to construct a complex valued analytic function from knowledge of
its real and imaginary parts on a curve in the complex plane has
been. In 1905, David Hilbert reshaped Riemann’s problem into the
solution of a certain nonlinear integral equation [127]. This equa-
tion is very similar to the integral equation introduced in the 1930s
by Norbert Wiener and Eberhard Hopf for the scattering of electro-
magnetic waves [61]. Historically, the mathematical underpinnings of
the Weiner–Hopf method go back to Hilbert’s 1905 paper in which
he considered a boundary value problem for a holomorphic complex
valued function Φ+ defined in a connected region S+ of the complex
plane bounded by a closed contour L:
Φ+ = G(t)Φ− + g(t), (B.19)
Appendices 147

where Φ± ± are the boundary values of holomorphic functions on


S+ and its compliment S− , and G(t) and g(t) are smooth bounded
functions on L. In the 1930s, it was shown that writing
X + (t)
G(t) = (B.20)
X − (t)
the general solution to Eq. (B.19) is [67]

X(z) g(t)dt
Φ(z) = +
+ X(z)P (z), (B.21)
2πi X (t)(t − z)
where X(t) is the solution to Eq. (C.1) when g(t) = 0, and P (z) is a
polynomial. It might be noted that the integral in Eq. (B.3) can be
expanded in inverse powers of z. This property plays a role in our
discussion of the Baker function in Chapter 5.
As a result of the flurry of activity in the 1920s and 1930s, find-
ing solutions of the Schrodinger equation for physically interesting
problems, it was discovered that certain special analytic functions
that had been discovered in the 19th century, were again found to
be quite useful. Of special interest to us is that while he was a post-
doctoral student at the Bohr Institute in Copenhagen, Lev Landau
discovered the “special” analytic functions discovered in the 19th cen-
tury which turned out to be of particular importance for quantum
mechanics because they solve the Schrodinger equation for problems
of great interest. Prominent examples are the hypergeometric func-
tions. In the case of the Airy function, the problem is a particle
subject to a linear potential; e.g. the potential acting on a particle
near the surface of the Earth due to gravity. In fact, Airy functions
are of importance for quantum motion in any potential because they
describe the motion of the particle near the classical turning point.
A powerful addendum to the use of special functions to analyt-
ically solve linear differential equation was introduced in 1931 by
Wiener and Hopf [61], who showed how sectionally holomorphic func-
tions (functions that in some region of the complex plane can be
written as a power series in z referring to some point in the region
as the origin) can be used to analytically solve scattering problems
that would otherwise appear intractable. It is noteworthy from the
point of view of this book that the Wiener–Hopf method was later
applied with great effect to the problem of determining the potential
148 Quantum Mechanics and Bayesian Machines

in the Schrodinger from the asymptotic phases of a quantum wave


function [60,117]. It is this later development which provides a cru-
cial signpost for our path to using quantum mechanics for machine
learning. We will make much of the demonstration by Segal and
Wilson [75] that the Riemann–Hilbert method for representing sec-
tionally holomorphic functions allows one to construct a map between
the two BFS spaces that is equivalent to the construction of the
map between two Hilbert spaces that is the essence of essentially
all machine learning algorithms. Indeed, an abbreviated summary of
what we hope to accomplish in this book is that we will tie together
the threads provided by the use of Riemann–Hilbert, Wiener–Hopf,
and Segal–Wilson methods for analyzing integrable differential equa-
tions in order to provide a Hamiltonian framework for solving signal
analysis, optimal control, and reinforcement learning problems.
⎛ ⎞
v3
⎜  exp iξv + ⎟
⎜1 Γ 3
dv ⎟
Ψ(ξ, k) = ⎜ k ⎟, (B.22)
⎝ v−u ⎠
0 1
which allows one to extract the solution for the homogeneous
Painlevé equation as the residue of the Ψ12 matrix element as λ → ∞.
Its [68] calls this the “nonabelian Airy problem”.
 In general, one can
extract the integrand g(λ) of an integral g(λ)dλ over a contour by
taking the limit λ → ∞ of
⎛  ⎞
g(μ)
⎜1 μ−λ ⎟

Z(λ) = ⎝ ⎠, (B.23)
0 1

which yields the Wiener–Hopf factor


 
1 2πig(λ)
G(λ, x) = , (B.24)
0 1

The Baker wave function is


 
4 3
Ψ(λ, x) = Z(λ, x)exp λ + xλ σ3 , (B.25)
3
where σ3 is the Pauli matrix.
Appendices 149

A similar approach works for the KdV equation using the real
line as the contour [71]. G(z, x) contains the scattering data for the
Baker function, and the matrix linking the two holomorphic Hilbert
spaces is
 3 
1 − |r|2 re−2i(z t+xz)
G(λ, x) = 3 t+xz)
Ψ− . (B.26)
re2i(z 1

We now come to the punch line; this matrix contains all the infor-
mation necessary to construct the Bellman and reward functions for
the Kalman filter:
 +
1 Φ − Φ−
Φ(z) = ds, (B.27)
2πi s−z
where

Φ+ (s) = Φ− (s)G(s),

is a finite dimensional matrix equation. If the contour in (B.27) is a


polygon, then the logarithmic jump matrix G(s) = M1 M2 . . . Mk is a
product of piecewise constant finite matrices. The rational function
φ(x, λ) can then be constructed as a product of τ -functions:

τk (Y ) = Y (λ)Mk . (B.28)

The appearance of a τ -function here is a strong hint that stochas-


tic processes underlie the KdV dynamics [72].

B.5. Inverse scattering transform


The problem of determining the nature of an inhomogeneous medium
from scattering data — the so-called inverse scattering problem — is
of great importance in several contexts; e.g. geophysical exploration.
Although in general these problems can only be approached numer-
ically, the scattering of a plane wave by a compact inhomogeneity
in three-dimensions is one example that is tractable because of the
Wiener–Hopf method [61]. Newton [117] derived equations general-
izing the GLM equations which can be used to infer the nature of
a localized potential in three-dimensions from observations of the
scattering of waves off the potential. As in the 1D case, the initial
150 Quantum Mechanics and Bayesian Machines

step was to introduce the Fourier transform of the scattering ampli-


tude. The problem of determining the nature of an inhomogeneous
medium from scattering data is of great importance in several con-
texts, e.g. seismology. Unfortunately, in general these problems can
only be approached numerically.
Our quantum approach to Bayesian inference is based on an opti-
mization principle for classical inverse scattering discovered by Rose
[34]. The origin of Rose’s principle is a method for solving the GL
equation due to Marčhenko [60,117]. As in our discussion of the GLM
equation in Section B.1.3, Rose starts by considering the solution
u+ (t, x) for the inhomogeneous wave equation [34,116] when the ini-
tial state is a delta function shaped wave incident from the left:

u+ (t, x) = δ(t − x) + R(t + x), for x < 0 (B.29)

and

u+ (t, x) = T (t − x), for x < 0. (B.30)

The existence of the solution u+ (x, t) allows one to write the solution
to (B.7) when the incident wave u0 (x, t) has any shape with a sharp
wave front, i.e. u0 (x, t) = 0 when x − t > x0 , in the Green’s function
form:
 ∞
u(t, x, x0 ) = u+ (t − τ, x)u0 (τ, 0, x0 )dτ. (B.31)
−∞

The analog of the orthogonality condition for representing kernel


eigen functions is
 ∞
u+ (t, x)u+ (t , x)dx = δ(t − t ) + R(t + t ). (B.32)
−∞

The usual orthonormality conditions for the time-independent


Schrodinger equation can be recovered by Fourier transforming
Eq. (B.34) with respect to both t and t . Causality and the finite
velocity of propagation imply that the scattered field (B.17) near the
wave front is given by

− − 1 x0
u(t = 0, x0 ; x0 ) = u+ (t = 0, x0 ; x0 ) − v(x)dx.
2 −∞
Appendices 151

The Rose optimization principle [34] is that the “best” choice for v(x)
is the one that leads to a scattering state that is entirely focused on
a particular location x∗ at a chosen time t∗ in the future.
 ∞
B= [u(t, x; x0 ) − δ(t − x + x0 )]2 dx = 0. (B.33)
−∞

B.6. Wave propagation with flexible boundaries


In Methods of Theoretical Physics [116], Morse and Feshbach dis-
cussed in some detail how to calculate the motion of a string, a
membrane, or elastic medium to an arbitrary force f (t) applied to
some limited region inside the medium or on its boundary. described
a general for describing the propagation of sound waves in an elas-
tic medium with a flexible boundary with its own dynamics. For an
elastic medium that can be modeled as a string with length l, their
approach was to write the solution as a double integral
 l  t
ψ(x|t) = dx0 g(x|x0 ||t − τ )f (x0 |τ )e−iωt dτ, (B.34)
0 −∞

where it should be noted that the second integral only goes over times
prior to t. They showed that problems of this type can be solved
by first calculating the wave response G(x, x0 , ω) when a periodic
impulse is applied at a particular point x0 on the boundary of the
medium carrying the wave:
 ∞
ψ(x|x0 |t) = G(x, x0 |ω)F (ω)e−iωt dω, (B.35)
−∞

where
 ∞
F (ω) = f (t)eiωt dω. (B.36)
−∞

Unfortunately, when G as a function of ω has singularities on the


real axis, then the integral in Eq. (B.35) is not convergent, and one
must resort to considering ω as a complex variable. This led Morse
and Feshbach [116] to consider replacing Eq. (B.36) with the Laplace
152 Quantum Mechanics and Bayesian Machines

transform:
 ∞
F (s) = f (t)e−st dω. (B.37)
−0

Solving the wave motion with a flexible boundary, taking into account
the distributed impulse resulting from the entire flexible boundary,
can now be found by finding the “filter” function g(x, x0 , t) for the
coupled wave medium/flexible boundary whose Laplace transform is
Green’s function G(x, x0 , ω). Given this filter function, the complete
motion of the string due to the imposition of a distributed force
f (x0 |t) along the length of the string can be obtained. We can do no
better summarizing the Morse–Feshbach proposal for how to solve
this type of problem than simply quote their synopsis in Methods of
Theoretical Physics [116]:
“First compute the Green’s function G(x, x0 |ω) for the steady
state response of the system to a force of unit amplitude and fre-
quency ω applied to point (x0 , y0, z0 ) within or on the boundary, by
solving an inhomogeneous Helmholtz equation with an inhomoge-
neous boundary condition. Find the impulse function g(x, x0 , t) for
which G is the Laplace transform, either by contour integration of
(Eq. (B.35)) or inverting (Eq. (B.37)). The response to f (x, t) is then
given by (Eq. (B.35))”.
One beautiful feature of the Morse–Feshbach prescription for deal-
ing with rubber potentials is that it not only illustrates the role of
causality more clearly, but it also immediately illustrates why the
nonrelativistic Schrodinger equation is relevant for understanding
wave propagation with rubber potentials.

B.7. Adaptive optics


A very interesting practical application of rubber potentials came
about in 1970s because of efforts of [38] to remove the degradation of
the angular resolution of ground-based astronomical telescopes due
to atmospheric turbulence. In his 1975 paper on adaptive optics for
reflective telescopes in the presence of photon noise, Dyson showed
[58] that it is possible to remove the noise in an optical signal passing
through the atmosphere due to turbulence by adjusting the shape of a
reflecting mirror. The original approach to adaptive optics [38] made
use of neural networks similar to those mentioned in Appendix A.
Appendices 153

In the context of the first approaches to adaptive optics, these neural


networks used as inputs the output of a phase sensor which detected
deviations in wave fronts from flatness, and then used a feedback loop
to adjust the shape of the mirror. The setup considered by Dyson
similarly used a phase sensor’s assessment of the shape of a flexible
mirror surface Σ and then used a feedback circuit to mechanically
deform the surface so as to compensate for small variations a(σ, t) in
the optical path length of light rays incident on a surface at various
locations σ. Dyson proceeded by writing down equations describing
the interplay between the controlled deformations in the shape of
a surface and changes in the output of phase sensors which record
changes in the intensity of light beams due to changes in the shape
of the surface. These equations involve two matrices, Aj (σ, t) and
B j (σ, t). The matrix B supposes that we have a control system that
adjusts the displacement δ(σ, t) of the surface with sufficient accuracy
so that the signal intensity at time t and position σ is a linear function
of the observed phase φ(σ, t) ≡ δ(σ, t) + a(σ, t)

I(j, t) = I0 − d2 σBj (σ, t)ϕ(σt), (B.38)

where I(j, t) is the signal recorded at time t, in detector j and the


I0 (j) are the recorded signals in the sensor array in the absence of
imposed variations in signal intensity due to atmospheric noise. The
second equation relates the deformation of the surface at location σ
produced by the feedback control to the observed signal intensity in
the sensor array:
 t
δ(σ, t) = dt Aj (σ, t − t )I(jt ), (B.39)
j −∞

where the integral over d Ω means sampling the light intensity at a


sufficiently large number of points on the sensor array as is required
to determine the parameters which define the shape of the surface.
When photon noise N is neglected, then Eqs. (B.1–2) have the clas-
sical solution (in matrix shorthand for Aj and Bj )
ϕ(N = 0) = [1 − AB]−1 a. (B.40)
Equations (B.38–B.39) define Dyson’s adaptive optics model. What
is most remarkable about his model though is that when photon noise
154 Quantum Mechanics and Bayesian Machines

is considered, the problem of changing the shape of the deformable


mirror to compensate for random changes in the optical path of the
illuminating beam across the mirror aperture becomes equivalent to
solving the inversion problem for the multi-channel Schrodinger equa-
tion. This situation is qualitatively different from the classical case
because in the classical case the negative feedback would amplify the
photon noise. Dyson showed that in the presence of photon noise the
optimal feedback matrices A(σ, x, t ) and B(σ, x) remain finite and
satisfy A = KB T I −10 , where K (σ1 , σ2 , t1 , t2 ) is a matrix satisfying
the nonlinear integral equation:
K + K T + K(B T I0−1 B)K T + U = 0, (B.41)
where U is the average < a 1 a 2 > over time. Dyson showed that
the solution to Eq. (B.41) is optimal in the sense that a quadratic
function of the errors is minimized. Dyson also noted that Eq. (B.41)
has the same form as the multi-channel Newton–Jost equation [117]
that solves the inverse scattering problem for the quantum mechan-
ical scattering of a quantum particle by a possibly anisotropic 3D
potential.

C. Riemann Surfaces

A Riemann surface is a topologically nontrivial smooth curved 2D


surface whose points can be parameterized by the solution of an alge-
braic equation. The appearance of this Riemann surface is a result of
a surprising theorem from the 1920s due to Burchnall and Chaundry
[77] that the complex valued eigenvalues, say y and z, for two com-
muting differential operators constructed from the infinity of Lax
operators parameterize a 2D surface defined by an algebraic equa-
tion of the form

y = ± a0 + a1 z + a2 z 2 + . . . an z n , (C.1)
Riemann’s greatest achievement was to point out that the ambiguity
represented by the ± sign in an algebraic expression like Eq. (C.1)
can be removed by the assertion that the solutions to (C.1) corre-
spond to points on a topologically nontrivial 2D surface. The value
of n in Eq. (C.1) becomes the “genus” g of this surface; i.e. the num-
ber of “handles” formed by the surface (cf. [55]). There is a complex
Appendices 155

torus, known in the literature [54,55] as the “Jacobian variety” of the


surface, associated in a canonical way with any Riemann surface. In
the case of the KdV equation, the Poisson–Arnold tori [91] emerges
as the Jacobian variety associated with the Burchnall–Chaundry
surface [77].
Algebraic geometry enters when for each torus one considers all
possible pairings of the eigenvalues λ of the Lax operator L(u) with
the eigenvalues of any operator Q that commutes with L(u). This
is a complex curve C which can be mapped into a complex torus
Cg /L, where
 w L is a g-dimensional complex lattice, by the Abel map
A(w) = w0 dω, when the integration path is a nontrivial 1-cycle on
the surface.
 w
A(w) = dω,
w0

What gives Riemann surfaces their punch are the Θ-functions [53–
56]; which define an ng -dimensional Hilbert space consisting of inde-
pendent meromorphic functions:
  
  
Θ(A|Tij ) ≡ exp iπ ni Tij nj + 2πi Aj nj , (C.2)
n∈Z g ij j
P 
where Aj ≡ P0 dωj and the Tij ≡ dωj (Bi ) are the “periods”
for the Riemann surface obtained by integrating one of the g alge-
braically independent rational differentials on the Riemann surface
around one of its “B” cycles. These functions are not L-periodic, but
L-automorphic:
⎛ ⎞

Θ ⎝A + Tij mj ,|Tij ⎠ = [exp [iπ (Tii + 2πiAi ]n Θ(A|Tij ).
j

(C.3)
One thing that is remarkable about Θ-functions is that they define
an embedding of a Riemann surface in a projective space. Because
156 Quantum Mechanics and Bayesian Machines

of the form of the prefactor in Eq. (C.3), the Θ-functions define


an embedding of the complex torus Cg /L into a projective space of
dimension ng . The Θ-functions define an embedding of a Riemann
surface into a projective space of dimension n2g , and therefore can
be considered as quantum wave functions on the Riemann surface.
These functions are not periodic automorphic w.r.t to a complex
lattice L; i.e. instead of being L-periodic, they get multiplied by a
factor
 
1
eα (x) = exp π H(x, α) + H(α, α) , (C.4)
2

where α ∈ L. The Hermitian form H is defined by the fact that


its imaginary part is just the usual intersection form for homology
cycles which maps L x L to the integers. The form (C.4) plays an
important role in relating the theory of Riemann surfaces to the
theory of algebraic varieties and quantum mechanics. This relations
are encoded in two fundamental topology theorems due to Solomon
Lefshetiz [54]. Of particular interest to us is that the exact multi-
soliton solutions to the KdV and NLS equations can be expressed in
terms of the Riemann Θ = function:
  
1
θ(z) ≡ exp 2iπ μT μ + μz , (C.5)
g
2
μ∈Z

where z ∈ Cg , μ ∈ Zg , and T ∈ Hs where Hs is the “Siegel half-


plane” consisting of symmetric complex g x g matrices with positive
imaginary parts. The Θ-functions (C.5) are naturally associated with
Riemann surfaces, which for our purposes will be the auxiliary
Riemann surfaces associated with either the KdV equation or the
NSE. The period matrix T = {τij } is determined by the periods of
abelian integrals around the g nontrivial “B type” homotopy cycles of
a Riemann surface with genus g. (The periods around the “A type”
cycles are conventionally normalized to 1.) The Θ-function is not
periodic on Cg but automorphic:
1
θ(z + α) = e2πi[−μ · z− 2 μT μ] θ(z), (C.6)
Appendices 157

where α ∈ Cg can always be in the form α = Iμ +T μ, and μ, μ ∈ Zg .


(These equations are written in different ways in different books on
Riemann surfaces, but I follow Farkas and Kra). The Riemann Θ-
functions are single valued and holomorphic on the complex torus
Cg /Λ, where Λ is the 2g dimensional real lattice defined by the
periods of abelian integrals of the allowed g independent rational
differentials [55] on a Riemann surface. A remarkable property of Θ-
functions is that although they are holomorphic functions on Cg /Λ
the ratio of Θ-functions with different values of the displacement
β in (6.40) can be used to define by pullback from Cg /Λ to a
corresponding Riemann surface a rational function on a Riemann
surface that provides the prefactor (and scattering data) for the
solution of the Lax equation for the KdV equation and NSE. The
exact expression for this prefactor in terms of Θ-functions can be
found in the literature on the NSE equation, but has the general
form
θ(z + α)θ(z − α)
φ(z) = . (C.7)
θ(z + β)θ(z − β)

The function φ(z) is a periodic meromorphic function on Cg with 2g


period vectors {e1 , . . . , eg ; τ1 , . . . , τg } (see Farkas and Kra). Remark-
ably, the numerator and denominator of (C.8) also provide a map
into a projective space P N where N = 2g − 1. The basis vectors
for the image of this map are provided by the shifted Riemann Θ-
functions [54,55] where the integers μ and μ are defined modulo 2, i.e.
μ, μ ∈ Zg /2Zg . Defining μ = ε/2, then the basis vectors within P N
are called first-order Θ-functions with integer characteristics [ε, ε ]
[54–57]:
    
 πi 1 1
θ  (z|T ) = exp T  + z +  θ(z). (C.8)
 2 4 2

As the integer characteristics [ε, ε ] run over the coset labels (0,1)

for Zg /2Zg the “theta-null-werte”, i.e. the values of θ  (z|T ) at
z = 0, define the generators for a 22g dimensional representation
of the Heisenberg group! In this representation, the cosets Zg /2Zg
play the role of t in (6.2). The action of the translation part of the
158 Quantum Mechanics and Bayesian Machines

Weyl–Heisenberg group on the Θ-functions is


      
 Tkk k 
θ  (z + Tk |T ) = exp 2πi −zk − − θ  (z|T )
 2 2 
(C.9)

This use of the explicit Weyl–Heisenberg generators allow one to


calculate analytically the reward function for possibly all optimal
control and RL of interest.

D. The Eightfold Way

As a precursor to our quantum approach to Bayesian inference, it


was discovered [72] during the flurry of activity following the paper of
Gardner et al. [73] on the inverse scattering solution of the KdV equa-
tion that the multi-soliton solutions of the KdV equation have a very
pretty purely quantum interpretation in terms of the energy eigen-
states for an array of quantum oscillators. This remarkable develop-
ment is based on a method introduced by Murray Gell-Mann and
Yuval Ne’eman [118] for constructing representations of SU(3) using
3 sets of fermion creation and annihilation operators. This method is
based on the fact that representations for SU(N) can always be clas-
sified using N-1 SU(2) representations. In the case of SU(3), the two
privileged SU(2) groups are traditionally referred to as “I-spin” and
“U-spin”. We are indebted to Gell–Mann for introducing the name
“Eightfold Way” for the 8 generators of SU(3), which in addition to 3
“I-spin” and 3 “U-spin” operators employ 3 “We spin” operators, and
in addition, the Buddha (channeled through James Joyce) conveyed
to him the names up-quark, down-quark, and strange quark for the
3 kinds of fermionic operators needed for SU(3). Each SU(3) repre-
sentation consists of a certain number of I-spin multiplets with well-
defined numbers of each kind of quark. The total number of quarks
minus anti-quarks attached to an SU(3) representation is known as
the baryon number. A natural way to truncate the Hilbert space of
a 3D oscillator is to fix the baryon number.
Ironically as a scheme for modeling nuclear particles as a loose
assembly of quarks, the “Eightfold Way” turned out to not be of
Appendices 159

any fundamental importance for elementary particle physics. On the


other hand, the use of the Gell–Mann–Ne’eman fermionic operators
to describe the representations of SU(3) turns out to be of great
importance for our program of translating Bayesian learning into
quantum mechanics. The N fermion creation and annihilation oper-
ators that are used to construct representations for SU(N) satisfy
anti-commutation relations:

{ψm , ψn } = {ψm , ψn∗ } = 0, {ψm , ψn∗ } = δnm . (D.1)
 ∗
Expressions of the form  nm cnm ψ m ψn then  form a Lie algebra that
acts in vector spaces V = cn ψ n and V ∗ = cn ψn∗ as

{ψm ψn∗ , ψp } = δnp ψm , {ψm ψn∗ , ψp∗ } = −δmp ψn∗ . (D.2)

Exponentiation of this Clifford Lie algebra leads to continuous group


G(V, V ∗ ):

exp(tψm ψn∗ ) = 1 + tψm ψn∗ . (D.3)

If we now fix a set of parameters {ti } (the “times” for the multiple


action-angle flows associated with an integrable system) for a con-
tinuous Lie group, then one can define [154] as a fermionic analog of
the Lie–Poisson Hamiltonian:

 ∞


H(t) ≡ tl ψn ψn+l . (D.4)
l=1 n=−∞

If g ∈ G(V, V ∗ ), then

g(t) = eH(t) g(0)e−H(t) (D.5)

and the KdV τ -function introduced in Eq. (5.16) is [154]

τ (x, g) =< g(x) >=< eH(x) g(0) >, (D.6)

where the brackets refer to the ground state expectation value for
an array of quantum oscillators. This τ -function is the glue that ties
together the long-term behavior of an integrable system with local
behavior represented by the Hamiltonian H. Of course, in practice
160 Quantum Mechanics and Bayesian Machines

the sums over l and n in the expression for H(t) would have to be
truncated in practice, not to mention the difficulties of representing
fermions in a practical computational setting. Nevertheless, we have
a setup which in principle would allow the τ and Baker functions for
the KdV equation to be exactly evaluated using an array of quantum
oscillators. Whether this is of any practical value remains to be seen.
However, the “I-spin”, “U-spin”. and “We-spin” lines in the SU(3)
root diagram used in the Eightfold Way make a ubiquitous appear-
ance in the RH approach to solving nonlinear PDES. The reason for
this is that that these axes play an important role [69] in defining the
boundary separating the domains of the holomorphic functions which
are used in Riemann–Hilbert approach to finding analytic solutions
for integrable nonlinear PDEs.

E. Quantum Theory of Brownian Motion

E.1. Quantum dynamics a la Feynman–Vernon


Keldysh
The theory of classical dynamics with noise was founded long before
quantum mechanics by Fokker and Planck. In the 1950s, Wiener and
Kac developed a classical path integral formulation of the Fokker–
Planck theory of noise. In the 1960s, Feynman and Vernon [47] devel-
oped a quantum version. For an environment consisting of a classical
noise source (e.g. a GP). The Feynman–Keldysh
 propagator J for
the quantum mechanical density matrix n ψn (x)ψn∗ (x )e−En /kT is

J= ei{S[x(t)]−S[x(t)]}/ (E.1)
  T  t 
   
× exp − x(t ) − A(t, t )(x(t) − x (t))dtdt Dx(t)Dx (t)
0 0

where S[x(t)] is the classical action for the system and A(t, t ) is the
autocorrelation function for the noise. If instead of a classical noise
signal the quantum system is coupled to a quantum environment,
then the real exponential in the formula for J is replaced by the
complex valued influence functional F {[x(t), x (t)]} and A(t, t ) is
replaced by a complex function α(t, t ). The exponential factor in
the density matrix propagator J can also be thought of as an overlap
Appendices 161

integral for final and initial states for the forward and backward 2nd
oscillator array as a functional of x(t) and x (t):

F [x(t)x (t)] = ψY (yb )ψY∗ (yb )dYb .


Another simple problem where the path integral focuses on the


classical path is the harmonic oscillator. In this case the Feynman
path integral [47] becomes
 mω 1
2
J(xa , tb ;xa ,ta ) =
2πi sin ωt
 
imω  2 2
×exp (xa + xb )cosω t − 2xa xb .
2sinωt
(E.2)
As is the case for a free particle, the phase of the exponential is
just the classical action measured in units of . This means that as
a first approximation the quantum motion of an array of coupled
oscillators can be calculated classically. This has the consequence
that the quantum dynamics of an array of linearly coupled quantum
oscillators can be efficiently simulated using conventional computers.
The Feynman–Keldysh prescription for the quantum dynamics of
a system subject to influences by an “environment” is to replace the
original Feynman path integral with a double path integral.
 

eiS[q(t)] Dq(t) → ei{S[q(t)]−S[q {t) ]} , F [q(t), q  (t)]Dq(t)Dq  (t),
(E.3)
where F (q, q  ) represents the effect of the second quantum system;
viz. the measuring apparatus. The exact form for F (q, q  ) depends
on the details of the second quantum system. However, in general
one can write
  T t

 
F [x(t)x (t)] = exp − x(t ) α(t, (t ) − x (t )α∗ (t, t ))
0 0

  
− (x(t )−x (t)}dtdt ,

where α(t, t ) is a complex function that plays much the same role
as the real autocorrelation function for the signal. In the case where
the environment consists of a pure classical noise, usefulness of the
162 Quantum Mechanics and Bayesian Machines

Feynman–Keldysh double path integral derives from the fact that


it allows one to precisely describe the time evolution of the den-
sity matrix of a quantum system interacting with an environment.
Green’s function for describing the time evolution of the density
matrix of a single harmonic oscillator is

im 2
J(x, y, x0 , x0 , t) = exp (ẋ − ω02 x2 − ẏ 2 + ω02 y 2 ) , (E.4)
2
where ω0 is the oscillator angular frequency. When this oscillator is
coupled to an environment, Eq. (E.4) becomes
   t1  t 
M  2 
J(t1 , t0 ) = exp − i q̀ − ω02 q 2 − q̄˙ 2 + ω02 q̄ 2
0 t0 2

− [α(t, t )q(t ) − α∗ (t, t )q̄(t )][q(t) − q̄(t)]dt dt Dq(t)Dq̄,

(E.5)
where α(t, t ) is a complex function that will play much the same
role as the real autocorrelation function A(t, t ) for a valued random
time real signal. When the environment consists of another harmonic
oscillator with level spacing Δ, the influence function is
g 2 mω02
F (x, y) ∼
= exp −
2Δ 2
 to +T  t 
 
× e−iΔ(t−t ) x(t ) − eiΔ(t−t ) y(t ))
to t0

×(x(t) − y(t) dt dt

(which in general implies non-Markovian quantum dynamics for the


density matrix):
∂ρ i ∂2ρ ∂2ρ iLω 20  2 
= + + Q − Q2
∂t 2L ∂Q2 ∂Q2 2

C 2 ln Δmax /Δmin 
ti −t

− Q(t) − Q(t)
2 Δmax ti =−∞
 ti
× (Q(s) − Q (s))ds ρ. (E.6)
ti −τ
Appendices 163

A general form for F (q, q  ) that applies in many situations, and


roughly speaking provides a quantum parallel for Gaussian noise, is

F [q(t), q  (t)]
  T  t 
  ∗     
= exp − [α(t, t )q(t ) − α (t, t )q (t )][q(t) − q (t)]dt dt ,
0 0
(E.7)

where α(t, t ) is a complex function that plays much the same role
in quantum mechanics as the autocorrelation function for Gaussian
processes. The exact relation of α(t, t ) to classical noise can be under-
stood by looking at matrix elements of the quadratic functional of
q(t) and q  (t) in the Hilbert space spanned by energy eigenstates of
an array of quantum oscillators. For example,
  T  t 
S[q(t)]−S[q  (t)]    
e α(t, t )q(t )[q(t) − q (t)]dt dt Dq(t)Dq(t)
0 0
 T  t
=− α(t, t ) < m|q(t)|n >< m|q(t )|n > dt dt, (E.8)
0 0

When the environment consists of quantum oscillators, α(t, t ) has


the form

 g2
i −iωi (t−t )
α(t, t ) = − e , (E.9)
ωi
2

where the ωi s are the frequencies of the oscillators in the 2nd oscilla-
tor array making up the “environment”. For a single harmonic oscil-
lator and an environment consisting of oscillators with frequencies
ωi , where Δi = ωi is the level spacing, by analogy with the classical
Wiener filter one might assume that α(t, t ) has a piece that repre-
sents the signal and a piece that represents the noise. As a reminder
the Wiener filter HF (t, s), described in Chapter 2, is obtained as
a ratio of Laplace transforms of the signal correlation K(s, t) and
R(s, t), the sum of the signal and noise correlation functions. If we
164 Quantum Mechanics and Bayesian Machines

add a classical noise term, R(s, t) becomes

 g2
i −iωi |t−t |
R(t, t ) = − e + A(t, t ), (E.10)
ωi
2

The kernel that appears in Wiener’s integral expression for a signal


accompanied by white noise is determined by the Wiener–Hopf fac-
torization of the bilateral Laplace (not Fourier!) transform R(s) of
R(t):
 ∞
R(s) = [δ(t) + K(t)] exp(−st)ds, (E.11)
−∞

where K(t, x) is either a causal or anti-causal kernel:



K(t, s) = αi exp(−βi t)exp(βi s), t ≥ s,
i
 (E.12)
K(t, s) = αi exp(−βi s)exp(βi t), t ≤ s.
i

The Wiener–Hopf Eq. (A.6) allows one to extract features of the


environment using the filter

Ẑ(t) = HF (t, t )Y (t ), (E.13)

where HF (s) is a ratio of Laplace transforms of the K(s, t) and the


classical R(s, t):

R+(y)
HF (s) = . (E.14)
1 + R+ (y)

E.2. Stochastic influence functions


In general, the noise in a real quantum system is not Gaussian.
For example, in superconducting quantum systems 1/f noise can
be important. It is believed that this might be due to the coupling
of the system to two level systems (TLSs). In case of a quantum
Appendices 165

oscillator, the Hamiltonian would be


1  1 
H = ωr a↑ a + + Δj σjz + igσjy (a↑ − a)
2 2
j
 gi  
ȧ = iω0 − i σj+ e−iΔ(t−t0 ) − σj− eiΔ(t−t0 ) + F (t)

j

 g2  t  
F (t) = − sin(t − t )σjz (t ) a+ (t ) − a(t ) dt
2 t0
j

Because the “starting” times” t0 for the episodes of coherent evolu-


tion for each TLS are randomly distributed, the influence function
will be a product of influence functions for each TLS:
 g 2 j  t  t   
F (t) ≈ exp − 2
e−iΔj (t−t ) x(t ) − eiΔj (t−t ) y(t )
 tj 0 tj 0
j

× [x(t) − y(t)] dt dt .

Modeling the coupling of a TLs to a quantum oscillator as the n =


0.1 levels of a harmonic oscillator yields a density matrix evolution
equation:

∂ρ i∂2ρ ∂2ρ iLω02 2 2 g2


= + + (Q − Q ) −
∂t 2L∂Q2 ∂Q2 2 2
 t  t
g2  
× exp − 2 Q(t ) Q(s)eiΔ(t −s)/ dsdt
 t0 t0
 t  t

+ Q (t ) Q (s)eiΔ(t −s) dsdt
t0 t0
 t
× (Q(t)Q(s)e−iΔ(t−s)/ + Q (t)Q (s)eiΔ(t−s)/ )ds ρ.
t0

The influence functional can now be regarded [155] as a random


function of time whose fluctuations can be measured by the autocor-
relation function for oscillator amplitude.
This page intentionally left blank
References

[1] J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of


Neural Computation (Addison-Wesley, Boston, 1991).
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning (MIT Press,
2018).
[3] K. J. Astrom and R. M. Murray, Feedback Systems (Princeton
University Press, 2009)
[4] Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference (Morgan Kaufmann Publishers, San Francisco,
1998).
[5] B. Efron, Large-Scale Inference: Empirical Bayes Methods for Esti-
mation, Testing, and Prediction (Cambridge University Press, 2010).
[6] D. Mackay, Information Theory, Inference, and Machine, 18 (Cam-
bridge University Press, New Delhi, 2005).
[7] C. M. Bishop, ‘Model-based machine learning’. Philosophical Trans-
actions of the Royal Society A371 (2012), 2012022.
[8] P. Dayan, G. E. Hinton, R. Neal, and R. Zemel, ‘The Helmholtz
machine’. Neural Computation 7 (1995), 889.
[9] G. E. Hinton, P. Dayan, B. Frey, and R. Neal, ‘The wake-sleep algo-
rithm for unsupervised neural networks’. Science 7 (1995), 889.
[10] T. Yang and M. Shadlen, Nature 447 (2007), 1075.
[11] K. Pribram, Brain and Perception (Lawrence Erlbaum Associates,
1991).
[12] T. Kohonen, Self-organizing Maps (Springer, 1995).
[13] R. Ritter and K. Schulten, ‘On the stationary state of Kohonen’s
self-organizing mapping’. Biological Cybernetics 54 (1986), 99.
[14] M. A. Nielsen and I. I. Chuang, Quantum Computation and Quantum
Information (Cambridge University Press, 2000).

167
168 Quantum Mechanics and Bayesian Machines

[15] L. D. Landau and I. M. Lifshitz, Quantum Mechanics (Pergamon


Press, 1977).
[16] M. Nagasawa, Schrodinger Equations and Diffusion Theory
(Birhauser, 1993).
[17] G. Chapline, ‘Quantum mechanics and pattern recognition’. Interna-
tional Journal of Quantum Information 2 (2004), 295.
[18] Y. Aharonov, S. Popescu, and J. Tollaksenn, ‘A time-symmetric for-
mulation of quantum mechanics’. Physics Today (Nov. 2010).
[19] E. Todorov, ‘General duality between optimal control and estima-
tion’. Proceeding of 47th IEEE Conference on Decision and Control
(IEEE, Cancun, Mexico, 2008).
[20] R. E. Bellman, Dynamic Programming (Princeton University Press,
1952).
[21] B. D. Anderson and J. B. Moore, Optimal Control (Prentice Hall,
1990).
[22] J. S. Dugdale, Entropy and its Physical Meaning (Taylor & Francis,
1996).
[23] C. Kittel, Elementary Statistical Physics (John Wiley & Sons, 1958).
[24] D. Mumford, ‘Neuronal architectures for pattern-theoretic problems’.
In Large Scale Theories of the Cortex, C. Koch and J. Davies (eds.)
(MIT Press, 1994).
[25] G. Chaitin, Algorithmic Information Theory (Cambridge University
Press, 1987).
[26] J. Rissanen, Stochastic Complexity in Statistical Inquiry (World Sci-
entific, 1989).
[27] S. Kullback, Information Theory and Statistics (Wiley, 1959).
[28] B. Jedynak, P. Frazier, and R. Sznitman, ‘Twenty questions with
noise: Bayes optimal policies for entropy loss’, Journal of Application
Probability 49 (2011), 114.
[29] A. P. Dempster, N. M. Laird, and D. B. Rubin, ‘Maximum likelihood
from incomplete data via the EM algorithm’. Proceeding of the Royal
Statistical Society B39 (1977), 1.
[30] E. Todorov, ‘Efficient computation of optimal actions’. PNAS 106
(2009), 11478.
[31] H. J. Kappen, ‘Path Integrals and symmetry breaking for optimal
control’. Journal of Statistical Mechanics P11011 (2005).
[32] L. M. Laurie, Feynman’s Thesis: A New Approach to Quantum
Mechanics (World Scientific, 2005).
[33] W. Pauli, Pauli Lecture on Physics vol. 4 (MIT Press, 1978).
[34] J. H. Rose, ‘Single-sided focusing of the time dependent Schrodinger
equation’. Physical Review A 65 (2001), 127707.
References 169

[35] P. A. M Dirac, The Principles of Quantum Mechanics (Oxford


University Press, 1958).
[36] G. Chapline, ‘Quantum mechanics as self-organized information
fusion’. Philosophical Magazine B81 (2001), 541.
[37] M. Nauenberg, ‘Quantum wave packets on Kepler elliptic orbits’.
Physical Review A 40 (1989), 1133.
[38] R. Tyson, Principles of Adaptive Optics (Academic Press, 1991).
[39] H. J. Briegel, D. E. Brown, W. Dur, R. Raussendorf, and M. Nest,
‘Measurement based quantum computation’. Nature Physics 5
(2009), 19.
[40] M. Planck, The Theory of Heat Radiation (Dover Publications, 1991).
[41] D. ter Haar, The Old Quantum Theory (Pergamon Press, 1967).
[42] B. L. Van Der Waerden, Sources of Quantum Mechanics (North-
Holland Publishing, 1967).
[43] W. Heisenberg, The Physical Principles of the Quantum Theory
(University of Chicago Press, 1930).
[44] H. Weyl, Theory of Groups and Quantum Mechanics (Martino
Publishing, 2014).
[45] N. D. Mermin, ‘Limits to quantum mechanics as a source of magic
tricks’. Physical Review Letters 74 (1995), 831.
[46] M. Born, ‘The statistical interpretation of quantum mechanics’. In
Nobel Lectures: Physics 1942–1962 (Elsevier Publishing, 1964).
[47] R. P. Feynman and A. Hibbs, Quantum Mechanics and Path Integrals
(McGraw-Hill, 1965).
[48] T. A. Brun, ‘A simple model of quantum trajectories’. arXiv.quat-ph/
0108312v1 (2001).
[49] J. Schwinger, ‘Quantum theory of Brownian Motion’. Journal of
Mathematical Physics 2 (1961), 407.
[50] A. O. Caldeira and A. Leggett, ‘Influence of damping on quantum
coherence’. Phyical Review 46 (1985), 211.
[51] L. V. Keldysh, ‘Diagram technique for nonequilibrium processes’.
Journal of Experimental and Theoretical Physics — Soviet Physics
20 (1965), 1018.
[52] G. Chapline, ‘Machine learning and quantum mechanics’. In Advances
in the Computational Sciences, ed., E. Schwegler, B. Rubenstein, and
S. Libby (eds.) (World Scientific, 2007).
[53] L. S. Shapley, ‘Stochastic games’. Mathematics 39 (1953), 1095.
[54] P. Griffiths and J. Harris, Principles of Algebraic Geometry (John
Wiley & Sons, 1978).
[55] D. Mumford, Curves and their Jacobians (University of Michigan
Press, 1975).
[56] D. Mumford, Tata Lectures on Theta (Birhhauser, 1983).
170 Quantum Mechanics and Bayesian Machines

[57] H. M. Farkas and I. Kra, Riemann Surfaces (Springer-Verlag, 1992).


[58] F. J. Dyson, ‘Photon noise and atmospheric noise in active optical
systems’. Journal of the Optical Society of America A 65 (1975), 551.
[59] T. Tao, ‘Tribute to Israel Gelfand’ (wordpress.com/2009/10/07/
Israel).
[60] V. A. Marčenko, ‘The construction of the potential energy from the
phases of the scattered waves’. Mathematical Reviews 17 (1955).
(Publications, 1992).
[61] B. Noble, Methods Based on the Wiener-Hopf Technique (Chelsea
Publishing Company, 1958). (wordpress.com/2009/10/07/Israel).
[62] J. Pearl, ‘Theoretical impediments to Machine Learning with sparks
from the causal revolution’. arXiv:1801.04016 (2018).
[63] B. Schoellkopf, ‘Causality for Machine Learning’. arXiv:1911.10500
(2019).
[64] R. P. Feynman, Quantum Electrodynamics (Benjamin, 1961).
[65] S. Mandelstam, ‘Introduction to string models and vertex functions’.
In Vertex Operators in Mathematics and Physics, J. Lepowsky, S.
Mandelstam, and I. M. Singer (eds.) (Springer-Verlag, 1985).
[66] G. Chapline, ‘The bootstrap principle and equal-time commutators’.
Il Nuovo Cimento 58 (1968), 1.
[67] N. I. Muskhelishvili, Singular Integral Equations (Dover, 1992).
[68] A. Its, ‘The Riemann-Hilbert problem and integrable systems’.
American Mathematical Society Notices 50 (2003), 1389.
[69] T. Trogdon and S. Olver, Riemann-Hibert Problems, Their Numer-
ical Solution, and the Computation of Nonlinear Special Functions
(SIAM, 2016).
[70] A. K. Ghatak, R. L. Gallawa, and I. C. Goyal, Modified Airy Func-
tions and WKB Solutions to the Wave Equation (US Government
Printing Office, 1991).
[71] A. B. Migdal and V. P. Krainov, Approximation Methods in Quantum
Mechanics (W. A. Benjamin, 1969).
[72] L. A. Dikey, Soliton Equations and Hamiltonian Systems (World Sci-
entific, 1991).
[73] N. J. Hitchin, G. B. Segal, and R. S. Ward, Integrable Systems
(Clarendon Press, 1999).
[74] P. Lax, ‘Integrals of nonlinear equations of evolution and solitary
waves’. In Communications on Pure and Applied Mathematics 21
(1968), 467.
[75] A. C. Newell, Solitons in Mathematics and Physics (Society for Indus-
trial and Applied Mathematics, 1985).
[76] R. Hiroto, ‘Exact solution for the Korteweg-deVries equation for mul-
tiple solitons’. Physical Review Letters 27 (1971), 1192.
References 171

[77] J. L. Burchnall and T. W. Chaundry, ‘Commutative ordinary dif-


ferential operators’. Proceeding of Royal Society London Series A
118 (1928), 557; H. F. Baker, ‘Note on the paper by Burchnall and
Chaundry’. Ibid. 584.
[78] M. Schneider, ‘Bayesian linking of geosynchronous orbital debris
tracks as seen by the LSST.’ Advances in Space Research 49 (2012),
655.
[79] S. Godsil, ‘The relationship between Markov Chain Monte Carlo
methods for model uncertainty’. Computational and Graphical Statis-
tics 10 (2001), 1.
[80] A. T. White, Graphs Groups, and Surfaces (North-Holland, 1984).
[81] N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary
Time Series, With Engineering Applications (Technology Press and
Wiley, 1949).
[82] S. M. Stigler, The History of Statistics (Harvard University Press,
1986).
[83] T. Kailath, ‘A view of three decades of linear filtering theory’, IEEE
Transactions on Information Theory IT-20 (1974), 146.
[84] D. Simon, Optimal State Estimation (Wiley, 2006).
[85] D. C. Lindberg, Science in the Middle Ages (University of Chicago
Press, 1978).
[86] H. J. Sussmann and J. C. Williams, ‘300 years of optimal control
from the Brachistochrone to the maximum principle’. IEEE Control
Systems 27 (1997), 32.
[87] H. Schattler and U. Ledzewicz, Geometric Control Theory (Springer,
2012).
[88] L. Landau and E. M. Lifshitz, Mechanics (Pergamon Press, 1977).
[89] Z. Ma and C. W. Rowly, ‘Lie-Poisson integrators: A Hamiltonian,
variational approach’. International Journal for Numerical Methods
in Engineering (Wiley, 2009).
[90] J. von Neumann and O. Morgenstern, Theory of Games and Eco-
nomic Behavior (Princeton University Press, 1944).
[91] M. Gutzwiller, Chaos in Classical and Quantum Mechanics
(Springer, 1990).
[92] I. Stewart, Galois Theory (Chapman and Hall, 1973).
[93] M. Kuga, Galois’ Dream (Birkauser, 1993).
[94] D. Babbitt, ‘Certain Hilbert spaces of analytic functions associated
with the Heisenberg group’. In Studies in Mathematical Physics,
E. Lieb, B. Simon, and A. S. Wightman (eds.) (Princeton Univer-
sity Press, 1971).
172 Quantum Mechanics and Bayesian Machines

[95] V. Bargmann, ‘On a Hilbert space of analytic functions and an associ-


ated integral transform’. Communications on Pure and Applied Math-
ematics 14 (1961), 199.
[96] S. Thangavelu, Harmonic Analysis on the Heisenberg Group
(Birkhauser, 1998).
[97] N. Wiener, Nonlinear Problems in Random Theory (MIT Press,
1958).
[98] M. Schuld and N. Kiloran, ‘Quantum Machine Learning models are
Kernel methods’. Physical Review Letters 122(4) (2019), 040504137.
[99] M. Schuld and F. Petruccione, Machine Learning with Quantum
Computers (Springer, 2021).
[100] J. Polchinski, String Theory (Cambridge University Press, 1998).
[101] J. von Neumann, Mathematical Foundations of Quantum Mechanics
(Princeton University Press, 1952).
[102] C. W. Helstrom, Quantum Detection and Estimation Theory
(Academic Press, 1976).
[103] C. N. Yang, ‘Concept of off-diagonal long-range order’. Reviews of
Modern Physics 34 (1962), 694.
[104] G. Chapline, ‘Theory of the superfluid transition in liquid helium’.
Physical Review A3 (1971), 1671.
[105] L. C. Thomas, Games, Theory and Applications (Dover, 2011).
[106] D. Silver et al., ‘General reinforcement algorithm that masters chess,
shogi, and Go’. Science 362 (2018), 1140.
[107] T. Kohonen, ‘Physiological interpretation of the self-organizing map
algorithm’. Neural Networks 6 (1993), 895.
[108] G. Chapline, ‘Spontaneous origin of topological complexity in self-
organizing neural networks’, Network: Computational Neural Systems
8 (1987), 185.
[109] R. Linsker, ‘Self-organization in a perceptual network’. Computer 21
(1988), 105.
[110] G. Chapline, ‘Minimum energy information fusion in sensor net-
works’. In Proceeding of the 2nd International Conference on Infor-
mation Fusion (Society for Information Fusion, 1999).
[111] J. Schwinger, ‘The Majorana formula’. In Festshrift for I. I. Rabi
(Transactions on New York Academy of Sciences), 38 (1977).
[112] L. C. Biedenharn and J. D. Louck, The Racah-Wigner Algebra in
Quantum Theory in Encyclopedia of Mathematics and its Applica-
tions, ed. G. Rota (Addison-Wesley, 1981).
[113] S. L. Braunstein and H. J. Kimble, ‘Teleportation of continuous quan-
tum variables’. Physical Review Letters 80 (1998), 869.
[114] E. P. Wigner, ‘The unreasonable effectiveness of mathematics in the
natural sciences’. Communications Pure and Applied Mathematics 13
(Feb. 1960), 1.
References 173

[115] G. Chapline, ‘Is theoretical physics the same thing as mathematics?’


Looking Forward: Frontiers in Theoretical Science, Physics Reports
315 (1999), 95.
[116] P. M. Morse and H. Feshbach, Methods of Theoretical Physics
(McGraw Hill, 1953).
[117] R. Newton, Inverse Schrodinger Scattering in Three Dimensions
(Springer-Verlag, New York, 1989).
[118] H. J. Lipkin, Lie Groups for Pedestrians (North-Holland, 1965).
[119] V. Barbu and S. S. Sritharan, ‘H∞ control theory of fluid dynamics’.
Proceedings of the Royal Society London A 454 (1998), 3009.
[120] M. Thieullen and A. Vigot, ‘Stochastic representation of the tau func-
tion with an application to the Korteweg-De Vries equation’. Com-
munications On Stochastic Analysis 12 (2018), 1.
[121] R. Durbin and D. Willshaw, ‘An analog approach to the traveling
salesman problem’. Nature 689 (1987), 326.
[122] H. Kleinert, Path Integrals (World Scientific, 1995).
[123] B. O. Koopman, Search and Screening: General Principles and His-
torical Applications (Pergamon Press, 1980).
[124] D. C. Woods and S. M. Lewis, ‘Design of experiments for screening’.
arXiV:1510.05248 [stat.ME] (2015).
[125] E. Finley-Fruendlich, Celestial Mechanics (Pergamon Press, 1958).
[126] V. Korepin, N. M. Bogoliubov, and A. Izergin, Quantum Inverse
Scattering Method and Correlation Functions (Cambridge University
Press, 1993).
[127] H. Bethe, Intermediate Quantum Mechanics (W. A. Benjamin, 1964).
[128] H. Volklein, Groups as Galois Groups (Cambridge University Press,
1996).
[129] G. A. Jones, ‘Bipartite graph embeddings, ‘Riemann surfaces and
Galois groups’. Discrete Mathematics 338 (2015), 1801.
[130] D. F. Walls and G. J. Milbum, Quantum Optics (Springer-Verlag,
1995).
[131] B. Friedrich and D. Herschbach, ‘Stern and Gerlach: How a Bad Cigar
helped reorient atomic physics’. Physics Today (Dec. 2003) 53.
[132] R. Raussendorf and H. Briegel, ‘A one-way quantum computer’. Phys-
ical Review Letters 86 (2001), 5188.
[133] C. F. Barenghi and N. G. Parker, A Primer on Quantum Fluids,
arXiv:1605580v2 [cond-mat.quant-gas] (2016).
[134] D. Bohm, ‘A suggested interpretation of the quantum theory in terms
of “Hidden Variables”’. Physical Review 85 (1952), 180.
[135] B. R. Frieden, ‘Fisher information as the basis for the Schrodinger
equation’. American Journal of Physics 57 (1989), 1004.
174 Quantum Mechanics and Bayesian Machines

[136] M. Reginatto, ‘Derivation of the equations of nonrelativistic quan-


tum mechanics using the principle of minimum Fisher information’.
Physical Review A58 (1998), 1775.
[137] I. M. Khalatnikov, Introduction to the Theory of Superfluidity
(W. A. Benjamin, 1965).
[138] G. Chapline, E. Hohfield, R. B. Laughlin, and D. Santiago, ‘Quantum
phase transitions and the breakdown of classical general relativity’.
Philosophical Magazine B81 (2001), 235.
[139] D. Christodoulou, ‘Reversible and irreversible transformations in
black-hole physics’. Physical Review Letters 25 (1970), 1596.
[140] F. D. M. Haldane, ‘Model for a quantum Hall Effect without Landau
levels’. Physical Review Letters 61 (2015), 1988.
[141] B. A. Bernevig and T. Hughes, Topological Insulators and Topological
Superconductors (Princeton University Press, 2013).
[142] R. Jackiw and S-Y. Pi, ‘Soliton solutions to the gauged nonlinear
Schrodinger equation on a plane’. Physical Review Letters 64 (1990),
2969.
[143] G. Chapline and J. Dubois, ‘Topological quantum image analysis’.
Proceeding of SPIE 7342 (2009), 73420C.
[144] R. J. Baxter, Exactly Solvable Models in Statistical Mechanics
(Academic Press, 1982).
[145] L. Kadanoff and A. C. Brown, ‘Correlation functions on the critical
lines of the Baxter and Ashkin-Teller Models’. Annuals Physics 121
(1979), 318.
[146] S-Y. Chu, ‘Statistical origin of classical mechanics and quantum
mechanics’. Physical Review Letters 71 (1993), 2847.1059.
[147] N. Ikeda and S. Taniguchi, ‘Quadratic Wiener functionals,
Kalman-Bucy filters, and the KdV equation’. Advanced Studies in
Pure Mathematics 41 (2004), 167.
[148] K. Obermayer, H. Ritter, and K. Schulten, ‘Large-scale simulations of
self-organizing neural networks on parallel computers: Applications
to biological modeling’. Parallel Computing 14 (1990), 381.
[149] P. Suppes, Z. Lu, and B. Han, ‘Brain wave recognition of words’.
PNAS 94 (1997), 14965.
[150] G. Chapline, C-Y. Fu, and S. Nagarajan, ‘Inverse scattering approach
to improving human pattern recognition’. Proceeding of SPIE (2000).
[151] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
Machine Learning (MIT Press, 2006).
[152] J. Ko and D. Fox, ‘GP -Bayes filters: Bayesian filtering using Gaussian
process prediction and observation models’. Autonomous Robots 27
(2009), 75.
References 175

[153] J. Lee et al., ‘Deep neural networks as gaussian processes’ (Inter-


national Conference on Learning Representations 2018, arXiV:
1711.00165v1 [stat.ML]).
[154] S. Frautschi, Regge Poles and S-Matrix Theory (W. A. Benjamin,
1963).
[155] G. Chapline and M. Otten, ‘Bayesian searches and quantum oscilla-
tors’. Proceedings of 3rd Workshop on Microwave Cavities and Detec-
tors, April 2018 (Springer, 2020).
[156] J. Barrett and I. Naish-Guzman, ‘The Ponzano-Regge model’.
Classical and Quantum Gravity 26 (2009), 155014.
This page intentionally left blank
Index

A F
adaptive optics, 8, 13, 23, 144, Θ-functions, 12, 18–19, 73–76, 79, 95
152–153 Feynman path integral, 8, 18, 36, 57
Ashkin–Teller model, 114 Feynman, Richard, 34
Feynman–Vernon influence function,
B 20
Bargmann–Segal transform, 90
Bayes’s formula, 3–5, 11, 16–17, G
25–27, 42, 46, 50–51, 53–54, 107, Galois theory, 18, 86–87
114, 136, 137, 139 Gross–Pitaevski equation, 105,
Bayes, Thomas, 15, 25 107–108
Bayesian learning, 11, 15, 19, 29, 41,
57, 71, 75, 80, 90, 92, 159 H
Bayesian searches, 17, 41–43, 45,
100–101 H∞ control, 63–65, 120, 122
Bellman cost function, 7, 17, 44, 55 Hamilton–Jacobi–Bellman equation,
Bellman–Issacs function, 12, 119, 121 55, 105
Boltzmann distribution, 5 Hardy spaces, 15, 18–20, 64, 120, 124,
127
D Helmholtz machine, 5–6, 12, 17–18,
20–22, 49–53, 84, 114–115,
dynamic programming, 6, 11, 13, 16, 117–119, 122
32–33, 43–46, 55 Helstrom’s theorem, 20, 106
Dyson, Freeman, 13 Hilbert spaces, 15, 19, 28, 72, 76,
89–90, 93–98, 127
E Hilbert, David, 15, 18
eightfold way, 69 Hinton, Geoffrey, 5–6

177
178 Quantum Mechanics and Bayesian Machines

holomorphic functions, 2–3, 15, Off-diagonal long-range order


18–19, 21, 29, 36–37, 64, 68, 76–77, (ODLO), 107
86, 90–94, 117, 120, 122–125, 135
P
I
Pontryagin control, 17–18, 20–21, 58,
innovation Gaussian process, 39 64, 105, 107–108, 111, 114
integrable systems, 15, 67, 80
Q
K quantum self-organization, 24, 37,
Kalman filter, 15, 19, 30, 56, 63–65, 105
120
Kalman, Rudolf, 27, 29 R
KdV equation, 15, 18–21, 23, 64, reward function, 6–8, 11–12, 15,
68–71, 73–75, 77–81, 83–84, 115, 58–59, 80–81, 84, 108, 118, 123,
117–118, 125 125
Kohonen self-organization, 21–22, 37, Riemann surface, 12, 15, 18–19, 21,
115, 122–123, 125, 135–136 24, 38–39, 71, 73–74, 76–79, 81,
Kullback–Leibler divergence, 6, 107 85–87, 94–95, 111, 114–124,
120
L Riemann–Hilbert problem, 68–69
Landau, Lev, 14 Rose optimization, 8, 118
Lax equation, 73, 75, 80–81, 115, 118
S
M Schrodinger equation, 2, 11, 13–14,
mammalian brain, 1–4, 18, 21, 36–37, 23, 35, 67–68, 72, 80–81, 86,
52, 125, 136 106–107, 112
mammalian cognition, 2–3, 17, 20, 22, Schwinger representation, 127,
122–125 132
Marčenko equation, 13, 75 Segal, Graeme, 15
Markov Chain Monte Carlo Segal–Wilson construction, 18, 76, 78,
regression, 38, 48 92
measurement-based quantum Shannon information, 10, 33, 45
computation, 8, 19, 20, 99, 102 Stochastic games, 119
meromorphic function, 15, 18, 36, 72,
76, 81, 85–86 T
minimum description length, 4, 16, 51 Todorov, Emanuel, 3, 7, 24, 33, 54
traveling salesman problem, 16, 20,
N 36–38, 109, 111
NLS equation, 18, 20, 23, 68, 73, 80, Trogdon–Olver, 19, 23
82, 84, 107, 124
V
O von Neumann, John, 65, 119–120,
Ockham’s razor, 41 125, 134–135
Index 179

W Wiener filter, 16, 29, 79


wake–sleep algorithm, 6, 12, 20–21, Wiener, Norbert, 27
51, 53, 114, 118–120 Wigner–Racah algebra, 98
Weyl–Heisenberg group, 10, 12, 19,
24, 89–90, 95

You might also like