0% found this document useful (0 votes)
22 views96 pages

Instructors Manual

Uploaded by

Manoj Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views96 pages

Instructors Manual

Uploaded by

Manoj Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Machine Learning in Finance: From Theory to

Practice Instructor’s Manual

Matthew F. Dixon, Igor Halperin and Paul Bilokon

Matthew Dixon
Department of Applied Math, Illinois Institute of Technology e-mail: [email protected]
Igor Halperin
NYU Tandon School of Engineering and Fidelity Investments, e-mail: [email protected],
e-mail: [email protected]
Paul Bilokon
Thalesians Ltd, London, e-mail: [email protected]

v
vi Matthew F. Dixon, Igor Halperin and Paul
Bilokon
Introduction

Machine learning in finance sits at the intersection of a number of emergent and


es- tablished disciplines including pattern recognition, financial econometrics,
statistical computing, probabilistic programming, and dynamic programming.
With the trend towards increasing computational resources and larger datasets,
machine learning has grown into a central computational engineering field, with
an emphasis placed on plug-and-play algorithms made available through open-
source machine learning toolkits. Algorithm focused areas of finance, such as
algorithmic trading have been the primary adopters of this technology. But
outside of engineering-based research groups and business activities, much of the
field remains a mystery.
A key barrier to understanding machine learning for non-engineering
students and practitioners is the absence of the well-established theories and
concepts that finan- cial time series analysis equips us with. These serve as the
basis for the development of financial modeling intuition and scientific reasoning.
Moreover, machine learning is heavily entrenched in engineering ontology,
which makes developments in the field somewhat intellectually inaccessible for
students, academics, and finance prac- titioners from the quantitative disciplines
such as mathematics, statistics, physics, and economics. Consequently, there is
a great deal of misconception and limited un- derstanding of the capacity of this
field. While machine learning techniques are often effective, they remain poorly
understood and are often mathematically indefensible. How do we place key
concepts in the field of machine learning in the context of more foundational
theory in time series analysis, econometrics, and mathematical statis- tics?
Under which simplifying conditions are advanced machine learning techniques
such as deep neural networks mathematically equivalent to well-known
statistical models such as linear regression? How should we reason about the
perceived bene- fits of using advanced machine learning methods over more
traditional econometrics methods, for different financial applications? What
theory supports the application of machine learning to problems in financial
modeling? How does reinforcement learning provide a model-free approach
to the Black–Scholes–Merton model for derivative pricing? How does Q-
learning generalize discrete-time stochastic control problems in finance?
Advantage of the Book

This book is written for advanced graduate students and academics in the
mathe- matical sciences, in addition to quants and data scientists in the field of
finance. Readers will find it useful as a bridge from these well-established
foundational top- ics to applications of machine learning in finance. Machine
learning is presented as a non-parametric extension of financial econometrics,
with an emphasis on novel algorithmic representations of data, regularization and
model averaging to improve out-of-sample forecasting. The key distinguishing
feature from classical financial econometrics is the absence of an assumption on
the data generation process. This
ML in Finance Instructor’s Manual
vii

has important implications for modeling and performance assessment which


are emphasized with examples throughout the book. Some of the main
contributions of the book are as follows
• The textbook market is saturated with excellent books on machine learning.
However, few present the topic from the prospective of financial
econometrics and cast fundamental concepts in machine learning into
canonical modeling and decision frameworks already well-established in
finance such as financial time series analysis, investment science, and financial
risk management. Only through the integration of these disciplines can we
develop an intuition into how machine learning theory informs the practice of
financial modeling.
• Machine learning is entrenched in engineering ontology, which makes
develop- ments in the field somewhat intellectually inaccessible for students,
academics and finance practitioners from quantitative disciplines such as
mathematics, statis- tics, physics, and economics. Moreover, financial
econometrics has not kept pace with this transformative field and there is a
need to reconcile various modeling concepts between these disciplines. This
textbook is built around powerful math- ematical ideas that shall serve as the
basis for a graduate course for students with prior training in probability and
advanced statistics, linear algebra, times series analysis, and Python
programming.
• This book provides financial market motivated and compact theoretical
treatment
of financial modeling with machine learning for the benefit of regulators,
wealth managers, federal research agencies, and professionals in other heavily
regulated business functions in finance who seek a more theoretical
exposition to allay concerns about the “black-box” nature of machine
learning.
• Reinforcement learning is presented as a model-free framework for
stochastic
control problems in finance, covering portfolio optimization, derivative
pricing and, wealth management applications without assuming a data
generation process. We also provide a model-free approach to problems in
market microstructure, such as optimal execution, with Q-learning.
Furthermore, our book is the first to present on methods of Inverse
Reinforcement Learning.
• Multi-choice questions, numerical examples and approximately 80 end-of-
chapter
exercises are used throughout the book to reinforce the main technical
concepts.
This book provides
•Recommended Course Python codes demonstrating the application of machine
Syllabus
learn- ing to algorithmic trading and financial modeling in risk management
and
This equity
book has research. These
been written as codes make use text
an introductory of powerful
book foropen-source software
a graduate course
toolkits such as Google’s TensorFlow, and Pandas, a data
in machine learning in finance for students with strong mathematical processing
environment
preparation in for Python. The codes have provided so that they can either be
presented as laboratory session material or used as a programming
assignment.
viii Matthew F. Dixon, Igor Halperin and Paul
Bilokon

probability, statistics, and time series analysis. The book therefore assumes,
and does not provide, concepts in elementary probability and statistics. In
particular, undergraduate preparation in probability theory should include
discrete and con- tinuous random variables, conditional probabilities and
expectations, and Markov chains. Statistics preparation includes experiment
design, statistical inference, re- gression and logistic regression models, and
analysis of time series, with examples in ARMA models. Preparation in financial
econometrics and Bayesian statistics in addition to some experience in the capital
markets or in investment management is advantageous but not necessary.
Our experience in teaching upper section undergraduate and graduate
programs in machine learning in finance and related courses in the departments
of applied math and financial engineering have been that students with little
programming skills, despite having strong math backgrounds, have difficulty
with the programming as- signments. It is therefore our recommendation that a
course in Python programming be a prerequisite or that a Python bootcamp be
run in conjunction with the begin- ning of the course. The course should equip
students with a solid foundation in data structures, elementary algorithms and
control flow in Python. Some supplementary material to support programming
has been been provided in the Appendices of the book, with references to
further supporting material.
Students with a background in computer science often have a distinct
advantage
in the programming assignments, but often need to be referred to other
textbooks on probability and time series analysis first. Exercises at the end of
Chapter 1 will be especially helpful in adapting to the mindset of a quant, with
the focus on economic games and simple numerical puzzles. In general we
encourage liberal use of these applied probability problems as they aid
understanding of the key mathematical ideas and build intuition for how they
Overview
translate intoofpractice.
the Textbook

Chapter 1

Chapter 1 provides the industry context for machine learning in finance,


discussing the critical events that have shaped the finance industry’s need for
machine learn- ing and the unique barriers to adoption. The finance industry
has adopted machine learning to varying degrees of sophistication. How it has
been adopted is heavily fragmented by the academic disciplines underpinning
the applications. We view some key mathematical examples that demonstrate
the nature of machine learning and how it is used in practice, with the focus
on building intuition for more tech- nical expositions in later chapters. In
particular, we begin to address many finance practitioner’s concerns that neural
networks are a “black-box” by showing how they are related to existing well-
established techniques such as linear regression, logistic regression and
autoregressive time series models. Such arguments are developed further in
later chapters.
ML in Finance Instructor’s Manual
ix

This chapter is written to be self-contained and could form the basis of a


bootcamp or short workshop on machine learning in finance. The end-of-chapter
questions have also been designed to assess the students’ background in
probability and quantitative finance.

Chapter 2

Chapter 2 introduces probabilistic modeling and reviews foundational concepts


in Bayesian econometrics such as Bayesian inference, model selection, online
learn- ing, and Bayesian model averaging. We develop more versatile
representations of complex data with probabilistic graphical models such as
mixture models.

Chapter 3

Chapter 3 introduces Bayesian regression and shows how it extends many of


the concepts in the previous chapter. We develop kernel-based machine
learning methods—specifically Gaussian process regression, an important class of
Bayesian machine learning methods—and demonstrate their application to
“surrogate” mod- els of derivative prices. This chapter also provides a natural
point from which to develop intuition for the role and functional form of
regularization in a frequentist setting—the subject of subsequent chapters.

Chapter 4

Chapter 4 provides a more in-depth description of supervised learning, deep


learning and neural networks—presenting the foundational mathematical and
statistical learn- ing concepts and explaining how they relate to real-world
examples in trading, risk management, and investment management. These
applications present challenges for forecasting and model design and are
presented as a reoccurring theme throughout the book. This chapter moves
towards a more engineering style exposition of neural networks, applying
concepts in the previous chapters to elucidate various model design choices.

Chapter 5

Chapter 5 presents a method for interpreting neural networks which imposes


mini- mal restrictions on the neural network design. The chapter demonstrates
techniques
x Matthew F. Dixon, Igor Halperin and Paul
Bilokon

for interpreting a feedforward network, including how to rank the importance


of the features. An example demonstrating how to apply interpretability
analysis to deep learning models for factor modeling is also presented.

Chapter 6

Chapter 6 provides an overview of the most important modeling concepts in


financial econometrics. Such methods form the conceptual basis and
performance baseline for more advanced neural network architectures
presented in the next chapter. In fact, each type of architecture is a
generalization of many of the models presented here. This chapter is especially
useful for students from an engineering or science background, with little
exposure to econometrics and time series analysis.

Chapter 7

Chapter 7 presents a powerful class of probabilistic models for financial data.


Many of these models overcome some of the severe stationarity limitations
of the fre- quentist models in the previous chapters. The fitting procedure
demonstrated is also different—the use of Kalman filtering algorithms for state-
space models rather than maximum likelihood estimation or Bayesian inference.
Simple examples of Hidden Markov models and particle filters in finance and
various algorithms are presented.

Chapter 8

Chapter 8 presents various neural network models for financial time series
analy- sis, providing examples of how they relate to well-known techniques
in financial econometrics. Recurrent neural networks (RNNs) are presented as
non-linear time series models and generalize classical linear time series models
such as AR(p). They provide a powerful approach for prediction in financial
time series and generalize to non-stationary data. The chapter also presents
convolution neural networks for filter- ing time series data and exploiting
different scales in the data. Finally, this chapter demonstrates how autoencoders
are used to compress information and generalize principal component analysis.
ML in Finance Instructor’s xi
Manual

Chapter 9
Chapter 9 introduces Markov Decision Processes and the classical methods of
dy- namic programming, before building familiarity with the ideas of
reinforcement learning and other approximate methods for solving MDPs. After
describing Bell- man optimality and iterative value and policy updates before
moving to Q-learning, the chapter quickly advances towards a more
engineering style exposition of the topic, covering key computational
concepts such as greediness, batch learning, and Q-learning. Through a number
of mini-case studies, the chapter provides insight into how RL is applied to
optimization problems in asset management and trading. These examples are
each supported with Python notebooks.

Chapter 10

Chapter 10 considers real-world applications of reinforcement learning in


finance, as well as further advances the theory presented in the previous
chapter. We start with one of the most common problems of quantitative
finance, which is the problem of optimal portfolio trading in discrete-time. Many
practical problems of trading or risk management amount to different forms of
dynamic portfolio optimization, with different optimization criteria, portfolio
composition, and constraints. The chapter introduces a reinforcement learning
approach to option pricing that generalizes the classical Black–Scholes–Merton
model to a data-driven approach using Q-learning. It then presents a
probabilistic extension of Q-learning called G-learning, and shows how it can be
used for dynamic portfolio optimization. For certain specifications of reward
functions, G-learning is semi-analytically tractable, and amounts to a proba-
bilistic version of linear quadratic regulators (LQR). Detailed analyses of such
cases are presented, and we show their solutions with examples from problems of
dynamic portfolio optimizati¯on and wealth management.

Chapter 11

Chapter 11 provides an overview of the most popular methods of inverse


reinforce- ment learning (IRL) and imitation learning (IL). These methods solve
the problem of optimal control in a data-driven way, similarly to reinforcement
learning, however with the critical difference that now rewards are not
observed. The problem is rather to learn the reward function from the observed
behavior of an agent. As behavioral data without rewards are widely available,
the problem of learning from such data is certainly very interesting. The
chapter provides a moderate-level description of the most promising IRL
methods, equips the reader with sufficient knowledge to understand and follow
the current literature on IRL, and presents examples that use simple simulated
environments to see how these methods perform when we know
xii Matthew F. Dixon, Igor Halperin and Paul
Bilokon

the “ground truth" rewards. We then present use cases for IRL in quantitative
finance that include applications to trading strategy identification, sentiment-based
trading, option pricing, and market modeling.

Chapter 12

Chapter 12 takes us forward to emerging research topics in quantitative


finance and machine learning. Among many interesting emerging topics,
we focus here on two broad themes. The first part deals with unification of
supervised learning and reinforcement learning as two tasks of perception-
action cycles of agents. We outline some recent research ideas in the literature
including in particular information theory-based versions of reinforcement
learning, and discuss their relevance for financial applications. We explain
why these ideas might have interesting practical implications for RL financial
models, where feature selection could be done within the general task of
optimization of a long-term objective, rather than outside of it, as is usually
performed in “alpha-research.”

Exercises

The exercises that appear at the end of every chapter form an important
component of the book. Each exercise has been chosen to reinforce concepts
explained in the text, to stimulate the application of machine learning in finance
and to gently bridge material in other chapters. It is graded according to
difficulty ranging from (*), which denotes a simple exercise taking a few
minutes to complete, through to (***), which denotes a significantly more
complex exercise. Unless specified otherwise, all equations referred in each
exercise correspond to those in the corresponding chapter.

Python Notebooks

Many of the chapters are accompanied by Python notebooks to illustrate some


of the main concepts and demonstrate application of machine learning
methods. Each notebook is lightly annotated. Many of these notebooks use
TensorFlow. We recommend loading these notebooks, together with any
accompanying Python source files and data, in Google Colab. Please see the
appendices of each textbook chapter accompanied by notebooks, and the
README.md in the subfolder of each textbook chapter, for further instructions and
details.
ML in Finance Instructor’s xiii
Manual

Instructor Materials
This Instructor’s Manual provides worked solutions to almost all of the end-of-
chapter questions. Additionally this Instructor’s Manual is accompanied by a
folder with notebook solutions to some of the programming assignments.
Occasionally, some addition notes on the notebook solution are also provided.
Contents

Machine Learning in Finance: From Theory to


Practice Instructor’s Manual . . . . . . . . . . . . . . . . . . . . . . . . . . .
v Matthew F. Dixon, Igor Halperin and Paul Bilokon
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Advantage of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Recommended Course Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Overview of the Textbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Python Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
Instructor Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii

Part I Machine Learning with Cross-Sectional Data

1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3

2Probabilistic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3Bayesian Regression & Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . 23 1


Programming Related Questions* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


2Programming Related Questions* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3Programming Related Questions* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xv
Part II Sequential Learning

6Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7Probabilistic Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


xvi Contents

8Advanced Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


4Programming Related Questions* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Part III Sequential Data with Decision-Making

9 Introduction to Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . .


69

10 Applications of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . .


85

11 Inverse Reinforcement Learning and Imitation Learning . . . . . . . . . . .


91
Part I
Machine Learning with Cross-Sectional
Data
Chapter 1
Introduction

Exercise 1.1**: Market game

Suppose that two players enter into a market game. The rules of the game are
as follows: Player 1 is the market maker, and Player 2 is the market
taker. In each round, Player 1 is provided with information x, and must choose
and declare a value α ∈ (0, 1) that determines how much it will pay out if a
binary event G occurs in the round. G ∼ Bernoulli(p), where p = g(x |θ) for
some unknown parameter θ.
Player 2 then enters the game with a $1 payment and chooses one of the
following payoffs: (
1
with probability p
V1(G, p) = α
0 with probability (1 −
or p)
(
0 with probability p
V2 (G, p) 1
= with probability (1 −
(1−α)
p)
a. Given that α is known to Player 2, state the strategy1 that will give Player 2
an expected payoff, over multiple games, of $1 without knowing p.
b. Suppose now that p is known to both players. In a given round, what is the
optimal choice of α for Player 1?
c. Suppose Player 2 knows with complete certainty, that G will be 1 for a
particular
round, what will be the payoff for Player 2?
d. Suppose Player 2 has complete knowledge in rounds {1, ..., i} and can
reinvest payoffs from earlier rounds into later rounds. Further suppose
without loss of generality that G = 1 for each of these rounds. What will be the
payoff for Player 2 after i rounds? You may assume that the each game can be
played with fractional
1 The strategy refers the choice of weight if Player 2 is to choose a payoff V = wV1 + (1 −
w)V2,
i.e. a weighted combination of payoffs V1 and V2.

3
4

dollar costs, so that, for example, if Player 2 pays Player 1 $1.5 to enter the
game, then the payoff will be 1.5V1.

Solution 1.1

a. If Player 2 chooses payoff V1 with probability α and V2 otherwise then the


payoff will be

αp (1 − α)(1 −
+ = p + (1 − p) =
p)
1.
α to win (1
So Player 2 is expected or − α) break even, regardless of their level
at least
of knowledge merely because Player 1 had to move first.
b. If Player 1 chooses α = p, then the expected payoff to Player 2 is exactly $1.
Suppose Player 1 chooses α > p, then Player 2 takes V2 and the payoff is

1− p
> 1.
1− α
And similarly if Player 1 chooses α < p, then Player 2 chooses V1 and also expects
to earn more than $1.
N.B.: If Player 1 chooses α = p, then Player 2 in expectation earns exactly $1
regardless
c. Since G = of
1 isthe choices
known made by
to Player Player
2, they 2. V1 and earn $α
choose
d.
1
. After the first round, Player 2 will have received α1$ . Reinvesting this in the
second
round, the payoff will be $ 21 . Therefore after i rounds, the payer to Player 2
α
be:
will
1 .
. i
k=1 αk
N.B.: This problem and the next are developing the intuition for the
use of logarithms in entropy. One key part of this logic is that the “right”
way to consider the likelihood of a dataset is by multiplying together the
probability of each observation, not summing or averaging them. Another
reason to prefer a product is that under a product the individual events form a
σ -algebra, such that any subset of events is itself an event that is priced
fairly by the game. For instance, a player can choose to bet on events i = 1
and i = 2 both happening, and the resulting event is priced fairly, at odds
calculated from its actual probability of occurrence, p1 p2 .
5

Exercise 1.2**: Model comparison

Recall Example 1.2. Suppose additional information was added such that it is
no longer possible to predict the outcome with 100% probability. Consider Table
1 as the results of some experiment.

G x
A (0,
B 1)
B (1,
C 1)
C (1,
0)
(1, model data.
Table 1: Sample
0)
(0,
Now if we are presented with x = (1, 0) 0), the result could be B or C.
Consider three different models applied to this value of x which encode the value
A, B or C.
f ((1, 0)) = (0, 1, 0), Predicts B with 100% certainty. (1)

g((1, 0)) = (0, 0, 1), Predicts C with 100% certainty. (2)

h((1, 0)) = (0, 0.5, 0.5), Predicts B or C with 50% (3)


certainty.
a. Show that each model has the same total absolute error, over the samples
x = (1, where
0).
b. Show
Table that xall=three
1 when (1, models assign the same average probability to the values
0). from

c.losses
Suppose that the
to Player market
1 are game when
unbounded in Exercise 1 is
x = (1, 0) now
and αplayed
= 1 −with
p. models f or
d. g. B or C each triggers two separate payoffs, V1 and V2 respectively. Show that
losses to Player 1 are bounded. the

Solution 1.2

a. Model f has an absolute error of 1 + 0, g has an error of 0 + 1, and h has an


error of 0.5 + 0.5, so they are all the same.
b. The calculation the error is the same as above, 1 + 0 = 0 + 1 = 0.5 + 0.5.
1
c. In the fourth row, Player 1 pays out 0 , if α = 1 − p, and p denotes the
confidence of event B or C, which is
model
d. unbounded.
Player 1 pays out 0.51 + 0.5
1
= 4, which is
bounded.
6

Additional problems of this sort can be generated by simply requiring that the
data set in question have at least two rows with identical values of x, but
different values of G. This ensures that no model could predict all of the events
(since a model must be a function of x), and thus it is necessary to consider the
probabilities assigned to mispredicted events.
Provided that some of the models assign absolute certainty (p = 1.0) to
some incorrectly predicted outcomes, the unbounded losses in Part (3) will occur
for those models.

Exercise 1.3**: Model comparison

Example 1.1 and the associated discussion alluded to the notion that some
types of models are more common than others. This exercise will explore that
concept briefly.
Recall Table 1.1 from Example 1.1:

G x
A (0,
B 1)
C (1,
C 1)
(1,
0)
For this exercise, consider two models (0, “similar” if they produce the same
projec- tions for G when applied to0) the values of x from Table 1.1 with
probability strictly greater than 0.95.
In the following subsections, the goal will be to produce sets of mutually
dissimilar models that all produce Table 1.1 with a given likelihood.
a. How many similar models produce Table 1.1 with likelihood 1.0?
b. Produce at least 4 dissimilar models that produce Table 1.1 with likelihood 0.9.
c. How many dissimilar models can produce Table 1.1 with likelihood exactly
0.95?

Solution 1.3

a. There is only one model that can produce Table 1.1 with likelihood 1.0, it is
 {1, 0, 0} if x = (0,
 1)
g(x) = (4)
 {0, 1, 0} if x = (1,
 1)

{0, 0, 1} if x = (1,
There are no dissimilar models that
0) can produce the output in Table 1.1
with likelihood 1.0.
{0, 0, 1} if x = (0,
0)
7

b. This can be done many ways, but most easily by perfectly predicting some
rows and not others. One such set of models is:

{0.9, 0.1, 0} if x = (0,


 {0, 1, 1)
g1 (x) = (5)
 0} if x = (1,
 {0, 0, 1)

1} if x = (1,
{0, 0,
{1, 0, 0} 0)
 1}
{0.1, 0.9 if x = (0,
g2 (x) = 0) (6)
 {0, 0,
 1}
 if x = (0,
{0, 0, 1)
1} 0, 0}
{1,
 {0, 1, 0} , 0} if x
g3 (x) = = (1, 1) (7)
 {0, 0.1,
if x = (1,
 0.9}0,
 {0,
0)
1}
if x = (0,
{1, 0, 0}
 {0, 1, 0)
g3 (x) = (8)
 0} if x = (0,
 {0, 0.1,
0, 0.9} 1)
 {0, if x = (0,
1}
0) = (1,
if xnecessary
When generating models of this form, it is not to use 0.9, 0.1, and
1)
0.0. If these three values are labeled α, β, and γ, then it is enough that α2 <
0.95, α >= 0.9, and α + β + γ = 1.0. if x = (1,
c. It’s clear from the setup that any models that 0) give the table with likelihood
if x = (0,so they must not in any
0.95 are right on the boundary of being dissimilar,
other circumstance agree. There are an infinite 0) number of models. For
example, here are two
if x = (0,
{0.95, 0.05, 0} 1)if x = (0,
 {0, 1, if1)x = (1,
g1 (x) = (9)
 0} 1)if x = (1,
 {0, 0,
 {0, 0, if1)x = (1,
1}
1} 0)if x = (1,
{0.95, 0, 0.05} 0)
 {0, 1, if x = (0,
g2 (x) = 0) (10)
 0}
 {0, 0,

1}
if = (0,
if xx = (0,
0)
but they could have easily have been generated1)with 0.95 + a + (0.05 − a), a
{0, 0,
∈ [0, 0.05]. if x = (1,
1}
1)
if x = (1,
0)
8

Exercise 1.4*: Likelihood estimation

When the data is i.i.d., the negative of log-likelihood function (the “error
function”) for a binary classifier is the cross-entropy

Õn
E(θ) = − Giln (g1(xi | θ)) + (1 − Gi )ln (g0(xi |
θ)).
Suppose now that therei=1is a probability πi that the class label on a training data
point xi has been correctly set. Write down the error function corresponding to
the negative log-likelihood. Verify that the error function in the above equation
is obtained when πi = 1. Note that this error function renders the model robust to
incorrectly labeled data, in contrast to the usual least squares error function.

Solution 1.4

Given the probability πi that the class label is correctly assigned, the error
function can be written as
Õn
E(θ) = − πiln (g1(xi | θ)) + (1 − πi )ln (g0(xi |
θ)).
i=1
Clearly when πi = 1 we recover the cross-entropy given by the error function.

Exercise 1.5**: Optimal action

Derive Eq. 1.17 by setting the derivative of Eq. 1.16 with respect to the
time-t action ut to zero. Note that Equation 1.17 gives a non-parametric
expression for the optimal action ut in terms of a ratio of two conditional
expectations. To be useful in practice, the approach might need some further
modification as you will use in the next exercise.

Solution 1.5

Setting the derivative of Eq. 1.16 w.r.t. ut to zero gives


"
∂ ÕT 0.
2
tE #
∂u t=0

This gives
E [φt |St ] − 2λut V [φt |St ]
= 0 and rearranging gives the optimal action:
9

E [φt | t
ut = .
S
2λV] [φt |St ]

Exercise 1.6***: Basis functions

Instead of non-parametric specifications of an optimal action in Eq. 1.17, we


can develop a parametric model of optimal action. To this end, assume we
have a set of basic functions ψk (S) with k = 1, . . . , K. Here K is the total
number
We nowof basis
definefunctions—the
the optimal same
actionasutthe
= dimension of your
ut (St ) in terms ofmodel space. θk of
coefficients
e
xpansion over basis functions Ψk (for example, we could use spline basis
functions, Fourier bases etc.) :
ÕK
ut = ut (St ) = θk (t)Ψk
(St ).
Compute the optimal coefficients θk (t) k=1by substituting the above equation for ut
into Eq. 1.16 and maximizing it with respect to a set of weights θk (t) for a t-th
time step.

Solution 1.6

Plugging the basis function representation for ut into Eq. 1.16, setting the
derivative of the resulting expressing with respect to θk J = θk J (t), we obtain a
system of linear equations for each k J = 1, . . . , K:
ÕK
2λ θk E [Ψk (S)Ψk J (S)Var [φ|S]] = E [Ψk J
(S)] .
k=1
This system of linear equations can be conveniently solved by defining a pair of
a matrix A and vector B whose elements are defined as follows:

Ak k J = E [Ψk (S)Ψk J (S)Var [φ|S]] , Bk J = E [Ψk J (S)] .

The solution of the linear system above is then given by the following simple
matrix- valued formula for the vector θ of parameters θk :
1
θ= A–
1
B. 2λ
Note that this relation involves matrix inversion, and in practice might have
to be regularized by adding a unity matrix to matrix A with a tiny regularization
parameter before the matrix inversion step.
Chapter 2
Probabilistic Modeling

Exercise 2.1: Applied Bayes’ Theorem

An accountant is 95 percent effective in detecting fraudulent accounting when it


is, in fact, present. However, the audit also yields a "false positive" result for
one percent of the non-fraudulent companies audited. If 0.1 percent of the
companies are actually fraudulent, what is the probability that a company is
fraudulent given that the audit revealed fraudulent accounting?

Solution 2.1

Let F denote the presence of fraud, F denote its absence, and + denote a
positive audit result (i.e. the audit revealed fraudulent accounting). Then P(+|
F ) = .95, P(+|F c ) = .01, and P(F ) = .001. Then according to Bayes’ theorem

P(F ) · P(+|F ) .001(.95)


P(F |+) = = = .0868.
P(+) .001(.95) + .999(.01)

Exercise 2.2*: FX and Equity

A currency strategist has estimated that JPY will strengthen against USD with
probability 60% if S&P 500 continues to rise. JPY will strengthen against USD
with probability 95% if S&P 500 falls or stays flat. We are in an upward
trending market at the moment, and we believe that the probability that S&P
500 will rise is 70%. We then learn that JPY has actually strengthened against
USD. Taking this new information into account, what is the probability that S&P
500 will
Recall Hint:P(A | B) P(B | A) P(A).
rise? rule:
Bayes’ P(B)
=

11
12

Solution 2.2

Denote the events as


• H = “S&P continues to rise”,
• E = “JPY strengthens against USD”.
Then
. .
P [E | H] = 0.6, P E | H¯ = 0.95, P [H] = 0.7.
By Bayes’s rule,
P [E | H] P [H]
P [H | E] = ,
P [E]
and we can use the Law of Total Probability to rewrite the denominator to
get P [E | H] P
P [H | E] = . .
. ., P [E | H][H]
P [H] + P E | ¯H P ¯
H
so we can now substitute in the given probabilities and obtain the
answer:
0.6 × 0.7
P [H | E] = = 0.595 ≈
60%, 0.6 × 0.7 + 0.95 × (1 –
0.7)
and so P [H | E] < P [H], the posterior is less than the prior. This decrease
in
probability to the fact that the evidence supports H¯ more strongly than
¯ >isPdue
. .
P E|H [E |
H:
H].

Exercise 2.3**: Bayesian inference in trading

Suppose there are n conditionally independent, but not identical, Bernoulli


trials G1, . . . , G n generated from the map P(Gi = 1 | X = xi ) = g1 (xi | θ) with
θ ∈ [0, 1]. Show that the likelihood of G | X is given by
.n
p(G | X, θ) = (g1(xi | θ))Gi · (g0(xi | θ))1– (11)
Gi i=1

and the log-likelihood of G | X is


given by
Õn
ln p(G | X, θ) = Giln(g1(xi | θ)) + (1 – Gi )ln(g0(xi | θ)).
(12)
i=1
Using Bayes’ rule, write the condition probability density function of θ (the
"poste- rior") given the data (X, G) in terms of the above likelihood function.
From the previous example, suppose that G = 1 corresponds to JPY
strengthening against the dollar and X are the S&P 500 daily returns and now
13

g1(x | θ) = θ1x>0 + (θ + 35)1x ≤0 . (13)


Starting with a neutral view on the parameter θ (i.e. θ ∈ [0, 1]), learn the
distribution of the parameter θ given that JPY strengthens against the dollar for
two of the three days and S&P 500 is observed to rise for 3 consecutive days.
Hint: You can use the Beta density function with a scaling constant Γ(α, β)

(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(14) (α – 1)!(β – 1)!
to evaluate the integral in the marginal density function.
If θ represents the currency analyst’s opinion of JPY strengthening against the
dollar, what is the probability that the model overestimates the analyst’s estimate?

Solution 2.3

If the trials are conditionally independent then

.n
p(G | X, θ) = p(G = Gi | X = xi,
θ)
i=1
and since the conditional probability of Gi = 1 given X = xi is pi = g1(xi | θ))
it follows that
.n
p(G | X, θ) p(G = G i | X = xi , θ) = pG
i
i
1 – pi )1–G i
= i=1 (

and taking logs gives


Õn Õn
ln p(G | X, θ) = ln p(G = Gi | X = xi, θ) = Giln pi + (1 – Gi )ln (1 –
pi ).
i=1 i=1

From Bayes’ rule, the conditional densityp(G


function of θ (the "posterior") is given
| X, θ)
by p(θ | G, X) =
p(θ). p(G | X)
Since p(θ = 0.6) = 1 and zero for all estimates (the currency strategist’s "prior")
From the previous example, suppose that G = 1 corresponds to JPY
strengthening against the dollar and X are the S&P 500 daily returns and now

g1(x | θ) = θ1x>0 + (θ + 35)1x ≤0 .


Starting with a uniform prior p(θ) = 1, ∀θ ∈ [0, 1], learn the distribution
of the parameter θ given that JPY strengthens against the dollar for two of the
three days
14

and S&P 500 is observed to rise for 3 consecutive days. So the likelihood
function is given by θ2(1 – θ) and from Bayes’ law:

p(G | X, θ2(1 – θ)
p(θ | G, X) p(θ) = , 1 = Γ(3, 2)θ2 (1 –
θ)
= 0
θ (1 –
2
θ).
p(G | X) θ)dθ
where we have used the Beta density function with a scaling constant Γ(α,
β)
(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(α – 1)!(β – 1)!
to evaluate the integral in the marginal density function. The probability that
the model overestimates the analyst’s estimate is:
1 0 0.6
∫ ∫ θ3 4
P [θ > 0.6 | G, X] = Γ(3, 2) θ2 (1–θ)dθ = Γ(3, 2)(1– .6θ2 (1–θ)dθ = 12 = 0.4752.
θ –
0.6 0 3 4 0

Exercise 2.4*: Bayesian inference in trading

Suppose that you observe the following daily sequence of directional changes in
the JPY/USD exchange rate (U (up), D(down or stays flat)):
U, D, U, U, D
and the corresponding daily sequence of S&P 500 returns is
- 0 . 0 5 , 0 . 0 1 , - 0 . 0 1 , - 0 . 0 2 , 0.03
You propose the following probability model to explain the behavior of
JPY against USD given the directional changes in S&P 500 returns: Let G
denote a Bernoulli R.V., where G = 1 corresponds to JPY strengthening against
the dollar and r are the S&P 500 daily returns. All observations of G are
conditionally independent (but *not* identical) so that the likelihood is

.n
p(G | r, θ) = p(G = Gi | r = ri,
θ)
i=1
where (
θu, ri >
p(Gi = 1 | r = ri, θ)
0 d, ri ≤ 0
θ
=
Compute the full expression for the likelihood that the data was generated by
this model.
15

Solution 2.4

The data D = (G, r) consists of all contemporaneous observations of G and r.


The trials are conditionally independent then the joint likelihood of G given r
and θ can be written as the product of marginal conditional density functions of
G:
.n
p(G | r, θ) = p(G = Gi | r = ri,
θ)
i=1
and since the conditional probability of Gi = 1 given r = ri is θi ∈ {θu, θd },
the data yields
.n
p(G | r, θ) p(G = G i | r = ri , θ) = θiG i (1 – i 1–G i
= i=1 θ)
= θd · (1 – θu ) · θd · θd · (1
– θ3du ()1 – u 2
=
θ ).

Exercise 2.5: Model comparison

Suppose you observe the following daily sequence of direction changes in the
stock market (U (up), D(down)):
U, D, U, U, D, D, D, D, U, U, U, U, U, U, U, D, U, D, U, D,
U, D, D, D, D, U, U, D, U, D, U, U, U, D, U, D, D, D, U, U,
D, D, D, U, D, U, D, U, D, D
You compare two models for explaining its behavior. The first model, M 1 ,
assumes that the probability of an upward movement is fixed to 0.5 and the data is
i.i.d.
The second model, M 2 , also assumes the data is i.i.d. but that the
of
probability an upward movement is set to an unknown θ ∈ Θ = (0, 1) with a
uniform prior
on
θ : p(θ|M 2) = 1. For simplicity, we additionally choose a uniform model prior
p ( M 1 ) = p(M2 ).
Compute the model evidence for each model.
Compute the Bayes’ factor and indicate which model should we prefer in light
of this data?

Solution 2.5

There are n = 50 observations of which 25 are D and the remaining 25 are U.


Let X denote the number of Us. The first model, M 1 assumes that the
probability of an upward movement is fixed to 0.5 and the data is i.i.d. The
second model, M 2 , assumes the probability of an upward movement is set to an
unknown θ ∈ Θ = (0, 1) with a uniform prior density on θ : p(θ|M 2) = 1.
For simplicity, we additionally choose a uniform model prior p ( M 1 ) = p ( M 2 ) .
16

Which model is most likely to have generated the given data? We compute
the model evidence for each model. The model evidence for M 2 is the
probability mass function

p(X = 25|M 2 ) 50 1 = 0.11227.


=
25 250
The model evidence of M 2 requires integrating over
θ: ∫ 1
p(X = 25|M 2 ) = p( X = 25|θ, M 2)p(θ|
0
∫ 1 M 2)dθ
50 25
= (1 – θ)25
θ
0 25 dθ
50
25
= =
Γ(26,
0.01960784.
26)
Note that we have used the definition of the Beta density function with a
scaling constant Γ(α, β)

(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(α – 1)!(β – 1)!
to evaluate the integral in the marginal density function above, where α = 26
β = 26. Γ(26, 26) = 6.4469 × 1015 and 2550 = 1.264 × 14
and
. Bayes’ factor in favor of M 1 is 5.7252 and thus |lnBF | = 1.744877
10 The
and, from Jeffrey’s table, we see that there is weak evidence in favor of M 1
since the log of the factor is between 1 and 2.5.

Exercise 2.6: Bayesian prediction and updating

Using Bayesian prediction, predict the probability of an upward movement


given the best model and data in Exercise 2.5.
Suppose now that you observe the following new daily sequence of
direction changes in the stock market (U (up), D(down)):
D, U, D, D, D, D, U, D, U, D, U, D, D, D, U, U, D, U, D, D,
D,
U, U, D, D, D, U, D, U, D, U, D, D, D, U, D, U, D, U, D, D,
D, D, U, U, D, U, D, U, U
Using the best model from Exercise 2.5, compute the new posterior
distribution function based on the new data and the data in the previous
question and predict the probability of an upward price movement given all
data. State all modeling assumptions clearly.
17

Solution 2.6

Part I If we choose k 2 (even though there is weak evidence for the "best"
model being k 1 ), we can write that the density of the new predicted value y J
(assuming it is just one new data point) given the previous data y is the
expected value of the likelihood of the new data under the posterior density p(θ|
y, k 2 ): ∫ 1
p(y J | y, k 2 ) = Eθ |y [p (y J | y, θ, k2 )] = p(yJ | y, θ, k 2p) (θ| y, kd2 )
0
θ.
The model evidence for k 2 , written in terms of y, in the previous question is
∫ 1
1
p(y|k 2 ) = p(y|θ, k2 )p(θ|k2 )dθ = 1.5511 10–16
Γ(26,
= 0 ×
26)
since the likelihood function is
.n . n . n
p(y|θ, k 2 ) = θ (1–θ)1–y i
yi
= θi=1 yi (1–θ) i=1(1–y i ) = θx (1–θ)n–x = θ25(1– 25

θ) i=1

where we have mapped Bernoulli trials yi to 1 and zeros instead of U’s and D’s.
Note that we dropped the nx term since this is a likelihood function over all
y. By Bayes’ law, the posterior density
is
p(θ| y, k 2 ) p(y|θ, k 2 ) p θ25(1 – θ)25·
= (θ|k 2 ) =
1,
p(y|k 2 ) Γ(26,
and the prediction density is 26)–1 ∫ 1
p(y J | y, k 2 ) = J
θ | [p(y | y, θ, k2 )] = 26, 26) θyJ (1 – θ)1– yJ θ25 (1 – 25 d
0
E y Γ( θ) θ
where we have used the likelihood function p(y J | y, θ, k 2 ) = θy (1 – θ)1– y .
J J

So the prediction of y J = 1 is the probability mass function for the upward


movement ∫ 1
J Γ(26, = 0.5
p(y = 1| y, k 2 ) = Γ(26 , ) θ26 (1 – 25 θ =
0
26)
Γ(27, 26)
26 θ)
and downward d
movement ∫ 1
J Γ(26,
p(y = 0| y, k 2 ) = 26, 26) θ25 (1 – 26 θ = = 0.5.
0
26) 27)
Γ(26,
Γ( θ)
and they sum to 1. d
Part II Now let’s learn again use the new data y J given in the question which
consists of 50 new data points, of which 30 are D’s and 20 are U’s. We use
the Bayesian updating (’online learning’) formula
18

p(y J |θ, k 2 )p(θ| y,


p(θ| yJ , y, k2 ) , . (15)
kp(y2 ) J |θ, k 2 )p(θ| y,
= θ eΘ

The new model evidence k 2 )dθ


is ∫ 1
p(y J | y, k 2 ) = p(y J |θ, k 2)p(θ| y,
0
k 2 )dθ
and using the likelihood function evaluated on the new
data

p(y J |θ, k 2 ) = θ20 (1 – θ)30

and recalling the posterior over y

p(θ| y, k 2 ) = θ25(1 – θ)25 Γ(26, 26)

we have ∫ 1
J Γ(26,
p(y | y, k 2 ) = θ20 (1 – θ)30θ25 (1 – θ)25 Γ(26, 26) dθ = 26) .
0
Γ(46,
so substituting the new model evidence and the new likelihood into 56) Equation 15
for the new posterior density is

p(θ| y J , y, k 2 ) = θ45 (1 – θ)55 Γ(46, 56).

Part III Finally let’s compute the probability again now assuming that y ∗ is the
next observation and we will merge y J , y. So now (y, y J ) contains 102
observations of the stock market variable Y .
The prediction density is now
∫ 1
p(y ∗| y, yJ , k2 ) = E θ | J [p(y∗ | y, Jy , θ, k
2 )] = , 56) θy∗ (1–θ)1– y∗ θ 45(1–θ)
0 55
Γ(46
y,y dθ
and evaluating the density at y∗ = 1:
Γ(46, 56)
p(y ∗ = 1| y, y J , k 2 ) = =
0.4509804. Γ(47, 56)

The probability that the next value y∗ will be a one has decreased after
observing all data. This is because there are relatively more D’s in the new data
than in the old data.
Using Model 1 Under Model k 1
∫ 1
1
p(y|k 1 ) = p(y|θ, k 1 )p(θ|k 1 )dθ =
0

250
19

where p(θ|k 1) = 1 when θ = 1/2 and p(θ|k 1) = 0 otherwise (i.e., the prior is
a Dirac delta function) and the likelihood function is
1
p(y|θ = 1/2, k 1 ) =
. 250
By Bayes’ law, the posterior density can only be
1
p(y|θ, k 1 )
p(θ| y, k 1 ) = p(θ|k 1) = 1, θ =
1/2 p(y|k 1 )
and zero for all other values of θ. The prediction density, for either up or
downward movements is,
1
p(y J | y, k 1 ) = Eθ |y[p(y J| y, θ = /2, k 1 )] = ·
2
1 1.
Note that the prediction will not change under new
data.

N.B.: An alternative formulation writes the old posterior in the previous


question conditioned on X = x and the new posterior conditioned on X J = x J
and X = x. The difference will be a factor up to the n choose x operator, which
will cancel through.

Exercise 2.7: Logistic regression is Naive Bayes


Suppose that G and X e {0, 1} p are Bernoulli random. p variables and the Xi s
mutually independent given G—that is, P [X | G]
are i=1 P [Xi | G]. Given a
Bayes’
= classifier P [G | X], show that the following logistic regression model
naive
pro- duces equivalent output if the weights are
p
P [G] Õ P [Xi = 0 e G]
w0 = log +
P [Gc ] logi=1 P [Xi = 0 e Gc ]
P [Xi = 1 e G] P [Xi = 0 e
wi = log c · , i = 1, . . . ,
] i = 1 e Gc ] P [X i = 0 e G]
G [X
P
p.

Solution 2.7

The Naive Bayes’ classifier is given


by:
20

P [X | G] P [G] (16)
P [G | X] = P [X]
P [X | G] P [G]
= P [X | G] P [G] + P [X | Gc ] P [Gc
= ] P[X |1G c ]P[G c ]
1+
P[X | G] P[G]
1
= . P[X | G c ]P[G c ] .
1 + exp P[X | G]P[G]
ln 1
= . .
P[X | G]P[G]
1 + exp – lnP[X | G c ]P[G c ]

We want
1
P [G | X] . . p . (17)
= 1 + exp – i=0 wi Xi
By combining Eq. 16 and Eq. 17 we
get:
P [X | G] P Õp
wi Xi .
– log [G] =–
P [X | Gc ] P [Gc ] i=0

Using Naive Bayes’ assumption and some simplifications we


have:
Õn P [G] Õn P [xi |
wi xi = log G] c + . (18)
i=0 P [G ] i=1
P [x i | c
log G ]
Now we want the RHS of Eq. 18 to be a linear function of Xi ’s, so we
perform the following substitutions (i > 0):
P [Xi | G] P [Xi = 1 |
log = Xi log (19)
P [Xi | G ]
c P [Xi = 1G]| Gc ]
P [Xi = 0 |
+ (1 – Xi ) .
G]
P [Xi = 0 | Gc ]
log
By combining Eq. 18 and Eq. 19 we can show equivalence. To solve for w0
we plug in Xi = 0 (for i > 0 because X0 is a dummy feature for the bias term
so its always 1). To compute wi we take the coefficient of Xi in Eq. 19.

Exercise 2.8**: Restricted Boltzmann Machines

Consider a probabilistic model with two types of binary variables: visible


binary stochastic units v e {0, 1} D and hidden binary stochastic units h e {0, 1} F ,
where D and F are the number of visible and hidden units respectively. The joint
probability density to observe their values is given by the exponential
distribution
21

1 Õ
p(v, h) = exp (–E(v, h)), Z= exp (–E(v, h))
Z v,h

and where the energy E(v, h) of the state {v, h}


is
ÕD ÕF ÕD ÕF
E(v, h) = –v W h – b v – a h =
T T T
Wijvi hj – b iv i – aj hj,
– i=1 j=1

i=1

j=1

with model parameters a, b, W . This probabilistic model is called the


! !
restricted Boltzmann machine.
Õ Show that conditional probabilitiesÕ for visible and
P [vnodes
hidden i = 1 |are = σ by W
h]given P [h i =σ 1(x)| =
j + bi ,function
thei j hsigmoid ] =1/(1
σ + e–Wx i):j v j + a i .

v j j

Solution 2.8

Consider the first expression P [vi = 1|h]. Let v–i stands for the vector of visible
units
with the ith components removed. This conditional probability can be computed
as follows: Õ P [vi = 1, v–i ,
P [vi = 1|h] (20)
h] p(h)
= v–i

Probabilities entering this expression can be written as


follows:
Õ 1 Õ
p(h) = p(v, h) = exp aT h exp vT W h + bT
Z
v v–i,vi v

1
D F
©Õ Õ ÕD Õ \
T
W i J j vi J h j + b i J vi J + Wi j j + b iI
P [vi = 1, v–i, h] = Z exp a h exp I
Zi J ≠ i j=1 h iJ ≠ i j ¬
Substituting these expressions into (20) and simplifying, we
obtain
!
exp . j Wi j h j + bi Õ
P [vi = 1|h] = . =σ Wij hj + bi
exp W h b + 1 j
j ij j +

The second relation is obtained ini a similar


way.
Chapter 3
Bayesian Regression & Gaussian Processes

Exercise 3.1: Posterior Distribution of Bayesian Linear


Regression

Consider the Bayesian linear regression model


y = θ T X + z, z ~ W( 0, σ 2 ), θ ~ W(µ,
i n
Σ).
Show that the posterior over data D is given by the
distribution

θ |D ~ W(µJ , Σ J ),

with moments:
1 1
µJ = Σ J a = (Σ–1 + XXT )–1 (Σ–1 µ + yT X)
σ 2n σ 2n
1
Σ J = A– 1 = (Σ–1 + XXT )–1 .
σ 2n

Solution 3.1

See Eq. 3.12 to Eq. 3.19 in the chapter for the derivation for the posterior
distribution moments.

Exercise 3.2: Normal conjugate distributions

Suppose that the prior is p(θ) = φ(θ; µ0 , σ02 ) and the likelihood is given
by
.n
p(x1:n | θ) = φ(xi ; θ, 2 ,
i=1 σ )

23
24

where σ 2 is assumed to be known. Show that the posterior is also normal, p(θ | x1:n)
2
=
φ(θ; µpost , σpost ),
where
σ 20 x¯ σ2 µ0,
µpost = σ
2
+ σ02 + σ
2
+ σ02
n n

1
σ 2post
= 1 + nσ 2
σ 20
,
1 .
where n
xi .
n i=1
x¯ :=

Solution 3.2

Starting with n = 1, we use the Schur identity for the bi-Gaussian joint
distribution
Cov(θ, x)
E [θ| x] = E[θ] (x – E[x])
+ σ 2x
Cov2(θ, x)
.
V [θ| x] = V[θ] σ 2x
Using +
that
x = θ + σz, z ~ W(0, 1)
θ = µ0 + σ0δ, δ ~ W(0, 1)

we
have
E[x] = θ,
V[x] = V[θ] + σ 2 = σ 02 + σ 2
Cov(x, θ) = E[(x – µ 0)(θ – µ 0)] = σ02

and plugging into the bi-Gaussian conditional moments and after some
rearranging gives

σ 20 2
µpost = E [θ| x] 2 + σ2
x + 2σ µ0
= σ 0 σ + σ02
σ 2 σ 20 = ( 1 + 1
σ 2post = V [θ| x] 2
2 2
2
)–1 .
= σ + σ0 σ0 σ
Now for n > 1, we first show
that
25

1 Õ
n
2
p(x1:n |θ) ∝ exp{– (xi – θ) }
2σ2 i=1
1 Õn Õn
2 2
∝ exp{– ( xi – 2θ xi + nθ )}
2σ2 i=1 i=1
n
∝ exp{– ¯ – θ)2 }
(x 2σ 2

∝ p(x¯|θ),

where we keep only terms in θ. Given that x¯| µ ~ W(θ, σ2/n), we can
substitute x¯ into the previous result for the conditional moments to give the
required result
σ 20 x¯ + σ2 µ0
µpost = σ
2
2
2
σ
+ σ02
n + σ0 n
1 n –1
σ 2post = ( +
) .σ 20 σ2

Exercise 3.3: Prediction with GPs

Show that the predictive distribution for a Gaussian Process, with model output
over
a test point, f , and assumed Gaussian noise with variance σ , is given
∗ 2n
by
f∗ | D, x∗ ~ W(E[f∗ |D , x∗], var[f∗ |D ,
x∗]),

where the moments of the posterior over X∗ are


E[f ∗|D , X∗] = µX∗ + KX , X[K X, + σn I] Y,
X∗ 2 –1
var[f∗|D , X∗] = KX∗, X∗ – KX , X[K X, + σ n2 I] –1 K X, ∗.
X ∗ X

Solution 3.3

Start with the joint density between y and f∗ given in Eq. 3.35 expressed in
terms of the partitioned covariance matrix.

y µy Σ Σ
=W f
, yy y ∗
f∗
. Σf∗ y Σf∗
µf∗
Then use the Schur Identity to derive Eq. 3.36.
f∗
Finally by writing y = f (x) + z,
with an unknown f (x), the covariance is given by

Σyy = KX, + σn2


I,X
26

where KX, X is known covariance of f (X) over the training input X. Writing the
other covariance terms gives the required form of the predictive distribution
moments with K specified by some user defined kernel.

1 Programming Related Questions*

Exercise 3.4: Derivative Modeling with GPs

Using the notebook Example-1-GP- BS-Pricing.ipynb, investigate the


effective- ness of a Gaussian process with RBF kernels for learning the shape of
a European derivative (call) pricing function Vt = ft (St ) where St is the
underlying stock’s spot price. The risk free rate is r = 0.001, the strike of the call
is KC = 130, the volatility of the underlying is σ = 0.1 and the time to maturity τ
= 1.0.
Your answer should plot the variance of the predictive distribution against
the stock price, St = s, over a data set consisting of n e {10, 50, 100, 200}
gridded values of the stock price s e Ωh := {i∆s | i e { 0, . . . , n – 1}, ∆s =
200/(n – 1)} ⊆ [0, 200] and the corresponding gridded derivative prices V (s).
Each observation of the dataset, (si , vi = ft (si )) is a gridded (stock, call price)
pair at time t.

Solution 3.4

See notebook GP_3_4.ipynb for the implemented solution.


0.005
GP Variance(10 gridded values)
GP Variance(50 gridded values)
GP Variance(100 gridded values)
GP Variance(200 gridded values)
0.004

0.003
Var

0.002

0.001

0.000
75 100 125 150 175 200 225 250
s

Fig. 1: The predictive variance of the GP against the stock price for various
training set sizes. The predictive variance is observed to monotonically decrease with
training set size.
Chapter 4
Feedforward Neural Networks

Exercise 4.1

Show that
substituting (
X j , i = k,
∇i j Ik =
0, i ≠ k,
into Eq. 4.47
gives
∂σk
∇ij σk ≡ = ∇i σk Xj = σk (δki –
∂w i j
σi )Xj .
Solution 4.1

By the chain rule we have

∂σ k Õ ∂σ k ∂ Im
=
∂wij m
∂ Im ∂wij

Recall that the k, i component of the Jacobian of σ


is:
∂σ k
= σ k (δki – i
∂ Ii σ ),

and since ∂w∂Iiji = Xj, i = k we have the required


answer
∂σ k ∂σ k
= Xj = σ k (δki – i j
∂wij σ )X∂ .
Ii

27
28

Exercise 4.2

Show that substituting the derivative of the softmax function w.r.t. wij into Eq.
4.52 gives for the special case when the output is Yk = 1, k = i and Yk = 0, ∀k ≠
i:
( (σi – 1)Xj, Yi = 1,
∇ij L(W, b) := [∇W L(W, b)]ij =
0, Yk = 0, ∀k ≠
i.

Solution 4.2

Recalling Eq. 4.52 with σ for the softmax activation function and I(W, b) is the
transformation

∇L(W, b) = ∇(L ◦ σ)(I(W, b)) = ∇L(σ(I(W, b))) · ∇σ(I(W, b)) · ∇I(W, b).

Writing the model output as a function of W only since we are not taking
the derivative of b:
Yˆ(W ) = σ ◦ I(W ).
The loss of a K-classifier is given by the cross-entropy:

ÕK
L(W ) = – Yk ˆ
lnYk . k=1

However, we want to evaluate the loss for the special case when k = i, since Yi =
1. Therefore the loss reduces to

L(W ) = –lnYˆi = –lnσ i .

Taking the derivative of the loss w.r.t. wij gives


∂L Õ ∂L(W) ∂σ m ∂ n
=
∂wij I ∂σm ∂ In ∂wij
m,n

or equivalently
Õ
∇i j L(W ) = σ m L(W )∇mn σ(I)∇i j In
∇ m,n (W ).
The derivative of the loss w.r.t.
σm is
∇σ m L(W ) = –δim
σm
.
The derivative of the softmax function w.r.t. to In gives
29

∇mn σ(I(W )) = σm(I(W ))(δmn – σn(I(W ))),

and the derivative of In w.r.t wij is

∇ij In(W ) = δni Xj .

Hence the derivative of the softmax function w.r.t. to wij is

σm(I(W ))(δmn – σn(I(W )))δni Xj .

Putting these terms together


Õ δim
∇i j L(W ) = – σm (I(W ))(δmn – σn(I(W )))δni Xj,
m,n
σm

Contracting over the m and n indices gives for Yi = 1:

∇ij L(W ) = (σi – 1)Xj,

and is zero if Yk = 0, k ≠ i.

Exercise 4.3

Consider feedforward neural networks constructed using the following two types
of activation functions:
• Identity
Id(x) := x
• Step function (a.k.a. Heaviside function)

1, if x ≥
H(x) :=
0,
0¯¯,
a. Consider a feedforward neural network otherwise.
with one input x e R, a single
hidden
layer with K units having step function activations, H(x), and a single output
with identity (a.k.a linear) activation, Id(x). The output can be written as
!
ÕK
ˆf (x) = Id (2) (2) (1)
w H(b + w x) . (1)
b + k k k
k=1

Construct neural networks using these activation


functions.
a. Consider the step function
y, if x ≥
u(x; a) := yH(x – a) =
0,
a, otherwise.
30

Construct a neural network with one input x and one hidden layer,
whose response is u(x; a). Draw the structure of the neural network,
specify the activation function for each unit (either Id or H), and specify the
values for all weights (in terms of a and y).
b. Now consider the indicator function

1, if x e [a,
1[a,b)(x) =
b),
0, otherwise.
Construct a neural network with one input x and one hidden layer,
whose response is y1[a,b)(x), for given real values y, a and b. Draw the
structure of the neural network, specify the activation function for each unit
(either Id or H), and specify the values for all weights (in terms of a, b and
y).

Solution 4.3

• There arew many possible solutions.


w1 =However, once such choice is as follows.
2) (1) (1)
b(2) = 0, 1
(
= y, b1 = –a,
Set
• There are many possible solutions. However, once such choice is as follows.
1.
(2) (2) (1) (1) (1) (1)
Set
b(2) = 0, w1 = y, w2 = –y, b1 = –a, w1 = 1, b2 = –b, w2 = 1.

Exercise 4.4

A neural network with a single hidden layer can provide an arbitrarily close
approxi- mation to any 1-dimensional bounded smooth function. This question
will guide you through the proof. Let f (x) be any function whose domain is [C,
D), for real values C < D. Suppose that the function is Lipschitz continuous, that
is,

∀x, x J e [C, D), | f (x J ) – f (x)| ≤ L | x J – x |,

for some constant L ≥ 0. Use the building blocks constructed in the previous
part to construct a neural network with one hidden layer that approximates this
function within z > 0, that is, ∀x e [C, D), | f (x) – fˆ(x)| ≤ z, where fˆ(x) is
the output of your neural network given input x. Your network should use only
the identity
(k or(kthe Heaviside activation functions. You need to specify the
wk , w0 ),Kand
number
)
1 , for units,
of whidden e {1,
each kthe . . . , K }.function
activation These weights may
for each be specified
unit, in
and a formula
of C, D, L and z, as well
for calculating each weight w0 , as the
termsvalues of f (x) evaluated at a finite number
of x values of your choosing (you need to explicitly specify which x values you
use). You do not need to explicitly write the fˆ(x) function. Why does your
network attain the given accuracy z?
31

Solution 4.4

The hidden units all use gs activation functions, and the output unit uses gI . K
=
L(D – C) 2x L
+ 1]. For k e {1, . . . , K }, (k)
w
1
= 1, w0(k) = –(C + (k – 1)2z/L).
the second layer,For we have w0 = 0, w1 = f (C + z/L), and for k e
{2, . . . , K }, wk = f (C + (2k – 1)z/L) – f (C + (2k – 3)z/L). We only
evaluate f (x) at points C + (2k – 1)z/L, for k e {1, . . . , K }, which is a finite
number of points. The function value fˆ(x) is exactly f (C + (2k – 1)z/L) for k
e { 1, . . . , K }. Note that for any x e [C + (2k – 1)z/L, C + 2kz/L], | f (x)
– fˆ(x)| ≤ z.

Exercise 4.5
the hidden layer and d outputs. The hidden-outer weight matrix Wi j(2) = n1 and
Consider a shallow neural network regression model with n tanh activated units
the
input-hidden weight matrix W (1) = 1. The biases are zero. If the features,
in
X1 , . . . , Xp
are i.i.d. Gaussian random variables with mean µ = 0, variance σ 2 , show
a. Yˆ e [–1, 1].
thatYˆ is independent of the number of hidden units, n ≥
b.
1.
c. The expectation, E[Yˆ] = 0 and the variance V[Yˆ] ≤ 1.

Solution 4.5

Yˆ = W (2)σ(I(1))
Each hidden variable is a vector with
Õp identical elements
Z(1) = w(1) Xj =
i ij Xj = Si = s, i e {1, . . . ,
j=1 n}.

Using this expression gives


n
(2)
Yˆi = wi j σ(S j ) = σ(s) = σ(s).
n j=1

which is independent of
n.
32


E[Yˆ] = W (2)σ(I(1)) fX (x)dx


= W (2) σ(I ( 1 ) ) fX (x)dx

∫ ∫
= W (2) ···
–∞ –∞
σ(I ) fX (x1) . ∫. . ∞
(1) ∫ ∞
fX
= (2)1 . . . ·
(xpW)dx dx·
p
· [σ( I(1) ) + σ(–I(1) )]X f 1(x ) . . X. f p 1 p
0 (x )dx 0 . . . dx

= 0, ∫ ∞ ∫ ∞
= W (2)
· · ·
where we have used the property that σ(I) is an odd function in I (and in x
since b(1) =
σ(I0)(1)
)and
fX (x1fX) .(x)
. . is
fX an even function since µ = 0. The variance is
(x )dx . . . dx
E[YˆYˆ ] = E[σ(I )σ (I )], where [E[σ(I ( 1 ) )σ T (I(1))]]ij = E[σ2(I(1))].
T p 1 ( 1 ) T p (1)

∫ ∞ ∫ ∞
–∞ –∞
E[σ2(I(1))] = ∫· ∞
··∫ σ 2 (I ( 1 ) ) fX (x1) . . . fX (xp )dx1 . . . dxp
= 2 ∞ · · · σ 2 (I(1) ) fX (x1 ) . . . Xf (xp )dx1 . . . p
dx ∫ 0 ∞ ∫ 0
≤ 2 ∞ · · · fX (x1 ) . . . Xf (xp )dx1 . . . p
dx 0 0
= 1.

Exercise 4.6

Determine the VC dimension of the sum of indicator


functions where Ω = [0, 1]
Õk
Fk (x) = { f : Ω → {0, 1}, f (x) = 1x e[t2i,t2i+1 ), 0 ≤ t0 < · · · < t2k+1 ≤ 1, k ≥
1}.
i=0

Solution 4.6

Fk is the set of all functions with k + 1 indicator functions. Consider the


worst case labeling that can be encountered. This occurs when the labels of
the points are alternating {0, 1, 0, 1, . . . }. If the function can represent the
labeled data in this case, then it can do so for any other labeling. In this worst
case configuration, with k + 1 labels and therefore 2(k + 1) points in the set, it is
clear that all points can be labeled by placing the indicator function over each of
the k + 1 points with label 1. So VC(Fk ) ≥ 2(k + 1). Moreover if there are 2(k
+ 1) + 1 points, you can create a configuration of alternating labeled points with
a total of k + 2 points with label 1.
33

This is achieved by starting and ending with a 1 label point so that the labels of
the points are {1, 0, 1, . . . , 0, 1}. This last configuration is not representable
with k + 1 indicator functions. Therefore VC(Fk ) = 2(k + 1).

Exercise 4.7

Show that a feedforward binary classifier with two Heaviside activated units
shatters the data {0.25, 0.5, 0.75}.

Solution 4.7

Now for all permutations of label assignments to the data, we consider values
of weights and biases which classify each arrangement:
• None of the points are activated. In this case, one of the units is
essentially
redundant and the output of the network is a Heaviside function: b1(1) = b2(2) &
w(2) (2)
1 = w 2
For example:
1 c
w(1) = 1 , b(1) = c

w(2 ) = 21, 21 , b(2) = 0

where c < –0.75.


– The last point is activated: choose –0.5 > c > –0.75
– The last two points are activated: choose –0.25 > c > –0.5
– All points are activated: choose c > –0.25
• The network behaves as an indicator
(1) ( ) (2)
b1 ≠ b 1)2 & w(2 = –w1
function
1

e.g.
1 c
w(1) = , (1) =
1 b c2
1

w(2) = 1, –1 , b(2) = (21)


0

where c2 < c1. For example if c1 = –0.4 and c2 = –0.6, then the second point
is activated. Also the first point is activated with, say, c1 = 0 and c2 = –0.4.
The first two points are activated with c1 = 0 and c2 = –0.6.
34

• Lastly, the inverse indicator function can be represented, e.g., only the first
and third points are activated:

1 c
w(1) = , (1) =
1 b c2
1
an
d w(2) = –1, 1 , b(2) = 1
with, say, c1 = –0.4 and c2 = –0.6.
Hence the network shatters all three points and the VC-dim is 3. Note that we
could have approached the first part above differently by letting the network
behave as a wider indicator function (with support covering two or three data
points) rather than as a single Heaviside function.

Exercise 4.8

Compute the weight and bias updates of W (2) and b(2) given a shallow binary
classifier (with one hidden layer) with unit weights, zero biases, and ReLU
activation of two hidden units for the labeled observation (x = 1, y = 1).

Solution 4.8

This is straightforward forward pass computations to obtain the error. E.g., with
ReLU activation
I(1) = W (1) x

max(1, 0) 1
Z(1) = σ( ) =
max(1, 0) 1
=
I(1) . . 1
I(2) = W (2) = 11
1
The prediction is Z(1)

1 1
yˆ = sigmoid ( I(2) ) =
1 + e–I (2)
= 0.8807971
σ = 1 + e–2
The error is
δ(2) = e = y – yˆ = 1 – 0.8807971 = 0.1192029
Applying one step of the back-propagation algorithm (multivariate chain
rule):
1 0.1192029
weight update: ∆W (2) = –γδ (2)Z(1) = – = –γ
1 0.1192029
35

where γ is the learning rate. The bias update is ∆b(2) = –γδ(2) = –γ0.1192029.

2 Programming Related Questions*

Exercise 4.9

Consider the following dataset (taken from Anscombe’s quartet):

(x1, y1) = (10.0, 9.14), (x2, y2) = (8.0, 8.14), (x3, y3) = (13.0, 8.74),
(x4, y4) = (9.0, 8.77), (x5, y5) = (11.0, 9.26), (x6, y6) = (14.0, 8.10),
(x7, y7) = (6.0, 6.13), (x8, y8) = (4.0, 3.10), (x9, y9) = (12.0, 9.13),
(x10, y10) = (7.0, 7.26), (x11, y11) = (5.0, 4.74).

a. Use a neural network library of your choice to show that a feedforward


network with one hidden layer consisting of one unit and a feedforward
network with no hidden layers, each using only linear activation functions, do
not outperform linear regression based on ordinary least squares (OLS).
b. Also demonstrate that a neural network with a hidden layer of three
neurons using the tanh activation function and an output layer using the linear
activation function captures the non-linearity and outperforms the linear
regression.

Solution 4.9

Worked solutions are provided in the Notebook ancombes_4_9.ipynb.

Exercise 4.10

Review the Python notebook d e e p _ c l a s s i f i e r s . i p y n b . This notebook uses


Keras to build three simple feedforward networks applied to the half-moon
problem: a logistic regression (with no hidden layer); a feedforward network
with one hidden layer; and a feedforward architecture with two hidden layers.
The half-moons problem is not linearly separable in the original coordinates.
However you will observe—after plotting the fitted weights and biases—that a
network with many hidden neurons gives a linearly separable representation of
the classification problem in the coordinates of the output from the final hidden
layer.
Complete the following questions in your own words.
a. Did we need more than one hidden layer to perfectly classify the half-
moons dataset? If not, why might multiple hidden layers be useful for other
datasets?
36

b. Why not use a very large number of neurons since it is clear that the
classification accuracy improves with more degrees of freedom?
c. Repeat the plotting of the hyperplane, in Part 1b of the notebook, only without
the ReLU function (i.e., activation=“linear”). Describe qualitatively how the
decision surface changes with increasing neurons. Why is a (non-linear)
activation function needed? The use of figures to support your answer is
expected.

Solution 4.10

• Only one hidden layer is needed. One reason why multiple hidden layers could
be needed on other datasets is because the input data are highly correlated. For
some datasets, it may be more efficient to use multiple layers and reduce the
overall number of parameters.
• Adding too many neurons in the hidden layer results in a network which
takes longer to train without an increase in the in-sample performance. It also
leads to over-fitting on out-of-sample test sets.
• The code and results should be submitted with activation=“linear” in the
hidden
layer. Plots of the hidden variables will show a significant difference
compared to using ReLU. Additionally, the performance of the network with
many neurons in the hidden layer, without activation, should be compared with
ReLU activated hidden neurons.

Exercise 4.11

Using the EarlyStopping callback in Keras, modify the notebook


De e p _ C l a ssi f i e r s. i p y n b to terminate training under the following stopping criterion |
L(k+1) – L(k) | ≤ δ with
δ = 0.1.

Exercise 4.12***

Consider a feedforward neural network with three inputs, two units in the first
hidden layer, two units in the second hidden layer, and three units in the
output layer. The activation function for hidden layer 1 is ReLU, for hidden
layer 2 is sigmoid, and for the output layer is softmax.
The initial weights0.1
are0.3
given 0.5 0.6
0.7 by the matrices
0.4 0.3 © \
W (1) = (2)
= (3)
,W ,W = I0.6 0.7I,
0.9 0.4 0.4 0.7 0.2
Z0.3 0.2¬
and all the biases are unit
vectors.
37

Assuming that the input 0.1 0.7 0.3 corresponds to the output 1 0 0 ,
compute
manuallythe updated weights and biases after a single epoch (forward +
backward pass), clearly stating all derivatives that you have used. You should use
a learning rate of 1.
As a practical exercise, you should modify the implementation of a
stochastic gradient descent routine in the back-propagation Python notebook.
Note that the notebook example corresponds to the example in Sect. 5, which
uses
sigmoid activated hidden layers only. Compare the weights and biases
obtained by TensorFlow (or your ANN library of choice) with those obtained by
your procedure after 200 epochs.

Solution 4.12

Worked solutions are provided in the back-propagation solution notebook.


Chapter 5
Interpretability

Exercise 5.1*

Consider the following data generation process

Y = X1 + X2 + z, X1, X2, z ~ N(0, 1),

i.e. β0 = 0 and β1 = β2 = 1.

a. For this data, write down the mathematical expression for the sensitivities of
the fitted neural network when the network has
• zero hidden layers;
• one hidden layer, with n unactivated hidden units;
• one hidden layer, with n tanh activated hidden units;
• one hidden layer, with n ReLU activated hidden units; and
• two hidden layers, each with n tanh activated hidden units.

Solution 5.1
Y = X1 + X2 + z, X1, X2, z ~ N(0, 1), β0 = 0, β1 = β2 =
1
• Zero hidden layers:

In this situation, the sensitivities are simply expressed


as: bYˆ
= w(1)
i i e {1, 2}
b Xi
Note, when there is one hidden layer, the sensitivities can be generally
expressed as:
bXYˆ = W (2) J(I(1)) = W (2) D(I(1))W (1),

39
40

• where Di,i (I) = σ J (I i ), Di, j ≠ i = 0


One hidden layers, with n unactivated hidden units:

As the hidden units are unactivated, we can write


σ (x) = x → σ J (x) = 1
Therefore, D(I(1)) is the identity matrix.
Thus,
bYˆ (2) (1)
=W W
bX
• One hidden layer, with tanh activated hidden
units:

As the hidden units are tanh activated, we can


write: σ (x) = tanh(x) → σ J (x) = 1 – tanh2 (x) = 1
– σ 2 (x) Therefore, we have that
bˆ Õ
= w(2) (1 – σ 2 (1) (1)
bY Xj (I i))w ij i i,

• One hidden layer, with n ReLU activated hidden units:

As the hidden units are ReLU activated, it follows that


σ (x) = ReLU(x) → σ J (x) = H(x), the Heaviside
function Therefore, we have that

bˆ Õ
= wi(2) (H(I (1) (1)
i ))wi, j
bY Xj i

• Two hidden layers, each with n tanh activated hidden units:

Since we have two hidden layers, our sensitivities can be expressed


as:
bYˆ
= W (3) D(2)W (2) D(1)W (1)
bX
As we mentioned previously, we can
σ (x) = tanh(x) → σ J (x) = 1 – tanh2 (x)
write:
Therefore, we can express the sensitivities as
bYˆ
= W (3) D(2)W (2) D(1)W (1)
bX
2 (2) (2)
(2) = Di,i(2)(I i ) = 1 – tanh (Ii ), Di, j≠i = 0
where Di,i :
(2)(1) (I(1) ) = 1 – tanh2
(1) = D
Di,i (1)
, Di,(1) = 0.
: i,i i i
(I ) j≠i
41

Exercise 5.2**

Consider the following data generation


process
Y = X + X + X X + z, X , X ~ N(0, 1), z ~ N(0, 2
1 2 1 2 1 2 n
σ ),
i.e. β0 = 0 and β1 = β2 = β12 = 1, where β12 is the interaction term. σn2 is the
variance of the noise and σn = 0.01.

a. For this data, write down the mathematical expression for the interaction term
(i.e. the off-diagonal components of the Hessian matrix) of the fitted neural
network when the network has
• zero hidden layers;
• one hidden layer, with n unactivated hidden units;
• one hidden layer, with n tanh activated hidden units;
• one hidden layer, with n ReLU activated hidden units; and
• two hidden layers, each with n tanh activated hidden units.
Why is the ReLU activated network problematic for estimating interaction terms?

Solution 5.2

See the notebook solution.

3 Programming Related Questions*

Exercise 5.3*

For the same problem in the previous exercise, use 5000 simulations to gener-
ate a regression training set dataset for the neural network with one hidden
layer. Produce a table showing how the mean and standard deviation of the
sensitiv- ities βi behave as the number of hidden units is increased. Compare
your re- sult with tanh and ReLU activation. What do you conclude about
which activa- tion function to use for interpretability? Note that you should
use the notebook D e e p - L e a r n i n g - I n t e r p r e t a b i l i t y . i p y n b as the starting
point for experimen- tal analysis.

Solution 5.3

See the notebook solution.


42

Exercise 5.4*

Generalize the s e n s i t i v i t i e s function in Exercise 5.3 to L layers for either


tanh or ReLU activated hidden layers. Test your function on the data generation
process given in Exercise 5.1.

Solution 5.4

See the notebook solution.

Exercise 5.5**

Fixing the total number of hidden units, how do the mean and standard
deviation of the sensitivities βi behave as the number of layers is increased?
Your answer should compare using either tanh or ReLU activation functions.
Note, do not mix the type of activation functions across layers. What you
conclude about the effect of the number of layers, keeping the total number of
units fixed, on the interpretability of the sensitivities?

Solution 5.5

Y = X1 + X 2 + X 1X 2 + z X 1, X 2 ~ N( 0, 1), z ~ N(0, 2
n
σ )
β0 = 0, β1 = β2 = β12 = 1, σn = 0.01
• Zero hidden layers:

In this situation, the interaction term is zero as the sensitivity is indepen-


dent of X.
Note that when there is just one hidden (activated) layer, the interaction term
can be generally expressed as:
ˆ 2 ) (1) (1)
b2Xi Xj Y = W ( diag(Wi D ) J I((1)
)W
• One hidden layer with n unactivated hidden
j
units:

As the hidden units are unactivated, we have:


b2Yˆ = 0
b Xi Xj
• One hidden layer with tanh activated hidden units:

As the hidden units are tanh activated, we can


write:
σ(x) = tanh(x)
43


= 1 – tanh2 (x)
b
x
b2 σ
= –2tanh(x)(1 – tanh 2(x))
b 2
Therefore, we have x
that
ˆ (1) (1)
bY = (2 )
b Xi Xj W i D J ( I (1) j
diag(W ) )W
J (1) (1) (1) J (1)
where Di,i (Ii ) = – 2tanh(I i )( 1 – tanh2 (I i )), Di, j≠i(I ) =
• One hidden
i layer, with n ReLU 0activated hidden
units:

As the hidden units are ReLU activated, we write


σ(x) = ReLU(x)

= H(x), the Heaviside
bx
function.
b2 σ
2
= δ(x), the Dirac Delta
bx
Therefore, it follows function.
that
b2Yˆ = W (2) diag( W (1) )D (I (1)
i
J )W j
b Xi j (1)
J X J
where Di,i (I i ) = δ(I i), i, (Ii ) = 0
• Two hidden
j≠iDlayers, each with n tanh activated hidden units:

Since we have two hidden layers, the interaction term can be


expressed
as:
bYˆ b D(2) (1)
(3) (2)
=W W (1) b D D(2) (1)
+ D wi
b Xi Xj b b Xj
where applyingXthe
j
chain rule we
have
= D J (I (2))W (2)D (1)diag(w (1))
j
bbD(2)
j
X
b D(1) = D J (I (1))diag(w (1)).
b j j

X
As we mentioned previously, for
tanh:
σ(x) = tanh(x)

= 1 – tanh2 (x)
b
x
44

b2 σ
= –2tanh(x)(1 – tanh 2(x))
b 2
x expressions for D J and
Therefore substitute in the
where Di,i (Ii ) = – 2tanh(I i(2))( 1 – tanh2 (Ii(2) )), Di, j≠i(I (2) ) = 0
(2)
J J
D
J (1) (1) (1) J (1)
Di,i (Ii ) = – 2tanh(I i )( 1 – tanh2 (I i )), Di, j≠i(I i ) = 0.
• ReLU
i does not exhibit sufficient smoothness for the interaction term to be
con- tinuous.

Exercise 5.6**

For the same data generation process as the previous exercise, use 5000
simula- tions to generate a regression training set for the neural network with
one hidden layer. Produce a table showing how the mean and standard
deviation of the in- teraction term behave as the number of hidden units is
increased, fixing all other parameters. What do you conclude about the effect of
the number of hidden units on the interpretability of the interaction term? Note
that you should use the note- book D e e p - L e a r n i n g - I n t e r a c t i o n . i p y n b as
the starting point for experimental analysis.

Solution 5.6

See the notebook solution.


Part II
Sequential Learning
Chapter 6
Sequence Modeling

Exercise 6.1

Calculate the mean, variance, and autocorrelation function (ACF) of the


following zero-mean AR(1) process:
yt = ø1 yt–1 + zt
where ø1 = 0.5. Determine whether the process is stationary by computing the
root of the characteristic equation Φ(z) = 0.

Solution 6.1

• The mean of an AR(1) process with no drift term is:

E[yt ] = E[ø1 yt–1 + zt ] = ø 1 E[y t – 1 ] + E[zt ] = ø1E[yt–1].

Since yt–1 = ø1 yt–2 + zt and E[zt ] = 0, then E[yt ] = ø21 E[y t–2 ] = · ·
ø1n E[y t–n ]. · =
By the property of stationary, we have |ø| < 1, which gives lim øn E[yt–n ] = 0.
n→∞ 1
• Hence E[yt ] = 0.
The variance of an AR(1) process with no drift term is as follows. First note
that
yt = ø1 L[yt ] + zt can be written as a MA(∞) process by back-substitution:
yt = (1 + ø 1L + ø 21 L 2 + . . . ) t
[z ].
Taking the expectation of yt2:
. .
E[y t2] = E ( 1 + ø1 L + ø21 L 2 + . . . )2 t2
= (1 +[zø2 ]+ ø 4
2
1 1 t ] + E[cross terms] = σ2 /(1 – 2 1
+ . . . )E[z ø ).
• The lag 1 auto-correlation of an AR(1) processs with no drift term
is:

47
48

ç1 := E[yt yt–1]
= E[(zt + ø 1z t–1 + ø21 zt–2 + . . . ) t–1 + ø 1zt–2 + ø21 zt–3
(z
= E[ø z2 + ø3z 2 + ø5z 2 + · ·+ ·. .+. )] cross
1 t–1 1 t–2 1 t–3
terms]
= ø 1 σ 2 + ø 31σ 2 + ø 51σ 2 + . . .
= ø1 σ 2/(1 – ø21 )

The lag 2 auto-correlation is:

ç2 := E[yt yt–2]
= E[(zt + ø 1z t–1 + ø21 zt–2 + . . . )(zt–2 + ø1 zt–3 + ø21 zt–4
. . )]2z2 + ø4z 2 + ø6z 2 + · · · + cross
= .E[ø
+
1 t–2 1 t–3 1 t–4
terms]
= ø12 σ 2 + ø 41σ 2 + ø 61σ 2 + . . .
= ø21 σ 2/(1 – ø21 ).

Keep iterating, to arrive at çs = ø 1sσ 2/(1 – ø 21). Since the autocorrelation is


φs σ 2 /( 1–φ 2)
r s = çs /ç0, we have r s = σ1 2 /(1–φ 2)1 = ø1s.
1
• Finally we can write the AR(1) process as

Φ(L)[yt ] = zt
where Φ(L) := (1–ø 1 L). The root of Φ(z) for ø1 = 0.5 is z = 2. Since |z| = 2 >
1, the process has no unit root and is stationary.

Exercise 6.2

You have estimated the following ARMA(1,1) model for some time series data

yt = 0.036 + 0.69yt–1 + 0.42ut–1 + ut,

where you are given the data at time t – 1, yt–1 = 3.4 and uˆt–1 = –1.3. Obtain
the forecasts for the series y for times t, t + 1, t + 2 using the estimated ARMA
model.
If the actual values for the series are –0.032, 0.961, 0.203 for t, t +1, t +2,
calculate the out-of-sample Mean Squared Error (MSE) and Mean Absolute Error
(MAE).

Solution 6.2
49

E[yt |Ωt– 1 ] = 0.036 + 0.69E[yt–1 |Ωt– 1 ] + 0.42E[ut–1 |Ωt– 1 ] + E[ut |Ωt– 1 ]


= –0.036 + 0.69 × 3.4 + 0.42 × –1.3 + 0 = 1.836
E[yt+1 |Ωt– 1 ] = 0.036 + 0.69E[yt |Ωt– 1 ] + 0.42E[ut |Ωt– 1 ] + E[ut+1 |Ωt– 1 ]
= 0.036 + 0.69 × 1.836 + 0.42 × 0 + 0 = 1.30284
E[yt+2 |Ωt– 1 ] = 0.036 + 0.69E[yt+1 |Ωt– 1 ] + 0.42E[ut+1 |Ωt– 1 ] + E[ut+2 |Ωt– 1 ]
= 0.036 + 0.69 × 1.30284 + 0.42 × 0 + 0 = 0.9350

Forecasted Actual Absolute Diff. Squared Diff


t 1.836 -0.032 3.489 1.868
t+1 1.30284 0.961 0.117 0.342
t+2 0.9350 0.203 0.536 0.732

1
MSE = (3.489 + 0.117 + 0.536) =
3
1.381.
1
M AE = (1.868 + 0.342 + 0.732) =
3
0.981.

Exercise 6.3

Derive the mean, variance, and autocorrelation function (ACF) of a zero mean
MA(1) process.

Solution 6.3

The mean of a MA(1) series with no drift term is:

E[yt ] = E[θ1ut–1 + ut ] = θ1E[ut–1] + E[ut ] = 0.


The variance of a MA(1) series with no drift term is:
2
V[y ] = E[y ]
t
E[y t2t] = E[(θ 1u t–1 + u t)2] = θ12E[ u 2 ] +t–1 2
t
= (1 + θ2 )σ 2. E[u ]
1

The lag-1 autocovariance of a MA(1) series with no drift term


is:
50

2
ç1 = E[y ty t–1 = E[(θ 1u t–1 + u t)(θ1 ut–2 + u t–1 )] = θ1 E[ut–1 ] = θ1σ2

The lag-1 autocorrelation is:

r 1 = r 1/r 0 = θ 1/(1 + θ12 ).

Exercise 6.4

Consider the following log-GARCH(1,1) model with a constant for the mean
equation

yt = µ + ut, ut ∼ N(0, t2
2 )
ln(σ σ) = α + α u 2 + β1lnσ2
t 0 1 t–1 t–1

• What are the advantages of a log-GARCH model over a standard GARCH


model?
• Estimate the unconditional variance of yt for the values α0 = 0.01, α1 = 0.1, β1
=
0.3.
• Derive an algebraic expression relating the conditional variance with the
uncon- ditional variance.
• Calculate the half-life of the model and sketch the forecasted volatility.

Solution 6.4

The log-GARCH model prevents a negative volatility whereas this is possible


under a GARCH model.
The unconditional 2variance oft a GARCHqmodelα0 is
σ := var(u ) = 1 – (.
. p
α
i=1 i
+ i=1 i
β ) 0.01
= = 0.01667
1 – (0.1 + 0.3)

A necessary condition for stationarity of GARCH(p,q) models is


that Õq Õp
(

αi +

βi ) < 1
i=1

i=1

which is satisfied since 0.1 + 0.3 = 0.4 < 1.


The l-step ahead forecast using a GARCH(1,1)
model:
51

σ 2 = α0 + α1u2 + β1 σ 2
t t–1
t–1
2 2
σˆ 2 = α0 + α1 E[u |Ω
t–1
] + β1
t+1 t
σ = σ 2 + (α1 + β 1)(σ 2 2
t – σ )
t
2 2
σˆt+22 = α0 + α1 E[ut+1 |Ωt–1 ] + β1 E[σt+1 |Ωt–
2 ] 2 2
= σ + (α1 + β 1) (σ t – σ )
1 2

2 2
1 ut+l–1 |Ωt–1 ] + β1 E[σt+l–1 |Ωt– 1 ]
2
σˆt+l = 0α + α
E[ = σ 2 + (α + β ) l(σ 2 – σ 2),
1 1
t
which provides a relationship between the conditional variance σt2 and the
tional variance σ2.
uncondi-
The half life is given by

K = ln(0.5)/ln(α1 + β1) = ln(0.5)/ln(0.4) = 0.7564.

Exercise 6.5

Consider the simple moving average (SMA)

Xt + X t–1 + X t+2 + . . . + t–N +1


St = ,
X N
and the exponential moving average (EMA), given by E1 = X1 and, for t ≥ 2,

Et = αXt + (1 – α)Et–1,

where N is the time horizon of the SMA and the coefficient α represents the
degree of weighting decrease of the EMA, a constant smoothing factor between 0
and 1. A higher α discounts older observations faster.
a. Suppose that, when computing the EMA, we stop after k terms, instead of
going after the initial value. What fraction of the total weight is obtained?
b. Suppose that we require 99.9% of the weight. What k do we require?
c. Show that, by picking α = 2/(N + 1), one achieves the same center of mass
in the EMA as in the SMA with the time horizon N.
d. Suppose that we have set α = 2/(N + 1). Show that the first N points in an
EMA
represent about 87.48% of the total weight.

Solution 6.5

a.
52

weight omitted by stopping after k


terms total weight
α[(1 – α)k + (1 – α)k+1 + (1 – α)k+2 + . . .]
=
α[1 + (1 – α) + (1 – α)2
k 1
+ . . .] α(1 – α) 1–(1–α)
= α
1–(1–α)

= (1 – α)k .

b. To have 99.9% of the weight, set the above ratio equal to 0.1% and solve for k:

ln(0.001)
k=
ln(1 – α)
to determine how many terms should be used.
c. The weights of an N-day SMA have a center of mass on the RSMAth day,
where

or N+1
RSMA =
N –2 1
RSMA = 2 ,
if we use zero-based indexing. We shall stick with the one-based
indexing. The weights of an EMA have the center of mass
Õ∞
REMA = α k(1 – α) k–1
k=1 .
From the Maclaurin
Õ ∞
series 1 k
=
1 – x k=0
x ,
taking derivatives of both sides, we
get
Õ∞
(x – 1)–2 = k · xk–1
k=0 .

Substituting x = 1 – α, we get

REMA = α(α)–2 = (α)–1 .

So the value of α that sets REMA = RSMA is

αEMA = 2.
NSMA + 1
53

d. From the formula for the sum of a geometric series, we obtain that the sum of
the weights of all the terms (i.e., the infinite number of terms) in an EMA is
1. The sum of the weights of N terms is 1 – (1 – α) N +1 . The weight omitted
after
is N terms
given by 1–[1–(1–α) N +1] = (1–α) N +1. Substituting α = N +1
2
and making
use N +1
of limn∞ 1 + an = ea , we get limN ∞ 1 – 1 – N 2+1
n
= 1 – e–2 ≈ 0.8647.

Exercise 6.6

Suppose that, for the sequence of random variables {yt }t∞=0 the following
model holds:
yt = µ + øyt–1 + zt, |ø| ≤ 1, zt ~ i.i.d.(0, σ2).
Derive the conditional expectation E[yt | y0] and the conditional variance V[yt |
y0].

Solution 6.6

As
y1 = µ + øy0
+ z1 ,

y
2

ø Õ
t–1 Õt
y yt = µ øi + øt y0 + øt–i zi .
1 i=0 i=1

Hence
+ Õ t–1 Õt Õ
t–1
E[yt ] = µ øi + øt y0 + øt–i E[zi ] = µ øi + øt y0,
z i=0 i=1 i=0

an 2

d =t
Õ Õ Õt Õt
V[yt | y0] = V[ t øt–i zi ] = V[ø t – i zi ] ø2(t–i)V[zi ] = ø2i .
= µi=1 i=1 i=1 i=1

ø
(
µ

ø
y
0
Chapter 7
Probabilistic Sequence Modeling

Exercise 7.1: Kalman Filtering of Autoregressive Moving Average ARM A(p, q)


Model

The autoregressive moving average ARMA(p, q) model can be written as

yt = ø1 yt–1 + . . . + øp y t – p + ηt + θ1ηt–1 + . . . + θq ηt–q,

where ηt ~ W(0, σ2), and includes as special cases all AR(p) and MA(q)
models. Such models are often fitted to financial time series. Suppose that we
would like to filter this time series using a Kalman filter. Write down a suitable
process and the observation models.

Solution 7.1

[Kalman Filtering of Autoregressive Moving Average ARM A(p, q) Model] Set


m := max(p, q + 1), øi := 0 for i > p, θi := 0 for i > q. Then we obtain our
process model as
Xt = Ft Xt–1 + at + Wt wt,
and the observation model as

Yt = Ht Xt + bt + Vt vt,

where

yt
© \
ø2 yt–1 + . . . + øp t–m+ 1 + θ1 ηt + . . . + θm–1 t–m+2 I
I I
Xt = I ø
y 3 yt–1 + . . . + øp yt–m+2 η+ θ2 ηt + . . . + θ
m–1 t–m+3
I e Rm×1,
Iη . I
I . I
Z øm yt–1 + θm–1ηt ¬

55
56

ø1 1 0 · ·
© \
·2 0 0 1
ø
I I
F = 0. . . . . . e R m×m
I . .. . I ,
I ø m–1 0 0 1I
Z ø.m . 0 0 · · ¬
· 0 |
W = 1 θ1 · · m–1 e Rm×1,
· θ
wt = ηt, Qt = σ2, H = 1 0 . . . 0 e 1 × m, bt = 0, Vt =
0.
If yt is stationary, then Xt ~ W(0, P) with P given by the equation P = FPF| +
σ 2 WW| , so we can set the initial state and error covariance to 0 and P,
respectively.

Exercise 7.2: The Ornstein-Uhlenbeck Process

Consider the one-dimensional Ornstein–Uhlenbeck (OU) process, the stationary


Gauss–Markov process given by the SDE

dXt = θ(µ – Xt ) dt + σ dWt,

where Xt e R, X0 = x0, and θ > 0, µ, and σ > 0 are constants. Formulate


the Kalman process model for this process.

Solution 7.2

[The Ornstein–Uhlenbeck Process] The solution to the SDE is given by


∫ t
Xt = x 0 e–θt + µ(1 – e–θt ) + σe–θ(t–u) dW u .
0

, t
An Itô integral, s f u) dWu , of a deterministic integrand, f , is a Gaussian
, t
dom( variable with mean(u) 0 and variance 0 f 2 (u) du. In our case,ran-
f (u) = e–θ(t–u),
, t
and 2
f (u) du = σ 2 (1 – e–2θt ) . σ
0 2θ
Since this Markov process is homogeneous, its transition density depends
only upon the time difference. Setting, for s ≤ t, hk := t – s as the time interval
between the time ticks k – 1 and k, we obtain a discretized process model

Xk = Fk Xk–1 + ak + wk,
2
with Fk = e–θh k , a k = µ(1 – e–θhk ), w ~ W 0, σ2θ 1 – e–2θh k ) .
(
As ka further exercise, consider extending this to a multivariate OU
process.
57

Exercise 7.3: Deriving the Particle Filter for Stochastic Volatility with
Leverage and Jumps

We shall regard the log-variance xt as the hidden states and the log-returns yt
as observations. How can we use the particle filter to estimate xt on the basis of
the observations yt ?

a. Show that, in the absence of jumps,

,
xt = µ(1 – ø) + øxt – 1 + σ v ρyt – 1 e– x t – 1 /2 + σ v 1 – ρ2 ξ t – 1

for some ξ t i.i~.d. W(0, 1).


b. Show that
p(zt | xt, yt ) = δ(zt –yet –x t /2 )P[J t = 0 | xt , yt ]+ø(zt ; t | Jt , σx2 )P[Jt = 1 | xt, yt ].
| Jt =1
t
µ x =1
where yt exp(xt /2)
µxt | Jt =1 =
exp(xt ) + σJ2
an
d σ J 2
σ x2 t | Jt =1 =
exp(xt ) + σJ
.2
c. Explain how you could implement random sampling from the probability
distri-
d. bution given by the density p(zt | xt, yt ).
e. Write down the probability density p(xt | xt–1, yt–1, zt–1).
Explain how you could sample from this distribution.
f. Show that the observation density is given by
" #
p(y t | t(i)| t–1, p, σ2 ) = (1 – 2πe
xˆ(i)
t | t –1
–1/2
exp –y 2
/(2e xˆt(i)| t – )
1
+
J t
xˆ p)
–1/2 xˆ(i)
2
" xˆ (i) 2 2 2

p 2π(e t | t –1 + σJ ) exp –y t /( e t | t –1 + 2σJ )


#

Solution 7.3 ,

[Deriving the Particle Filter for Stochastic Volatility with Leverage and
Jumps]
a. The Cholesky decomposition of

Σ=
ρ1

can be written as Σ = LL|


where
58

1, 0
L= .
ρ 1 – ρ2

Set (t ~ W(0, 1), independent of zt and yt . Then we can


write
zt
=L t
yzt (t
,
so ,
yt = ρzt + 1 – ρ2(t
an
d zt zt t
Var = Var L = LVar L| = LIL| =
yt z(t (t
Σ,
as required. Substituting this into the process model, we
get
xt+1 = µ(1 – ø) + øxt + σv t
y ,
= µ(1 – ø) + øxt + σv ρzt + 1 – ρ2(t

,
= µ(1 – ø) + øxt + σ v ρzt + σ v 1 – ρ2 ( t .

Since, in the absence of jumps, zt = yt e–xt /2, the result follows.


b. By the Law of Total Probability, we can write

p(zt | xt, yt ) = p(zt | Jt = 0, xt, yt )P [Jt = 0 | xt, yt ]+p(zt | Jt = 1, xt, yt )P [Jt = 1 | xt,
yt ] .
point, exp (x
yt
,
2 /2)
It is clear that, if the process does not jump, there is a Dirac delta mass at a single
so
p(zt | Jt = 0, xt, yt ) = δ(zt – yt e–xt
/2
).

Let us find p(zt | Jt = 0, xt, yt ). By Bayes’s theorem,


p( t | Jt = , xt , yt ) ∝ (y t | J = 1, xt , t
z 1 p t z )p(z ) t
= ø y t; zt (x t /2), σJ2 ø(zt ; ,
exp 0 1).
Taking the natural logarithm,
1 2
1 (yt – zt exp(xt /2))2 – zt
2
ln p(zt | Jt = 1, xt , yt ) = C – (22)
2 2 σJ
·
for some constant C. We recognise in this form the log-pdf of the normal
bution,
distri-
2
t x |J
ln ø zt ; µxt | Jt , 2
= C – 1 (z =1– µ t t ) . (23)
xt | Jt 2
σ =1 =1 · σ 2xt | Jt =1
59

2
We find the parameters µxt | Jt and xt | Jt by equating the coefficients of
and
µ z in the above two equations =1 as follows.
=1 From
z the first
equation, " # #
" exp(x t ) 1 yt exp(x t
ln ø zt ; µ xt | Jt , σx2t | Jt =1 = zt2 – /2) 2 – t +constant
=1 2σ J +z σ 2J term
2
From the
second, " # " #
1 µxt | Jt
ln ø zt ; µ xt | Jt , σx2 | J =1 = zt2 – + t =1 + constant
=1
t t
2σ2x | Jt =1 z σ 2 xt | Jt term
t
=1
Equating the coefficients of zt 2 in the above two equations, we
get
exp(xt ) 1 1
– – =–
2
, 2σ J 2σ x | J =1
2 2
t t

hence
σ2
σ 2xt | Jt =1 = exp(x )J+ σ2
t J

Equating the coefficients of zt , we .


get
yt exp(xt /2) x |J µ J =1 (exp(x t ) + σ 2 )
= =t t 1 by above=result x t| t
J
,
µ σ 2J σx2t | Jt =1 σ 2J

hence
yt exp(xt /2)
µxt | Jt =1 =
. exp(xt ) + σJ
2

The result follows.


c. Since ,
xt = µ(1 – ø) + øxt–1 + σv ρzt–1 + σv 1 – ρ2(t–1,

for some (t i.i


~.d. W(0, 1), we have

p(x | x , y , z ) = ø x ; µ(1 – ø) + øx + σ ρz , σ2 (1 – 2
t t–1 t–1 t–1 t t–1 v t–1 v
ρ) .
d. We first sample zt – 1 from p(zt – 1 | xt – 1, yt – 1 ), then sample from the above
normal pdf.
e. At t, the jump occurs with probability p and does not occur with probability p
– 1. When the jump does not occur, yt is normally distributed with mean 0
and variance e x t . When the jump does occur, it is also normally
Z = X + Y ofsince
distributed, the sumrandom variables, X ~ W(µX , σ2 ) and Y ~ W(µ
two normal Y ,Y
2
X 2
is ),
σ itself a normal random variable, Z ~ W(µX + µ Y, σX2 + σ Y). In our case,
the
60

two normal random variables are zt ext /2 ~ W(0, ext ) and a t ~ W(0, σJ2 ).
result follows by the Law of Total The
Probability.

Exercise 7.4: The Viterbi algorithm and an occasionally dishonest


casino
The dealer has two coins, a fair coin, with P(Heads) = 2 , and a loaded coin,
1
with
P(Heads) = 54 . The dealer starts with the fair coin with probability 35 . The
then tosses the coin several times. After each toss, there is a5 2 probability of a
dealer
to the other coin. The observed sequence is Heads, Tails, Tails, Heads, Tails,
switch
Heads, Heads, Heads, Tails, Heads. Run the Viterbi algorithm to determine which
coin the dealer was most likely using for each coin toss.

Solution 7.4

The solution is presented in notebook v i t e r b i _ 7 _ 4 . i p y n b .


max_probability’: 1. 146617856e- 05,
’steps’: [’Fair’,’Fair’,’Fair’,’Fair’,’Fair’,
’Loaded’,’Loaded’,’Loaded’,’Fair’,’Loaded’]
Chapter 8
Advanced Neural
Networks

Exercise 8.1*

Calculate the half-life of the following univariate RNN

xˆt = Wy zt–1 + by,


zt–1 = tanh(Wz zt–2 + Wx xt–1),

where Wy = 1, Wz = Wx = 0.5, bh = 0.1 and by = 0.

Solution 8.1

The lag 1 unit impulse gives xˆt = tanh(0.5 + 0.1) and the intermediate
calculation of the ratios, r(k), are shown in Table 2, showing that the half-life is 4
periods.
Lag k r(k)
1 1.00
2 0.657
3 0.502
4 0.429

Table 2: The half-life characterizes the memory decay of the architecture by


measur- ing the number of periods before a lagged unit impulse has at least half of
its effect at lag 1. The calculation of the half-life involves nested composition of the
recursion relation for the hidden layer until r(k) is less than a half. In this example,
the half-life is 4 periods.

61
62

Question 8.2: Recurrent Neural Networks

• State the assumptions needed to apply plain RNNs to time series data.
• Show that a linear RNN(p) model with bias terms in both the output layer and
the hidden layer can be written in the form

Õp
yˆt = µ + ø y
i t–i
i=1

• and state the form of the coefficients {øi }.


State the conditions on the activation function and weights in a plain RNN is
the model stable? (i.e. lags do not grow)

Solution 8.2

• Data should not be i.i.d., but autocorrelated. The data should be


(1) (1)
stationary. = 1, but bh and b
z z z x x y
Startzero.
• not by assuming
Then thatthe
follow W example , |ø
= ø in the| chapter
< 1, W but ø ,Ware the extra
= keeping y b term
. p h
that
so µ = i=1 øib so
y
that
Õp
yˆt = µ + ø y
i t–i
i=1

and øi = øx øzi–1.
• The following constraint on the activation function |σ (x)| ≤ 1 must hold
for stability of the RNN.

Exercise 8.3*

Using Jensen’s inequality, calculate the lower bound on the partial


autocovariance function of the following zero-mean RNN(1) process:

yt = σ(øy t – 1 ) + ut,

for some monotonically increasing, positive and convex activation function,


σ (x) and positive constant ø. Note that Jensen’s inequality states that E[g(X)] ≥
g(E[X]) for any convex function g of a random variable X.

Solution 8.3

The lag 1 partial autocovariance is the same as the lag-1 autocovariance:


63

E[yt, yt–1] = E[σ(øy t – 1 ) + ut, yt–1]


= E[σ(øyt–1), yt–1]
= E[g(yt–1)]
≥ g(E[yt–1]) = σ(øE[yt–1])E[yt–1],

where we have made use of Jensen’s inequality under the property that g(x) =
σ(øx)x is convex in x. Convexity holds if σ(x) is monotonic increasing and convex
and x is non-negative.

Exercise 8.4*

Show that the discrete convolution of the input sequence X = {3, 1, 2} and the
filter
F = {3, 2, 1} given by Y = X ∗ F where
Õ∞
yi = X ∗ Fi = x j Fi–j
j=–∞

is Y = {9, 9, 11, 5, 2}.

Solution 8.4

y0 = x0 × F0 = 3 × 3 = 9,
y1 = x0 × F1 + x1 × F0 = 3 × 1 + 2 × 3 = 9,
y2 = x0 × F2 + x1 × F1 + x2 × F0 = 3 × 2 + 2 × 1 + 1 × 3 = 11,
y3 = x1 × F2 + x2 × F1 + x3 × F0 = 2 × 2 + 1 × 1 = 5,
y4 = x2 × F2 = 1 × 2 = 2,

Exercise 8.5*

Show that the discrete convolution xˆt = F ∗ xt defines a univariate AR(p) if a p-


width filter is defined as Fj := øj for some constant parameter ø.

Solution 8.5

Substituting the definition of filter and expanding the discrete convolution gives:

Õp
xˆt = F ∗ xt = Fj xt–j = øx t–1 + ø2 xt–2 + · · · +
p
t–p ,
j=1 ø x
64

which is a special case of a AR(p) with geometrically decaying coefficients


when
|ø| < 1.

4 Programming Related Questions*

Exercise 8.6***

Modify the RNN notebook to predict coindesk prices using a univariate RNN
applied to the data c o i n d e s k . c s v . Then complete the following tasks
a. Determine whether the data is stationary by applying the augmented Dickey-
Fuller test.
b. Estimate the partial autocorrelation and determine the optimum lag at the
99% confidence level. Note that you will not be able to able to draw conclusions
if your data is not stationarity. Choose the sequence length to be equal to this
optimum lag.
c. Evaluate the MSE in-sample and out-of-sample as you vary the number of
hidden
neurons. What do you conclude about the level of over-fitting?
d. Apply L1 regularization to reduce the variance.
e. How does the out-of-sample performance of a plain RNN compare with that
of a GRU?
f. Determine whether the model error is white noise or is auto-correlated by
applying
the Ljung Box test.

Solution 8.6

See the notebook solution.

Exercise 8.7***

Modify the CNN 1D time series notebook to predict high frequency mid-prices
with a single hidden layer CNN, using the data HFT.csv. Then complete the
following tasks
g. Confirm that the data is stationary by applying the augmented Dickey-Fuller
test.
h. Estimate the partial autocorrelation and determine the optimum lag at the
99% confidence level.
i. Evaluate the MSE in-sample and out-of-sample using 4 filters. What do
you
conclude about the level of over-fitting as you vary the number of filters?
d. Apply L1 regularization to reduce the variance.
e. Determine whether the model error is white noise or is auto-correlated by
applying the Ljung-Box test.
65

Hint: You should also review the HFT RNN notebook before you begin this
exercise.

Solution 8.7

See the notebook solution.


Part III
Sequential Data with Decision-Making
Chapter 9
Introduction to Reinforcement learning

Exercise 9.1

Consider an MDP with a reward function rt = r(st , at ). Let Q π (s, a) be an


action- value function for policy π for this MDP, and πx (a|s) = arg maxπ
Q π (s, a) be an optimal greedy policy. Assume we define a new reward
function as an affine transformation of the previous reward: r˜(t) = wrt + b
with constant parameters b and w > 0. How does the new optimal policy π˜x
relate to the old optimal policy πx ?

Solution 9.1

Using the definition of the action-value function as a conditional


expectation of discounted rewards, we find that under an affine transformation
of rewards rt → art + b (where a > 0 and b are fixed numbers), the action-
value function transforms as follows: Q π (s, a) → Q˜ π (s, a) = aQ π (s, a) + f
(s) where f (s) is a function of the current state s. As f (s) does not depend
= arg max
πxgreedy
on actions a, optimal Qπ obtained
policies , a) = argwith
maxQ˜Qππ (s,
, a)a) and Q˜ π (s, a) are
π π
clearly the same: (s (s

Exercise 9.2

With True/False questions, give a short explanation to support your answer.


• True/False: Value iteration always find the optimal policy, when run to
conver- gence. [3]
• True/False: Q-learning is an on policy learning (value iteration) algorithm
and estimates updates to the state-action-value function, Q(s, a) using actions
taken under the current policy π. [5]

69
70

• For Q-learning to converge we need to correctly manage the exploration vs.


exploitation tradeoff. What property needs to hold for the exploration
strategy? [4]
• True/False: Q-learning with linear function approximation will always
converge to the optimal policy. [2]

Solution 9.2

• True: Value iteration always find the optimal policy, when run to
convergence (This is a result of the Bellman equations being a contraction
mapping).
• False: Q-learning is an off-policy learning algorithm: the Q(s, a) function is
learned from different actions (for example, random actions). We even
don’t need a policy at all to learn Q(s, a).
Q(s, a) › Q(s, a) + α(r + ç maxa J Q(s J , a J ) – Q(s, a)), where a J are all
actions,
that were probed in state s J (not actions under the current policy).
• In the limit, every action needs to be tried sufficiently often in every possible
state. This can be guaranteed with an sufficiently permissive exploration
strategy.
• False. It may not even be able to represent the optimal policy due to
approximation
Exercise 9.3*
error.
Consider the following toy cash buffer problem. An investor owns a stock,
initially valued at St0 = 1, and must ensure that their wealth (stock + cash) is
not less than a certain threshold K at time t = t1 . Let Wt = St + Ct denote their at
time t, where Ct is the total cash in the portfolio. If the wealth Wt1 < K = 2
penalized
then with a –10
the investor is
reward.
penalty of 0 or –1
The investor (whichtoisinject
chooses not deducted
either 0 from the fund).of cash with a
or 1 amounts
respective The stock price follows a discrete Markov chain with P(St +1 = s |
Dynamics
St = s) = 0.5, i.e. with probability 0.5 the stock remains the same price over
the time interval. P(St +1 = s + 1 | St = s) = P(St +1 = s – 1 | St = s) = 0.25. If
the wealth moves off the grid it simply bounces to the nearest value in the grid
at that time. The states are grid squares, identified by their row and column
number (row first). The investor always starts in state (1,0) (i.e. the initial
wealth Wt0 = 1 at time t0 = 0— there is no cash in the fund) and both states in
the last column (i.e., at time t = t1 = 1) are terminal (Table 3).
Using the Bellman equation (with generic state notation), give the first
round of
value iteration updates for each state by completing the table below. You may
ignore the time value of money, i.e., set ç = 1.
Vi+1 (s) = max( ÕT s, a, sJ )(R(s, a, Js ) + çV i
J
a
( sJ (s )))
71

w t0 t1
2 0 0
1 0 -10

Table 3: The reward function depends on fund wealth w and time.

(w,t) (1,0) (2,0)


V0(w) 0
V1(w) 0
?
NA

Solution 9.3

The stock price sequence {St0, St1 } is a Markov chain with one transition
period. The transition probabilities are denoted pji = P(St+1 = sj |St = si )
with states s1 = 1, s2 = 2.
The Wealth sequence, {Wt0, Wt1 } is an MDP. To avoid confusion in notation
with the stock states, let us denote the wealth states w e {w1 = 1, w2 = 2}
(instead of the usual s1, s2, ..). The actions a e A := {a0 = 0, a1 = 1}. The
transition matrix for this MDP is
Tijk := T (wi, ak, wj ) := P(Wt+1 = wj |Wt = wi, at = ak ).
From the problem constraints and the Markov chain transition probabilities pij
we can write for action a0 (no cash injection):
0.75 0.25
T,,0 =
0.25 0.75

and for action a1 (one dollar of


cash)
0.25 0.75
T,,1 =
0.0 1.0

We can now evaluate the rewards for the case when Wt0 = w1 = 1
(which is the only starting state). The rewards are R(w1, 0, w1) = –10, R(w1,
0, w2) = 0, R(w1, 1, w1) = –11, R(w1, 1, w2) = –1
And using the Bellman equation, the value updates from state Wt0 = w1 is
Õ2
V1 (w1 ) = max T (w1,
a, w j )R(w1, a, w j ) ae A j=1

= max{T (w1, 0, w1 )R(w1, 0, w1 ) + T (w1, 0,


w2 )R(w1, 0, w2 ), T (w1, 1, w1 )R(w1, 1,
w1 ) + T (w1, 1, w2 )R(w1, 1, w2 )}
= max(0.75 × –10 + 0.25 × 0, 0.25 × –11 + 0.75 × –
1)
= –3.5
72

for action at0 = 1. Note that V1(w2) does not exist as the wealth can not
transition from w2 at time t0.

Exercise 9.4*

Consider the following toy cash buffer problem. An investor owns a stock,
initially valued at St0 = 1, and must ensure that their wealth (stock + cash) does
not fall below a threshold K = 1 at time t = t1 . The investor can choose to
either sell the stock or inject more cash, but not both. In the former case, the
sale of the stock at time t results in an immediate cash update st (you may
ignore transactions costs). If the investor chooses to inject a cash amount ct e {0,
1}, there is a corresponding penalty of – ct (which is not taken from the fund).
Let Wt = St + Ct denote their wealth at time t, where Ct is the total cash in
the portfolio.
Dynamics The stock price follows a discrete Markov chain with P(St+1 = s |
St =
s) = 0.5, i.e., with probability 0.5 the stock remains the same price over the
time interval. P(St +1 = s + 1 | St = s) = P(St +1 = s – 1 | St = s) = 0.25. If the
wealth moves off the grid it simply bounces to the nearest value in the grid at
that time. The states are grid squares, identified by their row and column
number (row first). The investor always starts in state (1,0) (i.e. the initial
wealth Wt0 = 1 at time t0 = 0— there is no cash in the fund) and both states in
the last column (i.e., at time t = t1 = 1) are terminal.

w t0 t1
1 0 0
0 0 –10

Table 4: The reward function depends on fund wealth w and time.

Using the Bellman equation (with generic state notation), give the first round
of value iteration updates for each state by completing the table below. You may
ignore the time value of money, i.e., set ç = 1.
Vi+1 (s) = max( TÕ , a, sJ )(R(s, a, sJ ) + çV
i
J
a
(s sJ (s )))

(w,t) (0,0) (1,0)


V0(w) 0
V1(w) 0
? ?
73

Solution 9.4

The stock price sequence {St0, St1 } is a Markov chain with one transition
period. The transition probabilities are denoted pji = P(St+1 = sj |St = si )
with states s0 = 0, s1 = 1, s2 = 2.
The Wealth sequence, {Wt0, Wt1 } is an MDP. Let us denote the wealth states
w e {w0 = 0, w1 = 1}. The actions a := {a0, a1, a2 } respectively denote (i) do
not inject cash or sell the stock; (ii) sell the stock but do not inject cash; and (iii)
inject cash ($ 1) but do not sell the stock.
The transition matrix for this MDP is

Tijk := T (wi, ak, wj ) := P(Wt+1 = wj |Wt = wi, at = ak ).

From the problem constraints and the Markov chain transition probabilities pij
we can write for action a0 (no cash injection or sale):
0.75 0.25
T,,0 =
0.25 0.75

For the sale of the stock under action a1:

10
T,,1 = .
01

For action a2 (one dollar of cash


added)
0.25 0.75
T,,2 =
0.0 1.0

We can now evaluate the rewards for the case when Wt0 = w1 = 1 (which is
the only starting state). The rewards are

R(w1, a0, w1) = 0, R(w1, a0, w0) = –10,


R(w1, a1, w1) = 0, R(w1, a1, w0) = –10
R(w1, a2, w1) = –1, R(w1, a2, w0) = –11

And using the Bellman equation, the value updates from state Wt0 = w1 is

Õ1
V1 (w1 ) = max T (w1,
a, w j )R(w1, a, w j ) ae A j=0

= max{T (w1, a0, w1 )R(w1, a0, w1 ) + T (w1, a0,


w0 )R(w1, a0, w0 ),

T (w1, a1, w1 )R(w1, a1, w1 ) + T (w1, a1,


w0 )R(w1, a1, w0 ),

T (w1, a2, w1 )R(w1, a2, w1 ) + T (w1, a2,


which states that takingw0action
)R(w1, a2a t00 )}= a1 (i.e. sell the stock) is the most valuable
, w

state. = max(0.75 × 0 + 0.25 × –10, 1.0 × 0 + 0 × –10, 1.0


× –1 + 0 × –11)
= max(–2.5, 0, –1)
74

The other empty value in the table is not filled in. V1(w0) does not exist as
w
the ealth can not transition from w0 at time t0.

Exercise 9.5*

Deterministic policies such as the greedy policy pix (a|s) = arg maxπ Q π (s, a)
are invariant with respect to a shift of the action-value function by an arbitrary
function of a state f (s): πx (a|s) = arg maxπ Q π (s, a) = arg maxπ Q˜ π (s, a)
where Q˜ π (s, a) = Q π (s, a) – f (s). Show that this implies that the optimal
policy is also invariant with respect to the following transformation of an
original reward function r(st , at , st +1 ):

r˜(st, at, st+1) = r(st, at, st+1) + ç f (st+1) – f (st )

This transformation of a reward function is known as reward shaping (Ng,


Russell 1999). It has been used in reinforcement learning to accelerate learning
in certain settings. In the context of inverse reinforcement learning, reward
shaping invariance has far-reaching implications, as we will discuss later in the
book.

Solution 9.5

Consider the Bellman optimalityJ equationJ for the action-value


Q((s,J afunction
J) – f where
Q(s, a) – f (s) = E r(s, a, s ) + ç f (s ) – f (s) + ç max s J of the
we subtract an arbitrary function of the state f (s)a J from( )) both sides
equation. We can write it as follows:
This is equivalent to an MDP with a modified reward r˜t = r(s, a, s J ) + ç f (s J )
– f (s) and a new action-value function Q˜ (s, a) = Q(s, a) – f (s). Because the
MDP with the new reward is obtained by the identity transformation of the
original Bellman optimality equation, the optimal policy obtained with the
“shaped” reward r˜t will be identical to the optimal policy obtained with the
original reward rt = r(st , at , st +1 ).

Exercise 9.6**

Define the occupancy measure ρπ : S × Õ A∞ → R by the relation


ρπ (s, a) = π(a|s) çt Pr (st = s|
π) t=0

where Pr (st = s|π) is the probability density of the state s = st at time t


following policy π. The occupancy measure ρπ (s, a) can be interpreted as an
unnormalized
75

density of state-action pairs. It can be used e.g. to specify the value function as
an expectation value of the reward: V =< r(s, a) >ρ .
a. Compute the policy in terms of the occupancy measure ρπ .

b. Compute a normalized occupancy measure ρ˜π (s, a). How different the
policy will be if we used the normalized measure ρ˜π (s, a) instead of the
unnormalized measure ρπ ?

Solution 9.6

a. The policy is

ρπ (s, a)
π(a|s) = .
s ρπ (s, a)

b. The occupance measure is

ρπ (s, a) ρπ (s, a)
ρ˜(s, a) = . =. = (1 – ç)ρ(s,
s, a ρπ (s, a)

t=0
a)
Therefore, if we rescale the occupance measure
çt ρπ by a constant factor 1 – ç,
we obtain a normalized probability density ρ˜(s, a) of state-action pairs. On
the other hand, an optimal policy is invariant under a rescaling of all rewards
by a constant factor (see Exercise 9.1). This means that we can always
consider the occupancy measure ρπ to be a valid normalized probability
density, as any mismatch in the normalization could always be re-absorbed
in rescaled rewards.

Exercise 9.7**

Theoretical models for reinforcement learning typically assume that rewards


rt := r(st, at, st+1) are bounded: rmin ≤ rt ≤ rmax with some fixed values rmin,
max . On the unbounded
r(numerically) other hand,rewards.
some models of rewards
For example, usedarchitectures,
with linear by practitioners
a may
. K
produce
choice of a reward function is a linear expansion rt
popular θ Ψ (s
k=1 k k t t , a ) over a set
= K basis functions Ψk . Even if one chooses a set of bounded basis functions,
of
expression
this may become unbounded via a choice of coefficients θt .
a. Use the policy invariance under linear transforms of rewards (see Exercise
9.1) to equivalently formulate the same problem with rewards that are bounded
to the unit interval [0, 1], so they can be interpreted as probabilities.

b. How could you modify a linear unbounded specification of reward rθ (s, a,s J )
.
= k=1 K θk Ψk (s, a, s J ) to a bounded reward function with values in a unit
[0,interval
1]?
76

Solution 9.7

a. As greedy policies are invariant under affine transformations of a reward


function rt → wrt + b with fixed parameters w > 0 and b (see Exercise 9.1),
by a proper choice of parameters w, b, we can always map a finite interval
[rmin, rma x ] onto a unit interval [0, 1]. Upon such transformation, rewards
can be interpreted as probabilities.

b. Once we realize that any MDP with bounded rewards can be mapped onto
an MDP with rewards rt e [0, 1] without any loss of generality, a simple
alternative to a linear reward would be a logistic reward:

1
rθ (st, at, st+1) = . .
1 + exp – θk Ψk (st , at , st+1 )
K
k=1

Clearly, while a logistic reward is one simple choice of a bounded function on


a unit interval, other specifications are possible as well.

Exercise 9.8

Consider an MDP with a finite number of states and actions in a real-time


setting where the agent learns to act optimally using the ε-greedy policy.
The ε-greedy policy amounts to taking an action ax = argmaxa J Q(s, a J ) in each state
s with probability 1 – ε, and taking a random action with probability ε. Will
SARSA and Q-learning converge to the same solution under such policy, using
a constant value of ε? What will be different in the answer if ε decays with the
epoch, e.g. as εt ~ 1/t?

Solution 9.8

If ε is fixed, then SARSA will converge to an optimal ε-greedy policy, while


Q- learning with converge to an optimal policy. If ε is gradually reduced with the
epoch,
e.g. εt ~ 1/t, then both SARSA and Q-learning converge to the same optimal
policy.

Exercise 9.9

Consider the following single-step random cost (negative reward)

C (st, at, st+1) = yat + (K – st+1 – at )+ ,

where y and K are some parameters. You can use such a cost function to
develop an MDP model for an agent learning to invest. For example, st can be
the current assets in a portfolio of equities at time t, at be an additional cash
added to or subtracted
77

from the portfolio at time t, and st +1 be the portfolio value at the end of time
interval [t, t + 1). The second term is an option-like cost of a total
portfolio (equity and cash) shortfall by time t + 1 from a target value K.
Parameter y controls the relative importance of paying costs now as opposed to
delaying payment.
a. What is the corresponding expected cost for this problem, if the expectation
is taken w.r.t. to the stock prices and at is treated as deterministic?
b.
c. Is
Cantheyou
expected
find ancost a convex
optimal or concave
one-step action function of the action
at x that minimizes at ?
the one-step
cost?
expected
Hint: For Part (i), you can use the following
property:
d [y – x] = [(y – x)H(y – x)] d
+
dx dx

, where H(x) is the Heaviside function.

Solution 9.9

a. The expected cost is

C (st, at, st+1) = E [yat + (K – st+1 – at )+] = yat + E [(K – st+1 – at )+] .

b. To find the maximum of the expected one-step reward, we use the


relation
dx [y – x] + = dx [(y – x)H(y – x)] = –H(y – x) – (y – x)δ(y – x) = –H(y –
d d

x), H(x) is the Heaviside step function. Note that the (y – x)δ(y – x) term
where
is zero everywhere. Using this relation, the derivative of the expected reward is
bC
= y – E [H(K – st+1 – at )] = y – Pr (st+1 ≤ K – at )
bat
.
Taking the second derivative and assuming that the distribution of st+1 has a
2
pdf p(st+1 ), we obtain 6 C2 = p(K – at ) ≥ 0. This shows that the expected
6a t
C (st, at, st+1) is convex in actioncost
at .

c. An implicit equation for the optimal action tax is obtained at a zero of the
of the expected
derivative
cost: Pr s ≤ K – ax = y.
t+1 t

Exercise 9.10

Exercise 9.9 presented a simple single-period cost function that can be used in
the setting of model-free reinforcement learning. We can now formulate a
model based formulation for such an option-like reward. To this end, we use the
following
78

specification of the random end-of-period portfolio state:

st+1 = (1 + rt )st
rt = G(Ft ) + εt .

In words, the initial portfolio value st + at in the beginning of the interval [t, t +
grows
1) with a random return rt. given
. by a function G(Ft ) of factors Ft corrupted
by ε with E [ε] = 0 and E ε2 = σ 2 .
noise
a. Obtain the form of expected cost for this specification in Exercise
9.9.
b. Obtain thethe
c. Compute optimal single-step
sensitivity of the action
optimalforaction
this case.
with
. respect to the i-th factor Fit
assuming the sigmoid link function G(F ) = σ ( ω F ) and a Gaussian
t i i it
noise
εt .

Solution 9.10

a. The model dependent expected cost in this case is

C (st, at ) = yat + E [(K – (1 + G(Ft ) + σε)st – at )+] .

b. Equating the partial derivative bC/bat to zero, we obtain an implicit equation


for optimal action atx
the

K – axt – (1 + Ft ))s t
G( = y.
σst
εt ≤
P formula
c. If ε ~ N(0, 1), the last
becomes
K – axt – (1 + t t
W G(F ))s = y.
σs t

where W(·) is the cumulative normal distribution. Differentiating this


.
expression
with respect to factor Fit using the sigmoid link function G(Ft ) = σ ( i ωi Fit )
σ(F), we
=
obtain
baxt bG(Ft )
= –s t = –st ωi σ(F) (1 – σ(F)) .
bFit bFit

Exercise 9.11

Assuming a discrete set of actions at e A of dimension K—show that


optimization by greedy policy of Q-learning Q(st , axt ) = maxa t e A Q(s t, t
deterministic
policy
) be equivalently expressed as maximization over a set probability distributions
acan
π(at ) with probabilities πk for at = Ak , k = 1, . . . K (this relation is known
as
79

Fenchel duality):

ÕK ÕK
max Q(st, at ) = max πk Q (st, Ak ) s.t. 0 ≤ πi ≤ 1, πk = 1
at e A {π } k k=1
k=1

Solution 9.11

As all weights are between zero and one, the solution to the maximization over
the
distribution K
π = {π k } k=1 is to put all weights into a value k = k x such that at = x
maximizes
A k Q(st , at ): πkx = 1, πk = 0, ∀k ≠ kx . This means that maximization
over all possible actions for a discrete action set can be reduced to a linear
programming problem.

Exercise 9.12**

The reformulation of a deterministic policy search in terms of search over


probability distributions given in Exercise 9.8 is a mathematical identity where
the end result is still a deterministic policy. We can convert it to a probabilistic
policy search if we modify the objective function

ÕK ÕK
max Q(st, at ) = max πk Q (st, Ak ) s.t. 0 ≤ πi ≤ 1, πk = 1
at e A {π } k k=1
k=1

by adding to it a KL divergence of the policy π with some reference


(“prior”) policy ω:
ÕK 1 Õ
K
πk
Gx (st, at ) = max πk Q (st, Ak ) β πi log
π ω
– k=1 k=1 k

where β is a regularization parameter controlling the relative importance of the


two terms that enforce, respectively, maximization of the action-value
function and a preference for a previous reference policy ω with probabilities
ωk . When parameter β < ∞ is finite, this produces a stochastic rather than
deterministic optimal policy πx (a|s).
Find the optimal policy πx (a|s) from the entropy-regularized functional G(st, at
(Hint:
. ) use the method of Lagrange multipliers to enforce the normalization
π = 1).
constraint
k k

Solution 9.12

By changing the sign to replace maximization by minimization, rescaling, and


using the Lagrange multiplier method to enforce the normalization constraint, we
have the following Lagrangian function
80
!
ÕK πk ÕK ÕK
L = πi log – β πk Q (st, Ak ) + λ πk –
k=1 ωk k=1 k=1
1 ,
where λ is the Lagrange multiplier. Computing its variational derivative with
respect to πk , we obtain βQ(s , A )–λ–1
πk = ωk e t k

The Lagrange multiplier .λ can now be found by substituting this expression into
K
normalization condition k=1
the πk = 1. This produces the final
result
ωk eβQ(st, Ak )
πk = . ω e βQ(s t, Ak ) .
k k

Exercise 9.13**

Regularization by KL-divergence with a reference distribution ω introduced in


the previous exercise can be extended to a multi-period setting. This produces
maxi- mum entropy reinforcement learning which augments the standard RL
reward by an additional entropy penalty term in the form of KL divergence. The
optimal value function in MaxEnt RL is
" ∞
Õ 1 π(at |st )
x # ç t r(st , t st+1 ) –
F (s) = max E (24)
π s0 = s
t=0 a , log β π0 (at |st )
where
. E [·] stands for an average under a stationary distribution ρπ (a)
= s µπ (s)π(a|s) where µπ (s) is a stationary distribution over states induced by
policy
the π, and π0 is some reference policy. Show that the optimal policy for
this entropy-regularized MDP problem has the following form:
1 π (s ,
Õ π (s ,
πx (a|s) = π0 (at |st )eβG
a )
t t t , Zt ≡ π0 (at |st )e βG
a )
t t t (25)
Zt
at
.
where Gπ (s ,t t = E π [r(s ,t a t, s t+1)] + p(st+1 |st , at )Ft+1
π
(st+1 ). Check that
t st +1
athe) limit β → ∞ç reproduces the standard deterministic policy, that is limβ→∞
V x (s) = maxπ V π (s), while in the opposite limit β → 0 we obtain a random
and uniform policy. We will return to entropy-regularized value-based RL and
stochastic policies such as (25) (which are sometimes referred to as Boltzmann
policies) in later chapters of this book.

Solution 9.13

The entropy-regulated value function for policy π is


81
" #
Õ∞ 1 π(at |st )
tJ –t
F π (s ) =E π
ç r(st , t st+1 ) –
t t st
a, t J =t log β π0 (at |st )
Note that this expression coincides with the usual definition of the value function
in the limit β → ∞. Consider a similar extension to the action-value function:
" ∞
Õ 1 π(at |st )
π # J
π
Gt (st, at ) = E ç t –t r(st , at , st+1 ) –
s t, a t
log t J =t β π0 (at |st )
From these two expressions, and using the fact that the entropy term vanishes for
the first time step where the action a0 is fixed, we obtain
Õ 1 π(at |st )
F π (s ) = π(a |s ) Gπ (s , a ) –
t
log
t
at
t t t t β π0(at |st )
t

Maximizing this expression with respect to π(at |st ), we


obtain Õ
1 π (s , π (s ,
π(at |st ) = π0(a t |st )eβG
a )
t t t , Z t ≡ π0 (a t |st )eβG
a )
t t t
Zt
at
.
where Gtπ (st , t = E π [r(s t, at , st+1 )] + s t +1 p(st+1 |st , at )Ft+1
π
(st+1 ).
a) ç

Exercise 9.14*

Show that the solution for the coefficients Wtk in the LSPI method (see Eq.(9.71))
is
Wx = S –1M
t t t

where St is a matrix and Mt is a vector with the following


elements:
N
(t) Õ (k) (k) (k) (k)
Snm = Ψn Xt , at Ψm Xt ,
at k=1
N
Õ (k) (k)
Mn(t) = Ψn Xt(k) , a(k)
t t , at , t+1
Rt X(k) (k) X(k) + çQt+1
π Xt+1 , π t+1
X k=1

Solution 9.14

The objective function is

N 2
Õ (k) (k) (k) (k)
L t (Wt ) = Rt X(k)t , t(k)
a , t+1
X(k) + çQt+1 Xt+1 , π Xt+1 – Wt Ψ Xt ,
k=1 π
at
82

Differentiating this expression with respect to parameters Wt , equating it to


zero, and re-arranging terms, we obtain the result shown in the exercise.

Exercise 9.15**

Consider the Boltzmann weighted average of a function h(i) defined on a binary


set
I = {1, 2}:
Õ eβh(i)
Boltzβ h(i) = h(
i) .
i eI βh(i)
i eI e
a. Verify that this operator smoothly interpolates between the max and the mean of
h(i) which are obtained in the limits β → ∞ and β → 0, respectively.
b.
By taking β = 1, h(1) = 100, h(2) = 1, h J (1) = 1, h J (2) = 0, show that Boltz β is
not a non-expansion.
c. (Programming) Using operators that are not non-expansions can lead to a
loss of a solution in a generalized Bellman equation. To illustrate such phe-
nomenon, we use the following simple example.Consider the MDP problem
on the set I = {1, 2} with two actions a and b and the following specifi-
cation: p(1|1, a) = 0.66, p(2|1, a) = 0.34, r(1, a) = 0.122 and p(1|1, b)
= 0.99, p(1|1, b) = 0.01, r(1, b) = 0.033. The second state is absorbing
with p(1|2) = 0, p(2|2) = 1. The discount factor is ç = 0.98. Assume we
use the Boltzmann policy
eβQˆ (s, a)
π(a|s) = . ˆ a)
a eβQ(s,
Show that the SARSA algorithm

. .
Qˆ (s, a) › Qˆ (s, a) + α r(s, a) + çQˆ (s J , a J ) – Qˆ (s, a) ,

where a, a J are drawn from the Boltzmann policy with β = 16.55 and α =
0.1, leads to oscillating solutions for Qˆ (s1, a) and Qˆ (s1, a) that do not
achieve stable states with an increased number of iterations.

Solution 9.15

Part 2:
To verify the non-expansion property, we compute
Õ eh(i)/T Õ J
eh (i)/T
|BoltzT h(i) – BoltzT h J ( – h J (i) . h J (i)/T
i)| = h(i) .
i eI i eI eh(i)/T i eI i eI e

≈ |100 + 0 – 0.731 – 0| = 99.269 > 99 = max |h(i) –


i eI
h J (i)|
83

Exercise 9.16**

An alternative continuous approximation to the intractable max operator in the


Bellman optimality equation is given by the mellowmax function:
!

n
1
mmω (X) = ωxi
ω n i=1 e
log
a. Show that the mellowmax function recovers the max function in the limit ω →
∞.
b. Show that mellowmax is a non-expansion.

Solution 9.16

a. Let m = max (X) and W ≥ 1 is the number of values in X that are equal to m.
We obtain ! !
1Õ Õn
n
1 1 1
lim mmω (X) = lim n lim
eωxi = ω→∞ ωm eω(xi –m)
ω→∞
log
ω→∞ ω
i=1 log ω ne i=1
1 1
= lim log eωm W = m = max X)
ω→∞ ω n
(
b. Let X and Y be two vectors of length n, and ∆i = Xi – Yi be the
difference of their i-th components. Let i x be the index with the maximum
component- wise difference: i x = argmaxi ∆i . Without loss of generality, we
assume that x i x – y i x ≥ 0. We obtain
1 . n eωx i . . n
1 1 i=1 e ω(y +∆ )
n i i
1 i=1 eω(yi + ∆ i x )
ω ω 1 n . i= . n .
|mm (X) – mm (Y)| = ωlog 1 n
eωy = log ≤ log n
1 i ω i=1 eωyi ω i=1 eωyi
n i=
1
= log eω∆ ix = |∆i x | = max | iX – i
ω i
Y|
Chapter 10
Applications of Reinforcement Learning

Exercise 10.1

Derive Eq.(10.46) that gives the limit of the optimal action in the QLBS model
in the continuous-time limit.

Solution 10.1

The first term in Eq.(10.46) is evaluated in the same way as Eq.(10.22) and yields
. .
E t ∆Sˆ tΠ
ˆ t+1 b ˆt
lim =
2 C
bSt
∆t→0
Et ∆Sˆ t .

The second term i


is E t h 2γλ
1 ∆St µ– r
lim
∆t→0 2 = 1 .
Et ˆt
∆S 2λσ2 S t

Exercise 10.2

Consider the expression (10.121) for optimal policy obtained with G-


learning
1 ˆ
π0 (a t|yt )e R(y t ,a t )+γE t,a t [ Ft +1 (y t )
π
π(at |yt ) =
Zt
+1 ]
where the one-step reward is quadratic as in
Eq.(10.91):
ˆ t, t = yTt Ryy yt + atT Raa a + atT R ay yt + aT R .
R(y
t a
a)

85
86

How. does this relation


. simplify in two cases: (a) when the conditional
t,a Ft+1 (yt+1 ) does not depend on the action at , and (b) when the dynamics
π
expectation
E
linear in atare
as in Eq.(10.125)?

Solution 10.2
. π .
(a) When the conditional expectation t,a Ft+1 (yt+1 ) does not depend on the
action
aEt , this term trivially cancels out with the denominator Zt . Therefore, the
optimal policy in this case is determined by the one-step reward: π(at |yt ) ~
eRˆ(y(b)
t ,at ).When the dynamics are linear in action at as in Eq.(10.125), the

optimal policy for a multi-step problem has the same Gaussian form as an
optimal policy for a single step problem, however this time, parameters Ra a ,
Ray, Ra of the reward function are re-scaled.

Exercise 10.3

Derive relations (10.141).

Solution 10.3

The solution is obtained by a straightforward algebraic manipulation.

Exercise 10.4

Consider G-learning for a time-stationary case, given by Eq.(10.122):

ˆ çÕ Õ
βGπ (yJ ,aJ
G π (y, a) = R(y, a) ρ(yJ |y, a) log π0 (aJ) |y J)e
+ β yJ
aJ

Show that the high-temperature limit β → 0 of this equation reproduces the


fixed- policy Bellman equation for Gπ (y, a) where the policy coincides with
the prior policy, i.e. π = π0.

Solution 10.4

Expand the expression in the exponent to the first order in parameter β → 0:


(yJ ,aJ )
= 1 + βG π (yJ , aJ ) + O(β2 ).
π
eβG

Plugging it into Eq.(10.122) and using the first-order Taylor expansion for the
loga- rithm log(1 + x) = x + O(x2), we obtain
87
Õ
G π (y, a) = ˆ , a) + ç (y J | , a)π0 (aJJ | J G π(y J,
R(y ρ yJ ,aJ y y) a ).
This is the same as the fixed-policy Bellman equation for the G-function Gπ (y,
a)
with π = π0 and transition probability ρ(yJ , aJ |y, a) = ρ(yJ |y, a)π0 (aJ |yJ ).

Exercise 10.5

Consider policy update equations for G-learning given by Eqs.(10.174):


p = Σ p–1 –
Σ̃–1 (uu)
t
2βQ
u˜t = ˜Σp Σ–1 u¯ + (u) t
p t
βQ
v˜t = ˜Σp Σ–1 v¯ + (ux)
p t
βQ
(a) Find the limiting forms of theset expressions in the high-temperature
limit
β→ (b)0Assuming that we know
and low-temperature limit → ∞.point (u¯t , tv¯ ) of these iterative
theβstable
equations, ˜
as well as the covariance Σ p , invert them to find parameters of Q-function in
(uu) (ux)
of stable point values u¯ t, v¯ t . Note that only parameters
terms t Q t, Q , andt Q(u) can
be
recovered. Can you explain why parameters Qt (xx) and Qt(x) are lost in this
(Note: this problem can be viewed as a prelude to the topic of inverse
procedure?
reinforcement learning covered in the next chapter.)

Solution 10.5

(a) In the high-temperature limit β → 0, we find that parameters of the prior


policy are not updated:
Σ˜ –p1 = Σ–p1, u˜ t = u¯ t, v˜ t = v¯ t .
For the low-temperature limit β → ∞, the policy becomes a deterministic limit
of the Gaussian policy with the following parameters that are independent of
the parameters of the prior policy:

1 h (uu) i –1
Σp
=– Qt →0

˜
1 h (uu) i –1 (u)
u˜t = –Qt Qt
2
1h i –1
v˜t = – Q(uu)
t
(ux)
Qt
2
88

(b) If (u¯ t, v¯ t ) is a fixed point of the update equations and we assume it is


known, along with the covariance Σ˜ p , we can use it to obtain
1
Q(u)
t = 1 – ˜Σ Σ–1 t
β p p

1
Q(ux)
t = 1 – ˜Σ Σ–1
β p p t

This implies that if we know the parameters of the policy, we can recover
from
it the coefficients Qt(uu), Q(ux)
t and Q(u)
t . However, parameters Q t
(xx)
and Q(x)t are
not recovered from this procedure. The reason for it is that the policy
exponentially on Q(st, at ) as π(at |st ) = (sZ 1t ) eβG(st,at ) where Z(st ) is a
depends
factor, therefore additive terms in G(st , at ) that depend only ont s cancel out
normalization
the numerator and denominator.
between

Exercise 10.6***

The formula for an unconstrained Gaussian integral in n dimensions


reads
∫ ,
1 T x TB n (2π) n 1B T A –1B
2
e – x Ax+ d x= e2
|
.
. n A|
Show that when a constraint i=1 x ≤ X¯ with a parameter X¯ is imposed on
i
integration variables, a constrained version of this integral
the
reads ! ,

1 T +x TB
¯ –
Õn (2π) n 1B T A –1B BT A–1 1 – ¯
2
e – x Ax θ X xi d x n e2
| 1 – N X
= i=1 1T A–11
A| √
where N(·) is the cumulative normal distribution.
Hint: use the integral representation of the Heaviside step
function ∫
1 ∞
eizx dz.
θ(x) = lim
ε → 0 2πi –∞ z – iε

Solution 10.6
.
Writing x = xT 1 where 1 is a vector of ones, the integral representation of
i i
Heaviside step function
the
reads ∫ ¯ xT 1
¯ – xT 1 = 1 ∞ iz(X–
e
θ X dz.
ε → 0 2πi )
–∞ z – iε
lim
This gives
89
∫ ! ∫ ∫
Õn 1 ∞ eiz X–
( ¯ xT 1 )
–21 x
T
Ax
+
x
T
¯ –
Bθ X x
i dn
x = 21 x

T +
Ax x
T
B dn
x dz
e
i=1
e 2πi –∞ z –

where we should take the limit ε → 0 at the end of the calculation. Exchanging
the order of integration, the inner integral with respect to x can be evaluated
using the unconstrained formula where we substitute B → B – iz1. The integral
becomes (here a := 1T A–11)
, ∫ ¯ 1 T –1 T –1
(2π) n 1 B T A –1 B 1 eizX– 2 z1 A 1z–iB A 1
e2 dz
| 2πi z – iε
, A| ∫ – 1 a z+i BT A–1 1– X̄
2

(2π)n 1B T A –1B– (X¯ –B2aT A–1 1)


2
1 e 2
a

= dz
|A| e 2 2πi z–

T –1
The integral over z can be evaluated by changing the variable z → z – B A a 1– X̄ ,
itaking the limit ε → 0, and using the following formula (here BT A–1 1 –X¯
a ):
β :=
1 2
1 ∫ ∞ e – 21 az 2 ∫ ∞ – az aβ 2
β e 2 √ 2
2π –∞ β + dz = 0
dz = 1 – N β a e
π β +z
2 2
iz
where in the second equation we used Eq.(3.466) from Gradshteyn and Ryzhik
2 Using this relation, we finally obtain
∫ ! ,
1 T +x TB
¯ –
Õn (2π) n 1B T A –1B BT A–1 1 – ¯
2
e – x Ax θ X x i d nx 2 .
e 1 – N X√
= i=1 |A| T
1 A 1–1

Note that in the limit X¯ → ∞, we recover the unconstrained integral


formula.

2 I.S. Gradshteyn and I.M. Ryzhik, "Table of Integrals, Series, and Products", Elsevier,
2007.
Chapter 11
Inverse Reinforcement Learning and Imitation
Learning

Exercise 11.1

a. Derive Eq.(11.7).
b. Verify that the optimization problem in Eq.(11.10) is convex.

Solution 11.1

To find the optimal policy corresponding to the regularized value function


(11.8), we use the method of Lagrange multipliers, and add the normalization
constraint to this functional:

∫ 1 ∫
F π (s) = π(a|s) r(s, a) – log π(a|s)
β +λ
da
where λ is the Lagrange multiplier. Taking the variational
π(a|s)daderivative
– 1 with respect
to
π(a|s) and equating it to zero, we obtain (26)
1 1
r(s, a) – log π(a|s) – + λ = 0
β
β
Re-arranging this relation, we obtain

π(a|s) = eβ(r (s, a)+λ)–1

The value ,of the Lagrange multiplier λ can now be found from the
constraint
normalizationπ(a|s)da = 1. This produces the final
result
1
Zβ (s)

Consider the objective function defined in


Eq.(11.10)

91
92

G(λ) = log Zλ – λr¯(s)

Taking the second derivative with respect to λ, we


obtain
2 ∫ 2 ∫
d2 G Z J J (λ) Z J 2
= r (s, a)π(a|s)da – r(s, a)π(a|s)da = Var(r)
(λ)
= –
dλ2 variance
As Z(λ) Var(r) Z(λ)
is always non-negative, it means that the optimization
problem in Eq.(11.10) is convex.

Exercise 11.2

Consider the policy optimization problem with one-dimensional state- and


action- spaces and the following parameterization of the one-step reward

r(s, a) = – log 1 + e–θ Ψ(s, a)

where θ is a vector of K parameters, and Ψ(s, a) is a vector of K basis


functions. Verify that this is a concave function of a as long as basis functions
Ψ(s, a) are linear in a.

Solution 11.2

If the basis functions Ψ(s, a) are linear in a while having an arbitrary dependence
on
s, the second derivative of the reward is
2

br2 θ 6a
=– ≤
ba 2 1 + e–θ Ψ(s,
2
0
a)

Exercise 11.3

Verify that variational maximization with respect to classifier D(s, a) in


Eq.(11.75) reproduces the Jensen–Shannon divergence (11.72).

Solution 11.3

The GAIL loss function (11.75) can be written as




= max ρE (s, a) log D(s, a)dsda + ρπ (s, a) log (1 – D(s, a))
ΨxGA
dsda
D e[0,1]S×A
93

Taking the variational derivative of this expression with respect to D(s, a), we
obtain
ρE (s, a) ρπ (s, a) = 0
D(s, a) – 1 – D(s,
a)
Re-arranging terms in this expression, we obtain
Eq.(11.76):
ρE
(s, a)
D(s, a) = ρπ (s, a) + ρE (s,
a)
Substituting this expression into the first expression for ΨGA x , we
obtain ∫
ρE (s, a) ρ (s, a)
ΨxGA E π dsda+ ρπ (s, a) dsda
∫ ρπ (s, a) + ρE (s, a)log ρπ (s, a) + ρE (s,
= the
Dividing (s, a) loga)and denominator in logarithms in both terms by two
ρ numerator
and re-arranging terms, we obtain

1 1
ΨxGA = DK KL ρπ || (ρπ + ρE ) +DK KL E || (ρπ + ρ E ) –log 4 = D JS (ρπ, ρE )–log
2 2
ρ 4
where DJS (ρπ, ρE ) is the Jensen–Shannon distance (11.72).

Exercise 11.4

Using the definition (11.71) of the convex conjugate function ψx for a


differentiable convex function ψ(x) of a scalar variable x, show that (i) ψx is
convex, and (ii) ψx x = ψ.

Solution 11.4

For a convex differentiable function ψ(y) of a scalar variable y, the convex


conjugate
ψx (x) = sup xy – ψ(y)
y

coincides with the Legendre transform. Taking the derivative of this expression
with respect to y and equating it to zero, we have

ψJ (y) = x → yx = g(x)

where g = [ψJ ]– 1 is an inverse function of ψJ , so that we have

ψJ (g(x)) ≡ ψJ (yx ) = x

Differentiating both sides of this equation with respect to x, we obtain


94

1
ψJ J (g(x))
g J (x) =
We can therefore write the convex conjugate as a composite function of x:

ψx(x) = xyx – ψ(yx )

where yx = g(x). Differentiating this expression, we have


dψx(x) dyx ψ(yx ) dyx dyx dyx
dx = yx + x dx – dyx dx = yx + x dx – x dx = yx
Differentiating the second time, we have

d2ψx(x)

dyx
Therefore, the convex conjugate (Legendre transform) of a convex differentiable
function ψ(y) is convex.
To show that ψxx = ψ, first note that the original function can be written in
1
terms of its transform:
ψ(yx ) = xyx – ψx(x)
dx 2
Using this, we
obtain = = JJ ≥0
ψxx (x) = xp x – ψx x dψxdx(p) ψ =x(g(x))= xpx – ψx(px ) g(px )=x = g(px )px – ψ (px )
x

(p ) dp
p=px

= ψ (g(px )) = ψ(x)

where we replaced x by g(px ) in the third step, and replaced g(px ) by x in the
last step.

Exercise 11.5

Show that the choice f (x) = x log x – (x + 1) logx+12 in the definition of the f-
divergence (11.73) gives rise to the Jensen-Shannon divergence (11.72) of
distribu- tions P and Q.

Solution 11.5

Using f (x) = x log x – (x + 1) logx+1


2 in Eq.(11.73), we
obtain
95


p(x)
Df∫ ( P|| Q) = q(x) f q(x) ∫
p(x)
dx q(x)
= p(x) log p(x) + q(x) dx + q(x) log p(x) + q(x) dx
= DJS (p, q) – log 4

where DJS (p, q) is the Jensen–Shannon distance between p and q.

Exercise 11.6

In the example of learning a straight line from Sect. 5.6, compute the KL
divergence
D K L (Pθ ||P E ), D K L (PE ||Pθ ), and the JS divergence DJS (Pθ, PE ).

Solution 11.6

Let PE be a distribution corresponding to a vertical line with x = 0, and Pθ be a


distribution corresponding to a vertical line at x = θ. They can be written as

PE (x, z) = δ(x)z, Pθ = δ(x – θ)z

As they have non-overlapping supports, we have, for θ ≠ 0, D K L (Pθ ||PE )


= D K L (PE ||Pθ ) = 0, while for θ = 0, both KL divergences are zeros. On the
other hand, the JS distance is DJS (Pθ, PE ) = 2 log 2 for θ ≠ 0, while it vanishes
for θ = 0.

You might also like