Instructors Manual
Instructors Manual
Matthew Dixon
Department of Applied Math, Illinois Institute of Technology e-mail: [email protected]
Igor Halperin
NYU Tandon School of Engineering and Fidelity Investments, e-mail: [email protected],
e-mail: [email protected]
Paul Bilokon
Thalesians Ltd, London, e-mail: [email protected]
v
vi Matthew F. Dixon, Igor Halperin and Paul
Bilokon
Introduction
This book is written for advanced graduate students and academics in the
mathe- matical sciences, in addition to quants and data scientists in the field of
finance. Readers will find it useful as a bridge from these well-established
foundational top- ics to applications of machine learning in finance. Machine
learning is presented as a non-parametric extension of financial econometrics,
with an emphasis on novel algorithmic representations of data, regularization and
model averaging to improve out-of-sample forecasting. The key distinguishing
feature from classical financial econometrics is the absence of an assumption on
the data generation process. This
ML in Finance Instructor’s Manual
vii
probability, statistics, and time series analysis. The book therefore assumes,
and does not provide, concepts in elementary probability and statistics. In
particular, undergraduate preparation in probability theory should include
discrete and con- tinuous random variables, conditional probabilities and
expectations, and Markov chains. Statistics preparation includes experiment
design, statistical inference, re- gression and logistic regression models, and
analysis of time series, with examples in ARMA models. Preparation in financial
econometrics and Bayesian statistics in addition to some experience in the capital
markets or in investment management is advantageous but not necessary.
Our experience in teaching upper section undergraduate and graduate
programs in machine learning in finance and related courses in the departments
of applied math and financial engineering have been that students with little
programming skills, despite having strong math backgrounds, have difficulty
with the programming as- signments. It is therefore our recommendation that a
course in Python programming be a prerequisite or that a Python bootcamp be
run in conjunction with the begin- ning of the course. The course should equip
students with a solid foundation in data structures, elementary algorithms and
control flow in Python. Some supplementary material to support programming
has been been provided in the Appendices of the book, with references to
further supporting material.
Students with a background in computer science often have a distinct
advantage
in the programming assignments, but often need to be referred to other
textbooks on probability and time series analysis first. Exercises at the end of
Chapter 1 will be especially helpful in adapting to the mindset of a quant, with
the focus on economic games and simple numerical puzzles. In general we
encourage liberal use of these applied probability problems as they aid
understanding of the key mathematical ideas and build intuition for how they
Overview
translate intoofpractice.
the Textbook
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 8 presents various neural network models for financial time series
analy- sis, providing examples of how they relate to well-known techniques
in financial econometrics. Recurrent neural networks (RNNs) are presented as
non-linear time series models and generalize classical linear time series models
such as AR(p). They provide a powerful approach for prediction in financial
time series and generalize to non-stationary data. The chapter also presents
convolution neural networks for filter- ing time series data and exploiting
different scales in the data. Finally, this chapter demonstrates how autoencoders
are used to compress information and generalize principal component analysis.
ML in Finance Instructor’s xi
Manual
Chapter 9
Chapter 9 introduces Markov Decision Processes and the classical methods of
dy- namic programming, before building familiarity with the ideas of
reinforcement learning and other approximate methods for solving MDPs. After
describing Bell- man optimality and iterative value and policy updates before
moving to Q-learning, the chapter quickly advances towards a more
engineering style exposition of the topic, covering key computational
concepts such as greediness, batch learning, and Q-learning. Through a number
of mini-case studies, the chapter provides insight into how RL is applied to
optimization problems in asset management and trading. These examples are
each supported with Python notebooks.
Chapter 10
Chapter 11
the “ground truth" rewards. We then present use cases for IRL in quantitative
finance that include applications to trading strategy identification, sentiment-based
trading, option pricing, and market modeling.
Chapter 12
Exercises
The exercises that appear at the end of every chapter form an important
component of the book. Each exercise has been chosen to reinforce concepts
explained in the text, to stimulate the application of machine learning in finance
and to gently bridge material in other chapters. It is graded according to
difficulty ranging from (*), which denotes a simple exercise taking a few
minutes to complete, through to (***), which denotes a significantly more
complex exercise. Unless specified otherwise, all equations referred in each
exercise correspond to those in the corresponding chapter.
Python Notebooks
Instructor Materials
This Instructor’s Manual provides worked solutions to almost all of the end-of-
chapter questions. Additionally this Instructor’s Manual is accompanied by a
folder with notebook solutions to some of the programming assignments.
Occasionally, some addition notes on the notebook solution are also provided.
Contents
1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2Probabilistic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3Programming Related Questions* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xv
Part II Sequential Learning
6Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Suppose that two players enter into a market game. The rules of the game are
as follows: Player 1 is the market maker, and Player 2 is the market
taker. In each round, Player 1 is provided with information x, and must choose
and declare a value α ∈ (0, 1) that determines how much it will pay out if a
binary event G occurs in the round. G ∼ Bernoulli(p), where p = g(x |θ) for
some unknown parameter θ.
Player 2 then enters the game with a $1 payment and chooses one of the
following payoffs: (
1
with probability p
V1(G, p) = α
0 with probability (1 −
or p)
(
0 with probability p
V2 (G, p) 1
= with probability (1 −
(1−α)
p)
a. Given that α is known to Player 2, state the strategy1 that will give Player 2
an expected payoff, over multiple games, of $1 without knowing p.
b. Suppose now that p is known to both players. In a given round, what is the
optimal choice of α for Player 1?
c. Suppose Player 2 knows with complete certainty, that G will be 1 for a
particular
round, what will be the payoff for Player 2?
d. Suppose Player 2 has complete knowledge in rounds {1, ..., i} and can
reinvest payoffs from earlier rounds into later rounds. Further suppose
without loss of generality that G = 1 for each of these rounds. What will be the
payoff for Player 2 after i rounds? You may assume that the each game can be
played with fractional
1 The strategy refers the choice of weight if Player 2 is to choose a payoff V = wV1 + (1 −
w)V2,
i.e. a weighted combination of payoffs V1 and V2.
3
4
dollar costs, so that, for example, if Player 2 pays Player 1 $1.5 to enter the
game, then the payoff will be 1.5V1.
Solution 1.1
αp (1 − α)(1 −
+ = p + (1 − p) =
p)
1.
α to win (1
So Player 2 is expected or − α) break even, regardless of their level
at least
of knowledge merely because Player 1 had to move first.
b. If Player 1 chooses α = p, then the expected payoff to Player 2 is exactly $1.
Suppose Player 1 chooses α > p, then Player 2 takes V2 and the payoff is
1− p
> 1.
1− α
And similarly if Player 1 chooses α < p, then Player 2 chooses V1 and also expects
to earn more than $1.
N.B.: If Player 1 chooses α = p, then Player 2 in expectation earns exactly $1
regardless
c. Since G = of
1 isthe choices
known made by
to Player Player
2, they 2. V1 and earn $α
choose
d.
1
. After the first round, Player 2 will have received α1$ . Reinvesting this in the
second
round, the payoff will be $ 21 . Therefore after i rounds, the payer to Player 2
α
be:
will
1 .
. i
k=1 αk
N.B.: This problem and the next are developing the intuition for the
use of logarithms in entropy. One key part of this logic is that the “right”
way to consider the likelihood of a dataset is by multiplying together the
probability of each observation, not summing or averaging them. Another
reason to prefer a product is that under a product the individual events form a
σ -algebra, such that any subset of events is itself an event that is priced
fairly by the game. For instance, a player can choose to bet on events i = 1
and i = 2 both happening, and the resulting event is priced fairly, at odds
calculated from its actual probability of occurrence, p1 p2 .
5
Recall Example 1.2. Suppose additional information was added such that it is
no longer possible to predict the outcome with 100% probability. Consider Table
1 as the results of some experiment.
G x
A (0,
B 1)
B (1,
C 1)
C (1,
0)
(1, model data.
Table 1: Sample
0)
(0,
Now if we are presented with x = (1, 0) 0), the result could be B or C.
Consider three different models applied to this value of x which encode the value
A, B or C.
f ((1, 0)) = (0, 1, 0), Predicts B with 100% certainty. (1)
c.losses
Suppose that the
to Player market
1 are game when
unbounded in Exercise 1 is
x = (1, 0) now
and αplayed
= 1 −with
p. models f or
d. g. B or C each triggers two separate payoffs, V1 and V2 respectively. Show that
losses to Player 1 are bounded. the
Solution 1.2
Additional problems of this sort can be generated by simply requiring that the
data set in question have at least two rows with identical values of x, but
different values of G. This ensures that no model could predict all of the events
(since a model must be a function of x), and thus it is necessary to consider the
probabilities assigned to mispredicted events.
Provided that some of the models assign absolute certainty (p = 1.0) to
some incorrectly predicted outcomes, the unbounded losses in Part (3) will occur
for those models.
Example 1.1 and the associated discussion alluded to the notion that some
types of models are more common than others. This exercise will explore that
concept briefly.
Recall Table 1.1 from Example 1.1:
G x
A (0,
B 1)
C (1,
C 1)
(1,
0)
For this exercise, consider two models (0, “similar” if they produce the same
projec- tions for G when applied to0) the values of x from Table 1.1 with
probability strictly greater than 0.95.
In the following subsections, the goal will be to produce sets of mutually
dissimilar models that all produce Table 1.1 with a given likelihood.
a. How many similar models produce Table 1.1 with likelihood 1.0?
b. Produce at least 4 dissimilar models that produce Table 1.1 with likelihood 0.9.
c. How many dissimilar models can produce Table 1.1 with likelihood exactly
0.95?
Solution 1.3
a. There is only one model that can produce Table 1.1 with likelihood 1.0, it is
{1, 0, 0} if x = (0,
1)
g(x) = (4)
{0, 1, 0} if x = (1,
1)
{0, 0, 1} if x = (1,
There are no dissimilar models that
0) can produce the output in Table 1.1
with likelihood 1.0.
{0, 0, 1} if x = (0,
0)
7
b. This can be done many ways, but most easily by perfectly predicting some
rows and not others. One such set of models is:
When the data is i.i.d., the negative of log-likelihood function (the “error
function”) for a binary classifier is the cross-entropy
Õn
E(θ) = − Giln (g1(xi | θ)) + (1 − Gi )ln (g0(xi |
θ)).
Suppose now that therei=1is a probability πi that the class label on a training data
point xi has been correctly set. Write down the error function corresponding to
the negative log-likelihood. Verify that the error function in the above equation
is obtained when πi = 1. Note that this error function renders the model robust to
incorrectly labeled data, in contrast to the usual least squares error function.
Solution 1.4
Given the probability πi that the class label is correctly assigned, the error
function can be written as
Õn
E(θ) = − πiln (g1(xi | θ)) + (1 − πi )ln (g0(xi |
θ)).
i=1
Clearly when πi = 1 we recover the cross-entropy given by the error function.
Derive Eq. 1.17 by setting the derivative of Eq. 1.16 with respect to the
time-t action ut to zero. Note that Equation 1.17 gives a non-parametric
expression for the optimal action ut in terms of a ratio of two conditional
expectations. To be useful in practice, the approach might need some further
modification as you will use in the next exercise.
Solution 1.5
This gives
E [φt |St ] − 2λut V [φt |St ]
= 0 and rearranging gives the optimal action:
9
E [φt | t
ut = .
S
2λV] [φt |St ]
Solution 1.6
Plugging the basis function representation for ut into Eq. 1.16, setting the
derivative of the resulting expressing with respect to θk J = θk J (t), we obtain a
system of linear equations for each k J = 1, . . . , K:
ÕK
2λ θk E [Ψk (S)Ψk J (S)Var [φ|S]] = E [Ψk J
(S)] .
k=1
This system of linear equations can be conveniently solved by defining a pair of
a matrix A and vector B whose elements are defined as follows:
The solution of the linear system above is then given by the following simple
matrix- valued formula for the vector θ of parameters θk :
1
θ= A–
1
B. 2λ
Note that this relation involves matrix inversion, and in practice might have
to be regularized by adding a unity matrix to matrix A with a tiny regularization
parameter before the matrix inversion step.
Chapter 2
Probabilistic Modeling
Solution 2.1
Let F denote the presence of fraud, F denote its absence, and + denote a
positive audit result (i.e. the audit revealed fraudulent accounting). Then P(+|
F ) = .95, P(+|F c ) = .01, and P(F ) = .001. Then according to Bayes’ theorem
A currency strategist has estimated that JPY will strengthen against USD with
probability 60% if S&P 500 continues to rise. JPY will strengthen against USD
with probability 95% if S&P 500 falls or stays flat. We are in an upward
trending market at the moment, and we believe that the probability that S&P
500 will rise is 70%. We then learn that JPY has actually strengthened against
USD. Taking this new information into account, what is the probability that S&P
500 will
Recall Hint:P(A | B) P(B | A) P(A).
rise? rule:
Bayes’ P(B)
=
11
12
Solution 2.2
(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(14) (α – 1)!(β – 1)!
to evaluate the integral in the marginal density function.
If θ represents the currency analyst’s opinion of JPY strengthening against the
dollar, what is the probability that the model overestimates the analyst’s estimate?
Solution 2.3
.n
p(G | X, θ) = p(G = Gi | X = xi,
θ)
i=1
and since the conditional probability of Gi = 1 given X = xi is pi = g1(xi | θ))
it follows that
.n
p(G | X, θ) p(G = G i | X = xi , θ) = pG
i
i
1 – pi )1–G i
= i=1 (
and S&P 500 is observed to rise for 3 consecutive days. So the likelihood
function is given by θ2(1 – θ) and from Bayes’ law:
p(G | X, θ2(1 – θ)
p(θ | G, X) p(θ) = , 1 = Γ(3, 2)θ2 (1 –
θ)
= 0
θ (1 –
2
θ).
p(G | X) θ)dθ
where we have used the Beta density function with a scaling constant Γ(α,
β)
(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(α – 1)!(β – 1)!
to evaluate the integral in the marginal density function. The probability that
the model overestimates the analyst’s estimate is:
1 0 0.6
∫ ∫ θ3 4
P [θ > 0.6 | G, X] = Γ(3, 2) θ2 (1–θ)dθ = Γ(3, 2)(1– .6θ2 (1–θ)dθ = 12 = 0.4752.
θ –
0.6 0 3 4 0
Suppose that you observe the following daily sequence of directional changes in
the JPY/USD exchange rate (U (up), D(down or stays flat)):
U, D, U, U, D
and the corresponding daily sequence of S&P 500 returns is
- 0 . 0 5 , 0 . 0 1 , - 0 . 0 1 , - 0 . 0 2 , 0.03
You propose the following probability model to explain the behavior of
JPY against USD given the directional changes in S&P 500 returns: Let G
denote a Bernoulli R.V., where G = 1 corresponds to JPY strengthening against
the dollar and r are the S&P 500 daily returns. All observations of G are
conditionally independent (but *not* identical) so that the likelihood is
.n
p(G | r, θ) = p(G = Gi | r = ri,
θ)
i=1
where (
θu, ri >
p(Gi = 1 | r = ri, θ)
0 d, ri ≤ 0
θ
=
Compute the full expression for the likelihood that the data was generated by
this model.
15
Solution 2.4
Suppose you observe the following daily sequence of direction changes in the
stock market (U (up), D(down)):
U, D, U, U, D, D, D, D, U, U, U, U, U, U, U, D, U, D, U, D,
U, D, D, D, D, U, U, D, U, D, U, U, U, D, U, D, D, D, U, U,
D, D, D, U, D, U, D, U, D, D
You compare two models for explaining its behavior. The first model, M 1 ,
assumes that the probability of an upward movement is fixed to 0.5 and the data is
i.i.d.
The second model, M 2 , also assumes the data is i.i.d. but that the
of
probability an upward movement is set to an unknown θ ∈ Θ = (0, 1) with a
uniform prior
on
θ : p(θ|M 2) = 1. For simplicity, we additionally choose a uniform model prior
p ( M 1 ) = p(M2 ).
Compute the model evidence for each model.
Compute the Bayes’ factor and indicate which model should we prefer in light
of this data?
Solution 2.5
Which model is most likely to have generated the given data? We compute
the model evidence for each model. The model evidence for M 2 is the
probability mass function
(α + β – 1)!
p(θ|α, β) = θα–1(1 – θ)β–1 = Γ(α, β)θα–1(1 – θ)β–1
(α – 1)!(β – 1)!
to evaluate the integral in the marginal density function above, where α = 26
β = 26. Γ(26, 26) = 6.4469 × 1015 and 2550 = 1.264 × 14
and
. Bayes’ factor in favor of M 1 is 5.7252 and thus |lnBF | = 1.744877
10 The
and, from Jeffrey’s table, we see that there is weak evidence in favor of M 1
since the log of the factor is between 1 and 2.5.
Solution 2.6
Part I If we choose k 2 (even though there is weak evidence for the "best"
model being k 1 ), we can write that the density of the new predicted value y J
(assuming it is just one new data point) given the previous data y is the
expected value of the likelihood of the new data under the posterior density p(θ|
y, k 2 ): ∫ 1
p(y J | y, k 2 ) = Eθ |y [p (y J | y, θ, k2 )] = p(yJ | y, θ, k 2p) (θ| y, kd2 )
0
θ.
The model evidence for k 2 , written in terms of y, in the previous question is
∫ 1
1
p(y|k 2 ) = p(y|θ, k2 )p(θ|k2 )dθ = 1.5511 10–16
Γ(26,
= 0 ×
26)
since the likelihood function is
.n . n . n
p(y|θ, k 2 ) = θ (1–θ)1–y i
yi
= θi=1 yi (1–θ) i=1(1–y i ) = θx (1–θ)n–x = θ25(1– 25
θ) i=1
where we have mapped Bernoulli trials yi to 1 and zeros instead of U’s and D’s.
Note that we dropped the nx term since this is a likelihood function over all
y. By Bayes’ law, the posterior density
is
p(θ| y, k 2 ) p(y|θ, k 2 ) p θ25(1 – θ)25·
= (θ|k 2 ) =
1,
p(y|k 2 ) Γ(26,
and the prediction density is 26)–1 ∫ 1
p(y J | y, k 2 ) = J
θ | [p(y | y, θ, k2 )] = 26, 26) θyJ (1 – θ)1– yJ θ25 (1 – 25 d
0
E y Γ( θ) θ
where we have used the likelihood function p(y J | y, θ, k 2 ) = θy (1 – θ)1– y .
J J
we have ∫ 1
J Γ(26,
p(y | y, k 2 ) = θ20 (1 – θ)30θ25 (1 – θ)25 Γ(26, 26) dθ = 26) .
0
Γ(46,
so substituting the new model evidence and the new likelihood into 56) Equation 15
for the new posterior density is
Part III Finally let’s compute the probability again now assuming that y ∗ is the
next observation and we will merge y J , y. So now (y, y J ) contains 102
observations of the stock market variable Y .
The prediction density is now
∫ 1
p(y ∗| y, yJ , k2 ) = E θ | J [p(y∗ | y, Jy , θ, k
2 )] = , 56) θy∗ (1–θ)1– y∗ θ 45(1–θ)
0 55
Γ(46
y,y dθ
and evaluating the density at y∗ = 1:
Γ(46, 56)
p(y ∗ = 1| y, y J , k 2 ) = =
0.4509804. Γ(47, 56)
The probability that the next value y∗ will be a one has decreased after
observing all data. This is because there are relatively more D’s in the new data
than in the old data.
Using Model 1 Under Model k 1
∫ 1
1
p(y|k 1 ) = p(y|θ, k 1 )p(θ|k 1 )dθ =
0
250
19
where p(θ|k 1) = 1 when θ = 1/2 and p(θ|k 1) = 0 otherwise (i.e., the prior is
a Dirac delta function) and the likelihood function is
1
p(y|θ = 1/2, k 1 ) =
. 250
By Bayes’ law, the posterior density can only be
1
p(y|θ, k 1 )
p(θ| y, k 1 ) = p(θ|k 1) = 1, θ =
1/2 p(y|k 1 )
and zero for all other values of θ. The prediction density, for either up or
downward movements is,
1
p(y J | y, k 1 ) = Eθ |y[p(y J| y, θ = /2, k 1 )] = ·
2
1 1.
Note that the prediction will not change under new
data.
Solution 2.7
P [X | G] P [G] (16)
P [G | X] = P [X]
P [X | G] P [G]
= P [X | G] P [G] + P [X | Gc ] P [Gc
= ] P[X |1G c ]P[G c ]
1+
P[X | G] P[G]
1
= . P[X | G c ]P[G c ] .
1 + exp P[X | G]P[G]
ln 1
= . .
P[X | G]P[G]
1 + exp – lnP[X | G c ]P[G c ]
We want
1
P [G | X] . . p . (17)
= 1 + exp – i=0 wi Xi
By combining Eq. 16 and Eq. 17 we
get:
P [X | G] P Õp
wi Xi .
– log [G] =–
P [X | Gc ] P [Gc ] i=0
1 Õ
p(v, h) = exp (–E(v, h)), Z= exp (–E(v, h))
Z v,h
i=1
j=1
v j j
Solution 2.8
Consider the first expression P [vi = 1|h]. Let v–i stands for the vector of visible
units
with the ith components removed. This conditional probability can be computed
as follows: Õ P [vi = 1, v–i ,
P [vi = 1|h] (20)
h] p(h)
= v–i
1
D F
©Õ Õ ÕD Õ \
T
W i J j vi J h j + b i J vi J + Wi j j + b iI
P [vi = 1, v–i, h] = Z exp a h exp I
Zi J ≠ i j=1 h iJ ≠ i j ¬
Substituting these expressions into (20) and simplifying, we
obtain
!
exp . j Wi j h j + bi Õ
P [vi = 1|h] = . =σ Wij hj + bi
exp W h b + 1 j
j ij j +
θ |D ~ W(µJ , Σ J ),
with moments:
1 1
µJ = Σ J a = (Σ–1 + XXT )–1 (Σ–1 µ + yT X)
σ 2n σ 2n
1
Σ J = A– 1 = (Σ–1 + XXT )–1 .
σ 2n
Solution 3.1
See Eq. 3.12 to Eq. 3.19 in the chapter for the derivation for the posterior
distribution moments.
Suppose that the prior is p(θ) = φ(θ; µ0 , σ02 ) and the likelihood is given
by
.n
p(x1:n | θ) = φ(xi ; θ, 2 ,
i=1 σ )
23
24
where σ 2 is assumed to be known. Show that the posterior is also normal, p(θ | x1:n)
2
=
φ(θ; µpost , σpost ),
where
σ 20 x¯ σ2 µ0,
µpost = σ
2
+ σ02 + σ
2
+ σ02
n n
1
σ 2post
= 1 + nσ 2
σ 20
,
1 .
where n
xi .
n i=1
x¯ :=
Solution 3.2
Starting with n = 1, we use the Schur identity for the bi-Gaussian joint
distribution
Cov(θ, x)
E [θ| x] = E[θ] (x – E[x])
+ σ 2x
Cov2(θ, x)
.
V [θ| x] = V[θ] σ 2x
Using +
that
x = θ + σz, z ~ W(0, 1)
θ = µ0 + σ0δ, δ ~ W(0, 1)
we
have
E[x] = θ,
V[x] = V[θ] + σ 2 = σ 02 + σ 2
Cov(x, θ) = E[(x – µ 0)(θ – µ 0)] = σ02
and plugging into the bi-Gaussian conditional moments and after some
rearranging gives
σ 20 2
µpost = E [θ| x] 2 + σ2
x + 2σ µ0
= σ 0 σ + σ02
σ 2 σ 20 = ( 1 + 1
σ 2post = V [θ| x] 2
2 2
2
)–1 .
= σ + σ0 σ0 σ
Now for n > 1, we first show
that
25
1 Õ
n
2
p(x1:n |θ) ∝ exp{– (xi – θ) }
2σ2 i=1
1 Õn Õn
2 2
∝ exp{– ( xi – 2θ xi + nθ )}
2σ2 i=1 i=1
n
∝ exp{– ¯ – θ)2 }
(x 2σ 2
∝ p(x¯|θ),
where we keep only terms in θ. Given that x¯| µ ~ W(θ, σ2/n), we can
substitute x¯ into the previous result for the conditional moments to give the
required result
σ 20 x¯ + σ2 µ0
µpost = σ
2
2
2
σ
+ σ02
n + σ0 n
1 n –1
σ 2post = ( +
) .σ 20 σ2
Show that the predictive distribution for a Gaussian Process, with model output
over
a test point, f , and assumed Gaussian noise with variance σ , is given
∗ 2n
by
f∗ | D, x∗ ~ W(E[f∗ |D , x∗], var[f∗ |D ,
x∗]),
Solution 3.3
Start with the joint density between y and f∗ given in Eq. 3.35 expressed in
terms of the partitioned covariance matrix.
y µy Σ Σ
=W f
, yy y ∗
f∗
. Σf∗ y Σf∗
µf∗
Then use the Schur Identity to derive Eq. 3.36.
f∗
Finally by writing y = f (x) + z,
with an unknown f (x), the covariance is given by
where KX, X is known covariance of f (X) over the training input X. Writing the
other covariance terms gives the required form of the predictive distribution
moments with K specified by some user defined kernel.
Solution 3.4
0.003
Var
0.002
0.001
0.000
75 100 125 150 175 200 225 250
s
Fig. 1: The predictive variance of the GP against the stock price for various
training set sizes. The predictive variance is observed to monotonically decrease with
training set size.
Chapter 4
Feedforward Neural Networks
Exercise 4.1
Show that
substituting (
X j , i = k,
∇i j Ik =
0, i ≠ k,
into Eq. 4.47
gives
∂σk
∇ij σk ≡ = ∇i σk Xj = σk (δki –
∂w i j
σi )Xj .
Solution 4.1
∂σ k Õ ∂σ k ∂ Im
=
∂wij m
∂ Im ∂wij
27
28
Exercise 4.2
Show that substituting the derivative of the softmax function w.r.t. wij into Eq.
4.52 gives for the special case when the output is Yk = 1, k = i and Yk = 0, ∀k ≠
i:
( (σi – 1)Xj, Yi = 1,
∇ij L(W, b) := [∇W L(W, b)]ij =
0, Yk = 0, ∀k ≠
i.
Solution 4.2
Recalling Eq. 4.52 with σ for the softmax activation function and I(W, b) is the
transformation
∇L(W, b) = ∇(L ◦ σ)(I(W, b)) = ∇L(σ(I(W, b))) · ∇σ(I(W, b)) · ∇I(W, b).
Writing the model output as a function of W only since we are not taking
the derivative of b:
Yˆ(W ) = σ ◦ I(W ).
The loss of a K-classifier is given by the cross-entropy:
ÕK
L(W ) = – Yk ˆ
lnYk . k=1
However, we want to evaluate the loss for the special case when k = i, since Yi =
1. Therefore the loss reduces to
or equivalently
Õ
∇i j L(W ) = σ m L(W )∇mn σ(I)∇i j In
∇ m,n (W ).
The derivative of the loss w.r.t.
σm is
∇σ m L(W ) = –δim
σm
.
The derivative of the softmax function w.r.t. to In gives
29
and is zero if Yk = 0, k ≠ i.
Exercise 4.3
Consider feedforward neural networks constructed using the following two types
of activation functions:
• Identity
Id(x) := x
• Step function (a.k.a. Heaviside function)
1, if x ≥
H(x) :=
0,
0¯¯,
a. Consider a feedforward neural network otherwise.
with one input x e R, a single
hidden
layer with K units having step function activations, H(x), and a single output
with identity (a.k.a linear) activation, Id(x). The output can be written as
!
ÕK
ˆf (x) = Id (2) (2) (1)
w H(b + w x) . (1)
b + k k k
k=1
Construct a neural network with one input x and one hidden layer,
whose response is u(x; a). Draw the structure of the neural network,
specify the activation function for each unit (either Id or H), and specify the
values for all weights (in terms of a and y).
b. Now consider the indicator function
1, if x e [a,
1[a,b)(x) =
b),
0, otherwise.
Construct a neural network with one input x and one hidden layer,
whose response is y1[a,b)(x), for given real values y, a and b. Draw the
structure of the neural network, specify the activation function for each unit
(either Id or H), and specify the values for all weights (in terms of a, b and
y).
Solution 4.3
Exercise 4.4
A neural network with a single hidden layer can provide an arbitrarily close
approxi- mation to any 1-dimensional bounded smooth function. This question
will guide you through the proof. Let f (x) be any function whose domain is [C,
D), for real values C < D. Suppose that the function is Lipschitz continuous, that
is,
for some constant L ≥ 0. Use the building blocks constructed in the previous
part to construct a neural network with one hidden layer that approximates this
function within z > 0, that is, ∀x e [C, D), | f (x) – fˆ(x)| ≤ z, where fˆ(x) is
the output of your neural network given input x. Your network should use only
the identity
(k or(kthe Heaviside activation functions. You need to specify the
wk , w0 ),Kand
number
)
1 , for units,
of whidden e {1,
each kthe . . . , K }.function
activation These weights may
for each be specified
unit, in
and a formula
of C, D, L and z, as well
for calculating each weight w0 , as the
termsvalues of f (x) evaluated at a finite number
of x values of your choosing (you need to explicitly specify which x values you
use). You do not need to explicitly write the fˆ(x) function. Why does your
network attain the given accuracy z?
31
Solution 4.4
The hidden units all use gs activation functions, and the output unit uses gI . K
=
L(D – C) 2x L
+ 1]. For k e {1, . . . , K }, (k)
w
1
= 1, w0(k) = –(C + (k – 1)2z/L).
the second layer,For we have w0 = 0, w1 = f (C + z/L), and for k e
{2, . . . , K }, wk = f (C + (2k – 1)z/L) – f (C + (2k – 3)z/L). We only
evaluate f (x) at points C + (2k – 1)z/L, for k e {1, . . . , K }, which is a finite
number of points. The function value fˆ(x) is exactly f (C + (2k – 1)z/L) for k
e { 1, . . . , K }. Note that for any x e [C + (2k – 1)z/L, C + 2kz/L], | f (x)
– fˆ(x)| ≤ z.
Exercise 4.5
the hidden layer and d outputs. The hidden-outer weight matrix Wi j(2) = n1 and
Consider a shallow neural network regression model with n tanh activated units
the
input-hidden weight matrix W (1) = 1. The biases are zero. If the features,
in
X1 , . . . , Xp
are i.i.d. Gaussian random variables with mean µ = 0, variance σ 2 , show
a. Yˆ e [–1, 1].
thatYˆ is independent of the number of hidden units, n ≥
b.
1.
c. The expectation, E[Yˆ] = 0 and the variance V[Yˆ] ≤ 1.
Solution 4.5
Yˆ = W (2)σ(I(1))
Each hidden variable is a vector with
Õp identical elements
Z(1) = w(1) Xj =
i ij Xj = Si = s, i e {1, . . . ,
j=1 n}.
1Õ
n
(2)
Yˆi = wi j σ(S j ) = σ(s) = σ(s).
n j=1
which is independent of
n.
32
∫
E[Yˆ] = W (2)σ(I(1)) fX (x)dx
∫
= W (2) σ(I ( 1 ) ) fX (x)dx
∫ ∫
= W (2) ···
–∞ –∞
σ(I ) fX (x1) . ∫. . ∞
(1) ∫ ∞
fX
= (2)1 . . . ·
(xpW)dx dx·
p
· [σ( I(1) ) + σ(–I(1) )]X f 1(x ) . . X. f p 1 p
0 (x )dx 0 . . . dx
= 0, ∫ ∞ ∫ ∞
= W (2)
· · ·
where we have used the property that σ(I) is an odd function in I (and in x
since b(1) =
σ(I0)(1)
)and
fX (x1fX) .(x)
. . is
fX an even function since µ = 0. The variance is
(x )dx . . . dx
E[YˆYˆ ] = E[σ(I )σ (I )], where [E[σ(I ( 1 ) )σ T (I(1))]]ij = E[σ2(I(1))].
T p 1 ( 1 ) T p (1)
∫ ∞ ∫ ∞
–∞ –∞
E[σ2(I(1))] = ∫· ∞
··∫ σ 2 (I ( 1 ) ) fX (x1) . . . fX (xp )dx1 . . . dxp
= 2 ∞ · · · σ 2 (I(1) ) fX (x1 ) . . . Xf (xp )dx1 . . . p
dx ∫ 0 ∞ ∫ 0
≤ 2 ∞ · · · fX (x1 ) . . . Xf (xp )dx1 . . . p
dx 0 0
= 1.
Exercise 4.6
Solution 4.6
This is achieved by starting and ending with a 1 label point so that the labels of
the points are {1, 0, 1, . . . , 0, 1}. This last configuration is not representable
with k + 1 indicator functions. Therefore VC(Fk ) = 2(k + 1).
Exercise 4.7
Show that a feedforward binary classifier with two Heaviside activated units
shatters the data {0.25, 0.5, 0.75}.
Solution 4.7
Now for all permutations of label assignments to the data, we consider values
of weights and biases which classify each arrangement:
• None of the points are activated. In this case, one of the units is
essentially
redundant and the output of the network is a Heaviside function: b1(1) = b2(2) &
w(2) (2)
1 = w 2
For example:
1 c
w(1) = 1 , b(1) = c
e.g.
1 c
w(1) = , (1) =
1 b c2
1
where c2 < c1. For example if c1 = –0.4 and c2 = –0.6, then the second point
is activated. Also the first point is activated with, say, c1 = 0 and c2 = –0.4.
The first two points are activated with c1 = 0 and c2 = –0.6.
34
• Lastly, the inverse indicator function can be represented, e.g., only the first
and third points are activated:
1 c
w(1) = , (1) =
1 b c2
1
an
d w(2) = –1, 1 , b(2) = 1
with, say, c1 = –0.4 and c2 = –0.6.
Hence the network shatters all three points and the VC-dim is 3. Note that we
could have approached the first part above differently by letting the network
behave as a wider indicator function (with support covering two or three data
points) rather than as a single Heaviside function.
Exercise 4.8
Compute the weight and bias updates of W (2) and b(2) given a shallow binary
classifier (with one hidden layer) with unit weights, zero biases, and ReLU
activation of two hidden units for the labeled observation (x = 1, y = 1).
Solution 4.8
This is straightforward forward pass computations to obtain the error. E.g., with
ReLU activation
I(1) = W (1) x
max(1, 0) 1
Z(1) = σ( ) =
max(1, 0) 1
=
I(1) . . 1
I(2) = W (2) = 11
1
The prediction is Z(1)
1 1
yˆ = sigmoid ( I(2) ) =
1 + e–I (2)
= 0.8807971
σ = 1 + e–2
The error is
δ(2) = e = y – yˆ = 1 – 0.8807971 = 0.1192029
Applying one step of the back-propagation algorithm (multivariate chain
rule):
1 0.1192029
weight update: ∆W (2) = –γδ (2)Z(1) = – = –γ
1 0.1192029
35
where γ is the learning rate. The bias update is ∆b(2) = –γδ(2) = –γ0.1192029.
Exercise 4.9
(x1, y1) = (10.0, 9.14), (x2, y2) = (8.0, 8.14), (x3, y3) = (13.0, 8.74),
(x4, y4) = (9.0, 8.77), (x5, y5) = (11.0, 9.26), (x6, y6) = (14.0, 8.10),
(x7, y7) = (6.0, 6.13), (x8, y8) = (4.0, 3.10), (x9, y9) = (12.0, 9.13),
(x10, y10) = (7.0, 7.26), (x11, y11) = (5.0, 4.74).
Solution 4.9
Exercise 4.10
b. Why not use a very large number of neurons since it is clear that the
classification accuracy improves with more degrees of freedom?
c. Repeat the plotting of the hyperplane, in Part 1b of the notebook, only without
the ReLU function (i.e., activation=“linear”). Describe qualitatively how the
decision surface changes with increasing neurons. Why is a (non-linear)
activation function needed? The use of figures to support your answer is
expected.
Solution 4.10
• Only one hidden layer is needed. One reason why multiple hidden layers could
be needed on other datasets is because the input data are highly correlated. For
some datasets, it may be more efficient to use multiple layers and reduce the
overall number of parameters.
• Adding too many neurons in the hidden layer results in a network which
takes longer to train without an increase in the in-sample performance. It also
leads to over-fitting on out-of-sample test sets.
• The code and results should be submitted with activation=“linear” in the
hidden
layer. Plots of the hidden variables will show a significant difference
compared to using ReLU. Additionally, the performance of the network with
many neurons in the hidden layer, without activation, should be compared with
ReLU activated hidden neurons.
Exercise 4.11
Exercise 4.12***
Consider a feedforward neural network with three inputs, two units in the first
hidden layer, two units in the second hidden layer, and three units in the
output layer. The activation function for hidden layer 1 is ReLU, for hidden
layer 2 is sigmoid, and for the output layer is softmax.
The initial weights0.1
are0.3
given 0.5 0.6
0.7 by the matrices
0.4 0.3 © \
W (1) = (2)
= (3)
,W ,W = I0.6 0.7I,
0.9 0.4 0.4 0.7 0.2
Z0.3 0.2¬
and all the biases are unit
vectors.
37
Assuming that the input 0.1 0.7 0.3 corresponds to the output 1 0 0 ,
compute
manuallythe updated weights and biases after a single epoch (forward +
backward pass), clearly stating all derivatives that you have used. You should use
a learning rate of 1.
As a practical exercise, you should modify the implementation of a
stochastic gradient descent routine in the back-propagation Python notebook.
Note that the notebook example corresponds to the example in Sect. 5, which
uses
sigmoid activated hidden layers only. Compare the weights and biases
obtained by TensorFlow (or your ANN library of choice) with those obtained by
your procedure after 200 epochs.
Solution 4.12
Exercise 5.1*
i.e. β0 = 0 and β1 = β2 = 1.
a. For this data, write down the mathematical expression for the sensitivities of
the fitted neural network when the network has
• zero hidden layers;
• one hidden layer, with n unactivated hidden units;
• one hidden layer, with n tanh activated hidden units;
• one hidden layer, with n ReLU activated hidden units; and
• two hidden layers, each with n tanh activated hidden units.
Solution 5.1
Y = X1 + X2 + z, X1, X2, z ~ N(0, 1), β0 = 0, β1 = β2 =
1
• Zero hidden layers:
39
40
bˆ Õ
= wi(2) (H(I (1) (1)
i ))wi, j
bY Xj i
Exercise 5.2**
a. For this data, write down the mathematical expression for the interaction term
(i.e. the off-diagonal components of the Hessian matrix) of the fitted neural
network when the network has
• zero hidden layers;
• one hidden layer, with n unactivated hidden units;
• one hidden layer, with n tanh activated hidden units;
• one hidden layer, with n ReLU activated hidden units; and
• two hidden layers, each with n tanh activated hidden units.
Why is the ReLU activated network problematic for estimating interaction terms?
Solution 5.2
Exercise 5.3*
For the same problem in the previous exercise, use 5000 simulations to gener-
ate a regression training set dataset for the neural network with one hidden
layer. Produce a table showing how the mean and standard deviation of the
sensitiv- ities βi behave as the number of hidden units is increased. Compare
your re- sult with tanh and ReLU activation. What do you conclude about
which activa- tion function to use for interpretability? Note that you should
use the notebook D e e p - L e a r n i n g - I n t e r p r e t a b i l i t y . i p y n b as the starting
point for experimen- tal analysis.
Solution 5.3
Exercise 5.4*
Solution 5.4
Exercise 5.5**
Fixing the total number of hidden units, how do the mean and standard
deviation of the sensitivities βi behave as the number of layers is increased?
Your answer should compare using either tanh or ReLU activation functions.
Note, do not mix the type of activation functions across layers. What you
conclude about the effect of the number of layers, keeping the total number of
units fixed, on the interpretability of the sensitivities?
Solution 5.5
Y = X1 + X 2 + X 1X 2 + z X 1, X 2 ~ N( 0, 1), z ~ N(0, 2
n
σ )
β0 = 0, β1 = β2 = β12 = 1, σn = 0.01
• Zero hidden layers:
bσ
= 1 – tanh2 (x)
b
x
b2 σ
= –2tanh(x)(1 – tanh 2(x))
b 2
Therefore, we have x
that
ˆ (1) (1)
bY = (2 )
b Xi Xj W i D J ( I (1) j
diag(W ) )W
J (1) (1) (1) J (1)
where Di,i (Ii ) = – 2tanh(I i )( 1 – tanh2 (I i )), Di, j≠i(I ) =
• One hidden
i layer, with n ReLU 0activated hidden
units:
X
As we mentioned previously, for
tanh:
σ(x) = tanh(x)
bσ
= 1 – tanh2 (x)
b
x
44
b2 σ
= –2tanh(x)(1 – tanh 2(x))
b 2
x expressions for D J and
Therefore substitute in the
where Di,i (Ii ) = – 2tanh(I i(2))( 1 – tanh2 (Ii(2) )), Di, j≠i(I (2) ) = 0
(2)
J J
D
J (1) (1) (1) J (1)
Di,i (Ii ) = – 2tanh(I i )( 1 – tanh2 (I i )), Di, j≠i(I i ) = 0.
• ReLU
i does not exhibit sufficient smoothness for the interaction term to be
con- tinuous.
Exercise 5.6**
For the same data generation process as the previous exercise, use 5000
simula- tions to generate a regression training set for the neural network with
one hidden layer. Produce a table showing how the mean and standard
deviation of the in- teraction term behave as the number of hidden units is
increased, fixing all other parameters. What do you conclude about the effect of
the number of hidden units on the interpretability of the interaction term? Note
that you should use the note- book D e e p - L e a r n i n g - I n t e r a c t i o n . i p y n b as
the starting point for experimental analysis.
Solution 5.6
Exercise 6.1
Solution 6.1
Since yt–1 = ø1 yt–2 + zt and E[zt ] = 0, then E[yt ] = ø21 E[y t–2 ] = · ·
ø1n E[y t–n ]. · =
By the property of stationary, we have |ø| < 1, which gives lim øn E[yt–n ] = 0.
n→∞ 1
• Hence E[yt ] = 0.
The variance of an AR(1) process with no drift term is as follows. First note
that
yt = ø1 L[yt ] + zt can be written as a MA(∞) process by back-substitution:
yt = (1 + ø 1L + ø 21 L 2 + . . . ) t
[z ].
Taking the expectation of yt2:
. .
E[y t2] = E ( 1 + ø1 L + ø21 L 2 + . . . )2 t2
= (1 +[zø2 ]+ ø 4
2
1 1 t ] + E[cross terms] = σ2 /(1 – 2 1
+ . . . )E[z ø ).
• The lag 1 auto-correlation of an AR(1) processs with no drift term
is:
47
48
ç1 := E[yt yt–1]
= E[(zt + ø 1z t–1 + ø21 zt–2 + . . . ) t–1 + ø 1zt–2 + ø21 zt–3
(z
= E[ø z2 + ø3z 2 + ø5z 2 + · ·+ ·. .+. )] cross
1 t–1 1 t–2 1 t–3
terms]
= ø 1 σ 2 + ø 31σ 2 + ø 51σ 2 + . . .
= ø1 σ 2/(1 – ø21 )
ç2 := E[yt yt–2]
= E[(zt + ø 1z t–1 + ø21 zt–2 + . . . )(zt–2 + ø1 zt–3 + ø21 zt–4
. . )]2z2 + ø4z 2 + ø6z 2 + · · · + cross
= .E[ø
+
1 t–2 1 t–3 1 t–4
terms]
= ø12 σ 2 + ø 41σ 2 + ø 61σ 2 + . . .
= ø21 σ 2/(1 – ø21 ).
Φ(L)[yt ] = zt
where Φ(L) := (1–ø 1 L). The root of Φ(z) for ø1 = 0.5 is z = 2. Since |z| = 2 >
1, the process has no unit root and is stationary.
Exercise 6.2
You have estimated the following ARMA(1,1) model for some time series data
where you are given the data at time t – 1, yt–1 = 3.4 and uˆt–1 = –1.3. Obtain
the forecasts for the series y for times t, t + 1, t + 2 using the estimated ARMA
model.
If the actual values for the series are –0.032, 0.961, 0.203 for t, t +1, t +2,
calculate the out-of-sample Mean Squared Error (MSE) and Mean Absolute Error
(MAE).
Solution 6.2
49
1
MSE = (3.489 + 0.117 + 0.536) =
3
1.381.
1
M AE = (1.868 + 0.342 + 0.732) =
3
0.981.
Exercise 6.3
Derive the mean, variance, and autocorrelation function (ACF) of a zero mean
MA(1) process.
Solution 6.3
2
ç1 = E[y ty t–1 = E[(θ 1u t–1 + u t)(θ1 ut–2 + u t–1 )] = θ1 E[ut–1 ] = θ1σ2
Exercise 6.4
Consider the following log-GARCH(1,1) model with a constant for the mean
equation
yt = µ + ut, ut ∼ N(0, t2
2 )
ln(σ σ) = α + α u 2 + β1lnσ2
t 0 1 t–1 t–1
Solution 6.4
αi +
βi ) < 1
i=1
i=1
σ 2 = α0 + α1u2 + β1 σ 2
t t–1
t–1
2 2
σˆ 2 = α0 + α1 E[u |Ω
t–1
] + β1
t+1 t
σ = σ 2 + (α1 + β 1)(σ 2 2
t – σ )
t
2 2
σˆt+22 = α0 + α1 E[ut+1 |Ωt–1 ] + β1 E[σt+1 |Ωt–
2 ] 2 2
= σ + (α1 + β 1) (σ t – σ )
1 2
2 2
1 ut+l–1 |Ωt–1 ] + β1 E[σt+l–1 |Ωt– 1 ]
2
σˆt+l = 0α + α
E[ = σ 2 + (α + β ) l(σ 2 – σ 2),
1 1
t
which provides a relationship between the conditional variance σt2 and the
tional variance σ2.
uncondi-
The half life is given by
Exercise 6.5
Et = αXt + (1 – α)Et–1,
where N is the time horizon of the SMA and the coefficient α represents the
degree of weighting decrease of the EMA, a constant smoothing factor between 0
and 1. A higher α discounts older observations faster.
a. Suppose that, when computing the EMA, we stop after k terms, instead of
going after the initial value. What fraction of the total weight is obtained?
b. Suppose that we require 99.9% of the weight. What k do we require?
c. Show that, by picking α = 2/(N + 1), one achieves the same center of mass
in the EMA as in the SMA with the time horizon N.
d. Suppose that we have set α = 2/(N + 1). Show that the first N points in an
EMA
represent about 87.48% of the total weight.
Solution 6.5
a.
52
= (1 – α)k .
b. To have 99.9% of the weight, set the above ratio equal to 0.1% and solve for k:
ln(0.001)
k=
ln(1 – α)
to determine how many terms should be used.
c. The weights of an N-day SMA have a center of mass on the RSMAth day,
where
or N+1
RSMA =
N –2 1
RSMA = 2 ,
if we use zero-based indexing. We shall stick with the one-based
indexing. The weights of an EMA have the center of mass
Õ∞
REMA = α k(1 – α) k–1
k=1 .
From the Maclaurin
Õ ∞
series 1 k
=
1 – x k=0
x ,
taking derivatives of both sides, we
get
Õ∞
(x – 1)–2 = k · xk–1
k=0 .
Substituting x = 1 – α, we get
αEMA = 2.
NSMA + 1
53
d. From the formula for the sum of a geometric series, we obtain that the sum of
the weights of all the terms (i.e., the infinite number of terms) in an EMA is
1. The sum of the weights of N terms is 1 – (1 – α) N +1 . The weight omitted
after
is N terms
given by 1–[1–(1–α) N +1] = (1–α) N +1. Substituting α = N +1
2
and making
use N +1
of limn∞ 1 + an = ea , we get limN ∞ 1 – 1 – N 2+1
n
= 1 – e–2 ≈ 0.8647.
Exercise 6.6
Suppose that, for the sequence of random variables {yt }t∞=0 the following
model holds:
yt = µ + øyt–1 + zt, |ø| ≤ 1, zt ~ i.i.d.(0, σ2).
Derive the conditional expectation E[yt | y0] and the conditional variance V[yt |
y0].
Solution 6.6
As
y1 = µ + øy0
+ z1 ,
y
2
ø Õ
t–1 Õt
y yt = µ øi + øt y0 + øt–i zi .
1 i=0 i=1
Hence
+ Õ t–1 Õt Õ
t–1
E[yt ] = µ øi + øt y0 + øt–i E[zi ] = µ øi + øt y0,
z i=0 i=1 i=0
an 2
d =t
Õ Õ Õt Õt
V[yt | y0] = V[ t øt–i zi ] = V[ø t – i zi ] ø2(t–i)V[zi ] = ø2i .
= µi=1 i=1 i=1 i=1
ø
(
µ
ø
y
0
Chapter 7
Probabilistic Sequence Modeling
where ηt ~ W(0, σ2), and includes as special cases all AR(p) and MA(q)
models. Such models are often fitted to financial time series. Suppose that we
would like to filter this time series using a Kalman filter. Write down a suitable
process and the observation models.
Solution 7.1
Yt = Ht Xt + bt + Vt vt,
where
yt
© \
ø2 yt–1 + . . . + øp t–m+ 1 + θ1 ηt + . . . + θm–1 t–m+2 I
I I
Xt = I ø
y 3 yt–1 + . . . + øp yt–m+2 η+ θ2 ηt + . . . + θ
m–1 t–m+3
I e Rm×1,
Iη . I
I . I
Z øm yt–1 + θm–1ηt ¬
55
56
ø1 1 0 · ·
© \
·2 0 0 1
ø
I I
F = 0. . . . . . e R m×m
I . .. . I ,
I ø m–1 0 0 1I
Z ø.m . 0 0 · · ¬
· 0 |
W = 1 θ1 · · m–1 e Rm×1,
· θ
wt = ηt, Qt = σ2, H = 1 0 . . . 0 e 1 × m, bt = 0, Vt =
0.
If yt is stationary, then Xt ~ W(0, P) with P given by the equation P = FPF| +
σ 2 WW| , so we can set the initial state and error covariance to 0 and P,
respectively.
Solution 7.2
, t
An Itô integral, s f u) dWu , of a deterministic integrand, f , is a Gaussian
, t
dom( variable with mean(u) 0 and variance 0 f 2 (u) du. In our case,ran-
f (u) = e–θ(t–u),
, t
and 2
f (u) du = σ 2 (1 – e–2θt ) . σ
0 2θ
Since this Markov process is homogeneous, its transition density depends
only upon the time difference. Setting, for s ≤ t, hk := t – s as the time interval
between the time ticks k – 1 and k, we obtain a discretized process model
Xk = Fk Xk–1 + ak + wk,
2
with Fk = e–θh k , a k = µ(1 – e–θhk ), w ~ W 0, σ2θ 1 – e–2θh k ) .
(
As ka further exercise, consider extending this to a multivariate OU
process.
57
Exercise 7.3: Deriving the Particle Filter for Stochastic Volatility with
Leverage and Jumps
We shall regard the log-variance xt as the hidden states and the log-returns yt
as observations. How can we use the particle filter to estimate xt on the basis of
the observations yt ?
,
xt = µ(1 – ø) + øxt – 1 + σ v ρyt – 1 e– x t – 1 /2 + σ v 1 – ρ2 ξ t – 1
Solution 7.3 ,
[Deriving the Particle Filter for Stochastic Volatility with Leverage and
Jumps]
a. The Cholesky decomposition of
1ρ
Σ=
ρ1
1, 0
L= .
ρ 1 – ρ2
,
= µ(1 – ø) + øxt + σ v ρzt + σ v 1 – ρ2 ( t .
p(zt | xt, yt ) = p(zt | Jt = 0, xt, yt )P [Jt = 0 | xt, yt ]+p(zt | Jt = 1, xt, yt )P [Jt = 1 | xt,
yt ] .
point, exp (x
yt
,
2 /2)
It is clear that, if the process does not jump, there is a Dirac delta mass at a single
so
p(zt | Jt = 0, xt, yt ) = δ(zt – yt e–xt
/2
).
2
We find the parameters µxt | Jt and xt | Jt by equating the coefficients of
and
µ z in the above two equations =1 as follows.
=1 From
z the first
equation, " # #
" exp(x t ) 1 yt exp(x t
ln ø zt ; µ xt | Jt , σx2t | Jt =1 = zt2 – /2) 2 – t +constant
=1 2σ J +z σ 2J term
2
From the
second, " # " #
1 µxt | Jt
ln ø zt ; µ xt | Jt , σx2 | J =1 = zt2 – + t =1 + constant
=1
t t
2σ2x | Jt =1 z σ 2 xt | Jt term
t
=1
Equating the coefficients of zt 2 in the above two equations, we
get
exp(xt ) 1 1
– – =–
2
, 2σ J 2σ x | J =1
2 2
t t
hence
σ2
σ 2xt | Jt =1 = exp(x )J+ σ2
t J
hence
yt exp(xt /2)
µxt | Jt =1 =
. exp(xt ) + σJ
2
p(x | x , y , z ) = ø x ; µ(1 – ø) + øx + σ ρz , σ2 (1 – 2
t t–1 t–1 t–1 t t–1 v t–1 v
ρ) .
d. We first sample zt – 1 from p(zt – 1 | xt – 1, yt – 1 ), then sample from the above
normal pdf.
e. At t, the jump occurs with probability p and does not occur with probability p
– 1. When the jump does not occur, yt is normally distributed with mean 0
and variance e x t . When the jump does occur, it is also normally
Z = X + Y ofsince
distributed, the sumrandom variables, X ~ W(µX , σ2 ) and Y ~ W(µ
two normal Y ,Y
2
X 2
is ),
σ itself a normal random variable, Z ~ W(µX + µ Y, σX2 + σ Y). In our case,
the
60
two normal random variables are zt ext /2 ~ W(0, ext ) and a t ~ W(0, σJ2 ).
result follows by the Law of Total The
Probability.
Solution 7.4
Exercise 8.1*
Solution 8.1
The lag 1 unit impulse gives xˆt = tanh(0.5 + 0.1) and the intermediate
calculation of the ratios, r(k), are shown in Table 2, showing that the half-life is 4
periods.
Lag k r(k)
1 1.00
2 0.657
3 0.502
4 0.429
61
62
• State the assumptions needed to apply plain RNNs to time series data.
• Show that a linear RNN(p) model with bias terms in both the output layer and
the hidden layer can be written in the form
Õp
yˆt = µ + ø y
i t–i
i=1
Solution 8.2
and øi = øx øzi–1.
• The following constraint on the activation function |σ (x)| ≤ 1 must hold
for stability of the RNN.
Exercise 8.3*
yt = σ(øy t – 1 ) + ut,
Solution 8.3
where we have made use of Jensen’s inequality under the property that g(x) =
σ(øx)x is convex in x. Convexity holds if σ(x) is monotonic increasing and convex
and x is non-negative.
Exercise 8.4*
Show that the discrete convolution of the input sequence X = {3, 1, 2} and the
filter
F = {3, 2, 1} given by Y = X ∗ F where
Õ∞
yi = X ∗ Fi = x j Fi–j
j=–∞
Solution 8.4
y0 = x0 × F0 = 3 × 3 = 9,
y1 = x0 × F1 + x1 × F0 = 3 × 1 + 2 × 3 = 9,
y2 = x0 × F2 + x1 × F1 + x2 × F0 = 3 × 2 + 2 × 1 + 1 × 3 = 11,
y3 = x1 × F2 + x2 × F1 + x3 × F0 = 2 × 2 + 1 × 1 = 5,
y4 = x2 × F2 = 1 × 2 = 2,
Exercise 8.5*
Solution 8.5
Substituting the definition of filter and expanding the discrete convolution gives:
Õp
xˆt = F ∗ xt = Fj xt–j = øx t–1 + ø2 xt–2 + · · · +
p
t–p ,
j=1 ø x
64
Exercise 8.6***
Modify the RNN notebook to predict coindesk prices using a univariate RNN
applied to the data c o i n d e s k . c s v . Then complete the following tasks
a. Determine whether the data is stationary by applying the augmented Dickey-
Fuller test.
b. Estimate the partial autocorrelation and determine the optimum lag at the
99% confidence level. Note that you will not be able to able to draw conclusions
if your data is not stationarity. Choose the sequence length to be equal to this
optimum lag.
c. Evaluate the MSE in-sample and out-of-sample as you vary the number of
hidden
neurons. What do you conclude about the level of over-fitting?
d. Apply L1 regularization to reduce the variance.
e. How does the out-of-sample performance of a plain RNN compare with that
of a GRU?
f. Determine whether the model error is white noise or is auto-correlated by
applying
the Ljung Box test.
Solution 8.6
Exercise 8.7***
Modify the CNN 1D time series notebook to predict high frequency mid-prices
with a single hidden layer CNN, using the data HFT.csv. Then complete the
following tasks
g. Confirm that the data is stationary by applying the augmented Dickey-Fuller
test.
h. Estimate the partial autocorrelation and determine the optimum lag at the
99% confidence level.
i. Evaluate the MSE in-sample and out-of-sample using 4 filters. What do
you
conclude about the level of over-fitting as you vary the number of filters?
d. Apply L1 regularization to reduce the variance.
e. Determine whether the model error is white noise or is auto-correlated by
applying the Ljung-Box test.
65
Hint: You should also review the HFT RNN notebook before you begin this
exercise.
Solution 8.7
Exercise 9.1
Solution 9.1
Exercise 9.2
69
70
Solution 9.2
• True: Value iteration always find the optimal policy, when run to
convergence (This is a result of the Bellman equations being a contraction
mapping).
• False: Q-learning is an off-policy learning algorithm: the Q(s, a) function is
learned from different actions (for example, random actions). We even
don’t need a policy at all to learn Q(s, a).
Q(s, a) › Q(s, a) + α(r + ç maxa J Q(s J , a J ) – Q(s, a)), where a J are all
actions,
that were probed in state s J (not actions under the current policy).
• In the limit, every action needs to be tried sufficiently often in every possible
state. This can be guaranteed with an sufficiently permissive exploration
strategy.
• False. It may not even be able to represent the optimal policy due to
approximation
Exercise 9.3*
error.
Consider the following toy cash buffer problem. An investor owns a stock,
initially valued at St0 = 1, and must ensure that their wealth (stock + cash) is
not less than a certain threshold K at time t = t1 . Let Wt = St + Ct denote their at
time t, where Ct is the total cash in the portfolio. If the wealth Wt1 < K = 2
penalized
then with a –10
the investor is
reward.
penalty of 0 or –1
The investor (whichtoisinject
chooses not deducted
either 0 from the fund).of cash with a
or 1 amounts
respective The stock price follows a discrete Markov chain with P(St +1 = s |
Dynamics
St = s) = 0.5, i.e. with probability 0.5 the stock remains the same price over
the time interval. P(St +1 = s + 1 | St = s) = P(St +1 = s – 1 | St = s) = 0.25. If
the wealth moves off the grid it simply bounces to the nearest value in the grid
at that time. The states are grid squares, identified by their row and column
number (row first). The investor always starts in state (1,0) (i.e. the initial
wealth Wt0 = 1 at time t0 = 0— there is no cash in the fund) and both states in
the last column (i.e., at time t = t1 = 1) are terminal (Table 3).
Using the Bellman equation (with generic state notation), give the first
round of
value iteration updates for each state by completing the table below. You may
ignore the time value of money, i.e., set ç = 1.
Vi+1 (s) = max( ÕT s, a, sJ )(R(s, a, Js ) + çV i
J
a
( sJ (s )))
71
w t0 t1
2 0 0
1 0 -10
Solution 9.3
The stock price sequence {St0, St1 } is a Markov chain with one transition
period. The transition probabilities are denoted pji = P(St+1 = sj |St = si )
with states s1 = 1, s2 = 2.
The Wealth sequence, {Wt0, Wt1 } is an MDP. To avoid confusion in notation
with the stock states, let us denote the wealth states w e {w1 = 1, w2 = 2}
(instead of the usual s1, s2, ..). The actions a e A := {a0 = 0, a1 = 1}. The
transition matrix for this MDP is
Tijk := T (wi, ak, wj ) := P(Wt+1 = wj |Wt = wi, at = ak ).
From the problem constraints and the Markov chain transition probabilities pij
we can write for action a0 (no cash injection):
0.75 0.25
T,,0 =
0.25 0.75
We can now evaluate the rewards for the case when Wt0 = w1 = 1
(which is the only starting state). The rewards are R(w1, 0, w1) = –10, R(w1,
0, w2) = 0, R(w1, 1, w1) = –11, R(w1, 1, w2) = –1
And using the Bellman equation, the value updates from state Wt0 = w1 is
Õ2
V1 (w1 ) = max T (w1,
a, w j )R(w1, a, w j ) ae A j=1
for action at0 = 1. Note that V1(w2) does not exist as the wealth can not
transition from w2 at time t0.
Exercise 9.4*
Consider the following toy cash buffer problem. An investor owns a stock,
initially valued at St0 = 1, and must ensure that their wealth (stock + cash) does
not fall below a threshold K = 1 at time t = t1 . The investor can choose to
either sell the stock or inject more cash, but not both. In the former case, the
sale of the stock at time t results in an immediate cash update st (you may
ignore transactions costs). If the investor chooses to inject a cash amount ct e {0,
1}, there is a corresponding penalty of – ct (which is not taken from the fund).
Let Wt = St + Ct denote their wealth at time t, where Ct is the total cash in
the portfolio.
Dynamics The stock price follows a discrete Markov chain with P(St+1 = s |
St =
s) = 0.5, i.e., with probability 0.5 the stock remains the same price over the
time interval. P(St +1 = s + 1 | St = s) = P(St +1 = s – 1 | St = s) = 0.25. If the
wealth moves off the grid it simply bounces to the nearest value in the grid at
that time. The states are grid squares, identified by their row and column
number (row first). The investor always starts in state (1,0) (i.e. the initial
wealth Wt0 = 1 at time t0 = 0— there is no cash in the fund) and both states in
the last column (i.e., at time t = t1 = 1) are terminal.
w t0 t1
1 0 0
0 0 –10
Using the Bellman equation (with generic state notation), give the first round
of value iteration updates for each state by completing the table below. You may
ignore the time value of money, i.e., set ç = 1.
Vi+1 (s) = max( TÕ , a, sJ )(R(s, a, sJ ) + çV
i
J
a
(s sJ (s )))
Solution 9.4
The stock price sequence {St0, St1 } is a Markov chain with one transition
period. The transition probabilities are denoted pji = P(St+1 = sj |St = si )
with states s0 = 0, s1 = 1, s2 = 2.
The Wealth sequence, {Wt0, Wt1 } is an MDP. Let us denote the wealth states
w e {w0 = 0, w1 = 1}. The actions a := {a0, a1, a2 } respectively denote (i) do
not inject cash or sell the stock; (ii) sell the stock but do not inject cash; and (iii)
inject cash ($ 1) but do not sell the stock.
The transition matrix for this MDP is
From the problem constraints and the Markov chain transition probabilities pij
we can write for action a0 (no cash injection or sale):
0.75 0.25
T,,0 =
0.25 0.75
10
T,,1 = .
01
We can now evaluate the rewards for the case when Wt0 = w1 = 1 (which is
the only starting state). The rewards are
And using the Bellman equation, the value updates from state Wt0 = w1 is
Õ1
V1 (w1 ) = max T (w1,
a, w j )R(w1, a, w j ) ae A j=0
The other empty value in the table is not filled in. V1(w0) does not exist as
w
the ealth can not transition from w0 at time t0.
Exercise 9.5*
Deterministic policies such as the greedy policy pix (a|s) = arg maxπ Q π (s, a)
are invariant with respect to a shift of the action-value function by an arbitrary
function of a state f (s): πx (a|s) = arg maxπ Q π (s, a) = arg maxπ Q˜ π (s, a)
where Q˜ π (s, a) = Q π (s, a) – f (s). Show that this implies that the optimal
policy is also invariant with respect to the following transformation of an
original reward function r(st , at , st +1 ):
Solution 9.5
Exercise 9.6**
density of state-action pairs. It can be used e.g. to specify the value function as
an expectation value of the reward: V =< r(s, a) >ρ .
a. Compute the policy in terms of the occupancy measure ρπ .
b. Compute a normalized occupancy measure ρ˜π (s, a). How different the
policy will be if we used the normalized measure ρ˜π (s, a) instead of the
unnormalized measure ρπ ?
Solution 9.6
a. The policy is
ρπ (s, a)
π(a|s) = .
s ρπ (s, a)
ρπ (s, a) ρπ (s, a)
ρ˜(s, a) = . =. = (1 – ç)ρ(s,
s, a ρπ (s, a)
∞
t=0
a)
Therefore, if we rescale the occupance measure
çt ρπ by a constant factor 1 – ç,
we obtain a normalized probability density ρ˜(s, a) of state-action pairs. On
the other hand, an optimal policy is invariant under a rescaling of all rewards
by a constant factor (see Exercise 9.1). This means that we can always
consider the occupancy measure ρπ to be a valid normalized probability
density, as any mismatch in the normalization could always be re-absorbed
in rescaled rewards.
Exercise 9.7**
b. How could you modify a linear unbounded specification of reward rθ (s, a,s J )
.
= k=1 K θk Ψk (s, a, s J ) to a bounded reward function with values in a unit
[0,interval
1]?
76
Solution 9.7
b. Once we realize that any MDP with bounded rewards can be mapped onto
an MDP with rewards rt e [0, 1] without any loss of generality, a simple
alternative to a linear reward would be a logistic reward:
1
rθ (st, at, st+1) = . .
1 + exp – θk Ψk (st , at , st+1 )
K
k=1
Exercise 9.8
Solution 9.8
Exercise 9.9
where y and K are some parameters. You can use such a cost function to
develop an MDP model for an agent learning to invest. For example, st can be
the current assets in a portfolio of equities at time t, at be an additional cash
added to or subtracted
77
from the portfolio at time t, and st +1 be the portfolio value at the end of time
interval [t, t + 1). The second term is an option-like cost of a total
portfolio (equity and cash) shortfall by time t + 1 from a target value K.
Parameter y controls the relative importance of paying costs now as opposed to
delaying payment.
a. What is the corresponding expected cost for this problem, if the expectation
is taken w.r.t. to the stock prices and at is treated as deterministic?
b.
c. Is
Cantheyou
expected
find ancost a convex
optimal or concave
one-step action function of the action
at x that minimizes at ?
the one-step
cost?
expected
Hint: For Part (i), you can use the following
property:
d [y – x] = [(y – x)H(y – x)] d
+
dx dx
Solution 9.9
C (st, at, st+1) = E [yat + (K – st+1 – at )+] = yat + E [(K – st+1 – at )+] .
x), H(x) is the Heaviside step function. Note that the (y – x)δ(y – x) term
where
is zero everywhere. Using this relation, the derivative of the expected reward is
bC
= y – E [H(K – st+1 – at )] = y – Pr (st+1 ≤ K – at )
bat
.
Taking the second derivative and assuming that the distribution of st+1 has a
2
pdf p(st+1 ), we obtain 6 C2 = p(K – at ) ≥ 0. This shows that the expected
6a t
C (st, at, st+1) is convex in actioncost
at .
c. An implicit equation for the optimal action tax is obtained at a zero of the
of the expected
derivative
cost: Pr s ≤ K – ax = y.
t+1 t
Exercise 9.10
Exercise 9.9 presented a simple single-period cost function that can be used in
the setting of model-free reinforcement learning. We can now formulate a
model based formulation for such an option-like reward. To this end, we use the
following
78
st+1 = (1 + rt )st
rt = G(Ft ) + εt .
In words, the initial portfolio value st + at in the beginning of the interval [t, t +
grows
1) with a random return rt. given
. by a function G(Ft ) of factors Ft corrupted
by ε with E [ε] = 0 and E ε2 = σ 2 .
noise
a. Obtain the form of expected cost for this specification in Exercise
9.9.
b. Obtain thethe
c. Compute optimal single-step
sensitivity of the action
optimalforaction
this case.
with
. respect to the i-th factor Fit
assuming the sigmoid link function G(F ) = σ ( ω F ) and a Gaussian
t i i it
noise
εt .
Solution 9.10
K – axt – (1 + Ft ))s t
G( = y.
σst
εt ≤
P formula
c. If ε ~ N(0, 1), the last
becomes
K – axt – (1 + t t
W G(F ))s = y.
σs t
Exercise 9.11
Fenchel duality):
ÕK ÕK
max Q(st, at ) = max πk Q (st, Ak ) s.t. 0 ≤ πi ≤ 1, πk = 1
at e A {π } k k=1
k=1
Solution 9.11
As all weights are between zero and one, the solution to the maximization over
the
distribution K
π = {π k } k=1 is to put all weights into a value k = k x such that at = x
maximizes
A k Q(st , at ): πkx = 1, πk = 0, ∀k ≠ kx . This means that maximization
over all possible actions for a discrete action set can be reduced to a linear
programming problem.
Exercise 9.12**
ÕK ÕK
max Q(st, at ) = max πk Q (st, Ak ) s.t. 0 ≤ πi ≤ 1, πk = 1
at e A {π } k k=1
k=1
Solution 9.12
The Lagrange multiplier .λ can now be found by substituting this expression into
K
normalization condition k=1
the πk = 1. This produces the final
result
ωk eβQ(st, Ak )
πk = . ω e βQ(s t, Ak ) .
k k
Exercise 9.13**
Solution 9.13
Exercise 9.14*
Show that the solution for the coefficients Wtk in the LSPI method (see Eq.(9.71))
is
Wx = S –1M
t t t
Solution 9.14
N 2
Õ (k) (k) (k) (k)
L t (Wt ) = Rt X(k)t , t(k)
a , t+1
X(k) + çQt+1 Xt+1 , π Xt+1 – Wt Ψ Xt ,
k=1 π
at
82
Exercise 9.15**
. .
Qˆ (s, a) › Qˆ (s, a) + α r(s, a) + çQˆ (s J , a J ) – Qˆ (s, a) ,
where a, a J are drawn from the Boltzmann policy with β = 16.55 and α =
0.1, leads to oscillating solutions for Qˆ (s1, a) and Qˆ (s1, a) that do not
achieve stable states with an increased number of iterations.
Solution 9.15
Part 2:
To verify the non-expansion property, we compute
Õ eh(i)/T Õ J
eh (i)/T
|BoltzT h(i) – BoltzT h J ( – h J (i) . h J (i)/T
i)| = h(i) .
i eI i eI eh(i)/T i eI i eI e
Exercise 9.16**
Solution 9.16
a. Let m = max (X) and W ≥ 1 is the number of values in X that are equal to m.
We obtain ! !
1Õ Õn
n
1 1 1
lim mmω (X) = lim n lim
eωxi = ω→∞ ωm eω(xi –m)
ω→∞
log
ω→∞ ω
i=1 log ω ne i=1
1 1
= lim log eωm W = m = max X)
ω→∞ ω n
(
b. Let X and Y be two vectors of length n, and ∆i = Xi – Yi be the
difference of their i-th components. Let i x be the index with the maximum
component- wise difference: i x = argmaxi ∆i . Without loss of generality, we
assume that x i x – y i x ≥ 0. We obtain
1 . n eωx i . . n
1 1 i=1 e ω(y +∆ )
n i i
1 i=1 eω(yi + ∆ i x )
ω ω 1 n . i= . n .
|mm (X) – mm (Y)| = ωlog 1 n
eωy = log ≤ log n
1 i ω i=1 eωyi ω i=1 eωyi
n i=
1
= log eω∆ ix = |∆i x | = max | iX – i
ω i
Y|
Chapter 10
Applications of Reinforcement Learning
Exercise 10.1
Derive Eq.(10.46) that gives the limit of the optimal action in the QLBS model
in the continuous-time limit.
Solution 10.1
The first term in Eq.(10.46) is evaluated in the same way as Eq.(10.22) and yields
. .
E t ∆Sˆ tΠ
ˆ t+1 b ˆt
lim =
2 C
bSt
∆t→0
Et ∆Sˆ t .
Exercise 10.2
85
86
Solution 10.2
. π .
(a) When the conditional expectation t,a Ft+1 (yt+1 ) does not depend on the
action
aEt , this term trivially cancels out with the denominator Zt . Therefore, the
optimal policy in this case is determined by the one-step reward: π(at |yt ) ~
eRˆ(y(b)
t ,at ).When the dynamics are linear in action at as in Eq.(10.125), the
optimal policy for a multi-step problem has the same Gaussian form as an
optimal policy for a single step problem, however this time, parameters Ra a ,
Ray, Ra of the reward function are re-scaled.
Exercise 10.3
Solution 10.3
Exercise 10.4
ˆ çÕ Õ
βGπ (yJ ,aJ
G π (y, a) = R(y, a) ρ(yJ |y, a) log π0 (aJ) |y J)e
+ β yJ
aJ
Solution 10.4
Plugging it into Eq.(10.122) and using the first-order Taylor expansion for the
loga- rithm log(1 + x) = x + O(x2), we obtain
87
Õ
G π (y, a) = ˆ , a) + ç (y J | , a)π0 (aJJ | J G π(y J,
R(y ρ yJ ,aJ y y) a ).
This is the same as the fixed-policy Bellman equation for the G-function Gπ (y,
a)
with π = π0 and transition probability ρ(yJ , aJ |y, a) = ρ(yJ |y, a)π0 (aJ |yJ ).
Exercise 10.5
Solution 10.5
1 h (uu) i –1
Σp
=– Qt →0
2β
˜
1 h (uu) i –1 (u)
u˜t = –Qt Qt
2
1h i –1
v˜t = – Q(uu)
t
(ux)
Qt
2
88
Exercise 10.6***
Solution 10.6
.
Writing x = xT 1 where 1 is a vector of ones, the integral representation of
i i
Heaviside step function
the
reads ∫ ¯ xT 1
¯ – xT 1 = 1 ∞ iz(X–
e
θ X dz.
ε → 0 2πi )
–∞ z – iε
lim
This gives
89
∫ ! ∫ ∫
Õn 1 ∞ eiz X–
( ¯ xT 1 )
–21 x
T
Ax
+
x
T
¯ –
Bθ X x
i dn
x = 21 x
–
T +
Ax x
T
B dn
x dz
e
i=1
e 2πi –∞ z –
iε
where we should take the limit ε → 0 at the end of the calculation. Exchanging
the order of integration, the inner integral with respect to x can be evaluated
using the unconstrained formula where we substitute B → B – iz1. The integral
becomes (here a := 1T A–11)
, ∫ ¯ 1 T –1 T –1
(2π) n 1 B T A –1 B 1 eizX– 2 z1 A 1z–iB A 1
e2 dz
| 2πi z – iε
, A| ∫ – 1 a z+i BT A–1 1– X̄
2
= dz
|A| e 2 2πi z–
iε
T –1
The integral over z can be evaluated by changing the variable z → z – B A a 1– X̄ ,
itaking the limit ε → 0, and using the following formula (here BT A–1 1 –X¯
a ):
β :=
1 2
1 ∫ ∞ e – 21 az 2 ∫ ∞ – az aβ 2
β e 2 √ 2
2π –∞ β + dz = 0
dz = 1 – N β a e
π β +z
2 2
iz
where in the second equation we used Eq.(3.466) from Gradshteyn and Ryzhik
2 Using this relation, we finally obtain
∫ ! ,
1 T +x TB
¯ –
Õn (2π) n 1B T A –1B BT A–1 1 – ¯
2
e – x Ax θ X x i d nx 2 .
e 1 – N X√
= i=1 |A| T
1 A 1–1
2 I.S. Gradshteyn and I.M. Ryzhik, "Table of Integrals, Series, and Products", Elsevier,
2007.
Chapter 11
Inverse Reinforcement Learning and Imitation
Learning
Exercise 11.1
a. Derive Eq.(11.7).
b. Verify that the optimization problem in Eq.(11.10) is convex.
Solution 11.1
∫ 1 ∫
F π (s) = π(a|s) r(s, a) – log π(a|s)
β +λ
da
where λ is the Lagrange multiplier. Taking the variational
π(a|s)daderivative
– 1 with respect
to
π(a|s) and equating it to zero, we obtain (26)
1 1
r(s, a) – log π(a|s) – + λ = 0
β
β
Re-arranging this relation, we obtain
The value ,of the Lagrange multiplier λ can now be found from the
constraint
normalizationπ(a|s)da = 1. This produces the final
result
1
Zβ (s)
91
92
Exercise 11.2
Solution 11.2
If the basis functions Ψ(s, a) are linear in a while having an arbitrary dependence
on
s, the second derivative of the reward is
2
6Ψ
br2 θ 6a
=– ≤
ba 2 1 + e–θ Ψ(s,
2
0
a)
Exercise 11.3
Solution 11.3
Taking the variational derivative of this expression with respect to D(s, a), we
obtain
ρE (s, a) ρπ (s, a) = 0
D(s, a) – 1 – D(s,
a)
Re-arranging terms in this expression, we obtain
Eq.(11.76):
ρE
(s, a)
D(s, a) = ρπ (s, a) + ρE (s,
a)
Substituting this expression into the first expression for ΨGA x , we
obtain ∫
ρE (s, a) ρ (s, a)
ΨxGA E π dsda+ ρπ (s, a) dsda
∫ ρπ (s, a) + ρE (s, a)log ρπ (s, a) + ρE (s,
= the
Dividing (s, a) loga)and denominator in logarithms in both terms by two
ρ numerator
and re-arranging terms, we obtain
1 1
ΨxGA = DK KL ρπ || (ρπ + ρE ) +DK KL E || (ρπ + ρ E ) –log 4 = D JS (ρπ, ρE )–log
2 2
ρ 4
where DJS (ρπ, ρE ) is the Jensen–Shannon distance (11.72).
Exercise 11.4
Solution 11.4
coincides with the Legendre transform. Taking the derivative of this expression
with respect to y and equating it to zero, we have
ψJ (y) = x → yx = g(x)
ψJ (g(x)) ≡ ψJ (yx ) = x
1
ψJ J (g(x))
g J (x) =
We can therefore write the convex conjugate as a composite function of x:
d2ψx(x)
dyx
Therefore, the convex conjugate (Legendre transform) of a convex differentiable
function ψ(y) is convex.
To show that ψxx = ψ, first note that the original function can be written in
1
terms of its transform:
ψ(yx ) = xyx – ψx(x)
dx 2
Using this, we
obtain = = JJ ≥0
ψxx (x) = xp x – ψx x dψxdx(p) ψ =x(g(x))= xpx – ψx(px ) g(px )=x = g(px )px – ψ (px )
x
(p ) dp
p=px
= ψ (g(px )) = ψ(x)
where we replaced x by g(px ) in the third step, and replaced g(px ) by x in the
last step.
Exercise 11.5
Show that the choice f (x) = x log x – (x + 1) logx+12 in the definition of the f-
divergence (11.73) gives rise to the Jensen-Shannon divergence (11.72) of
distribu- tions P and Q.
Solution 11.5
∫
p(x)
Df∫ ( P|| Q) = q(x) f q(x) ∫
p(x)
dx q(x)
= p(x) log p(x) + q(x) dx + q(x) log p(x) + q(x) dx
= DJS (p, q) – log 4
Exercise 11.6
In the example of learning a straight line from Sect. 5.6, compute the KL
divergence
D K L (Pθ ||P E ), D K L (PE ||Pθ ), and the JS divergence DJS (Pθ, PE ).
Solution 11.6