0% found this document useful (0 votes)
42 views28 pages

Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty

This document summarizes a research paper on nonparametric adaptive Bayesian stochastic control under model uncertainty. The paper proposes a new methodology that models an unknown stochastic process distribution as a random probability measure using the Dirichlet process. This allows for online learning in a Bayesian manner while integrating optimization and learning. The approach avoids model misspecification issues of other methods. The paper develops an algorithm to handle the infinite-dimensional state space using Gaussian process surrogates for the value function and optimal control. It demonstrates the financial advantages of the nonparametric Bayesian framework compared to parametric approaches like robust and adaptive control.

Uploaded by

Igor Pavlov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views28 pages

Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty

This document summarizes a research paper on nonparametric adaptive Bayesian stochastic control under model uncertainty. The paper proposes a new methodology that models an unknown stochastic process distribution as a random probability measure using the Dirichlet process. This allows for online learning in a Bayesian manner while integrating optimization and learning. The approach avoids model misspecification issues of other methods. The paper develops an algorithm to handle the infinite-dimensional state space using Gaussian process surrogates for the value function and optimal control. It demonstrates the financial advantages of the nonparametric Bayesian framework compared to parametric approaches like robust and adaptive control.

Uploaded by

Igor Pavlov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/345707772

Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty

Preprint · November 2020

CITATIONS READS
0 3

2 authors, including:

Tao Chen
University of Michigan
6 PUBLICATIONS   22 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Tao Chen on 12 November 2020.

The user has requested enhancement of the downloaded file.


Nonparametric Adaptive Bayesian Stochastic Control
Under Model Uncertainty
Tao Chen∗ Jiyoun Myung†

Abstract
In this paper we propose a new methodology for solving a discrete time stochastic
Markovian control problem under model uncertainty. By utilizing the Dirichlet process,
we model the unknown distribution of the underlying stochastic process as a random
probability measure and achieve online learning in a Bayesian manner. Our approach
integrates optimizing and dynamic learning. When dealing with model uncertainty,
the nonparametric framework allows us to avoid model misspecification that usually
occurs in other classical control methods. Then, we develop a numerical algorithm to
handle the infinitely dimensional state space in this setup and utilizes Gaussian process
surrogates to obtain a functional representation of the value function in the Bellman
recursion. We also build separate surrogates for optimal control to eliminate repeated
optimizations on out-of-sample paths and bring computational speed-ups. Finally,
we demonstrate the financial advantages of the nonparametric Bayesian framework
compared to parametric approaches such as strong robust and time consistent adaptive.

Key words: nonparametric adaptive Bayesian control, Dirichlet process, Gaussian


process surrogates, utility maximization, model uncertainty, optimal portfolio.

1 Introduction
In solving stochastic control problems, attention has been paid to model risk, the uncer-
tainty about the underlying system dynamics. As discussed in [18], such type of uncertainty
must be distinguished from measurable randomness of system realizations, and hereby called
Knightian uncertainty. This ambiguity is expressed either in terms of a parametric family of
distributions or a set of probability measures. In practice, probabilities of interest are often
estimated through observing system outcomes. Then, several families of approaches, such
as “robust” and “learning” methods, are applied to tackle the control problem.
The central idea of robust techniques, which goes back to [15], is to find the optimal
strategy that performs the best in the worst-case scenario. Hence, a robust stochastic control
problem is a form of inf-sup optimization, where supremum is over the family of probabilities

Department of Mathematics, University of Michigan Ann Arbor, East Hall 2859, 530 Church Street,
Ann Arbor, MI 48109-1043

Department of Statistics and Biostatistics, California State University, East Bay, North Science 319,
Hayward, CA 94542

1
and infimum is taken across the control set. This area has been extensively studied in
literature using different approaches, some of which are briefly described in Section 2.1. There
are two issues that must be addressed in these approaches. First, one usually assumes equal
weights for all possible distributions or probability measures in the considered family even
when some are much less plausible than the others. To overcome this drawback, a penalty
function of probabilities can be added to the objective function that is going to be optimized.
One challenge in this treatment then becomes how the penalty function is properly chosen.
Second, the set of probabilities is frequently fixed in time even in a dynamic environment, in
which uncertainty about the system can be reduced as newly incoming information about the
underlying system becomes available. To address this issue, the adaptive robust methodology
is proposed in [7], which initiates the study of dynamically reducing uncertainty of the
underlying model while solving robust stochastic control problems. Serving as a fundamental
tool for such an approach, an innovative statistical method of online updating the confidence
regions for the unknown system parameters is introduced in [6].
Traditionally, one uses the Bayesian method (cf. [22, 19]) to incorporate learning into
solving control problems. The rationales behind such methods are twofold. On one hand, the
underlying system is learned through observations of the data. On the other hand, by mod-
eling the uncertainty about the true parameters as random variables, posteriors determine
the weights to different models. Hence, the inf-sup formulation is replaced by the weighted
average across all possible models, which leads to a “inf-integral” problem. One could naively
use the Bayesian technique to learn the system, then subsequently control the learnt sys-
tem. One concern regarding such implementation for dynamically consistent problems is
that the corresponding optimal control is essentially myopic by separating the learning from
the control. As discussed in [19], the optimal control should ideally be maintained even
during the learning phase. Contrary to the naive Bayesian algorithm that separates the two
phases, some other approaches (cf. [23, 3, 7]) account for the controller being cognizant
that knowledge of the unknown model may change in the future and therefore, such is-
sues should be addressed at the present time. Nevertheless, a wholistic Bayesian framework
known as Bayesian adaptive control has been explored, in which control and online learning
are integrated together. In Section 2.2, we will briefly review such work.
In this study, we propose a nonparametric adaptive Bayesian methodology that solves
stochastic control problems under model uncertainty in a discrete time setup according to
the Bellman principle. Some earlier related parametric frameworks are e.g., [2, 19]. In
contrast with these works, our setup does not assume any model for the underlying system
process as such. The proposed approach is more data driven and avoids the issue of model
misspecification. In addition, we prove that the optimal control problem satisfies the Bellman
principle. By considering a Borel measurable loss function, we show that optimal selectors
exist and are universally measurable with respect to the relevant augmented state variable.
We also use the machine learning technique, namely the Gaussian process surrogates, to
numerically solve the Bellman equations.
The paper is organized as follows. In Section 2 we briefly review some of the existing
methodologies for solving stochastic control problems subject to model uncertainty, from

2
both robust and Bayesian perspectives. We introduce our nonparametric adaptive Bayesian
approach in Section 3 and present the theoretical results in Section 4. In Section 5, as an
illustrative example, a utility maximization problem of optimal investment is considered. We
solve the problem by utilizing the proposed approach combined with some machine learning
techniques. Finally, we provide a comparative analysis to the existing control methods.

2 Existing Methodologies for Stochastic Control Prob-


lems under Model Uncertainty
We start our discussion with a review of classical methods and novel approaches introduced
recently for solving dynamically consistent stochastic control problems subject to Knightian
uncertainty.

2.1 Robust Methodologies


In this section, we will mainly discuss the existing parametric robust techniques for dealing
with model uncertainty. Nonetheless, readers should also be aware of other nonparametric
robust methods proposed in the past few decades. In [12], the author treats model uncer-
tainty as multiple probability measures and studies its impact on pricing derivatives. The
topic has been receiving more and more attention due to the last financial crisis. The copula
approach is partially criticized for the disaster, which the financial industry was using for
pricing financial products such as Collateralized Debt Obligations (CDOs) while not account-
ing for the potential model risk. Enormous amount of effort has been placed on addressing
the issue after the financial market meltdown. In the breakthrough paper [9], the authors
prove a version of the first fundamental theorem of asset pricing under model uncertainty in
the quasi-sure sense. A related work is [4], which studies the topic by taking into account the
transaction costs. The adaptive robust framework introduced in [7] incorporates reducing
the uncertainty in the robust method, and is applied to time-inconsistent Markovian control
problems under model uncertainty in the follow-up work [8].
Although nonparametric robust methods are theoretically sound, most of them are diffi-
cult to implement in practice due to the optimization required over a family of probability
measures. On the contrary, robust techniques in the parametric setup have been widely
used by large banks when addressing the model risk. By imposing a parametric model with
unknown parameters, the numerical part of the work becomes significantly easier as one
optimizes over a set of numbers rather than abstract probabilities measures. To this end, we
will go through several important setups of robust stochastic control problems. In Section 5,
we will also compare our approach to one of the discussed methodologies, strong robust, via
an illustrative example.
Let (Ω, F) be a measurable space, and some positive integer T be a fixed time horizon.
Consider a random process {Yt , t = 0, 1, . . . , T } taking values in some measurable space.
The process {Yt } is assumed to be observed, but its true law is from a family of probability
distributions {Pθ , θ ∈ Θ} and corresponds to the unknown parameter θ∗ . Denote by F =

3
(Ft , t = 0, . . . , T ) the natural filtration generated by the process {Yt }. A family U of
F-adapted processes {ϕt } that takes values in some measurable space is considered as the
set of admissible controls. Additionally, let L be a function of Y 0 := {Y0 , . . . , YT } and
ϕ0 := {ϕ0 , . . . , ϕT −1 }. A stochastic control problem at hand is then formulated as

inf Eθ∗ [L(Y 0 , ϕ0 )], (2.1)


{ϕt }∈U

given that one knows θ∗ .


However, subjected to the Knightain uncertainty, one cannot deal with problem (2.1)
since the value of θ∗ is unknown. Various robust methodologies are proposed in view of such
ambiguity:

• the (static) robust control approach

inf sup Eθ [L(Y 0 , ϕ0 )], (2.2)


{ϕt }∈U θ∈Θ

which optimizes the objective function over the worst-case model through the whole
time scale, is discussed in, e.g., [16, 17, 1].

• the strong robust control approach

inf sup EQ [L(Y 0 , ϕ0 )], (2.3)


{ϕt }∈U Q∈QK

searches for the worst-case model in each single time period. Above QK is a set of
probability measures on the canonical space, and K is the set of sequences of {θt }
chosen by a Knightian adversary against the controller (cf. [23, 3]).

• the adaptive robust control approach

inf sup EQ [L(Y 0 , ϕ0 )], (2.4)


{ϕt }∈U Q∈QΨ

incorporates learning into the robust methodology by dynamically shrinking the un-
certainty set and finds the worst-case model in each time period. Above QΨ is a set of
probability measures on the relevant canonical space. The family QΨ is constructed in
a way that the set of adversary strategies Ψ consists of the set-valued processes τ (t, θ̂t )
which, for instance, can be chosen as the confidence region of θ∗ at time t based on
point estimator θ̂t . For more details, we refer the readers to [7].

The classical (but static) robust method is usually conservative by its nature. As shown
in [7], for an optimal investment problem that requires the controller to dynamically allocate
the wealth in the risk-free asset and a risky asset, the static robust approach will lead to
investment in the risk-free asset only through the whole time scale. As discussed in [20], “If
the true model is the worst one, then this solution will be nice and dandy. However, if the

4
true model is the best one or something close to it, this solution could be very bad (that is,
the solution need not be robust to model error at all!).”
The strong robust method tries to overcome this drawback by considering the worst-
case model at each time period. While making decisions at time t, the controller takes into
account that she could change her opinion about the worst one in the future and adapts her
strategies to such possibility. However, as demonstrated in [7], the strong robust provides
the exact same solution as the static robust approach for certain problems.
A new framework called adaptive robust is proposed recently. In view of the limita-
tions of the two tactics mentioned above, the adaptive robust method addresses the issue by
dynamically updating the parameter space and removing the unlikely models out of consid-
eration. This procedure is completed by learning about the system dynamics and utilizing
the recursively constructed confidence regions for the unknown parameters (cf. [6]). When
the penalty term is absent in the objective function, it also partially solves the problem of
(unreasonably) considering all possible models with equal weights, since some implausible
values of the parameters will be removed due to the learning process. The strong robust
approach is essentially a special case of the adaptive robust by fixing the parameter space
throughout. The challenges in scaling such method to high dimensional problems also inspire
employment of machine learning techniques to solve robust control problems. In [10], the
authors propose and develop a novel algorithm for the adaptive robust control by utilizing
the ideas from regression Monte Carlo, adaptive experimental design, and statistical learn-
ing. Numerical studies in the paper, as well as in [7], show that the adaptive robust achieves
a sound balance between being conservative and aggressive.

2.2 Bayesian Methodologies


As discussed previously, methods of using the Bayesian theory to solve stochastic control
problems under model uncertainty have been developed for quite awhile. The so-called
Bayesian adaptive control is studied in various projects, and we refer readers to e.g. [19, 22]
for detailed discussions. Both references integrate learning (in a Bayesian manner) and
optimization, and use the Bellman principle to solve the control problem. In [19], sequence
of the Bayesian estimators of the unknown parameters constructed via the filtering technique
is augmented to the state process, and the stochastic optimal control problem with partial
observations is turned into one with complete observations, which can be solved by dynamic
programming. In [22], the author considers a non-stationary Bayesian dynamic decision
model and reduces it to decision models with completely known transition law. The strategy
is the same as in [19]: to augment the set of posterior distributions to the state space.
A similar discussion can also be found in [2]. We hereby summarize the corresponding
formulation of control problems as follows,
Z
inf Eθ [L(Y 0 , ϕ0 )]ν(dθ), (2.5)
{ϕt }∈U Θ

where ν is the prior distribution on the parameter space Θ. When solving such a problem
according to the Bellman principle, intermediate expectations are computed according to the

5
latest posterior distributions. In a recent work [11], the author considers a filtering problem
of discrete time hidden Markov models subject to model uncertainty and uses nonlinear
expectations to model the uncertainty. In particular, the expectation taken under uncertainty
is treated as a nonlinear expectation and formulation of the control problem is dynamically
consistent so that solutions are obtained by solving Bellman equations.
We want to mention that most of the aforementioned works consider a parametric model
for the system process. This methodology has shortcomings, as enforcing a specific family
of distributions for an unknown law is arguably not the optimal starting point. Nonpara-
metric problems can be considered to address this issue. In [13, 14], the Dirichlet process is
introduced as a prior distribution on the space of probability measures and indeed shown to
yield some desirable properties for handling such kind of problems.
A Dirichlet process can be viewed as a probability measure on the set of probability
distributions and therefore a random probability measure. It is shown in [13] that, with
respect to the weak convergence topology, the support of a Dirichlet process contains any
probability measure whose support is contained in the support of the parameter of the
Dirichlet process (cf. Definition 3.1 and discussion). More importantly, the posterior given a
sample of observations from the true probability distribution is also a Dirichlet process. Such
properties shed light on incorporating dynamic learning into stochastic control problems in
a nonparametric way.
Given these desirable properties, a theoretical framework utilizing the Dirichlet process
can potentially achieve some success in solving stochastic control problems under model un-
certainty. Indeed, the author in [14] explored using Dirichlet process to handle uncertainty
and solve an adaptive investment problem (see [14, Section 5] for more discussion). How-
ever, a complete and detailed theory of nonparametric Bayesian control, to the best of our
knowledge, has not been established. In this paper, an adaptive Bayesian framework built
upon the tools developed for Dirichlet processes is proposed. We consider a discrete time
dynamic stochastic control problem where the noise process is observed but with unknown
distribution. The corresponding formulation of the problem is a blend of online learning
and optimal control, for which the Bellman principle and existence of universally measur-
able selectors are proved. Our algorithm can also be seen as a new way to construct the
augmented state space. Instead of using posterior distributions as in the existing literature,
we recursively update the parameter of the Dirichlet process based on observations of the
incoming signal and the resulting sequence is augmented to the state process. In turn, Borel
measurability of the updating rule for the state process is carried out nicely. This property
is essential in our proof for existence of measurable selectors. Finally, implementation of the
approach involves regression/interpolation against measures on the relevant space. We sug-
gest an approximation of the state space, and a machine learning technique for overcoming
the challenge in the numerical example.
There are important drawbacks of the Dirichlet process to be noted. In particular, a
sample distribution drawn from the process is discrete with probability one. Hence, in
the resulting inf-integral formulation, the integral is only taken over all discrete probability
measures. It is then worth emphasizing that the proposed methodology is not limited to

6
the Dirichlet process, of which extensions (e.g. [21]) will apply as well. In this paper,
the Dirichlet process was chosen for the sake of simplicity and illustrative purpose. Study of
nonparametric adaptive Bayesian using random probability measures that sample continuous
distributions will be deferred to future work.

3 Nonparametric Adaptive Bayesian Control Method-


ology
In this section, we elaborate on the ideas presented in [2, 19], and utilize the nonparametric
tools introduced in [13, 14] to develop a nonparametric Bayesian framework for solving
stochastic control problems subjected to Knightian uncertainty. Towards this end, we begin
our presentation with a precise formulation of the problem.
Similar to Section 2, fix a finite time horizon T and let (Ω, F) be some measurable space,
on which we consider a sequence of Rd -valued random variables {Yt , t = 0, . . . , T } with its
natural filtration F. Denote by U ⊂ Rk a compact subset of Rk . We assume that there exists
an F-adapted process {ϕt } which takes values in U and that it plays the role of a control
process. Let U be the set of all control processes. We also consider a noise process {Zt } that
is real valued. For simplicity, we postulate that the sequence is i.i.d.1 . In addition, {Zt } is
observed but its true distribution PZ is unknown. We describe {Yt } as the state process of
some controlled dynamical system, satisfying the following abstract dynamics

Yt+1 = fY (Yt , ϕt , Zt+1 ).

It is further assumed that the function fY : Rd × U × R → Rd is continuous.


Denote by P(R) the set of probability measures on (R, B(R)), where B(R) is the Borel
σ-algebra on R. We equip the set P(R) with the Borel σ-algebra corresponding to the
Prokhorov metric. In this case, continuity of probability measures is equivalent to the weak
convergence, and the space P(R) is Polish. Next, we will recall the definition of Dirichlet
process which is the main tool used in this work.

Definition 3.1. Let α and D be a finite non-null measure and a random probability measure
on (R, B(R)), respectively. We say that D is a Dirichlet process with parameter α and write
D ∈ D(α), if for every finite measurable partition {B1 , . . . , Bn } of R, the random vector
(D(B1 ), . . . , D(Bn )) has a Dirichlet distribution with parameter (α(B1 ), . . . , α(Bn )).

It is well-known that the support of D with respect to the topology of weak convergence
is the set of all distributions on R whose supports are contained in the support of α. In
this paper, we will always take α as a finite measure with full support. On the other
hand, nonparametric learning of an unknown distribution can be done through a sequence
of Dirichlet processes in a Bayesian manner. To this end, for the unknown distribution PZ ,
1
We consider {Zt } as an i.i.d. sequence here for illustrative purpose. Of note, our theory also works for
more general noise processes as long as a nonparametric Bayesian estimate is feasible.

7
we assign a Dirichlet process D(α) as its prior distribution. Let c0 = α(R) and P0 = α/c0 ,
we will write that the prior for PZ is D(c0 P ).
Given the observations Z1 , . . . , Zt , define the random probability measure
c0 P0 + ts=1 δZs
P
Pt = ,
c0 + t
where δ is the Dirac measure, and we know that the posterior for PZ is Dirichlet process
D((c0 + t)Pt ). Clearly, the sequence of random probability measures Pt , t = 1, . . . , T , can
be written in the following recursive way
(c0 + t − 1)Pt−1 + δZt
Pt = =: fPc0 (t − 1, Pt−1 , Zt ), t = 1, . . . , T,
c0 + t
with P0 = P0 . In this work, the process {Pt } will represent the dynamic learning of PZ as
the time-t posterior of PZ is given as D((c0 + t)Pt ), t = 1, . . . , T .
Now, we proceed to formulate the nonparametric adaptive Bayesian control problem.
By adopting a similar idea presented in [2, 19, 7], we consider the augmented state process
Xt = (Yt , Pt ), t = 0, . . . , T , and the augmented state space

EX = Rd × P(R).

In view that both Rd and P(R) are Polish spaces and therefore Borel spaces, the Cartesian
product EX with the product topology is also a Borel space and the Borel σ-algebra EX
coincides with the product σ-algebra. The process {Xt } has the following dynamics,

Xt+1 = Gc0 (t, Xt , ϕt , Zt+1 ), t = 0, . . . , T − 1,

where Gc0 is defined as

Gc0 (t, x, u, z) = (fY (y, u, z), fPc0 (t, P, z)) , (3.1)

for x = (y, P ) ∈ EX .
Given our assumptions, the process {Xt } is F-adapted and is Markovian. Therefore,
we are essentially dealing with a Markov decision problem. This leads to the fact that our
optimal control at any time t = 0, . . . , T − 1, and given any state x ∈ EX , will be a function
of t and x. See Proposition 4.1 and Theorem 4.3 for the justification. In order to proceed,
we present the following technical result regarding the updating rule Gc0 below.
Lemma 3.2. For any t = 0, . . . , T − 1, the mapping Gc0 (t, ·, ·, ·) is continuous.
Proof. It is enough to show that fPc0 (t, P, z) := (c0c+t)P +δz
0 +t+1
, P ∈ P(R), z ∈ R, is continuous
with respect to P and z for any fixed t = 0, . . . , T − 1.
Assume that (Pn , zn ) → (P, z) where P, Pn ∈ P(R), z, zn ∈ R, n = 1, 2, . . .. Then
Pn → P weakly and zn → z. Take B ⊂ R such that
 
(c0 + t)P + δz
(∂B) = 0.
c0 + t + 1

8
Then, set B satisfies that P (∂B) = 0 and z ∈ / ∂B. According to Portmanteau theorem, we
have Pn (B) → P (B) and δzn (B) → δz (B). It is implied that
   
(c0 + t)Pn + δzn (c0 + t)P + δz
lim (B) = (B).
n→∞ c0 + t + 1 c0 + t + 1

Continuity of fPc0 (t, ·, ·) follows by Portmanteau theorem.


Lemma 3.2 shows that Gc0 (t, ·, ·, ·) : EX × U × R → EX is a continuous mapping. As a
result, we obtain the following corollary.

Corollary 3.3. The mapping Gc0 (t, ·, ·, ·) is Borel measurable.

We proceed to define the transition probability for the process {Xt }. That is, for any
t = 0, . . . , T − 1, (x, u) ∈ EX × U , we define a probability measure on EX as
Z
Q(B|t, x, u; c0 ) = P (Gc0 (t, x, u, Zt+1 ) ∈ B) π(dP), π ∈ D(ct P ),

for any B ∈ EX . According to [13, Proposition 4], we have that

Q(B|t, x, u; c0 ) = P (Gc0 (t, x, u, Zt+1 ) ∈ B) , x = (y, P ). (3.2)

It view of Corollary 3.3, we have that Q is a Borel stochastic kernel on EY given EY × U .


To proceed, we take U to be the set of all sequences of universally measurable functions on
EX . Then, for any c0 > 0, x0 = (y0 , P0 ) ∈ EX , and control process {ϕt } ∈ U, we denote
0 T +1
ϕ0 = (ϕ0 , . . . , ϕT −1 ) and define the probability measure Qϕc0 ,x0 on the canonical space EX :
Z Z T
0
Y
Qϕc0 ,x0 (B0 × · · · × BT ) = ··· Q(dxt |t − 1, xt−1 , ϕt−1 (xt−1 ); c0 )δ(y0 ,P0 ) (dx0 ). (3.3)
B0 BT t=1

t
Similarly, we define the probability measure Qϕc0 ,xt on the concatenated canonical space
XTs=t+1 EX by
Z Z T
t
Y
Qϕc0 ,xt (Bt+1 × · · · × BT ) = ··· Q(dxs |s − 1, xs−1 , ϕs−1 (xs−1 ); c0 ), (3.4)
Bt+1 BT s=t+1

where ϕt := (ϕt , ϕt+1 , . . . , ϕT −1 ), and denote by U t the collection of such sequences.


Now, the nonparametric adaptive Bayesian control problem is formulated as

inf EQϕ0 [`(YT )], (3.5)


{ϕt }∈U c0 ,x0

where ` is a measurable function. By employing the canonical construction of the augmented


T +1
process space EX , dynamic learning of the unknown distribution PZ is carried out along
each path of the process {Xt }. In a robust framework such as [7], the control problem

9
can be seen as a game between the controller and the Knightian adversary. The nature
T +1
maximizes the objective function over the set of probability measures on EX , contrary
to the controller’s intention to minimize across the admissible strategies. Therefore, the
controller is essentially trying to minimize a nonlinear expectation. In our formulation, the
nature assigns weights to all possible models via a Dirichlet process and chooses her strategy
as a weighted average of her options, and in accordance, the controller will minimize a linear
expectation on the canonical space.
Remark 3.4. In this work, we consider the Markov decision problem with terminal loss.
We want to stress that our framework can be easily extended to deal with problems with
intermediate costs. We can also adjust the Definitions (3.3), (3.4), and (3.5) by adopting
history-dependent controls. Then, it can be applied to non-Markov decision problems.

4 Solution to the Adaptive Bayesian Control Problem


The main result in this section is to prove that problem (3.5) satisfies the dynamic program-
ming principle. Hence, it can be solved recursively, and the optimal control will be obtained.
To this end, we consider the associated adaptive Bayesian Bellman equations

WTc0 (x) = `(y), x = (y, P ) ∈ EX ,


Wtc0 (x) = inf EP Wt+1
c0
(Gc0 (t, x, u, Zt+1 )) ,
 
x = (y, P ) ∈ EX , t = 0, . . . , T − 1, (4.1)
u∈U

and we will show that


inf EQϕ0 [`(YT )] = W0c0 (x0 ).
{ϕt }∈U c0 ,x0

To proceed, we will first justify, by using Jankov-von Neumann theorem ([5, Proposition 7.49
- 7.50]), that universally measurable selectors ϕ∗t (x), t = 0, . . . , T − 1, exist for the associated
Bellman equations
c0
EP [Wt+1 (Gc0 (t, x, ϕ∗t (x), Zt+1 ))] = Wtc0 (x), (4.2)

for any t = 0, . . . , T − 1.
Towards this end, we postulate that the loss function ` : R → R is Borel measurable.
Then, we have the following result.

Proposition 4.1. The functions Wtc0 , t = T, T − 1, . . . , 0, are lower semianalytic (l.s.a.),


and universally measurable optimal selectors ϕtc0 ,∗ (x), t = T − 1, . . . , 0, in (4.1) exist.

Proof. We will fix c0 > 0 throughout. Note that ` is Borel measurable, and Gc0 is Borel-
measurable according to Lemma 3.3. Thus, WTc0 (Gc0 (T − 1, ·, ·, ·)) is Borel measurable and
therefore l.s.a. on EX × U . Then, we denote

wT −1 (x, u) = EP [WTc0 (Gc0 (T − 1, x, u, ZT ))]

10
−1 is an l.s.a. function that maps EX × U to R, where R is the extended real line:
and wTS
R = R {−∞, ∞}.
By adopting the notations of [5, Proposition 7.50], we let

X = EX = R × P(R), x = (y, P ),
Y = U, y = u,
D = EX × U,
f (x, y) = wT −1 (y, P, u).

In view of our assumptions, both X and Y are Borel spaces. The set D is Borel and hence
analytic. It is trivial to verify that projX (D) = EX and Dx = U for any x ∈ EX . Define
wT∗ −1 : EX → R by
wT∗ −1 (x) = inf f (x, y).
u∈U

Then, Jankov-von Neumann theorem (cf. [5, Proposition 7.49 - 7.50]) yields that for any
ε > 0, there exists an analytically measurable function ϕ∗,ε
T −1 : EX → U satisfying
(
wT∗ −1 (x) + ε, if wT∗ −1 (x) > −∞,
wT −1 (x, ϕ∗,ε
T −1 (x)) =
−1/ε, if wT∗ −1 (x) = −∞.

Next, for every positive integer n, there exists an analytically measurable function ϕ∗n such
that
(
wT∗ −1 (x) + n1 , if wT∗ −1 (x) > −∞,
wT −1 (x, ϕ∗,n
T −1 (x)) =
−n, if wT∗ −1 (x) = −∞.

Since the set U is compact, then for any fixed x ∈ EX , there is a convergent subsequence
{ϕ∗,n k c0 ,∗ ∗,nk ∗,c0
T −1 (x)}. Define ϕ̃T −1 (x) = limk→∞ ϕT −1 (x), and for the fixed x we have wT −1 (x, ϕ̃T −1 (x)) =
wT∗ −1 (x). Therefore, the set

I = {x ∈ EX | for some ux ∈ U, wT −1 (x, ux ) = wT∗ −1 (x)}

coincides with EX . In view of [5, Proposition 7.50] part(b), with slight abuse of notations,
there exists a universally measurable function ϕcT0−1
,∗
(x) that is the optimal selector. Moreover,
the function WT −1 (x) = wT −1 (x) is l.s.a.. By [5, Lemma 7.30], WTc0−1 (Gc0 (T − 2, ·, ·, ·)) is
c0 ∗

l.s.a.. The rest of the proof follows analogously.


Next, we move on to prove that problem (3.5) can be solved by using the dynamic
programming principle. Towards this end, we define the functions

Vtc0 (x, ϕt ) =EQϕt [`(YT )] , t = 0, . . . , T − 1,


c0 ,x

Vtc0 ,∗ (x) = inf


t t
EQϕt [`(YT )] , t = 0, . . . , T − 1,
ϕ ∈U c0 ,x

VTc0 ,∗ (x) =`(y),

11
for x ∈ EX , and ϕt which is a sequence of measurable functions. We provide the following
technical result to show the regularity of the functions Vtc0 , t = 1, . . . , T , so that they can
be integrated.

Lemma 4.2. For any t = 0, . . . , T , and universally measurable sequence ϕt , the function
Vtc0 (x, ϕt ) is universally measurable.

We omit the proof of Lemma 4.2 as it follows easily from [5, Proposition 7.46]. Now, with
the support of Proposition 4.1 and Lemma 4.2, we present the main result of this section.

Theorem 4.3. The process {ϕct 0 ,∗ } constructed from the selectors in Proposition 4.1 is the
solution of the nonparametric Bayesian control problem (3.5):

inf EQϕ0 [`(YT )] = V0c0 ,∗ (x0 ) = W0c0 (x0 ). (4.3)


{ϕt }∈U c0 ,x0

Moreover, for any t = 0, . . . , T − 1, we have

Vtc0 (x, ϕct 0 ,∗ (x)) = Vtc0 ,∗ (x) = Wtc0 (x), x ∈ EX . (4.4)

Proof. We prove the result via backward induction in t = T, . . . , 0:

Vtc0 ,∗ (x) = Wtc0 (x), x ∈ EX .

First, it is clear that VTc0 ,∗ (x) = WTc0 (x), for x ∈ EX . Take t = T − 1, we have

VTc−1
0 ,∗
(x) = inf EQϕT −1 [`(YT )]
ϕT −1 ∈U T −1 c0 ,x

= inf EP [`(Gc0 (T − 1, x, u, ZT ))]


u∈U
= inf EP [WTc0 (Gc0 (T − 1, x, u, ZT ))] = WTc0−1 (xT −1 ), x = (y, P ) ∈ EX .
u∈U

Hence, the function VTc−1 0 ,∗


is l.s.a., and moreover it is universally measurable. For t =
c0 ,∗
T − 2, . . . , 1, 0, assume that Vt+1 is l.s.a.. Given a universally measurable function ϕt , the
stochastic kernel Q(· | t, x, ϕt (x); c0 ) is universally measurable on EX given EX . Therefore,
the following integrals
Z
EQϕt+1 [`(YT )]Q(dx0 |t, x, ϕt (x); c0 )
EX c0 ,x0

and
Z
c0 ,∗ 0
Vt+1 (x )Q(dx0 |t, x, ϕt (x); c0 )
EX

12
are well defined, where the first one is justified by Lemma 4.2. We have by induction
Z
c0 ,∗
Vt (x) = t inft+1 t EQϕt+1 [`(YT )]Q(dx0 |t, x, ϕt (x); c0 )
ϕ =(ϕt ,ϕ )∈U c0 ,x0
ZEX
≥ t inft+1 t inf EQϕt+1 [`(YT )]Q(dx0 |t, x, ϕt (x); c0 )
t+1 ∈U t+1
ϕ =(ϕt ,ϕ )∈U ϕ c0 ,x0
ZEX
c0 ,∗ 0
= t inft+1 t Vt+1 (x )Q(dx0 |t, x, ϕt (x); c0 )
ϕ =(ϕt ,ϕ )∈U EX
Z
= inf c0
Wt+1 (x0 )Q(dx0 |t, x, u; c0 ) = Wtc0 (x).
u∈U EX

Next, fix ε > 0, for any x ∈ EY , let ϕt+1,ε be the ε-optimal control process at time t + 1,
namely,
EQϕt+1,ε [`(YT )] ≤ t+1inf t+1 EQϕt+1 [`(YT )] + ε.
c0 ,x ϕ ∈U c0 ,x

We know that ϕt+1,ε exists as showed in Proposition 4.1. Then, we obtain that
Z
c0 ,∗
Vt (x) = t inft+1 t EQϕt+1 [`(YT )]Q(dx0 |t, x, ϕt (x); c0 )
ϕ =(ϕt ,ϕ )∈U c0 ,x0
Z EX
≤ t inft+1 t EQϕt+1,ε [`(YT )]Q(dx0 |t, x, ϕt (x); c0 )
ϕ =(ϕt ,ϕ ∈U EX c0 ,x0
Z
≤ inf c0
Wt+1 (x0 )Q(dx0 |t, x, ϕt (x); c0 ) + ε = Wtc0 (x) + ε.
u∈U EX

Because ε is arbitrary, we conclude that Wtc0 (x) = Vtc0 (x), and such equality holds true for
all t = T − 1, . . . , 0. Equality (4.4) follows immediately.

5 Application: Nonparametric Adaptive Bayesian Util-


ity Maximization
In this section we demonstrate our method in the context of a dynamic optimal portfolio
selection problem. Consider a market model consists of a risk-free asset with a constant
interest rate r, and a risky asset {St } with the corresponding log-return from time t to t + 1
denoted by Zt+1 = log(St+1 /St ). The dynamics of the wealth process {Yt } in the market
produced by a self-financing trading strategy is given by

Yt+1 = Yt (1 + r + ϕt (eZt+1 − 1 − r)), t = 0, . . . , T − 1, (5.1)

with initial wealth Y0 = y0 . Hence, the function fY (y, u, z) = y(1 + r + u(ez − 1 − r)).
Above ϕt ∈ U = [0, 1] is the proportion of the portfolio wealth invested in the risky asset
from time t to t + 1. In this setup, the wealth process remains non-negative. We postulate

13
that Zt , t = 1, . . . , T , form an i.i.d. sequence of random variables. Both processes {Yt } and
{Zt } are assumed to be observed and in particular, the true distribution of {Zt } is unknown.
Denote by F the natural filtration generated by {Yt }, and we consider {ϕt } as an F-adapted
process. The prior on the distribution of Zi is chosen to be D(c0 A0 ), where c0 is some
positive real number and A0 is a probability measure on R with full support. Consider the
1−η
loss function in the form: `(x) = 1−x 1−η
with η > 1. Note that such function is bounded
from below. Then the adaptive Bayesian control problem at hand is

1 − YT1−η
 
inf E ϕ0 , (5.2)
{ϕt }∈U Qc0 ,x0 1−η

where x0 = (y0 , A0 ). Such problem is in fact equivalent to the one as follows


 1−η 
YT − 1
sup EQϕ0 . (5.3)
{ϕt }∈U c0 ,x0 1−η

Therefore, we are maximizing a CRRA utility of the terminal wealth with high risk aversion
coefficient. The corresponding Bellman equations are then written as

y 1−η − 1
WTc0 (x) = , (5.4)
1−η
Wtc0 (x) = sup EP Wt+1
 c0
(Gc0 (t, x, u, Zt+1 )) ,

(5.5)
u∈U
x = (y, P ) ∈ EX , u ∈ U, t = 0, . . . , T − 1.

Note that the main challenge in applying the nonparametric adaptive Bayesian method
and solving (5.4) - (5.5) is that we need to regress against probability measures. Indeed,
c0

when numerically computing EP Wt+1 (Gc0 (t, x, u, Zt+1 )) , we will estimate its value through
Monte Carlo simulations and therefore interpolation/extrapolation is required to evaluate
Wt+1 (·). In view of such difficulty, the strategy we propose is to regress against the first m
moments of the posterior probability measures instead of against the measures themselves (cf.
see 5.1 below for more discussion). Practically, we will face a high dimensional optimization
problem where traditional grid-based method is extremely inefficient or impossible. To this
end, we will employ the new machine learning algorithm proposed in [10] that has sound
scalability and overcomes the challenges in solving our high dimensional stochastic control
problem. To briefly summarize our numerical algorithm: we will employ the regression Monte
Carlo (RMC) paradigm and the Gaussian process (GP) surrogates to recursively compute
the optimal strategy {ϕt } backward in time. The detailed description is presented in the
following section.

5.1 Machine Learning Algorithm


The main purpose of this section is to propose a numerical solver for (5.4) - (5.5) in the
same spirit of [10]. We begin with discretization of the state space and we employ the RMC

14
method to create a stochastic (non-gridded) mesh for the underlying state process. We will
also explain how to handle the issue of regressing against probabilities measures along the
way.
One difficulty in discretizing the state space is that the state process {Yt } depends
on the unknown control which prevents the direct simulation of {Yt } when we apply the
RMC paradigm. Hence, we use the idea of control randomization by generating the values
e1t , . . . , u
u eNt , t = 0, 1, . . . , T − 1, uniformly in the set U along each of the N sample paths.
We also simulate the return process {Zt } according to some sampling measure and obtain
Zet1 , . . . , ZetN , t = 1, . . . , T . For each path, we choose the initial wealth ye0i and the parameter
for Dirichlet prior α = c0 Pe0i , i = 1, . . . , N . Here c0 is some postive real number and Pe0i ,
i = 1, . . . , N , are some probability distributions with full support on R. Then, along each
sample path, the processes {Yt } and {At } will be updated according to the mapping Gc0 by
using the simulated values u eit and Zet+1
i
, i = 1, . . . , N , t = 0, . . . , T − 1, and the sample sites
i i ei
et = (e
x yt , Pt ), i = 1, . . . , N , t = 0, . . . , T will be obtained.
When applying the dynamic programming to solve (5.5) at the sampled sites x eit−1 , i =
1, . . . , N , t = 1, . . . , T − 1, we need to approximate the values of functions Wt (Gc0 (t −
1, xeit−1 , ·, ·)), i = 1, . . . , N , t = 1, . . . , T − 1. Therefore, a regression model for Wt is needed.
The difficulty in constructing such a model is twofold. First, part of the state variable At is
a probability measure which is essentially an infinite dimensional variable. In the regression,
we will approximate the variable by its first M moments and regress against the vector of such
moments instead. To put it in a different way, we approximate the state space EX = {(y, P )}
as E eX = {(y, m1 , . . . , mM )} where mi is the ith moment of the probability measure P . With
P P P
slight abuse of notations, we will use x eit , i = 1, . . . , N , to denote the N sample sites in the
space E eX . Second, although we reduce the dimension of the regression problem from infinity
to 1+M , by implementing this approximation. It still leads to a significant high dimensional
optimization problem. Hence, a numerical approach with good scalability is crucial for which
we will follow the methods of [10] by constructing nonparametric approximations of value
functions Wt , t = 1, . . . , T − 1, via GP surrogates.
To be more specific, we consider a regression model W ft of Wtc0 such that for any set of
inputs the corresponding values of W ft are jointly normal distributed. Given training data
i c0 i
(e
xt , Wt (e xt )), i = 1, . . . , N , for any x e∈E eX , the predicted value W ft (e
x) is computed as
2 −1
W
ft (e
x) = (k(e e1t ), . . . , k(e
x, x eN
x, x x1 ), . . . , Wt (e
t ))[K +  I] (Wt (e xN ))T ,

where I is the N × N identity matrix and entries of K has the form Ki,j = k(e ejt ),
xit , x
i, j = 1, . . . , N . The function k(·, ·) is the kernel function of the GP model and in this
project, we choose the Matern-5/2 family. Through estimating the hyperparameters inside
k(·, ·), we fit the GP surrogate W
ft and use it in (5.5) to compute W
ft−1 . The overall algorithm
is as follows:

1. (Assume that Wtc0 (·) and ϕct 0 ,∗ (·) are computed at sampled points, and the GP surro-
gates W
ft+1 and ϕ
et+1 are fitted.)

15
xit−1 , i = 1, . . . , N } ⊂ E
2. For time t, any u ∈ U and each of the sample sites {e eX , use
Monte Carlo simulation to approximate

EP [Wtc0 (Gc0 (t − 1, x
eit−1 , u, Zt ))

as the following sum


L
1 Xf
W xit−1 , u) ≈
ct−1 (e eit−1 , u, Z j ))
Wt−1 (Gc0 (t − 1, x
L j=1

where Z 1 , . . . , Z L are generated from the distribution P .


c0
3. Solve the optimization problem Wt−1 xit−1 ) = supu∈U W
(e xit−1 , u) and obtain the
ct−1 (e
maximizer ϕct 0 ,∗ (e
xit−1 ), i = 1, . . . , N .
c0
4. Fit the GP surrogates for W ft−1 and ϕ e∗t−1 by using the training data (e
xit−1 , Wt−1 xit−1 ))
(e
i c0 ,∗ i
and (e xt−1 )), i = 1, . . . , N , respectively.
xt−1 , ϕt−1 (e

5. Goto 1: start the next recursion for t − 2.

To analyze the performance of the optimal control computed from our algorithm, we generate
N 0 out-of-sample paths as follows. We first simulate the random noise Zti , i = 1, . . . , N 0 ,
t = 1, . . . , T from the sampling measure, and pick xi0 ≡ (y0 , A0 ) ∈ EX , i = 1, . . . , N 0 . For
each xit = (yti , Ait ), compute the first M moments (m1Ai , . . . , mM Ait
) of Ait , and use the GP
t
surrogate to estimate the corresponding optimal strategy as ϕ et (xit ) = ϕe∗t (yti , m1Ai , . . . , mM
Ait
).
t
i c0
Then we update the state process according to xt+1 = G (t, xt , ϕ i
et (xt ), Zt+1 ), i = 1, . . . , N 0 .
i i
P 0 (yTi )1−η −1
Finally, the expected value of the terminal utility is estimated as N10 N i=1 1−η
.

5.2 Other Stochastic Control Methodologies under Model Uncer-


tainty
We will compare the performance of the adaptive Bayesian strategy with the performance
of strategies computed from two other classical stochastic control frameworks under model
uncertainty: strong robust and time consistent adaptive.
The main purpose of the comparison is to show the advantage of the nonparametric
adaptive Bayesian approach to control methods that assume a parametric model for the
underlying random noise. Typically, when an equity investor deals with model uncertainty,
she assumes that the log-returns of the underlying stock across the trading periods are
i.i.d. normal random variables with unknown mean µ and variance σ 2 . Then, to apply the
strong robust approach, the invester constructs a confidence region C := τ (t0 , µ̂0 , σ̂02 ) for
(µ, σ 2 ) based on historical observations, where t0 is the sample size of the historical data and

16
(µ̂0 , σ̂02 ) is the estimator of the unknown parameters (µ, σ 2 ) based on such data. Next, she
computes the optimal strong robust strategies by solving the following Bellman equations:

yT1−η − 1
WTsr (yT ) = ,
1−η
Wtsr (yt ) = sup inf2
sr
Eµ,σ2 [Wt+1 (fY (yt , u, Zt+1 )], t = 0, . . . , T − 1, (5.6)
u∈U (µ,σ )∈C

where Eµ,σ2 denotes the expectation computed corresponding to (µ, σ 2 ).


The idea of the strong robust method is to find the worst-case parameters (µt (yt , u), σ 2t (yt , u)) ∈
C as measurable functions of the state yt and the trading strategy u, then search for the
optimal strategy ϕ∗,sr
t (yt ) that performs the best under its corresponding worst-case model.
For more details about the strong robust methodology, please refer to the study in [23, 3].

µ̂0 = 4.615 × 10−3 , σ̂0 = 5.609 × 10−2


AB SR AD
c0 = 1 c0 = 5 c0 = 10 c0 = 20 c0 = 30
mean(W ) 1.8037 1.8036 1.8036 1.8035 1.8034 1.8020 1.8026
var(W ) 4.295e-4 4.835e-4 5.421e-4 6.187e-4 6.653e-4 4.917e-14 1.162e-3
q0.30 (W ) 1.7919 1.7915 1.7891 1.7872 1.7849 1.8020 1.7841
q0.90 (W ) 1.8352 1.8379 1.8405 1.8425 1.8485 1.8020 1.8483
max(W ) 1.8721 1.8704 1.8720 1.8718 1.8656 1.8020 1.8783
min(W ) 1.7711 1.7576 1.7536 1.7632 1.7647 1.8020 1.7123

Table 1: Mean, variance, 30%-quantile, 90%-quantile, maximum, and minimum of the out-
of-sample terminal utility for the AB, SR and AD methods; Case 1-1.

Another approach to deal with model uncertainty is the non-robust adaptive approach
based on “learning”. Loosely speaking, the investor adapts to her latest belief about the
unknown parameters, which learned as the point estimator, and takes actions based on such
belief. At any fixed time point, the controller Typically finds the optimal strategy by solving
the Bellman equations according to the current parameter estimate across the remaining
timeline. Such treatment bears the drawback that the computed strategy is time inconsistent:
the controller knows that she will change her view about the unknown parameter at all future
time points, the inevitable future changes are not taken into consideration in the computation
of the strategy.
Modifications can be made to obtain a time consistent adaptive control framework. In

17
µ̂0 = −3.987 × 10−3 , σ̂0 = 6.288 × 10−2
AB SR AD
c0 = 1 c0 = 5 c0 = 10 c0 = 20 c0 = 30
mean(W ) 1.8043 1.8038 1.8041 1.8043 1.8038 1.8020 1.8016
var(W ) 4.092e-4 3.350e-4 3.356e-4 2.701e-4 1.768e-4 4.917e-14 2.020e-5
q0.30 (W ) 1.7940 1.7952 1.7959 1.7972 1.7981 1.8020 1.8006
q0.90 (W ) 1.8362 1.8334 1.8334 1.8262 1.8256 1.8020 1.8050
max(W ) 1.8702 1.8626 1.8639 1.8598 1.8590 1.8020 1.8146
min(W ) 1.7594 1.7638 1.7574 1.7665 1.7801 1.8020 1.7715

Table 2: Mean, variance, 30%-quantile, 90%-quantile, maximum, and minimum of the out-
of-sample terminal utility for the AB, SR and AD methods; Case 1-2.

the context of our investment problem, one will solve the following Bellman equations

yT1−η − 1
WTad (yT , µ̂T , σ̂T2 ) = ,
1−η
Wtad (yt , µ̂t , σ̂t2 ) = sup Eµ̂t ,σ̂t2 [Wt+1
ad
(fY (yt , u, Zt+1 ), fΘ (t, µ̂t , σ̂t2 , Zt+1 )], (5.7)
u∈U
tµ̂t + Zt+1 t(t + 1)σ̂t2 + t(µ̂t − Zt+1 )2
 
2
h(t, µ̂t , σ̂t , Zt+1 ) = , ,
t+1 (t + 1)2

for t = 0, . . . , T − 1. The role of the function fΘ is to update the estimators µ̂t and σ̂t2
based on new observation Zt+1 . Recall our earlier discussion about the adaptive robust
control approach, and note that the above method is a special case of the adaptive robust
by replacing the set τ (t, µ̂t , σ̂t2 ) with the singleton {(µ̂t , σ̂t2 )}.
We will omit the detailed description of the algorithms that compute the optimal strong
robust and time consistent adaptive strategies, as they are direct modifications of the algo-
rithm introduced in Section 5.1 tailored to Belmman equations (5.6) and (5.7). For compar-
ison, we will simulate t0 observations Z−t0 , . . . , Z−1 from the sampling measure and compute
µ̂0 , σ̂02 , and τ (t0 , µ̂0 , σ̂02 ) according to the observations. The robust parameter set for the
strong robust approach is defined as C = τ (t0 , µ̂0 , σ̂02 ), and the initial guess of the unknown
parameters used in the adaptive approach are chosen as µ̂0 , and σ̂02 . Through the rest of the
paper, we denote by yTab , yTsr , and yTad the terminal wealth generated by the adaptive Bayesian,
strong robust, and adaptive methods, respectively. The respective optimal strategies are
denoted as ϕab,∗ t , ϕsr,∗
t , and ϕt
ad,∗
, t = 0, . . . , T − 1. We will generate N 0 paths of the out-
of-sample random noise Zti , i = 1, . . . , N 0 , t = 1, . . . , T , from the sampling measure. Then,
we estimate the optimal strategies ϕab,∗ t (ytab,i ), ϕsr,∗ sr,i
t (yt ), and ϕt
ad,∗ ad,i
(yt ), t = 0, . . . , T − 1,

18
Positive µ Negative µ

1.00 1.00

Bayesian ( c0 = 1 ) Bayesian ( c0 = 1 )
0.75 Bayesian ( c0 = 30 )
0.75 Bayesian ( c0 = 30 )

Adaptive Adaptive
Robust Robust
ϕt ϕt

0.50 0.50

0.25 0.25

0.00 0.00

0 10 20 30 0 10 20 30
t t

µ̂0 = 4.615 × 10−3 , σ̂0 = 5.609 × 10−2 µ̂0 = −3.987 × 10−3 , σ̂0 = 6.288 × 10−2

Figure 1: Path of nonparametric Bayesian strategy ϕab in comparison to strong robust and
adaptive; Case 1.

Positive µ Negative µ
1.875

1.85
1.850

WT WT
1.825

1.80

1.800

1.75

1.775

Bayesian ( c0 = 1 ) Bayesian ( c0 = 30 ) Adaptive Robust Bayesian ( c0 = 1 ) Bayesian ( c0 = 30 ) Adaptive Robust

−3 −2
µ̂0 = 4.615 × 10 , σ̂0 = 5.609 × 10 µ̂0 = −3.987 × 10 , σ̂0 = 6.288 × 10−2 −3

Figure 2: Distribution of nonparametric Bayesian utility ϕab in comparison to strong robust


and adaptive; Case 1.

19
i = 1, . . . , N 0 , by using the corresponding GP surrogates, and in turn update
ab,i
yt+1 = fY (ytab,i , ϕab,∗
t (ytab,i ), Zt+1
i
),
sr,i
yt+1 = fY (ytsr,i , ϕsr,∗ sr,i i
t (yt ), Zt+1 ),
ad,i
yt+1 = fY (ytad,i , ϕad,∗
t (ytad,i ), Zt+1
i
),

for i = 1, . . . , N 0 , and t = 0, . . . , T − 1.

5.3 Numerical Results


In this section, we compare the performance of the adaptive Bayesian, strong robust, and
adaptive approaches by analyzing the relevant statistics of
ab,N 0 1−η
!
ab,1 1−η
(y ) − 1 (y ) − 1
W ab := T
,..., T ,
1−η 1−η
sr,N 0 1−η
!
sr,1 1−η
(y ) − 1 (y ) − 1
W sr := T
,..., T ,
1−η 1−η
ad,N 0 1−η
!
ad,1 1−η
(y ) − 1 (y ) − 1
W ad := T
,..., T .
1−η 1−η

To this end, we choose one unit of time as 1/30 year, and T = 30. The yearly interest rate is
2%, so that r = 0.02/30 = 6.667 × 10−4 . The number of paths of sample sites for solving the
Bellman equations is N = 600, and the number of out-of-sample paths is N 0 = 200. Initial
endowment for investing is y0 = 100. Some other parameters are set as t0 = 100 and M = 4.
We consider two cases of bimodal sampling measures. In case 1, each Zt with 50% chance
comes from normal distribution N (µ1 , σ12 ) and with 50% chance from p normal distribution
−4
N (µ2 , σ22 ), where µ1 = −0.02/30 = −6.667 × 10 p , σ1 = 0.4 × 1/30 = 7.303 × 10−2 ,
and µ2 = 0.13/30 = 4.333 × 10−3p , σ2 = 0.3 × 1/30 = 5.477 × 10−2 . In case 2, µ1 =
−3 −2 −3
0.04/30 = 1.333p × 10 , σ1 = 0.3−2× 1/30 = 5.477 × 10 , and µ2 = 0.13/30 = 4.333 × 10 ,
σ2 = 0.5 × 1/30 = 9.129 × 10 . The risk-averse parameter η chosen for computations in
case 1 and 2 are 1.5 and 1.002, respectively.
For both cases, we randomly generate (µ̂0 , σ̂02 ). The initial Dirichlet process for adaptive
Bayesian is D(c0 P0 ) where P0 is normal distribution N (µ̂0 , σ̂02 ). A strong robust investor
assumes that the one period log-return has a normal distribution N (µ, σ 2 ) and her robust
parameter set τ (t0 , µ̂0 , σ̂02 ) is the 80% confidence region centered at (µ̂0 , σ̂02 ). An adaptive
investor also assumes that the model is normal and the initial guess for the parameters are
µ̂0 and σ̂02 .
Remark 5.1. Note that in the above setup, the adaptive Bayesian investor also assumes that
the model for the one-period log-return is N (µ̂0 , σ̂02 ) at the starting time. After that, at
any time t > 0 the model she uses is the weighted average of N (µ̂0 , σ̂02 ) and the empirical
distribution with respective weights c0c+t
0
and c0t+t .

20
µ̂0 = 6.255 × 10−4 , σ̂0 = 7.090 × 10−2
AB SR AD
c0 = 1 c0 = 5 c0 = 10 c0 = 20 c0 = 30
mean(W ) 4.7079 4.7004 4.7098 4.7096 4.7028 4.6038 4.6918
var(W ) 0.0912 0.0674 0.0844 0.0867 0.0850 6.701e-12 0.05942
q0.30 (W ) 4.5071 4.5209 4.5130 4.5059 4.5005 4.6038 4.5507
q0.90 (W ) 5.1332 5.0739 5.1463 5.1601 5.1691 4.6038 5.0224
max(W ) 5.3100 5.4023 5.5333 5.5709 5.5200 4.6038 5.310
min(W ) 4.1512 4.1635 4.2018 4.2389 4.2327 4.6038 4.0793

Table 3: Mean, variance, 30%-quantile, 90%-quantile, maximum, and minimum of the out-
of-sample terminal utility for the AB, SR and AD methods; Case 2-1.

Case 1. In this setup, we randomly generate two sets of values for the initial guess: (µ̂0 , σ̂02 ) =
(4.615×10−3 , 5.609×10−2 ) and (µ̂0 , σ̂02 ) = (−3.987×10−3 , 6.288×10−2 ). Then we solve (5.5),
(5.6), and (5.7) for these two cases. The resulting strategies from both cases are analyzed
on the same set of out-of-sample random noise.
It is worth mentioning that a reasonable choice of the robust parameter set τ (t0 , µ̂0 , σ̂02 )
for the strong robust approach will usually lead to trivial solutions. Estimating the mean
log-return is notoriously inefficient and slow even if the model is indeed Gaussian. Therefore,
by assuming a wrong model in this case, the set τ (t0 , µ̂0 , σ̂02 ) chosen at 80% confidence level is
too large and the worst-case parameter in such set will result a strategy that invests nearly
all the money in the banking account at all times (cf. Figure 1). We also see this effect
from both Table 1 and 2, as the mean, quantiles, maximum, and minimum values of W sr are
all the same and the variance of W sr is almost 0. Such an extremely conservative strategy
will produce a relatively higher 30% quantile and minimum value of W sr , which are both
measures of investment risk.
The adaptive approach chooses the strategy based on the current view of the model
parameters which are heavily affected by the initial guess. Recall that optimal strategies
for both cases of (µ̂0 , σ̂02 ) are tested on the same set of out-of-sample random noise. Two
opposite views of the model parameters will lead to strategies that are very different (cf.
Figure 1). For positive µ̂0 , the AD strategy is very aggressive as we observe much higher
values of the mean, 90% quantile, maximum and much lower values of the 30% quantile
and minimum of W ad compared to the case of negative µ̂0 as in Table 1 and Table 2. This
means, in general, that the parametric adaptive method is very sensitive to the initial guess
and not robust to model mispecification. Especially when in a market that is neither bull or
bear, the investor can easily be confused by the initial guess for relative smaller number of
observation and trading periods.
For the adaptive Bayesian framework, we test different choices of c0 = 1, 5, 10, 20,

21
Positive µ Negative µ

1.00 1.00

0.75 0.75

ϕt ϕt

0.50 0.50

0.25 0.25
Bayesian ( c0 = 1 ) Bayesian ( c0 = 1 )
Bayesian ( c0 = 30 ) Bayesian ( c0 = 30 )

Adaptive Adaptive
Robust Robust

0.00 0.00

0 10 20 30 0 10 20 30
t t

µ̂0 = 6.255 × 10−4 , σ̂0 = 7.090 × 10−2 µ̂0 = −8.347 × 10−3 , σ̂0 = 7.805 × 10−2

Figure 3: Path of nonparametric Bayesian strategy ϕab in comparison to strong robust and
adaptive; Case 2.

Positive µ Negative µ
5.6

5.2

5.0

WT WT

4.8

4.5
4.4

Bayesian ( c0 = 1 ) Bayesian ( c0 = 30 ) Adaptive Robust Bayesian ( c0 = 1 ) Bayesian ( c0 = 30 ) Adaptive Robust

−4 −2
µ̂0 = 6.255 × 10 , σ̂0 = 7.090 × 10 µ̂0 = −8.347 × 10 , σ̂0 = 7.805 × 10−2 −3

Figure 4: Distribution of nonparametric Bayesian utility ϕab in comparison to strong robust


and adaptive; Case 2.

22
µ̂0 = −8.347 × 10−3 , σ̂0 = 7.805 × 10−2
AB SR AD
c0 = 1 c0 = 5 c0 = 10 c0 = 20 c0 = 30
mean(W ) 4.6980 4.6860 4.6823 4.6668 4.6483 4.6038 4.6036
var(W ) 0.0758 0.0640 0.0580 0.0453 0.0310 6.701e-12 6.943e-4
q0.30 (W ) 4.5139 4.5387 4.5687 4.5929 4.5854 4.6038 4.6038
q0.90 (W ) 5.1097 5.0613 5.0462 5.0009 4.8673 4.6038 4.6049
max(W ) 5.4234 5.3652 5.4676 5.4587 5.3856 4.6038 4.8089
min(W ) 4.1833 4.2164 4.2659 4.2580 4.2897 4.6038 4.3850

Table 4: Mean, variance, 30%-quantile, 90%-quantile, maximum, and minimum of the out-
of-sample terminal utility for the AB, SR and AD methods; Case 2-2.

and 30. As c0 increases, the weight of P0 in the posterior mean of the Dirichlet process
increases correspondingly. Hence the optimal strategy ϕab,∗t for c0 = 30 lies in between ϕab,∗
t
for c0 = 1 and the AD strategy ϕtad,∗ . We also observe in Figure 2 that the distribution
of W ab will converge to that of W ad when c0 becomes large. Higher weight of P0 will in
theory reduce the possibility of overfitting and avoid the learning of the underlying model
from picking up too much market noise, especially at early time stages. It can also be seen
as a tuning parameter for the purpose of risk management: in Table 2, when the initial view
of the market is “pessimistic” (negative initial guess of the mean log-return), larger c0 will
make the investment strategy more conservative and we observe that the 30% quantile and
minimum value of W ab increases respect to c0 . Accordingly, the 90% quantile and maximum
value decreases since the conservative strategy will be less likely to take advantage of stock
price increasing. In Table 1, the initial view of the market is “optimistic” (positive initial
guess of the mean log-return), larger c0 will make the strategy more aggressive and less risk
averse. The variance and 90% quantile of W ab becomes larger and the 30% quantile becomes
smaller for higher c0 . Interestingly, the max(W ad ) decreases and min(W ad ) increases in this
case. To understand this, note that in Figure 1, the AD strategy presents a “mirror” effect
as the strategy decreases when µ̂0 > 0 and increases when µ̂0 < 0 for time steps that are
close to T . This means that, by assuming a Gaussian model for the log-return, the AD
strategy will converge to a level that is between 0 and 1 on any Z-path. Hence, for large
c0 , the AB strategy will demonstrate the effect of such convergence. On the other hand,
such observation signals a warning about model misspecification: by mistakenly assuming a
Gaussian log-return, the AD strategy for µ̂0 < 0 starts to invest money in the risky asset
when the market is bad and other investors are reducing their shares of the stock. In the
nonparametric Bayesian framework, a correction will be imposed: in Table 1, a large enough
c0 will eventually make q0.90 (W ab ) higher than q0.90 (W ar ); and in Table 2; it will make
min(W ab ) higher than min(W ar ).

23
Case 1 Case 2

Figure 5: Learning paths of µ̂ by the adaptive approach.

Amongst these three methods, it is not surprising that AB produces lower 30% quantile
and minimum value; higher variance, 90% quantile, and maximum value of the terminal
utility compared to the values generated by conservative strategies. This comparison result
is reversed when AB is compared to an aggressive methodology. Nevertheless, the estimated
mean of the terminal utility produced by AB is always higher than the ones given by SR
and AD. These numbers show that AB can be viewed as a preferred approach compared
to the other two. Finally, we stress that in both “optimistic” and “pessimistic” cases, the
numbers generated by AB are similar, which confirms that such methodology is robust to
model mispecification and randomness in data observation.
Case 2. In this setup, we randomly generate two sets of values for the initial guess: (µ̂0 , σ̂02 ) =
(6.255 × 10−4 , 7.090 × 10−2 ) and (µ̂0 , σ̂02 ) = (µ̂0 = −8.347 × 10−3 , σ̂0 = 7.805 × 10−2 ). Then
we solve (5.5), (5.6), and (5.7) for these two cases. The resulting strategies from both cases
are analyzed on the same set of out-of-sample random noise.
The numerical results we obtain are quite similar to case 1. One significant difference
we have is that for µ̂0 > 0, the AD strategy produces a lower var(W ) than AB, and when
c0 increases, the variance of the terminal utility from the AB strategy decreases. This is
also confirmed by the box-plot in Figure 4. In general, a lower variance of the terminal
wealth/utility is a result of conservative strategies, and indeed we see from Figure 3 that the
AD strategy is not necessarily more aggressive than AB in this case. Such phenomenon is
explained by Figure 5: in case 2, the estimated µ̂0 , despite being positive, is smaller than
µ̂0 in case 1. Therefore, many paths of µ̂t go below 0 and it is the driving force that the
AD strategy becomes conservative even though the initial guess is somewhat optimistic. In
any case, we have that the AB method still produces the highest mean terminal utility, and
hence it is the preferred approach compared to the other two.

24
6 Conclusion
We have developed a nonparametric Bayesian approach to deal with stochastic control prob-
lems under model uncertainty. Our motivation comes from the multiple desirable features
of Dirichlet process and aims to avoid model misspecification inherent in assumptions of
parametric models. By augmenting the Bayesian posterior mean to the state variable, we
integrate the optimization and online learning when the distribution of the underlying ran-
dom process is unknown. We prove the necessary regularity of the relevant functions of the
augmented state variable so that the nonparametric adaptive Bayesian control problem is
solved by dynamic programming, and the measurable optimal control exists. The resulting
case study provides new insights on the interaction between the prior mean and adaptive
learning in the context of utility maximization problem. The nonparametric framework is
robust to the random perturbations in the learning process, and the weight of the prior
mean in the dynamic learning can be used as a tuning parameter for the purpose of risk
management.
In order to make the proposed framework numerically feasible, we develop an algorithm
based on the machine learning technique that utilizes the Gaussian process surrogates. Fol-
lowing the idea introduced in [10], we build multiple surrogates for different pieces of the
Bellman recursion, not only for the value function but also for the feedback control. To han-
dle the infinite dimensional state space associated with the nonparametric learning process,
we map each distribution to the corresponding vector of moments to reduce the dimension
of the state space. Instead, it is possible to modify the kernel function of the Gaussian
process surrogate and enable it to evaluate the distance between probability distributions.
Further investigation of such proposal and the study of extending our approach to the case
of multi-dimensional distributions are deferred to future research.

References
[1] T. Başar and P. Bernhard, H ∞ -optimal control and related minimax design prob-
lems, Systems & Control: Foundations & Applications, Birkhäuser Boston, Inc., Boston,
MA, second ed., 1995. A dynamic game approach.

[2] N. Bäuerle and U. Rieder, Markov Decision Processes with Applications to Fi-
nance, Universitext, Springer-Verlag Berlin Heidelberg, 2011.

[3] E. Bayraktar, A. Cosso, and H. Pham, Robust feedback switching control: Dy-
namic programming and viscosity solutions, SIAM Journal on Control and Optimization,
54 (2016), p. 2594–2628.

[4] E. Bayraktar and Y. Zhang, Fundamental theorem of asset pricing under trans-
action costs and model uncertainty, Mathematics of Operations Research, 41 (2016),
pp. 1039–1054.

25
[5] D. P. Bertsekas and S. Shreve, Stochastic Optimal Control: The Discrete-Time
Case, Academic Press, 1978.

[6] T. Bielecki, T. Chen, and I. Cialenco, Recursive construction of confidence re-


gions, Electron. J. Statist., 11 (2017), pp. 4674–4700.

[7] T. Bielecki, T. Chen, I. Cialenco, A. Cousin, and J. M., Adaptive robust


control under model uncertainty, SIAM J. Control Optim., 57 (2019).

[8] T. R. Bielecki, T. Chen, and I. Cialenco, Time-inconsistent markovian con-


trol problems under model uncertainty with application to the mean-variance portfolio
selection, Submitted for publication, (2020).

[9] B. Bouchard and M. Nutz, Arbitrage and duality in nondominated discrete-time


models, Ann. Appl. Probab., 25 (2015), pp. 823–859.

[10] T. Chen and M. Ludkovski, A machine learning approach to adaptive robust utility
maximization and hedging, Preprint, (2019).

[11] S. Cohen, Uncertainty and filtering of hidden markov models in discrete time, Preprint,
(2018).

[12] R. Cont, Model uncertainty and its impact on the pricing of derivative instruments,
Mathematical Finance, 16 (2006), pp. 519–547.

[13] T. Ferguson, A Bayesian analysis of some nonparametric problems, The Annals of


Statistics, 1 (1973), pp. 209–230.

[14] , Prior distributions on spaces of probability measures, The Annals of Statistics, 2


(1974), pp. 615–629.

[15] I. Gilboa and D. Schmeidler, Maxmin expected utility with nonunique prior, J.
Math. Econom., 18 (1989), pp. 141–153.

[16] L. P. Hansen, G. Sargent, G. Turmuhambetova, and N. Williams, Robust


control and model misspecification, J. Econom. Theory, 128 (2006), p. 45–90.

[17] P. L. Hansen and T. J. Sargent, Robustness, Princeton University Press, 2008.

[18] F. H. Knight, Risk, Uncertainty and Profit, Houghton Mifflin, 1921. reprint Dover
2006.

[19] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification, and


Adaptive Control, Prentice Hall, Inc., 2015.

[20] A. E. B. Lim, G. J. Shanthikumar, and Z. J. Max Shen, Model uncertainty,


robust optimization and learning, Tutorials in Operations Research, INFORMS (2006),
p. 66–94.

26
[21] A. Y. Lo, On a class of Bayesian nonparametric estimates: I. density estimates, The
Annals of Statistics, 12 (1984), pp. 351–357.

[22] U. Rieder, Bayesian dynamic programming, Adv. Appl. Prob., 7 (1975), pp. 330–348.

[23] M. Sirbu, A note on the strong formulation of stochastic control problems with model
uncertainty, Electronic Communications in Probability, 19 (2014).

27

View publication stats

You might also like