Bayesian Reinforcement Learning
Bayesian Reinforcement Learning
Abstract This chapter surveys recent lines of work that use Bayesian techniques for
reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior
distribution over unknown parameters and learning is achieved by computing a
posterior distribution based on the data observed. Hence, Bayesian reinforcement
learning distinguishes itself from other forms of reinforcement learning by explic-
itly maintaining a distribution over various quantities such as the parameters of the
model, the value function, the policy or its gradient. This yields several benefits: a)
domain knowledge can be naturally encoded in the prior distribution to speed up
learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c)
notions of risk can be naturally taken into account to obtain robust policies.
1 Introduction
Nikos Vlassis
(1) Luxembourg Centre for Systems Biomedicine, University of Luxembourg, and (2) OneTree
Luxembourg, e-mail: [email protected], [email protected]
Mohammad Ghavamzadeh
INRIA, e-mail: [email protected]
Shie Mannor
Technion, e-mail: [email protected]
Pascal Poupart
University of Waterloo, e-mail: [email protected]
1
2 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
man, 1956; Bellman and Kalaba, 1959; Bellman, 1961). This work was then gener-
alized to multi-state sequential decision problems with unknown transition probabil-
ities and rewards (Silver, 1963; Cozzolino, 1964; Cozzolino et al, 1965). The book
“Bayesian Decision Problems and Markov Chains” by Martin (1967) gives a good
overview of the work of that era. At the time, reinforcement learning was known as
adaptive control processes and then Bayesian adaptive control.
Since Bayesian learning meshes well with decision theory, Bayesian techniques
are natural candidates to simultaneously learn about the environment while making
decisions. The idea is to treat the unknown parameters as random variables and to
maintain an explicit distribution over these variables to quantify the uncertainty. As
evidence is gathered, this distribution is updated and decisions can be made simply
by integrating out the unknown parameters.
In contrast to traditional reinforcement learning techniques that typically learn
point estimates of the parameters, the use of an explicit distribution permits a quan-
tification of the uncertainty that can speed up learning and reduce risk. In particu-
lar, the prior distribution allows the practitioner to encode domain knowledge that
can reduce the uncertainty. For most real-world problems, reinforcement learning
from scratch is intractable since too many parameters would have to be learned if
the transition, observation and reward functions are completely unknown. Hence,
by encoding domain knowledge in the prior distribution, the amount of interaction
with the environment to find a good policy can be reduced significantly. Further-
more, domain knowledge can help avoid catastrophic events that would have to be
learned by repeated trials otherwise. An explicit distribution over the parameters
also provides a quantification of the uncertainty that is very useful to optimize the
exploration/exploitation tradeoff. The choice of action is typically done to maximize
future rewards based on the current estimate of the model (exploitation), however
there is also a need to explore the uncertain parts of the model in order to refine it
and earn higher rewards in the future. Hence, the quantification of this uncertainty
by an explicit distribution becomes very useful. Similarly, an explicit quantification
of the uncertainty of the future returns can be used to minimize variance or the risk
of low rewards.
The chapter is organized as follows. Section 2 describes Bayesian techniques for
model-free reinforcement learning where explicit distributions over the parameters
of the value function, the policy or its gradient are maintained. Section 3 describes
Bayesian techniques for model-based reinforcement learning, where the distribu-
tions are over the parameters of the transition, observation and reward functions. Fi-
nally, Section 4 describes Bayesian techniques that take into account the availability
of finitely many samples to obtain sample complexity bounds and for optimization
under uncertainty.
Bayesian Reinforcement Learning 3
Model-free RL methods are those that do not explicitly learn a model of the sys-
tem and only use sample trajectories obtained by direct interaction with the system.
Model-free techniques are often simpler to implement since they do not require any
data structure to represent a model nor any algorithm to update this model. However,
it is often more complicated to reason about model-free approaches since it is not
always obvious how sample trajectories should be used to update an estimate of the
optimal policy or value function. In this section, we describe several Bayesian tech-
niques that treat the value function or policy gradient as random objects drawn from
a distribution. More specifically, Section 2.1 describes approaches to learn distri-
butions over Q-functions, Section 2.2 considers distributions over policy gradients
and Section 2.3 shows how distributions over value functions can be used to infer
distributions over policy gradients in actor-critic algorithms.
Value-function based RL methods search in the space of value functions to find the
optimal value (action-value) function, and then use it to extract an optimal policy. In
this section, we study two Bayesian value-function based RL algorithms: Bayesian
Q-learning (Dearden et al, 1998) and Gaussian process temporal difference learn-
ing (Engel et al, 2003, 2005a; Engel, 2005). The first algorithm caters to domains
with discrete state and action spaces while the second algorithm handles continuous
state and action spaces.
Since the posterior does not have a closed form due to the integral, it is approximated
by finding the closest Normal-Gamma distribution by minimizing KL-divergence.
At run-time, it is very tempting to select the action with the highest expected Q-
value (i.e., a∗ = arg maxa E[Q(s, a)]), however this strategy does not ensure explo-
ration. To address this, Dearden et al (1998) proposed to add an exploration bonus to
the expected Q-values that estimates the myopic value of perfect information (VPI).
If exploration leads to a policy change, then the gain in value should be taken into
account. Since the agent does not know in advance the effect of each action, VPI is
computed as an expected gain
Z ∞
V PI(s, a) = dx Gains,a (x) P(Q(s, a) = x) (1)
−∞
where the gain corresponds to the improvement induced by learning the exact Q-
value (denoted by qs,a ) of the action executed.
qs,a − E[Q(s, a1 )] if a 6= a1 and qs,a > E[Q(s, a1 )]
Gains,a (qs,a ) = E[Q(s, a2 )] − qs,a if a = a1 and qs,a < E[Q(s, a2 )] (2)
0 otherwise
There are two cases: a is revealed to have a higher Q-value than the action a1 with
the highest expected Q-value or the action a1 with the highest expected Q-value is
revealed to have a lower Q-value than the action a2 with the second highest expected
Q-value.
Bayesian Reinforcement Learning 5
Bayesian Q-learning (BQL) maintains a separate distribution over D(s, a) for each
(s, a)-pair, thus, it cannot be used for problems with continuous state or action
spaces. Engel et al (2003, 2005a) proposed a natural extension that uses Gaussian
processes. As in BQL, D(s, a) is assumed to be Normal with mean µ(s, a) and pre-
cision τ(s, a). However, instead of maintaining a Normal-Gamma over µ and τ si-
multaneously, a Gaussian over µ is modeled. Since µ(s, a) = Q(s, a) and the main
quantity that we want to learn is the Q-function, it would be fine to maintain a belief
only about the mean. To accommodate infinite state and action spaces, a Gaussian
process is used to model infinitely many Gaussians over Q(s, a) for each (s, a)-pair.
A Gaussian process (e.g., Rasmussen and Williams 2006) is the extension of the
multivariate Gaussian distribution to infinitely many dimensions or equivalently,
corresponds to infinitely many correlated univariate Gaussians. Gaussian processes
GP(µ, k) are parameterized by a mean function µ(x) and a kernel function k(x, x0 )
which are the limit of the mean vector and covariance matrix of multivariate Gaus-
sians when the number of dimensions become infinite. Gaussian processes are often
used for functional regression based on sampled realizations of some unknown un-
derlying function.
Along those lines, Engel et al (2003, 2005a) proposed a Gaussian Process Tem-
poral Difference (GPTD) approach to learn the Q-function of a policy based on
samples of discounted sums of returns. Recall that the distribution of the sum of
discounted rewards for a fixed policy π is defined recursively as follows:
When z refers to states then E[D] = V and when it refers to state-action pairs then
E[D] = Q. Unless otherwise specified, we will assume that z = (s, a). We can de-
compose D as the sum of its mean Q and a zero-mean noise term ∆ Q, which
will allow us to place a distribution directly over Q later on. Replacing D(z) by
Q(z) + ∆ Q(z) in Eq. 3 and grouping the ∆ Q terms into a single zero-mean noise
term N(z, z0 ) = ∆ Q(z) − γ∆ Q(z0 ), we obtain
The GPTD learning model (Engel et al, 2003, 2005a) is based on the statistical
generative model in Eq. 4 that relates the observed reward signal r to the unobserved
action-value function Q. Now suppose that we observe the sequence z0 , z1 , . . . , zt ,
then Eq. 4 leads to a system of t equations that can be expressed in matrix form as
rt−1 = H t Qt + Nt , (5)
where
6 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
> >
rt = r(z0 ), . . . , r(zt ) , Qt = Q(z0 ), . . . , Q(zt ) ,
>
Nt = N(z0 , z1 ), . . . , N(zt−1 , zt ) , (6)
1 −γ 0 . . . 0
0 1 −γ . . . 0
Ht = . . (7)
..
.. .
0 0 . . . 1 −γ
If we assume that the residuals ∆ Q(z0 ), . . . , ∆ Q(zt ) are zero-mean Gaussians with
variance σ 2 , and moreover, each residual is generated independently of all the oth-
ers, i.e., E[∆ Q(zi )∆ Q(z j )] = 0, for i 6= j, it is easy to show that the noise vector Nt
is Gaussian with mean 0 and the covariance matrix
1 + γ 2 −γ 0 . . . 0
−γ 1 + γ 2 −γ . . . 0
Σ t = σ 2 H t H t> = σ 2 . .. . (8)
. ..
. . .
0 0 . . . −γ 1 + γ 2
In episodic tasks, if zt−1 is the last state-action pair in the episode (i.e., st is a zero-
reward absorbing terminal state), Ht becomes a square t × t invertible matrix of the
form shown in Eq. 7 with its last column removed. The effect on the noise covariance
matrix Σt is that the bottom-right element becomes 1 instead of 1 + γ 2 .
Placing a GP prior GP(0, k) on Q, we may use Bayes’ rule to obtain the moments
Q̂ and k̂ of the posterior Gaussian process on Q:
where Dt denotes the observed data up to and including time step t. We used here
the following definitions:
>
kt (z) = k(z0 , z), . . . , k(zt , z) , K t = kt (z0 ), kt (z1 ), . . . , kt (zt ) ,
−1 −1
α t = H t> H t K t H t> + Σ t rt−1 , Ct = H t> H t K t H t> + Σ t Ht . (10)
As more samples are observed, the posterior covariance decreases, reflecting a grow-
ing confidence in the Q-function estimate Q̂t .
The GPTD model described above is kernel-based and non-parametric. It is also
possible to employ a parametric representation under very similar assumptions. In
the parametric setting, the GP Q is assumed to consist of a linear combination of a
finite number of basis functions: Q(·, ·) = φ (·, ·)>W , where φ is the feature vector
and W is the weight vector. In the parametric GPTD, the randomness in Q is due
to W being a random vector. In this model, we place a Gaussian prior over W and
apply Bayes’ rule to calculate the posterior distribution of W conditioned on the
observed data. The posterior mean and covariance of Q may be easily computed by
Bayesian Reinforcement Learning 7
multiplying the posterior moments of W with the feature vector φ . See Engel (2005)
for more details on parametric GPTD.
In the parametric case, the computation of the posterior may be performed on-
line in O(n2 ) time per sample and O(n2 ) memory, where n is the number of basis
functions used to approximate Q. In the non-parametric case, we have a new basis
function for each new sample we observe, making the cost of adding the t’th sample
O(t 2 ) in both time and memory. This would seem to make the non-parametric form
of GPTD computationally infeasible except in small and simple problems. However,
the computational cost of non-parametric GPTD can be reduced by using an online
sparsification method (e.g., Engel et al 2002), to a level that it can be efficiently
implemented online.
The choice of the prior distribution may significantly affect the performance of
GPTD. However, in the standard GPTD, the prior is set at the beginning and remains
unchanged during the execution of the algorithm. Reisinger et al (2008) developed
an online model selection method for GPTD using sequential MC techniques, called
replacing-kernel RL, and empirically showed that it yields better performance than
the standard GPTD for many different kernel families.
Finally, the GPTD model can be used to derive a SARSA-type algorithm, called
GPSARSA (Engel et al, 2005a; Engel, 2005), in which state-action values are esti-
mated using GPTD and policies are improved by a ε-greedily strategy while slowly
decreasing ε toward 0. The GPTD framework, especially the GPSARSA algorithm,
has been successfully applied to large scale RL problems such as the control of an
octopus arm (Engel et al, 2005b) and wireless network association control (Aharony
et al, 2005).
with an estimate of the so-called natural policy gradient (Kakade, 2002; Bagnell and
Schneider, 2003; Peters et al, 2003). In terms of the policy update rule, the move to a
natural-gradient rule amounts to linearly transforming the gradient using the inverse
Fisher information matrix of the policy. In empirical evaluations, natural PG has
been shown to significantly outperform conventional PG (Kakade, 2002; Bagnell
and Schneider, 2003; Peters et al, 2003; Peters and Schaal, 2008).
However, both conventional and natural policy gradient methods rely on Monte-
Carlo (MC) techniques in estimating the gradient of the performance measure. Al-
though MC estimates are unbiased, they tend to suffer from high variance, or al-
ternatively, require excessive sample sizes (see O’Hagan, 1987 for a discussion). In
the case of policy gradient estimation this is exacerbated by the fact that consistent
policy improvement requires multiple gradient estimation steps. O’Hagan (1991)
proposes a Bayesian alternative to MC estimation of an integral, R
called Bayesian
quadrature (BQ). The idea is to model integrals of the form dx f (x)g(x) as ran-
dom quantities. This is done by treating the first term in the integrand, f , as a ran-
dom function over which we express a prior in the form of a Gaussian process (GP).
Observing (possibly noisy) samples of f at a set of points {x1 , x2 , . . . , xM } allows
us to employ Bayes’ rule to compute a posterior distribution of f conditioned on
these samples. This, in turn, induces a posterior distribution over the value of the
integral. Rasmussen and Ghahramani (2003) experimentally demonstrated how this
approach, when applied to the evaluation of an expectation, can outperform MC es-
timation by orders of magnitude, in terms of the mean-squared error. Interestingly,
BQ is often effective even when f is known. The posterior of f can be viewed as an
approximation of f (that converges to f in the limit), but this approximation can be
used to perform the integration in closed form. In contrast, MC integration uses the
exact f , but only at the points sampled. So BQ makes better use of the information
provided by the samples by using the posterior to “interpolate” between the samples
and by performing the integration in closed form.
In this section, we study a Bayesian framework for policy gradient estimation
based on modeling the policy gradient as a GP (Ghavamzadeh and Engel, 2006).
This reduces the number of samples needed to obtain accurate gradient estimates.
Moreover, estimates of the natural gradient as well as a measure of the uncertainty
in the gradient estimates, namely, the gradient covariance, are provided at little extra
cost.
Let us begin with some definitions and notations. A stationary policy π(·|s) is a
probability distribution over actions, conditioned on the current state. Given a fixed
policy π, the MDP induces a Markov chain over state-action pairs, whose transition
probability from (st , at ) to (st+1 , at+1 ) is π(at+1 |st+1 )P(st+1 |st , at ). We generically
denote by ξ = (s0 , a0 , s1 , a1 , . . . , sT −1 , aT −1 , sT ), T ∈ {0, 1, . . . , ∞} a path generated
by this Markov chain. The probability (density) of such a path is given by
T −1
P(ξ |π) = P0 (s0 ) ∏ π(at |st )P(st+1 |st , at ). (11)
t=0
Bayesian Reinforcement Learning 9
T −1 t
We denote by R(ξ ) = ∑t=0 γ r(st , at ) the discounted cumulative return of the path
ξ , where γ ∈ [0, 1] is a discount factor. R(ξ ) is a random variable both because the
path ξ itself is a random variable, and because, even for a given path, each of the
rewards sampled in it may be stochastic. The expected value of R(ξ ) for a given
path ξ is denoted by R̄(ξ ). Finally, we define the expected return of policy π as
Z
η(π) = E[R(ξ )] = dξ R̄(ξ )P(ξ |π). (12)
∇P(ξ ; θ )
Z
∇η(θ ) = dξ R̄(ξ ) P(ξ ; θ ), (13)
P(ξ ; θ )
;θ )
where ∇P(ξ
P(ξ ;θ )
= ∇ log P(ξ ; θ ) is called the score function or likelihood ratio. Since
the initial-state distribution P0 and the state-transition distribution P are independent
of the policy parameters θ , we may write the score function of a path ξ using Eq. 11
as2
∇P(ξ ; θ ) T −1 ∇π(at |st ; θ ) T −1
u(ξ ; θ ) = = ∑ = ∑ ∇ log π(at |st ; θ ). (14)
P(ξ ; θ ) t=0 π(at |st ; θ ) t=0
c )→
This is an unbiased estimate, and therefore, by the law of large numbers, ∇η(θ
∇η(θ ) as M goes to infinity, with probability one.
In the frequentist approach to PG, the performance measure used is η(θ ). In
order to serve as a useful performance measure, it has to be a deterministic function
of the policy parameters θ . This is achieved by averaging the cumulative return
R(ξ ) over all possible paths ξ and all possible returns accumulated in each path.
In the Bayesian approach we have an additional source of randomness, namely, our
subjective Bayesian uncertainty R
concerning the process generating the cumulative
return. Let us denote ηB (θ ) = dξ R̄(ξ )P(ξ ; θ ), where ηB (θ ) is a random variable
because of the Bayesian uncertainty. We are interested in evaluating the posterior
distribution of the gradient of ηB (θ ) w.r.t. the policy parameters θ . The posterior
2 To simplify notation, we omit ∇ and u’s dependence on the policy parameters θ , and use ∇ and
u(ξ ) in place of ∇θ and u(ξ ; θ ) in the sequel.
10 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
In the Bayesian policy gradient (BPG) method of Ghavamzadeh and Engel (2006),
the problem of estimating the gradient of the expected return (Eq. 16) is cast as an
integral evaluation problem, and then the BQ method (O’Hagan, 1991), described
above, is used. In BQ, we need to partition the integrand into two parts, f (ξ ; θ ) and
g(ξ ; θ ). We will model f as a GP and assume that g is a function known to us. We
will then proceed by calculating the posterior moments of the gradient ∇ηB (θ ) con-
ditioned on the observed data DM = {ξ1 , . . . , ξM }. Because in general, R(ξ ) cannot
be known exactly, even for a given ξ (due to the stochasticity of the rewards), R(ξ )
should always belong to the GP part of the model, i.e., f (ξ ; θ ). Ghavamzadeh and
Engel (2006) proposed two different ways of partitioning the integrand in Eq. 16, re-
sulting in two distinct Bayesian models. Table 1 in Ghavamzadeh and Engel (2006)
summarizes the two models. Models 1 and 2 use Fisher-type kernels for the prior
covariance of f . The choice of Fisher-type kernels was motivated by the notion that
a good representation should depend on the data generating process (see Jaakkola
and Haussler 1999; Shawe-Taylor and Cristianini 2004 for a thorough discussion).
The particular choices of linear and quadratic Fisher kernels were guided by the
requirement that the posterior moments of the gradient be analytically tractable.
Models 1 and 2 can be used to define algorithms for evaluating the gradient of the
expected return w.r.t. the policy parameters. The algorithm (for either model) takes
a set of policy parameters θ and a sample size M as input, and returns an estimate of
the posterior moments of the gradient of the expected return. This Bayesian PG eval-
uation algorithm, in turn, can be used to derive a Bayesian policy gradient (BPG)
algorithm that starts with an initial vector of policy parameters θ 0 and updates the
parameters in the direction of the posterior mean of the gradient of the expected re-
turn, computed by the Bayesian PG evaluation procedure. This is repeated N times,
or alternatively, until the gradient estimate is sufficiently close to zero.
As mentioned earlier, the kernel functions used in Models 1 and 2 are both based
on the Fisher information matrix G(θ ). Consequently, every time we update the
policy parameters we need to recompute G. In most practical situations, G is not
known and needs to be estimated. Ghavamzadeh and Engel (2006) described two
possible approaches to this problem: MC estimation of G and maximum likelihood
(ML) estimation of the MDP’s dynamics and use it to calculate G. They empirically
showed that even when G is estimated using MC or ML, BPG performs better than
MC-based PG algorithms.
BPG may be made significantly more efficient, both in time and memory, by
sparsifying the solution. Such sparsification may be performed incrementally, and
helps to numerically stabilize the algorithm when the kernel matrix is singular, or
nearly so. Similar to the GPTD case, one possibility is to use the on-line sparsifi-
cation method proposed by Engel et al (2002) to selectively add a new observed
path to a set of dictionary paths, which are used as a basis for approximating the
Bayesian Reinforcement Learning 11
full solution. Finally, it is easy to show that the BPG models and algorithms can be
extended to POMDPs along the same lines as in Baxter and Bartlett (2001).
∞
where r̄(z) is the mean reward for the state-action pair z, and µ π (z) = ∑t=0 γ t Ptπ (z)
is a discounted weighting of state-action pairs encountered while following policy
π. Integrating a out of µ π (z) = µ π (s, a) results in the corresponding
R
discounted
weighting of states encountered by following policy π; ρ π (s) = A daµ π (s, a). Un-
like ρ π and µ π , (1 − γ)ρ π and (1 − γ)µ π are distributions. They are analogous
to the stationary distributions over states and state-action pairs of policy π in the
undiscounted setting, since as γ → 1, they tend to these stationary distributions, if
they exist. The policy gradient theorem (Marbach, 1998, Proposition 1; Sutton et al,
2000, Theorem 1; Konda and Tsitsiklis, 2000, Theorem 1) states that the gradient
of the expected return for parameterized policies is given by
Z Z
∇η(θ ) = dsda ρ(s; θ )∇π(a|s; θ )Q(s, a; θ ) = dz µ(z; θ )∇ log π(a|s; θ )Q(z; θ ).
(17)
Observe that if b : S → R is an arbitrary function of s (also called a baseline), then
Z Z Z
dsda ρ(s; θ )∇π(a|s; θ )b(s) = ds ρ(s; θ )b(s)∇ da π(a|s; θ )
Z
ZS
A
= ds ρ(s; θ )b(s)∇ 1 = 0,
S
Now consider the case in which the action-value function for a fixed policy π,
Qπ , is approximated by a learned function approximator. If the approximation is
sufficiently good, we may hope to use it in place of Qπ in Eqs. 17 and 18, and still
point roughly in the direction of the true gradient. Sutton et al (2000) and Konda
and Tsitsiklis (2000) showed that if the approximation Q̂π (·; w) with parameter w
is compatible, i.e., ∇w Q̂π (s, a; w) = ∇ log π(a|s; θ ), and if it minimizes the mean
squared error Z 2
E π (w) =
dz µ π (z) Qπ (z) − Q̂π (z; w) (19)
Z
have the same solutions (e.g., Bhatnagar et al 2007, 2009), and if the parameter w
is set to be equal to w∗ in Eq. 20, then the resulting mean squared error E π (w∗ ) is
further minimized by setting b(s) = V π (s) (Bhatnagar et al, 2007, 2009). In other
Bayesian Reinforcement Learning 13
words, the variance in the action-value function estimator is minimized if the base-
line is chosen to be the value function itself. This means that it is more meaningful to
consider w∗> ψ(z) as the least-squared optimal parametric representation for the ad-
vantage function Aπ (s, a) = Qπ (s, a) − V π (s) rather than the action-value function
Qπ (s, a).
We are now in a position to describe the main idea behind the BAC approach.
Making use of the linearity of Eq. 17 in Q and denoting g(z; θ ) = µ π (z)∇ log π(a|s; θ ),
we obtain the following expressions for the posterior moments of the policy gradi-
ent (O’Hagan, 1991):
Z Z
E[∇η(θ )|Dt ] = dz g(z; θ )Q̂t (z; θ ) = dz g(z; θ )kt (z)> α t ,
Z Z
Z
Cov [∇η(θ )|Dt ] = dz dz0 g(z; θ )Ŝt (z, z0 )g(z0 ; θ )>
Z2
Z
= dz dz0 g(z; θ ) k(z, z0 ) − kt (z)>Ct kt (z0 ) g(z0 ; θ )> , (21)
Z2
where Q̂t and Ŝt are the posterior moments of Q computed by the GPTD critic from
Eq. 9.
These equations provide us with the general form of the posterior policy gradient
moments. We are now left with a computational issue, namely, how to compute the
following integrals appearing in these expressions?
Z Z
Ut = dz g(z; θ )kt (z)> and V = dzdz0 g(z; θ )k(z, z0 )g(z0 ; θ )> . (22)
Z Z2
Using the definitions in Eq. 22, we may write the gradient posterior moments com-
pactly as
Ghavamzadeh and Engel (2007) showed that in order to render these integrals
analytically tractable, the prior covariance kernel should be defined as k(z, z0 ) =
ks (s, s0 ) + kF (z, z0 ), the sum of an arbitrary state-kernel ks and the Fisher kernel be-
tween state-action pairs kF (z, z0 ) = u(z)> G(θ )−1 u(z0 ). They proved that using this
prior covariance kernel, U t and V from Eq. 22 satisfy U t = [u(z0 ), . . . , u(zt )] and
V = G(θ ). When the posterior moments of the gradient of the expected return are
available, a Bayesian actor-critic (BAC) algorithm can be easily derived by updating
the policy parameters in the direction of the mean.
Similar to the BPG case in Section 2.2, the Fisher information matrix of each
policy may be estimated using MC or ML methods, and the algorithm may be made
significantly more efficient, both in time and memory, and more numerically sta-
ble by sparsifying the solution using for example the online sparsification method
of Engel et al (2002).
14 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
tains only to one of the unknowns, we can model beliefs as products of Dirichlets,
0
one for each unknown model parameter θas,s .
Belief monitoring in this POMDP corresponds to Bayesian updating of the be-
liefs based on observed state transitions. For a prior belief b(θ ) = Dir(θ ; n) over
some transition parameter θ , when a specific (s, a, s0 ) transition is observed in
the environment, the posterior belief is analytically computed by the Bayes’ rule,
0 0
b0 (θ ) ∝ θas,s b(θ ). If we represent belief states by a tuple hs, {ns,s
a }i consisting of
0
the current state s and the hyperparameters nas,s for each Dirichlet, belief updating
simply amounts to setting the current state to s0 and incrementing by one the hyper-
0
parameter ns,s 0
a that matches the observed transition s, a, s .
The POMDP formulation of Bayesian reinforcement learning provides a natural
framework to reason about the exploration/exploitation tradeoff. Since beliefs en-
code all the information gained by the learner (i.e., sufficient statistics of the history
of past actions and observations) and an optimal POMDP policy is a mapping from
beliefs to actions that maximizes the expected total rewards, it follows that an op-
timal POMDP policy naturally optimizes the exploration/exploitation tradeoff. In
other words, since the goal in balancing exploitation (immediate gain) and explo-
ration (information gain) is to maximize the overall sum of rewards, then the best
tradeoff is achieved by the best POMDP policy. Note however that this assumes that
the prior belief is accurate and that computation is exact, which is rarely the case
in practice. Nevertheless, the POMDP formulation provides a useful formalism to
design algorithms that naturally tradeoff the exploration/exploitation tradeoff.
The POMDP formulation reduces the RL problem to a planning problem with
special structure. In the next section we derive the parameterization of the optimal
value function, which can be computed exactly by dynamic programming (Poupart
et al, 2006). However, since the complexity grows exponentially with the planning
horizon, we also discuss some approximations.
Here s is the current nominal MDP state, b is the current belief over the model
0
parameters θ , and bs,s 0
a is the updated belief after transition s, a, s . The transition
model is defined as
Z Z
0
P(s0 |s, b, a) = dθ b(θ ) P(s0 |s, θ , a) = dθ b(θ ) θas,s , (25)
θ θ
16 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
and is just the average transition probability P(s0 |s, a) with respect to belief b. Since
an optimal POMDP policy achieves by definition the highest attainable expected
future reward, it follows that such a policy would automatically optimize the explo-
ration/exploitation tradeoff in the original RL problem.
It is known (see, e.g., chapter 12 in this book) that the optimal finite-horizon value
function of a POMDP with discrete states and actions is piecewise linear and con-
vex, and it corresponds to the upper envelope of a set Γ of linear segments called α-
vectors: V ∗ (b) = maxα∈Γ α(b). In the literature, α is both defined as a linear func-
tion of b (i.e., α(b)) and as a vector of s (i.e., α(s)) such that α(b) = ∑s b(s)α(s).
Hence, for discrete POMDPs, value functions can be parameterized by a set of α-
vectors each represented as a vector of values for each state. Conveniently, this pa-
rameterization is closed under Bellman backups.
In the case of Bayesian RL, despite the hybrid nature of the state space, the piece-
wise linearity and convexity of the value function may still hold as demonstrated by
Duff (2002) and Porta et al (2005). In particular, the optimal finite-horizon value
function of a discrete-action POMDP corresponds to the upper envelope of a set Γ
of linear segments called α-functions (due to the continuous nature of the POMDP
state θ ), which can be grouped in subsets per nominal state s:
Suppose that the optimal value function Vsk (b) for k steps-to-go is composed of a set
Γ k of α-functions such that Vsk (b) = maxα∈Γ k αs (b). Using Bellman’s equation, we
can compute by dynamic programming the best set Γ k+1 representing the optimal
value function V k+1 with k + 1 stages-to-go. First we rewrite Bellman’s equation
(Eq. 24) by substituting V k for the maximum over the α-functions in Γ k as in Eq. 26:
Bayesian Reinforcement Learning 17
0
Vsk+1 (b) = max R(b, a) + γ ∑ P(s0 |s, b, a) max αs0 (bs,s
a ).
a α∈Γ k
s0
Then we decompose Bellman’s equation in three steps. The first step finds the max-
imal α-function for each a and s0 . The second step finds the best action a. The third
step performs the actual Bellman backup using the maximal action and α-functions:
s,s 0 0
αb,a = arg max α(bas,s ) (28)
α∈Γ k
s,s 0 0
asb = arg max R(s, a) + γ ∑ P(s0 |s, b, a)αb,a (bas,s ) (29)
a s0
s,s s,s 0 0
Vsk+1 (b) = R(s, asb ) + γ ∑ P(s0 |s, b, asb )αb,a s (bas ) (30)
b b
s0
We can further rewrite the third step by using α-functions in terms of θ (instead
0
of b) and expanding the belief state bs,s
as : b
Z
0 0
Vsk+1 (b) = R(s, asb ) + γ ∑ P(s0 |s, b, asb ) dθ bas,ss (θ )αb,a
s,s
s (θ ) (31)
b b
s0 θ
For every b we define such an α-function, and together all αb,s form the set Γ k+1 .
Since each αb,s was defined by using the optimal action and α-functions in Γ k , it
follows that each αb,s is necessarily optimal at b and we can introduce a max over
all α-functions with no loss:
Z
Vsk+1 (b) = dθ b(θ )αb,s (θ ) = αs (b) = max αs (b). (36)
θ α∈Γ k+1
Based on the above we can show the following (we refer to the original paper for
the proof):
Theorem 1 (Poupart et al (2006)). The α-functions in Bayesian RL are linear com-
binations of products of (unnormalized) Dirichlets.
18 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
where the wiθ ’s are the importance weights of the sampled models depending on the
proposal distribution used. Dearden et al (1999) describe several efficient procedures
to sample the models from some proposal distributions that may be easier to work
with than P(θ ).
An alternative myopic Bayesian action selection strategy is Thompson sampling,
which involves sampling just one MDP from the current belief, solve this MDP to
optimality (e.g., by Dynamic Programming), and execute the optimal action at the
current state (Thompson, 1933; Strens, 2000), a strategy that reportedly tends to
over-explore (Wang et al, 2005).
One may achieve a less myopic action selection strategy by trying to compute a
near-optimal policy in the belief-state MDP of the POMDP (see previous section).
Since this is just an MDP (albeit continuous and with a special structure), one may
use any approximate solver for MDPs. Wang et al (2005); Ross and Pineau (2008)
have pursued this idea by applying the sparse sampling algorithm of Kearns et al
(1999) on the belief-state MDP. This approach carries out an explicit lookahead to
the effective horizon starting from the current belief, backing up rewards through the
tree by dynamic programming or linear programming (Castro and Precup, 2007), re-
sulting in a near-Bayes-optimal exploratory action. The search through the tree does
not produce a policy that will generalize over the belief space however, and a new
tree will have to be generated at each time step which can be expensive in practice.
Presumably the sparse sampling approach can be combined with an approach that
generalizes over the belief space via an α-function parameterization as in BEETLE,
although no algorithm of that type has been reported so far.
20 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
Multi-task learning (MTL) is an important learning paradigm and has recently been
an area of active research in machine learning (e.g., Caruana 1997; Baxter 2000). A
common setup is that there are multiple related tasks for which we are interested in
improving the performance over individual learning by sharing information across
the tasks. This transfer of information is particularly important when we are pro-
vided with only a limited number of data to learn each task. Exploiting data from
related problems provides more training samples for the learner and can improve the
performance of the resulting solution. More formally, the main objective in MTL is
to maximize the improvement over individual learning averaged over the tasks. This
should be distinguished from transfer learning in which the goal is to learn a suitable
bias for a class of tasks in order to maximize the expected future performance.
Most RL algorithms often need a large number of samples to solve a problem
and cannot directly take advantage of the information coming from other similar
tasks. However, recent work has shown that transfer and multi-task learning tech-
niques can be employed in RL to reduce the number of samples needed to achieve
nearly-optimal solutions. All approaches to multi-task RL (MTRL) assume that the
tasks share similarity in some components of the problem such as dynamics, reward
structure, or value function. While some methods explicitly assume that the shared
components are drawn from a common generative model (Wilson et al, 2007; Mehta
et al, 2008; Lazaric and Ghavamzadeh, 2010), this assumption is more implicit in
others (Taylor et al, 2007; Lazaric et al, 2008). In Mehta et al (2008), tasks share
the same dynamics and reward features, and only differ in the weights of the reward
function. The proposed method initializes the value function for a new task using
the previously learned value functions as a prior. Wilson et al (2007) and Lazaric
and Ghavamzadeh (2010) both assume that the distribution over some components
of the tasks is drawn from a hierarchical Bayesian model (HBM). We describe these
two methods in more details below.
Lazaric and Ghavamzadeh (2010) study the MTRL scenario in which the learner
is provided with a number of MDPs with common state and action spaces. For any
given policy, only a small number of samples can be generated in each MDP, which
may not be enough to accurately evaluate the policy. In such a MTRL problem,
it is necessary to identify classes of tasks with similar structure and to learn them
jointly. It is important to note that here a task is a pair of MDP and policy such
that all the MDPs have the same state and action spaces. They consider a particular
class of MTRL problems in which the tasks share structure in their value functions.
To allow the value functions to share a common structure, it is assumed that they
are all sampled from a common prior. They adopt the GPTD value function model
(see Section 2.1) for each task, model the distribution over the value functions us-
ing a HBM, and develop solutions to the following problems: (i) joint learning of
the value functions (multi-task learning), and (ii) efficient transfer of the informa-
tion acquired in (i) to facilitate learning the value function of a newly observed task
(transfer learning). They first present a HBM for the case in which all the value func-
tions belong to the same class, and derive an EM algorithm to find MAP estimates of
Bayesian Reinforcement Learning 21
the value functions and the model’s hyper-parameters. However, if the functions do
not belong to the same class, simply learning them together can be detrimental (neg-
ative transfer). It is therefore important to have models that will generally benefit
from related tasks and will not hurt performance when the tasks are unrelated. This
is particularly important in RL as changing the policy at each step of policy iteration
(this is true even for fitted value iteration) can change the way tasks are clustered
together. This means that even if we start with value functions that all belong to the
same class, after one iteration the new value functions may be clustered into several
classes. To address this issue, they introduce a Dirichlet process (DP) based HBM
for the case that the value functions belong to an undefined number of classes, and
derive inference algorithms for both the multi-task and transfer learning scenarios
in this model.
The MTRL approach in Wilson et al (2007) also uses a DP-based HBM to model
the distribution over a common structure of the tasks. In this work, the tasks share
structure in their dynamics and reward function. The setting is incremental, i.e.,
the tasks are observed as a sequence, and there is no restriction on the number of
samples generated by each task. The focus is not on joint learning with finite number
of samples, it is on using the information gained from the previous tasks to facilitate
learning in a new one. In other words, the focus in this work is on transfer and not
on multi-task learning.
When transfer learning and multi-task learning are not possible, the learner may still
want to use domain knowledge to reduce the complexity of the learning task. In non-
Bayesian reinforcement learning, domain knowledge is often implicitly encoded in
the choice of features used to encode the state space, parametric form of the value
function, or the class of policies considered. In Bayesian reinforcement learning, the
prior distribution provides an explicit and expressive mechanism to encode domain
knowledge. Instead of starting with a non-informative prior (e.g., uniform, Jeffrey’s
prior), one can reduce the need for data by specifying a prior that biases the learning
towards parameters that a domain expert feels are more likely.
For instance, in model-based Bayesian reinforcement learning, Dirichlet distri-
butions over the transition and reward distributions can naturally encode an expert’s
bias. Recall that the hyperparameters ni − 1 of a Dirichlet can be interpreted as the
number of times that the pi -probability event has been observed. Hence, if the ex-
pert has access to prior data where each event occured ni − 1 times or has reasons
to believe that each event would occur ni − 1 times in a fictitious experiment, then a
corresponding Dirichlet can be used as an informative prior. Alternatively, if one has
some belief or prior data to estimate the mean and variance of some unknown multi-
nomial, then the hyperparameters of the Dirichlet can be set by moment matching.
A drawback of the Dirichlet distribution is that it only allows unimodal priors to
be expressed. However, mixtures of Dirichlets can be used to express multimodal
22 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
distributions. In fact, since Dirichlets are monomials (i.e., Dir(θ ) = ∏i θini ), then
n
mixtures of Dirichlets are polynomials with positive coefficients (i.e., ∑ j c j ∏i θi i j ).
So with a lage enough number of mixture components it is possible to approxi-
mate arbitrarily closely any desirable prior over an unknown multinomial distribu-
tion. Pavlov and Poupart (2008) explored the use of mixtures of Dirichlets to express
joint priors over the model dynamics and the policy. Although mixtures of Dirich-
lets are quite expressive, in some situation it may be possible to structure the priors
according to a generative model. To that effect, Doshi-Velez et al (2010) explored
the use of hierarchical priors such as hierarchical Dirichlet processes over the model
dynamics and policies represented as stochastic finite state controllers. The multi-
task and transfer learning techniques described in the previous section also explore
hierarchical priors over the value function (Lazaric and Ghavamzadeh, 2010) and
the model dynamics (Wilson et al, 2007).
One of the main attractive features of the Bayesian approach to RL is the possibility
of obtaining finite sample estimation for the statistics of a given policy in terms
of posterior expected value and variance. This idea was first pursued by Mannor
et al (2007), who considered the bias and variance of the value function estimate
of a single policy. Assuming an exogenous sampling process (i.e., we only get to
observe the transitions and rewards, but not to control them), there exists a nominal
model (obtained by, say, maximum a-posteriori probability estimate) and a posterior
probability distribution over all possible models. Given a policy π and a posterior
distribution over model θ =< T, r >, we can consider the expected posterior value
function as: " #
∞
ET̃ ,r̃ Es [ ∑ γ t r̃(st )|T̃ ] , (38)
t=1
where the outer expectation is according to the posterior over the parameters of the
MDP model and the inner expectation is with respect to transitions given that the
model parameters are fixed. Collecting the infinite sum, we get
where T̃π and r̃π are the transition matrix and reward vector of policy π when model
< T̃ , r̃ > is the true model. This problem maximizes the expected return over both
the trajectories and the model random variables. Because of the nonlinear effect of
T̃ on the expected return, Mannor et al (2007) argue that evaluating the objective of
this problem for a given policy is already difficult.
Assuming a Dirichlet prior for the transitions and a Gaussian prior for the re-
wards, one can obtain bias and variance estimates for the value function of a given
policy. These estimates are based on first order or second order approximations of
Bayesian Reinforcement Learning 23
Equation (39). From a computational perspective, these estimates can be easily com-
puted and the value function can be de-biased. When trying to optimize over the
policy space, Mannor et al (2007) show experimentally that the common approach
consisting of using the most likely (or expected) parameters leads to a strong bias in
the performance estimate of the resulting policy.
The Bayesian view for a finite sample naturally leads to the question of pol-
icy optimization, where an additional maximum over all policies is taken in (38).
The standard approach in Markov decision processes is to consider the so-called
robust approach: assume the parameters of the problem belong to some uncertainty
set and find the policy with the best worst-case performance. This can be done ef-
ficiently using dynamic programming style algorithms; see Nilim and El Ghaoui
(2005); Iyengar (2005). The problem with the robust approach is that it leads to
over-conservative solutions. Moreover, the currently available algorithms require
the uncertainty in different states to be uncorrelated, meaning that the uncertainty
set is effectively taken as the Cartesian product of state-wise uncertainty sets.
One of the benefits of the Bayesian perspective is that it enables using certain risk
aware approaches since we have a probability distribution on the available models.
For example, it is possible to consider bias-variance tradeoffs in this context, where
one would maximize reward subject to variance constraints or give a penalty for
excessive variance. Mean-variance optimization in the Bayesian setup seems like
a difficult problem, and there are currently no known complexity results about it.
Curtailing this problem, Delage and Mannor (2010) present an approximation to a
risk-sensitive percentile optimization criterion:
maximizey∈R,π∈ϒ y
s.t. ∞
Pθ (Es (∑t=0 γ t rt (st )|s0 ∝ q, π) ≥ y) ≥ 1 − ε. (40)
(2010). Two recent papers address the issue of complexity in model-based BRL. In
the first paper, Kolter and Ng (2009) present a simple algorithm, and prove that with
high probability it is able to perform approximately close to the true (intractable) op-
timal Bayesian policy after a polynomial (in quantities describing the system) num-
ber of time steps. The algorithm and analysis are reminiscent to PAC-MDP (e.g.,
Brafman and Tennenholtz (2002); Strehl et al (2006)) but it explores in a greedier
style than PAC-MDP algorithms. In the second paper, Asmuth et al (2009) present
an approach that drives exploration by sampling multiple models from the poste-
rior and selecting actions optimistically. The decision when to re-sample the set and
how to combine the models is based on optimistic heuristics. The resulting algo-
rithm achieves near optimal reward with high probability with a sample complexity
that is low relative to the speed at which the posterior distribution converges during
learning. Finally, Fard and Pineau (2010) derive a PAC-Bayesian style bound that
allows balancing between the distribution-free PAC and the data-efficient Bayesian
paradigms.
While Bayesian Reinforcement Learning was perhaps the first kind of reinforce-
ment learning considered in the 1960s by the Operations Research community, a
recent surge of interest by the Machine Learning community has lead to many ad-
vances described in this chapter. Much of this interest comes from the benefits of
maintaining explicit distributions over the quantities of interest. In particular, the
exploration/exploitation tradeoff can be naturally optimized once a distribution is
used to quantify the uncertainty about various parts of the model, value function or
gradient. Notions of risk can also be taken into account while optimizing a policy.
In this chapter we provided an overview of the state of the art regarding the use of
Bayesian techniques in reinforcement learning for a single agent in fully observable
domains. We note that Bayesian techniques have also been used in partially ob-
servable domains (Ross et al, 2007, 2008; Poupart and Vlassis, 2008; Doshi-Velez,
2009; Veness et al, 2010) and multi-agent systems (Chalkiadakis and Boutilier,
2003, 2004; Gmytrasiewicz and Doshi, 2005).
References
Aharony N, Zehavi T, Engel Y (2005) Learning wireless network association control with Gaussian
process temporal difference methods. In: Proceedings of OPNETWORK
Asmuth J, Li L, Littman ML, Nouri A, Wingate D (2009) A Bayesian sampling approach to ex-
ploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncer-
tainty in Artificial Intelligence, AUAI Press, UAI ’09, pp 19–26
Bagnell J, Schneider J (2003) Covariant policy search. In: Proceedings of the Eighteenth Interna-
tional Joint Conference on Artificial Intelligence
Bayesian Reinforcement Learning 25
Barto A, Sutton R, Anderson C (1983) Neuron-like elements that can solve difficult learning con-
trol problems. IEEE Transaction on Systems, Man and Cybernetics 13:835–846
Baxter J (2000) A model of inductive bias learning. Journal of Artificial Intelligence Research
12:149–198
Baxter J, Bartlett P (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelli-
gence Research 15:319–350
Bellman R (1956) A problem in sequential design of experiments. Sankhya 16:221–229
Bellman R (1961) Adaptive Control Processes: A Guided Tour. Princeton University Press
Bellman R, Kalaba R (1959) On adaptive control processes. Transactions on Automatic Control,
IRE 4(2):1–9
Bhatnagar S, Sutton R, Ghavamzadeh M, Lee M (2007) Incremental natural actor-critic algorithms.
In: Proceedings of Advances in Neural Information Processing Systems 20, MIT Press, pp 105–
112
Bhatnagar S, Sutton R, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica
45(11):2471–2482
Brafman R, Tennenholtz M (2002) R-max - a general polynomial time algorithm for near-optimal
reinforcement learning. JMLR 3:213–231
Caruana R (1997) Multitask learning. Machine Learning 28(1):41–75
Castro P, Precup D (2007) Using linear programming for Bayesian exploration in Markov decision
processes. In: Proc. 20th International Joint Conference on Artificial Intelligence
Chalkiadakis G, Boutilier C (2003) Coordination in multi-agent reinforcement learning: A
Bayesian approach. In: International Joint Conference on Autonomous Agents and Multiagent
Systems (AAMAS), pp 709–716
Chalkiadakis G, Boutilier C (2004) Bayesian reinforcement learning for coalition formation under
uncertainty. In: International Joint Conference on Autonomous Agents and Multiagent Systems
(AAMAS), pp 1090–1097
Cozzolino J, Gonzales-Zubieta R, Miller RL (1965) Markovian decision processes with uncer-
tain transition probabilities. Tech. Rep. Technical Report No. 11, Research in the Control of
Complex Systems. Operations Research Center, Massachusetts Institute of Technology
Cozzolino JM (1964) Optimal sequential decision making under uncertainty. Master’s thesis, Mas-
sachusetts Institute of Technology
Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. In: Proceedings of the Fifteenth
National Conference on Artificial Intelligence, pp 761–768
Dearden R, Friedman N, Andre D (1999) Model based Bayesian exploration. In: UAI, pp 150–159
DeGroot MH (1970) Optimal Statistical Decisions. McGraw-Hill, New York
Delage E, Mannor S (2010) Percentile optimization for Markov decision processes with parameter
uncertainty. Operations Research 58(1):203–213
Dimitrakakis C (2010) Complexity of stochastic branch and bound methods for belief tree search
in bayesian reinforcement learning. In: ICAART (1), pp 259–264
Doshi-Velez F (2009) The infinite partially observable Markov decision process. In: Neural Infor-
mation Processing systems
Doshi-Velez F, Wingate D, Roy N, Tenenbaum J (2010) Nonparametric Bayesian policy priors for
reinforcement learning. In: NIPS
Duff M (2002) Optimal learning: Computational procedures for Bayes-adaptive Markov decision
processes. PhD thesis, University of Massassachusetts Amherst
Duff M (2003) Design for an optimal probe. In: ICML, pp 131–138
Engel Y (2005) Algorithms and representations for reinforcement learning. PhD thesis, The He-
brew University of Jerusalem, Israel
Engel Y, Mannor S, Meir R (2002) Sparse online greedy support vector regression. In: Proceedings
of the Thirteenth European Conference on Machine Learning, pp 84–96
Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to tem-
poral difference learning. In: Proceedings of the Twentieth International Conference on Ma-
chine Learning, pp 154–161
26 Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart
Engel Y, Mannor S, Meir R (2005a) Reinforcement learning with Gaussian processes. In: Proceed-
ings of the Twenty Second International Conference on Machine Learning, pp 201–208
Engel Y, Szabo P, Volkinshtein D (2005b) Learning to control an octopus arm with Gaussian pro-
cess temporal difference methods. In: Proceedings of Advances in Neural Information Process-
ing Systems 18, MIT Press, pp 347–354
Fard MM, Pineau J (2010) PAC-Bayesian model selection for reinforcement learning. In: Lafferty
J, Williams CKI, Shawe-Taylor J, Zemel R, Culotta A (eds) Advances in Neural Information
Processing Systems 23, pp 1624–1632
Ghavamzadeh M, Engel Y (2006) Bayesian policy gradient algorithms. In: Proceedings of Ad-
vances in Neural Information Processing Systems 19, MIT Press
Ghavamzadeh M, Engel Y (2007) Bayesian Actor-Critic algorithms. In: Proceedings of the
Twenty-Fourth International Conference on Machine Learning
Gmytrasiewicz P, Doshi P (2005) A framework for sequential planning in multi-agent settings.
Journal of Artificial Intelligence Research (JAIR) 24:49–79
Greensmith E, Bartlett P, Baxter J (2004) Variance reduction techniques for gradient estimates in
reinforcement learning. Journal of Machine Learning Research 5:1471–1530
Iyengar GN (2005) Robust dynamic programming. Mathematics of Operations Research
30(2):257–280
Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Pro-
ceedings of Advances in Neural Information Processing Systems 11, MIT Press
Kaelbling LP (1993) Learning in Embedded Systems. MIT Press
Kakade S (2002) A natural policy gradient. In: Proceedings of Advances in Neural Information
Processing Systems 14
Kearns M, Mansour Y, Ng A (1999) A sparse sampling algorithm for near-optimal planning in
large Markov decision processes. In: Proc. IJCAI
Kolter JZ, Ng AY (2009) Near-bayesian exploration in polynomial time. In: Proceedings of the 26th
Annual International Conference on Machine Learning, ACM, New York, NY, USA, ICML ’09,
pp 513–520
Konda V, Tsitsiklis J (2000) Actor-Critic algorithms. In: Proceedings of Advances in Neural Infor-
mation Processing Systems 12, pp 1008–1014
Lazaric A, Ghavamzadeh M (2010) Bayesian multi-task reinforcement learning. In: Proceedings
of the Twenty-Seventh International Conference on Machine Learning, pp 599–606
Lazaric A, Restelli M, Bonarini A (2008) Transfer of samples in batch reinforcement learning. In:
Proceedings of ICML 25, pp 544–551
Mannor S, Simester D, Sun P, Tsitsiklis JN (2007) Bias and variance approximation in value func-
tion estimates. Management Science 53(2):308–322
Marbach P (1998) Simulated-based methods for Markov decision processes. PhD thesis, Mas-
sachusetts Institute of Technology
Martin JJ (1967) Bayesian decision problems and Markov chains. John Wiley, New York
Mehta N, Natarajan S, Tadepalli P, Fern A (2008) Transfer in variable-reward hierarchical rein-
forcement learning. Machine Learning 73(3):289–312
Meuleau N, Bourgine P (1999) Exploration of multi-state environments: local measures and back-
propagation of uncertainty. Machine Learning 35:117–154
Nilim A, El Ghaoui L (2005) Robust control of Markov decision processes with uncertain transition
matrices. Operations Research 53(5):780–798
O’Hagan A (1987) Monte Carlo is fundamentally unsound. The Statistician 36:247–249
O’Hagan A (1991) Bayes-Hermite quadrature. Journal of Statistical Planning and Inference
29:245–260
Pavlov M, Poupart P (2008) Towards global reinforcement learning. In: NIPS Workshop on Model
Uncertainty and Risk in Reinforcement Learning
Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural
Networks 21(4):682–697
Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. In: Pro-
ceedings of the Third IEEE-RAS International Conference on Humanoid Robots
Bayesian Reinforcement Learning 27
Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: Proceedings of the Sixteenth
European Conference on Machine Learning, pp 280–291
Porta JM, Spaan MT, Vlassis N (2005) Robot planning in partially observable continuous domains.
In: Proc. Robotics: Science and Systems
Poupart P, Vlassis N (2008) Model-based Bayesian reinforcement learning in partially observable
domains. In: International Symposium on Artificial Intelligence and Mathematics (ISAIM)
Poupart P, Vlassis N, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforce-
ment learning. In: Proc. Int. Conf. on Machine Learning, Pittsburgh, USA
Rasmussen C, Ghahramani Z (2003) Bayesian Monte Carlo. In: Proceedings of Advances in Neural
Information Processing Systems 15, MIT Press, pp 489–496
Rasmussen C, Williams C (2006) Gaussian Processes for Machine Learning. MIT Press
Reisinger J, Stone P, Miikkulainen R (2008) Online kernel selection for Bayesian reinforcement
learning. In: Proceedings of the Twenty-Fifth Conference on Machine Learning, pp 816–823
Ross S, Pineau J (2008) Model-based Bayesian reinforcement learning in large structured domains.
In: Uncertainty in Artificial Intelligence (UAI)
Ross S, Chaib-Draa B, Pineau J (2007) Bayes-adaptive POMDPs. In: Advances in Neural Infor-
mation Processing Systems (NIPS)
Ross S, Chaib-Draa B, Pineau J (2008) Bayesian reinforcement learning in continuous POMDPs
with application to robot navigation. In: IEEE International Conference on Robotics and Au-
tomation (ICRA), pp 2845–2851
Shawe-Taylor J, Cristianini N (2004) Kernel Methods for Pattern Analysis. Cambridge University
Press
Silver EA (1963) Markov decision processes with uncertain transition probabilities or rewards.
Tech. Rep. Technical Report No. 1, Research in the Control of Complex Systems. Operations
Research Center, Massachusetts Institute of Technology
Spaan MTJ, Vlassis N (2005) Perseus: Randomized point-based value iteration for POMDPs. Jour-
nal of Artificial Intelligence Research 24:195–220
Strehl AL, Li L, Littman ML (2006) Incremental model-based learners with formal learning-time
guarantees. In: UAI
Strens M (2000) A Bayesian framework for reinforcement learning. In: ICML
Sutton R (1984) Temporal credit assignment in reinforcement learning. PhD thesis, University of
Massachusetts Amherst
Sutton R (1988) Learning to predict by the methods of temporal differences. Machine Learning
3:9–44
Sutton R, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement
learning with function approximation. In: Proceedings of Advances in Neural Information Pro-
cessing Systems 12, pp 1057–1063
Taylor M, Stone P, Liu Y (2007) Transfer learning via inter-task mappings for temporal difference
learning. JMLR 8:2125–2167
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of
the evidence of two samples. Biometrika 25:285–294
Veness J, Ng KS, Hutter M, Silver D (2010) Reinforcement learning via AIXI approximation. In:
AAAI
Wang T, Lizotte D, Bowling M, Schuurmans D (2005) Bayesian sparse sampling for on-line reward
optimization. In: ICML
Watkins C (1989) Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England
Wiering M (1999) Explorations in efficient reinforcement learning. PhD thesis, University of Am-
sterdam
Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine Learning 8:229–256
Wilson A, Fern A, Ray S, Tadepalli P (2007) Multi-task reinforcement learning: A hierarchical
Bayesian approach. In: Proceedings of ICML 24, pp 1015–1022