1、Bayesian Policy Gradient Algorithms(2006)
1、Bayesian Policy Gradient Algorithms(2006)
net/publication/221620013
CITATIONS READS
67 100
2 authors:
All content following this page was uploaded by Mohammad Ghavamzadeh on 16 May 2014.
Abstract
Policy gradient methods are reinforcement learning algorithms that adapt a param-
eterized policy by following a performance gradient estimate. Conventional pol-
icy gradient methods use Monte-Carlo techniques to estimate this gradient. Since
Monte Carlo methods tend to have high variance, a large number of samples is
required, resulting in slow convergence. In this paper, we propose a Bayesian
framework that models the policy gradient as a Gaussian process. This reduces
the number of samples needed to obtain accurate gradient estimates. Moreover,
estimates of the natural gradient as well as a measure of the uncertainty in the
gradient estimates are provided at little extra cost.
1 Introduction
Policy Gradient (PG) methods are Reinforcement Learning (RL) algorithms that maintain a param-
eterized action-selection policy and update the policy parameters by moving them in the direction
of an estimate of the gradient of a performance measure. Early examples of PG algorithms are the
class of REINFORCE algorithms of Williams [1] which are suitable for solving problems in which
the goal is to optimize the average reward. Subsequent work (e.g., [2, 3]) extended these algorithms
to the cases of infinite-horizon Markov decision processes (MDPs) and partially observable MDPs
(POMDPs), and provided much needed theoretical analysis. However, both the theoretical results
and empirical evaluations have highlighted a major shortcoming of these algorithms, namely, the
high variance of the gradient estimates. This problem may be traced to the fact that in most interest-
ing cases, the time-average of the observed rewards is a high-variance (although unbiased) estimator
of the true average reward, resulting in the sample-inefficiency of these algorithms.
One solution proposed for this problem was to use a small (i.e., smaller than 1) discount factor in
these algorithms [2, 3], however, this creates another problem by introducing bias into the gradient
estimates. Another solution, which does not involve biasing the gradient estimate, is to subtract
a reinforcement baseline from the average reward estimate in the updates of PG algorithms (e.g.,
[4, 1]). Another approach for speeding-up policy gradient algorithms was recently proposed in [5]
and extended in [6, 7]. The idea is to replace the policy-gradient estimate with an estimate of the
so-called natural policy-gradient. This is motivated by the requirement that a change in the way the
policy is parametrized should not influence the result of the policy update. In terms of the policy
update rule, the move to a natural-gradient rule amounts to linearly transforming the gradient using
the inverse Fisher information matrix of the policy.
However, both conventional and natural policy gradient methods rely on Monte-Carlo (MC) tech-
niques to estimate the gradient of the performance measure. Monte-Carlo estimation is a frequentist
procedure, and as such violates the likelihood principle [8].1 Moreover, although MC estimates are
unbiased, they tend to produce high variance estimates, or alternatively, require excessive sample
sizes (see [9] for a discussion).
1
The likelihood principle states that in a parametric statistical model, all the information about a data sample
that is required for inferring the model parameters is contained in the likelihood function of that sample.
In [10] a RBayesian alternative to MC estimation is proposed. The idea is to model integrals of
the form f (x)p(x)dx as Gaussian Processes (GPs). This is done by treating the first term f in
the integrand as a random function, the randomness of which reflects our subjective uncertainty
concerning its true identity. This allows us to incorporate our prior knowledge on f into its prior
distribution. Observing (possibly noisy) samples of f at a set of points (x1 , x2 , . . . , xM ) allows us
to employ Bayes’ rule to compute a posterior distribution of f , conditioned on these samples. This,
in turn, induces a posterior distribution over the value of the integral. In this paper, we propose a
Bayesian framework for policy gradient, by modeling the gradient as a GP. This reduces the number
of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient
and the gradient covariance are provided at little extra cost.
TY
−1
Pr(ξ|µ) = P0 (x0 ) µ(at |xt )P (xt+1 |xt , at ). (1)
t=0
P
We denote by R(ξ) = Tt=0 γ t r(xt , at ) the (possibly discounted, γ ∈ [0, 1]) cumulative return of the
path ξ . R(ξ) is a random variable both because the path ξ is a random variable, and because even for
a given path, each of the rewards sampled in it may be stochastic. The expected value of R(ξ) for a
given ξ is denoted by R̄(ξ). Finally, let us define the expected return,
Z
η(µ) = E(R(ξ)) = R̄(ξ) Pr(ξ|µ)dξ. (2)
Gradient-based approaches to policy search in RL have recently received much attention as a means
to sidetrack problems of partial observability and of policy oscillations and even divergence encoun-
tered in value-function based methods (see [11], Sec. 6.4.2 and 6.5.3). In policy gradient (PG) meth-
ods, we define a class of smoothly parameterized stochastic policies {µ(·|x; θ), x ∈ X , θ ∈ Θ}, es-
timate the gradient of the expected return (2) with respect to the policy parameters θ from observed
system trajectories, and then improve the policy by adjusting the parameters in the direction of the
gradient [1, 2, 3]. The gradient of the expected return η(θ) = η(µ(·|·; θ)) is given by 2
Z
∇ Pr(ξ; θ)
∇η(θ) = R̄(ξ) Pr(ξ; θ)dξ, (3)
Pr(ξ; θ)
Pr(ξ;θ)
where Pr(ξ; θ) = Pr(ξ|µ(·|·; θ)). The quantity ∇Pr(ξ;θ) = ∇ log Pr(ξ; θ) is known as the score
function or likelihood ratio. Since the initial state distribution P0 and the transition distribution P
are independent of the policy parameters θ, we can write the score of a path ξ using Eq. 1 as
∇ Pr(ξ; θ) X
T −1
∇µ(at |xt ; θ) X
T −1
u(ξ) = = = ∇ log µ(at |xt ; θ). (4)
Pr(ξ; θ) t=0
µ(at |xt ; θ) t=0
2
Throughout the paper, we use the notation ∇ to denote ∇θ – the gradient w.r.t. the policy parameters.
Previous work on policy gradient methods used classical Monte-Carlo to estimate the gradient in
Eq. 3. These methods generate i.i.d. sample paths ξ1 , . . . , ξM according to Pr(ξ; θ), and estimate
the gradient ∇η(θ) using the MC estimator
Ti −1
1 X 1 X X
M M
c
∇η ∇ log µ(at,i |xt,i ; θ).
M C (θ) = R(ξi )∇ log Pr(ξi ; θ) = R(ξi ) (5)
M i=1 M i=1 t=0
3 Bayesian Quadrature
Bayesian quadrature (BQ) [10] is a Bayesian method for evaluating an integral using samples of its
integrand. We consider the problem of evaluating the integral
Z
ρ= f (x)p(x)dx. (6)
If p(x) is a probability density function, this becomes the problem of evaluating the expected value
of f (x). In MC estimation of such expectations, samples (x1 , x2 , . . . , xM ) are drawn from p(x),
1
PM
and the integral is estimated as ρ̂MC = M i=1 f (xi ). ρ̂MC is an unbiased estimate of ρ, with
variance that diminishes to zero as M → ∞. However, as O’Hagan points out, MC estimation is
fundamentally unsound, as it violates the likelihood principle, and moreover, does not make full use
of the data at hand [9] .
The alternative proposed in [10] is based on the following reasoning: In the Bayesian approach, f (·)
is random simply because it is numerically unknown. We are therefore uncertain about the value
of f (x) until we actually evaluate it. In fact, even then, our uncertainty is not always completely
removed, since measured samples of f (x) may be corrupted by noise. Modeling f as a Gaussian
process (GP) means that our uncertainty is completely accounted for by specifying a Normal prior
distribution over functions. This prior distribution is specified by its mean and covariance, and is
denoted by f (·) ∼ N {f0 (·), k(·, ·)}. This is shorthand for the statement that f is a GP with prior mean
E(f (x)) = f0 (x) and covariance Cov(f (x), f (x′ )) = k(x, x′ ), respectively. The choice of kernel
function k allows us to incorporate prior knowledge on the smoothness properties of the integrand
into the estimation procedure. When we are provided with a set of samples DM = {(xi , yi )}M i=1 ,
where yi is a (possibly noisy) sample of f (xi ), we apply Bayes’ rule to condition the prior on these
sampled values. If the measurement noise is normally distributed, the result is a Normal posterior
distribution of f |DM . The expressions for the posterior mean and covariance are standard:
Note that ρ0 and z0 are the prior mean and variance of ρ, respectively.
Model 1 Model 2
Known part p(ξ; θ) = Pr(ξ; θ) p(ξ; θ) = ∇ Pr(ξ; θ)
Uncertain part f (ξ; θ) = R̄(ξ)∇ log Pr(ξ; θ) f (ξ) = R̄(ξ)
Measurement y(ξ) = R(ξ)∇ log Pr(ξ; θ) y(ξ) = R(ξ)
Prior mean of f E(f (ξ; θ)) = 0 E(f (ξ)) = 0
Prior cov. of f Cov(f (ξ; θ), f (ξ ′ ; θ)) = k(ξ, ξ ′ )I Cov(f (ξ), f (ξ ′ )) = k(ξ, ξ ′ )
E(∇ηB (θ)|DM ) = Y M C M zM Z M C M yM
Cov(∇ηB (θ)|DM ) = (z0 − z ⊤ M C M z M )I Z0 − ZM CM Z⊤ M
` ´2
Kernel function k(ξi , ξj ) = 1 + u(ξi )⊤ G−1 u(ξj ) k(ξi , ξj ) = u(ξi )⊤ G−1 u(ξj )
⊤ −1
zM (z M )i = 1 + u(ξi ) G u(ξi ) ZM = U M
z0 z0 = 1 + n Z0 = G
In order to prevent the problem from “degenerating into infinite regress”, as phrased by O’Hagan
[10], we should choose the functions p, k, and f0 so as to allow us to solve the integrals in Eq. 10
analytically. For instance, O’Hagan provides the analysis required for the case where the integrands
in Eq. 10 are products of multivariate Gaussians and polynomials, referred to as Bayes-Hermite
quadrature. One of the contributions of the present paper is in providing analogous analysis for
kernel functions that are based on the Fisher kernel [13, 14]. It is important to note that in MC
estimation, samples must be drawn from the distribution p(x), whereas in the Bayesian approach,
samples may be drawn from arbitrary distributions. This affords us with flexibility in the choice of
sample points, allowing us, for instance to actively design the samples (x1 , x2 , . . . , xM ).
ηB (θ) is a random variable both because of the noise in R(ξ) and the Bayesian uncertainty. Under
the quadratic loss, our Bayesian performance measure is E(ηB (θ)|DM ). Since we are interested
in optimizing performance rather than evaluating it, we evaluate the posterior distribution of the
gradient of ηB (θ). For the mean we have
„Z «
∇ Pr(ξ; θ)
∇E (ηB (θ)|DM ) = E (∇ηB (θ)|DM ) = E R(ξ) Pr(ξ; θ)dξ |DM . (12)
Pr(ξ; θ)
Consequently, in BPG we cast the problem of estimating the gradient of the expected return in
the form of Eq. 6. As described in Sec. 3, we partition the integrand into two parts, f (ξ; θ) and
p(ξ; θ). We will place the GP prior over f and assume that p is known. We will then proceed by
calculating the posterior moments of the gradient ∇ηB (θ) conditioned on the observed data. Next,
we investigate two different ways of partitioning the integrand in Eq. 12, resulting in two distinct
Bayesian models. Table 1 summarizes the two models we use in this work. Our choice of Fisher-type
kernels was motivated by the notion that a good representation should depend on the data generating
process (see [13, 14] for a thorough discussion). Our particular choices of linear and quadratic Fisher
kernels were guided by the requirement that the posterior moments of the gradient be analytically
tractable. In Table 1 we made use of the following definitions: F Mˆ = (f (ξ1 ; θ), . . . , f (ξM ; θ)) ∼
˜
N (0, K M ), Y M = (y(ξ1 ), . . . , y(ξM )) ∼ N (0, K M + σ 2 I), U M = u(ξ1 ) , u(ξ2 ) , . . . , u(ξM ) ,
R RR
Z M = ∇ Pr(ξ; θ)k M (ξ)⊤ dξ, and Z 0 =` k(ξ, ξ ′ )∇ ′ ⊤ ′
´ Pr(ξ; θ)∇ Pr(ξ ; θ) dξdξ . Finally, n is the
⊤
number of policy parameters, and G = E u(ξ)u(ξ) is the Fisher information matrix.
We can now use Models 1 and 2 to define algorithms for evaluating the gradient of the expected
return with respect to the policy parameters. The pseudo-code for these algorithms is shown in
Alg. 1. The generic algorithm (for either model) takes a set of policy parameters θ and a sample size
M as input, and returns an estimate of the posterior moments of the gradient of the expected return.
Algorithm 1 : A Bayesian Policy Gradient Evaluation Algorithm
1: BPG Eval(θ, M ) // policy parameters θ ∈ Rn , sample size M > 0 //
2: Set G = G(θ) , D0 = ∅
3: for i = 1 to M do
4: S ξi using the policy µ(θ)
Sample a path
5: Di = Di−1 {ξi }
P i −1
6: Compute u(ξi ) = Tt=0 ∇ log µ(at |st ; θ)
P i −1
7: R(ξi ) = Tt=0 r(st , at )
8: Update K i using K i−1 and ξi
y(ξi ) = R(ξi )u(ξi ) (Model 1) or y(ξi ) = R(ξi ) (Model 2)
9:
(z M )i = 1 + u(ξi )⊤ G−1 u(ξi ) (Model 1) or Z M (:, i) = u(ξi ) (Model 2)
10: end for
11: C M = (K M + σ 2 I)−1
12: Compute the posterior mean and covariance:
E(∇ηB (θ)|DM ) = Y M C M z M , Cov(∇ηB (θ)|DM ) = (z0 − z ⊤ M C M z M )I (Model 1) or
E(∇ηB (θ)|DM ) = Z M C M y M , Cov(∇ηB (θ)|DM ) = Z 0 − Z M C M Z ⊤ M (Model 2)
13: return E (∇ηB (θ)|DM ) , Cov (∇ηB (θ)|DM )
The kernel functions used in Models 1 and 2 are both based on the Fisher information matrix G(θ).
Consequently, every time we update the policy parameters we need to recompute G. In Alg. 1 we
assume that G is known, however, in most practical situations this will not be the case. Let us briefly
outline two possible approaches for estimating the Fisher information matrix.
MC Estimation: At each step j, our BPG algorithm generates M sample paths using the current
policy parameters θ j in order to estimate the gradient ∇ηB (θ j ). We can use these generated sample
paths to estimate the Fisher information matrix G(θj ) by replacing the expectation in G with em-
PM PTi −1 ⊤
pirical averaging as ĜM C (θ j ) = PM1 T i=1 t=0 ∇ log µ(at |xt ; θ j )∇ log µ(at |xt ; θ j ) .
i=1 i
Model-Based Policy Gradient: The Fisher information matrix depends on the probability distri-
bution over paths. This distribution is a product of two factors, one corresponding to the current
policy, and the other corresponding to the MDP dynamics P0 and P (see Eq. 1). Thus, if the MDP
dynamics are known, the Fisher information matrix can be evaluated off-line. We can model the
MDP dynamics using some parameterized model, and estimate the model parameters using maxi-
mum likelihood or Bayesian methods. This would be a model-based approach to policy gradient,
which would allow us to transfer information between different policies.
Alg. 1 can be made significantly more efficient, both in time and memory, by sparsifying the so-
lution. Such sparsification may be performed incrementally, and helps to numerically stabilize the
algorithm when the kernel matrix is singular, or nearly so. Here we use an on-line sparsification
method from [15] to selectively add a new observed path to a set of dictionary paths DM , which are
used as a basis for approximating the full solution. Lack of space prevents us from discussing this
method in further detail (see Chapter 2 in [15] for a thorough discussion).
The Bayesian policy gradient (BPG) algorithm is described in Alg. 2. This algorithm starts with an
initial vector of policy parameters θ0 and updates the parameters in the direction of the posterior
mean of the gradient of the expected return, computed by Alg. 1. This is repeated N times, or
alternatively, until the gradient estimate is sufficiently close to zero.
Algorithm 2 : A Bayesian Policy Gradient Algorithm
1: BPG(θ 0 , α, N, M ) // initial policy parameters θ 0 , learning rates (αj )N−1
j=0 , number of policy updates
N > 0, BPG Eval sample size M > 0 //
2: for j = 0 to N − 1 do
3: ∆θ j = E (∇ηB (θ j )|DM ) from BPG Eval(θ j , M )
4: θ j+1 = θ j +αj ∆θ j (regular gradient) or θ j+1 = θ j +αj G−1 (θ j )∆θ j (natural gradient)
5: end for
6: return θ N
5 Experimental Results
In this section, we compare the BQ and MC gradient estimators in a continuous-action bandit prob-
lem and a continuous state and action linear quadratic regulation (LQR) problem. We also evaluate
the performance of the BPG algorithm (Alg. 2) on the LQR problem, and compare it with a standard
MC-based policy gradient (MCPG) algorithm.
5.1 A Bandit Problem
In this simple example, we compare the BQ and MC estimates of the gradient (for a fixed set of
policy parameters) using the same samples. Our simple bandit problem has a single state and A = R.
Thus, each path ξi consists of a single action ai . The policy, and therefore also the distribution over
paths is given by a ∼ N (θ1 = 0, θ22 = 1). The score function of the path ξ = a and the Fisher
information matrix are given by u(ξ) = [a, a2 − 1]⊤ and G = diag(1, 2), respectively.
Table 2 shows the exact gradient of the expected return and its MC and BQ estimates (using 10
and 100 samples) for two versions of the simple bandit problem corresponding to two different
deterministic reward functions r(a) = a and r(a) = a2 . The average over 104 runs of the MC and
BQ estimates and their standard deviations are reported in Table 2. The true gradient is analytically
tractable and is reported as “Exact” in Table 2 for reference.
Exact
„ « „ MC (10) « „ BQ (10) « „ MC (100) « „ BQ (100) «
1 0.9950 ± 0.438 0.9856 ± 0.050 1.0004 ± 0.140 1.000 ± 0.000001
r(a) = a
0 −0.0011 ± 0.977 0.0006 ± 0.060 0.0040 ± 0.317 0.000 ± 0.000004
„ « „ « „ « „ « „ «
0 0.0136 ± 1.246 0.0010 ± 0.082 0.0051 ± 0.390 0.000 ± 0.000003
r(a) = a2
2 2.0336 ± 2.831 1.9250 ± 0.226 1.9869 ± 0.857 2.000 ± 0.000011
Table 2: The true gradient of the expected return and its MC and BQ estimates for two bandit problems.
As shown in Table 2, the BQ estimate has much lower variance than the MC estimate for both small
and large sample sizes. The BQ estimate also has a lower bias than the MC estimate for the large
sample size (M = 100), and almost the same bias for the small sample size (M = 10).
5.2 A Linear Quadratic Regulator
In this section, we consider the following linear system in which the goal is to minimize the expected
return over 20 steps. Thus, it is an episodic problem with paths of length 20.
System Policy
Initial State: x0 ∼ N (0.3, 0.001) Actions: at ∼ µ(·|xt ; θ) = N (λxt , σ 2 )
Rewards: rt = x2t + 0.1a2t Parameters: θ = (λ , σ)⊤
Transitions: xt+1 = xt + at + nx ; nx ∼ N (0, 0.01)
We first compare the BQ and MC estimates of the gradient of the expected return for the policy
induced by the parameters λ = −0.2 and σ = 1. We use several different sample sizes (number of
paths used for gradient estimation) M = 5j , j = 1, . . . , 20 for the BQ and MC estimates. For each
sample size, we compute both the MC and BQ estimates 104 times, using the same samples. The
true gradient is estimated using MC with 107 sample paths for comparison purposes.
Figure 1 shows the mean squared error (MSE) (first column), and the mean absolute angular error
(second column) of the MC and BQ estimates of the gradient for several different sample sizes.
The absolute angular error is the absolute value of the angle between the true gradient and the
estimated gradient. In this figure, the BQ gradient estimate was calculated using Model 1 without
sparsification. With a good choice of sparsification threshold, we can attain almost identical results
much faster and more efficiently with sparsification. These results are not shown here due to space
limitations. To give an intuition concerning the speed and the efficiency attained by sparsification,
we should mention that the dimension of the feature space for the kernel used in Model 1 is 6
(Proposition 9.2 in [14]). Therefore, we deal with a kernel matrix of size 6 with sparsification versus
a kernel matrix of size M = 5j , j = 1, . . . , 20 without sparsification.
We ran another set of experiments, in which we add i.i.d. Gaussian noise to the rewards: rt = x2t +
0.1a2t + nr ; nr ∼ N (0, σr2 = 0.1). In Model 2, we can model this by the measurement noise
covariance matrix Σ = T σr2 I, where T = 20 is the path length. Since each reward rt is a Gaussian
PT −1
random variable with variance σr2 , the return R(ξ) = t=0 rt will also be a Gaussian random
variable with variance T σr2 . The results are presented in the third and fourth columns of Figure 1.
These experiments indicate that the BQ gradient estimate has lower variance than its MC counter-
1
part. In fact, whereas the performance of the MC estimate improves as M , the performance of the
BQ estimate improves at a higher rate.
6 2 6 2
10 10 10 10
MC MC MC MC
4 1 1
10 10 10
4
10
3
10
2 0 3 0
10 10 10 10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of Paths Number of Paths Number of Paths Number of Paths
Figure 1: Results for the LQR problem using Model 1 (left) and Model 2 (right), without sparsification. The
Model 2 results are for a LQR problem, in which the rewards are corrupted by i.i.d. Gaussian noise. For each
algorithm, we show the MSE (left) and the mean absolute angular error (right), as functions of the number of
sample paths M . Note that the errors are plotted on a logarithmic scale. All results are averages over 104 runs.
Next, we use BPG to optimize the policy parameters in the LQR problem. Figure 2 shows the
performance of the BPG algorithm with the regular (BPG) and the natural (BPNG) gradient es-
timates, versus a MC-based policy gradient (MCPG) algorithm, for the sample sizes (number of
sample paths used for estimating the gradient of a policy) M = 5, 10, 20, and 40. We use Alg. 2
with the number of updates set to N = 100, and Model 1 for the BPG and BPNG methods. Since
Alg. 2 computes the Fisher information matrix for each set of policy parameters, an estimate of the
natural gradient is provided at little extra cost at each step. The returns obtained by these meth-
ods are averaged over 104 runs for sample sizes 5 and 10, and over 103 runs for sample sizes
20 and 40. The policy parameters are initialized randomly at each run. In order to ensure that
the learned parameters do not exceed an acceptable range, the policy parameters are defined as
λ = −1.999 + 1.998/(1 + eν1 ) and σ = 0.001 + 1/(1 + eν2 ). The optimal solution is λ∗ ≈ −0.92
and σ ∗ = 0.001 (ηB (λ∗ , σ ∗ ) = 0.1003) corresponding to ν1∗ ≈ −0.16 and ν2∗ → ∞.
1 MC 1 MC 1 MC 1 MC
10 BPG 10 BPG 10 BPG 10 BPG
BPNG BPNG BPNG BPNG
Average Expected Return
0 0 0 0
10 10 10 10
Figure 2: A comparison of the average expected returns of BPG using regular (BPG) and natural (BPNG)
gradient estimates, with the average expected return of the MCPG algorithm for sample sizes 5, 10, 20, and 40.
Figure 2 shows that MCPG performs better than the BPG algorithm for the smallest sam-
ple size (M = 5), whereas for larger samples BPG dominates MCPG. This phenomenon is
also reported in [16]. We use two different learning rates for the two components of the
gradient. For a fixed sample size, each method starts with an initial learning rate, and de-
creases it according to the schedule αj = α0 (20/(20 + j)). Table 3 summarizes the best
initial learning rates for each algorithm. The selected learning rates for BPNG are signif-
icantly larger than those for BPG and MCPG, which explains why BPNG initially learns
faster than BPG and MCPG, but contrary to our expectations, eventually performs worse.
Figure 4: A comparison of the average return of BPG when the Fisher information matrix is known (BPG),
and when it is estimated using MC (BPG-MC) and ML (BPG-ML) methods, for sample sizes 10, 20, and 40
(from left to right). The average return of the MCPG algorithm is also provided for comparison.
6 Discussion
In this paper we proposed an alternative approach to conventional frequentist policy gradient esti-
mation procedures, which is based on the Bayesian view. Our algorithms use GPs to define a prior
distribution over the gradient of the expected return, and compute the posterior, conditioned on the
observed data. The experimental results are encouraging, but we conjecture that even higher gains
may be attained using this approach. This calls for additional theoretical and empirical work.
Although the proposed policy updating algorithm (Alg. 2) uses only the posterior mean of the gradi-
ent in its updates, we hope that more elaborate algorithms can be devised that would make judicious
use of the covariance information provided by the gradient estimation algorithm (Alg. 1). Two ob-
vious possibilities are: 1) risk-aware selection of the update step-size and direction, and 2) using
the variance in a termination condition for Alg. 1. Other interesting directions include 1) investi-
gating other possible partitions of the integrand in the expression for ∇ηB (θ) into a GP term f and
a known term p, 2) using other types of kernel functions, such as sequence kernels, 3) combining
our approach with MDP model estimation, to allow transfer of learning between different policies,
4) investigating methods for learning the Fisher information matrix, 5) extending the Bayesian ap-
proach to Actor-Critic type of algorithms, possibly by combining BPG with the Gaussian process
temporal difference (GPTD) algorithms of [15].
Acknowledgments We thank Rich Sutton and Dale Schuurmans for helpful discussions. M.G.
would like to thank Shie Mannor for his useful comments at the early stages of this work. M.G. is
supported by iCORE and Y.E. is partially supported by an Alberta Ingenuity fellowship.
References
[1] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256, 1992.
[2] P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, MIT, 1998.
[3] J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. JAIR, 15:319–350, 2001.
[4] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning
with function approximation. In Proceedings of NIPS 12, pages 1057–1063, 2000.
[5] S. Kakade. A natural policy gradient. In Proceedings of NIPS 14, 2002.
[6] J. Bagnell and J. Schneider. Covariant policy search. In Proceedings of the 18th IJCAI, 2003.
[7] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. In Proceedings
of the Third IEEE-RAS International Conference on Humanoid Robots, 2003.
[8] J. Berger and R. Wolpert. The Likelihood Principle. Inst. of Mathematical Statistics, Hayward, CA, 1984.
[9] A. O’Hagan. Monte Carlo is fundamentally unsound. The Statistician, 36:247–249, 1987.
[10] A. O’Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29, 1991.
[11] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[12] R. Sutton and A. Barto. An Introduction to Reinforcement Learning. MIT Press, 1998.
[13] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Proceedings
of NIPS 11. MIT Press, 1998.
[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[15] Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, The Hebrew Univer-
sity of Jerusalem, Israel, 2005.
[16] C. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Proceedings of NIPS 15. MIT Press, 2003.