0% found this document useful (0 votes)

9 views

1、Bayesian Policy Gradient Algorithms（2006）

This document summarizes a 2006 research paper on Bayesian policy gradient algorithms for reinforcement learning. It introduces policy gradient methods that estimate the gradient of a policy's expected return to update the policy parameters. However, traditional methods have high variance gradient estimates. The paper proposes a Bayesian approach that models the policy gradient as a Gaussian process, which provides more accurate estimates from fewer samples. It also estimates the natural gradient and uncertainty in the estimates with little extra cost.

Uploaded by

da da

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

1、Bayesian Policy Gradient Algorithms（2006）

Uploaded by

da da

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221620013

Bayesian Policy Gradient Algorithms.

Conference Paper in Advances in Neural Information Processing Systems · January 2006

Source: DBLP

CITATIONS READS
67 100

2 authors:

Mohammad Ghavamzadeh Yaakov Engel

National Institute for Research in Computer Science and Control University of Alberta
150 PUBLICATIONS 4,161 CITATIONS 23 PUBLICATIONS 2,107 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Mohammad Ghavamzadeh on 16 May 2014.

The user has requested enhancement of the downloaded file.

Bayesian Policy Gradient Algorithms

Mohammad Ghavamzadeh Yaakov Engel

Department of Computing Science, University of Alberta
Edmonton, Alberta, Canada T6G 2E8
{mgh,yaki}@cs.ualberta.ca

Abstract
Policy gradient methods are reinforcement learning algorithms that adapt a param-
eterized policy by following a performance gradient estimate. Conventional pol-
icy gradient methods use Monte-Carlo techniques to estimate this gradient. Since
Monte Carlo methods tend to have high variance, a large number of samples is
required, resulting in slow convergence. In this paper, we propose a Bayesian
framework that models the policy gradient as a Gaussian process. This reduces
the number of samples needed to obtain accurate gradient estimates. Moreover,
estimates of the natural gradient as well as a measure of the uncertainty in the
gradient estimates are provided at little extra cost.

1 Introduction
Policy Gradient (PG) methods are Reinforcement Learning (RL) algorithms that maintain a param-
eterized action-selection policy and update the policy parameters by moving them in the direction
of an estimate of the gradient of a performance measure. Early examples of PG algorithms are the
class of REINFORCE algorithms of Williams [1] which are suitable for solving problems in which
the goal is to optimize the average reward. Subsequent work (e.g., [2, 3]) extended these algorithms
to the cases of infinite-horizon Markov decision processes (MDPs) and partially observable MDPs
(POMDPs), and provided much needed theoretical analysis. However, both the theoretical results
and empirical evaluations have highlighted a major shortcoming of these algorithms, namely, the
high variance of the gradient estimates. This problem may be traced to the fact that in most interest-
ing cases, the time-average of the observed rewards is a high-variance (although unbiased) estimator
of the true average reward, resulting in the sample-inefficiency of these algorithms.
One solution proposed for this problem was to use a small (i.e., smaller than 1) discount factor in
these algorithms [2, 3], however, this creates another problem by introducing bias into the gradient
estimates. Another solution, which does not involve biasing the gradient estimate, is to subtract
a reinforcement baseline from the average reward estimate in the updates of PG algorithms (e.g.,
[4, 1]). Another approach for speeding-up policy gradient algorithms was recently proposed in [5]
and extended in [6, 7]. The idea is to replace the policy-gradient estimate with an estimate of the
so-called natural policy-gradient. This is motivated by the requirement that a change in the way the
policy is parametrized should not influence the result of the policy update. In terms of the policy
update rule, the move to a natural-gradient rule amounts to linearly transforming the gradient using
the inverse Fisher information matrix of the policy.
However, both conventional and natural policy gradient methods rely on Monte-Carlo (MC) tech-
niques to estimate the gradient of the performance measure. Monte-Carlo estimation is a frequentist
procedure, and as such violates the likelihood principle [8].1 Moreover, although MC estimates are
unbiased, they tend to produce high variance estimates, or alternatively, require excessive sample
sizes (see [9] for a discussion).
1
The likelihood principle states that in a parametric statistical model, all the information about a data sample
that is required for inferring the model parameters is contained in the likelihood function of that sample.
In [10] a RBayesian alternative to MC estimation is proposed. The idea is to model integrals of
the form f (x)p(x)dx as Gaussian Processes (GPs). This is done by treating the first term f in
the integrand as a random function, the randomness of which reflects our subjective uncertainty
concerning its true identity. This allows us to incorporate our prior knowledge on f into its prior
distribution. Observing (possibly noisy) samples of f at a set of points (x1 , x2 , . . . , xM ) allows us
to employ Bayes’ rule to compute a posterior distribution of f , conditioned on these samples. This,
in turn, induces a posterior distribution over the value of the integral. In this paper, we propose a
Bayesian framework for policy gradient, by modeling the gradient as a GP. This reduces the number
of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient
and the gradient covariance are provided at little extra cost.

2 Reinforcement Learning and Policy Gradient Methods

Reinforcement Learning (RL) [11, 12] is a class of learning problems in which an agent inter-
acts with an unfamiliar, dynamic and stochastic environment, where the agent’s goal is to optimize
some measure of its long-term performance. This interaction is conventionally modeled as a MDP.
Let P(S) be the set of probability distributions on (Borel) subsets of a set S . A MDP is a tuple
(X , A, q, P, P0 ) where X and A are the state and action spaces, respectively; q(·|a, x) ∈ P(R) is the
probability distribution over rewards; P (·|a, x) ∈ P(X ) is the transition probability distribution; (we
assume that P and q are stationary); and P0 (·) ∈ P(X ) is the initial state distribution. We denote the
random variable distributed according to q(·|a, x) as r(x, a). In addition, we need to specify the rule
according to which the agent selects actions at each possible state. We assume that this rule does
not depend explicitly on time. A stationary policy µ(·|x) ∈ P(A) is a probability distribution over
actions, conditioned on the current state. The MDP controlled by the policy µ induces a Markov
chain over state-action pairs. We generically denote by ξ = (x0 , a0 , x1 , a1 , . . . , xT −1 , aT −1 , xT ) a
path generated by this Markov chain. The probability (or density) of such a path is given by

TY
−1
Pr(ξ|µ) = P0 (x0 ) µ(at |xt )P (xt+1 |xt , at ). (1)
t=0

P
We denote by R(ξ) = Tt=0 γ t r(xt , at ) the (possibly discounted, γ ∈ [0, 1]) cumulative return of the
path ξ . R(ξ) is a random variable both because the path ξ is a random variable, and because even for
a given path, each of the rewards sampled in it may be stochastic. The expected value of R(ξ) for a
given ξ is denoted by R̄(ξ). Finally, let us define the expected return,
Z
η(µ) = E(R(ξ)) = R̄(ξ) Pr(ξ|µ)dξ. (2)

Gradient-based approaches to policy search in RL have recently received much attention as a means
to sidetrack problems of partial observability and of policy oscillations and even divergence encoun-
tered in value-function based methods (see [11], Sec. 6.4.2 and 6.5.3). In policy gradient (PG) meth-
ods, we define a class of smoothly parameterized stochastic policies {µ(·|x; θ), x ∈ X , θ ∈ Θ}, es-
timate the gradient of the expected return (2) with respect to the policy parameters θ from observed
system trajectories, and then improve the policy by adjusting the parameters in the direction of the
gradient [1, 2, 3]. The gradient of the expected return η(θ) = η(µ(·|·; θ)) is given by 2
Z
∇ Pr(ξ; θ)
∇η(θ) = R̄(ξ) Pr(ξ; θ)dξ, (3)
Pr(ξ; θ)

Pr(ξ;θ)
where Pr(ξ; θ) = Pr(ξ|µ(·|·; θ)). The quantity ∇Pr(ξ;θ) = ∇ log Pr(ξ; θ) is known as the score
function or likelihood ratio. Since the initial state distribution P0 and the transition distribution P
are independent of the policy parameters θ, we can write the score of a path ξ using Eq. 1 as

∇ Pr(ξ; θ) X
T −1
∇µ(at |xt ; θ) X
T −1
u(ξ) = = = ∇ log µ(at |xt ; θ). (4)
Pr(ξ; θ) t=0
µ(at |xt ; θ) t=0

2
Throughout the paper, we use the notation ∇ to denote ∇θ – the gradient w.r.t. the policy parameters.
Previous work on policy gradient methods used classical Monte-Carlo to estimate the gradient in
Eq. 3. These methods generate i.i.d. sample paths ξ1 , . . . , ξM according to Pr(ξ; θ), and estimate
the gradient ∇η(θ) using the MC estimator
Ti −1
1 X 1 X X
M M
c
∇η ∇ log µ(at,i |xt,i ; θ).
M C (θ) = R(ξi )∇ log Pr(ξi ; θ) = R(ξi ) (5)
M i=1 M i=1 t=0

3 Bayesian Quadrature
Bayesian quadrature (BQ) [10] is a Bayesian method for evaluating an integral using samples of its
integrand. We consider the problem of evaluating the integral
Z
ρ= f (x)p(x)dx. (6)

If p(x) is a probability density function, this becomes the problem of evaluating the expected value
of f (x). In MC estimation of such expectations, samples (x1 , x2 , . . . , xM ) are drawn from p(x),
1
PM
and the integral is estimated as ρ̂MC = M i=1 f (xi ). ρ̂MC is an unbiased estimate of ρ, with
variance that diminishes to zero as M → ∞. However, as O’Hagan points out, MC estimation is
fundamentally unsound, as it violates the likelihood principle, and moreover, does not make full use
of the data at hand [9] .
The alternative proposed in [10] is based on the following reasoning: In the Bayesian approach, f (·)
is random simply because it is numerically unknown. We are therefore uncertain about the value
of f (x) until we actually evaluate it. In fact, even then, our uncertainty is not always completely
removed, since measured samples of f (x) may be corrupted by noise. Modeling f as a Gaussian
process (GP) means that our uncertainty is completely accounted for by specifying a Normal prior
distribution over functions. This prior distribution is specified by its mean and covariance, and is
denoted by f (·) ∼ N {f0 (·), k(·, ·)}. This is shorthand for the statement that f is a GP with prior mean
E(f (x)) = f0 (x) and covariance Cov(f (x), f (x′ )) = k(x, x′ ), respectively. The choice of kernel
function k allows us to incorporate prior knowledge on the smoothness properties of the integrand
into the estimation procedure. When we are provided with a set of samples DM = {(xi , yi )}M i=1 ,
where yi is a (possibly noisy) sample of f (xi ), we apply Bayes’ rule to condition the prior on these
sampled values. If the measurement noise is normally distributed, the result is a Normal posterior
distribution of f |DM . The expressions for the posterior mean and covariance are standard:

E(f (x)|DM ) = f0 (x) + kM (x)⊤ C M (y M − f 0 ), (7)

′ ′ ⊤ ′
Cov(f (x), f (x )|DM ) = k(x, x ) − kM (x) C M k M (x ).
Here and in the sequel, we make use of the definitions:
f 0 = (f0 (x1 ), . . . , f0 (xM ))⊤ , y M = (y1 , . . . , yM )⊤ ,
k M (x) = (k(x1 , x), . . . , k(xM , x))⊤ , [K M ]i,j = k(xi , xj ) , C M = (K M + ΣM )−1 ,
and [ΣM ]i,j is the measurement noise covariance between the ith and jth samples. Typically, it
is assumed that the measurement noise is i.i.d., in which case ΣM = σ 2 I , where σ 2 is the noise
variance and I is the identity matrix.
Since integration is a linear operation, the posterior distribution of the integral in Eq. 6 is also
Gaussian, and the posterior moments are given by
Z ZZ
E(ρ|DM ) = E(f (x)|DM )p(x)dx , Var(ρ|DM ) = Cov(f (x), f (x′ )|DM )p(x)p(x′)dxdx′ . (8)

Substituting Eq. 7 into Eq. 8, we get

E(ρ|DM ) = ρ0 + z ⊤
M C M (y M − f 0 ) , Var(ρ|DM ) = z0 − z ⊤
M C M zM , (9)
where we made use of the definitions:
Z Z ZZ
ρ0 = f0 (x)p(x)dx , zM = kM (x)p(x)dx , z0 = k(x, x′ )p(x)p(x′)dxdx′ . (10)

Note that ρ0 and z0 are the prior mean and variance of ρ, respectively.
Model 1 Model 2
Known part p(ξ; θ) = Pr(ξ; θ) p(ξ; θ) = ∇ Pr(ξ; θ)
Uncertain part f (ξ; θ) = R̄(ξ)∇ log Pr(ξ; θ) f (ξ) = R̄(ξ)
Measurement y(ξ) = R(ξ)∇ log Pr(ξ; θ) y(ξ) = R(ξ)
Prior mean of f E(f (ξ; θ)) = 0 E(f (ξ)) = 0
Prior cov. of f Cov(f (ξ; θ), f (ξ ′ ; θ)) = k(ξ, ξ ′ )I Cov(f (ξ), f (ξ ′ )) = k(ξ, ξ ′ )
E(∇ηB (θ)|DM ) = Y M C M zM Z M C M yM
Cov(∇ηB (θ)|DM ) = (z0 − z ⊤ M C M z M )I Z0 − ZM CM Z⊤ M
` ´2
Kernel function k(ξi , ξj ) = 1 + u(ξi )⊤ G−1 u(ξj ) k(ξi , ξj ) = u(ξi )⊤ G−1 u(ξj )
⊤ −1
zM (z M )i = 1 + u(ξi ) G u(ξi ) ZM = U M
z0 z0 = 1 + n Z0 = G

Table 1: Summary of the Bayesian policy gradient Models 1 and 2.

In order to prevent the problem from “degenerating into infinite regress”, as phrased by O’Hagan
[10], we should choose the functions p, k, and f0 so as to allow us to solve the integrals in Eq. 10
analytically. For instance, O’Hagan provides the analysis required for the case where the integrands
in Eq. 10 are products of multivariate Gaussians and polynomials, referred to as Bayes-Hermite
quadrature. One of the contributions of the present paper is in providing analogous analysis for
kernel functions that are based on the Fisher kernel [13, 14]. It is important to note that in MC
estimation, samples must be drawn from the distribution p(x), whereas in the Bayesian approach,
samples may be drawn from arbitrary distributions. This affords us with flexibility in the choice of
sample points, allowing us, for instance to actively design the samples (x1 , x2 , . . . , xM ).

4 Bayesian Policy Gradient

In this section, we use Bayesian quadrature to estimate the gradient of the expected return with
respect to the policy parameters, and propose Bayesian policy gradient (BPG) algorithms. In the
frequentist approach to policy gradient our performance measure was η(θ) from Eq. 2, which is the
result of averaging the cumulative return R(ξ) over all possible paths ξ and all possible returns accu-
mulated in each path. In the Bayesian approach we have an additional source of randomness, which
is our subjective Bayesian uncertainty concerning the process generating the cumulative returns. Let
us denote Z
ηB (θ) = R(ξ) Pr(ξ; θ)dξ. (11)

ηB (θ) is a random variable both because of the noise in R(ξ) and the Bayesian uncertainty. Under
the quadratic loss, our Bayesian performance measure is E(ηB (θ)|DM ). Since we are interested
in optimizing performance rather than evaluating it, we evaluate the posterior distribution of the
gradient of ηB (θ). For the mean we have
„Z «
∇ Pr(ξ; θ)
∇E (ηB (θ)|DM ) = E (∇ηB (θ)|DM ) = E R(ξ) Pr(ξ; θ)dξ |DM . (12)
Pr(ξ; θ)
Consequently, in BPG we cast the problem of estimating the gradient of the expected return in
the form of Eq. 6. As described in Sec. 3, we partition the integrand into two parts, f (ξ; θ) and
p(ξ; θ). We will place the GP prior over f and assume that p is known. We will then proceed by
calculating the posterior moments of the gradient ∇ηB (θ) conditioned on the observed data. Next,
we investigate two different ways of partitioning the integrand in Eq. 12, resulting in two distinct
Bayesian models. Table 1 summarizes the two models we use in this work. Our choice of Fisher-type
kernels was motivated by the notion that a good representation should depend on the data generating
process (see [13, 14] for a thorough discussion). Our particular choices of linear and quadratic Fisher
kernels were guided by the requirement that the posterior moments of the gradient be analytically
tractable. In Table 1 we made use of the following definitions: F Mˆ = (f (ξ1 ; θ), . . . , f (ξM ; θ)) ∼
˜
N (0, K M ), Y M = (y(ξ1 ), . . . , y(ξM )) ∼ N (0, K M + σ 2 I), U M = u(ξ1 ) , u(ξ2 ) , . . . , u(ξM ) ,
R RR
Z M = ∇ Pr(ξ; θ)k M (ξ)⊤ dξ, and Z 0 =` k(ξ, ξ ′ )∇ ′ ⊤ ′
´ Pr(ξ; θ)∇ Pr(ξ ; θ) dξdξ . Finally, n is the
⊤
number of policy parameters, and G = E u(ξ)u(ξ) is the Fisher information matrix.
We can now use Models 1 and 2 to define algorithms for evaluating the gradient of the expected
return with respect to the policy parameters. The pseudo-code for these algorithms is shown in
Alg. 1. The generic algorithm (for either model) takes a set of policy parameters θ and a sample size
M as input, and returns an estimate of the posterior moments of the gradient of the expected return.
Algorithm 1 : A Bayesian Policy Gradient Evaluation Algorithm
1: BPG Eval(θ, M ) // policy parameters θ ∈ Rn , sample size M > 0 //
2: Set G = G(θ) , D0 = ∅
3: for i = 1 to M do
4: S ξi using the policy µ(θ)
Sample a path
5: Di = Di−1 {ξi }
P i −1
6: Compute u(ξi ) = Tt=0 ∇ log µ(at |st ; θ)
P i −1
7: R(ξi ) = Tt=0 r(st , at )
8: Update K i using K i−1 and ξi
y(ξi ) = R(ξi )u(ξi ) (Model 1) or y(ξi ) = R(ξi ) (Model 2)
9:
(z M )i = 1 + u(ξi )⊤ G−1 u(ξi ) (Model 1) or Z M (:, i) = u(ξi ) (Model 2)
10: end for
11: C M = (K M + σ 2 I)−1
12: Compute the posterior mean and covariance:
E(∇ηB (θ)|DM ) = Y M C M z M , Cov(∇ηB (θ)|DM ) = (z0 − z ⊤ M C M z M )I (Model 1) or
E(∇ηB (θ)|DM ) = Z M C M y M , Cov(∇ηB (θ)|DM ) = Z 0 − Z M C M Z ⊤ M (Model 2)
13: return E (∇ηB (θ)|DM ) , Cov (∇ηB (θ)|DM )

The kernel functions used in Models 1 and 2 are both based on the Fisher information matrix G(θ).
Consequently, every time we update the policy parameters we need to recompute G. In Alg. 1 we
assume that G is known, however, in most practical situations this will not be the case. Let us briefly
outline two possible approaches for estimating the Fisher information matrix.
MC Estimation: At each step j, our BPG algorithm generates M sample paths using the current
policy parameters θ j in order to estimate the gradient ∇ηB (θ j ). We can use these generated sample
paths to estimate the Fisher information matrix G(θj ) by replacing the expectation in G with em-
PM PTi −1 ⊤
pirical averaging as ĜM C (θ j ) = PM1 T i=1 t=0 ∇ log µ(at |xt ; θ j )∇ log µ(at |xt ; θ j ) .
i=1 i
Model-Based Policy Gradient: The Fisher information matrix depends on the probability distri-
bution over paths. This distribution is a product of two factors, one corresponding to the current
policy, and the other corresponding to the MDP dynamics P0 and P (see Eq. 1). Thus, if the MDP
dynamics are known, the Fisher information matrix can be evaluated off-line. We can model the
MDP dynamics using some parameterized model, and estimate the model parameters using maxi-
mum likelihood or Bayesian methods. This would be a model-based approach to policy gradient,
which would allow us to transfer information between different policies.
Alg. 1 can be made significantly more efficient, both in time and memory, by sparsifying the so-
lution. Such sparsification may be performed incrementally, and helps to numerically stabilize the
algorithm when the kernel matrix is singular, or nearly so. Here we use an on-line sparsification
method from [15] to selectively add a new observed path to a set of dictionary paths DM , which are
used as a basis for approximating the full solution. Lack of space prevents us from discussing this
method in further detail (see Chapter 2 in [15] for a thorough discussion).
The Bayesian policy gradient (BPG) algorithm is described in Alg. 2. This algorithm starts with an
initial vector of policy parameters θ0 and updates the parameters in the direction of the posterior
mean of the gradient of the expected return, computed by Alg. 1. This is repeated N times, or
alternatively, until the gradient estimate is sufficiently close to zero.
Algorithm 2 : A Bayesian Policy Gradient Algorithm
1: BPG(θ 0 , α, N, M ) // initial policy parameters θ 0 , learning rates (αj )N−1
j=0 , number of policy updates
N > 0, BPG Eval sample size M > 0 //
2: for j = 0 to N − 1 do
3: ∆θ j = E (∇ηB (θ j )|DM ) from BPG Eval(θ j , M )
4: θ j+1 = θ j +αj ∆θ j (regular gradient) or θ j+1 = θ j +αj G−1 (θ j )∆θ j (natural gradient)
5: end for
6: return θ N

5 Experimental Results
In this section, we compare the BQ and MC gradient estimators in a continuous-action bandit prob-
lem and a continuous state and action linear quadratic regulation (LQR) problem. We also evaluate
the performance of the BPG algorithm (Alg. 2) on the LQR problem, and compare it with a standard
MC-based policy gradient (MCPG) algorithm.
5.1 A Bandit Problem
In this simple example, we compare the BQ and MC estimates of the gradient (for a fixed set of
policy parameters) using the same samples. Our simple bandit problem has a single state and A = R.
Thus, each path ξi consists of a single action ai . The policy, and therefore also the distribution over
paths is given by a ∼ N (θ1 = 0, θ22 = 1). The score function of the path ξ = a and the Fisher
information matrix are given by u(ξ) = [a, a2 − 1]⊤ and G = diag(1, 2), respectively.
Table 2 shows the exact gradient of the expected return and its MC and BQ estimates (using 10
and 100 samples) for two versions of the simple bandit problem corresponding to two different
deterministic reward functions r(a) = a and r(a) = a2 . The average over 104 runs of the MC and
BQ estimates and their standard deviations are reported in Table 2. The true gradient is analytically
tractable and is reported as “Exact” in Table 2 for reference.

Exact
„ « „ MC (10) « „ BQ (10) « „ MC (100) « „ BQ (100) «
1 0.9950 ± 0.438 0.9856 ± 0.050 1.0004 ± 0.140 1.000 ± 0.000001
r(a) = a
0 −0.0011 ± 0.977 0.0006 ± 0.060 0.0040 ± 0.317 0.000 ± 0.000004
„ « „ « „ « „ « „ «
0 0.0136 ± 1.246 0.0010 ± 0.082 0.0051 ± 0.390 0.000 ± 0.000003
r(a) = a2
2 2.0336 ± 2.831 1.9250 ± 0.226 1.9869 ± 0.857 2.000 ± 0.000011

Table 2: The true gradient of the expected return and its MC and BQ estimates for two bandit problems.

As shown in Table 2, the BQ estimate has much lower variance than the MC estimate for both small
and large sample sizes. The BQ estimate also has a lower bias than the MC estimate for the large
sample size (M = 100), and almost the same bias for the small sample size (M = 10).
5.2 A Linear Quadratic Regulator
In this section, we consider the following linear system in which the goal is to minimize the expected
return over 20 steps. Thus, it is an episodic problem with paths of length 20.
System Policy
Initial State: x0 ∼ N (0.3, 0.001) Actions: at ∼ µ(·|xt ; θ) = N (λxt , σ 2 )
Rewards: rt = x2t + 0.1a2t Parameters: θ = (λ , σ)⊤
Transitions: xt+1 = xt + at + nx ; nx ∼ N (0, 0.01)
We first compare the BQ and MC estimates of the gradient of the expected return for the policy
induced by the parameters λ = −0.2 and σ = 1. We use several different sample sizes (number of
paths used for gradient estimation) M = 5j , j = 1, . . . , 20 for the BQ and MC estimates. For each
sample size, we compute both the MC and BQ estimates 104 times, using the same samples. The
true gradient is estimated using MC with 107 sample paths for comparison purposes.
Figure 1 shows the mean squared error (MSE) (first column), and the mean absolute angular error
(second column) of the MC and BQ estimates of the gradient for several different sample sizes.
The absolute angular error is the absolute value of the angle between the true gradient and the
estimated gradient. In this figure, the BQ gradient estimate was calculated using Model 1 without
sparsification. With a good choice of sparsification threshold, we can attain almost identical results
much faster and more efficiently with sparsification. These results are not shown here due to space
limitations. To give an intuition concerning the speed and the efficiency attained by sparsification,
we should mention that the dimension of the feature space for the kernel used in Model 1 is 6
(Proposition 9.2 in [14]). Therefore, we deal with a kernel matrix of size 6 with sparsification versus
a kernel matrix of size M = 5j , j = 1, . . . , 20 without sparsification.
We ran another set of experiments, in which we add i.i.d. Gaussian noise to the rewards: rt = x2t +
0.1a2t + nr ; nr ∼ N (0, σr2 = 0.1). In Model 2, we can model this by the measurement noise
covariance matrix Σ = T σr2 I, where T = 20 is the path length. Since each reward rt is a Gaussian
PT −1
random variable with variance σr2 , the return R(ξ) = t=0 rt will also be a Gaussian random
variable with variance T σr2 . The results are presented in the third and fourth columns of Figure 1.
These experiments indicate that the BQ gradient estimate has lower variance than its MC counter-
1
part. In fact, whereas the performance of the MC estimate improves as M , the performance of the
BQ estimate improves at a higher rate.
6 2 6 2
10 10 10 10
MC MC MC MC

Mean Absolute Angular Error (deg)

BQ BQ BQ BQ
5
10

Mean Squared Error

5
10

4 1 1
10 10 10

4
10
3
10

2 0 3 0
10 10 10 10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of Paths Number of Paths Number of Paths Number of Paths

Figure 1: Results for the LQR problem using Model 1 (left) and Model 2 (right), without sparsification. The
Model 2 results are for a LQR problem, in which the rewards are corrupted by i.i.d. Gaussian noise. For each
algorithm, we show the MSE (left) and the mean absolute angular error (right), as functions of the number of
sample paths M . Note that the errors are plotted on a logarithmic scale. All results are averages over 104 runs.

Next, we use BPG to optimize the policy parameters in the LQR problem. Figure 2 shows the
performance of the BPG algorithm with the regular (BPG) and the natural (BPNG) gradient es-
timates, versus a MC-based policy gradient (MCPG) algorithm, for the sample sizes (number of
sample paths used for estimating the gradient of a policy) M = 5, 10, 20, and 40. We use Alg. 2
with the number of updates set to N = 100, and Model 1 for the BPG and BPNG methods. Since
Alg. 2 computes the Fisher information matrix for each set of policy parameters, an estimate of the
natural gradient is provided at little extra cost at each step. The returns obtained by these meth-
ods are averaged over 104 runs for sample sizes 5 and 10, and over 103 runs for sample sizes
20 and 40. The policy parameters are initialized randomly at each run. In order to ensure that
the learned parameters do not exceed an acceptable range, the policy parameters are defined as
λ = −1.999 + 1.998/(1 + eν1 ) and σ = 0.001 + 1/(1 + eν2 ). The optimal solution is λ∗ ≈ −0.92
and σ ∗ = 0.001 (ηB (λ∗ , σ ∗ ) = 0.1003) corresponding to ν1∗ ≈ −0.16 and ν2∗ → ∞.

1 MC 1 MC 1 MC 1 MC
10 BPG 10 BPG 10 BPG 10 BPG
BPNG BPNG BPNG BPNG
Average Expected Return

Average Expected Return

Optimal Optimal Optimal Optimal

0 0 0 0
10 10 10 10

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Number of Updates (Sample Size = 5) Number of Updates (Sample Size = 10) Number of Updates (Sample Size = 20) Number of Updates (Sample Size = 40)

Figure 2: A comparison of the average expected returns of BPG using regular (BPG) and natural (BPNG)
gradient estimates, with the average expected return of the MCPG algorithm for sample sizes 5, 10, 20, and 40.

Figure 2 shows that MCPG performs better than the BPG algorithm for the smallest sam-
ple size (M = 5), whereas for larger samples BPG dominates MCPG. This phenomenon is
also reported in [16]. We use two different learning rates for the two components of the
gradient. For a fixed sample size, each method starts with an initial learning rate, and de-
creases it according to the schedule αj = α0 (20/(20 + j)). Table 3 summarizes the best
initial learning rates for each algorithm. The selected learning rates for BPNG are signif-
icantly larger than those for BPG and MCPG, which explains why BPNG initially learns
faster than BPG and MCPG, but contrary to our expectations, eventually performs worse.

So far we have assumed that the Fisher M =5 M = 10 M = 20 M = 40

information matrix is known. In the MCPG 0.01, 0.05 0.05, 0.10 0.05, 0.10 0.10, 0.15
BPG 0.01, 0.03 0.07, 0.10 0.15, 0.20 0.10, 0.30
next experiment, we estimate it us-
BPNG 0.03, 0.50 0.09, 0.30 0.45, 0.90 0.80, 0.90
ing both MC and maximum likelihood
(ML) methods as described in Sec. 4. Figure 3: Initial learning rates used by the PG algorithms.
In ML estimation, we assume that the
transition probability function is P (xt+1 |xt , at ) = N (β1 xt + β2 at + β3 , β42 ), and then estimate its
parameters by observing state transitions. Figure 4 shows that when the Fisher information matrix
is estimated using MC (BPG-MC), the BPG algorithm still performs better than MCPG, and outper-
forms the BPG algorithm in which the Fisher information matrix is estimated using ML (BPG-ML).
Moreover, as we increase the sample size, its performance converges to the performance of the BPG
algorithm in which the Fisher information matrix is known (BPG).
−0.1 −0.1 −0.1
10 10 10
MC MC MC
BPG BPG BPG
BPG−ML BPG−ML BPG−ML

Average Expected Return

−0.2 BPG−MC −0.2 BPG−MC −0.2 BPG−MC
10 10 10
Optimal Optimal Optimal

−0.3 −0.3 −0.3

10 10 10

−0.4 −0.4 −0.4

10 10 10

−0.5 −0.5 −0.5

10 10 10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of Updates (Sample Size = 10) Number of Updates (Sample Size = 20) Number of Updates (Sample Size = 40)

Figure 4: A comparison of the average return of BPG when the Fisher information matrix is known (BPG),
and when it is estimated using MC (BPG-MC) and ML (BPG-ML) methods, for sample sizes 10, 20, and 40
(from left to right). The average return of the MCPG algorithm is also provided for comparison.

6 Discussion
In this paper we proposed an alternative approach to conventional frequentist policy gradient esti-
mation procedures, which is based on the Bayesian view. Our algorithms use GPs to define a prior
distribution over the gradient of the expected return, and compute the posterior, conditioned on the
observed data. The experimental results are encouraging, but we conjecture that even higher gains
may be attained using this approach. This calls for additional theoretical and empirical work.
Although the proposed policy updating algorithm (Alg. 2) uses only the posterior mean of the gradi-
ent in its updates, we hope that more elaborate algorithms can be devised that would make judicious
use of the covariance information provided by the gradient estimation algorithm (Alg. 1). Two ob-
vious possibilities are: 1) risk-aware selection of the update step-size and direction, and 2) using
the variance in a termination condition for Alg. 1. Other interesting directions include 1) investi-
gating other possible partitions of the integrand in the expression for ∇ηB (θ) into a GP term f and
a known term p, 2) using other types of kernel functions, such as sequence kernels, 3) combining
our approach with MDP model estimation, to allow transfer of learning between different policies,
4) investigating methods for learning the Fisher information matrix, 5) extending the Bayesian ap-
proach to Actor-Critic type of algorithms, possibly by combining BPG with the Gaussian process
temporal difference (GPTD) algorithms of [15].
Acknowledgments We thank Rich Sutton and Dale Schuurmans for helpful discussions. M.G.
would like to thank Shie Mannor for his useful comments at the early stages of this work. M.G. is
supported by iCORE and Y.E. is partially supported by an Alberta Ingenuity fellowship.
References
[1] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256, 1992.
[2] P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, MIT, 1998.
[3] J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. JAIR, 15:319–350, 2001.
[4] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning
with function approximation. In Proceedings of NIPS 12, pages 1057–1063, 2000.
[5] S. Kakade. A natural policy gradient. In Proceedings of NIPS 14, 2002.
[6] J. Bagnell and J. Schneider. Covariant policy search. In Proceedings of the 18th IJCAI, 2003.
[7] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. In Proceedings
of the Third IEEE-RAS International Conference on Humanoid Robots, 2003.
[8] J. Berger and R. Wolpert. The Likelihood Principle. Inst. of Mathematical Statistics, Hayward, CA, 1984.
[9] A. O’Hagan. Monte Carlo is fundamentally unsound. The Statistician, 36:247–249, 1987.
[10] A. O’Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29, 1991.
[11] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[12] R. Sutton and A. Barto. An Introduction to Reinforcement Learning. MIT Press, 1998.
[13] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Proceedings
of NIPS 11. MIT Press, 1998.
[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[15] Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, The Hebrew Univer-
sity of Jerusalem, Israel, 2005.
[16] C. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Proceedings of NIPS 15. MIT Press, 2003.

View publication stats

Affidavit of Walker Todd
93% (28)
Affidavit of Walker Todd
14 pages
BSBLDR811 Learner Guide
No ratings yet
BSBLDR811 Learner Guide
35 pages
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
100% (1)
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
400 pages
Curse of Strahd Reloaded - A Campaign Guide by - U - DragnaCarta - Death House
100% (3)
Curse of Strahd Reloaded - A Campaign Guide by - U - DragnaCarta - Death House
32 pages
Euthyphro Apology Crito PDF
0% (1)
Euthyphro Apology Crito PDF
2 pages
Override Bypass Control
100% (5)
Override Bypass Control
26 pages
Acc 223a - Answers To CH 10 Assignment PDF
100% (1)
Acc 223a - Answers To CH 10 Assignment PDF
19 pages
Numerical Methods For Ordinary Differential Equations
100% (1)
Numerical Methods For Ordinary Differential Equations
134 pages
Framework For The Preparation and Presentation of Financial Statements
No ratings yet
Framework For The Preparation and Presentation of Financial Statements
6 pages
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
No ratings yet
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
12 pages
Some Analysis of The Knockoff Filter and Its Variants: Jiajie Chen, Anthony Hou, Thomas Y. Hou June 6, 2017
No ratings yet
Some Analysis of The Knockoff Filter and Its Variants: Jiajie Chen, Anthony Hou, Thomas Y. Hou June 6, 2017
25 pages
Policy_Gradient_Methods_for_Reinforcement_Learning
No ratings yet
Policy_Gradient_Methods_for_Reinforcement_Learning
5 pages
Fitted Advantage Estimation
No ratings yet
Fitted Advantage Estimation
23 pages
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
No ratings yet
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
36 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
16 Aap1257
No ratings yet
16 Aap1257
43 pages
Bickel and Levina 2004
No ratings yet
Bickel and Levina 2004
28 pages
Quasi Newton Trpo
No ratings yet
Quasi Newton Trpo
10 pages
Difference of Q Estimation
No ratings yet
Difference of Q Estimation
28 pages
Asr 013
No ratings yet
Asr 013
16 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
No ratings yet
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
9 pages
Which Moments To Match
No ratings yet
Which Moments To Match
25 pages
1507.04783v1
No ratings yet
1507.04783v1
10 pages
Algorithms 17 00111
No ratings yet
Algorithms 17 00111
12 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Seismic Source Inversion Using Discontinuous Galerkin Methods A Bayesian Approach
No ratings yet
Seismic Source Inversion Using Discontinuous Galerkin Methods A Bayesian Approach
16 pages
Paper-Simple Poisson PCA An Algorithm For Sparse Feature
No ratings yet
Paper-Simple Poisson PCA An Algorithm For Sparse Feature
19 pages
Fitting To The Power-Law Distribution
No ratings yet
Fitting To The Power-Law Distribution
19 pages
Ku Satsu 160225
No ratings yet
Ku Satsu 160225
11 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
Functional Logistic Regression: A Comparison of Three Methods
No ratings yet
Functional Logistic Regression: A Comparison of Three Methods
20 pages
HP-ppl06
No ratings yet
HP-ppl06
16 pages
CH0003 Nguyen v1
No ratings yet
CH0003 Nguyen v1
37 pages
v88c02
No ratings yet
v88c02
41 pages
Variational Methods For Reinforced Learning
No ratings yet
Variational Methods For Reinforced Learning
8 pages
s00180-012-0352-y
No ratings yet
s00180-012-0352-y
29 pages
Exploration in Contextual Bandits: Reedy Reedy
No ratings yet
Exploration in Contextual Bandits: Reedy Reedy
16 pages
Exploring Gauge-Fixing Conditions
No ratings yet
Exploring Gauge-Fixing Conditions
9 pages
SRE_Report_merged
No ratings yet
SRE_Report_merged
16 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
Uncertainty Evaluation in Reservoir Forecasting by Bayes Linear Methodology
No ratings yet
Uncertainty Evaluation in Reservoir Forecasting by Bayes Linear Methodology
10 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Mean-Square Performance of Adaptive Filter Algorithms in Nonstationary Environments
No ratings yet
Mean-Square Performance of Adaptive Filter Algorithms in Nonstationary Environments
7 pages
NHNRM spl11pmd
No ratings yet
NHNRM spl11pmd
10 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
Preprint: Bayesian Inference of Power Law Distributions
No ratings yet
Preprint: Bayesian Inference of Power Law Distributions
11 pages
tmp82D3 TMP
No ratings yet
tmp82D3 TMP
12 pages
Parameter Estimation in Mean Reversion Processes W
No ratings yet
Parameter Estimation in Mean Reversion Processes W
9 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Elly Aj NK Abc Grad
No ratings yet
Elly Aj NK Abc Grad
35 pages
Principal Components Analysis For Turbulence-Chemistry Interaction Modeling
No ratings yet
Principal Components Analysis For Turbulence-Chemistry Interaction Modeling
72 pages
Exact Particle Filtering and Parameter Learning
No ratings yet
Exact Particle Filtering and Parameter Learning
39 pages
Topology and Geometry in Machine Learning For Logistic Regression Problems
No ratings yet
Topology and Geometry in Machine Learning For Logistic Regression Problems
30 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
23 pages
Chemical Process System Engineering
No ratings yet
Chemical Process System Engineering
35 pages
Sess 6 Tuenbaeva Nazarov
No ratings yet
Sess 6 Tuenbaeva Nazarov
7 pages
WurtzEtAlGarch PDF
No ratings yet
WurtzEtAlGarch PDF
41 pages
On The Boundedness of Penalty Parameters in An Aug PDF
No ratings yet
On The Boundedness of Penalty Parameters in An Aug PDF
26 pages
NIPS 2013 Probabilistic Principal Geodesic Analysis Paper 복사본
No ratings yet
NIPS 2013 Probabilistic Principal Geodesic Analysis Paper 복사본
9 pages
Diffusion Models A Concise Perspective
No ratings yet
Diffusion Models A Concise Perspective
8 pages
6776_Towards_Efficient_Trace_E
No ratings yet
6776_Towards_Efficient_Trace_E
11 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
23 pages
RL+ LSTM
No ratings yet
RL+ LSTM
18 pages
Provable Benefits of Annealing
No ratings yet
Provable Benefits of Annealing
26 pages
Journal of Statistical Software: Pyssm: A Python Module For Bayesian Inference of Linear Gaussian State Space Models
No ratings yet
Journal of Statistical Software: Pyssm: A Python Module For Bayesian Inference of Linear Gaussian State Space Models
37 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
641 pages
A Framework For Robust Subspace Learning
No ratings yet
A Framework For Robust Subspace Learning
47 pages
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
100% (1)
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
421 pages
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
No ratings yet
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
497 pages
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
No ratings yet
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
83 pages
Patterns of Scalable Bayesian Inference
No ratings yet
Patterns of Scalable Bayesian Inference
133 pages
子空间学习机（SLM）：方法论和性能
No ratings yet
子空间学习机（SLM）：方法论和性能
12 pages
Interpolation-Based Q-Learning
No ratings yet
Interpolation-Based Q-Learning
37 pages
Multi Class Classification
No ratings yet
Multi Class Classification
79 pages
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
No ratings yet
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
575 pages
On-Line Q-Learning Using Connectionist Systems
No ratings yet
On-Line Q-Learning Using Connectionist Systems
21 pages
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
No ratings yet
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
184 pages
2、Model Based Bayesian Exploration
No ratings yet
2、Model Based Bayesian Exploration
10 pages
Learning From Delayed Rewards（1989）-可选定
No ratings yet
Learning From Delayed Rewards（1989）-可选定
241 pages
1、Reinforcement Learning With Gaussian Processes (2005)
No ratings yet
1、Reinforcement Learning With Gaussian Processes (2005)
8 pages
1、Review of Learning Machines Nilsson Nils J. 1965
No ratings yet
1、Review of Learning Machines Nilsson Nils J. 1965
1 page
1、Book Reviews
No ratings yet
1、Book Reviews
7 pages
Narrowband IoT: A Survey On Downlink and Uplink Perspectives
No ratings yet
Narrowband IoT: A Survey On Downlink and Uplink Perspectives
9 pages
Q1 Shang Dynasty
No ratings yet
Q1 Shang Dynasty
3 pages
The Song of The Sybil
No ratings yet
The Song of The Sybil
12 pages
EDUC550 Assessment 1
No ratings yet
EDUC550 Assessment 1
9 pages
Checkley's Natural Method of Physical Training
100% (2)
Checkley's Natural Method of Physical Training
234 pages
Resin Retained Fixed Partial Denture
No ratings yet
Resin Retained Fixed Partial Denture
44 pages
4.drug Metabolism (Biotransformation)
No ratings yet
4.drug Metabolism (Biotransformation)
33 pages
Sexual Exploitation in India Types, Causes, and Solutions
No ratings yet
Sexual Exploitation in India Types, Causes, and Solutions
7 pages
Group and Subqueries
No ratings yet
Group and Subqueries
6 pages
DLL Mapeh-5 Q2 W7
No ratings yet
DLL Mapeh-5 Q2 W7
6 pages
Operation Issues in Supply Chain Management
No ratings yet
Operation Issues in Supply Chain Management
9 pages
Full download The Logical Must Wittgenstein On Logic 1st Edition Penelope Maddy pdf docx
100% (14)
Full download The Logical Must Wittgenstein On Logic 1st Edition Penelope Maddy pdf docx
60 pages
Financial Ratios Table
No ratings yet
Financial Ratios Table
2 pages
Educ 300 Essay 1
No ratings yet
Educ 300 Essay 1
1 page
Ef3e Beg Syllabus
No ratings yet
Ef3e Beg Syllabus
4 pages
Case History 2D Electrical Resistivity Imaging of Some Complex Landslides in The Lucanian Apennine Chain, Southern Italy
No ratings yet
Case History 2D Electrical Resistivity Imaging of Some Complex Landslides in The Lucanian Apennine Chain, Southern Italy
8 pages
Effect of Nepotism on Productivity among Public Servants in Nigeria
No ratings yet
Effect of Nepotism on Productivity among Public Servants in Nigeria
26 pages
Usb Cam Log
No ratings yet
Usb Cam Log
1 page
An Introduction To Wavelet Analysis With SAS
No ratings yet
An Introduction To Wavelet Analysis With SAS
10 pages
Year 9 Content Description:: Number and Algebra
No ratings yet
Year 9 Content Description:: Number and Algebra
2 pages
Lesbian lingo_ slang terminology in English and Spanish spoken by
No ratings yet
Lesbian lingo_ slang terminology in English and Spanish spoken by
91 pages
Nutritionist and Dietitians
No ratings yet
Nutritionist and Dietitians
8 pages
Title, Subtitle, and Subheading
No ratings yet
Title, Subtitle, and Subheading
2 pages

1、Bayesian Policy Gradient Algorithms（2006）

Uploaded by

1、Bayesian Policy Gradient Algorithms（2006）

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Bayesian Policy Gradient Algorithms.

Conference Paper in Advances in Neural Information Processing Systems · January 2006

Mohammad Ghavamzadeh Yaakov Engel

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Mohammad Ghavamzadeh Yaakov Engel

2 Reinforcement Learning and Policy Gradient Methods

E(f (x)|DM ) = f0 (x) + kM (x)⊤ C M (y M − f 0 ), (7)

Substituting Eq. 7 into Eq. 8, we get

Table 1: Summary of the Bayesian policy gradient Models 1 and 2.

4 Bayesian Policy Gradient

Mean Absolute Angular Error (deg)

Mean Absolute Angular Error (deg)

Mean Squared Error

Mean Squared Error

Average Expected Return

Average Expected Return

Average Expected Return

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

So far we have assumed that the Fisher M =5 M = 10 M = 20 M = 40

Average Expected Return

Average Expected Return

Average Expected Return

−0.3 −0.3 −0.3

−0.4 −0.4 −0.4

−0.5 −0.5 −0.5

View publication stats

You might also like