0% found this document useful (0 votes)

30 views

Variational Methods For Reinforced Learning

furmston10a

Uploaded by

Jorge Leandro

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Variational Methods For Reinforced Learning

furmston10a

Uploaded by

Jorge Leandro

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Variational methods for Reinforcement Learning

Thomas Furmston David Barber

Computer Science Department, University College London, London WC1E 6BT, UK.

Abstract However, these may result in myopic policies since only

the known observed transitions are assumed possible.
As an alternative, we describe a Bayesian approach in
We consider reinforcement learning as solv-
which a prior distribution is placed over the environ-
ing a Markov decision process with unknown
ment model and updated as data from the environment
transition distribution. Based on interac-
is received. This environment distribution maintains
tion with the environment, an estimate of
the possibility of transitions to parts of the space that
the transition matrix is obtained from which
have not yet been observed but nevertheless may prove
the optimal decision policy is formed. The
rewarding. The optimal policy is then obtained by
classical maximum likelihood point estimate
integrating over all possible environment models. To
of the transition model does not reflect the
deal with the difficulties of carrying out this integral we
uncertainty in the estimate of the transition
discuss two approximate methods, Variational Bayes
model and the resulting policies may con-
(VB) (see for example (Beal and Ghahramani, 2003))
sequently lack a sufficient degree of explo-
and Expectation Propagation (EP) (Wainwright and
ration. We consider a Bayesian alternative
Jordan, 2008; Minka, 2001). For simplicity of expo-
that maintains a distribution over the tran-
sition, we assume throughout that the reward model
sition so that the resulting policy takes into
is known, but that the transition model needs to be
account the limited experience of the envi-
learned from experience. Extending the approach to
ronment. The resulting algorithm is formally
an unknown reward model is essentially straightfor-
intractable and we discuss two approximate
ward.
solution methods, Variational Bayes and Ex-
pectation Propagation.
2 Variational MDPs

1 Introduction An MDP can be described by an initial state distri-

bution p1 (s1 ), transition distributions p(st+1 |st , at ),
Reinforcement Learning (RL) is the problem of learn- and a reward function rt (st , at ), where the state and
ing to act optimally through interaction and simula- action at time t are denoted by st and at respec-
tion in an unknown environment (Sutton and Barto, tively. For a discount factor the reward is defined
1998) and may be applied to sequential decision prob- as rt (st , at ) = t1 r(st , at ) for a stationary reward
lems where the underlying dynamics of the envi- r(st , at ). We assume a stationary policy, , defined
ronment is unknown, for example helicopter control as a set of conditional distributions over the action
(Abbeel et al., 2007), the cart-pole problem (Ras- space1 , a,s = p(at = a|st = s, ). The total expected
mussen and Deisenroth, 2008) and elevator scheduling reward of the MDP (the policy utility) is
(Crites and Barto, 1995). We assume a model-based
H X
approach for which we need to estimate the param- X
eters of the transition model based on limited inter- U () = rt (st , at )p(st , at |) (1)
t=1 st ,at
action with the environment. A classical approach to
learning an environment model is to use a point es-
where H is the horizon, which can be either finite or in-
timator, such as the maximum likelihood estimator.
finite, and p(st , at |) is the marginal of the joint state-
Appearing in Proceedings of the 13th International Con- 1
More generally, one may consider policies which de-
ference on Artificial Intelligence and Statistics (AISTATS)
pend on the belief, a,s,D = p(a|s, p(|D), ), similar to
2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of
the encoding of RL as a POMDP(Duff, 2002), though we
JMLR: W&CP 9. Copyright 2010 by the authors.
leave this case for future study.

241
Variational methods for Reinforcement Learning

we obtain a lower bound on the log utility

at at+1 policy
log U () H(q(s1:t , a1:t , t)) + hlog p(s1:t , a1:t , t|)iq
(5)
st st+1 s state transition
where hiq denotes the average w.r.t. q(s1:t , a1:t , t) and
H() is the entropy function. An EM algorithm can be
obtained from the bound in (5) by iterative coordinate-
rt rt+1 r utility wise maximisation:
H
E-step For fixed old find the best q that maximises
the r.h.s. of (5). For no constraint on q, this gives
Figure 1: RL represented as a model-based MDP tran- q = p(s1:t , a1:t , t| old ).
sition and policy learning problem. Rewards depend
on the current and past state and the past action, M-step For fixed q find the best that maximises
rt (st , at ). The policy p(at |st , ) determines the deci- the r.h.s. of (5). This is equivalent to maximising
sion and the environment is modeled by the transition the energy hlog p(s1:t , a1:t , t|)iq w.r.t. .
p(st+1 |st , at ). Based on a history of actions, states and
reward, the task is maximize the expected summed re- Maximisation of the energy term w.r.t. , under the
wards with respect to the policy . In the MDP setup, constraint that the policy is a distribution, gives
the state transition and utilities are known; in our RL
setup we have a distribution over these quantities. H X
X t
new
a,s q(s = s, a = a, t) (6)
t=1 =1
action trajectory distribution
For this M-step the required marginals of the q-
p(s1:H , a1:H |) = p(aH |sH , )p1 (s1 ) distribution can be calculated in linear time using mes-
H1 sage passing since the distribution is chain structured
(Wainwright and Jordan, 2008). The EM algorithm
Y
p(st+1 |st , at )p(at |st , ). (2)
t=1 is run until the policy converges to a (possibly local)
optima.
In this paper we consider the episodic case, so that
the horizon is finite. Graphically we can represent this
using an influence diagram, figure 1. Given a transi- 3 Variational Reinforcement Learning
tion model p(st+1 |st , at ), the MDP learning problem
is to find a policy that maximizes (1). By expressing In the RL problem we assume the transition distri-
s0
the utility (1) as the likelihood function of an appro- butions formed from s,a = p(s0 |s, a) are unknown
priately constructed mixture model the MDP can be and need to be estimated on the basis of interaction
solved using techniques from probabilistic inference, with the environment. These interactions are observed
such as EM (Toussaint et al., 2006) or MCMC (Hoff- transitions D = {(sn , an ) sn+1 , n = 1, . . . , N }. A
man et al., 2008). We follow a construction equivalent classical approach it is to use a point estimate of
to (Toussaint et al., 2006) but which has the advantage the transition model, such as the maximum likelihood
of not requiring auxiliary variables, see e.g. (Dayan (ML) estimator. However, for small amounts of ob-
and Hinton, 1997; Kober and Peters, 2009; Furmston served transitions, these estimators harshly assume
and Barber, 2009). Without loss of generality, we as- that unobserved transitions will simply never occur.
sume the reward is non-negative and define the reward Such an over-confident estimate can adversely affect
weighted path distribution the overall policy solution and result in myopic policies
that are unaware of potentially beneficial state-action
rt (st , at )p(s1:t , a1:t |) pairs. Whilst this over-confidence can be ameliorated
p(s1:t , a1:t , t|) = (3)
U () by adding pseudo-counts, this still does not reflect the
uncertainty in the estimate of the transition.
This distribution is properly normalised, as can be seen
from (1) and (2). We now define a variational dis- We propose an alternative Bayesian solution that
tribution q(s1:t , a1:t , t), and take the Kullback-Leibler maintains a distribution over transitions. The pos-
divergence between the q-distribution and (3). Since terior of is formed from Bayes rule

KL(q(s1:t , a1:t , t)||p(s1:t , a1:t , t|)) 0 (4) p(|D) p(D|)p(). (7)

242
Thomas Furmston, David Barber

As is a set of independent categorical distributions M-step For fixed q find the best that maximises the
a natural conjugate prior p() is the product of inde- r.h.s. of (15). This is equivalent to maximising
pendent Dirichlet distributions, i.e. the energy hlog p(s1:t , a1:t , t|, )iq w.r.t. .
Y

p() Dir(s,a |s,a ) (8) To perform the M-step we need the maximum of
s,a
hlog p(s1:t , a1:t , t|, )iq w.r.t. . As the policy is in-
where are hyper-parameters. This gives a posterior dependent of the transitions this maximisation gives
Y updates of the form

p(|D) = Dir(s,a |cs,a + s,a

) (9)
H X
t
s,a new
X
a,s q(s = s, a = a, t) (16)
where c is the count of observed transitions: t=1 =1

N Calculating the policy update is now a matter of cal-

0 X
css,a = I [sn = s, an = a, sn+1 = s0 ] (10) culating the marginals of the q-distribution from the
n=1 previous E-step. If no functional restriction is placed
The task now is to find the policy that maximizes the on the q-distribution then it will take the form of (13),
expected utility given the environmental data where will equal the policy of the previous M-step.
Z However, examining the form of (13), the exact state-
U (|D) = U (|)p(|D)d (11) action marginals of this distribution are computation-
ally intractable. This can be understood by first car-
rying out the integral over , which has the effect of
where U (|) is given by (1) with transitions .
coupling together all time slices of the path distribu-
Our aim is to form an EM style approach to learning tion p(s1:t , a1:t , t).
. Assuming the reward is non-negative we construct
In the following we discuss two approaches to dealing
a probability distribution for which the normalization
with this intractability. The first, Variational Bayes,
constant is equal to (11). Consider the following un-
restricts the functional form of the q-distribution in
normalised distribution defined over state-action paths
the E-step such that the updates in the M-step become
and times t = 1, ..., H,
tractable. The second approximates the marginals of
p(s1:t , a1:t , t|, ) = r(st , at )p(s1:t , a1:t |, ) (12) the q-distribution directly using Expectation Propaga-
tion.
where p(s1:t , a1:t |, ) is the marginal of (2) given the
transitions . Using (12) we now define a joint distri- 4 Variational Bayes
bution over state-action paths, times and transitions
p(s1:t , a1:t , t|, )p(|D) To ensure computational tractability, a suitable re-
p(s1:t , a1:t , t, |, D) = (13) striction on the functional form of the q-distribution
U (|D)
is to make the factorised approximation:
This distribution is properly normalised, which can be
verified through use of (1) and (11). The Kullback- q(s1:t , a1:t , t, ) = q(s1:t , a1:t , t)q(). (17)
Leibler divergence between a variational distribution
q(s1:t , a1:t , t, ), and (13) gives the bound This approximation maintains the lower bound in (15)
which now takes the form
KL(q(s1:t , a1:t , t, )||p(s1:t , a1:t , t, |, D)) 0 (14)
log U (|D) H(qx ) + H(q ) + hlog p(|D)iq
from which we obtain
+ hlog p(s1:t , a1:t , t|, )iq qx (18)
log U (|D) H(q(s1:t , a1:t , t, )) + hlog p(|D)iq
Where we have used the notation q q(), and
+ hlog p(s1:t , a1:t , t|, )iq (15) qx q(s1:t , a1:t , t). The variational Bayes procedure
now iteratively maximizes (18) with respect to the dis-
where hiq denotes the average w.r.t. q(s1:t , a1:t , t, ).
tributions qx and q . Taking the functional derivative
An EM algorithm for optimising the bound with re-
of (18) with respect to qx and q , whilst holding the
spect to is:
other fixed, gives the following update equations:

E-step For fixed old find the best q that maximises q(s1:t , a1:t , t) e
hlog p(s1:t ,a1:t ,t|,)iq
(19)
the r.h.s. of (15). For no constraint on q, this hlog p(s1:t ,a1:t ,t|,)iqx
gives q = p(s1:t , a1:t , t, | old , D). q() p(|, D)e (20)

243
Variational methods for Reinforcement Learning

Algorithm 1 VB EM Algorithm these distributions are coupled we need to iterate them

Input: policy , reward r, prior and transition until convergence.
counts c. The form of the M-step is calculated by maximising
repeat the bound (18) with respect to . This leads to the
For fixed policy same updates as (16) except the q-distribution now
repeat takes the form of (21). A summary of VB-EM is given
Calculate the q-marginals (21) and (23). in algorithm (1).
until Convergence of the marginals.
Update the policy according to (16).
until Convergence of the policy. 4.1 Hierarchical Variational Bayes

So far we have assumed that the hyper-parameters, ,

are fixed. However the quality of the policy learned can
Expansion of the log p(s1:t , a1:t , t|, ) term in (19)
be strongly dependent on . If the components of
shows that q(s1:t , a1:t , t) is proportional to
are set too low any initial data points will dominate the
t1 transition posterior and the probability of unobserved
Y hlog s +1 ,s ,a iq transitions will be small. On the other hand if is set
r(st , at )at ,st p1 (s1 ) e a ,s . (21)
=1 too high an excessively large amount of data points will
be required to dilute the prior effect on the posterior.
This is the same form as the original MDP (1,2) with To overcome this problem we can extend the model by
the transitions replaced with unnormalised transi- placing a prior distribution over and then update
tions the posterior as data from the environment is received.
hlog s0 ,s,a iq This extension is straightforward under the variational
(s0 , s, a) e . (22) approximation qx q q . In our experiments we use the
hyper-parameter distribution independently for each
The averages of log in the exponent can be com- component of :
puted using standard digamma functions. Given q ,
the marginals q(s , a , t) can be then calculated us- p() e20(1) ,
2
0
ing message passing on the corresponding factor graph
(Kschischang et al., 2001). which has the effect of retaining significant poste-
A similar calculation for the transition parameters rior variance in the transition model, damping overly
gives the update greedy exploitation.

hlog s +1 ,s ,a iqx
PH Pt1
q() p(|, D)e t=1 =1
5 Expectation Propagation
The summation of the states and actions in the expo- In order to implement the Variational Reinforce-
nent means that we may write ment Learning approach of 3 we require the
Y
marginals of the intractable distribution q =
+ cs,a + rs,a

q() = Dir s,a |s,a (23)
p(s1:t , a1:t , t, | old , D). As an alternative to the vari-
s,a
ational Bayes factorised approach we here consider an
where approximate message passing (AMP) approach that
0 XX approximates the required marginals directly.
s
rs,a = q(s +1 = s0 , s = s, a = a) (24)
The graphical structure of q(s1:t , a1:t , , t) is loopy but
t
sparse, so that a sum-product algorithm may pro-
Equation (23) has an intuitive interpretation: for each vide reasonable approximate marginals, see figure 2.
s0
triple (s0 , s, a) we have the prior s,a term and the ob- The messages for the factor graph version of the sum-
s0 product algorithm take the following form.
served counts cs,a which deal with the posterior of the
s0
transitions. The term rs,a encodes an approximate ex- Y
xf (x) = hx (x) (25)
pected reward obtained from starting in state s, tak-
hn(x)\{f }
ing action a, entering state s0 and then following X Y
afterwards. The posterior q() is therefore a standard f x = f (X) yf (y) (26)
Dirichlet posterior on transitions but biased towards {x} yn(f )\{x}
transitions that are likely to lead to higher expected P
reward. Under the approximation (17) the E-step con- where {x} means the sum over all variables except
sists of calculating the distributions (21) and (23). As x, n() is the set of neighbouring nodes and X are the

244
Thomas Furmston, David Barber

p1 s1 Algorithm 2 AMP EM Algorithm

Input: policy , reward r, prior , transition
R counts c and message-passing schedule S.
repeat
a1
For fixed policy
repeat
p(|D)
Perform message-passing according to S using
T EP to approximate messages T 0 ().
p1 s1 s2
until Convergence of the messages.
R Update the policy according to (16).
until Convergence of the policy.
a1 a2

using (25) we have that T () takes the form

T T Y
p1 s1 s2 s3 T () = p(|D) T 0 () (29)
T 0 6=T
R
where T 0 () is given by
a1 a2 a3
0
aT 0 (a)sT 0 (s)s0 T 0 (s0 )ss,a .
X
T 0 () =
Figure 2: A factor graph representation of s0 ,a,s
q(s1:t , a1:t , t, ) for transition factors T , reward factors (30)
R and policy factor , for a H = 3 horizon. The
square nodes represent the various factors (functions) From (29) and (30), T () is a mixture of Dirich-
of the distribution and the circle nodes represent the lets where the number of mixtures is exponential in
variables. The initial time has no transition. The tth the planning horizon H. This makes messages such
chain is the tth row of this diagram for fixed . as (28) computationally intractable. Following the
general approach outlined in (Minka, 2001) to make
a tractable approximate implantation we therefore
variables of the factor f . At convergence the singleton project the messages T () to a product of inde-
marginals are approximated by pendent Dirichlets by moment matching. Given the
projection q() we use (26) and (27) to obtain the ap-
p(x) =
Y
f x (x) (27) proximate message
f Fx q()
T () = Q . (31)
p(|D) T 0 6=T T 0 ()
where Fx means the set of functions in the factor graph
that depend on x. As can be seen from (26) and (25) Given a message initialisation and a message passing
all the messages that involve the factors p1 , , and R schedule S, the AMP algorithm can be summarized
are trivial, requiring only summations of discrete func- as in algorithm (2). For our experiments we used the
tions. Also, as the factor node p(|D) is a leaf node schedule S outlined in algorithm (3).
this message is also trivial. However, the messages be-
tween and the transition factors T are intractable.
To see this we examine a message from T to an action 6 Experiments
node a2
6.1 Incorporation of uncertainty
Z
0
dT ()as ,s .
X
T a (a) = sT (s)sT (s0 ) The first experiment is designed to demonstrate that
s,s 0 our objective function indeed incorporates uncertainty
(28) in the knowledge of the environment into the policy
optimisation process. The experiment is performed on
In order for (28) to be tractable we need T () to a problem small enough that for short horizons the
be the product of independent Dirichlets. However, objective function (11) and the EM update (16) can
be calculated exactly. This allows for characteristics
2
We have dropped the time dependence on the factors of the objective function to be gleaned without the
and the variables to ease the notation. complicating issue of approximations.

245
Variational methods for Reinforcement Learning

Algorithm 3 AMP message-passing Schedule i 1 i 4 10
Ti = , R=
repeat 1 i i 1 1
for t = 1 to H do
Perform message-passing along the tth chain, Figure 3: The transition and reward matrices for the
q(s1 , a1 , ..., st , at ), figure 2, holding all the mes- two-state toy problem. Ti represents the transition
sages T () fixed. matrix from state si , where the columns correspond
end for to actions and the rows correspond to the next state.
repeat The reward matrix R is defined so that the actions run
for each T () do along the rows and the states run along the columns.
Perform Expectation-Propagation to obtain
q(), then use (31) to update T ().
end for when < , and s1 ,a1 = 0 otherwise, where =
until Convergence of all the messages T (). 0.7021. The fact that we know the point, , at which
until Convergence of the q-distribution. the optimal policy of the MDP changes means that we
can form a distribution of sML 1 ,a1
. Given the sample
size and the true value of the transition parameter we
The experiment was performed on a toy two-state have the distribution
problem, with the transition and reward matrices X
given in figure 3. The horizon was set to H = 5 and the p(sML
1 ,a1
= 1|N, true ) = BN,true (n)
initial state is 1. The aim of the experiment is to com- {nN |n/N <}

pare the average total expected utility of the policies

where BN,true is the density function of the Binomial
obtained from the Bayesian and point-based objective
distribution with parameters (N, true ). Having ob-
functions. The average is taken over the true transi-
tained the distribution over the optimal policy it is
tion model, true , and we compare these averages for
now possible to calculate (33).
increasing numbers of observed transitions, N . We set
the distribution over the true transition model to be We calculated (32) and (33) for increasing values of the
uniform. Writing the quantities of interest down alge- N , the results of which are shown in figure 4. It can
braically we have for the Bayesian objective function be observed that the Bayesian objective function con-
sistently outperforms the point-based objective func-
Ep(true ) [Ep(D|true ,N ) [U ( D |true )]] tion. We expect a more dramatic difference in larger
problems for which the amount of uncertainty in the
Z
= dtrue dDU ( D |true ))p(D|true , N )p(true ) transition parameters is greater.
(32) It should be noted that while the point-based objective
function will always produce a deterministic policy the
where D is the optimal policy of the Bayesian objec-
Bayesian objective function can produce a stochastic
tive function. For the ML objective function we have
policy. This naturally incorporates an explorative type
Ep(true ) [Ep(ML |true ,N ) [U ( ML |true )]] behaviour into the policy that will lead to a reduction
Z in the uncertainty in the environment.
= dtrue d ML U ( ML |true )p( ML |true , N )p(true )
6.2 The chain problem
(33)
We compare the EM RL algorithms on the standard
where similarly ML is the optimal policy of the ML
chain benchmark RL problem (Dearden et al., 1998)
objective function.
which has 5 states each having 2 possible actions, as
As we can calculate the objective function U (|D) ex- shown in figure 5. The initial state is 1 and every ac-
actly, we can also calculate (32) for reasonable values tion is flipped with slip probability pslip = 0.2, mak-
of N . It remains to calculate (33), where the difficult ing the environment stochastic. The optimal policy
term is the probability distribution over the optimal is to travel down the chain towards state 5, which is
policy, which we now detail. achieved by always selecting action a.
The settings of the reward matrix and the horizon are In the experiments the total 1000 time-steps are split
such that, given (1 , 2 ) are known, the optimal action into 10 episodes each of 100 time-steps. During each
in state s2 is a1 for all values of 2 . This means that episode the policy and transition model are fixed, and
when the transition dynamics are known the optimal the transitions and rewards from the RL environment
policy can be given by a single parameter, s1 ,a1 . In are collated. At the end of each episode the policy
the experiment we set 1 = 2 = , so that s1 ,a1 = 1 and transition model are updated. All policies are ini-

246
Thomas Furmston, David Barber

a,10
33

a,0 a,0 a,0 a,0

s1 s2 s3 s4 s5
32.5

b,2
32

31.5 Figure 5: The single-chain problem state-action tran-

sitions with rewards r(st , at ). The initial state is state
31 1. There are two actions a, b, with each action being
ML Objective Function
Bayesian Objective Function
flipped with probability 0.2.
30.5
0 5 10 15 20
Sample Size, N
marginal statistics required for EM learning using
Expectation Propagation.
Figure 4: The average total expected reward of the
HVB-EM As for VB-EM but extended to the hyper-
policies obtained from the Bayesian objective function,
parameter distribution, as described in 4.1.
U (|D), and the maximum likelihood objective func-
tion, U (|ML ). The sample size is plotted against the
average total expected reward. The results, averaged over a 100 experiments, are
shown in figure 6. The AMP and stochastic EM algo-
rithms consistently outperform the ML EM algorithm.
tialised randomly from a uniform distribution. For the This is in agreement with our previous results and sug-
methods based on a fixed hyper-parameters , we set gests that both of these algorithms are able to make
each component of to 1. reasonable approximations to the true marginals of the
Convergence of all MDP solvers was determined when q-distribution. Despite the encouraging initial perfor-
the L1 norm of the policy between successive itera- mance of the variational Bayes algorithms, the ML
tions is less than 0.01. The methods we compared are EM algorithm eventually performs better than both
described below. the fixed hyper-parameter and hierarchical VB vari-
ants. This suggests that the factorised approximation
inherent in the VB leads to difficulties. One potential
ML EM The mean is computed from the Dirichlet
issue is that under the factorisation assumptions, the
posterior p(|D, ). This is used as a point-based
unnormalised transitions (22) have the form
estimate of the transition model in the MDP EM
algorithm of 2. s0
s0 e(s,a )
s,a = 0 (34)
SEM At the end of each episode we obtained an
P
ss,a
e( s0 )
approximation to the optimal policy using sam-
pling. We first draw samples i , i = 1, . . . , I where
P s0 represents the digamma function. For
from the posterior p(|D, ). For each sample i s0 s,a < 1 the contributions of the first time points
we then compute the exact conditional marginals in the unnormalised distribution (21) exponentially
p(s , a , t|i ) by message passing on the chain. dominate. As a result there is a bias towards the initial
Averaging over the samples gives the Stochastic time-steps, forcing both of the variational Bayes algo-
EM update rithms to focus on only locally optimal policies. Fi-
nally we note that the prior on the hyper-parameters,
I X
X H X
t , was beneficial to the variational Bayes algorithm.
new
s,a p(s , a , t|i ) This is unsurprising since it maintains posterior vari-
i=1 t=1 =1 ance. We would expect a similar improvement in per-
In the experiments we set I so that this method formance for a hierarchical Expectation Propagation
has roughly the same runtime as the AMP EM approach.
algorithm. In the variational Bayes algorithm the q-distributions
VB EM At the end of each episode the approach de- had to be iterated around 15 times on average. The
scribed in 4 is used. The hyper-parameter is approximate message passing algorithm had to repeat
fixed throughout to 1. the message passing schedule around 10 times on av-
erage, where the Expectation Propagation section of
AMP EM At the end of each episode, the approach the schedule had to be repeated around 2 times for
described in 5 is run, which approximates the convergence. Under the current implementation the

247
Variational methods for Reinforcement Learning

2.6
Application to Scoring Graphical Model Structures.
ML EM
In Bayesian Statistics, volume 7, pages 453464. Ox-
2.4 SEM ford University Press, 2003.
VB EM

2.2 HVB EM
AMP EM
R. Crites and A. Barto. Improving Elevator Perfor-
mance Using Reinforcement Learning. NIPS, 8:
2
10171023, 1995.
1.8
P. Dayan and G. E. Hinton. Using Expectation-
1.6 Maximization for Reinforcement Learning. Neural
Computation, 9:271278, 1997.
1.4

R. Dearden, N. Friedman, and S. Russell. Bayesian Q

100 200 300 400 500 600
TimeSteps
700 800 900 1000 learning. AAAI, 15:761768, 1998.
M. Duff. Optimal Learning: Computational Pro-
cedures for Bayes-Adaptive Markov Decision Pro-
Figure 6: Results from the P chain problem in fig- cesses. PhD thesis, University of Massachusetts
t
ure 5 with average reward 1t r plotted against Amherst, 2002.
time t. The plot shows the results for approximate
T. Furmston and D. Barber. Solving deterministic
message passing (light blue), hierarchical variational
policy (PO)MPDs using Expectation-Maximisation
Bayes (purple), variational Bayes (red), stochastic EM
and Antifreeze. European Conference on Machine
(green) and and the EM algorithm of 2 using the max-
Learning (ECML), 1:5065, 2009. Workshop on
imum likelihood estimator (dark blue). The results
Learning and data Mining for Robotics.
represent performance averaged over 100 runs of the
experiment. M. Hoffman, A. Doucet, N. de Freitas, and A. Jasra.
Trans-dimensional MCMC for Bayesian Policy
Learning. NIPS, 20:665672, 2008.
variational Bayes algorithm is able to perform an EM
J. Kober and J. Peters. Policy search for motor prim-
step in approximately 0.15 seconds, while the approx-
itives in robotics. NIPS, 21:849856, 2009.
imate message passing algorithm takes approximately
5 seconds. F. R. Kschischang, B. J. Frey, and H-A. Loeliger. Fac-
tor graphs and the sum-product algorithm. IEEE
Transactions on Information Theory, 47:498519,
7 Conclusions 2001.
T. P. Minka. Expectation Propagation for approxi-
Framing Markov Decision Problems as inference in a
mate Bayesian inference. In UAI 01: Proceedings
related graphical model has been recently introduced
of the 17th Conference in Uncertainty in Artificial
and has the potential advantage that methods in ap-
Intelligence, pages 362369, 2001.
proximate inference can be exploited to help overcome
difficulties associated with classical MDP solvers in C. Rasmussen and M. Deisenroth. Probabilistic in-
large-scale problems. In this work, we performed some ference for fast learning in control. In S. Girgin,
groundwork theory that extends these techniques to M. Loth, R. Munos, P. Preux, and D. Ryabko, ed-
the case of reinforcement learning in which the param- itors, Recent Advances in Reinforcement Learning,
eters of the MDP are unknown and need to be learned pages 229242, 2008.
from experience. An exact implementation of such a R. S. Sutton and A. G. Barto. Reinforcement Learning:
Bayesian formulation of RL is formally intractable and An Introduction. MIT Press, 1998.
we considered two approximate solutions, one based on
M. Toussaint, S. Harmeling, and A. Storkey. Proba-
variational Bayes, and the other on Expectation Prop-
bilistic inference for solving (PO)MDPs. Research
agation, our initial findings suggesting that the latter
Report EDI-INF-RR-0934, University of Edinburgh,
approach is to be generally preferred.
School of Informatics, 2006.
M. J. Wainwright and M. I. Jordan. Graphical mod-
References els, exponential families, and variational inference.
P. Abbeel, A. Coates, M. Quigley, and A. Ng. An Foundations and Trends in Machine Learning, 1(1-
Application of Reinforcement Learning to Aerobatic 2):1305, 2008.
Helicopter Flight. NIPS, 19:18, 2007.
M. J. Beal and Z. Ghahramani. The Variational
Bayesian EM Algorithm for Incomplete Data: with

248

Handbook of Accelerator Physics and Engineering
No ratings yet
Handbook of Accelerator Physics and Engineering
679 pages
Crane Stability and Ground Pressure
100% (4)
Crane Stability and Ground Pressure
17 pages
Siemens VCB
No ratings yet
Siemens VCB
64 pages
AU212 Automobile Power Plant
No ratings yet
AU212 Automobile Power Plant
3 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
No ratings yet
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
36 pages
1、Bayesian Policy Gradient Algorithms（2006）
No ratings yet
1、Bayesian Policy Gradient Algorithms（2006）
9 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
RL+ LSTM
No ratings yet
RL+ LSTM
18 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Bayesian Estimation of State Space Models
No ratings yet
Bayesian Estimation of State Space Models
49 pages
Variational Heteroscedastic Gaussian Process Regre
No ratings yet
Variational Heteroscedastic Gaussian Process Regre
9 pages
Conditional Density Estimation With Neural Network
No ratings yet
Conditional Density Estimation With Neural Network
41 pages
Delphic Offline Reinforcement
No ratings yet
Delphic Offline Reinforcement
29 pages
Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619
No ratings yet
Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619
9 pages
Opinion Critic
No ratings yet
Opinion Critic
9 pages
Supervised Fine-Tuning As Inverse Reinforcement Learning
No ratings yet
Supervised Fine-Tuning As Inverse Reinforcement Learning
12 pages
Emergent agentic transformer from chain of hindsight experience
No ratings yet
Emergent agentic transformer from chain of hindsight experience
13 pages
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
No ratings yet
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
7 pages
A A C H - D R: Lternative Pproaches For Omputing Ighest Ensity Egions
No ratings yet
A A C H - D R: Lternative Pproaches For Omputing Ighest Ensity Egions
53 pages
The Effects of Memory Replay in Reinforcement Learning
No ratings yet
The Effects of Memory Replay in Reinforcement Learning
14 pages
Dynamic Logistic Regression
No ratings yet
Dynamic Logistic Regression
7 pages
Bayesian Reinforcement Learning
No ratings yet
Bayesian Reinforcement Learning
27 pages
20301-13-24314-1-2-20220628
No ratings yet
20301-13-24314-1-2-20220628
9 pages
NeurIPS-2020-toward-the-fundamental-limits-of-imitation-learning-Paper
No ratings yet
NeurIPS-2020-toward-the-fundamental-limits-of-imitation-learning-Paper
11 pages
Well-Defined Stochastic Petri Nets
No ratings yet
Well-Defined Stochastic Petri Nets
7 pages
Eigenoption Discovery Through The Deep Successor Representation
No ratings yet
Eigenoption Discovery Through The Deep Successor Representation
22 pages
16 Aap1257
No ratings yet
16 Aap1257
43 pages
Inverse Reinforcement Learning Through Policy Gradient Minimization
No ratings yet
Inverse Reinforcement Learning Through Policy Gradient Minimization
7 pages
Article SDE PKPD
No ratings yet
Article SDE PKPD
25 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
6776_Towards_Efficient_Trace_E
No ratings yet
6776_Towards_Efficient_Trace_E
11 pages
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
No ratings yet
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
12 pages
An Analytical Framework For A Consensus-Based Global Optimization Method
No ratings yet
An Analytical Framework For A Consensus-Based Global Optimization Method
20 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
Efficient Reductions for Imitation Learning
No ratings yet
Efficient Reductions for Imitation Learning
8 pages
Scalable Thompson Sampling Via Optimal Transport
No ratings yet
Scalable Thompson Sampling Via Optimal Transport
9 pages
On Averaging and Extrapolation For Gradient Descent: Alan Luner Benjamin Grimmer
No ratings yet
On Averaging and Extrapolation For Gradient Descent: Alan Luner Benjamin Grimmer
33 pages
A Distrib Persp On RL
No ratings yet
A Distrib Persp On RL
19 pages
Fourier Feature Approximations For Periodic Kernels
No ratings yet
Fourier Feature Approximations For Periodic Kernels
8 pages
Difference of Q Estimation
No ratings yet
Difference of Q Estimation
28 pages
GLM, GAMs & GLLMs - An Overview of Theory For Applications in Fisheries Research, VENABLES, 2004
No ratings yet
GLM, GAMs & GLLMs - An Overview of Theory For Applications in Fisheries Research, VENABLES, 2004
19 pages
An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
No ratings yet
An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
47 pages
Meinshausen Et Al-2010-Journal of The Royal Statistical Society: Series B (Statistical Methodology)
No ratings yet
Meinshausen Et Al-2010-Journal of The Royal Statistical Society: Series B (Statistical Methodology)
57 pages
Surrogates
No ratings yet
Surrogates
35 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
No ratings yet
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
8 pages
Bayes-Adaptive POMDPs 2007
No ratings yet
Bayes-Adaptive POMDPs 2007
8 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
BOOTSTRAPPING MAX STATISTICS IN HIGH DIMENSIONS- NEAR-PARAMETRIC RATES UNDER WEAK VARIANCE DECAY AND APPLICATION TO FUNCTIONAL AND MULTINOMIAL DATA
No ratings yet
BOOTSTRAPPING MAX STATISTICS IN HIGH DIMENSIONS- NEAR-PARAMETRIC RATES UNDER WEAK VARIANCE DECAY AND APPLICATION TO FUNCTIONAL AND MULTINOMIAL DATA
64 pages
Convergence of Stochastic Iterative Dynamic Programming Algorithms
No ratings yet
Convergence of Stochastic Iterative Dynamic Programming Algorithms
8 pages
Provably Efficient Maximum Entropy Exploration
No ratings yet
Provably Efficient Maximum Entropy Exploration
11 pages
Novel Strategies Based On A Gradient Boosting Regressio - 2024 - Expert Systems
No ratings yet
Novel Strategies Based On A Gradient Boosting Regressio - 2024 - Expert Systems
15 pages
Compound Markov Mixture Models With Applications in Finance
No ratings yet
Compound Markov Mixture Models With Applications in Finance
39 pages
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
No ratings yet
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
26 pages
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
No ratings yet
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
19 pages
Abolghasemi 2012
No ratings yet
Abolghasemi 2012
11 pages
Ashtiani 2009
No ratings yet
Ashtiani 2009
5 pages
Running Head: Information-Theoretic Account of Interference 1
No ratings yet
Running Head: Information-Theoretic Account of Interference 1
42 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Git Succinctly
No ratings yet
Git Succinctly
59 pages
Principles of Charged Particle Acceleration
No ratings yet
Principles of Charged Particle Acceleration
593 pages
Jeftha Spunda 4174615 Approximate Nearest Neighbor Field Computation Via K-D Trees
No ratings yet
Jeftha Spunda 4174615 Approximate Nearest Neighbor Field Computation Via K-D Trees
26 pages
Alpha Shapes - Celikik
No ratings yet
Alpha Shapes - Celikik
40 pages
LeNgiCoaLahProNg11 PDF
No ratings yet
LeNgiCoaLahProNg11 PDF
8 pages
Jeftha Spunda 4174615 Approximate Nearest Neighbor Field Computation Via K-D Trees
No ratings yet
Jeftha Spunda 4174615 Approximate Nearest Neighbor Field Computation Via K-D Trees
26 pages
AI Benchmark: All About Deep Learning On Smartphones in 2019
No ratings yet
AI Benchmark: All About Deep Learning On Smartphones in 2019
19 pages
Deep Convolutional Denoising of Low-Light Images: Tal Remez or Litany Raja Giryes
No ratings yet
Deep Convolutional Denoising of Low-Light Images: Tal Remez or Litany Raja Giryes
11 pages
Automata and Quantum Computing
No ratings yet
Automata and Quantum Computing
34 pages
An Introduction To Locally Linear Embedding
No ratings yet
An Introduction To Locally Linear Embedding
13 pages
Limits of Resolution 6 How Many Megapixels
No ratings yet
Limits of Resolution 6 How Many Megapixels
13 pages
Eigenvalues, Eigenvectors (CDT-28) : April 2020
No ratings yet
Eigenvalues, Eigenvectors (CDT-28) : April 2020
11 pages
Digital Color Cameras - Spectral Response
No ratings yet
Digital Color Cameras - Spectral Response
32 pages
Machine Learning Yarning - Andrew NG - 23 To 27
50% (2)
Machine Learning Yarning - Andrew NG - 23 To 27
8 pages
Wiley Pattern Classification-Errata PDF
No ratings yet
Wiley Pattern Classification-Errata PDF
19 pages
Drizzle: A Method For The Linear Reconstruction of Undersampled Images
No ratings yet
Drizzle: A Method For The Linear Reconstruction of Undersampled Images
9 pages
Kmtronic 8 Relay Rs232 Manual
No ratings yet
Kmtronic 8 Relay Rs232 Manual
7 pages
Static Synchronous Compensator (Statcom)
No ratings yet
Static Synchronous Compensator (Statcom)
11 pages
6110 J PC8538P - 54 - 11JAN11
No ratings yet
6110 J PC8538P - 54 - 11JAN11
620 pages
99 Seconds Timer
No ratings yet
99 Seconds Timer
2 pages
CTRL Lecture 02 FirstOrderSystems
No ratings yet
CTRL Lecture 02 FirstOrderSystems
63 pages
STULZ Liquid Cooling Brochure 2405 EN
100% (1)
STULZ Liquid Cooling Brochure 2405 EN
8 pages
Cervical Cancer Staging - TNM and FIGO Classifications For Cervical Cancer
100% (1)
Cervical Cancer Staging - TNM and FIGO Classifications For Cervical Cancer
5 pages
12th Sprint Reloaded-Solutions
No ratings yet
12th Sprint Reloaded-Solutions
47 pages
INTRODUCTION
No ratings yet
INTRODUCTION
2 pages
Pair of Linear Equation
100% (1)
Pair of Linear Equation
10 pages
Summary About Networking
100% (1)
Summary About Networking
7 pages
Fatigue Analysis in Caesar II
100% (2)
Fatigue Analysis in Caesar II
4 pages
IET Electric Power Appl - 2023 - Yu - Mutual inductance calculation for rectangular and circular coils with parallel axes
No ratings yet
IET Electric Power Appl - 2023 - Yu - Mutual inductance calculation for rectangular and circular coils with parallel axes
10 pages
Volume of Prisms (Answers)
No ratings yet
Volume of Prisms (Answers)
3 pages
Tecno Economics of LNG Regasification Terminal in
No ratings yet
Tecno Economics of LNG Regasification Terminal in
11 pages
SAC Test NoAnswer
No ratings yet
SAC Test NoAnswer
45 pages
Applications of Optical Instruments
No ratings yet
Applications of Optical Instruments
2 pages
1.2 Moles, Molar Volume & Gas Laws
No ratings yet
1.2 Moles, Molar Volume & Gas Laws
14 pages
6.1.4 Permutations of N Different Objects Under Given Conditions
No ratings yet
6.1.4 Permutations of N Different Objects Under Given Conditions
2 pages
Generalni Katalog 2013
No ratings yet
Generalni Katalog 2013
166 pages
Challenges in Fabrication of 2.15Cr 1mo 0.25V Reactors
100% (1)
Challenges in Fabrication of 2.15Cr 1mo 0.25V Reactors
56 pages
HPSA Install 20200617-163243
No ratings yet
HPSA Install 20200617-163243
821 pages
Understanding by Design Unit Template
No ratings yet
Understanding by Design Unit Template
8 pages
Module 1 - Chapter 1 - Introduction To Numerical Methods and Analysis
No ratings yet
Module 1 - Chapter 1 - Introduction To Numerical Methods and Analysis
8 pages
07 - Sheet Metal Ducting - 2.9
No ratings yet
07 - Sheet Metal Ducting - 2.9
11 pages
Kenneth N. Waltz Theory of International Politics Addison-Wesley Series in Political Science 1979
No ratings yet
Kenneth N. Waltz Theory of International Politics Addison-Wesley Series in Political Science 1979
150 pages
Grade 9 Baseline Test 2023 Marking Guideline (Final)
No ratings yet
Grade 9 Baseline Test 2023 Marking Guideline (Final)
7 pages