Variational Methods For Reinforced Learning
Variational Methods For Reinforced Learning
241
Variational methods for Reinforcement Learning
242
Thomas Furmston, David Barber
As is a set of independent categorical distributions M-step For fixed q find the best that maximises the
a natural conjugate prior p() is the product of inde- r.h.s. of (15). This is equivalent to maximising
pendent Dirichlet distributions, i.e. the energy hlog p(s1:t , a1:t , t|, )iq w.r.t. .
Y
p() Dir(s,a |s,a ) (8) To perform the M-step we need the maximum of
s,a
hlog p(s1:t , a1:t , t|, )iq w.r.t. . As the policy is in-
where are hyper-parameters. This gives a posterior dependent of the transitions this maximisation gives
Y updates of the form
p(|D) = Dir(s,a |cs,a + s,a
) (9)
H X
t
s,a new
X
a,s q(s = s, a = a, t) (16)
where c is the count of observed transitions: t=1 =1
E-step For fixed old find the best q that maximises q(s1:t , a1:t , t) e
hlog p(s1:t ,a1:t ,t|,)iq
(19)
the r.h.s. of (15). For no constraint on q, this hlog p(s1:t ,a1:t ,t|,)iqx
gives q = p(s1:t , a1:t , t, | old , D). q() p(|, D)e (20)
243
Variational methods for Reinforcement Learning
hlog s +1 ,s ,a iqx
PH Pt1
q() p(|, D)e t=1 =1
5 Expectation Propagation
The summation of the states and actions in the expo- In order to implement the Variational Reinforce-
nent means that we may write ment Learning approach of 3 we require the
Y
marginals of the intractable distribution q =
+ cs,a + rs,a
q() = Dir s,a |s,a (23)
p(s1:t , a1:t , t, | old , D). As an alternative to the vari-
s,a
ational Bayes factorised approach we here consider an
where approximate message passing (AMP) approach that
0 XX approximates the required marginals directly.
s
rs,a = q(s +1 = s0 , s = s, a = a) (24)
The graphical structure of q(s1:t , a1:t , , t) is loopy but
t
sparse, so that a sum-product algorithm may pro-
Equation (23) has an intuitive interpretation: for each vide reasonable approximate marginals, see figure 2.
s0
triple (s0 , s, a) we have the prior s,a term and the ob- The messages for the factor graph version of the sum-
s0 product algorithm take the following form.
served counts cs,a which deal with the posterior of the
s0
transitions. The term rs,a encodes an approximate ex- Y
xf (x) = hx (x) (25)
pected reward obtained from starting in state s, tak-
hn(x)\{f }
ing action a, entering state s0 and then following X Y
afterwards. The posterior q() is therefore a standard f x = f (X) yf (y) (26)
Dirichlet posterior on transitions but biased towards {x} yn(f )\{x}
transitions that are likely to lead to higher expected P
reward. Under the approximation (17) the E-step con- where {x} means the sum over all variables except
sists of calculating the distributions (21) and (23). As x, n() is the set of neighbouring nodes and X are the
244
Thomas Furmston, David Barber
245
Variational methods for Reinforcement Learning
Algorithm 3 AMP message-passing Schedule i 1 i 4 10
Ti = , R=
repeat 1 i i 1 1
for t = 1 to H do
Perform message-passing along the tth chain, Figure 3: The transition and reward matrices for the
q(s1 , a1 , ..., st , at ), figure 2, holding all the mes- two-state toy problem. Ti represents the transition
sages T () fixed. matrix from state si , where the columns correspond
end for to actions and the rows correspond to the next state.
repeat The reward matrix R is defined so that the actions run
for each T () do along the rows and the states run along the columns.
Perform Expectation-Propagation to obtain
q(), then use (31) to update T ().
end for when < , and s1 ,a1 = 0 otherwise, where =
until Convergence of all the messages T (). 0.7021. The fact that we know the point, , at which
until Convergence of the q-distribution. the optimal policy of the MDP changes means that we
can form a distribution of sML 1 ,a1
. Given the sample
size and the true value of the transition parameter we
The experiment was performed on a toy two-state have the distribution
problem, with the transition and reward matrices X
given in figure 3. The horizon was set to H = 5 and the p(sML
1 ,a1
= 1|N, true ) = BN,true (n)
initial state is 1. The aim of the experiment is to com- {nN |n/N <}
246
Thomas Furmston, David Barber
a,10
33
b,2
32
247
Variational methods for Reinforcement Learning
2.6
Application to Scoring Graphical Model Structures.
ML EM
In Bayesian Statistics, volume 7, pages 453464. Ox-
2.4 SEM ford University Press, 2003.
VB EM
2.2 HVB EM
AMP EM
R. Crites and A. Barto. Improving Elevator Perfor-
mance Using Reinforcement Learning. NIPS, 8:
2
10171023, 1995.
1.8
P. Dayan and G. E. Hinton. Using Expectation-
1.6 Maximization for Reinforcement Learning. Neural
Computation, 9:271278, 1997.
1.4
248