0% found this document useful (0 votes)

13 views9 pages

ml96 Ps

This document introduces a generalized reinforcement learning model that extends beyond traditional Markov decision processes (MDPs) to include two-player games, risk-sensitive reinforcement learning, and other applications. The generalized model provides a theoretical foundation for using reinforcement learning in MDPs, games, and other scenarios. Key results for MDPs and two-player games are reviewed, and it is shown that extensions of common reinforcement learning algorithms can find optimal value functions in these contexts.

Uploaded by

tye moore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views9 pages

ml96 Ps

Uploaded by

tye moore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

A Generalized Reinforcement-Learning Model: Convergence and

Applications

Michael L. Littlnan Csaba Szepesvari*

Department of Computer Science Research Grollp of Artificial Tntelligence
Brmvn University "Jozsef Attila" University, Szeged
Providence, RI 02912-1910, l�SA Szeged 6720, Aradi vrt tere L HUNGARY
mlittman (Wes. brown.cdu szepes,:Wmath. u-szeged.hu

Abstract o :::;
I < I controls the degree to which future rewa,Tds
are significant cOlIlpared to iuunediaLe rewards.
Reinforcement learning is the process by The Lheory of Markov decil:lion procetl.':lel:l has been utled
which an autonomous agent uses its experi as a theoretical foundation for important reslllts con
ence intera cting with an environment to im cerning this reinforcement-learning scenario. A (finite)
prove it::; beha.vior. The rVlarkov dec.ision pro :rdarkov decision process (�IDP) is defined by the tuple
cess (:\'lDP) modd is a popular way of for (8, A, P, R ), where 8 is a finite set of states, A It fi
malizing the reinforcement-learning problem, nite set of actions, P a transition function, and R. a
but it it) by no meant; LIte only \-vay. In Lhitl reward function_ The optimal behavior for an agent
paper, we show hmy IllaIlY of the important in an .\lUl' depends on the optimality criterion; for
theoretical results concerning reinforcement the infinite-horizon expected discounted total-re"\yard
learning in MDPs extend to a generalized MDP criterion, t.he 0pLimal bell <-lvi or call be round hy id en
model that includes MDPS, two-player games tifying the optimal value function, defined recursively
and .\lUPs under a worst-case optimality cri by
terion as special cases. The basis of this ex
tension is a stochastic-approximation theo
rem that redllces fLsynchronolls convergence
to synchrOflOlJS cOlnrergence.
for all states x E 8, ,,,here R(J:, a ) is the iuullediate re
ward for taking action a from state x, /' the discount
1 INTRODUCTION factor, and P(x, a, y) the probability that state y is
reached from state ::e when action a E A is chosen.
Reinrorcement learning is Ule process hy wllich an These simultaneous equations, known as the Bellman
cqu(]tions, can be solved using a variety of techniques
agent iIllproves its behavior in an environment via ex
perience. A rf:injorremfnt-lmrning srenario is defined ranging from sllccessive approximation to linear pro
by the experience presented to the agent at each step, gramming (Puterman : -1 994).
and the criterion for evaluating the agent's behavior. Tn the abRence of complete information regarding
One particularly well-studied reinforcement-learning Lhe t.ransiLion and reward rundions: reinrorcemenL
scenario is that of a single agent maximizing expected learning methods can be used to find optimal value
difKOlll1ted total reward in a finite-state environment; functionR. ReRearchers ha,Ye explored model-free (di
in this scenario experiences are of the form (;1:, a, y, r ) , rect) methods, such as Q-Iearning (\Vatkins and
with state x, action a, resulting state y, and the agent's Dayan, 1992), and model-ba.,,;ccl (indirect) methods,
scalar immediate re"\vard '>. A discount parameter such a."l prioritized sweeping (Moore and i\.tkeson:
199;1), and many converge to optilnal value functions
1\.1so Department of Adaptive Systems, Joint Depart under the proper conditions (Tsitsiklis, 1994; Jaakkola
rrll-'rd, or Lhl-' "'}{rJ;Sl-'r ALLila" rnivt-'r-siLy, Szt-'gl-'d and Lhl-' Tn ct aL 1991; Gullapalli and Barto, 199,1),
sLiLld,e or TsoLopu; or !.lIC HlIrrgar-ian Academy or Scicnces,
Budapest 1525, P,O, Box, n HUKGARY Not all reinforcement-learning scenarios of interest can
be modeled as MDI's. For example, a great deal of 2 THE GENERALIZED MODEL
reinforcement-learning resea,rch ha.<;J been directed to
the problem of solving tl,vo-player games ( e.g, Tesauro, In this section, \ve introduce our generalized r"fDP
1995) , and the reinforcement-learning algorithms for model. We begin by fil1l11marizing some of the more
solving MDPs a,nd their convergence proofs do not ap significant results regarding the standard _\lUP model
ply directly to games. and some important results for two-player games.
Tn one form of two-player game, experiences arc of the
form (;1:, n, y, r), where states ;]; and y contain addi 2.1 MARKOV DECISION PROCESSES

tional information concerning \vhich player (maximizer

or m£n£mizcr) gets to choose the action in that state. To provide a, point of departure for OllT genera,lir,a,tion
There are deep similarities between _\'11) PS and this type of Markov decision pl'Ocet;�:;etl, vve Iirsl describe Lhe utle
of game; for example, it is possible to define a set of of reinforcement learning in the :\'lDPS; proofs of the
Hellman equations for the optimal minimax value of a unattributed claims can be found in Puterman's ( 1991)
�IDP book. The ultimate target of learning is an op
two-player zero-tlum game,
timal policy. A po!£r;y ifi some function that tells the
agent ,l vhich actions should be chosen under which cir
cumstances. A policy" is optimal under the expected
maXCEA ( R( >:, a) + "I I:y P ( >:, a, y)V" (y)), discounted total re,l vard criterion if, with respect to the
if maximizer moves in x space of all possible policiefl, IT maximizes the expected
minaEA (R(x, a) + "I I:y P ix, a, yW" Iy)), discounted LoLal ['(-'\ovaI'd r[,Olfl all st aLes.

if minimizer moves in x,
Directl y lllaxirnillillgover (he space or all po ssihle poli
cies is impractical. Hmvever, MUt's have an impor
tant property that makes it 11l1neCesfiary to consider
where R( ;7:, a ) is the reward to the maximizing player. such a broad space of possibilities. We say a policy
\Vhen 0 .::; I < -I , these equationfi have a lllliqlle so 1T" is stotionary and deterministic if it ma.ps directly

lution and can be solved by sllccefisive-approximation from fitates to a,ctionfi, ignoring everything else� and
methods. In addition, we show that simple extensions we write 1T ( X ) as the action chosen by 'iT when the cur
of several reinforcement-learning algorithms for \lD"Ps rent state ifi x. Tn expected discounted total reward
converge 1,0 optiuw.l value fUlldiollt; in Lhet;e gametl. .\-lUI' environments, there is ahvays a stationary deter
ministic policy that is optimal; we will Ufle the \voro
In Lhit; paper, we introduce a generalized Markov de "policy�' to mean fitationary deterministic policy� un
cision process model \vith applications to reinforce less otherwise stated.
ment learning, and list some important results con
cerning Lhe model. CelleraliL:,ed MDPtl provide a foun The value function for a policy 7T� V� llWpS states to
l

dation for the llse of reinforcement learning in \lDPs their expected discounted total revmrd under policy 1r.
and games, as well as in risk-sensitive reinforcement It can be defined by the simuHaneoutl equatiOllt;
learning ( Heger, 1994), exploration-sensitive reinforce
ment learning ( John, 1 99.3), and reinforcement learn V' (xl = R(", a ) + "I L P(x, a, yW"(y),
ing in simultaneous-action games (Littman, 1991) . y
Our main Llleorem addresses conditions ror the conver
gence of asynchronous stochastic processes and shmvs for all xES'. The optimal value function V'" is the
hm" these conditions relate to conditions for conver vahle function of an optimal policy; it ifi lllliqlle for
o .::; "/ < 1 . The myopic policy with respect to a value
gence o r a corre.<;Jponding syndll'onollfi process; it can
be used to prove the convergence of model-free and function V- is the policy lTv such that
model-based reinforcement-lea,rning algorithmfi under
a variety of reinforcement-learning scenarios.
In Section 2� we present generalized .\-lDPs and mo
tivate their fonn via two detailed examples. In Sec
Any myopic policy l,vith respect to the optilnal value
tion 3� we describe a stochastic-approximation theo
function is optiu13l.
rem, and in Section 4 \ve show several applications
of the theorem that prove the convergence of learning The Hellman equations can he operationali?;ed in the
processes in generalized NIDI's. form of the dynamic-programming operator T� which
maps value functions to value functions: Condon (1992) proves this and the other unatlributed
results in this sedion.
A popular optimality criterion for alternating Markov
games is discounted minimax optimality. l�nder this
For 0 :::; '";1 < 1 , sllccessive applications of T to a value criterion, the maximizer chooses actions to maxi
function bring it closer and closer to the optimal value mize its reward against the minimizer's best possi
rlJndion 1/* l \-vh idl is Lhe uniqlJe fixed poi nt or T: hie counter-policy. A pair of policies is in cquilibr£111n
V' TV',
=
if neither player has any incentive to change policies
if the other player�s policy remains fixed. The value
In reinforcement learning, Rand P are not knmvn in function for a pa.ir of eqllilihrium policies is the op
advance. In modcl-ba.sed reinforcement learning, H. timal value function for the game; it is uniqlle when
and Pare estimaLed OII-lille, and Lhe value function o :::; I < 1, and can be found by successive approxima
is updated according to the approxiuwte dynamic tion. For hoth players, there is ahvays a deterministic
programming operator derived from these estimates; stationary optimal policy. Any myopic policy with re
this algorithm converges to the optimal value function spect to the optimal value function is optimal.
under a ,vide variety of choices of the order states arc
updated (Gullapalli and Bart.o, 1 994), Dynamic-programming operators, Bellman equations�
and reinforcement-learning algorithms can be defined
The method of Q-learning (Watkins and Dayan, 1992) for alternating :rdarkov games by starting with the def
uses experience to estimate the optimal vaille function initions llsed in MDPs and changing the maximllm op
withollt ever explicitly approximating R and P. The erators to either maximums or minimums conditioned
algorithm estimates the optimal Q function on the state. vVe show below that the resulting al
gorithms share their convergence properties with the
Q'(x, a) = R(x,a) +')' L P(,",a, yW'(y), analogous algorithms for 1\,.'TDPs.
'I

from which the optimal vahle flmction can he com 2.3 GENERALIZED MOPS
puted vi a 1/*(;1;) = IIlaxaQ"(;r,u). Given the experi
ence at step t (J.:r, a" Yr, II) and the current estimate Tn a.lternating Markov games a,nd MDPS, optimal be
Qd;};, (1) of the optimal Q function, Q-learning llpdates havior can be specified by the Dellman equations; any
myopic policy with respect to the optiuwl value func
(I'+1(Xt, a,) : = (1 - utlx" at))(I,(xt, at) tion is optimaL Tn this sedion� .ve generalir,e the Bell
+ (t,(x" arlh. + '(maxQ, ( y" a)), man equations to define optimal behavior for a broad
"
class of reinforcement-learning models. The objective
where 0 :::; at(x� a) :::; 1 is a time-dependent learn criterion used in these models is additive in that the
ing rate controlling the blending rate of ne.v estimates the vahle of a policy is some measure of the totl1l re
with old estimates for each state-action pair. The es ward received.
timated Q function converges to Q* under the proper
The generalir,ed Bellman eqllations can be ,vritten
conditions (\\ratkins and Dayan, 1992).

2.2 ALTERNATING MARKOV GAMES

(I )

Tn alLernating Tvlarkov games, Lwo players Lake Lllrns Here "' Q9/;'� is an operator that summarizes values
issuing actions to maximize their expected discounted a,:
over actions a..o;; a function of the state, and "E9y.r,
total reward. The model is defined hy the tuple is an operator that sumlnarizes values over next states
(81, 8'2, A, R, P, R), wllere 81 is the seL or s LaLes in as a function of the state and action. For Markov
which player 1 issues actions from the set A, /h is the decision processes, Q9(1.rf(x,a) = maxa f(x�a) and
set of states in which player 2 isslles actions from the ,
EEl/" g(y) Ly P(J;, A., y)g (y ) For alternat.ing
set fl, P is the transition function, and It is the re
Tvlarkov games, E9y .r,'1 is the same a,nd Q9(1'r, f(;n,a) =

ward function for player 1. In the zerO-SUIll games .ve

maxa f(x� a) or mina f(x, oj depending ,vhether x is
consider, the re.vards to player 2 (the minimir,er) are
in Sl or 82• Many models can be represented in this
simply the additive inverse of the re.vards to player I
frame'iyork; see Section 11.
(the maximizer ). \.'larkov decision processes are a spe
cial case of alternating }"larkov games in which 82 0; = From a reinforcement-learning perspective� the value
functions defined by the generalized _\lU!, model can be poliCJ\ ,l ve require that the operator @x satisfy the
interpreted fl.';; the total value of the re""Yards received following property: for all fllllCtions f and states x,
by an agent selecting actions in a stochastic environ mina f(x, a) <:0"X f(x,a) <:maxaf(",a),
ment. The agent begins in state x, takes action a, and
The value function with respect to a policy Jr, V7r, can
ends up in state y. The E9/.,a opera,tor defines how
be defined by the simultaneous equations V 7r T7r l/7r;
=

the value of the next state should be used in assigning

it is unique when T is a contradion mapping. A policy
vahle to the current state. The Q9a x operator defines
1T" is optimal if it is myopic with respect to its own
hmv an optimal agent should choose actions.
vahle function. Tf 11"" is an optimal policy, then ,,-''IT.
\Vhen 0 ::; '";i < 1 and Q9x and EBx,a are non
·
V" because it solves the Bellman equation: l/7r =

expam,iom" the generalized Bellman equations have a T7r• V7r·= T"nr· .

unique optimal sohltion, and therefore, the optimal
The next Redion deRcribes a general theorem that
value function is ""Yell defined. The Q9;(' operator is a
can be ut;eu to prove the convergence of several
non-expansion if
reinforcement-learning algorithms for these and other
models.
I®x
a
h(x,a) - ®x 1
a
h(x,a) <: m,:'x lj ,(x,a)-h(x,a)1

3 CONVERGENCE THEOREM
for all fl, h, and ;r. An analogous condition defines
when Efr",,11 is a non-expansion.
The process of finding an optimal value function can be
rVlany natural operators are non-expansions, such
viewed in the following general "'ivay. At any lTIOment
as max: min, midpoint, median, mean, and fixed
in time, there is a set of values representing the cur
weighted averages or these operations. Several pre
rent approximation of the optimal value fllllction. On
viously described reinforcement-learning scenarios are
each iteration, we apply some dynamic-programming
specia,l ca.ses of this genera,lir,ed "MDP model includ
operator: perhaps modified hy experience, to the Cllr
ing computing the expected return of a fixed pol
rent approximation to generate a new approximation.
icy (Sutton: 1988), finding the optimal risk-averse pol
Over time, we would like the approximation to tend
icy (Heger, 1 994): a,nd finding the optimal exploration
toward the optimal value function.
sensitive policy (John, 1995).
In thit; procetltl, there are L\A,!O typet; of a.pproximation
Afi,"vith MDPS, we can define a dynamic-programming
going on simultaneously. The first is an approxima
operator
tion of the dynamic-programming operator for the un
derly ing model, and the tlecond it; Lhe ut;e of the ap
proxilnate dyna.mic-progra.mming operator to find the
optimal value function. This section presents a theo
rem that gives a set of conditions under "'ivhich this type
The operator 'I' is a, contraction mapping for 0 :s; ')' <
of simllltaneous stocha.stic approximation converges to
1. Thitl meant;
an optimal value function.
sup I[TV,](x) - [TV 2](x)1 <: ,SUP W,(,,) - V2(,,)1 First, we need to define the general stochastic process.
x ;('
Let the set ,Y be tlle staLes of Llle model, and the set
where VI and V2 are arbitrary functions and 0 :s; '";/ < 1 B( X ) of bounded, real-valued functions over X be the
is the index of contraction. set of vahlC functions, Let T : BCY) -+ B(X) be an
arbitrary conLraction mapping wiLli fixed point V*.
\Ve can define a notion of stationary myopic policies
with respect to a value fllllCtion V; it is any (stochas rr we had direct access to the contraction mapping T:
tic ) policy 'iT"V for ,"vhich T7rV = TV where we could use it to successively approximate ll*. In
most reinforcement-learning scenarios, T is not avail
able and ,l ve must use experience to construct approx
imations of T. Consider a sequence of random op
erators Tt : B(X) -+ (B(X) -+ B(X)) and define
Here Jra(x) reprCf;entR the probability that an agent Ut+1 = [TtU,]V where V and [To E B(X) arc arhi
following Jr would choose action 0. in state ;1:. To be trary value fllllCtions. vVe say Tt appm.rimatf8 T at
certain that every value function possesses a myopic V� if [It converges to TV with probability 1 uniformly
over X l. The idea is that Tt is a randomized version In some applications� such as Q-learning1 the contri
of T that uses [It a..o;; "memory" to converge to TlT. hlltion of new information needs to deca,y over time to
insure that the process converges. In this case, Gdx)
The follmving theorem shows that, under the proper
needs to converge to one; Condition 3 allows this as
conditions, 'ive can use the sequence Tt to estimate the
long as the convergence is slow enough to incorporate
fixed point V* of T.
sufficient infonnation for the process to converge.
Theorenl 1 rd T bp an. arbitrary mapping with ji:r:pd Condition 4 links the villues of Gt(x) ilnd F,(.r)
poin.t V*" an d let Tt approximate T at V*. Let l'o be an through some quantity ....1. < 1. Tf it ,\-vere somehow pos
orbitrary v(l[ue junction, (wd define Vt+1 = [Tt Vt]l't. sible to update the values synchronously over the en
Tf there exi"t junctions 0 S F, ( ,r) S 1 and 0 S G, ( ,r) S tire ",tate space, the process 'ivould converge to V* eyen
1 satisfying the conditions below tvith probability one, when "'j = 1. Tn the more interesting asynchronous
llu�n 1/; e01llW'f'Yf!S to ll* 'wah prol)(Jbaay 1 'IlnijrJ'r'fnly ease, when ....!. = 1 the long-term behavior of Vt is not
over X: immediately clear: it may even he that Vt converges
to l:)olneLhing oLher lha.n F*. The requirement that
1, for all U, an d U2 E S(X), and all x E X, "'/ < 1 insures that the use of outdated information in
the asynchronous updates does not ca,llse a problem in
I([T,U,]V ')(,,,) -([T,U,]V')(",)1 convergence,
< G,(:t)IU,(x) - U2(x)l; One of the most noteworthy aspects of this theorem
is that it shows how to redllce the problem of approx
2, for all U an d V E S(X), and all x E X, imating V* to the problem of approximating T at a
particular point II (in particular, it is enough if T can
1(['I;U]V')(,r,) - (['ltu]V)(x)1 be approximated at F*); in many cases1 the latter is
< F,(x) sup W'(x') - V( x')I: much easier to achieve and also to prove. For exam
x'
ple, LlJe Lheorerrl makes the co n v erge n ce or Q-learnillg
3, for all k > 0, II�l=kGt(x) converges to zero uni a consequence of the classical Robbins-Monro theo
formly in x a8 n in�r(;(}8Ui: (md. rem ( Robhins and Monro, I 9.)1).

4, there exists 0 S , < 1 such that for all x E X and 4 APPLICATIONS

largr rl1ou.gh t,

Fdx) S ,(1- G ,(x)), This section Inakes use of Theorem 1 to prove the con
vergence of various reinforcement-learning algorithms.

Note that from the conditions of the theorem, it fol 4.1 GENERALIZED Q-LEARNING FOR

lows that T is a contraction operator at V* 'ivith index EXPECTED VALUE MODELS

of contraction "/. The theorem is proven in a more de
tailed ver::;ion o[ Lhiti paper (Sljepetivari and Littman, Consider the family of finite state and action general
1996), \Ve next describe some of the intuition behind ized f..1Ul's defined by the Bellman equations
the sta,tement of the theorem and its conditions.
The iterative approximation of \/'"' is performed by
computing l'i:+l = ['It l't]l'i:: ,,,here Tt approximates T
V'(x) = �t ( R(x. a) + -; � P(x a, yw'(y))
,

with the help of the "memory" present in li. Because

of Conditions 1 and 2, G't (x) is the extent to which the where the definition of Q9x does not depend on H or
estimated value function depends on its present value P. A Q-learning algorithm [or lhi::; class of modelti can
""d F,(.r) "" 1 -Gt(.r) is the extent to which the es be defined as follmvs. Given experience (J.:t, at, Yt, rt )
timated value fllllction is hased on "new'; information at time t and an estimate QtCr, a ) of the optima,l Q
(this reasoning hecomes clearer in the context or the function, let
applications in Section 4).
Qt+1(x"atl:= (l-<>,(x"atl)Q,(x"atl
1 A st-'qlJt-'rwt-' or rlHlcLions f., corlvt-'r'gt-'s 1.0 f* \-viLlI proh
ahilil.y 1 IHlironTily over X ir, ror' I.he events '1/.' ror which
jn ('lL',.-r) --+ f*, the convergence is uniform in x.
+ (
o, .r"o, ) (rt+i�XQ'(Yt,a)),
\Ve can derive the assumptions necessary for this learn Theorem 1 therefore implies that this generalized Q
ing algorithm to satisfy the conditions of Theorem 1 learning a,lgorithm converges to the optimal Q func
and therefore converge to the optimal Q values. The tion with probability 1 uniformly over X x A. The
dynamic-programming operator defining the optimal convergence of Q-lcarning for discounted r"fDPs and
Q function is alternating Tvfa,rkov games foll mvs trivially from this.
Extensions of this result for undiscounted "all-policies
[TQ](.r, n) = R(x, a) + i L P(x, G, y) Q5J "Q(y, a'). proper \' 1 1 ) PS ( Bertsekas and Tsitfiiklis, 1989), a soft
" .

y a' state aggregation learning rule (Singh et aI., 1995), and

a '"flpreading" learning rllie are given in a more dctailed
The randorniz,ed approxirrraLe dyrrarnic-prograrnrnirrg version of this paper (S zepesv ari and Littman, 1996).
operator that gives rise to the Q-learning rule is

(l-utix,a))Q'(x,a)+
([T, Q'IO)(',"I � 1 at ( J', (j)(rt + -10/ Q(y" a')),
if .7� = ,1:t and a = at
Q'(:t, aL othen,vlse,
4.2 Q-LEARNING FOR MARKOV GAMES

lVlarkov gamefl are a generalization of rvfDPfl and alter

nating Markov garlles ill wlrich hotlr players sirrlillta
neously choose actions at each step. The basic model
If is defined by the tuple (8, ,1, R, P, R) and discount fac·
tor '";1. As in alternating rVlarkov games, the optimality
criterion is one of discounted minimax optimality: but
• Yt iR randomly selected according to the probabil heCallSe the players move simultaneollsly, the Bellman
ity distrih lltion defined by P(;1:t,(J,t,'),
equationfl ta.ke on a more complex form:
• Q9x is a non-expansion, and both the expected
value and the variance of Q9n�I;Q(Yt, a) exist given V'(x)= max min L P(a).
the \vay YI- is sampled, !?En(Aj bEB
nEil

• rt ha."> finite variance and expected value given Xt

and at equal to R(:rt, atL (Ii(X, 0., b) + i L P(.r, n, b, y)
yES
V'(Y))
• the learning rates are decayed so that
In Lhetle equations, R.(x, u, b ) itl Lhe lIIunediate reward
for the maximizer for taking action a in state ;c at the
same time the minimizer takes adion 'J, P(,l:, 0, h, y)
is the prooaoiliLy that otate y io reached [rom sLate :t
and I:t ,tXt = X,at = a)at(x,a)2 < 00 with
when the maximizer takes action (J. and the minimizer
probability 1 unifonnly over X x A 2,
takes action b, and II(A) represents the set of discrete
probability distributions over the set A.. The sets S':
then a standard reslllt from the theory of stochastic A, and R are finite.
approximation (Robbins and Monro, 1951) states that
Tt approximates T every,,,here. That is, this method Once again, optimal policies are policies that are in
of using a decayed, exponentially weighted average cor equilibrium, and there is always a pair of optimal
rectly computes the average one-fitep reward. policies Ural are stationary. Unlike .\'11) PS and alter
nating Markov games, the optimal policies are some
L et ("�t(.'/;, a )-
_ { l-a,(x,"), if ;r= Xf and a= ar; timeR fltocha,:;:;tic; there are Markov gamefi in ,vhich no
1, otherwise , determinisLic policy is optimal. TIle fito dlastic na
ture of optimal policies explains the need for the opti
if x = Xt and a at:
ancI Pt (J;,a ) = { iatix,a),
0,
otherwise.
= mization over probahility distributions in the Bellman
equations, and stems from the fact that players must
These functions satisfy the conditions of Theorem 1 avoid being "'second guessecr' during action selection.
(Condition :3 is implied hy the restrictions placed on An equivalent set of eqllations can be \vritten with a

the sequence of learning rates at). stochastic choice for the minimizer, and also with the
roles of the nlaximizer and minimizer reversed.
2This condiLion implies, H.rTlong oLht-'r· Lllings, LlIH.L ev
(TY statc-adiorl pair· is IIpdated infinitely o rten . Here, ::\: The Q-learning llpdate rllle for "Markov games
denotes the characteristic function. (Littman, 199;1) given step t experience (Xt, at,
bt, Yt, 1't) has the form 4,4 EXPLORATION-SENSITIVE MODELS

Qt+l(:Ct,at, btl := (1 - at(:Ct, "t, bt))Qt(:Ct, "t, btl John (1995) considered the ilnplications of insisting
that reinforcement-learning agents keep exploring for
+ (tt(xt,at,bt) (rt+')'�xQt(Yt,a'b))' ever; he fOlllld that better learning performance can
be achieved if the Q-learning rule is changed to in
corporate the condition of persistent exploration. In
where John'fl fonmllation, the agent is forced to adopt a pol
icy [rom a l'etltricLed oel,: in one exall1ple� the a.gent
(8(f(x, n, b) = max min
pEII(A) hE R
L p(alf(x, a, b). must choose a stochastic stationary policy that selects
a,b aEA
actionfl at ra,ndom 5% of the time.
The results of the previOlls section prove tha,t, this rule This approa,ch reqllires that the definition of opti
cOllvergetl Lo the opLilllal Q function under the proper mality be changed to relled the restridion on poli
conditions. cies. The optimal value function is given by V* (;1;) =

sUPT<EPIl IF""(x), ,,,here Po is the set of permitted (sta

4,3 RISK-SENSITIVE MODELS tionary) policies, and the a:ssociated Bellman equa
tions are
Heger ( 1994) described an optimality criterion for
\lDPs in which only the worst possible value of the next
state makes a contribution to the value of a state. An
optimal policy llllder this criterion is one that avoids
which corresponds to a generalized MDP model with
states for \vhich a bad outcome is possible, even if it is
EBy g(y)
x''' = Ly P(x,a, y)g(y) and 0n" f(x, a) =
not probable; for this reason, the criterion has a risk
s UPT<E PIl La 1Ta ( ;J;)f( J�, a). Recallse iTa(x) ifl a proba
averse quality to it. The generali7,eo Bellman equa
bility distribution over a for any given state ;J;� Q9J:
tions for thifl criterion arc
is a non-expansion and, thus� the convergence of the
aflflociated Q-Iearning algorithm follmvfi from the argll
Y' (x) = (8( (R(X, 0) + '( min Y' (y) ).
y:P(x,a,y»O ments in Section 4.1. As a result, John's learning rule
"
gives the optimal policy under the revised optimality
The al'gulnenL in SecLioIl 4.5 ohows that model-based criterion.
reinforcement learning can be used to find optimal
policies in risk-sensitive models, as long as Q9T does 4,5 MODEL-BASED METHODS

not depend on II. or P, and P itl estimated in a vvay

The defining &<;Jsumption in reinforcement learning is
that preserves its zero vs. non-lero nature in the limit.
that the rewa.rd and transiLion [undiontl, II. and p,
For the model in which 0/ f(", 0) lllAx,J(.r,a), = are not known in advance. AlthOllgh Q-learning shows
Heger defined a Q-learning-Iike algorithm that con that optimal value functions can be estimated without
verges to optimal policies without estimating Rand ever explicitly learning Rand r, learning H and r
P online. Tn essence, the learning algorithm uses an makes more efficient use of experience at the expense of
update rule analogous to the rule in Q-Iearning with additional storage and computation C�iloore and Atke
the additional requiremenL LhaL Lhe inaial Q function son, 1 993 ). The parameters of R. and P can he gleaned
be set optimistically; that is, Qo(x,a) must be larger from experience by keeping statistics for each state
thAn Q'(.r"a) for all ," and a. action pair on the expected reward and the proportion
or transiLions Lo eadl next sLate. Tn model-hased re
Using Theorem I- it ifl possible to prove the conver
inforcement learning, Hand P arc estimated on-line,
gence or a generali?;aLion of Heger's algoriLllm Lo mod
and the value function is updated according to the
els where 0/f(J:,a) f(:",,,'(1,,,)) for some func
=

approximate dynamic-programming operator derived

tion 0,*(.); that is, afl long &<;J the summary vahle of
from these estimates. Theorem 1 implies the conver
f(x, aj is equal to f(x,a' ) for some a'. The proof
gence of a, wide variety of model-b&<;Jed reinforcement
is based on estimating the Q-Iearning algorithul from
learning methods.
above by a,n appropriate proceflS where the Q function
is updated only if the received experience tuple is an The dynamic-programming operator defining the opti
extremity according to the optimality equation; details mal value for generalized TvTDPs is given in Equation 2.
are given clse,,,here (Szepesvari and Littman, 1996). Here we assume that E9X't� may depend on P and/or
R, but Q9x may not. It is possible to extend the fol 5 CONCLUSIONS
lmving argllment to a,llow Q9x to depend on P and R as
well. In model-based reinforcelnent learning, n. and P Tn this paper, we presented a generali7,ed model of
x a,t
arc estimated by the quantities Rr and PI-, and EB , :Markov decision processes, and proved the conver
is an estimate of the EB·r , y opera,tor defined llsing Rt gence of several reinforcement-learning algorithms in
and Pt . As long as every state-action pair is visited in the generalized model.
finitely of ten, there are a llllmber of simple methods for
computing Rt and Pt that converge to Rand P. A bit Other Results vVe have deri v ed a collecLion of re
t
more carc is needed to insurc that EBx , a. convergcs to sults ( Szepesvari and Littman, 1996) for the general
EB:I,,(I, however , For example, in expected-re\vard mod ized fllUP model that demonstrate its general applica
els, EEl/'"g(y) = L.y P(x,a, y)g(y) and the conver bility : the Bellman equations can be solved by value
gence of Pt to P guara,ntees the convergence of EBx , (I ,
t
iteration; a myopic policy with respect to an approxi
to Efrv,a. On the other hand, in a risk-sensitive model, mately optimal value function gives an approximately
EB/·,(lg(y) = miny;p ( �r:'IL' Y » o g(y) and it is necessary to optimal policy; ,,,,hen Q9:1: has a particular "maximiza
approximate P in a way that insures that the set of y tion" property, policy iteration converges to the op
such that Pt (x, a., y) > 0 convcrgn, to the set of Y SllCh timal valuc fllnction, and, for models with the maxi
that P(J:, a., y ) > O. This can be accomplished easily, m iz,aLioll properLy and n ilite sLaLe alld actioll spaces ,

for example, by setting Pt (x, a, y) 0 if no transition

= both value iteration and policy iteration identify opti
from .T to y under a has heen observed. mal policies in pselldopolynomial timc.
H
Assllming P and are estimated in a, way tha,t results
Related Work The \vork presented here is
in the convergence of EErc,a, t to EBx, a , the tlequence of
closely related to several previous research efforts.
dynamic-programming operators Tt defined by
Szepesvari (1995) described a related generalized

([TtUW)(J: ) =
{ 0 x
a
if x
(H,(,r" a) + -1. EEl:," ,t V (y) ) ,
E Tt
reinforcement-learning model and presented condi
tions llnder which there if> an optimal (stationary) pol
icy that is myopic with respect to the optimal value
U (:t ), oLhervvitle. function.
Tsitsiklis (1991) developed the connection between
approximates T for .1.,11 va,lue functions. The set Tt � S stochastic-approximation theory and reinforcemcnt
represents the set of states whose values are updated learning in NIUPti. Our work itl tlimilar in tlpiriL to that
on step t; one popular choice is to set Tt = { xd. of Jaakkola, Jordan, and Singh (1994) . We believe
the form of Theorem 1 ma,kes it particularly conve
The fllllCtions
nient [or proving the convergence of reinforcement
0, if ;1: E Tt ; learning algorithms; our theorem reduces the proof
Gt(x) = { 1, of the convergence of an asynchronous process to a
othenvise �
simpler proof of convergence of a corresponding syn
and chroni7,ed one. This idea enahles llS to prove the con
vergence of asynchronous stochastic processes whose
if J: E Tr;
Ft (x) = { 6: othen,vise,
underlying synchronous process is not of the Hobbins
Tvlonro type (e.g., risk-sensitive MDPS, model-ba.<;ed al
gorithms, dc.).
satisfy the conditions of Theorcm l as long as cach
;1: is in in n n itely many Tt sets (Con d it ion 3) and the
Future Work There arc many arcas of interest in
discount factor "/ is less than 1 (Condition 4).
Lhe Uleory orrein rorcemenL l earn ing LhaL \ve would like
As a consequence of this argument and TheorCIll I , to address in future work. The results in this paper pri
model-based methods can be llsed t o find optimal marily concern reinforcement-learning in contra,ctive
policies in MUPS, alternating rVI arkov gaInes, Markov models ('";i < 1 or all-policies-proper ) , and there are
games, risk-sensitive MDPS, and exploration-sensitive important non-contractive reinforcement-learning sce
\-TDPs. Also, letting Rt R and Pt
= P for all t, = narios, for example, reinforcement learning under an
this refiult implies that rcal-time dynamic program average-reward criterion (Tvlahadcvan, 1 996). Tt \VOllld
ming (Barto et al., 1995) converges to the optimal be interesting to develop a TD(>.) algorithm ( Slltton�
value function. 1988) for generalized 1'lUl's. Theorem 1 is not re-
stricted to finite state spaces, and it might be valuable John, G. H. ( 1995). When the best move isn't opti
to prove the convergence of a reinforcement-learning mal: Q-learning with exploration. l�npllhlished
algorithm_ for a infinite state-space model. manuscript.
Littman, M. L. (1994) . Markov games as a framework
Conclusion By idcntifying common clc
for multi-agent reinforcement learning. In Pro
menttl among tleveral reinforcemenL-learning tlcenarioo,
cetxling.') oj lilt Elevenlh Inle nw liorwl ConJe re n ct
we created a ne." dass of models that generalizes ex
isting models in a,n interesting \vay. In the generalized
on ilIarhinp [earning, pages 1 ,5 7-1 63, San F'ran
cisco� CA. �ilorgan Kaufmann.
frame\vork, \ve replicaLed Lhe ei:)Lablitlhed convergence
proofs for reinforceIllent learning in Markov decision :rda.hadevan, S. (1996). Average rewa.rd reinforcement
processes, and proved new results concerning the con learning : Foundations, algorithms, and empirical
vergence of reinforcement-learning algorithms in game result". M(J ch in c ream ing, 22( I /2/3): 1 .59- 1 96.
environments, under a risk-sensitive a.<;sumption, and
under an exploration-sensitive assumption. At the Tvroore � A. VV. and Atkeson, C. G. ( 1 99;3 ). Prioriti7,ed
heart of our results is a new stocha.stic-approximation sweeping: Reinforcement learning \"ith less data
theorem that is ea.<;y to apply to new situations. and less real time. fl,foelline Leorning, 13.
Puterman, L. ( 199 /1) . 11;Jm·kov Decision Process es
rvi.
Acknowledgenlent s
Disr1'rtr Storhastir Dynam ir Programm ing. .John
Rc,carch "'pportcd by PHA RB H930.5-02/ I 022 and \Viley &; Sons, Tnc.� Ke\v -'(ork, 'l"Y.
OTT< A Grant no. F0201 32 and by Bellcore's Sllpport
Robbins, H. and Monro, S. ( 195 1 ) . A stochastic ap
for Doctoral Education Program.
proximation method. Amwls of fl,fothematico[
St atist ics, 22:400-407.
References
Singh, S., .Jaakkola, T. . and Jordan, M. (1 995). Rein
I3arto, A. G . , I3radtke, S . .T., and Singh, S. P. ( 1995) . forcement learning with sof t state aggregation. Tn
Learning to act using real-time dynamic program Tesauro, G., Tourctzky, D. S.� and Leen� T. K.,
ming. ,1 rtifteial Tntel/igenrF, n( 1 ) :8 1 - 1 :3 8. editors� A d1.'01u::rs £n Nruml Tnfo rmation Pro(;(JiS
-ing S'yt ;{ernl:i Z Cambl"idge� MA. The MIT Pretl.':l.
Bertsekas, D. P. and Tsitsiklis, J. 1\. (1989) . Parallel
an d Ihstr£lmtrd Co mputation : /Vumrriml iHcth Slltton, I-l. S. ( '1 988). Learning to predict by the
ods. Prentice-Hall, Englewood Cli rr.'l, KJ.
method or temporal di rrerence.'l. Marhinp Tprt rn
ing, 3(1) :9-44.
Condon� A. ( l U92). The complexity of stochastic
games. Informotion ond Co mputotio n, 96(2) :203- Szepesy,iri, C. ( 1 9U.) ). General framework for relll
224. [orce111ent learning. In Proceeding.') oj ICANN',95
Poris.
Gllllapalli, V. and Barto, A . G. ( 1 994). Convergence
Szepesvari, C. and LiLtman, l'vr. L. ( 1 990). Gen
of indirect adaptive asynchronous value iteration
eralized :1'larkov decision processes: Dynamic
algorithms. In Cowan, .J . D . , Tesauro, G.� and
programming a,nd reinforcement-learning algo
AbpecLor, .T. , ediLortl, Advanct.') in lVeH'fuZ InJo/'
rithms. Technical Heport CS-96- 1 1 , TIrmvn l�ni
'motion Pmcess-ing Systems 6, pages 695-702, San
versity, Providence� HI.
�'l ateo, CA. Morgan I{allfmann.
Tesauro� G. ( 1 995). Temporal difference learning
Heger, M. ( 1991). Consideration of risk in reinforce
and TD-Gammon. Co m m 'u n ieotions of the A CfI,f,
ment learning. Tn Prorrrdings of thr Rlrvrnth
pages 58-67.
Tntf!rnational Conjf!1'f! n rf! on J),fari1-inf! Fparn-ing,
pages 105-1 11, San Francisco, CA. Morgan Kauf Tsitsiklis, .T. K. ( 199,1). Asynchronous stochastic ap
mann. proximation and Q-lcarning. 1l1achine Lcorning�
1 6(:3 ).
Jaakkola, T., Jordan, M . I., and Singh, S. P. ( 1 994).
On the convergence of stochafltic iterative dy Watkins, C. J. C. H. and Dayan, P. (1992) . Q-lcarning.
namic programming algorithms. ?\'pural Compu Marhine T,earning, 8(3):279-292.
tation, 6(6).

Fuse Box Diagram Toyota Camry (XV50 2012-2017)
No ratings yet
Fuse Box Diagram Toyota Camry (XV50 2012-2017)
10 pages
Massey Ferguson MF 3610 TRACTOR Service Parts Catalogue Manual (Part Number 1637182)
No ratings yet
Massey Ferguson MF 3610 TRACTOR Service Parts Catalogue Manual (Part Number 1637182)
16 pages
As 1418.4-2004 Cranes Hoists and Winches Tower Cranes
No ratings yet
As 1418.4-2004 Cranes Hoists and Winches Tower Cranes
8 pages
The Motivation and Attitudes Towards Learning Slang in English A Study of The Fourth-Year Undergr
100% (1)
The Motivation and Attitudes Towards Learning Slang in English A Study of The Fourth-Year Undergr
79 pages
Employee Performance Review - Quarterly - Final
No ratings yet
Employee Performance Review - Quarterly - Final
5 pages
Next Gen HD LED Lit Videowall User Guide PDF
No ratings yet
Next Gen HD LED Lit Videowall User Guide PDF
109 pages
CE6603-Design of Steel Structures
No ratings yet
CE6603-Design of Steel Structures
12 pages
اسس الاتصالات مرحلة ثانية د حمود
No ratings yet
اسس الاتصالات مرحلة ثانية د حمود
131 pages
Instruction Manual: Digital Genset Controller DGC-500
No ratings yet
Instruction Manual: Digital Genset Controller DGC-500
151 pages
Interview Questions - For LinkedIn
No ratings yet
Interview Questions - For LinkedIn
4 pages
Column Interaction Diagram
No ratings yet
Column Interaction Diagram
4 pages
Converting MicroSim® Schematics Designs To OrCAD Capture® Designs
No ratings yet
Converting MicroSim® Schematics Designs To OrCAD Capture® Designs
44 pages
Peter Siwes Report
No ratings yet
Peter Siwes Report
17 pages
Echoes of The Tambaran Masculinity History and The Subject in The Work of Donald F Tuzin David Lipset Instant Download
No ratings yet
Echoes of The Tambaran Masculinity History and The Subject in The Work of Donald F Tuzin David Lipset Instant Download
85 pages
Review of Design Fire Heat Release Rate For Tunnels With Fire Suppression Systems
No ratings yet
Review of Design Fire Heat Release Rate For Tunnels With Fire Suppression Systems
11 pages
New Curriculum Machine Design Projcet II Course Outline
No ratings yet
New Curriculum Machine Design Projcet II Course Outline
2 pages
ISO 9001 2015 Internal Audit Process Map Sample
No ratings yet
ISO 9001 2015 Internal Audit Process Map Sample
1 page
Machine Standard Configuration: Horizon 03ix
No ratings yet
Machine Standard Configuration: Horizon 03ix
8 pages
MOD 3 10KTL3 XH User Manual EN
No ratings yet
MOD 3 10KTL3 XH User Manual EN
29 pages
Major Project Synopsis Front Page
100% (1)
Major Project Synopsis Front Page
7 pages
Module 5
No ratings yet
Module 5
27 pages
Uc Colorado Springs
No ratings yet
Uc Colorado Springs
17 pages
Job and Business Opportunities
No ratings yet
Job and Business Opportunities
6 pages
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
No ratings yet
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
5 pages
Rizal Course - Instructions For The Required Terminal Paper
No ratings yet
Rizal Course - Instructions For The Required Terminal Paper
2 pages
The Origin of Language: Presented By: Sadiq Mazari
No ratings yet
The Origin of Language: Presented By: Sadiq Mazari
13 pages
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
No ratings yet
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
5 pages
Chapter 6
No ratings yet
Chapter 6
10 pages
S. G. Balekundri Institute of Technology: Shivabasavanagar, Belagavi-590 010, Karnataka - India
No ratings yet
S. G. Balekundri Institute of Technology: Shivabasavanagar, Belagavi-590 010, Karnataka - India
7 pages
Halter
No ratings yet
Halter
2 pages

ml96 Ps

Uploaded by

ml96 Ps

Uploaded by

A Generalized Reinforcement-Learning Model: Convergence and

Michael L. Littlnan Csaba Szepesvari*

tional information concerning \vhich player (maximizer

2.2 ALTERNATING MARKOV GAMES

ward function for player 1. In the zerO-SUIll games .ve

the value of the next state should be used in assigning

expam,iom" the generalized Bellman equations have a T7r• V7r·= T"nr· .

4, there exists 0 S , < 1 such that for all x E X and 4 APPLICATIONS

lows that T is a contraction operator at V* 'ivith index EXPECTED VALUE MODELS

with the help of the "memory" present in li. Because

y a' state aggregation learning rule (Singh et aI., 1995), and

lVlarkov gamefl are a generalization of rvfDPfl and alter­

• rt ha."> finite variance and expected value given Xt

sUPT<EPIl IF""(x), ,,,here Po is the set of permitted (sta­

not depend on II. or P, and P itl estimated in a vvay

approximate dynamic-programming operator derived

for example, by setting Pt (x, a, y) 0 if no transition

You might also like

lVlarkov gamefl are a generalization of rvfDPfl and alter

sUPT<EPIl IF""(x), ,,,here Po is the set of permitted (sta