0% found this document useful (0 votes)
13 views9 pages

ml96 Ps

This document introduces a generalized reinforcement learning model that extends beyond traditional Markov decision processes (MDPs) to include two-player games, risk-sensitive reinforcement learning, and other applications. The generalized model provides a theoretical foundation for using reinforcement learning in MDPs, games, and other scenarios. Key results for MDPs and two-player games are reviewed, and it is shown that extensions of common reinforcement learning algorithms can find optimal value functions in these contexts.

Uploaded by

tye moore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

ml96 Ps

This document introduces a generalized reinforcement learning model that extends beyond traditional Markov decision processes (MDPs) to include two-player games, risk-sensitive reinforcement learning, and other applications. The generalized model provides a theoretical foundation for using reinforcement learning in MDPs, games, and other scenarios. Key results for MDPs and two-player games are reviewed, and it is shown that extensions of common reinforcement learning algorithms can find optimal value functions in these contexts.

Uploaded by

tye moore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Generalized Reinforcement-Learning Model: Convergence and

Applications

Michael L. Littlnan Csaba Szepesvari*


Department of Computer Science Research Grollp of Artificial Tntelligence
Brmvn University "Jozsef Attila" University, Szeged
Providence, RI 02912-1910, l�SA Szeged 6720, Aradi vrt tere L HUNGARY
mlittman (Wes. brown.cdu szepes,:Wmath. u-szeged.hu

Abstract o :::;
I < I controls the degree to which future rewa,Tds
are significant cOlIlpared to iuunediaLe rewards.
Reinforcement learning is the process by The Lheory of Markov decil:lion procetl.':lel:l has been utled
which an autonomous agent uses its experi­ as a theoretical foundation for important reslllts con­
ence intera cting with an environment to im­ cerning this reinforcement-learning scenario. A (finite)
prove it::; beha.vior. The rVlarkov dec.ision pro­ :rdarkov decision process (�IDP) is defined by the tuple
cess (:\'lDP) modd is a popular way of for­ (8, A, P, R ), where 8 is a finite set of states, A It fi­
malizing the reinforcement-learning problem, nite set of actions, P a transition function, and R. a
but it it) by no meant; LIte only \-vay. In Lhitl reward function_ The optimal behavior for an agent
paper, we show hmy IllaIlY of the important in an .\lUl' depends on the optimality criterion; for
theoretical results concerning reinforcement the infinite-horizon expected discounted total-re"\yard
learning in MDPs extend to a generalized MDP criterion, t.he 0pLimal bell <-lvi or call be round hy id en­
model that includes MDPS, two-player games tifying the optimal value function, defined recursively
and .\lUPs under a worst-case optimality cri­ by
terion as special cases. The basis of this ex­
tension is a stochastic-approximation theo­
rem that redllces fLsynchronolls convergence
to synchrOflOlJS cOlnrergence.
for all states x E 8, ,,,here R(J:, a ) is the iuullediate re­
ward for taking action a from state x, /' the discount
1 INTRODUCTION factor, and P(x, a, y) the probability that state y is
reached from state ::e when action a E A is chosen.
Reinrorcement learning is Ule process hy wllich an These simultaneous equations, known as the Bellman
cqu(]tions, can be solved using a variety of techniques
agent iIllproves its behavior in an environment via ex­
perience. A rf:injorremfnt-lmrning srenario is defined ranging from sllccessive approximation to linear pro­
by the experience presented to the agent at each step, gramming (Puterman : -1 994).
and the criterion for evaluating the agent's behavior. Tn the abRence of complete information regarding
One particularly well-studied reinforcement-learning Lhe t.ransiLion and reward rundions: reinrorcemenL­
scenario is that of a single agent maximizing expected learning methods can be used to find optimal value
difKOlll1ted total reward in a finite-state environment; functionR. ReRearchers ha,Ye explored model-free (di­
in this scenario experiences are of the form (;1:, a, y, r ) , rect) methods, such as Q-Iearning (\Vatkins and
with state x, action a, resulting state y, and the agent's Dayan, 1992), and model-ba.,,;ccl (indirect) methods,
scalar immediate re"\vard '>. A discount parameter such a."l prioritized sweeping (Moore and i\.tkeson:
199;1), and many converge to optilnal value functions
1\.1so Department of Adaptive Systems, Joint Depart­ under the proper conditions (Tsitsiklis, 1994; Jaakkola
rrll-'rd, or Lhl-' "'}{rJ;Sl-'r ALLila" rnivt-'r-siLy, Szt-'gl-'d and Lhl-' Tn­ ct aL 1991; Gullapalli and Barto, 199,1),
sLiLld,e or TsoLopu; or !.lIC HlIrrgar-ian Academy or Scicnces,
Budapest 1525, P,O, Box, n HUKGARY Not all reinforcement-learning scenarios of interest can
be modeled as MDI's. For example, a great deal of 2 THE GENERALIZED MODEL
reinforcement-learning resea,rch ha.<;J been directed to
the problem of solving tl,vo-player games ( e.g, Tesauro, In this section, \ve introduce our generalized r"fDP
1995) , and the reinforcement-learning algorithms for model. We begin by fil1l11marizing some of the more
solving MDPs a,nd their convergence proofs do not ap­ significant results regarding the standard _\lUP model
ply directly to games. and some important results for two-player games.
Tn one form of two-player game, experiences arc of the
form (;1:, n, y, r), where states ;]; and y contain addi­ 2.1 MARKOV DECISION PROCESSES

tional information concerning \vhich player (maximizer


or m£n£mizcr) gets to choose the action in that state. To provide a, point of departure for OllT genera,lir,a,tion
There are deep similarities between _\'11) PS and this type of Markov decision pl'Ocet;�:;etl, vve Iirsl describe Lhe utle
of game; for example, it is possible to define a set of of reinforcement learning in the :\'lDPS; proofs of the
Hellman equations for the optimal minimax value of a unattributed claims can be found in Puterman's ( 1991)
�IDP book. The ultimate target of learning is an op­
two-player zero-tlum game,
timal policy. A po!£r;y ifi some function that tells the
agent ,l vhich actions should be chosen under which cir­
cumstances. A policy" is optimal under the expected
maXCEA ( R( >:, a) + "I I:y P ( >:, a, y)V" (y)), discounted total re,l vard criterion if, with respect to the
if maximizer moves in x space of all possible policiefl, IT maximizes the expected
minaEA (R(x, a) + "I I:y P ix, a, yW" Iy)), discounted LoLal ['(-'\ovaI'd r[,Olfl all st aLes.

if minimizer moves in x,
Directl y lllaxirnillillgover (he space or all po ssihle poli­
cies is impractical. Hmvever, MUt's have an impor­
tant property that makes it 11l1neCesfiary to consider
where R( ;7:, a ) is the reward to the maximizing player. such a broad space of possibilities. We say a policy
\Vhen 0 .::; I < -I , these equationfi have a lllliqlle so­ 1T" is stotionary and deterministic if it ma.ps directly

lution and can be solved by sllccefisive-approximation from fitates to a,ctionfi, ignoring everything else� and
methods. In addition, we show that simple extensions we write 1T ( X ) as the action chosen by 'iT when the cur­
of several reinforcement-learning algorithms for \lD"Ps rent state ifi x. Tn expected discounted total reward
converge 1,0 optiuw.l value fUlldiollt; in Lhet;e gametl. .\-lUI' environments, there is ahvays a stationary deter­
ministic policy that is optimal; we will Ufle the \voro
In Lhit; paper, we introduce a generalized Markov de­ "policy�' to mean fitationary deterministic policy� un­
cision process model \vith applications to reinforce­ less otherwise stated.
ment learning, and list some important results con­
cerning Lhe model. CelleraliL:,ed MDPtl provide a foun­ The value function for a policy 7T� V� llWpS states to
l

dation for the llse of reinforcement learning in \lDPs their expected discounted total revmrd under policy 1r.
and games, as well as in risk-sensitive reinforcement It can be defined by the simuHaneoutl equatiOllt;
learning ( Heger, 1994), exploration-sensitive reinforce­
ment learning ( John, 1 99.3), and reinforcement learn­ V' (xl = R(", a ) + "I L P(x, a, yW"(y),
ing in simultaneous-action games (Littman, 1991) . y
Our main Llleorem addresses conditions ror the conver­
gence of asynchronous stochastic processes and shmvs for all xES'. The optimal value function V'" is the
hm" these conditions relate to conditions for conver­ vahle function of an optimal policy; it ifi lllliqlle for
o .::; "/ < 1 . The myopic policy with respect to a value
gence o r a corre.<;Jponding syndll'onollfi process; it can
be used to prove the convergence of model-free and function V- is the policy lTv such that
model-based reinforcement-lea,rning algorithmfi under
a variety of reinforcement-learning scenarios.
In Section 2� we present generalized .\-lDPs and mo­
tivate their fonn via two detailed examples. In Sec­
Any myopic policy l,vith respect to the optilnal value
tion 3� we describe a stochastic-approximation theo­
function is optiu13l.
rem, and in Section 4 \ve show several applications
of the theorem that prove the convergence of learning The Hellman equations can he operationali?;ed in the
processes in generalized NIDI's. form of the dynamic-programming operator T� which
maps value functions to value functions: Condon (1992) proves this and the other unatlributed
results in this sedion.
A popular optimality criterion for alternating Markov
games is discounted minimax optimality. l�nder this
For 0 :::; '";1 < 1 , sllccessive applications of T to a value criterion, the maximizer chooses actions to maxi­
function bring it closer and closer to the optimal value mize its reward against the minimizer's best possi­
rlJndion 1/* l \-vh idl is Lhe uniqlJe fixed poi nt or T: hie counter-policy. A pair of policies is in cquilibr£111n
V' TV',
=
if neither player has any incentive to change policies
if the other player�s policy remains fixed. The value
In reinforcement learning, Rand P are not knmvn in function for a pa.ir of eqllilihrium policies is the op­
advance. In modcl-ba.sed reinforcement learning, H. timal value function for the game; it is uniqlle when
and Pare estimaLed OII-lille, and Lhe value function o :::; I < 1, and can be found by successive approxima­
is updated according to the approxiuwte dynamic­ tion. For hoth players, there is ahvays a deterministic
programming operator derived from these estimates; stationary optimal policy. Any myopic policy with re­
this algorithm converges to the optimal value function spect to the optimal value function is optimal.
under a ,vide variety of choices of the order states arc
updated (Gullapalli and Bart.o, 1 994), Dynamic-programming operators, Bellman equations�
and reinforcement-learning algorithms can be defined
The method of Q-learning (Watkins and Dayan, 1992) for alternating :rdarkov games by starting with the def­
uses experience to estimate the optimal vaille function initions llsed in MDPs and changing the maximllm op­
withollt ever explicitly approximating R and P. The erators to either maximums or minimums conditioned
algorithm estimates the optimal Q function on the state. vVe show below that the resulting al­
gorithms share their convergence properties with the
Q'(x, a) = R(x,a) +')' L P(,",a, yW'(y), analogous algorithms for 1\,.'TDPs.
'I

from which the optimal vahle flmction can he com­ 2.3 GENERALIZED MOPS
puted vi a 1/*(;1;) = IIlaxaQ"(;r,u). Given the experi­
ence at step t (J.:r, a" Yr, II) and the current estimate Tn a.lternating Markov games a,nd MDPS, optimal be­
Qd;};, (1) of the optimal Q function, Q-learning llpdates havior can be specified by the Dellman equations; any
myopic policy with respect to the optiuwl value func­
(I'+1(Xt, a,) : = (1 - utlx" at))(I,(xt, at) tion is optimaL Tn this sedion� .ve generalir,e the Bell­
+ (t,(x" arlh. + '(maxQ, ( y" a)), man equations to define optimal behavior for a broad
"
class of reinforcement-learning models. The objective
where 0 :::; at(x� a) :::; 1 is a time-dependent learn­ criterion used in these models is additive in that the
ing rate controlling the blending rate of ne.v estimates the vahle of a policy is some measure of the totl1l re­
with old estimates for each state-action pair. The es­ ward received.
timated Q function converges to Q* under the proper
The generalir,ed Bellman eqllations can be ,vritten
conditions (\\ratkins and Dayan, 1992).

2.2 ALTERNATING MARKOV GAMES


(I )

Tn alLernating Tvlarkov games, Lwo players Lake Lllrns Here "' Q9/;'� is an operator that summarizes values
issuing actions to maximize their expected discounted a,:
over actions a..o;; a function of the state, and "E9y.r,
total reward. The model is defined hy the tuple is an operator that sumlnarizes values over next states
(81, 8'2, A, R, P, R), wllere 81 is the seL or s LaLes in as a function of the state and action. For Markov
which player 1 issues actions from the set A, /h is the decision processes, Q9(1.rf(x,a) = maxa f(x�a) and
set of states in which player 2 isslles actions from the ,
EEl/" g(y) Ly P(J;, A., y)g (y ) For alternat.ing
set fl, P is the transition function, and It is the re­
Tvlarkov games, E9y .r,'1 is the same a,nd Q9(1'r, f(;n,a) =

ward function for player 1. In the zerO-SUIll games .ve


maxa f(x� a) or mina f(x, oj depending ,vhether x is
consider, the re.vards to player 2 (the minimir,er) are
in Sl or 82• Many models can be represented in this
simply the additive inverse of the re.vards to player I
frame'iyork; see Section 11.
(the maximizer ). \.'larkov decision processes are a spe­
cial case of alternating }"larkov games in which 82 0; = From a reinforcement-learning perspective� the value
functions defined by the generalized _\lU!, model can be poliCJ\ ,l ve require that the operator @x satisfy the
interpreted fl.';; the total value of the re""Yards received following property: for all fllllCtions f and states x,
by an agent selecting actions in a stochastic environ­ mina f(x, a) <:0"X f(x,a) <:maxaf(",a),
ment. The agent begins in state x, takes action a, and
The value function with respect to a policy Jr, V7r, can
ends up in state y. The E9/.,a opera,tor defines how
be defined by the simultaneous equations V 7r T7r l/7r;
=

the value of the next state should be used in assigning


it is unique when T is a contradion mapping. A policy
vahle to the current state. The Q9a x operator defines
1T" is optimal if it is myopic with respect to its own
hmv an optimal agent should choose actions.
vahle function. Tf 11"" is an optimal policy, then ,,-''IT.
\Vhen 0 ::; '";i < 1 and Q9x and EBx,a are non­
·
V" because it solves the Bellman equation: l/7r =

expam,iom" the generalized Bellman equations have a T7r• V7r·= T"nr· .


unique optimal sohltion, and therefore, the optimal
The next Redion deRcribes a general theorem that
value function is ""Yell defined. The Q9;(' operator is a
can be ut;eu to prove the convergence of several
non-expansion if
reinforcement-learning algorithms for these and other
models.
I®x
a
h(x,a) - ®x 1
a
h(x,a) <: m,:'x lj ,(x,a)-h(x,a)1

3 CONVERGENCE THEOREM
for all fl, h, and ;r. An analogous condition defines
when Efr",,11 is a non-expansion.
The process of finding an optimal value function can be
rVlany natural operators are non-expansions, such
viewed in the following general "'ivay. At any lTIOment
as max: min, midpoint, median, mean, and fixed
in time, there is a set of values representing the cur­
weighted averages or these operations. Several pre­
rent approximation of the optimal value fllllction. On
viously described reinforcement-learning scenarios are
each iteration, we apply some dynamic-programming
specia,l ca.ses of this genera,lir,ed "MDP model includ­
operator: perhaps modified hy experience, to the Cllr­
ing computing the expected return of a fixed pol­
rent approximation to generate a new approximation.
icy (Sutton: 1988), finding the optimal risk-averse pol­
Over time, we would like the approximation to tend
icy (Heger, 1 994): a,nd finding the optimal exploration­
toward the optimal value function.
sensitive policy (John, 1995).
In thit; procetltl, there are L\A,!O typet; of a.pproximation
Afi,"vith MDPS, we can define a dynamic-programming
going on simultaneously. The first is an approxima­
operator
tion of the dynamic-programming operator for the un­
derly ing model, and the tlecond it; Lhe ut;e of the ap­
proxilnate dyna.mic-progra.mming operator to find the
optimal value function. This section presents a theo­
rem that gives a set of conditions under "'ivhich this type
The operator 'I' is a, contraction mapping for 0 :s; ')' <
of simllltaneous stocha.stic approximation converges to
1. Thitl meant;
an optimal value function.
sup I[TV,](x) - [TV 2](x)1 <: ,SUP W,(,,) - V2(,,)1 First, we need to define the general stochastic process.
x ;('
Let the set ,Y be tlle staLes of Llle model, and the set
where VI and V2 are arbitrary functions and 0 :s; '";/ < 1 B( X ) of bounded, real-valued functions over X be the
is the index of contraction. set of vahlC functions, Let T : BCY) -+ B(X) be an
arbitrary conLraction mapping wiLli fixed point V*.
\Ve can define a notion of stationary myopic policies
with respect to a value fllllCtion V; it is any (stochas­ rr we had direct access to the contraction mapping T:
tic ) policy 'iT"V for ,"vhich T7rV = TV where we could use it to successively approximate ll*. In
most reinforcement-learning scenarios, T is not avail­
able and ,l ve must use experience to construct approx­
imations of T. Consider a sequence of random op­
erators Tt : B(X) -+ (B(X) -+ B(X)) and define
Here Jra(x) reprCf;entR the probability that an agent Ut+1 = [TtU,]V where V and [To E B(X) arc arhi­
following Jr would choose action 0. in state ;1:. To be trary value fllllCtions. vVe say Tt appm.rimatf8 T at
certain that every value function possesses a myopic V� if [It converges to TV with probability 1 uniformly
over X l. The idea is that Tt is a randomized version In some applications� such as Q-learning1 the contri­
of T that uses [It a..o;; "memory" to converge to TlT. hlltion of new information needs to deca,y over time to
insure that the process converges. In this case, Gdx)
The follmving theorem shows that, under the proper
needs to converge to one; Condition 3 allows this as
conditions, 'ive can use the sequence Tt to estimate the
long as the convergence is slow enough to incorporate
fixed point V* of T.
sufficient infonnation for the process to converge.
Theorenl 1 rd T bp an. arbitrary mapping with ji:r:pd Condition 4 links the villues of Gt(x) ilnd F,(.r)
poin.t V*" an d let Tt approximate T at V*. Let l'o be an through some quantity ....1. < 1. Tf it ,\-vere somehow pos­
orbitrary v(l[ue junction, (wd define Vt+1 = [Tt Vt]l't. sible to update the values synchronously over the en­
Tf there exi"t junctions 0 S F, ( ,r) S 1 and 0 S G, ( ,r) S tire ",tate space, the process 'ivould converge to V* eyen
1 satisfying the conditions below tvith probability one, when "'j = 1. Tn the more interesting asynchronous
llu�n 1/; e01llW'f'Yf!S to ll* 'wah prol)(Jbaay 1 'IlnijrJ'r'fnly ease, when ....!. = 1 the long-term behavior of Vt is not
over X: immediately clear: it may even he that Vt converges
to l:)olneLhing oLher lha.n F*. The requirement that
1, for all U, an d U2 E S(X), and all x E X, "'/ < 1 insures that the use of outdated information in
the asynchronous updates does not ca,llse a problem in
I([T,U,]V ')(,,,) -([T,U,]V')(",)1 convergence,
< G,(:t)IU,(x) - U2(x)l; One of the most noteworthy aspects of this theorem
is that it shows how to redllce the problem of approx­
2, for all U an d V E S(X), and all x E X, imating V* to the problem of approximating T at a
particular point II (in particular, it is enough if T can
1(['I;U]V')(,r,) - (['ltu]V)(x)1 be approximated at F*); in many cases1 the latter is
< F,(x) sup W'(x') - V( x')I: much easier to achieve and also to prove. For exam­
x'
ple, LlJe Lheorerrl makes the co n v erge n ce or Q-learnillg
3, for all k > 0, II�l=kGt(x) converges to zero uni­ a consequence of the classical Robbins-Monro theo­
formly in x a8 n in�r(;(}8Ui: (md. rem ( Robhins and Monro, I 9.)1).

4, there exists 0 S , < 1 such that for all x E X and 4 APPLICATIONS


largr rl1ou.gh t,

Fdx) S ,(1- G ,(x)), This section Inakes use of Theorem 1 to prove the con­
vergence of various reinforcement-learning algorithms.

Note that from the conditions of the theorem, it fol­ 4.1 GENERALIZED Q-LEARNING FOR

lows that T is a contraction operator at V* 'ivith index EXPECTED VALUE MODELS


of contraction "/. The theorem is proven in a more de­
tailed ver::;ion o[ Lhiti paper (Sljepetivari and Littman, Consider the family of finite state and action general­
1996), \Ve next describe some of the intuition behind ized f..1Ul's defined by the Bellman equations
the sta,tement of the theorem and its conditions.
The iterative approximation of \/'"' is performed by
computing l'i:+l = ['It l't]l'i:: ,,,here Tt approximates T
V'(x) = �t ( R(x. a) + -; � P(x a, yw'(y))
,

with the help of the "memory" present in li. Because


of Conditions 1 and 2, G't (x) is the extent to which the where the definition of Q9x does not depend on H or
estimated value function depends on its present value P. A Q-learning algorithm [or lhi::; class of modelti can
""d F,(.r) "" 1 -Gt(.r) is the extent to which the es­ be defined as follmvs. Given experience (J.:t, at, Yt, rt )
timated value fllllction is hased on "new'; information at time t and an estimate QtCr, a ) of the optima,l Q
(this reasoning hecomes clearer in the context or the function, let
applications in Section 4).
Qt+1(x"atl:= (l-<>,(x"atl)Q,(x"atl
1 A st-'qlJt-'rwt-' or rlHlcLions f., corlvt-'r'gt-'s 1.0 f* \-viLlI proh­
ahilil.y 1 IHlironTily over X ir, ror' I.he events '1/.' ror which
jn ('lL',.-r) --+ f*, the convergence is uniform in x.
+ (
o, .r"o, ) (rt+i�XQ'(Yt,a)),
\Ve can derive the assumptions necessary for this learn­ Theorem 1 therefore implies that this generalized Q­
ing algorithm to satisfy the conditions of Theorem 1 learning a,lgorithm converges to the optimal Q func­
and therefore converge to the optimal Q values. The tion with probability 1 uniformly over X x A. The
dynamic-programming operator defining the optimal convergence of Q-lcarning for discounted r"fDPs and
Q function is alternating Tvfa,rkov games foll mvs trivially from this.
Extensions of this result for undiscounted "all-policies­
[TQ](.r, n) = R(x, a) + i L P(x, G, y) Q5J "Q(y, a'). proper \' 1 1 ) PS ( Bertsekas and Tsitfiiklis, 1989), a soft
" .

y a' state aggregation learning rule (Singh et aI., 1995), and


a '"flpreading" learning rllie are given in a more dctailed
The randorniz,ed approxirrraLe dyrrarnic-prograrnrnirrg version of this paper (S zepesv ari and Littman, 1996).
operator that gives rise to the Q-learning rule is

(l-utix,a))Q'(x,a)+
([T, Q'IO)(',"I � 1 at ( J', (j)(rt + -10/ Q(y" a')),
if .7� = ,1:t and a = at
Q'(:t, aL othen,vlse,
4.2 Q-LEARNING FOR MARKOV GAMES

lVlarkov gamefl are a generalization of rvfDPfl and alter­


nating Markov garlles ill wlrich hotlr players sirrlillta­
neously choose actions at each step. The basic model
If is defined by the tuple (8, ,1, R, P, R) and discount fac·
tor '";1. As in alternating rVlarkov games, the optimality
criterion is one of discounted minimax optimality: but
• Yt iR randomly selected according to the probabil­ heCallSe the players move simultaneollsly, the Bellman
ity distrih lltion defined by P(;1:t,(J,t,'),
equationfl ta.ke on a more complex form:
• Q9x is a non-expansion, and both the expected
value and the variance of Q9n�I;Q(Yt, a) exist given V'(x)= max min L P(a).
the \vay YI- is sampled, !?En(Aj bEB
nEil

• rt ha."> finite variance and expected value given Xt


and at equal to R(:rt, atL (Ii(X, 0., b) + i L P(.r, n, b, y)
yES
V'(Y))
• the learning rates are decayed so that
In Lhetle equations, R.(x, u, b ) itl Lhe lIIunediate reward
for the maximizer for taking action a in state ;c at the
same time the minimizer takes adion 'J, P(,l:, 0, h, y)
is the prooaoiliLy that otate y io reached [rom sLate :t
and I:t ,tXt = X,at = a)at(x,a)2 < 00 with
when the maximizer takes action (J. and the minimizer
probability 1 unifonnly over X x A 2,
takes action b, and II(A) represents the set of discrete
probability distributions over the set A.. The sets S':
then a standard reslllt from the theory of stochastic A, and R are finite.
approximation (Robbins and Monro, 1951) states that
Tt approximates T every,,,here. That is, this method Once again, optimal policies are policies that are in
of using a decayed, exponentially weighted average cor­ equilibrium, and there is always a pair of optimal
rectly computes the average one-fitep reward. policies Ural are stationary. Unlike .\'11) PS and alter­
nating Markov games, the optimal policies are some­
L et ("�t(.'/;, a )-
_ { l-a,(x,"), if ;r= Xf and a= ar; timeR fltocha,:;:;tic; there are Markov gamefi in ,vhich no
1, otherwise , determinisLic policy is optimal. TIle fito dlastic na­
ture of optimal policies explains the need for the opti­
if x = Xt and a at:
ancI Pt (J;,a ) = { iatix,a),
0,
otherwise.
= mization over probahility distributions in the Bellman
equations, and stems from the fact that players must
These functions satisfy the conditions of Theorem 1 avoid being "'second guessecr' during action selection.
(Condition :3 is implied hy the restrictions placed on An equivalent set of eqllations can be \vritten with a

the sequence of learning rates at). stochastic choice for the minimizer, and also with the
roles of the nlaximizer and minimizer reversed.
2This condiLion implies, H.rTlong oLht-'r· Lllings, LlIH.L ev­
(TY statc-adiorl pair· is IIpdated infinitely o rten . Here, ::\: The Q-learning llpdate rllle for "Markov games
denotes the characteristic function. (Littman, 199;1) given step t experience (Xt, at,
bt, Yt, 1't) has the form 4,4 EXPLORATION-SENSITIVE MODELS

Qt+l(:Ct,at, btl := (1 - at(:Ct, "t, bt))Qt(:Ct, "t, btl John (1995) considered the ilnplications of insisting
that reinforcement-learning agents keep exploring for­
+ (tt(xt,at,bt) (rt+')'�xQt(Yt,a'b))' ever; he fOlllld that better learning performance can
be achieved if the Q-learning rule is changed to in­
corporate the condition of persistent exploration. In
where John'fl fonmllation, the agent is forced to adopt a pol­
icy [rom a l'etltricLed oel,: in one exall1ple� the a.gent
(8(f(x, n, b) = max min
pEII(A) hE R
L p(alf(x, a, b). must choose a stochastic stationary policy that selects
a,b aEA
actionfl at ra,ndom 5% of the time.
The results of the previOlls section prove tha,t, this rule This approa,ch reqllires that the definition of opti­
cOllvergetl Lo the opLilllal Q function under the proper mality be changed to relled the restridion on poli­
conditions. cies. The optimal value function is given by V* (;1;) =

sUPT<EPIl IF""(x), ,,,here Po is the set of permitted (sta­


4,3 RISK-SENSITIVE MODELS tionary) policies, and the a:ssociated Bellman equa­
tions are
Heger ( 1994) described an optimality criterion for
\lDPs in which only the worst possible value of the next
state makes a contribution to the value of a state. An
optimal policy llllder this criterion is one that avoids
which corresponds to a generalized MDP model with
states for \vhich a bad outcome is possible, even if it is
EBy g(y)
x''' = Ly P(x,a, y)g(y) and 0n" f(x, a) =
not probable; for this reason, the criterion has a risk­
s UPT<E PIl La 1Ta ( ;J;)f( J�, a). Recallse iTa(x) ifl a proba­
averse quality to it. The generali7,eo Bellman equa­
bility distribution over a for any given state ;J;� Q9J:
tions for thifl criterion arc
is a non-expansion and, thus� the convergence of the
aflflociated Q-Iearning algorithm follmvfi from the argll­
Y' (x) = (8( (R(X, 0) + '( min Y' (y) ).
y:P(x,a,y»O ments in Section 4.1. As a result, John's learning rule
"
gives the optimal policy under the revised optimality
The al'gulnenL in SecLioIl 4.5 ohows that model-based criterion.
reinforcement learning can be used to find optimal
policies in risk-sensitive models, as long as Q9T does 4,5 MODEL-BASED METHODS

not depend on II. or P, and P itl estimated in a vvay


The defining &<;Jsumption in reinforcement learning is
that preserves its zero vs. non-lero nature in the limit.
that the rewa.rd and transiLion [undiontl, II. and p,
For the model in which 0/ f(", 0) lllAx,J(.r,a), = are not known in advance. AlthOllgh Q-learning shows
Heger defined a Q-learning-Iike algorithm that con­ that optimal value functions can be estimated without
verges to optimal policies without estimating Rand ever explicitly learning Rand r, learning H and r
P online. Tn essence, the learning algorithm uses an makes more efficient use of experience at the expense of
update rule analogous to the rule in Q-Iearning with additional storage and computation C�iloore and Atke­
the additional requiremenL LhaL Lhe inaial Q function son, 1 993 ). The parameters of R. and P can he gleaned
be set optimistically; that is, Qo(x,a) must be larger from experience by keeping statistics for each state­
thAn Q'(.r"a) for all ," and a. action pair on the expected reward and the proportion
or transiLions Lo eadl next sLate. Tn model-hased re­
Using Theorem I- it ifl possible to prove the conver­
inforcement learning, Hand P arc estimated on-line,
gence or a generali?;aLion of Heger's algoriLllm Lo mod­
and the value function is updated according to the
els where 0/f(J:,a) f(:",,,'(1,,,)) for some func­
=

approximate dynamic-programming operator derived


tion 0,*(.); that is, afl long &<;J the summary vahle of
from these estimates. Theorem 1 implies the conver­
f(x, aj is equal to f(x,a' ) for some a'. The proof
gence of a, wide variety of model-b&<;Jed reinforcement­
is based on estimating the Q-Iearning algorithul from
learning methods.
above by a,n appropriate proceflS where the Q function
is updated only if the received experience tuple is an The dynamic-programming operator defining the opti­
extremity according to the optimality equation; details mal value for generalized TvTDPs is given in Equation 2.
are given clse,,,here (Szepesvari and Littman, 1996). Here we assume that E9X't� may depend on P and/or
R, but Q9x may not. It is possible to extend the fol­ 5 CONCLUSIONS
lmving argllment to a,llow Q9x to depend on P and R as
well. In model-based reinforcelnent learning, n. and P Tn this paper, we presented a generali7,ed model of
x a,t
arc estimated by the quantities Rr and PI-, and EB , :Markov decision processes, and proved the conver­
is an estimate of the EB·r , y opera,tor defined llsing Rt gence of several reinforcement-learning algorithms in
and Pt . As long as every state-action pair is visited in­ the generalized model.
finitely of ten, there are a llllmber of simple methods for
computing Rt and Pt that converge to Rand P. A bit Other Results vVe have deri v ed a collecLion of re­
t
more carc is needed to insurc that EBx , a. convergcs to sults ( Szepesvari and Littman, 1996) for the general­
EB:I,,(I, however , For example, in expected-re\vard mod­ ized fllUP model that demonstrate its general applica­
els, EEl/'"g(y) = L.y P(x,a, y)g(y) and the conver­ bility : the Bellman equations can be solved by value
gence of Pt to P guara,ntees the convergence of EBx , (I ,
t
iteration; a myopic policy with respect to an approxi­
to Efrv,a. On the other hand, in a risk-sensitive model, mately optimal value function gives an approximately
EB/·,(lg(y) = miny;p ( �r:'IL' Y » o g(y) and it is necessary to optimal policy; ,,,,hen Q9:1: has a particular "maximiza­
approximate P in a way that insures that the set of y tion" property, policy iteration converges to the op­
such that Pt (x, a., y) > 0 convcrgn, to the set of Y SllCh timal valuc fllnction, and, for models with the maxi­
that P(J:, a., y ) > O. This can be accomplished easily, m iz,aLioll properLy and n ilite sLaLe alld actioll spaces ,

for example, by setting Pt (x, a, y) 0 if no transition


= both value iteration and policy iteration identify opti­
from .T to y under a has heen observed. mal policies in pselldopolynomial timc.
H
Assllming P and are estimated in a, way tha,t results
Related Work The \vork presented here is
in the convergence of EErc,a, t to EBx, a , the tlequence of
closely related to several previous research efforts.
dynamic-programming operators Tt defined by
Szepesvari (1995) described a related generalized

([TtUW)(J: ) =
{ 0 x
a
if x
(H,(,r" a) + -1. EEl:," ,t V (y) ) ,
E Tt
reinforcement-learning model and presented condi­
tions llnder which there if> an optimal (stationary) pol­
icy that is myopic with respect to the optimal value
U (:t ), oLhervvitle. function.
Tsitsiklis (1991) developed the connection between
approximates T for .1.,11 va,lue functions. The set Tt � S stochastic-approximation theory and reinforcemcnt
represents the set of states whose values are updated learning in NIUPti. Our work itl tlimilar in tlpiriL to that
on step t; one popular choice is to set Tt = { xd. of Jaakkola, Jordan, and Singh (1994) . We believe
the form of Theorem 1 ma,kes it particularly conve­
The fllllCtions
nient [or proving the convergence of reinforcement­
0, if ;1: E Tt ; learning algorithms; our theorem reduces the proof
Gt(x) = { 1, of the convergence of an asynchronous process to a
othenvise �
simpler proof of convergence of a corresponding syn­
and chroni7,ed one. This idea enahles llS to prove the con­
vergence of asynchronous stochastic processes whose
if J: E Tr;
Ft (x) = { 6: othen,vise,
underlying synchronous process is not of the Hobbins­
Tvlonro type (e.g., risk-sensitive MDPS, model-ba.<;ed al­
gorithms, dc.).
satisfy the conditions of Theorcm l as long as cach
;1: is in in n n itely many Tt sets (Con d it ion 3) and the
Future Work There arc many arcas of interest in
discount factor "/ is less than 1 (Condition 4).
Lhe Uleory orrein rorcemenL l earn ing LhaL \ve would like
As a consequence of this argument and TheorCIll I , to address in future work. The results in this paper pri­
model-based methods can be llsed t o find optimal marily concern reinforcement-learning in contra,ctive
policies in MUPS, alternating rVI arkov gaInes, Markov models ('";i < 1 or all-policies-proper ) , and there are
games, risk-sensitive MDPS, and exploration-sensitive important non-contractive reinforcement-learning sce­
\-TDPs. Also, letting Rt R and Pt
= P for all t, = narios, for example, reinforcement learning under an
this refiult implies that rcal-time dynamic program­ average-reward criterion (Tvlahadcvan, 1 996). Tt \VOllld
ming (Barto et al., 1995) converges to the optimal be interesting to develop a TD(>.) algorithm ( Slltton�
value function. 1988) for generalized 1'lUl's. Theorem 1 is not re-
stricted to finite state spaces, and it might be valuable John, G. H. ( 1995). When the best move isn't opti­
to prove the convergence of a reinforcement-learning mal: Q-learning with exploration. l�npllhlished
algorithm_ for a infinite state-space model. manuscript.
Littman, M. L. (1994) . Markov games as a framework
Conclusion By idcntifying common clc­
for multi-agent reinforcement learning. In Pro­
menttl among tleveral reinforcemenL-learning tlcenarioo,
cetxling.') oj lilt Elevenlh Inle nw liorwl ConJe re n ct
we created a ne." dass of models that generalizes ex­
isting models in a,n interesting \vay. In the generalized
on ilIarhinp [earning, pages 1 ,5 7-1 63, San F'ran­
cisco� CA. �ilorgan Kaufmann.
frame\vork, \ve replicaLed Lhe ei:)Lablitlhed convergence
proofs for reinforceIllent learning in Markov decision :rda.hadevan, S. (1996). Average rewa.rd reinforcement
processes, and proved new results concerning the con­ learning : Foundations, algorithms, and empirical
vergence of reinforcement-learning algorithms in game result". M(J ch in c ream ing, 22( I /2/3): 1 .59- 1 96.
environments, under a risk-sensitive a.<;sumption, and
under an exploration-sensitive assumption. At the Tvroore � A. VV. and Atkeson, C. G. ( 1 99;3 ). Prioriti7,ed
heart of our results is a new stocha.stic-approximation sweeping: Reinforcement learning \"ith less data
theorem that is ea.<;y to apply to new situations. and less real time. fl,foelline Leorning, 13.
Puterman, L. ( 199 /1) . 11;Jm·kov Decision Process es ­
rvi.
Acknowledgenlent s
Disr1'rtr Storhastir Dynam ir Programm ing. .John
Rc,carch "'pportcd by PHA RB H930.5-02/ I 022 and \Viley &; Sons, Tnc.� Ke\v -'(ork, 'l"Y.
OTT< A Grant no. F0201 32 and by Bellcore's Sllpport
Robbins, H. and Monro, S. ( 195 1 ) . A stochastic ap­
for Doctoral Education Program.
proximation method. Amwls of fl,fothematico[
St atist ics, 22:400-407.
References
Singh, S., .Jaakkola, T. . and Jordan, M. (1 995). Rein­
I3arto, A. G . , I3radtke, S . .T., and Singh, S. P. ( 1995) . forcement learning with sof t state aggregation. Tn
Learning to act using real-time dynamic program­ Tesauro, G., Tourctzky, D. S.� and Leen� T. K.,
ming. ,1 rtifteial Tntel/igenrF, n( 1 ) :8 1 - 1 :3 8. editors� A d1.'01u::rs £n Nruml Tnfo rmation Pro(;(JiS­
-ing S'yt ;{ernl:i Z Cambl"idge� MA. The MIT Pretl.':l.
Bertsekas, D. P. and Tsitsiklis, J. 1\. (1989) . Parallel
an d Ihstr£lmtrd Co mputation : /Vumrriml iHcth­ Slltton, I-l. S. ( '1 988). Learning to predict by the
ods. Prentice-Hall, Englewood Cli rr.'l, KJ.
method or temporal di rrerence.'l. Marhinp Tprt rn ­
ing, 3(1) :9-44.
Condon� A. ( l U92). The complexity of stochastic
games. Informotion ond Co mputotio n, 96(2) :203- Szepesy,iri, C. ( 1 9U.) ). General framework for relll­
224. [orce111ent learning. In Proceeding.') oj ICANN',95
Poris.
Gllllapalli, V. and Barto, A . G. ( 1 994). Convergence
Szepesvari, C. and LiLtman, l'vr. L. ( 1 990). Gen­
of indirect adaptive asynchronous value iteration
eralized :1'larkov decision processes: Dynamic­
algorithms. In Cowan, .J . D . , Tesauro, G.� and
programming a,nd reinforcement-learning algo­
AbpecLor, .T. , ediLortl, Advanct.') in lVeH'fuZ InJo/'­
rithms. Technical Heport CS-96- 1 1 , TIrmvn l�ni­
'motion Pmcess-ing Systems 6, pages 695-702, San
versity, Providence� HI.
�'l ateo, CA. Morgan I{allfmann.
Tesauro� G. ( 1 995). Temporal difference learning
Heger, M. ( 1991). Consideration of risk in reinforce­
and TD-Gammon. Co m m 'u n ieotions of the A CfI,f,
ment learning. Tn Prorrrdings of thr Rlrvrnth
pages 58-67.
Tntf!rnational Conjf!1'f! n rf! on J),fari1-inf! Fparn-ing,
pages 105-1 11, San Francisco, CA. Morgan Kauf­ Tsitsiklis, .T. K. ( 199,1). Asynchronous stochastic ap­
mann. proximation and Q-lcarning. 1l1achine Lcorning�
1 6(:3 ).
Jaakkola, T., Jordan, M . I., and Singh, S. P. ( 1 994).
On the convergence of stochafltic iterative dy­ Watkins, C. J. C. H. and Dayan, P. (1992) . Q-lcarning.
namic programming algorithms. ?\'pural Compu­ Marhine T,earning, 8(3):279-292.
tation, 6(6).

You might also like