ml96 Ps
ml96 Ps
Applications
Abstract o :::;
I < I controls the degree to which future rewa,Tds
are significant cOlIlpared to iuunediaLe rewards.
Reinforcement learning is the process by The Lheory of Markov decil:lion procetl.':lel:l has been utled
which an autonomous agent uses its experi as a theoretical foundation for important reslllts con
ence intera cting with an environment to im cerning this reinforcement-learning scenario. A (finite)
prove it::; beha.vior. The rVlarkov dec.ision pro :rdarkov decision process (�IDP) is defined by the tuple
cess (:\'lDP) modd is a popular way of for (8, A, P, R ), where 8 is a finite set of states, A It fi
malizing the reinforcement-learning problem, nite set of actions, P a transition function, and R. a
but it it) by no meant; LIte only \-vay. In Lhitl reward function_ The optimal behavior for an agent
paper, we show hmy IllaIlY of the important in an .\lUl' depends on the optimality criterion; for
theoretical results concerning reinforcement the infinite-horizon expected discounted total-re"\yard
learning in MDPs extend to a generalized MDP criterion, t.he 0pLimal bell <-lvi or call be round hy id en
model that includes MDPS, two-player games tifying the optimal value function, defined recursively
and .\lUPs under a worst-case optimality cri by
terion as special cases. The basis of this ex
tension is a stochastic-approximation theo
rem that redllces fLsynchronolls convergence
to synchrOflOlJS cOlnrergence.
for all states x E 8, ,,,here R(J:, a ) is the iuullediate re
ward for taking action a from state x, /' the discount
1 INTRODUCTION factor, and P(x, a, y) the probability that state y is
reached from state ::e when action a E A is chosen.
Reinrorcement learning is Ule process hy wllich an These simultaneous equations, known as the Bellman
cqu(]tions, can be solved using a variety of techniques
agent iIllproves its behavior in an environment via ex
perience. A rf:injorremfnt-lmrning srenario is defined ranging from sllccessive approximation to linear pro
by the experience presented to the agent at each step, gramming (Puterman : -1 994).
and the criterion for evaluating the agent's behavior. Tn the abRence of complete information regarding
One particularly well-studied reinforcement-learning Lhe t.ransiLion and reward rundions: reinrorcemenL
scenario is that of a single agent maximizing expected learning methods can be used to find optimal value
difKOlll1ted total reward in a finite-state environment; functionR. ReRearchers ha,Ye explored model-free (di
in this scenario experiences are of the form (;1:, a, y, r ) , rect) methods, such as Q-Iearning (\Vatkins and
with state x, action a, resulting state y, and the agent's Dayan, 1992), and model-ba.,,;ccl (indirect) methods,
scalar immediate re"\vard '>. A discount parameter such a."l prioritized sweeping (Moore and i\.tkeson:
199;1), and many converge to optilnal value functions
1\.1so Department of Adaptive Systems, Joint Depart under the proper conditions (Tsitsiklis, 1994; Jaakkola
rrll-'rd, or Lhl-' "'}{rJ;Sl-'r ALLila" rnivt-'r-siLy, Szt-'gl-'d and Lhl-' Tn ct aL 1991; Gullapalli and Barto, 199,1),
sLiLld,e or TsoLopu; or !.lIC HlIrrgar-ian Academy or Scicnces,
Budapest 1525, P,O, Box, n HUKGARY Not all reinforcement-learning scenarios of interest can
be modeled as MDI's. For example, a great deal of 2 THE GENERALIZED MODEL
reinforcement-learning resea,rch ha.<;J been directed to
the problem of solving tl,vo-player games ( e.g, Tesauro, In this section, \ve introduce our generalized r"fDP
1995) , and the reinforcement-learning algorithms for model. We begin by fil1l11marizing some of the more
solving MDPs a,nd their convergence proofs do not ap significant results regarding the standard _\lUP model
ply directly to games. and some important results for two-player games.
Tn one form of two-player game, experiences arc of the
form (;1:, n, y, r), where states ;]; and y contain addi 2.1 MARKOV DECISION PROCESSES
if minimizer moves in x,
Directl y lllaxirnillillgover (he space or all po ssihle poli
cies is impractical. Hmvever, MUt's have an impor
tant property that makes it 11l1neCesfiary to consider
where R( ;7:, a ) is the reward to the maximizing player. such a broad space of possibilities. We say a policy
\Vhen 0 .::; I < -I , these equationfi have a lllliqlle so 1T" is stotionary and deterministic if it ma.ps directly
lution and can be solved by sllccefisive-approximation from fitates to a,ctionfi, ignoring everything else� and
methods. In addition, we show that simple extensions we write 1T ( X ) as the action chosen by 'iT when the cur
of several reinforcement-learning algorithms for \lD"Ps rent state ifi x. Tn expected discounted total reward
converge 1,0 optiuw.l value fUlldiollt; in Lhet;e gametl. .\-lUI' environments, there is ahvays a stationary deter
ministic policy that is optimal; we will Ufle the \voro
In Lhit; paper, we introduce a generalized Markov de "policy�' to mean fitationary deterministic policy� un
cision process model \vith applications to reinforce less otherwise stated.
ment learning, and list some important results con
cerning Lhe model. CelleraliL:,ed MDPtl provide a foun The value function for a policy 7T� V� llWpS states to
l
dation for the llse of reinforcement learning in \lDPs their expected discounted total revmrd under policy 1r.
and games, as well as in risk-sensitive reinforcement It can be defined by the simuHaneoutl equatiOllt;
learning ( Heger, 1994), exploration-sensitive reinforce
ment learning ( John, 1 99.3), and reinforcement learn V' (xl = R(", a ) + "I L P(x, a, yW"(y),
ing in simultaneous-action games (Littman, 1991) . y
Our main Llleorem addresses conditions ror the conver
gence of asynchronous stochastic processes and shmvs for all xES'. The optimal value function V'" is the
hm" these conditions relate to conditions for conver vahle function of an optimal policy; it ifi lllliqlle for
o .::; "/ < 1 . The myopic policy with respect to a value
gence o r a corre.<;Jponding syndll'onollfi process; it can
be used to prove the convergence of model-free and function V- is the policy lTv such that
model-based reinforcement-lea,rning algorithmfi under
a variety of reinforcement-learning scenarios.
In Section 2� we present generalized .\-lDPs and mo
tivate their fonn via two detailed examples. In Sec
Any myopic policy l,vith respect to the optilnal value
tion 3� we describe a stochastic-approximation theo
function is optiu13l.
rem, and in Section 4 \ve show several applications
of the theorem that prove the convergence of learning The Hellman equations can he operationali?;ed in the
processes in generalized NIDI's. form of the dynamic-programming operator T� which
maps value functions to value functions: Condon (1992) proves this and the other unatlributed
results in this sedion.
A popular optimality criterion for alternating Markov
games is discounted minimax optimality. l�nder this
For 0 :::; '";1 < 1 , sllccessive applications of T to a value criterion, the maximizer chooses actions to maxi
function bring it closer and closer to the optimal value mize its reward against the minimizer's best possi
rlJndion 1/* l \-vh idl is Lhe uniqlJe fixed poi nt or T: hie counter-policy. A pair of policies is in cquilibr£111n
V' TV',
=
if neither player has any incentive to change policies
if the other player�s policy remains fixed. The value
In reinforcement learning, Rand P are not knmvn in function for a pa.ir of eqllilihrium policies is the op
advance. In modcl-ba.sed reinforcement learning, H. timal value function for the game; it is uniqlle when
and Pare estimaLed OII-lille, and Lhe value function o :::; I < 1, and can be found by successive approxima
is updated according to the approxiuwte dynamic tion. For hoth players, there is ahvays a deterministic
programming operator derived from these estimates; stationary optimal policy. Any myopic policy with re
this algorithm converges to the optimal value function spect to the optimal value function is optimal.
under a ,vide variety of choices of the order states arc
updated (Gullapalli and Bart.o, 1 994), Dynamic-programming operators, Bellman equations�
and reinforcement-learning algorithms can be defined
The method of Q-learning (Watkins and Dayan, 1992) for alternating :rdarkov games by starting with the def
uses experience to estimate the optimal vaille function initions llsed in MDPs and changing the maximllm op
withollt ever explicitly approximating R and P. The erators to either maximums or minimums conditioned
algorithm estimates the optimal Q function on the state. vVe show below that the resulting al
gorithms share their convergence properties with the
Q'(x, a) = R(x,a) +')' L P(,",a, yW'(y), analogous algorithms for 1\,.'TDPs.
'I
from which the optimal vahle flmction can he com 2.3 GENERALIZED MOPS
puted vi a 1/*(;1;) = IIlaxaQ"(;r,u). Given the experi
ence at step t (J.:r, a" Yr, II) and the current estimate Tn a.lternating Markov games a,nd MDPS, optimal be
Qd;};, (1) of the optimal Q function, Q-learning llpdates havior can be specified by the Dellman equations; any
myopic policy with respect to the optiuwl value func
(I'+1(Xt, a,) : = (1 - utlx" at))(I,(xt, at) tion is optimaL Tn this sedion� .ve generalir,e the Bell
+ (t,(x" arlh. + '(maxQ, ( y" a)), man equations to define optimal behavior for a broad
"
class of reinforcement-learning models. The objective
where 0 :::; at(x� a) :::; 1 is a time-dependent learn criterion used in these models is additive in that the
ing rate controlling the blending rate of ne.v estimates the vahle of a policy is some measure of the totl1l re
with old estimates for each state-action pair. The es ward received.
timated Q function converges to Q* under the proper
The generalir,ed Bellman eqllations can be ,vritten
conditions (\\ratkins and Dayan, 1992).
Tn alLernating Tvlarkov games, Lwo players Lake Lllrns Here "' Q9/;'� is an operator that summarizes values
issuing actions to maximize their expected discounted a,:
over actions a..o;; a function of the state, and "E9y.r,
total reward. The model is defined hy the tuple is an operator that sumlnarizes values over next states
(81, 8'2, A, R, P, R), wllere 81 is the seL or s LaLes in as a function of the state and action. For Markov
which player 1 issues actions from the set A, /h is the decision processes, Q9(1.rf(x,a) = maxa f(x�a) and
set of states in which player 2 isslles actions from the ,
EEl/" g(y) Ly P(J;, A., y)g (y ) For alternat.ing
set fl, P is the transition function, and It is the re
Tvlarkov games, E9y .r,'1 is the same a,nd Q9(1'r, f(;n,a) =
3 CONVERGENCE THEOREM
for all fl, h, and ;r. An analogous condition defines
when Efr",,11 is a non-expansion.
The process of finding an optimal value function can be
rVlany natural operators are non-expansions, such
viewed in the following general "'ivay. At any lTIOment
as max: min, midpoint, median, mean, and fixed
in time, there is a set of values representing the cur
weighted averages or these operations. Several pre
rent approximation of the optimal value fllllction. On
viously described reinforcement-learning scenarios are
each iteration, we apply some dynamic-programming
specia,l ca.ses of this genera,lir,ed "MDP model includ
operator: perhaps modified hy experience, to the Cllr
ing computing the expected return of a fixed pol
rent approximation to generate a new approximation.
icy (Sutton: 1988), finding the optimal risk-averse pol
Over time, we would like the approximation to tend
icy (Heger, 1 994): a,nd finding the optimal exploration
toward the optimal value function.
sensitive policy (John, 1995).
In thit; procetltl, there are L\A,!O typet; of a.pproximation
Afi,"vith MDPS, we can define a dynamic-programming
going on simultaneously. The first is an approxima
operator
tion of the dynamic-programming operator for the un
derly ing model, and the tlecond it; Lhe ut;e of the ap
proxilnate dyna.mic-progra.mming operator to find the
optimal value function. This section presents a theo
rem that gives a set of conditions under "'ivhich this type
The operator 'I' is a, contraction mapping for 0 :s; ')' <
of simllltaneous stocha.stic approximation converges to
1. Thitl meant;
an optimal value function.
sup I[TV,](x) - [TV 2](x)1 <: ,SUP W,(,,) - V2(,,)1 First, we need to define the general stochastic process.
x ;('
Let the set ,Y be tlle staLes of Llle model, and the set
where VI and V2 are arbitrary functions and 0 :s; '";/ < 1 B( X ) of bounded, real-valued functions over X be the
is the index of contraction. set of vahlC functions, Let T : BCY) -+ B(X) be an
arbitrary conLraction mapping wiLli fixed point V*.
\Ve can define a notion of stationary myopic policies
with respect to a value fllllCtion V; it is any (stochas rr we had direct access to the contraction mapping T:
tic ) policy 'iT"V for ,"vhich T7rV = TV where we could use it to successively approximate ll*. In
most reinforcement-learning scenarios, T is not avail
able and ,l ve must use experience to construct approx
imations of T. Consider a sequence of random op
erators Tt : B(X) -+ (B(X) -+ B(X)) and define
Here Jra(x) reprCf;entR the probability that an agent Ut+1 = [TtU,]V where V and [To E B(X) arc arhi
following Jr would choose action 0. in state ;1:. To be trary value fllllCtions. vVe say Tt appm.rimatf8 T at
certain that every value function possesses a myopic V� if [It converges to TV with probability 1 uniformly
over X l. The idea is that Tt is a randomized version In some applications� such as Q-learning1 the contri
of T that uses [It a..o;; "memory" to converge to TlT. hlltion of new information needs to deca,y over time to
insure that the process converges. In this case, Gdx)
The follmving theorem shows that, under the proper
needs to converge to one; Condition 3 allows this as
conditions, 'ive can use the sequence Tt to estimate the
long as the convergence is slow enough to incorporate
fixed point V* of T.
sufficient infonnation for the process to converge.
Theorenl 1 rd T bp an. arbitrary mapping with ji:r:pd Condition 4 links the villues of Gt(x) ilnd F,(.r)
poin.t V*" an d let Tt approximate T at V*. Let l'o be an through some quantity ....1. < 1. Tf it ,\-vere somehow pos
orbitrary v(l[ue junction, (wd define Vt+1 = [Tt Vt]l't. sible to update the values synchronously over the en
Tf there exi"t junctions 0 S F, ( ,r) S 1 and 0 S G, ( ,r) S tire ",tate space, the process 'ivould converge to V* eyen
1 satisfying the conditions below tvith probability one, when "'j = 1. Tn the more interesting asynchronous
llu�n 1/; e01llW'f'Yf!S to ll* 'wah prol)(Jbaay 1 'IlnijrJ'r'fnly ease, when ....!. = 1 the long-term behavior of Vt is not
over X: immediately clear: it may even he that Vt converges
to l:)olneLhing oLher lha.n F*. The requirement that
1, for all U, an d U2 E S(X), and all x E X, "'/ < 1 insures that the use of outdated information in
the asynchronous updates does not ca,llse a problem in
I([T,U,]V ')(,,,) -([T,U,]V')(",)1 convergence,
< G,(:t)IU,(x) - U2(x)l; One of the most noteworthy aspects of this theorem
is that it shows how to redllce the problem of approx
2, for all U an d V E S(X), and all x E X, imating V* to the problem of approximating T at a
particular point II (in particular, it is enough if T can
1(['I;U]V')(,r,) - (['ltu]V)(x)1 be approximated at F*); in many cases1 the latter is
< F,(x) sup W'(x') - V( x')I: much easier to achieve and also to prove. For exam
x'
ple, LlJe Lheorerrl makes the co n v erge n ce or Q-learnillg
3, for all k > 0, II�l=kGt(x) converges to zero uni a consequence of the classical Robbins-Monro theo
formly in x a8 n in�r(;(}8Ui: (md. rem ( Robhins and Monro, I 9.)1).
Fdx) S ,(1- G ,(x)), This section Inakes use of Theorem 1 to prove the con
vergence of various reinforcement-learning algorithms.
Note that from the conditions of the theorem, it fol 4.1 GENERALIZED Q-LEARNING FOR
(l-utix,a))Q'(x,a)+
([T, Q'IO)(',"I � 1 at ( J', (j)(rt + -10/ Q(y" a')),
if .7� = ,1:t and a = at
Q'(:t, aL othen,vlse,
4.2 Q-LEARNING FOR MARKOV GAMES
the sequence of learning rates at). stochastic choice for the minimizer, and also with the
roles of the nlaximizer and minimizer reversed.
2This condiLion implies, H.rTlong oLht-'r· Lllings, LlIH.L ev
(TY statc-adiorl pair· is IIpdated infinitely o rten . Here, ::\: The Q-learning llpdate rllle for "Markov games
denotes the characteristic function. (Littman, 199;1) given step t experience (Xt, at,
bt, Yt, 1't) has the form 4,4 EXPLORATION-SENSITIVE MODELS
Qt+l(:Ct,at, btl := (1 - at(:Ct, "t, bt))Qt(:Ct, "t, btl John (1995) considered the ilnplications of insisting
that reinforcement-learning agents keep exploring for
+ (tt(xt,at,bt) (rt+')'�xQt(Yt,a'b))' ever; he fOlllld that better learning performance can
be achieved if the Q-learning rule is changed to in
corporate the condition of persistent exploration. In
where John'fl fonmllation, the agent is forced to adopt a pol
icy [rom a l'etltricLed oel,: in one exall1ple� the a.gent
(8(f(x, n, b) = max min
pEII(A) hE R
L p(alf(x, a, b). must choose a stochastic stationary policy that selects
a,b aEA
actionfl at ra,ndom 5% of the time.
The results of the previOlls section prove tha,t, this rule This approa,ch reqllires that the definition of opti
cOllvergetl Lo the opLilllal Q function under the proper mality be changed to relled the restridion on poli
conditions. cies. The optimal value function is given by V* (;1;) =
([TtUW)(J: ) =
{ 0 x
a
if x
(H,(,r" a) + -1. EEl:," ,t V (y) ) ,
E Tt
reinforcement-learning model and presented condi
tions llnder which there if> an optimal (stationary) pol
icy that is myopic with respect to the optimal value
U (:t ), oLhervvitle. function.
Tsitsiklis (1991) developed the connection between
approximates T for .1.,11 va,lue functions. The set Tt � S stochastic-approximation theory and reinforcemcnt
represents the set of states whose values are updated learning in NIUPti. Our work itl tlimilar in tlpiriL to that
on step t; one popular choice is to set Tt = { xd. of Jaakkola, Jordan, and Singh (1994) . We believe
the form of Theorem 1 ma,kes it particularly conve
The fllllCtions
nient [or proving the convergence of reinforcement
0, if ;1: E Tt ; learning algorithms; our theorem reduces the proof
Gt(x) = { 1, of the convergence of an asynchronous process to a
othenvise �
simpler proof of convergence of a corresponding syn
and chroni7,ed one. This idea enahles llS to prove the con
vergence of asynchronous stochastic processes whose
if J: E Tr;
Ft (x) = { 6: othen,vise,
underlying synchronous process is not of the Hobbins
Tvlonro type (e.g., risk-sensitive MDPS, model-ba.<;ed al
gorithms, dc.).
satisfy the conditions of Theorcm l as long as cach
;1: is in in n n itely many Tt sets (Con d it ion 3) and the
Future Work There arc many arcas of interest in
discount factor "/ is less than 1 (Condition 4).
Lhe Uleory orrein rorcemenL l earn ing LhaL \ve would like
As a consequence of this argument and TheorCIll I , to address in future work. The results in this paper pri
model-based methods can be llsed t o find optimal marily concern reinforcement-learning in contra,ctive
policies in MUPS, alternating rVI arkov gaInes, Markov models ('";i < 1 or all-policies-proper ) , and there are
games, risk-sensitive MDPS, and exploration-sensitive important non-contractive reinforcement-learning sce
\-TDPs. Also, letting Rt R and Pt
= P for all t, = narios, for example, reinforcement learning under an
this refiult implies that rcal-time dynamic program average-reward criterion (Tvlahadcvan, 1 996). Tt \VOllld
ming (Barto et al., 1995) converges to the optimal be interesting to develop a TD(>.) algorithm ( Slltton�
value function. 1988) for generalized 1'lUl's. Theorem 1 is not re-
stricted to finite state spaces, and it might be valuable John, G. H. ( 1995). When the best move isn't opti
to prove the convergence of a reinforcement-learning mal: Q-learning with exploration. l�npllhlished
algorithm_ for a infinite state-space model. manuscript.
Littman, M. L. (1994) . Markov games as a framework
Conclusion By idcntifying common clc
for multi-agent reinforcement learning. In Pro
menttl among tleveral reinforcemenL-learning tlcenarioo,
cetxling.') oj lilt Elevenlh Inle nw liorwl ConJe re n ct
we created a ne." dass of models that generalizes ex
isting models in a,n interesting \vay. In the generalized
on ilIarhinp [earning, pages 1 ,5 7-1 63, San F'ran
cisco� CA. �ilorgan Kaufmann.
frame\vork, \ve replicaLed Lhe ei:)Lablitlhed convergence
proofs for reinforceIllent learning in Markov decision :rda.hadevan, S. (1996). Average rewa.rd reinforcement
processes, and proved new results concerning the con learning : Foundations, algorithms, and empirical
vergence of reinforcement-learning algorithms in game result". M(J ch in c ream ing, 22( I /2/3): 1 .59- 1 96.
environments, under a risk-sensitive a.<;sumption, and
under an exploration-sensitive assumption. At the Tvroore � A. VV. and Atkeson, C. G. ( 1 99;3 ). Prioriti7,ed
heart of our results is a new stocha.stic-approximation sweeping: Reinforcement learning \"ith less data
theorem that is ea.<;y to apply to new situations. and less real time. fl,foelline Leorning, 13.
Puterman, L. ( 199 /1) . 11;Jm·kov Decision Process es
rvi.
Acknowledgenlent s
Disr1'rtr Storhastir Dynam ir Programm ing. .John
Rc,carch "'pportcd by PHA RB H930.5-02/ I 022 and \Viley &; Sons, Tnc.� Ke\v -'(ork, 'l"Y.
OTT< A Grant no. F0201 32 and by Bellcore's Sllpport
Robbins, H. and Monro, S. ( 195 1 ) . A stochastic ap
for Doctoral Education Program.
proximation method. Amwls of fl,fothematico[
St atist ics, 22:400-407.
References
Singh, S., .Jaakkola, T. . and Jordan, M. (1 995). Rein
I3arto, A. G . , I3radtke, S . .T., and Singh, S. P. ( 1995) . forcement learning with sof t state aggregation. Tn
Learning to act using real-time dynamic program Tesauro, G., Tourctzky, D. S.� and Leen� T. K.,
ming. ,1 rtifteial Tntel/igenrF, n( 1 ) :8 1 - 1 :3 8. editors� A d1.'01u::rs £n Nruml Tnfo rmation Pro(;(JiS
-ing S'yt ;{ernl:i Z Cambl"idge� MA. The MIT Pretl.':l.
Bertsekas, D. P. and Tsitsiklis, J. 1\. (1989) . Parallel
an d Ihstr£lmtrd Co mputation : /Vumrriml iHcth Slltton, I-l. S. ( '1 988). Learning to predict by the
ods. Prentice-Hall, Englewood Cli rr.'l, KJ.
method or temporal di rrerence.'l. Marhinp Tprt rn
ing, 3(1) :9-44.
Condon� A. ( l U92). The complexity of stochastic
games. Informotion ond Co mputotio n, 96(2) :203- Szepesy,iri, C. ( 1 9U.) ). General framework for relll
224. [orce111ent learning. In Proceeding.') oj ICANN',95
Poris.
Gllllapalli, V. and Barto, A . G. ( 1 994). Convergence
Szepesvari, C. and LiLtman, l'vr. L. ( 1 990). Gen
of indirect adaptive asynchronous value iteration
eralized :1'larkov decision processes: Dynamic
algorithms. In Cowan, .J . D . , Tesauro, G.� and
programming a,nd reinforcement-learning algo
AbpecLor, .T. , ediLortl, Advanct.') in lVeH'fuZ InJo/'
rithms. Technical Heport CS-96- 1 1 , TIrmvn l�ni
'motion Pmcess-ing Systems 6, pages 695-702, San
versity, Providence� HI.
�'l ateo, CA. Morgan I{allfmann.
Tesauro� G. ( 1 995). Temporal difference learning
Heger, M. ( 1991). Consideration of risk in reinforce
and TD-Gammon. Co m m 'u n ieotions of the A CfI,f,
ment learning. Tn Prorrrdings of thr Rlrvrnth
pages 58-67.
Tntf!rnational Conjf!1'f! n rf! on J),fari1-inf! Fparn-ing,
pages 105-1 11, San Francisco, CA. Morgan Kauf Tsitsiklis, .T. K. ( 199,1). Asynchronous stochastic ap
mann. proximation and Q-lcarning. 1l1achine Lcorning�
1 6(:3 ).
Jaakkola, T., Jordan, M . I., and Singh, S. P. ( 1 994).
On the convergence of stochafltic iterative dy Watkins, C. J. C. H. and Dayan, P. (1992) . Q-lcarning.
namic programming algorithms. ?\'pural Compu Marhine T,earning, 8(3):279-292.
tation, 6(6).