0% found this document useful (0 votes)
11 views7 pages

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

This document presents a continuous-time, continuous-state version of the temporal difference (TD) learning algorithm aimed at enhancing reinforcement learning applications in real-world control tasks and neurobiological modeling. The performance of the proposed algorithms was evaluated through a pendulum control task, demonstrating successful implementation of both the 'critic' and 'actor' components using radial basis function networks. The findings indicate that the continuous formulation allows for explicit optimal control laws, improving upon traditional discrete-time methods.

Uploaded by

kinmanlaikm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

This document presents a continuous-time, continuous-state version of the temporal difference (TD) learning algorithm aimed at enhancing reinforcement learning applications in real-world control tasks and neurobiological modeling. The performance of the proposed algorithms was evaluated through a pendulum control task, demonstrating successful implementation of both the 'critic' and 'actor' components using radial basis function networks. The findings indicate that the continuous formulation allows for explicit optimal control laws, improving upon traditional discrete-time methods.

Uploaded by

kinmanlaikm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Temporal Difference Learning in

Continuous Time and Space

Kenji Doya
doya~hip.atr.co.jp
ATR Human Information Processing Research Laboratories
2-2 Hikaridai, Seika.-cho, Soraku-gun, Kyoto 619-02, Japan

Abstract
A continuous-time, continuous-state version of the temporal differ-
ence (TD) algorithm is derived in order to facilitate the application
of reinforcement learning to real-world control tasks and neurobi-
ological modeling. An optimal nonlinear feedback control law was
also derived using the derivatives of the value function. The per-
formance of the algorithms was tested in a task of swinging up a
pendulum with limited torque. Both the "critic" that specifies the
paths to the upright position and the "actor" that works as a non-
linear feedback controller were successfully implemented by radial
basis function (RBF) networks.

1 INTRODUCTION
The temporal-difference (TD) algorithm (Sutton, 1988) for delayed reinforcement
learning has been applied to a variety of tasks, such as robot navigation, board
games, and biological modeling (Houk et al., 1994). Elucidation of the relationship
between TD learning and dynamic programming (DP) has provided good theoretical
insights (Barto et al., 1995). However, conventional TD algorithms were based on
discrete-time, discrete-state formulations. In applying these algorithms to control
problems, time, space and action had to be appropriately discretized using a priori
knowledge or by trial and error. Furthermore, when a TD algorithm is used for
neurobiological modeling, discrete-time operation is often very unnatural.
There have been several attempts to extend TD-like algorithms to continuous cases.
Bradtke et al. (1994) showed convergence results for DP-based algorithms for a
discrete-time, continuous-state linear system with a quadratic cost. Bradtke and
Duff (1995) derived TD-like algorithms for continuous-time, discrete-state systems
(semi-Markov decision problems). Baird (1993) proposed the "advantage updating"
algorithm by modifying Q-Iearning so that it works with arbitrary small time steps .
1074 K.DOYA

In this paper, we derive a TD learning algorithm for continuous-time, continuous-


state, nonlinear control problems. The correspondence of the continuous-time ver-
sion to the conventional discrete-time version is also shown. The performance of
the algorithm was tested in a nonlinear control task of swinging up a pendulum
with limited torque.

2 CONTINUOUS-TIME TD LEARNING
We consider a continuous-time dynamical system (plant)
x(t) = f(x(t), u(t)) (1)
where x E X eRn is the state and u E U C Rm is the control input (action). We
denote the immediate reinforcement (evaluation) for the state and the action as
r(t) = r(x(t), u(t)). (2)
Our goal is to find a feedback control law (policy)
u(t) = JL(x(t)) (3)
that maximizes the expected reinforcement for a certain period in the future. To
be specific, for a given control law JL, we define the "value" of the state x(t) as

V!L(x(t)) =
t
100 1 ,-t
-e-- r(x(s), u(s))ds,
r
T (4)

where x(s) and u(s) (t < s < 00) follow the system dynamics (1) and the control
law (3). Our problem now is to find an optimal control law JL* that maximizes
V!L(x) for any state x E X. Note that r is the time scale of "imminence-weighting"
and the scaling factor ~ is used for normalization, i.e., ft OO ~e- ':;:t ds = 1.

2.1 TD ERROR
The basic idea in TD learning is to predict future reinforcement in an on-line man-
ner. We first derive a local consistency condition for the value function V!L(x). By
differentiating (4) by t, we have
d
r dt V!L(x(t)) = V!L(x(t)) - r(t). (5)
Let P(t) be the prediction of the value function V!L(x(t)) from x(t) (output of the
"critic"). If the prediction is perfect, it should satisfy rP(t) = P(t) - r(t). If this
is not satisfied, the prediction should be adjusted to decrease the inconsistency
f(t) = r(t) - P(t) + rP(t). (6)
This is a continuous version of the temporal difference error.

2.2 EULER DIFFERENTIATION: TD(O)


The relationship between the above continuous-time TD error and the discrete-time
TD error (Sutton, 1988)
f(t) = r(t) + ,,(P(t) - P(t - ~t) (7)
can be easily seen by a backward Euler approximation of p(t). By substituting
p(t) = (P(t) - P(t - ~t))/~t into (6), we have

f=r(t)+ ~t [(1- ~t)P(t)-P(t-~t)] .


Temporal Difference in Learning in Continuous Time and Space 1075

This coincides with (7) if we make the "discount factor" '"Y = 1- ~t ~ e-'¥, except
for the scaling factor It '
Now let us consider a case when the prediction of the value function is given by
(8)

where bi O are basis functions (e.g., sigmoid, Gaussian, etc) and Vi are the weights.
The gradient descent of the squared TD error is given by
~Vi ex: _ o~r2(t) ex: - r et) [(1 _~t) oP(t) _ oP(t - ~t)] .
OVi T OVi OVi
In order to "back-up" the information about the future reinforcement to correct the
prediction in the past, we should modify pet - ~t) rather than pet) in the above
formula. This results in the learning rule

~Vi ex: ret) OP(~~ ~t) = r(t)bi(x(t - ~t)) . (9)

This is equivalent to the TD(O) algorithm that uses the "eligibility trace" from the
previous time step.

2.3 SMOOTH DIFFERENTIATION: TD(-\)


The Euler approximation of a time derivative is susceptible to noise (e.g., when
we use stochastic control for exploration) . Alternatively, we can use a "smooth"
differentiation algorithm that uses a weighted average of the past input, such as

pet) ~ pet) - Pet) where Tc dd pet) = pet) - pet)


~ t
and Tc is the time constant of the differentiation. The corresponding gradient de-
scent algorithm is

~Vi ex: _ O~;2(t) ex: ret) o~(t) = r(t)bi(t) , (10)


Vi UVi

where bi is the eligibility trace for the weight


d- -
Tc dtbi(t) = bi(x(t)) - bi(t) . (11)

Note that this is equivalent to the TD(-\) algorithm (Sutton, 1988) with -\ = 1- At
Tc

if we discretize the above equation with time step ~t.

3 OPTIMAL CONTROL BY VALUE GRADIENT


3.1 HJB EQUATION
The value function V * for an optimal control J..L* is defined as

V*(x(t)) = max
U[t,oo)
[1 t
00
1 . -t r(x(s), u(s))ds ] .
-e--
T
T (12)

According to the principle of dynamic programming (Bryson and Ho, 1975), we


consider optimization in two phases, [t, t + ~t] and [t + ~t , 00), resulting in the
expression

.
V * (x(t)) = max
U[t,HAt)
[I +t
t At
1 · :;:- t r(x(s), u(s))ds + e--'¥V* (x(t
_e-
T
1.
+ ~t))
1076 K.DOYA

By Taylor expanding the value at t + f:l.t as


av*
V*(x(t + f:l.t)) = V*(x(t)) + ax(t) f(x(t), u(t))f:l.t + O(f:l.t)
and then taking f:l.t to zero, we have a differential constraint for the optimal value
function
V*(t) = max [r(x(t), u(t)) + T av* ]
- f(x(t), u(t)) . (13)
U(t)EU ax
This is a variant of the Hamilton-Jacobi-Bellman equation (Bryson and Ho, 1975)
for a discounted case.

3.2 OPTIMAL NONLINEAR FEEDBACK CONTROL


When the reinforcement r(x, u) is convex with respect to the control u, and the
vector field f(x, u) is linear with respect to u, the optimization problem in (13) has
a unique solution. The condition for the optimal control is
ar(x, u) av* af(x, u) _ 0
au +T ax au -. (14)
Now we consider the case when the cost for control is given by a convex potential
function GjO for each control input

f(x, u) = rx(x) - 2:= Gj(Uj),


j

where reinforcement for the state r x (x) is still unknown. We also assume that the
input gain of the system
b -(x) = af(x, u)
J au-J
is available. In this case, the optimal condition (14) for Uj is given by
av*
-Gj(Uj) +T ax bj(x) = O.

Noting that the derivative G'O is a monotonic function since GO is convex, we have
the optimal feedback control law

Uj = (G')-1 ( T av*
ax b(x) ) . (15)

Particularly, when the amplitude of control is bounded as IUj I < uj&X, we can
enforce this constraint using a control cost
~
Gj(Uj) = Cj IoUi g-l(s)ds, (16)

where g-10 is an inverse sigmoid function that diverges at ±1 (Hopfield, 1984). In


this case, the optimal feedback control law is given by
max
av* )
Uj = ujax g ( u ~j T ax bj(x) . (17)

In the limit of Cj -70, this results in the "bang-bang" control law

Uj = Ujmax'
SIgn ax b j (x )] .
[av* (18)
Temporal Difference in Learning in Continuous Time and Space 1077

Figure 1: A pendulum with limited torque. The dynamics is given by m18


-f-tiJ + mglsinO + T. Parameters were m = I = 1, 9 = 9.8, and f-t = 0.0l.

trials th

(a) (b)
20

~\~iii
17 .5

15

12.5

I, I
i~!
0.

.-" 10
' :1
7 .5

trials th

(c) (d)

Figure 2: Left: The learning curves for (a) optimal control and (c) actor-critic.
Lup: time during which 101 < 90°. Right: (b) The predicted value function P after
100 trials of optimal control. (d) The output of the controller after 100 trials with
actor-critic learning. The thick gray line shows the trajectory of the pendulum. th:
o (degrees), om: iJ (degrees/sec).
1078 K.DOYA

4 ACTOR-CRITIC
When the information about the control cost, the input gain of the system, or the
gradient of the value function is not available, we cannot use the above optimal
control law. However, the TD error (6) can be used as "internal reinforcement" for
training a stochastic controller, or an "actor" (Barto et al., 1983).
In the simulation below, we combined our TD algorithm for the critic with a rein-
forcement learning algorithm for real-valued output (Gullapalli, 1990). The output
of the controller was given by

u;(t) = ujU g (~W;,b'(X(t)) + <1n;(t)) , (19)

where nj(t) is normalized Gaussian noise and Wji is a weight. The size of this per-
turbation was changed based on the predicted performance by (Y = (Yo exp( -P(t)).
The connection weights were changed by
!:l.Wji ex f(t)nj(t)bi(x(t)). (20)

5 SIMULATION
The performance of the above continuous-time TD algorithm was tested on a task
of swinging up a pendulum with limited torque (Figure 1). Control of this one-
degree-of-freedom system is trivial near the upright equilibrium. However, bringing
the pendulum near the upright position is not if we set the maximal torque Tmax
smaller than mgl. The controller has to swing the pendulum several times to
build up enough momentum to bring it upright. Furthermore, the controller has to
decelerate the pendulum early enough to avoid falling over.
We used a radial basis function (RBF) network to approximate the value function
for the state of the pendulum x = (8,8). We prepared a fixed set of 12 x 12 Gaussian
basis functions. This is a natural extension of the "boxes" approach previously used
to control inverted pendulums (Barto et al., 1983). The immediate reinforcement
was given by the height of the tip of the pendulum, i.e., rx = cos 8.

5.1 OPTIMAL CONTROL


First, we used the optimal control law (17) with the predicted value function P
instead of V·. We added noise to the control command to enhance exploration.
The torque was given by
Tmax aP(x) )
T = Tmaxg ( - - r - - b + (Yn(t) ,
c ax
where g(x) = ~ tan- 1 ( ~x) (Hopfield, 1984). Note that the input gain b =
(0, 1/mI2)T was constant. Parameters were rm ax = 5, c = 0.1, (Yo = 0.01, r = 1.0,
and rc = 0.1.
Each run was started from a random 8 and was continued for 20 seconds. Within
ten trials, the value function P became accurate enough to be able to swing up and
hold the pendulum (Figure 2a). An example of the predicted value function P after
100 trials is shown in Figure 2b. The paths toward the upright position, which were
implicitly determined by the dynamical properties of the system, can be seen as the
ridges of the value function. We also had successful results when the reinforcement
was given only near the goal: rx = 1 if 181 < 30°, -1 otherwise.
Temporal Difference in Learning in Continuous Time and Space 1079

5.2 ACTOR-CRITIC
Next, we tested the actor-critic learning scheme as described above. The controller
was also implemented by a RBF network with the same 12 x 12 basis functions as
the critic network. It took about one hundred trials to achieve reliable performance
(Figure 2c). Figure 2d shows an example of the output of the controller after 100
trials. We can see nearly linear feedback in the neighborhood of the upright position
and a non-linear torque field away from the equilibrium.

6 CONCLUSION
We derived a continuous-time, continuous-state version of the TD algorithm and
showed its applicability to a nonlinear control task. One advantage of continuous
formulation is that we can derive an explicit form of optimal control law as in (17)
using derivative information, whereas a one-ply search for the best action is usually
required in discrete formulations.

References
Baird III, L. C. (1993). Advantage updating. Technical Report WL-TR-93-1146,
Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-7301, USA.
Barto, A. G. , Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time
dynamic programming. Artificial Intelligence, 72:81-138.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Transactions
on System, Man, and Cybernetics, SMC-13:834-846.
Bradtke, S. J. and Duff, M. O. (1995). Reinforcement learning methods for
continuous-time Markov decision problems. In Tesauro, G., Touretzky, D. S.,
and Leen, T. K., editors, Advances in Neural Information Processing Systems
7, pages 393-400. MIT Press, Cambridge, MA.
Bradtke, S. J ., Ydstie, B. E., and Barto, A. G. (1994). Adaptive linear quadratic
control using policy iteration. CMPSCI Technical Report 94-49, University of
Massachusetts, Amherst, MA.
Bryson, Jr., A. E .. and Ho, Y.-C. (1975). Applied Optimal Control. Hemisphere
Publishing, New York, 2nd edition.
GuUapalli, V. (1990) . A stochastic reinforcement learning algorithm for learning
real-valued functions. Neural Networks, 3:671-192.
Hopfield, J. J. (1984). Neurons with graded response have collective computational
properties like those of two-state neurons. Proceedings of National Academy of
Science, 81 :3088-3092.
Houk, J . C., Adams, J. L., and Barto, A. G. (1994). A model of how the basal
ganglia generate and use neural signlas that predict renforcement. In Houk,
J. C., Davis, J. L., and Beiser, D. G. , editors, Models of Information Processing
in the Basal Ganglia, pages 249--270. MIT Press, Cambrigde, MA.
Sutton, R. S. (1988). Learning to predict by the methods of temporal difference.
Machine Learning, 3:9--44.

You might also like