NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper
NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper
Kenji Doya
doya~hip.atr.co.jp
ATR Human Information Processing Research Laboratories
2-2 Hikaridai, Seika.-cho, Soraku-gun, Kyoto 619-02, Japan
Abstract
A continuous-time, continuous-state version of the temporal differ-
ence (TD) algorithm is derived in order to facilitate the application
of reinforcement learning to real-world control tasks and neurobi-
ological modeling. An optimal nonlinear feedback control law was
also derived using the derivatives of the value function. The per-
formance of the algorithms was tested in a task of swinging up a
pendulum with limited torque. Both the "critic" that specifies the
paths to the upright position and the "actor" that works as a non-
linear feedback controller were successfully implemented by radial
basis function (RBF) networks.
1 INTRODUCTION
The temporal-difference (TD) algorithm (Sutton, 1988) for delayed reinforcement
learning has been applied to a variety of tasks, such as robot navigation, board
games, and biological modeling (Houk et al., 1994). Elucidation of the relationship
between TD learning and dynamic programming (DP) has provided good theoretical
insights (Barto et al., 1995). However, conventional TD algorithms were based on
discrete-time, discrete-state formulations. In applying these algorithms to control
problems, time, space and action had to be appropriately discretized using a priori
knowledge or by trial and error. Furthermore, when a TD algorithm is used for
neurobiological modeling, discrete-time operation is often very unnatural.
There have been several attempts to extend TD-like algorithms to continuous cases.
Bradtke et al. (1994) showed convergence results for DP-based algorithms for a
discrete-time, continuous-state linear system with a quadratic cost. Bradtke and
Duff (1995) derived TD-like algorithms for continuous-time, discrete-state systems
(semi-Markov decision problems). Baird (1993) proposed the "advantage updating"
algorithm by modifying Q-Iearning so that it works with arbitrary small time steps .
1074 K.DOYA
2 CONTINUOUS-TIME TD LEARNING
We consider a continuous-time dynamical system (plant)
x(t) = f(x(t), u(t)) (1)
where x E X eRn is the state and u E U C Rm is the control input (action). We
denote the immediate reinforcement (evaluation) for the state and the action as
r(t) = r(x(t), u(t)). (2)
Our goal is to find a feedback control law (policy)
u(t) = JL(x(t)) (3)
that maximizes the expected reinforcement for a certain period in the future. To
be specific, for a given control law JL, we define the "value" of the state x(t) as
V!L(x(t)) =
t
100 1 ,-t
-e-- r(x(s), u(s))ds,
r
T (4)
where x(s) and u(s) (t < s < 00) follow the system dynamics (1) and the control
law (3). Our problem now is to find an optimal control law JL* that maximizes
V!L(x) for any state x E X. Note that r is the time scale of "imminence-weighting"
and the scaling factor ~ is used for normalization, i.e., ft OO ~e- ':;:t ds = 1.
2.1 TD ERROR
The basic idea in TD learning is to predict future reinforcement in an on-line man-
ner. We first derive a local consistency condition for the value function V!L(x). By
differentiating (4) by t, we have
d
r dt V!L(x(t)) = V!L(x(t)) - r(t). (5)
Let P(t) be the prediction of the value function V!L(x(t)) from x(t) (output of the
"critic"). If the prediction is perfect, it should satisfy rP(t) = P(t) - r(t). If this
is not satisfied, the prediction should be adjusted to decrease the inconsistency
f(t) = r(t) - P(t) + rP(t). (6)
This is a continuous version of the temporal difference error.
This coincides with (7) if we make the "discount factor" '"Y = 1- ~t ~ e-'¥, except
for the scaling factor It '
Now let us consider a case when the prediction of the value function is given by
(8)
where bi O are basis functions (e.g., sigmoid, Gaussian, etc) and Vi are the weights.
The gradient descent of the squared TD error is given by
~Vi ex: _ o~r2(t) ex: - r et) [(1 _~t) oP(t) _ oP(t - ~t)] .
OVi T OVi OVi
In order to "back-up" the information about the future reinforcement to correct the
prediction in the past, we should modify pet - ~t) rather than pet) in the above
formula. This results in the learning rule
This is equivalent to the TD(O) algorithm that uses the "eligibility trace" from the
previous time step.
Note that this is equivalent to the TD(-\) algorithm (Sutton, 1988) with -\ = 1- At
Tc
V*(x(t)) = max
U[t,oo)
[1 t
00
1 . -t r(x(s), u(s))ds ] .
-e--
T
T (12)
.
V * (x(t)) = max
U[t,HAt)
[I +t
t At
1 · :;:- t r(x(s), u(s))ds + e--'¥V* (x(t
_e-
T
1.
+ ~t))
1076 K.DOYA
where reinforcement for the state r x (x) is still unknown. We also assume that the
input gain of the system
b -(x) = af(x, u)
J au-J
is available. In this case, the optimal condition (14) for Uj is given by
av*
-Gj(Uj) +T ax bj(x) = O.
Noting that the derivative G'O is a monotonic function since GO is convex, we have
the optimal feedback control law
Uj = (G')-1 ( T av*
ax b(x) ) . (15)
Particularly, when the amplitude of control is bounded as IUj I < uj&X, we can
enforce this constraint using a control cost
~
Gj(Uj) = Cj IoUi g-l(s)ds, (16)
Uj = Ujmax'
SIgn ax b j (x )] .
[av* (18)
Temporal Difference in Learning in Continuous Time and Space 1077
trials th
(a) (b)
20
~\~iii
17 .5
15
12.5
I, I
i~!
0.
.-" 10
' :1
7 .5
trials th
(c) (d)
Figure 2: Left: The learning curves for (a) optimal control and (c) actor-critic.
Lup: time during which 101 < 90°. Right: (b) The predicted value function P after
100 trials of optimal control. (d) The output of the controller after 100 trials with
actor-critic learning. The thick gray line shows the trajectory of the pendulum. th:
o (degrees), om: iJ (degrees/sec).
1078 K.DOYA
4 ACTOR-CRITIC
When the information about the control cost, the input gain of the system, or the
gradient of the value function is not available, we cannot use the above optimal
control law. However, the TD error (6) can be used as "internal reinforcement" for
training a stochastic controller, or an "actor" (Barto et al., 1983).
In the simulation below, we combined our TD algorithm for the critic with a rein-
forcement learning algorithm for real-valued output (Gullapalli, 1990). The output
of the controller was given by
where nj(t) is normalized Gaussian noise and Wji is a weight. The size of this per-
turbation was changed based on the predicted performance by (Y = (Yo exp( -P(t)).
The connection weights were changed by
!:l.Wji ex f(t)nj(t)bi(x(t)). (20)
5 SIMULATION
The performance of the above continuous-time TD algorithm was tested on a task
of swinging up a pendulum with limited torque (Figure 1). Control of this one-
degree-of-freedom system is trivial near the upright equilibrium. However, bringing
the pendulum near the upright position is not if we set the maximal torque Tmax
smaller than mgl. The controller has to swing the pendulum several times to
build up enough momentum to bring it upright. Furthermore, the controller has to
decelerate the pendulum early enough to avoid falling over.
We used a radial basis function (RBF) network to approximate the value function
for the state of the pendulum x = (8,8). We prepared a fixed set of 12 x 12 Gaussian
basis functions. This is a natural extension of the "boxes" approach previously used
to control inverted pendulums (Barto et al., 1983). The immediate reinforcement
was given by the height of the tip of the pendulum, i.e., rx = cos 8.
5.2 ACTOR-CRITIC
Next, we tested the actor-critic learning scheme as described above. The controller
was also implemented by a RBF network with the same 12 x 12 basis functions as
the critic network. It took about one hundred trials to achieve reliable performance
(Figure 2c). Figure 2d shows an example of the output of the controller after 100
trials. We can see nearly linear feedback in the neighborhood of the upright position
and a non-linear torque field away from the equilibrium.
6 CONCLUSION
We derived a continuous-time, continuous-state version of the TD algorithm and
showed its applicability to a nonlinear control task. One advantage of continuous
formulation is that we can derive an explicit form of optimal control law as in (17)
using derivative information, whereas a one-ply search for the best action is usually
required in discrete formulations.
References
Baird III, L. C. (1993). Advantage updating. Technical Report WL-TR-93-1146,
Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-7301, USA.
Barto, A. G. , Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time
dynamic programming. Artificial Intelligence, 72:81-138.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Transactions
on System, Man, and Cybernetics, SMC-13:834-846.
Bradtke, S. J. and Duff, M. O. (1995). Reinforcement learning methods for
continuous-time Markov decision problems. In Tesauro, G., Touretzky, D. S.,
and Leen, T. K., editors, Advances in Neural Information Processing Systems
7, pages 393-400. MIT Press, Cambridge, MA.
Bradtke, S. J ., Ydstie, B. E., and Barto, A. G. (1994). Adaptive linear quadratic
control using policy iteration. CMPSCI Technical Report 94-49, University of
Massachusetts, Amherst, MA.
Bryson, Jr., A. E .. and Ho, Y.-C. (1975). Applied Optimal Control. Hemisphere
Publishing, New York, 2nd edition.
GuUapalli, V. (1990) . A stochastic reinforcement learning algorithm for learning
real-valued functions. Neural Networks, 3:671-192.
Hopfield, J. J. (1984). Neurons with graded response have collective computational
properties like those of two-state neurons. Proceedings of National Academy of
Science, 81 :3088-3092.
Houk, J . C., Adams, J. L., and Barto, A. G. (1994). A model of how the basal
ganglia generate and use neural signlas that predict renforcement. In Houk,
J. C., Davis, J. L., and Beiser, D. G. , editors, Models of Information Processing
in the Basal Ganglia, pages 249--270. MIT Press, Cambrigde, MA.
Sutton, R. S. (1988). Learning to predict by the methods of temporal difference.
Machine Learning, 3:9--44.