0% found this document useful (0 votes)

11 views7 pages

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

This document presents a continuous-time, continuous-state version of the temporal difference (TD) learning algorithm aimed at enhancing reinforcement learning applications in real-world control tasks and neurobiological modeling. The performance of the proposed algorithms was evaluated through a pendulum control task, demonstrating successful implementation of both the 'critic' and 'actor' components using radial basis function networks. The findings indicate that the continuous formulation allows for explicit optimal control laws, improving upon traditional discrete-time methods.

Uploaded by

kinmanlaikm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

Uploaded by

kinmanlaikm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Temporal Difference Learning in

Continuous Time and Space

Kenji Doya
doya~hip.atr.co.jp
ATR Human Information Processing Research Laboratories
2-2 Hikaridai, Seika.-cho, Soraku-gun, Kyoto 619-02, Japan

Abstract
A continuous-time, continuous-state version of the temporal differ-
ence (TD) algorithm is derived in order to facilitate the application
of reinforcement learning to real-world control tasks and neurobi-
ological modeling. An optimal nonlinear feedback control law was
also derived using the derivatives of the value function. The per-
formance of the algorithms was tested in a task of swinging up a
pendulum with limited torque. Both the "critic" that specifies the
paths to the upright position and the "actor" that works as a non-
linear feedback controller were successfully implemented by radial
basis function (RBF) networks.

1 INTRODUCTION
The temporal-difference (TD) algorithm (Sutton, 1988) for delayed reinforcement
learning has been applied to a variety of tasks, such as robot navigation, board
games, and biological modeling (Houk et al., 1994). Elucidation of the relationship
between TD learning and dynamic programming (DP) has provided good theoretical
insights (Barto et al., 1995). However, conventional TD algorithms were based on
discrete-time, discrete-state formulations. In applying these algorithms to control
problems, time, space and action had to be appropriately discretized using a priori
knowledge or by trial and error. Furthermore, when a TD algorithm is used for
neurobiological modeling, discrete-time operation is often very unnatural.
There have been several attempts to extend TD-like algorithms to continuous cases.
Bradtke et al. (1994) showed convergence results for DP-based algorithms for a
discrete-time, continuous-state linear system with a quadratic cost. Bradtke and
Duff (1995) derived TD-like algorithms for continuous-time, discrete-state systems
(semi-Markov decision problems). Baird (1993) proposed the "advantage updating"
algorithm by modifying Q-Iearning so that it works with arbitrary small time steps .
1074 K.DOYA

In this paper, we derive a TD learning algorithm for continuous-time, continuous-

state, nonlinear control problems. The correspondence of the continuous-time ver-
sion to the conventional discrete-time version is also shown. The performance of
the algorithm was tested in a nonlinear control task of swinging up a pendulum
with limited torque.

2 CONTINUOUS-TIME TD LEARNING
We consider a continuous-time dynamical system (plant)
x(t) = f(x(t), u(t)) (1)
where x E X eRn is the state and u E U C Rm is the control input (action). We
denote the immediate reinforcement (evaluation) for the state and the action as
r(t) = r(x(t), u(t)). (2)
Our goal is to find a feedback control law (policy)
u(t) = JL(x(t)) (3)
that maximizes the expected reinforcement for a certain period in the future. To
be specific, for a given control law JL, we define the "value" of the state x(t) as

V!L(x(t)) =
t
100 1 ,-t
-e-- r(x(s), u(s))ds,
r
T (4)

where x(s) and u(s) (t < s < 00) follow the system dynamics (1) and the control
law (3). Our problem now is to find an optimal control law JL* that maximizes
V!L(x) for any state x E X. Note that r is the time scale of "imminence-weighting"
and the scaling factor ~ is used for normalization, i.e., ft OO ~e- ':;:t ds = 1.

2.1 TD ERROR
The basic idea in TD learning is to predict future reinforcement in an on-line man-
ner. We first derive a local consistency condition for the value function V!L(x). By
differentiating (4) by t, we have
d
r dt V!L(x(t)) = V!L(x(t)) - r(t). (5)
Let P(t) be the prediction of the value function V!L(x(t)) from x(t) (output of the
"critic"). If the prediction is perfect, it should satisfy rP(t) = P(t) - r(t). If this
is not satisfied, the prediction should be adjusted to decrease the inconsistency
f(t) = r(t) - P(t) + rP(t). (6)
This is a continuous version of the temporal difference error.

2.2 EULER DIFFERENTIATION: TD(O)

The relationship between the above continuous-time TD error and the discrete-time
TD error (Sutton, 1988)
f(t) = r(t) + ,,(P(t) - P(t - ~t) (7)
can be easily seen by a backward Euler approximation of p(t). By substituting
p(t) = (P(t) - P(t - ~t))/~t into (6), we have

f=r(t)+ ~t [(1- ~t)P(t)-P(t-~t)] .

Temporal Difference in Learning in Continuous Time and Space 1075

This coincides with (7) if we make the "discount factor" '"Y = 1- ~t ~ e-'¥, except
for the scaling factor It '
Now let us consider a case when the prediction of the value function is given by
(8)

where bi O are basis functions (e.g., sigmoid, Gaussian, etc) and Vi are the weights.
The gradient descent of the squared TD error is given by
~Vi ex: _ o~r2(t) ex: - r et) [(1 _~t) oP(t) _ oP(t - ~t)] .
OVi T OVi OVi
In order to "back-up" the information about the future reinforcement to correct the
prediction in the past, we should modify pet - ~t) rather than pet) in the above
formula. This results in the learning rule

~Vi ex: ret) OP(~~ ~t) = r(t)bi(x(t - ~t)) . (9)

This is equivalent to the TD(O) algorithm that uses the "eligibility trace" from the
previous time step.

2.3 SMOOTH DIFFERENTIATION: TD(-\)

The Euler approximation of a time derivative is susceptible to noise (e.g., when
we use stochastic control for exploration) . Alternatively, we can use a "smooth"
differentiation algorithm that uses a weighted average of the past input, such as

pet) ~ pet) - Pet) where Tc dd pet) = pet) - pet)

~ t
and Tc is the time constant of the differentiation. The corresponding gradient de-
scent algorithm is

~Vi ex: _ O~;2(t) ex: ret) o~(t) = r(t)bi(t) , (10)

Vi UVi

where bi is the eligibility trace for the weight

d- -
Tc dtbi(t) = bi(x(t)) - bi(t) . (11)

Note that this is equivalent to the TD(-\) algorithm (Sutton, 1988) with -\ = 1- At
Tc

if we discretize the above equation with time step ~t.

3 OPTIMAL CONTROL BY VALUE GRADIENT

3.1 HJB EQUATION
The value function V * for an optimal control J..L* is defined as

V*(x(t)) = max
U[t,oo)
[1 t
00
1 . -t r(x(s), u(s))ds ] .
-e--
T
T (12)

According to the principle of dynamic programming (Bryson and Ho, 1975), we

consider optimization in two phases, [t, t + ~t] and [t + ~t , 00), resulting in the
expression

.
V * (x(t)) = max
U[t,HAt)
[I +t
t At
1 · :;:- t r(x(s), u(s))ds + e--'¥V* (x(t
_e-
T
1.
+ ~t))
1076 K.DOYA

By Taylor expanding the value at t + f:l.t as

av*
V*(x(t + f:l.t)) = V*(x(t)) + ax(t) f(x(t), u(t))f:l.t + O(f:l.t)
and then taking f:l.t to zero, we have a differential constraint for the optimal value
function
V*(t) = max [r(x(t), u(t)) + T av* ]
- f(x(t), u(t)) . (13)
U(t)EU ax
This is a variant of the Hamilton-Jacobi-Bellman equation (Bryson and Ho, 1975)
for a discounted case.

3.2 OPTIMAL NONLINEAR FEEDBACK CONTROL

When the reinforcement r(x, u) is convex with respect to the control u, and the
vector field f(x, u) is linear with respect to u, the optimization problem in (13) has
a unique solution. The condition for the optimal control is
ar(x, u) av* af(x, u) _ 0
au +T ax au -. (14)
Now we consider the case when the cost for control is given by a convex potential
function GjO for each control input

f(x, u) = rx(x) - 2:= Gj(Uj),

where reinforcement for the state r x (x) is still unknown. We also assume that the
input gain of the system
b -(x) = af(x, u)
J au-J
is available. In this case, the optimal condition (14) for Uj is given by
av*
-Gj(Uj) +T ax bj(x) = O.

Noting that the derivative G'O is a monotonic function since GO is convex, we have
the optimal feedback control law

Uj = (G')-1 ( T av*
ax b(x) ) . (15)

Particularly, when the amplitude of control is bounded as IUj I < uj&X, we can
enforce this constraint using a control cost
~
Gj(Uj) = Cj IoUi g-l(s)ds, (16)

where g-10 is an inverse sigmoid function that diverges at ±1 (Hopfield, 1984). In

this case, the optimal feedback control law is given by
max
av* )
Uj = ujax g ( u ~j T ax bj(x) . (17)

In the limit of Cj -70, this results in the "bang-bang" control law

Uj = Ujmax'
SIgn ax b j (x )] .
[av* (18)
Temporal Difference in Learning in Continuous Time and Space 1077

Figure 1: A pendulum with limited torque. The dynamics is given by m18

-f-tiJ + mglsinO + T. Parameters were m = I = 1, 9 = 9.8, and f-t = 0.0l.

trials th

(a) (b)
20

~\~iii
17 .5

12.5

I, I
i~!
0.

.-" 10
' :1
7 .5

trials th

Figure 2: Left: The learning curves for (a) optimal control and (c) actor-critic.
Lup: time during which 101 < 90°. Right: (b) The predicted value function P after
100 trials of optimal control. (d) The output of the controller after 100 trials with
actor-critic learning. The thick gray line shows the trajectory of the pendulum. th:
o (degrees), om: iJ (degrees/sec).
1078 K.DOYA

4 ACTOR-CRITIC
When the information about the control cost, the input gain of the system, or the
gradient of the value function is not available, we cannot use the above optimal
control law. However, the TD error (6) can be used as "internal reinforcement" for
training a stochastic controller, or an "actor" (Barto et al., 1983).
In the simulation below, we combined our TD algorithm for the critic with a rein-
forcement learning algorithm for real-valued output (Gullapalli, 1990). The output
of the controller was given by

u;(t) = ujU g (~W;,b'(X(t)) + <1n;(t)) , (19)

where nj(t) is normalized Gaussian noise and Wji is a weight. The size of this per-
turbation was changed based on the predicted performance by (Y = (Yo exp( -P(t)).
The connection weights were changed by
!:l.Wji ex f(t)nj(t)bi(x(t)). (20)

5 SIMULATION
The performance of the above continuous-time TD algorithm was tested on a task
of swinging up a pendulum with limited torque (Figure 1). Control of this one-
degree-of-freedom system is trivial near the upright equilibrium. However, bringing
the pendulum near the upright position is not if we set the maximal torque Tmax
smaller than mgl. The controller has to swing the pendulum several times to
build up enough momentum to bring it upright. Furthermore, the controller has to
decelerate the pendulum early enough to avoid falling over.
We used a radial basis function (RBF) network to approximate the value function
for the state of the pendulum x = (8,8). We prepared a fixed set of 12 x 12 Gaussian
basis functions. This is a natural extension of the "boxes" approach previously used
to control inverted pendulums (Barto et al., 1983). The immediate reinforcement
was given by the height of the tip of the pendulum, i.e., rx = cos 8.

5.1 OPTIMAL CONTROL

First, we used the optimal control law (17) with the predicted value function P
instead of V·. We added noise to the control command to enhance exploration.
The torque was given by
Tmax aP(x) )
T = Tmaxg ( - - r - - b + (Yn(t) ,
c ax
where g(x) = ~ tan- 1 ( ~x) (Hopfield, 1984). Note that the input gain b =
(0, 1/mI2)T was constant. Parameters were rm ax = 5, c = 0.1, (Yo = 0.01, r = 1.0,
and rc = 0.1.
Each run was started from a random 8 and was continued for 20 seconds. Within
ten trials, the value function P became accurate enough to be able to swing up and
hold the pendulum (Figure 2a). An example of the predicted value function P after
100 trials is shown in Figure 2b. The paths toward the upright position, which were
implicitly determined by the dynamical properties of the system, can be seen as the
ridges of the value function. We also had successful results when the reinforcement
was given only near the goal: rx = 1 if 181 < 30°, -1 otherwise.
Temporal Difference in Learning in Continuous Time and Space 1079

5.2 ACTOR-CRITIC
Next, we tested the actor-critic learning scheme as described above. The controller
was also implemented by a RBF network with the same 12 x 12 basis functions as
the critic network. It took about one hundred trials to achieve reliable performance
(Figure 2c). Figure 2d shows an example of the output of the controller after 100
trials. We can see nearly linear feedback in the neighborhood of the upright position
and a non-linear torque field away from the equilibrium.

6 CONCLUSION
We derived a continuous-time, continuous-state version of the TD algorithm and
showed its applicability to a nonlinear control task. One advantage of continuous
formulation is that we can derive an explicit form of optimal control law as in (17)
using derivative information, whereas a one-ply search for the best action is usually
required in discrete formulations.

References
Baird III, L. C. (1993). Advantage updating. Technical Report WL-TR-93-1146,
Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-7301, USA.
Barto, A. G. , Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time
dynamic programming. Artificial Intelligence, 72:81-138.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Transactions
on System, Man, and Cybernetics, SMC-13:834-846.
Bradtke, S. J. and Duff, M. O. (1995). Reinforcement learning methods for
continuous-time Markov decision problems. In Tesauro, G., Touretzky, D. S.,
and Leen, T. K., editors, Advances in Neural Information Processing Systems
7, pages 393-400. MIT Press, Cambridge, MA.
Bradtke, S. J ., Ydstie, B. E., and Barto, A. G. (1994). Adaptive linear quadratic
control using policy iteration. CMPSCI Technical Report 94-49, University of
Massachusetts, Amherst, MA.
Bryson, Jr., A. E .. and Ho, Y.-C. (1975). Applied Optimal Control. Hemisphere
Publishing, New York, 2nd edition.
GuUapalli, V. (1990) . A stochastic reinforcement learning algorithm for learning
real-valued functions. Neural Networks, 3:671-192.
Hopfield, J. J. (1984). Neurons with graded response have collective computational
properties like those of two-state neurons. Proceedings of National Academy of
Science, 81 :3088-3092.
Houk, J . C., Adams, J. L., and Barto, A. G. (1994). A model of how the basal
ganglia generate and use neural signlas that predict renforcement. In Houk,
J. C., Davis, J. L., and Beiser, D. G. , editors, Models of Information Processing
in the Basal Ganglia, pages 249--270. MIT Press, Cambrigde, MA.
Sutton, R. S. (1988). Learning to predict by the methods of temporal difference.
Machine Learning, 3:9--44.

Optimal Control Exercises
100% (2)
Optimal Control Exercises
79 pages
Parts Manual SK750 - SK755 (053-2566)
No ratings yet
Parts Manual SK750 - SK755 (053-2566)
207 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Jia Zhou - JMLR 2023
No ratings yet
Jia Zhou - JMLR 2023
61 pages
Sizing of Amine Absorber
No ratings yet
Sizing of Amine Absorber
7 pages
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
No ratings yet
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
29 pages
3 - Chapter 7 Temporal-Difference Methods
No ratings yet
3 - Chapter 7 Temporal-Difference Methods
26 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
A Pontryagin Perspective On Reinforcement Learning
No ratings yet
A Pontryagin Perspective On Reinforcement Learning
21 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Kamala Pur Kar 2016
No ratings yet
Kamala Pur Kar 2016
11 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
1 s2.0 S000510981500343X Main
No ratings yet
1 s2.0 S000510981500343X Main
8 pages
Dissecting Reinforcement Learning-Part10
No ratings yet
Dissecting Reinforcement Learning-Part10
19 pages
Self 2019
No ratings yet
Self 2019
6 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
2019 RL Control Review
No ratings yet
2019 RL Control Review
27 pages
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
No ratings yet
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
13 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
2024 Ouput Feedback Linear System Based On ADP 1
No ratings yet
2024 Ouput Feedback Linear System Based On ADP 1
10 pages
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
No ratings yet
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
11 pages
Dynamic Programming and Linear Quadratic (LQ) Control (Discrete-Time and Continuous Time Cases)
No ratings yet
Dynamic Programming and Linear Quadratic (LQ) Control (Discrete-Time and Continuous Time Cases)
53 pages
1 s2.0 S0925231224014486 Main 1 8
No ratings yet
1 s2.0 S0925231224014486 Main 1 8
13 pages
Control and Reinforcement Learning
No ratings yet
Control and Reinforcement Learning
6 pages
He, S Et Al (2019) Reinforcement Learning
No ratings yet
He, S Et Al (2019) Reinforcement Learning
10 pages
Tac 232
No ratings yet
Tac 232
7 pages
Modares 2014
No ratings yet
Modares 2014
10 pages
An Introduction To Submarine Cables
100% (1)
An Introduction To Submarine Cables
7 pages
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
No ratings yet
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
13 pages
RL and ObC Lecture 1
No ratings yet
RL and ObC Lecture 1
34 pages
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
No ratings yet
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
9 pages
Num5 Ibm
No ratings yet
Num5 Ibm
222 pages
Root
No ratings yet
Root
8 pages
AGRU - FM 1613 Approved HDPE Pipes Fittings
No ratings yet
AGRU - FM 1613 Approved HDPE Pipes Fittings
64 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
CDC - 2023 - Final - Submission 2023-09-12 14 - 11 - 10
No ratings yet
CDC - 2023 - Final - Submission 2023-09-12 14 - 11 - 10
6 pages
02of22 - Reinforcement Learning in System Identification
No ratings yet
02of22 - Reinforcement Learning in System Identification
13 pages
Shi 2021
No ratings yet
Shi 2021
11 pages
Curriculum-Based Deep Reinforcement Learning For Quantum Control
No ratings yet
Curriculum-Based Deep Reinforcement Learning For Quantum Control
14 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
DW 144
No ratings yet
DW 144
98 pages
Data Driven Control of Large Scale Systems (1) 240720 220740
No ratings yet
Data Driven Control of Large Scale Systems (1) 240720 220740
6 pages
15 - Optimal Policies For Passive Learning Controllers
No ratings yet
15 - Optimal Policies For Passive Learning Controllers
7 pages
Tos Tle Cookery Third Quarter Bahian
100% (1)
Tos Tle Cookery Third Quarter Bahian
2 pages
LA - Sleeve Auto Performance 2014 Catalog
No ratings yet
LA - Sleeve Auto Performance 2014 Catalog
76 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
Practical Issues in Temporal Difference Learning: Gerald Tesauro
No ratings yet
Practical Issues in Temporal Difference Learning: Gerald Tesauro
25 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
NeurIPS 2020 Decentralized TD Tracking With Linear Function Approximation and Its Finite Time Analysis Paper
No ratings yet
NeurIPS 2020 Decentralized TD Tracking With Linear Function Approximation and Its Finite Time Analysis Paper
11 pages
Adaptive Learning Feedback Linearization
No ratings yet
Adaptive Learning Feedback Linearization
9 pages
Adaptive Dynamic Programming Algorithm For Uncertain Nonlinear Switched Systems
No ratings yet
Adaptive Dynamic Programming Algorithm For Uncertain Nonlinear Switched Systems
7 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Deterministic Continuous Time Optimal Control and The Hamilton-Jacobi-Bellman Equation
No ratings yet
Deterministic Continuous Time Optimal Control and The Hamilton-Jacobi-Bellman Equation
7 pages
Lecture 4 Control
No ratings yet
Lecture 4 Control
23 pages
Model Free Difference Feedback Control of Stochastic Systems
No ratings yet
Model Free Difference Feedback Control of Stochastic Systems
6 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
H Infinity Based Iterative Learning Control of Systems With Disturbances
No ratings yet
H Infinity Based Iterative Learning Control of Systems With Disturbances
8 pages
Doosan Schematic All Models
100% (69)
Doosan Schematic All Models
20 pages
L4 Discrete Time Optimal Control Indirect LQ ARE
No ratings yet
L4 Discrete Time Optimal Control Indirect LQ ARE
26 pages
Panel Kapasitor Bank-Model - PDF 1
No ratings yet
Panel Kapasitor Bank-Model - PDF 1
1 page
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
ACODS 2014 GAndrade
No ratings yet
ACODS 2014 GAndrade
7 pages
Forms of Quadratic Function
No ratings yet
Forms of Quadratic Function
2 pages
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
No ratings yet
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
8 pages
Science - BSC Information Technology - Semester 5 - 2023 - April - Software Project Management Cbcs
No ratings yet
Science - BSC Information Technology - Semester 5 - 2023 - April - Software Project Management Cbcs
2 pages
Reinforcement Learning Is Direct Adaptive Optimal Control: Richard S. Sutton, Andrew G. Barto, Ronald J. Williams
No ratings yet
Reinforcement Learning Is Direct Adaptive Optimal Control: Richard S. Sutton, Andrew G. Barto, Ronald J. Williams
7 pages
Api Tools Presentation
No ratings yet
Api Tools Presentation
18 pages
16 - Optimal Control of Unknown Parameter Systems
No ratings yet
16 - Optimal Control of Unknown Parameter Systems
3 pages
The MESA and ISA 95
No ratings yet
The MESA and ISA 95
9 pages
Presentasi Bulldozer D6N LGP
No ratings yet
Presentasi Bulldozer D6N LGP
28 pages
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
No ratings yet
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
30 pages
Zeroshot Fewshot (Concepts)
No ratings yet
Zeroshot Fewshot (Concepts)
5 pages
Untitled
No ratings yet
Untitled
3 pages
Lect6 Traffic Safety
No ratings yet
Lect6 Traffic Safety
83 pages
A Direct Adaptive Neural-Network Control For Unknown Nonlinear Systems and Its Application
No ratings yet
A Direct Adaptive Neural-Network Control For Unknown Nonlinear Systems and Its Application
8 pages
Santu CV Job Final (07!01!25)
No ratings yet
Santu CV Job Final (07!01!25)
10 pages
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
No ratings yet
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
6 pages
Optimizing Nonlinear Control Allocation
No ratings yet
Optimizing Nonlinear Control Allocation
6 pages
F3 Fixture
No ratings yet
F3 Fixture
2 pages
Daa Unit-4
No ratings yet
Daa Unit-4
31 pages
Lab Report 2 (Circle)
No ratings yet
Lab Report 2 (Circle)
4 pages
CCNA 1 v7.0 Modules 16 - 17: Building and Securing A Small Network Exam Answers 2020
No ratings yet
CCNA 1 v7.0 Modules 16 - 17: Building and Securing A Small Network Exam Answers 2020
25 pages
Woolseylecture 1
No ratings yet
Woolseylecture 1
4 pages
Nibha Dubey
No ratings yet
Nibha Dubey
5 pages
Powin - SAMPLE Commissioning Schedule 22NOV2021
No ratings yet
Powin - SAMPLE Commissioning Schedule 22NOV2021
1 page
Energy Performance Certificate (EPC) : Rules On Letting This Property
No ratings yet
Energy Performance Certificate (EPC) : Rules On Letting This Property
5 pages
19 Arid 3235 LAB (5,6,7,8)
No ratings yet
19 Arid 3235 LAB (5,6,7,8)
11 pages
Dynamic Difficulty Adjustment Via Fast User Adaptation
No ratings yet
Dynamic Difficulty Adjustment Via Fast User Adaptation
3 pages
Cse2012 PPS3 w2022
No ratings yet
Cse2012 PPS3 w2022
3 pages

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

Uploaded by

NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper

Uploaded by

Temporal Difference Learning in

Continuous Time and Space

In this paper, we derive a TD learning algorithm for continuous-time, continuous-

2.2 EULER DIFFERENTIATION: TD(O)

f=r(t)+ ~t [(1- ~t)P(t)-P(t-~t)] .

~Vi ex: ret) OP(~~ ~t) = r(t)bi(x(t - ~t)) . (9)

2.3 SMOOTH DIFFERENTIATION: TD(-\)

pet) ~ pet) - Pet) where Tc dd pet) = pet) - pet)

~Vi ex: _ O~;2(t) ex: ret) o~(t) = r(t)bi(t) , (10)

where bi is the eligibility trace for the weight

if we discretize the above equation with time step ~t.

3 OPTIMAL CONTROL BY VALUE GRADIENT

According to the principle of dynamic programming (Bryson and Ho, 1975), we

By Taylor expanding the value at t + f:l.t as

3.2 OPTIMAL NONLINEAR FEEDBACK CONTROL

f(x, u) = rx(x) - 2:= Gj(Uj),

where g-10 is an inverse sigmoid function that diverges at ±1 (Hopfield, 1984). In

In the limit of Cj -70, this results in the "bang-bang" control law

Figure 1: A pendulum with limited torque. The dynamics is given by m18

u;(t) = ujU g (~W;,b'(X(t)) + <1n;(t)) , (19)

5.1 OPTIMAL CONTROL

You might also like