0% found this document useful (0 votes)
5 views7 pages

Tac 232

This document presents a framework for solving the infinite horizon optimal tracking problem for unknown nonlinear systems using deep neural networks (DNNs) and reinforcement learning (RL). The proposed method employs a multitimescale concurrent learning-based weight update policy that allows for real-time updates of DNN weights while ensuring stability and convergence of the control policy. Simulation results demonstrate the effectiveness of the technique, showing improved tracking performance compared to existing methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Tac 232

This document presents a framework for solving the infinite horizon optimal tracking problem for unknown nonlinear systems using deep neural networks (DNNs) and reinforcement learning (RL). The proposed method employs a multitimescale concurrent learning-based weight update policy that allows for real-time updates of DNN weights while ensuring stability and convergence of the control policy. Simulation results demonstrate the effectiveness of the technique, showing improved tracking performance compared to existing methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO.

5, MAY 2023 3171

Deep Neural Network-Based Approximate Optimal Tracking for


Unknown Nonlinear Systems
Max L. Greene , Member, IEEE, Zachary I. Bell , Member, IEEE, Scott Nivison , Member, IEEE,
and Warren E. Dixon , Fellow, IEEE

Abstract—The infinite horizon optimal tracking problem is policy [3]. However, the HJB equation is a nonlinear partial differential
solved for a deterministic, control-affine, unknown nonlinear dy- equation that generally does not have an analytical solution. RL can
namical system. A deep neural network (DNN) is updated in real
time to approximate the unknown nonlinear system dynamics. The be used to approximate the solution to the HJB using an approximate
developed framework uses a multitimescale concurrent learning- dynamic programming (ADP)-based approach in [2], [4], [5], [6], [7],
based weight update policy, with which the output layer DNN and [8]. If the value function is successfully approximated, then a
weights are updated in real time, but the internal DNN features are stabilizing optimal control policy can be determined.
updated discretely and at a slower timescale (i.e., with batch-like
updates). The design of the output layer weight update policy is
An indirect measure of a given policy’s (sub)optimality, called the
motivated by a Lyapunov-based analysis, and the inner features are Bellman error (BE), is derived from the HJB equation. In conventional
updated according to existing DNN optimization algorithms. Simu- ADP approaches, the BE is evaluated at on-trajectory points. In the
lation results demonstrate the efficacy of the developed technique model-based ADP method developed in [9] and used this work, the BE
and compare its performance to existing techniques. is calculated at on- and off-trajectory points. This process is called BE
Index Terms—Adaptive control, neural networks, nonlinear con- extrapolation. BE extrapolation can provide simulation of experience;
trol, reinforcement learning. however, methods, such as simulation of experience and BE extrapola-
tion require a model of the system’s dynamics. Results, such as [10], use
a model-free approach to solve the Hamilton–Jacobi–Isaacs equation.
I. INTRODUCTION However, Modares et al. [10] rely on an initially stabilizing control
Reinforcement learning (RL) is a technique that facilitates adaptation policy and a sufficiently large set of data pairs, which are collected
in many computational problems, such as robotics, video game playing, online, to successfully approximate the optimal control policy.
supply chain management, and automatic control. Generally, RL-based Using a model for methods, such as simulation of experience or
agents interact with an environment, sense the state of the system, BE extrapolation enables faster learning in comparison to model-free
and perform an action that seeks to minimize or maximize a cost methods. However, the need for a model can limit robustness and
function [1]. The cost depends on the environment, state, and previous applicability. Motivated by this issue, the model-based ADP methods
action(s) of the system. RL, unlike supervised learning, can evaluate in [9] and [11] use a data-driven concurrent learning (CL)-based system
the performance of a particular action without a teacher. This makes identifier (see [12] and [13]) to simultaneously approximate the drift dy-
RL well-posed to determine policies in which examples, or models, of namics and, subsequently, the optimal control policy. Using a CL-based
desired behavior do not exist. These qualities have motivated the use of adaptation law provides guarantees on system parameter convergence,
RL to obtain online approximate solutions to optimal control problems which are not obtained via traditional gradient or least-squares-based
for systems with finite state-spaces as shown in [2]. update laws. The result in [11] uses a CL-based policy to update the
The solution to the Hamilton–Jacobi–Bellman (HJB) equation is the weights of a single hidden-layer NN in real time. However, recent
optimal value function, which is used to determine the optimal control evidence indicates that deep neural networks (DNNs) utilize a more
complex structure to potentially improve the function approximation
performance [14].
Manuscript received 30 November 2022; accepted 3 February 2023. The results in [15] leverage a multitimescale DNN-based model
Date of publication 20 February 2023; date of current version 26 April reference adaptive controller. Similarly, the method in [16] uses a
2023. This work was supported in part by AFOSR under Grant FA9550-
19-1-0169, in part by AFRL under Grant FA8651-21-F1027, and in multitimescale DNN to approximate the unknown system dynamics,
part by ONR Grant N00014-21-1-2481. Recommended by Senior Editor which, with a robust sliding-mode controller, facilitates a trajectory
Tetsuya Iwasaki and Guest Editors George J. Pappas, Anuradha M. tracking objective. In [16], a gradient-based adaptation policy is used
Annaswamy, Manfred Morari, Claire J. Tomlin, Rene Vidal, and Melanie to update the output layer weights of the DNN in real time. Simulta-
N. Zeilinger. (Corresponding author: Max L. Greene.)
Max L. Greene is with the Aurora Flight Sciences, Cambridge, MA
neous to real-time execution, input–output data is stored and used to
02142 USA (e-mail: [email protected]). update the inner layer features using traditional offline DNN function
Zachary I. Bell is with the Munitions Directorate, Air Force Re- approximation training methods. The inner layer features are updated
search Laboratory, Eglin AFB, Navarre, FL 32566 USA (e-mail: iteratively (i.e., not in real time); specifically, the inner layer features are
[email protected]). instantaneously implemented when the inner layer DNN update policies
Scott Nivison is with the Johns Hopkins University Applied Physics
Laboratory, Fort Walton Beach, FL 32578 USA (e-mail: scott.nivison@ complete retraining based on user-defined criteria. Iteratively updating
jhuapl.edu). the inner layer features introduces discontinuities into the adaptation
Warren E. Dixon is with the Department of Mechanical and Aerospace algorithm; these discontinuities propagate into the closed-loop error
Engineering, University of Florida, Gainesville, FL 32611 USA (e-mail: system. Hence, the Lyapunov-based stability result from [11] cannot be
[email protected]).
Color versions of one or more figures in this article are available at
easily extended. A more rigorous Lyapunov-like analysis that considers
https://doi.org/10.1109/TAC.2023.3246761. piecewise-in-time discontinuities in the dynamics is required. Further-
Digital Object Identifier 10.1109/TAC.2023.3246761 more, the adaptive update policy in [16] cannot be easily extended to

0018-9286 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
3172 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023

facilitate system identification within the ADP framework. The result tracking problem to an time-invariant optimal control problem, as
in [16] does not guarantee convergence of the output layer weights to outlined in [17]. Assumptions 3 and 4 can be satisfied based on the
their respective ideal values. To prove stability of the overall system with user selection of xd .
an ADP-based controller, the adaptive update policy of the output layer Based on Assumptions 2–4, the control policy ud : Rn → Rm ,
DNN weights must include a CL modification from [13], which further which tracks the desired trajectory (i.e., trajectory tracking component
complicates the stability analysis (cf., model-based ADP analyzes in [9] of the controller), is ud (xd )  g + (xd )(hd (xd ) − f (xd )). However,
and [11]). ud (xd ) cannot be calculated if the drift dynamics f are unknown.
A novel aspect of this article is the development of a multitimescale Hence, an implementable approximation of the trajectory tracking con-
DNN system identifier to approximate the solution to the optimal troller component ûd is subsequently defined in Section III. Motivated
trajectory tracking problem using an online model-based RL frame- by the desire to transform the time-varying tracking problem into an
work. Unlike the output layer DNN weight update policy in [16], infinite horizon regulation problem, we follow the development in [17]
the developed output layer DNN weight update policy is augmented to rewrite (1) as
with a CL-based term to facilitate parameter convergence, which is ζ̇ = F (ζ) + G (ζ) μ (2)
necessary to prove that the trajectories of the system are uniformly
2n
ultimately bounded (UUB). Furthermore, iteratively updating the inner where ζ ∈ R is the concatenated state vector ζ  [e T
, xTd ]T , μ 
2n
layer features of the DNN introduces piecewise-in-time discontinuities u − ud (xd ) is the transient component of the controller, F : R →
into the dynamics, which must be considered in the stability analysis. R2n is defined as
 
A Lyapunov-based stability analysis proves that, while performing f (e + xd ) − hd (xd ) + g (e + xd ) ud (xd )
F (ζ)  (3)
continuous-time updates to the output layer weights, along with iterative hd (xd )
updates to the inner layer features of the DNN, the applied control policy and G : R2n → R2n×m is defined as
converges to within a neighborhood of the optimal control policy, and  T
the state trajectory converges to within a neighborhood of the desired G (ζ)  g (e + xd )T , 0m×n . (4)
state trajectory. The simulation results show that one iteration of DNN
From Assumption 2, it follows that 0 < G(ζ) ≤ G, where G ∈ R>0 .
retraining improves tracking performance by 10.7%.

A. Control Objective
II. PROBLEM FORMULATION
The control objective is to find a control policy u that minimizes the
Consider a class of nonlinear control-affine systems1
cost functional
ẋ = f (x) + g(x)u (1)  ∞
J (ζ, μ) = r (ζ (τ ) , μ (τ )) dτ (5)
where x ∈ Rn denotes the system state, u ∈ Rm denotes the control 0
input, f : Rn → Rn denotes the drift dynamics, and g : Rn → Rn×m subject to (2) while eliminating the tracking error, where r : R2n ×
denotes the control effectiveness matrix with n ≥ m and the pseudoin- Rm → R≥0 is the instantaneous cost, which is defined as r(ζ, μ) 
verse of g(x) exists. Let xd ∈ Rn denote a time-varying continuously Q(ζ) + μT Rμ, where Q : R2n → R≥0 is a positive semidefinite
differentiable desired state trajectory, and e  x − xd quantifies the (PSD) cost function, and R ∈ Rm×m is a constant user-defined positive
error between the actual and desired state. The following assump- definite (PD) symmetric cost matrix. Let Q(e) = Q(ζ) ∀ζ ∈ R2n , e ∈
tions facilitate the formulation of the approximate optimal tracking Rn , where Q : Rn → R≥0 is a PD function.2
controller [11]. Property 1: The function Q satisfies q(e) ≤ Q(ζ) ≤ q(e) for
Assumption 1: The function f is a locally Lipschitz function and q, q : R≥0 → R≥0 .
f (0) = 0. Furthermore, ∇x f : Rn → Rn×n is continuous. The scalar infinite-horizon value function for the optimal solution,
Assumption 2: The function g is a locally Lipschitz function, has full i.e., the cost-to-go, denoted by V ∗ : R2n → R≥0 , is given by V ∗ (ζ) =
column rank for all x ∈ Rn , and is bounded such that g ≤ g(x) ≤ g, ∞
minμ(τ )∈U t r(ζ(τ ), μ(τ ))dτ, where U ⊆ Rm denotes the action
where g ∈ R>0 is the infimum overall x of the minimum singular values space. If the optimal value function is continuously differentiable, then
of g(x), and g ∈ R>0 is the supremum overall x of the maximum the optimal control policy V ∗ is a solution to the corresponding HJB
singular values of g(x). equation
Assumption 3: The desired trajectory is bounded from above by a
positive constant xd ∈ R such that supt∈R≥0 xd  ≤ xd . 0 = ∇ζ V ∗ (ζ) (F (ζ) + G (ζ) μ∗ (ζ)) + Q (ζ) + μ∗ (ζ)T Rμ∗ (ζ)
Assumption 4: There exists a locally Lipschitz function hd : Rn → (6)
R , such that hd (xd )  ẋd and g + (xd )g(xd )(hd (xd ) − f (xd )) =
n
which has the boundary condition V ∗ (0) = 0, and the optimal policy
hd (xd ) − f (xd ), ∀t ∈ R≥0 , where g + : Rn → Rm×n is defined as μ∗ : R2n → Rm is μ∗ (ζ) = − 21 R−1 G(ζ)T (∇ζ V ∗ (ζ))T .
g + (x)  (g T (x)g(x))−1 g T (x). It follows that supt∈R≥0 g + (xd ) ≤
gd+ . B. Value Function Approximation
Remark 1: Assumptions 2–4 are the typical assumptions necessary The optimal control policy can be derived from the HJB equation
to facilitate the transformation of this problem from a time-varying in (6); however, the optimal control policy requires knowledge of the
optimal value function. Parametric methods can be used to approxi-
1 For notational brevity, the trajectory x(t), where x : R
≥0 → R , is denoted
n mate the optimal value function over a compact domain Ω ⊂ R2n .3
as x ∈ Rn and referred to as x instead of x(t). For example, an equation Since the function V ∗ is continuous and an approximation is sought on
of the form f + h(y, t) = g(x) should be interpreted as f (t) + h(y(t), t) =
∂f (x,y) T ∂f (x,y) T
g(x(t)) ∀t ∈ R≥0 . The gradient [ ∂x1 , . . . , ∂xn ]T is denoted by
2 Q is PSD and Q is PD so that the desired trajectory x is not penalized and
∇x f (x, y).  ·  denotes both the Euclidean norm for vectors and Frobenius d
norm for matrices. 1n×m and 0n×m denote matrices of ones and zeros with the error e is penalized, e.g., let Q(ζ) = eT Qe + xTd 0n×n xd .
n rows and m columns, respectively. In×n denotes an n × n identity matrix. 3 The subsequent stability analysis in Theorem 1 proves that if ζ is initialized
vec(·) denotes the vectorization operator. within an appropriately-sized subset of Ω, then it will remain in Ω.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023 3173

the compact set Ω, the Stone–Weierstrass Theorem is used to express where θ ∈ Rh×n is an unknown bounded ideal output layer weight
the optimal value function in (2) in Ω as matrix, φ : Rp → Rh is an vector of activation functions, Φ∗ : Rn →
V ∗ (ζ) = W T σ (ζ) +  (ζ) ∀ζ ∈ Ω (7) Rp represents the unknown inner layer features of the DNN, and
∗θ : Rn → Rn is a bounded function approximation error. For exam-
where W ∈ RL is an unknown constant weight vector, σ : R2n → RL
ple, the unknown inner layer DNN features Φ∗ can be expressed as
is a user-defined vector of activation functions, and  : R2n → R is the
Φ∗ (x) = Vk k (Vk−1 , k−1 (Vk−2 , k−2 (. . .x))), where k ∈ N denotes
bounded function approximation error.
the number of inner layers of the DNN, Vk and k (·) denote the
Assumption 5: There exist constants W , σ, ∇ζ σ, , ∇ζ  ∈ R>0
corresponding inner layer weights and activation functions of the DNN,
such that the unknown weights W , user-defined activation func-
respectively.
tion σ(·), and function approximation error ε, can be bounded
Based on the DNN representation in (11), the ith DNN-based esti-
such that W  ≤ W , supζ∈Ω σ(ζ) ≤ σ, supζ∈Ω ∇ζ σ(ζ) ≤
mate of the drift dynamics fˆi : Rn × Rh×n , → Rn is defined as
∇ζ σ, supζ∈Ω (ζ) ≤ , and supζ∈Ω ∇ζ (ζ) ≤ ∇ζ  [2, Assump-  
tions 9.1.c-e].4 fˆi x, θ̂ = θ̂T φ Φ̂i (x) (12)
From (6) and (7), the optimal control policy μ∗ can be expressed
where θ̂ ∈ Rh×n is the estimate of the ideal output layer weight matrix
using (7) as
 θ, and Φ̂i : Rp → Rn is the ith iteration selection of the inner features
1
μ∗ (ζ) = − R−1 G (ζ)T ∇ζ σ (ζ)T W + ∇ζ ε (ζ)T . (8) with user-selected activation functions and estimated internal-layer
2 weights. To facilitate the convergence of the DNN-based online system
The ideal weights W in (7) and (8) are unknown; hence, an approxima-
identifier, (12) can be used to develop an estimator
tion of W is sought. Using an actor–critic approach (see [2]), the critic 
weight estimate, Ŵc ∈ RL is used to approximate the optimal value x̂˙ = fˆi x, θ̂ + g(x)u + ko x̃ (13)
function V̂ : R2n × RL → R denoted as
 where x̃  x − x̂, and ko ∈ R>0 is a user-selected estimator learning
V̂ ζ, Ŵc = ŴcT σ (ζ) . (9) gain.
Assumption 6: Similar to Assumption 5 there exist con-
Similarly, an actor weight estimate, Ŵa ∈ RL is used to approximate
stant weights θ and positive constants θ, φ, ∗θ , and ∇x ∗θ ∈
the optimal control policy μ̂ : R2n × RL → R defined as
  R≥0 , such that θ ≤ θ, supx∈C φ(x) ≤ φ, supx∈C ∇x φ(x) ≤
1
μ̂ ζ, Ŵa  − R−1 G (ζ)T ∇ζ σ (ζ)T Ŵa . (10) ∇x φ, supx∈C ∗θ (x) ≤ ∗θ , and supx∈C ∇x ∗θ (x) ≤ ∇x ∗θ [21,
2 Ch. 4].
III. DNN SYSTEM IDENTIFICATION Assumption 7: The ith user-selected inner layer features of the
Φ∗ and Φ̂i are selected such that Φ∗ (x) − Φ̂i (x) ≤ Φ̃i (x), where
The HJB equation in (6) is equal to zero under optimal conditions; Φ̃i : Rn → Rp is the inner layer DNN function reconstruction error
however, substituting (9) and (10) into ∇ζ V ∗ (ζ) and μ∗ (ζ) results of the ith iteration, and supx∈C, i∈N Φ̃i (x) ≤ Φ̃, where Φ̃ ∈ R≥0
in a residual term called the BE, which is defined in the subsequent is a bounded constant for all i. Using the mean value theorem,
section. To compute this residual term, F (ζ) and G(ζ), and therefore,
φ(Φ∗ (x)) − φ(Φ̂i (x)) ≤ ∇x φ Φ̃.6
the system model (i.e., f (x) and g(x)) must be known. If the system
In the developed method, a DNN with uncertain output layer pa-
model is not known, then online system identification can be used to
rameters θ̂ is used to facilitate system identification in the sense that F̂
estimate the model in real time. The ADP result in [11] approximates
approximates F. To enable convergence of F̂ to F, CL-based parameter
f (x) with a single hidden-layer NN online and g(x) is known. Recent
update laws are developed that use recorded data for learning. This CL
works indicate that DNNs may potentially improve function approxi-
strategy is leveraged to modify the output layer weight update law
mation performance [14]. The result in [16] develops a multitimescale
in [16]. As shown in the subsequent stability analysis, this modification
DNN-based control method to approximate f (x) online, which may
enables θ̂ to converge to a region containing θ. Specifically, the output
improve the approximation of f (x) [14]. The output layer weights of
layer DNN weight estimates are updated using the CL-based update
the DNN are adjusted in real time using adaptive update laws motivated
law
by a Lyapunov-based stability analysis. Concurrent to real-time execu-
 M 
tion, data are collected and DNN training algorithms (e.g., stochastic ˙
θ̂ = Γθ φ Φi (x) x̃T + kθ Γθ φ Φi (xj )
gradient descent [19, Ch. 8]), iteratively update the inner layer DNN j=1
features. Since DNN learning algorithms are performed iteratively,   T
the inner layer weights are not updated in real time; the weights are · x̄˙ j − gj (xj ) uj − θ̂T φ Φi (xj ) (14)
discretely updated intermittently during task-execution once training is
complete. Motivated to apply the aforementioned technique to ADP, this where Γθ ∈ R h×h
and kθ ∈ R>0 are constant user-selected adaptation
section outlines the necessary steps required to apply multitimescale gains. Assumption 8 is required to achieve the aforementioned param-
DNN system identification to ADP. eter convergence objective. Specifically, Assumption 8 specifies the
DNN architectures can approximate continuous functions on a com- quality of exploration data that is required by the history stacks in the
pact set; the ability to do so is based on universal approximation second term of (14).
theorems that can be invoked case-by-case for specific DNN archi- Assumption 8: A history stack of input–output data pairs
tectures [20]. The drift dynamics f can be approximated on a compact {xj , uj }M
j=1 and history stack of numerically computed state deriva-
M
set C ⊂ Rn as5 tives {ẋj }M
j=1 , which satisfies λmin (
T
j=1 φ(Φ̂i (xj ))φ(Φ̂i (xj )) ) >
f (x) = θT φ (Φ∗ (x)) + ∗θ (x) ∀x ∈ C (11)
6 Assumption 7 is a mild assumption that is required because an inner layer

4 Assumption features of the DNN parameterization in (11) are not assumed to be known; this
5 can be satisfied by selecting polynomials as basis func-
tions [18, Th. 1.5]. assumption is required to introduce a θ̃ term in the subsequent stability analysis,
5 The subsequent stability analysis in Theorem 1 proves that if x is initialized where θ̃  θ − θ̂. For typical activation functions (e.g., radial basis functions,
within an appropriately-sized subset of C, then it will remain in C. sigmoids), Assumption 7 can be easily satisfied with Φ̃ = 2p.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
3174 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023

0 and ẋj − ẋj  < d ∀j, are available a priori for each index j (16) yields the analytical form of the BE given by
of xj , where d ∈ R>0 is a known constant, ẋj  f (xj ) + g(xj )uj ,   
δ̂ ζ, θ̂, Ŵc , Ŵa = − ω T W̃c − W T ∇ζ σ F (ζ) − F̂i ζ, θ̂
and the operator λmin (·) represents the minimum eigenvalue of the
argument [13].7 1 T
Since the dynamics are unknown, similarly, the trajectory tracking + W̃ Gσ W̃a + O (ζ) (18)
4 a
component of the controller ud (xd ) is not known. Hence, an approxima-
tion of the trajectory tracking component of the controller ûd : Rn × where ω : R2n × RL × Rh×n → Rh is defined as ω(ζ, Ŵa , θ̂) 
Rh×n → Rm is defined as ûd (xd , θ̂)  g + (xd )(hd (xd ) − fˆi (x, θ̂)). ∇ζ σ(ζ)(F̂i (ζ, θ̂) + G(ζ)μ̂(ζ, Ŵa )), O(ζ)  12 ∇ζ (ζ)GR ∇ζ σ(ζ)T
The control policy applied to the system in (1) is W + 14 Gε − W T ∇ζ σ(ζ)∗θ (e + xd ) − ∇ζ ε(ζ)F (ζ), GR = GR (ζ)
   G(ζ)R−1 G(ζ)T , Gσ = Gσ (ζ)  ∇ζ σ(ζ)GR (ζ)∇ζ σ(ζ)T , and
u  μ̂ ζ, Ŵa + ûd xd , θ̂ . (15) Gε = Gε (ζ)  ∇ζ (ζ)G(ζ)∇ζ (ζ)T .
At each time instant t ∈ R≥0 , the estimated BE in (16) and pol-
While the contribution of this section focuses on updating the output
icy in (10) are evaluated using the current system state, critic esti-
layer weights in real time, updating the inner layer features of the DNN
mate, actor estimate, and output layer weight estimate matrix to get
system identifier can lead to improved function approximation. Data
the instantaneous BE and control policy, which are denoted by δ̂ 
stored in the CL history stack can be collected a priori and/or online
δ̂(ζ, θ̂, Ŵc , Ŵa ) and μ̂  μ̂(ζ, Ŵa ), respectively. The system model,
and can simultaneously update the output layer weights and inner layer
which is approximated using the aforementioned DNN-based identifier,
features of the DNN (i.e., update θ̂ in real time and update Φ̂i (x) from i
can be used to evaluate the BE at off-trajectory states in Ω. The process
to i + 1) iteratively. Following (14) and using the CL history stack, the
of evaluating the BE at off-trajectory states is called BE extrapolation.
target dataset is {x̄˙ j − g(xj )uj }M
j=1 , and the respective input dataset
BE extrapolation yields simultaneous exploitation and exploration,
is {xj }M
j=1 .
which results in faster policy learning over a domain.
To facilitate BE extrapolation, the off-policy trajectories {ζe : ζe ∈
IV. BE EXTRAPOLATION Ω}N e=1 are selected, where N ∈ N denotes the number of extrapolated
Recall, the HJB equation in (6) is equal to zero under optimal trajectories in Ω. The state-dependent extrapolation points can be
conditions; hence, substituting (9), (10), and the approximated drift selected a priori by the user or by using an online strategy.8 Using
dynamics fˆi (x, θ̂) into (6) results in a residual term δ̂ : R2n × Rh×n × the extrapolated trajectories ζe ∈ Ω, the BE in (16) is evaluated such
RL × RL → R, which is referred to as the BE, defined as that δ̂e  δ̂(ζe , θ̂, Ŵc , Ŵa ). Let the tuple (Σc , Σa , ΣΓ ) define the
extrapolation data corresponding to Ω such that Σc  N1 N ωe
e=1 ρe δ̂e ,
  T 
1 N GT T
σe Ŵa ωe 1 N T
ωe ωe
δ̂ ζ, θ̂, Ŵc , Ŵa  μ̂ ζ, Ŵa Rμ̂ ζ, Ŵa + Q (ζ) Σa  N e=1 4ρe
, and ΣΓ  N e=1 ρe , where ωe 
    ω(ζe , θ̂, Ŵa ),ρe  ρ(ζe , θ̂, Ŵa ) = 1 + and Γ ∈ RL×L is a
νωeT Γωe ,
+ ∇ζ V̂ ζ, Ŵc F̂i ζ, θ̂ + G (ζ) μ̂ ζ, Ŵa (16) user-initialized learning gain.
To ensure that enough off-trajectory BE extrapolation data is selected
where F̂i : R2n × Rh×n → R2n is defined as to achieve sufficient exploration, Assumption 9 is provided, which
  
T
facilitates the subsequent stability analysis.
F̂i ζ, θ̂  fˆi e + xd , θ̂ − hd (xd )T
Assumption 9: There exist a finite set of trajectories {ζe : ζe ∈
T Ω}Ne=1 such that 0 < c  inf t∈R≥0 λmin {ΣΓ } for all t ∈ R≥0 , where
+ud (xd )T g (e + xd )T , hd (xd )T . (17) λmin {·} is the minimum eigenvalue, and the constant c is the lower
bound of the value of each input–output data pairs’ minimum eigenval-
Remark 2: Performing minimization of the BE in (16) results in
ues.9
the broader problem of solving the HJB equation in (6). For general
nonlinear systems, the HJB equation lacks a general solution. Often,
numerical methods are used offline to solve the HJB equation. For V. ACTOR AND CRITIC WEIGHT UPDATE LAWS
cases with known dynamics, the offline-obtained solution will result in Using the instantaneous BE δ̂ and extrapolated BEs δ̂e , the
closed-loop stability. However, there are cases, such as the one consid- continuous-time least-squares-based update policies for the critic and
ered in this article, in which the model is unknown. Because of this, actor weights, which are designed based on the subsequent stability
the multitimescale DNN identifier is used to approximate the system analysis, are
dynamics in (1) online and, simultaneously, use this approximation of ˙ ω
the model in a model-based RL framework to approximate the solution Ŵc = −ηc1 Γ δ̂ − ηc2 ΓΣc (19)
ρ
to the HJB equation in real time. The subsequently defined critic weight 
update policy in (19) is designed to minimize the BE online. Γωω T Γ
Γ̇ = λΓ − ηc1 − ηc2 ΓΣ Γ Γ 1{Γ≤Γ≤Γ} (20)
The BE in (16) indicates how close the actor and critic weight ρ2
estimates are to their respective ideal weights. The mismatch between 
˙
the estimates and their ideal values are defined as W̃c  W − Ŵc and Ŵa = − ηa1 Ŵa − Ŵc − ηa2 Ŵa
W̃a  W − Ŵa . Substituting (7) and (8) into (6) and subtracting from
GTσ Ŵa ω T
+ ηc1 Ŵc + ηc2 Σa Ŵc (21)

7 Availability of the system identification history stack (i.e., the tuple
{xj , uj , ẋj }M
j=1 ) a priori is not necessary [11]. If the system is sufficiently
8 The design of online strategies to determine extrapolation points could
excited and the history stack is recorded within a finite time, then the developed
controller can be used. Switching between a PE-based controller and the devel- potentially improve learning and is a topic for future research.
oped controller results in a switched subsystem with one switching event. The 9 Assumption 9 can be verified online. Furthermore, Assumption 9 can be
stability of the overall system is determined from the stability of the individual heuristically satisfied by selecting more BE extrapolation points than number of
subsystems. neurons in σ such that N L [9].

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023 3175

where ηc1 , ηc2 , ηa1 , ηa2 , λ ∈ R>0 are constant learning gains, Γ and The optimal value function is parameterized with a linear combi-
Γ ∈ R>0 are upper and lower bound saturation constants, and 1{·} nation of weights and basis functions; this has been done in results
denotes the indicator function. The indicator function in (20) ensures such as [2]. However, the multitimescale DNN identifier introduces
that Γ(t) is upper and lower bounded by two user-defined saturation new terms and piecewise-in-time discontinuities into the dynamics.
gains, Γ and Γ ∈ R>0 , to ensure that Γ ≤ Γ(t) ≤ Γ for all t ∈ R≥0 . Hence, existing actor–critic approaches cannot be applied to show
The indicator function in (20) can be removed provided ρ and ρi stability of the closed-loop system. The subsequent Lyapunov-based
are changed to ρ = 1 + νω T ω and ρi = 1 + νωiT ωi , and additional stability analysis is performed to analyze the convergence and stability
assumptions are included for the regressors ωρ and ΣΓ to ensure that Γ properties of the online implementation of (13), (14), and (19)–(21).
is bounded [22]. Theorem 1: Given the dynamics in (1), that Assumptions 1–9 are
Remark 3: Under Assumptions 1–4, the optimal value function satisfied, and that the conditions in (23)–(25) are satisfied, then the
can be shown to be the unique PD solution of the HJB equation. tracking error e, weight estimation errors W̃c and W̃a , state estimation
Approximation of the PD solution to the HJB equation is guaranteed by error x̃, and output layer weight matrix error θ̃ are UUB. Hence, the
appropriately selecting initial weight estimates and the Lyapunov-based applied control policy û converges to a neighborhood of the optimal
update laws in (19)–(21) [23]. control policy u∗ .

Proof: Using (1), the fact that V̇na (e, t) = V̇ ∗ (ζ), V̇ ∗ (ζ) =

VI. STABILITY ANALYSIS ∇ζ V (ζ)(F (ζ) + G(ζ)μ), (13), (14), (19)–(21), Young’s Inequality,
nonlinear damping, the class of dynamics in (2), Assumptions 8 and 9,
Recall from Property 1 that the function Q and, therefore, the and substituting the sufficient conditions in (23) and (24) yields
optimal value function V ∗ in (7) is PSD. Hence, V ∗ is not a valid
Lyapunov function. The result in [17] can be used to show that a V̇L ≤ −νl (Z) , ∀νl−1 (l) ≤ Z ≤ α2−1 (α1 (r)) ∀t ∈ R≥0 (26)
nonautonomous form of V ∗ , denoted as Vna ∗
: Rn × R≥0 → R and where νl (Z)  1
q(e) + 1 1
η cW̃c 2 + 16 (ηa1 + ηa2 )W̃a 2
∗ 2 16 c2
defined as Vna (e, t)  V ∗ (ζ), is PD and decrescent. Furthermore, + k4o x̃2 + λ { M 2
j=1 φ(Φi (xj ))φ(Φi (xj )) }vec(θ̃) .
kθ T
∗ 6 min
Vna (0, t) = 0 and there exist class K∞ functions v, v : R≥0 → R≥0 Since the discontinuities in the update laws in (13), (14), and (19)–(21)

that bound v(e) ≤ Vna (e, t) ≤ v(e) ∀e ∈ Rn , t ∈ R≥0 . Hence, are piecewise continuous in time and (22) is a common Lyapunov func-

Vna (e, t) is a valid Lyapunov function. Let Z ∈ R2n+2L+hn denote tion across each DNN iteration i, [24, Th. 4.18] can be invoked to con-
a concatenated state defined as Z  [eT , W̃cT , W̃aT , x̃T , vec(θ̃)T ]T . clude that Z is UUB such that lim supt→∞ Z(t) ≤ α1−1 (α2 (νl−1 (l)))
Let VL : R2n+2L+hn × R≥0 → R be a candidate Lyapunov function and μ̂ converges to a neighborhood around the optimal policy μ∗ . Since
defined as Z ∈ L∞ , it follows that e, W̃c , W̃a , x̃, θ̃ ∈ L∞ ; hence, x, Ŵc , Ŵa , θ̂ ∈
∗ 1 1
VL (Z, t)  Vna (e, t) + W̃cT Γ(t)−1 W̃c + W̃aT W̃a L∞ and μ ∈ L∞ .
2 2 Using (26), the result in [24, Th. 4.18] can be invoked to show
1 T 1  T −1 that every trajectory Z(t) that satisfies the initial condition Z(0) ≤
+ x̃ x̃ + tr θ̃ Γθ θ̃ . (22)
2 2 α2−1 (α1 (r)) is bounded for all t ∈ R≥0 . That is, Z ∈ χ ∀t ∈ R≥0 . Since

Using the properties of Vna (e, t) and [24, Lemma 4.3], then (22) Z ∈ χ it follows that the individual states of Z lie on compact sets.11
be bounded as α1 (Z) ≤ VL (Z, t) ≤ α2 (Z) for class K∞ func- Furthermore, since xd ≤ dd , then ζ ∈ Ω and x ∈ C, where Ω is the
tions α1 , α2 : R≥0 → R≥0 . Using (20), the normalized regressors compact set that facilitates value function approximation, and C is the
ω
ρ
and ωρee can be bounded as supt∈R≥0  ωρ  ≤ √1 for all compact set that facilitates DNN-based system identification.
2 νΓ
ζ ∈ Ω and supt∈R≥0  ωρee  ≤ √1 for all ζe ∈ Ω. The matrices
2 νΓ
2
VII. SIMULATION EXAMPLE
GR and Gσ can be bounded as supζ∈Ω GR  ≤ λmax {R−1 }G 
The following section applies the developed technique to an optimal
GR and supζ∈Ω Gσ  ≤ (∇ζ σG)2 λmax {R−1 }  Gσ , respectively,
tracking problem for an autonomous undersea vehicle (AUV) with
where λmax {·} denotes the maximum eigenvalue.
the instantaneous cost function r(ζ, μ) = eT Qe + μT Rμ. The system
To facilitate the subsequent stability analysis, let r ∈ R>0 be the ra-
dynamics for the AUV in this example is from [26]. To focus the scope of
dius of a compact ball centered at the origin χ ⊂ R2n+2L+hn centered
this simulation section, it is assumed that the AUV is neutrally buoyant
at the origin, and l ∈ R>0 is a positive constant that depends on the
while submerged, the center of gravity is below the center of buoyancy
bounded NN constants in Assumptions 5–7. The sufficient conditions
on the z-axis, and the vehicle model accounts for small roll and pitch
for ultimate boundedness of Z are derived based on the subsequent
angles. The dynamics for the AUV in an irrotational current can be
stability analysis as
expressed as
W Gσ
ηa1 + ηa2 ≥ (ηc1 + ηc2 ) √ (23) ξ˙ = fH (ζ, νc ) + f0 (ζ, ν̇c ) + gτb (27)
νΓ
6
2 2
where ξ  [ηAU T
V νAU V ] ∈ R is the concatenated state vector, f0 :
T T

ηa1 (ηc1 + ηc2 )2 W Gσ R × R → R is the known rigid body drift dynamics, fH : R6 ×


6 3 6
c≥4 +
ηc2 4ηc2 νΓ (ηa1 + ηa2 ) R3 → R6 is the unknown hydrodynamic parameter effects (see [26]
2 2
 2 for definitions of the states and dynamics). The rigid body dynamics
3 (ηc1 + ηc2 )2 W ∇ζ σ φ + gd+ φg are assumed to be known because they are measurable a priori, whereas
+     (24) the hydrodynamic parameters are not known. The irrotational current
T
M
2ηc2 kθ νΓλmin j=1 φ Φi (xj ) φ Φi (xj ) vector for this example is νc = [−0.1, 0.1, 0]T .
The time-varying desired trajectory is ξd (t) =
π t) π t)
νl−1 (l) < α2−1 (α1 (r)) (25) π
[cos( 20 π
t), cos( 30 t), 0, −
π sin( 20
, −
π sin( 30
, 0] T
, and the control
20 30
where νl is a subsequently defined PD function.10
11 See [25, Algorithm A.2] for discussion on establishing the size of compact
10 See [9] for insight into satisfying the conditions in (23)–(25). sets χ.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
3176 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023

TABLE I
SIMULATION RESULTS AND ADP COMPARISON

Fig. 1. DNN is composed of four layers, each with 30, 10, 15, and 6
neurons, respectively.

objective is to minimize the infinite horizon cost function in (5). The


drift dynamics are unknown and approximated using the developed
DNN-based system identification method. The DNN used in this
simulation was composed of four layers, each with 30, 10, 15, and 6
neurons, respectively. The DNN architecture is illustrated in Fig. 1.
The first, second, and third layers use Elliot symmetric sigmoid,
logarithmic sigmoid, and tangent sigmoid activation functions,
respectively.12 The first, second, and third layers include bias terms.
The mean squared error was used as the loss function for training. The
Levenberg–Marquardt algorithm was used to train the weights of the
DNN. For each DNN training iteration, 70% of the data was used for
training, 15% was used for validation, and 15% was used for testing.
The controller cost parameters in (5) are Q =
diag([100, 100, 200, 10, 10, 50]) and R = I3×3 . N = 110592
BE extrapolation trajectories were selected across the operating
domain Ω. The initial conditions used for the simulated
system are ξ(0) = [−1, 1.5, 3π 4
ˆ = ξ(0), and
, 0, 0, 0]T ,ξ(0)
Γ(0) = 5000 · I27×27 . Both Ŵc (0) and Ŵa (0) are initialized
13

by solving the algebraic Riccati equation for the linearized rigid body Fig. 2. Error comparisons between the retrained DNN ADP and ran-
AUV dynamics about the position ξ = 06×1 . The polynomial basis dom DNN ADP methods. The red dashed line at t = 120 s represents
function σ with 27 elements is used for value function approximation. the beginning of the retraining, and the black dashed line at approx-
imately t = 12.8 s represents the end of the retraining and when the
Each θ̂(0) ∈ R25×6 is initialized according to its subsequently-defined new internal DNN weights are implemented. The retrained DNN has
training method. The gains were selected as ηc1 = 0, ηc2 = 0.5, improved tracking performance. Since the random DNN and retrained
ηa1 = 10,ηa2 = 0.1, λ = 0.025, ν = 0.025, Γ = 5000,Γ = 100, DNN cases are initialized identically, they have identical performance for
kθ = 5 · 106 , ko = 10, and Γθ = 1. To facilitate CL, a maximum the first 120 s (i.e., until the inner layer DNN features are updated).
of 100 state-action pairs are recorded and replaced according to the
singular value maximization algorithm defined in [13, Algorithm 1].
This section presents simulation results for exact model knowledge inner layer features once online. The retrained DNN ADP method
(EMK) ADP, linearly parameterizable (LP) ADP, randomly initialized highlights the performance improvements that occur through online
DNN ADP, transfer learning DNN ADP, and pretrained DNN ADP. All iterative adjustment of the inner layer DNN features. For retraining, a
of the ADP methods in this simulation comparison are model-based history stack of DNN training data is collected for 120 s.14 After 120 s,
(i.e., use BE extrapolation). EMK ADP uses EMK of f (x), so the results the internal DNN weights begin retraining. The DNN is trained for 50
present the best possible performance for an ADP-based controller for epochs, which takes approximately 12.8 s.
a given set of gains and extrapolation trajectories. LP ADP assumes that Table I compares the performance of each method, and Fig. 2
f0 (x) is LP (i.e., f (x) = Y (x)θ, where Y (x) exactly parameterizes the compares the randomly DNN ADP and retrained DNN ADP methods.
dynamics), as typically seen in an adaptive control literature [21, Sec. The second
 240 column compares the total integral error of each simulation
3.4.3]. LP requires some, but not EMK, and represents a special case (i.e., 0 e(τ ) dτ ). Recall, the EMK ADP method is expected to have
(subset) of the dynamics in (1). For the pretrained DNN ADP method, the best performance. LP ADP is the best performing method with
the DNN is trained a priori using the actual dynamics in (27). The uncertainty, followed by transfer learning DNN ADP, pretrained DNN
transfer learning DNN ADP method is also based on training the DNN ADP, retrained DNN ADP, and random DNN ADP, respectively. The
a priori on a system that is similar, but not exactly the same, as the dy- third column of Table I compares the ADP methods with the integral
namics used during implementation. For the transfer learning case, the of the difference between their state trajectory and the EMK ADP state
current vector νc = [−0.1, 0.1, 0]T is changed to νc = [0.1, −0.1, 0]T trajectory. Similarly, LP ADP performs the best, followed by pretrained
to represent the uncertainty between the training model and the actual DNN ADP, transfer learning DNN ADP, retrained DNN ADP, and
model. The randomly initialized DNN ADP method does not require random DNN ADP, respectively. While the transfer learning DNN ADP
any prior training, i.e., does not require knowledge of the drift dynamics. case has a lower integral of error in the second column, the pretrained
The retrained DNN ADP method is initialized as the random DNN DNN ADP case performs closer to the EMK ADP case, as seen in
ADP method; however, the retrained DNN ADP method updates the the third column. The fourth column of Table I compares the ADP
methods after the retrained DNN ADP has completed retraining. Once
12 The number of neurons per layer and activation functions were selected
empirically. 14 The time of 120 s was selected because it is the period of x . Collecting
d
13 To reduce the computational complexity of the simulation, the least-squares
more data should result in improved training of the inner layer weights at the
gain matrix is initialized such that Γ(0) = Γ. expense of additional computation time.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 5, MAY 2023 3177

retraining is complete, the new internal weights are implemented. After [4] F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic
retraining, the difference between the retrained DNN and random DNN Programming for Feedback Control, vol. 17. Hoboken, NJ, USA: Wiley,
controllers is notable. The improved retrained DNN has significantly 2013.
[5] A. Chakrabarty, D. K. Jha, G. T. Buzzard, Y. Wang, and K. G.
better performance compared to the random DNN case, and it is Vamvoudakis, “Safe approximate dynamic programming via kernelized
comparable to that of the other ADP methods. The integral of error Lipschitz estimation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32,
from the time at which the new inner layer weights are implemented no. 1, pp. 405–419, Jan. 2021.
to the end of the trial is used to compare the performance of the two [6] B. Pang and Z.-P. Jiang, “Adaptive optimal control of linear periodic
systems: An off-policy value iteration approach,” IEEE Trans. Autom.
techniques. After retraining, the integral of error for the random case is Control, vol. 66, no. 2, pp. 888–894, Feb. 2021.
163.29. The integral of error for the retrained case is 145.76. Hence, the [7] W. Gao, M. Mynuddin, D. C. Wunsch, and Z.-P. Jiang, “Reinforcement
online retraining method empirically improved error tracking by 10.7%. learning-based cooperative optimal output regulation via distributed adap-
After retraining, the EMK ADP and LP ADP perform the best, followed tive internal model,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 10,
pp. 5229–5240, Oct. 2022.
by pretrained DNN ADP, transfer learning DNN ADP, retrained DNN
[8] B. Pang, T. Bian, and Z.-P. Jiang, “Robust policy iteration for continuous-
ADP, and random DNN ADP, respectively. The unsurprising trend in time linear quadratic regulation,” IEEE Trans. Autom. Control, vol. 59,
Table I is that if a system has more model knowledge a priori, then no. 11, pp. 3051–3056, Nov. 2024.
performance improves. [9] R. Kamalapurkar, P. Walters, and W. E. Dixon, “Model-based reinforce-
These simulation studies confirm the effectiveness of a DNN-based ment learning for approximate optimal regulation,” Automatica, vol. 64,
pp. 94–104, 2016.
ADP controller with a real-time output layer weight and iterative inner [10] H. Modares, F. L. Lewis, and Z.-P. Jiang, “H∞ tracking control of
layer feature updates. The benefit of the developed technique is that completely unknown continuous-time systems via off-policy reinforce-
a component of the drift dynamics f0 can be approximated without ment learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10,
any model knowledge a priori. This simulation example illustrates the pp. 2550–2562, Oct. 2015.
[11] R. Kamalapurkar, L. Andrews, P. Walters, and W. E. Dixon, “Model-based
well-understood trend that more model knowledge leads to improved
reinforcement learning for infinite-horizon approximate optimal tracking,”
controller performance. Since the developed method can the update IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 753–758,
the DNN features to better approximate the nonlinear drift dynamics, Mar. 2017.
a DNN-based model of the drift dynamics can be learned without [12] G. V. Chowdhary and E. N. Johnson, “Theory and flight-test validation of
pretraining. Unlike existing single-layer NN-based system identifiers, a concurrent-learning adaptive controller,” J. Guid. Control Dyn., vol. 34,
no. 2, pp. 592–607, Mar. 2011.
the additional layers of the DNN facilitate improved function approxi- [13] G. Chowdhary, T. Yucelen, M. Mühlegg, and E. N. Johnson, “Concurrent
mation. Combining existing data-based deep learning algorithms with learning adaptive control of linear systems with exponentially convergent
adaptive control policies can decrease model uncertainty, enhance the bounds,” Int. J. Adapt. Control Signal Process., vol. 27, no. 4, pp. 280–301,
quality of the value function approximation, and ultimately improve 2013.
[14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
the system performance.
no. 7553, pp. 436–444, 2015.
[15] G. Joshi and G. Chowdhary, “Deep model reference adaptive control,” in
VIII. CONCLUSION Proc. IEEE Conf. Decis. Control, 2019, pp. 4601–4608.
[16] R. Sun, M. Greene, D. Le, Z. Bell, G. Chowdhary, and W. E. Dixon,
This article develops a framework for using a DNN-based system “Lyapunov-based real-time and iterative adjustment of deep neural net-
identifier within a model-based RL ADP framework to solve the infinite works,” IEEE Control Syst. Lett., vol. 6, pp. 193–198, 2022. [Online].
horizon optimal tracking control problem. A CL-based continuous-time Available: https://ieeexplore.ieee.org/abstract/document/9337905/
[17] R. Kamalapurkar, H. Dinh, S. Bhasin, and W. E. Dixon, “Approximate
update law is used to update the output layer weights of the DNN.
optimal trajectory tracking for continuous-time nonlinear systems,” Auto-
A Lyapunov-based analysis is performed to prove UUB identification matica, vol. 51, pp. 40–48, Jan. 2015.
of the DNN weights, trajectory tracking, and approximation of the [18] F. Sauvigny, Partial Differential Equations 1: Foundations and Integral
applied control policy to a neighborhood of the optimal control policy. Representations. Berlin, Germany: Springer, 2012.
Simulation results illustrate the performance of the developed method [19] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning,
vol. 1. Cambridge, MA, USA: MIT Press, 2016.
in comparison to existing methods applied to an AUV. Future work [20] P. Kidger and T. Lyons, “Universal approximation with deep narrow
will investigate using a DNN to simultaneously approximate the value networks,” in Proc. Conf. Learn. Theory, 2020, pp. 2306–2327.
function in conjunction with a DNN-based system identifier. [21] F. L. Lewis, S. Jagannathan, and A. Yesildirak, Neural Network Control
of Robot Manipulators and Nonlinear Systems. Philadelphia, PA, USA:
CRC Press, 1998.
ACKNOWLEDGMENT
[22] P. Deptula, J. Rosenfeld, R. Kamalapurkar, and W. E. Dixon, “Approximate
Any opinions, findings, and conclusions or recommendations ex- dynamic programming: Combining regional and local state following
approximations,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6,
pressed in this material are those of the author(s) and do not necessarily
pp. 2154–2166, Jun. 2018.
reflect the views of of Aurora Flight Sciences, the Johns Hopkins [23] P. Deptula, Z. Bell, E. Doucette, W. J. Curtis, and W. E. Dixon, “Data-
Applied Physics Laboratory, or the sponsoring agencies. based reinforcement learning approximate optimal control for an uncertain
nonlinear system with control effectiveness faults,” Automatica, vol. 116,
pp. 1–10, Jun. 2020.
REFERENCES
[24] H. K. Khalil, Nonlinear Systems, 3rd ed. Upper Saddle River, NJ, USA:
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Prentice-Hall, 2002.
Cambridge, MA, USA: MIT Press, 1998. [25] R. Kamalapurkar, P. S. Walters, J. A. Rosenfeld, and W. E. Dixon, Re-
[2] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control inforcement Learning for Optimal Feedback Control: A Lyapunov-Based
and Differential Games by Reinforcement Learning Principles. London, Approach. Berlin, Germany: Springer, 2018.
U.K.: The Institution of Engineering and Technology, 2013. [26] N. Fischer, D. Hughes, P. Walters, E. Schwartz, and W. E. Dixon, “Non-
[3] D. Liberzon, Calculus of Variations and Optimal Control Theory: A linear RISE-based control of an autonomous underwater vehicle,” IEEE
Concise Introduction. Princeton, NJ, USA: Princeton Univ. Press, 2012. Trans. Robot., vol. 30, no. 4, pp. 845–852, Aug. 2014.

Authorized licensed use limited to: University of Florida. Downloaded on May 18,2023 at 19:58:55 UTC from IEEE Xplore. Restrictions apply.

You might also like