Adaptive Control and Intersections With Reinforcement Learning
Adaptive Control and Intersections With Reinforcement Learning
Autonomous Systems
Adaptive Control and
Intersections with
Reinforcement Learning
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
Anuradha M. Annaswamy
Active Adaptive Control Laboratory, Department of Mechanical Engineering, Massachusetts
Institute of Technology, Cambridge, Massachusetts, USA; email: [email protected]
65
1. INTRODUCTION
The overarching goals in the control of a large-scale system are to understand the underlying
dynamics and offer decision support in real time so as to realize high performance. A control
problem can be stated as the determination of inputs into the system so as to have its outputs
lead to desired performance metrics, often related to efficiency, reliability, and resilience. The
challenge that arises is that these decisions must be undertaken in the presence of uncertainties
in the system and in the environment it operates in. Adaptive control (AC) and reinforcement
learning (RL) are two different methods that have been explored over the past several decades to
address this difficult problem in a range of application domains. This article attempts to present
them using a common framework.
An interesting point to note is that the solutions for this problem have been proposed by AC
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
and RL using two distinct frameworks. A fundamental concept that is common to both of these
methodologies is learning. Despite this commonality, there has been very little cross-fertilization
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
between these methods. Both methods have distinct advantages in their approach and at the same
time have gaps in application to real-time control. This article presents a first step in provid-
ing a comparison between these two methods, exploring the role of learning, and describing the
challenges that these two fields have encountered.
Sections 2 and 3 are devoted to laying out the fundamentals of AC, with particular emphasis
on how learning occurs in the proposed solutions. This is followed by a brief exploration of RL
in Section 4. Section 5 presents two different examples of dynamic systems that are used to illus-
trate the distinction between the two approaches. Finally, Section 6 is devoted to comparisons,
combinations of the two approaches, and a few concluding remarks.
where x(t ) ∈ Rn represents the system state and y(t ) ∈ R p represents all measurable system outputs.
For many physical systems, n p > m (1). θ ∈ R represents system parameters that may be
unknown, and f (·) and g(·) denote system dynamics (which may be nonlinear) that capture the
underlying physics of the system. The functions f (·) and g(·) also vary with t, as disturbances and
stochastic noise may affect the states and output. The goal is to choose u(t) so that y(t) tracks a
desired command signal yc (t) at all t, and so that an underlying cost J((y − yc ), x, u) is minimized.
In what follows, the system that is being controlled is referred to as a plant.
As the description of the system in Equation 1 is based on a plant model, and as the goal is
to determine the control input in real time, all control approaches make assumptions regarding
what is known and unknown. The functions f and g are often not fully known, as the plant is
subject to various perturbations and modeling errors due to environmental changes, complexities
in the underlying mechanisms, aging, and anomalies. The field of AC takes a parametric approach
to distinguish the known parts from the unknown. In particular, it is assumed that f is a known
function, while the parameter θ is unknown. A real-time control input is then designed so as to
ensure that the tracking goals are achieved by including an adaptive component that attempts to
estimate the parameters online. A similar problem statement can be made for a linearized version
of the problem in Equation 1, which is of the form
y = W (s, θ )[u], 2.
66 Annaswamy
where s denotes the differential operator d/dt, W(s, ·) is a rational operator of s, and θ is a parameter
vector. In this linear case, we assume that the structure of W(s) (i.e., the order and net order) is
known but that θ (i.e., the coefficients) is not known.
The following subsections further break down the approach taken to address these problems,
especially in the context of learning and optimization. While the description below is in the context
of deterministic continuous-time systems, similar efforts have been carried out in stochastic and
discrete-time dynamic systems as well (2).
lim e(t ) = 0, 3.
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
t→∞
where e(t) = y(t) − yc (t). As these decisions are required to be made in real time, the focus of the AC
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
approach is to lead to a closed-loop dynamic system that has bounded solutions at all time t and a
desired asymptotic behavior as in Equation 3. The central question is whether this can be ensured
even when there are parametric uncertainties in θ and several other nonparametric uncertainties,
which may be due to unmodeled dynamics, disturbances, or other unknown effects. Once this
is guaranteed, the question of learning, in the form of parameter convergence, is addressed. As a
result, control for learning is a central question that is pursued in the class of problems addressed in
AC rather than learning for control (3). Once the control and learning objectives are realized, one
can then proceed to the optimization of a suitable cost J. This sequence of adapt–learn–optimize
is an underpinning of much of AC.
The above sequence can be reconciled with the well-known certainty equivalence principle,
which proceeds in the following manner: First, optimize under perfect foresight, then substi-
tute optimal estimates for unknown values. This philosophy underlies all AC solutions by first
determining a controller structure that leads to an optimal solution when the parameters are
known and then replacing the parameters in the controller with their estimates. The difficulty
in adopting this philosophy to its fullest stems from the dual nature of the adaptive controller,
as it attempts to accomplish two tasks simultaneously, estimation and control. This simultaneous
action introduces a strong nonlinearity into the picture and therefore renders a true deployment
of the certainty equivalence principle intractable. Instead, an adapt–learn–optimize sequence is
adopted, with the first step corresponding to an adaptive controller that leads to a stable solution.
This is then followed by estimation of the unknown parameters, and optimization is addressed at
the final step.
A typical solution of the adaptive controller takes the form
u = C1 (θc (t ), φ(t ), t ), 4.
θ˙c = C2 (θc , φ, t ), 5.
where θ c (t) is an estimate of a control parameter that is intentionally varied as a function of time,
and φ(t) represents all available data at time t. The nonautonomous nature of C1 and C2 is due
to the presence of exogenous signals such as set points and command signals. A stabilization task
would render these functions autonomous. The functions C1 (·) and C2 (·) are deterministic con-
structions and make the overall closed-loop system nonlinear and nonautonomous. The challenge
is to suitably construct functions C1 (t) and C2 (t) so as to have θ c (t) approach its true value θc∗ and
ensure the stability and asymptotic stability properties of the overall adaptive systems. Several
textbooks have delineated these constructions for deterministic systems (e.g., 4–9). The solutions
as persistent excitation and uniform observability (10–14). These persistent excitation properties
are usually associated with the underlying regressor φ and are typically realized by appropriately
choosing the exogenous signals such as r(t) (which is the input into the reference model M).
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
The assumption that the uncertainties in Equations 1 and 2 are limited to just the parameter
θ , and that otherwise f and g or W(s) is known, is indeed an idealization. Several departures from
this assumption can take place in the form of unmodeled dynamics, time-varying parameters,
disturbances, and noise. For example, the linear plant may have a form
y = [W (s, (θ (t ))) + (s)] [u + d(t ) + n(t ) + g(t )], 6.
where d(t) is an exogenous bounded disturbance, n(t) represents measurement noise, and g(t)
represents nonlinear effects. The parameter θ is time-varying and is of the form
θ (t ) = θ ∗ + ϑ (t ), 7.
where θ ∗ is an unknown constant parameter but is accompanied by additional unknown varia-
tions in the form of ϑ(t). Finally (s) is due to higher-order dynamics that is not known, poorly
known, or even deliberately ignored for the sake of computational or algorithmic simplicity. In
all of these cases, a robust adaptive controller needs to be designed to ensure that the underly-
ing signals remain bounded, with errors that are proportional to the size of these perturbations.
Similar departures of unknown effects that cannot be anticipated during online operation exist
for the nonlinear system in Equation 1 as well. All AC methods have been developed with these
departures from the idealized problem statements (as addressed in Section 2.1).
As will become apparent in Section 3, the results that have been proposed for a robust adap-
tive controller are intricately linked to learning of the underlying parameters. These aspects and
implications of imperfect learning will be addressed in Section 3 as well.
68 Annaswamy
such a C1 determined, and noting that θc∗ could be unknown due to the parametric uncertainty in
the plant, the analytic part focuses on finding C2 such that output following takes place with the
closed-loop system remaining bounded. An alternative to the above direct approach of identifying
the control parameters is an indirect one where the plant parameters are first estimated, and
these estimates are then used to determine the control parameter θ c (t) at each t. The following
sections describe the details of the model reference AC approach for various classes of dynamic
systems, ranging from simple and algebraic cases to nonlinear dynamic ones.
3.1.1. Algebraic systems. Many problems in adaptive estimation and control may be expressed
as (2)
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
y(t ) = θ ∗T φ(t ), 8.
where θ ∗ , φ(t ) ∈ RN represent an unknown parameter and measurable regressor, respectively, and
y(t ) ∈ R represents an output that can be determined at each t. Given that θ ∗ is unknown, we
formulate an estimator ŷ(t ) = θ T (t )φ(t ), where ŷ(t ) ∈ R is the estimated output and the unknown
parameter is estimated as θ (t ) ∈ RN . This in turn results in two types of errors, a performance
error ey (t) and a learning error
θ (t ),1
ey = ŷ − y,
θ = θ − θ ∗, 9.
where the former can be measured but the latter is unknown, though adjustable. From Equation 8
and the estimator, it is easy to see that ey and
θ are related using a simple regression relation:
ey (t ) =
θ T φ(t ). 10.
A common approach for adjusting the estimate θ(t) at each time t is to use a gradient rule and a
suitable loss function. One example is the choice
1 2
L1 (θ ) = e , 11.
2 y
leading to the gradient rule
θ˙ (t ) = −γ ∇θ L1 (θ (t )), γ > 0. 12.
θ T
That this leads to a stable estimation scheme can be shown using a Lyapunov function, V = θ,
as its time derivative, V̇ = −e2y .
3.1.2. Dynamic systems with states accessible. The next class of problems that has been
addressed in AC corresponds to plants with all states accessible. This section presents the solution
for the simple case of a scalar input:
ẋ = A p x + b p u, 13.
where Ap and bp are unknown, u is the control input and is a scalar, and x is the state and is accessible
for measurement. As mentioned in Section 2.1, the first step is to find a reference model M, which
takes the form
ẋm = Ah xm + br 14.
1 In what follows, the argument (t) is suppressed unless needed for emphasis.
that matches the closed-loop system to the reference model. This corresponds to the algebraic
part of the problem described in Section 2.1.
The final step is the analytic part—the rule for estimating the unknown parameters θ ∗ and k∗
and the corresponding AC input that replaces the input choice in Equation 17. These solutions
are given by
u = θ T (t )x + k(t )r, 18.
θ˙ = −sign(k∗ )θ (eT Pbm )x, 19.
∗
k̇ = −sign(k )γk (e Pbm )r,
T
20.
m P + PAm = −Q
AT 21.
is a Lyapunov function with V̇ = −eT Qe and that limt → ∞ e(t) = 0 (for further details, see
chapter 3 in Reference 4). In summary, the adaptive controller that is proposed here can be
viewed as an action–response–correction sequence where the action is the control input given by
Equation 18, the response is the resulting state error e, and the correction is the parameter-adaptive
laws in Equations 19 and 20
It should be noted that the adaptation rules in Equations 19 and 20 can also be expressed as
the gradient of a loss function (15),
d eT Pe eT Qe
L2 (θ¯ ) = + , 23.
dt 2 2
where θ¯ = [θ T , k]T , and it is assumed that k∗ > 0 for ease of exposition. It is noted that this loss
function L2 differs from that in Equation 11 and includes an additional component that reflects
the dynamics in the system. It is easy to see that
and that θ˙¯ (t ) is implementable as θ L2 (θ) = φeT Pbm and can be computed at each time t, where
φ = [xTp , r]T (15). This implies that a real-time control solution that is stable depends critically on
choosing an appropriate loss function.
70 Annaswamy
The matching condition given in Equation 16 is akin to the controllability condition, albeit
somewhat stronger, as it requires the existence of a θ ∗ for a known Hurwitz matrix Am (4, 16). The
other requirement is that the sign of k∗ must be known, which is required to ensure that V is a
Lyapunov function.
3.1.3. Adaptive observers. The AC solution in Equations 18 and 19 in Section 3.1.2 required
that the state x(t) be available for measurement at each t. A central challenge in developing adaptive
solutions for plants whose states are not accessible is the simultaneous generation of estimates of
both states and parameters in real time. Unlike the Kalman filter in the stochastic case or the
Luenberger observer in the deterministic case, the problem becomes significantly more complex,
as state estimates require plant parameters, and parameter estimation is facilitated when states
are accessible. This loop is broken using a nonminimal representation of the plant, leading to a
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
tractable observer design. Starting with a plant model as in Equation 2, a state representation of
the same can be derived as given by Luders & Narendra (17):
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
ω̇1 = ω1 + u,
y = θ1T ω1 + θ2T ω2 ,
where ω = [ω1T , ω2T ]T is a nonminimal state of the plant transfer function Wp (s) between the input
u and the output y. ∈ Rn×n is a Hurwitz matrix, and and are controllable and are known pa-
rameters. Assuming that Wp (s) has n poles and m coprime zeros, in contrast to a minimal nth-order
representation, Equation 25 is nonminimal and has 2n states. The adaptive observer leverages
Equation 25 and generates a state estimate ω and a plant estimate θ as follows:
ω̂˙ 1 = ω̂1 + u,
θ1T ω̂1 +
ŷ = θ2T ω̂2 ,
where θ = [θ1T ,
θ2T ]T and ω̂ = [ω̂1T , ω̂2T ]T . The adaptive law that adjusts the parameter estimates
is chosen as
θ˙ = − ŷ p − y p
ω, 27.
3.1.4. Adaptive control with output feedback. The two assumptions made in the development
of adaptive systems in Section 3.1.2 include matching conditions and the availability of states of
the underlying dynamic system at each instant t. Both are often violated in many problems, which
led to the development of adaptive systems when only partial measurements are available. With
the focus primarily on linear time-invariant plants, the first challenge was to address the prob-
lem of the separation principle employed in the control of linear time-invariant plants (23, 24).
The idea therein is to allow a simultaneous estimation of states using an observer and a feedback
control using state estimates with a linear–quadratic regulator to be implemented, and to allow
enabled a tractable problem formulation, where ω(t) is generated as in Equation 25. The added ad-
vantage of the nonminimal representation is that it ensures the existence of a solution that matches
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
the controlled plant using Equation 28 to that of the reference model. That is, the existence of a
control parameter θ ∗ and k∗ such that
u(t ) = θ ∗T ω(t ) + k∗ r(t ) 29.
ensured that the closed-loop transfer function from r to y matched that of a reference model with
a transfer function Wm (s), specified as
ym (t ) = Wm (s)r(t ). 30.
That is, the controller in Equation 29 is guaranteed to exist for which the output error ey = yp −
ym has a limiting property of limt → ∞ ey (t) = 0. For this purpose, the well-known Bézout identity
(23) and the requirement that Wp (s) has stable zeros were leveraged.
When the adaptive controller in Equation 28 is used, the plant model in Equation 2 and the
existence of θ ∗ and k∗ that guarantee that the output error ey (t) goes to zero lead to an error model
of the form
T
θ¯ φ],
ey = (1/k∗ )Wm (s)[ 31.
72 Annaswamy
It can be shown that
V = eT Pe + (1/|k∗ |) ||(θc − θ ∗ )||2 + |(k − k∗ )|2 34.
is a Lyapunov function where P is the solution of the KYL for the realization {Am , b, c} of Wm (s),
which is SPR. This follows from first noting that
V̇ = −eT (AT T ˙T
˙
m P + PAm )e + 2e Pb(θ ω + k̃r) − θ ey ω − k̃ey r.
T
Since Wm (s) is SPR, the use of the KYL applied to Wm (s) together with the adaptive laws in
Equations 32–34 causes the second term to cancel out the third and fourth terms, and hence
V̇ = −eT Qe ≤ 0. The structure of the adaptive controller in Equation 28 guarantees that ey , θ c , k,
ω, yp , and u are bounded and that limt → ∞ ey (t) = 0. Additions of positive definite gain matrices to
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
Equations 32 and 33 as in Equations 19 and 20 are straightforward. Similar to Section 3.1.2, the
action–response–correction sequence is accomplished by u in Equation 28, ey , and the adaptive
laws in Equations 32 and 33.
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
The choice of the adaptive laws as in Equations 32 and 33 centrally depended on the KYL,
which in turn required that Wm (s) be SPR. An SPR transfer function (4) leads to the requirement
that the relative degree—the difference between the number of poles and zeros of Wp (s)—is unity
and has stable zeros (zeros only in Re[s] < 0), also defined as hyperminimum phase (31). Qualita-
tively, it implies that a stable adjustment rule for the parameter should be based on loss functions
that do not significantly lag the times at which new data come into the system. For a general case
when the relative degree of Wp (s) exceeds unity, it poses a significant stability problem, as it is clear
that the same simple adaptive laws as in Equations 32 and 33 will no longer suffice because the
corresponding transfer function Wm (s) of the reference model cannot be made SPR.
A final note regarding the assumptions made about the plant model in Equation 2 is in order.
For the controller in Equation 29 to allow the closed-loop system to match the reference model
in Equation 30 for any reference input r(t), a reference model Wm (s) with the same order and net
order as that of Wp (s) must be chosen, which implies that the order and net order of the plant must
be known. Determination of a Lyapunov function requires that the sign of k∗ be known. Finally, the
model matching starting with a nonminimal representation of the plant requires stable pole-zero
cancellations, which necessitates that the zeros be stable.
Extensions to a general case with output feedback have been proposed using several novel
tools, including an augmented error approach (4), backstepping (8), averaging theory (32), and
high-order tuners (33). In all cases, the complexity of the adaptive controller is increased, as the
error model in Equation 31 does not permit the realizations of simple loss functions as in Li (θ),
i = 1, 2. Annaswamy & Fradkov (2) provided a concise presentation of these extensions.
3.1.5. Learning and imperfect learning. As is clear from all preceding discussions, the hall-
mark of all AC problems is the inclusion of a parameter estimation algorithm. In addition to
ensuring that the closed-loop system is bounded and that the performance errors are brought
to zero, all adaptive systems attempt to learn the underlying parameters, with the goal that the
parameter error θ − θ ∗ is reduced, if not brought to zero.
Four important implications of learning and imperfect learning should be kept in mind (and
are expanded further in Section 5). The first is the necessary and sufficient condition under which
learning occurs.
Definition 1 (4). A bounded function φ : [t0 , ∞) → RN is persistently exciting if there exist
T > 0 and α > 0 such that
t+T
φ(τ )φ T (τ )dτ ≥ αI, ∀t ≥ t0 .
t
typical goals in system estimation and control, can be achieved without relying on learning. That
is, a guaranteed safe behavior of the controlled system can be assured in real time even without
reaching the learning goal, as output matching is an easier task, while parameter matching is task
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
dependent and faces challenges due to spectral properties of a dynamic system. This guarantee
in the presence of imperfect learning is essential and suggests that for real-time decision-making,
control for learning is the practical goal, in contrast to learning for control. It should also be
added that when the excitation level is insufficient or there is simply no persistent excitation, the
parameters will not converge to the true values (40).
The third implication is the strong link between learning and robustness. Ensuring that the
parameter estimates have converged to their true values opens the door to several attractive prop-
erties of a linear system, the foremost of which are exponential stability and robustness to various
departures from idealization. In fact, a treatment of bounded behavior in the presence of persis-
tent excitation has been established globally for the case when these departures are due to external
disturbances (38) and locally for a larger class of perturbations (32). The foundation for these state-
ments stems from the fact that AC systems are nonlinear, and bounded-input, bounded-output
stability is not guaranteed when the unforced system is uniformly asymptotically stable and not
exponentially stable (38, 41). Use of regularization and other modifications, such as projection and
dead zone modifications, has been suggested to ensure robustness when no persistent excitation
properties can be guaranteed (4, 6).
The fourth implication, which rounds off this topic, is this: When there is no persistent exci-
tation and disturbances are present, the closed-loop system can produce large bursts of tracking
error (34–36). That is, imperfect learning exhibits a clearly nonrobust property that leads to a sig-
nificant departure from a tracking or regulation goal, exhibiting an undesirable behavior over short
periods when the tracking error becomes significantly large. A specific example that illustrates this
behavior is the following (34): Consider a first-order plant with two unknown parameters a and b
of the form
yk+1 = ayk + buk , 35.
whose AC solution is given by (42)
1
uk = ak yk + y∗k+1 .
− 36.
bk
The results of Goodwin et al. (43) can be used to reparameterize Equation 36 as
uk = −θc1 ,k yk + θc2 ,k y∗k+1 37.
and propose parameter adjustment rules for θc1 ,k and θc2 ,k . These adjustment rules guarantee that
(a) θci ,k and yk are bounded (42); (b) θci ,k converge to constants θc0i , which may not coincide with
the true values (44); and (c) yk approaches y∗k as k → ∞ (42). In addition, when φ c, k is persistently
74 Annaswamy
exciting (i.e., satisfies Definition 1), we also have that the estimates θci ,k approach the true values
θc∗i . However, when such a persistent excitation is not present and when perturbations are present,
bursting can occur, which can be explained as follows.
Suppose we consider a simple regulation problem with y∗k ≡ 1. The control input in
Equation 37 leads to a closed-loop system of the form
where
g(θc1 ,k ) = a − bθc1 ,k , h(θc2 ,k ) = bθc2 ,k . 39.
This implies that the closed-loop system is (a) unstable if |g(θc1 ,k )| > 1 and (b) stable if |g(θc1 ,k )| < 1.
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
The most troublesome scenario occurs when there is marginal stability—i.e., θc1 ,k = θcb1 , where
g(θcb1 ) = −1. Suppose that that parameters θci ,k become arbitrarily close to θcbi for some k = k0 ; at
k+
0 , a disturbance pulse is introduced, which can cause the parameters to drift, with θc1 ,k approach-
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
ing θcb1 , which in turn leads to oscillations yk , causing θci ,k to readjust and once again approach
another set of constant values, θc0i . Such a phenomenon has been shown to occur in the absence of
persistent excitation (34), including in continuous-time systems (37). This phenomenon is not pe-
culiar to the specific systems in question and can occur in any dynamic system where simultaneous
identification and control are attempted.
3.1.6. Numerical illustration of learning and imperfect learning. As an example, consider a
continuous-time F-16 model (45), where the nominal dynamics are linearized about level flying
at 500 feet/s at an altitude of 15,000 feet to produce a linear time-invariant system similar to the
one described by Stevens & Lewis (45), with states, inputs, and parameters defined as follows:
where α, q, and δ e are the angle of attack (degrees), pitch rate (degrees per second), and elevator
deflection (degrees), respectively. The goal is to control the nominal dynamics using a linear–
quadratic regulator with cost matrices
10
QLQR = , RLQR = 1. 41.
00
Solving the discrete algebraic Riccati equation provides us with a feedback gain vector
T
θLQR = −0.1536, 0.8512 . 42.
Finally, applying the feedback gain above to the dynamics in Equation 40, along with a reference
input r(t), gives the following closed-loop system, which we choose as the reference system:
ẋm = A p xm + b p (θLQR
T
xm + r) = Ah xm + b p r. 43.
Assuming that both states are measurable, the AC goal is to ensure that the plant tracks the refer-
ence system for a given reference signal r(t), regardless of differences between the plant and the
nominal dynamics.
We now introduce a parametric perturbation into the plant model such that the true open-loop
dynamics are given by
ẋ = Ā p x + b p u, 44.
8 3.0
4
2.5
6
3 2.0
4
2 1.5
2
1.0
0 1
0.5
–2 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Time t (s) Time t (s) Time t (s)
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
Figure 1
Simulation results of a simple F-16 model with imperfect learning. (a) The reference signal r(t) to be followed by the reference model
and the plant. Within any given 20-s period, this reference signal does not provide enough excitation for the adaptive system to fully
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
learn. (b) The tracking error e(t) 2 . The tracking error goes to zero within the first 20-s period, but bursting occurs every time the
reference signal subsequently changes. (c) The parameter error θ̃ (t ) 2 . The adaptive system fully learns the true parameters by the end
of the simulation, but while the parameters are not fully learned, bursting occurs.
where
−0.5078 0.8839
Ā p = . 45.
9.4950 −5.3970
One can verify that Ā p , bp , and Ah satisfy the matching condition in Assumption 1 for some θ ∗ and
k∗ = 1. Because the perturbation in the plant’s dynamics is unknown to the control designer, we
choose an initial parameter estimate of θ (0) = θ LQR . We assume x(0) = 0. The resulting closed-
loop adaptive system with Equations 18, 19, 43, and 44 is simulated. k(t) is set to 1, and θ is chosen
as the identity matrix. The responses are shown in Figures 1 and 2. In Figure 1b,c, the tracking
error x(t) − xm (t) 2 and the parameter error norm θ(t) − θ ∗ 2 are shown for the reference signal
r(t) shown in Figure 1a, where e(t) = x(t) − xm (t) and θ̃ (t ) = θ (t ) − θ ∗ . The same is illustrated in
Figure 2a–c as well.
Figure 1 is an illustration of imperfect learning: Although the tracking error in Figure 1b has
gone completely to zero at the end of the initial 20-s interval, the parameter error in Figure 1c
a b c
10 1.2 3.5
Parameter error ||θ̃ (t)||2
Tracking error ||e(t)||2
Reference signal r(t)
8 1.0 3.0
2.5
6 0.8
2.0
4 0.6
1.5
2 0.4
1.0
0 0.2 0.5
–2 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Time t (s) Time t (s) Time t (s)
Figure 2
Simulation results of a simple F-16 model with full learning. (a) The reference signal r(t). In this simulation, the reference signal is
modified during the first 20-s period to provide sufficient excitation for full learning. (b) The tracking error e(t) 2 . Due to full learning
during the first 20 s, the plant tracks the reference model in Equation 43 through subsequent changes in r(t) without bursting. (c) The
parameter error θ̃ (t ) 2 . Due to persistent excitation within the first 20-s time window, the adaptive system quickly learns the true
parameters.
76 Annaswamy
remains nonzero due to lack of excitation in the reference input r(t) (Figure 1a). As a result,
each time r(t) subsequently changes, which could occur at any instant, the plant and reference
model initially move in different directions. The bursting phenomenon is therefore apparent in
Figure 1b because each time r(t) changes, the tracking error explodes to large nonzero values
before the adaptive law is able to correct this behavior. Contrast Figure 1 with Figure 2, in which
the reference system is set to r(t) = sin (0.5t) + sin (1.5t) during the first 20-s interval to provide the
plant with persistent excitation and allow the parameter error to go to zero alongside the tracking
error, as evidenced by Figure 2c. Learning occurs in this second simulation, and thus no bursting
is exhibited in Figure 2b.
This simple example uses a piecewise constant reference input with large jumps to illustrate the
danger of imperfect learning. In practice, a control designer might not choose such an adversarial
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
reference input. However, the same phenomenon can be caused by several other potentially adver-
sarial influences on the system that are out of the control designer’s hands, such as state-dependent
disturbances due to unmodeled dynamics (46).
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
where zi denotes the ith element of z, and the functions γ i , β i , and β are all known, with θ as
an unknown parameter, all with suitable dimensions. The backstepping approach involves the
construction of a suitable Lyapunov function that allows a stable adaptive controller even though
a certainty-equivalence-based approach does not readily lead to a suitable structure and overcomes
the fact that the triangular structure does not satisfy a matching condition.
Several approaches to AC of nonlinear systems are based on approximation of nonlinear right-
hand sides by linear ones. There are only a few publications with explicit formulations of dynamic
properties of the overall system, e.g., a paper by Wen & Hill (49) where the nonlinear model is
reduced by standard linearization via finite differences. There are a few results dealing with high-
gain linear controllers for nonlinear systems (50, 51). Other tools, such as absolute stability (52,
53), passivity (54), passification (55–57), and immersion and invariance (58), have led to a successful
set of approaches for AC of nonlinear systems.
An additional approach that requires special mention is AC of nonlinear systems based on neu-
ral networks. The basic principles of using neural networks in control of nonlinear systems have
been addressed in a number of papers (59–62), and this approach is one of the common tools
employed in both AC and RL due to neural networks’ ability to approximate complex nonlinear
maps and powerful interpolation properties. Another approach that has been used is an approxi-
mation of Lyapunov functions for control using neural networks, so that the stability of the closed
loop (63, 64) is guaranteed. The central challenge addressed in all of these works is to come up
with a stable approach that addresses the well-known fact that there is an underlying issue of an
treatment wherein principles of optimality as well as the entire approach outlined here have clear
analogs.
The goal is to design a control input of the form
uk = π (xk ) 49.
so that an underlying cost Jπ (x0 ) is minimized, where
1
k
Jπ (x0 ) = lim c(xi , π (xi )), x0 ∈ X , 50.
k→∞ k
i=0
and π is a deterministic policy that maps x X to u U(x). X ⊂ Rn and U ⊂ Rm are desired sets
for the state x and the control input u, respectively. It is assumed that policies π (xk ) can be found
such that
0 ≤ c(xk , uk ) ≤ ∞, ∀xk ∈ X , uk ∈ U (xk ). 51.
With the above problem statement, optimal control results provide a framework for determining
the solutions of Equation 50. The foundation for this framework comes from the well-known
Bellman equation:
J∗ (x) = min c(x, u) + J∗ ( f (x, u)) , ∀x ∈ X , 52.
u∈U (x)
in which the fixed point solution J∗ is the optimal cost. The solution leads to an optimal control
input u∗ (xk ) that satisfies
u∗ (x) = arg min c(x, u) + J∗ ( f (x, u)) , ∀x ∈ X. 53.
u∈U (x)
The dynamic programming approach utilizes the principle of optimality in which the following
cost-to-go problem is solved:
J∗ (xk ) = min c(xk , uk ) + J∗ (xk+1 ) , 54.
uk ∈Uk
with a boundary condition J∗ (x∞ ) specified. The challenge in solving the Bellman equation comes
from its computational burden, especially when the underlying dimensions of x and u as well as
the sets U and X are large and when f is unknown.
Motivated by these concerns of computational burden and overall complexity in determining
the optimal control input when the model is uncertain, an approximation is employed to solve the
78 Annaswamy
Bellman equation and forms the subject matter of the field of RL, often denoted as approximate dy-
namic (or sometimes neuro-dynamic) programming. The evolution of this field centrally involves
the determination of a suitable approximate cost and a corresponding approximately optimal con-
trol policy. Various iterative approaches, including policy iteration, Q-learning, and value iteration,
have been proposed in the literature to determine the best approximation. Bertsekas (65, 66) and
Watkins & Dayan (67) have published excellent expositions on Q-learning and value iteration. A
brief discussion of policy iteration is given below.
Here, policy iteration is illustrated using the concept of a Q-function. The optimal Q-function
corresponds to the cost stemming from the current control action, assuming that all future costs
will be optimized using an optimal policy. Such a function is defined as the solution to
Q∗ (xk , uk ) = c(xk , uk ) + min, Q∗ ( f (xk , uk ), u). 55.
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
u∈U
An implicit assumption here is that the cost c(x, u) can be determined in Equation 57, even
when the model is unknown, using the concept of an oracle (68). As Q i → Q∗ , it follows that
∗
the corresponding input from Equation 56 will yield the optimal u .
To determine an efficient approximation, a parametric approach is often used. Denoting this
parameter as θ ∈ R p , we estimate the Q-function as
p
θ (xk , uk ) =
Q φi (xk , uk )θi , 58.
i=1
where φ(x, u) is a basis function in R p for the Q-function. One can then associate a parameter θ π
for a particular policy uk = π (xk ) by defining
θ (xk , uk ) = φ T (xk , uk )θπ .
Q 59.
π
where g is a nonlinear function of xk and uk as well as parameters θ π , which represent the weights
of the neural network. This ability to approximate even complex functions has been successfully
leveraged in AC as well (59, 60, 62, 69). The structure of the network g often permits a powerful
approximation and leads to a desired approximation Q̂∗θ with a minimal approximation error. By
collecting several samples {xk j , uk j , ck j }, j = 1, . . . , N subject to the policy uk j = π (xk j ), one can com-
pute a least-squares-based solution (70, 71) to compute θ π . One can also use a recursive approach
to determine these parameters, based on a gradient descent rule that minimizes a loss function of
the approximation error, an example of which is the well-known back-propagation approach (72,
73). A large number of variations occur in the type of adjustments used in determining θ π (74–76),
motivated by performance, efficiency, and robustness, an exposition of which is beyond the scope
of this article. Once θ π is determined, an approximate optimal policy is determined as
u) = arg min g(xk , uk , θπ ).
û(x) = arg min Q(x, 61.
u∈U (x) u∈U (x)
years, the scope of RL has been increasingly used to not only learn the optimal policy through
an approximate structure but also carry out this learning when the dynamic system is uncertain.
And it is in the context of the latter problem that the commonality between AC and RL begins to
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
emerge. As the RL approach is predicated on access to a simulation engine or dataset that enables
repeated exploration of various policies, one of the major uncertainties that must be contended
with is the oft-mentioned challenge of a sim-to-real gap (77, 78), which describes the difficulty of
generalizing a trained policy to a new environment. This challenge takes even more of a center
stage in the context of real-time control.
Three points should be noted in particular. First, RL algorithms, whether based on nonrecur-
sive or recursive approaches, are geared toward ensuring that the function approximations (e.g.,
Q̂), and not the parameter estimates of θ π , converge to their true values. It should be noted that
when the dimensions of θ π are large, as in deep networks, the focus is exclusively on the opti-
mization of the underlying Q-function, value function, or policy. Any such overparameterization
often negates identifiability and can lead to imperfect learning, as there can be infinite θ̂ that
solves
Q̂θ̂π̂ (x, u) = Qθπ (x, u).
Second, the parametric updates, whether using least squares or recursive counterparts, are exclu-
sively an offline exercise. As we move the focus of the RL methods toward an online solution, a
huge set of obstacles that were mentioned in the previous section will have to be addressed here.
Imperfect learning can often occur because of a lack of identifiability or lack of convergence. It is
not clear that robustness properties will always be satisfied or that bursting can be avoided. Third,
the approximation error in Q is a function of the input u and the state x. This in turn implies
once again that in real time, any perturbations that occur due to departures from the simulation
environment, because of unforeseen anomalies, environmental changes, or modeling errors, may
lead to fundamental questions of stability and robustness.
80 Annaswamy
where A and B are unknown matrices. The control objective is to determine uk such that the cost
function
1 T
T
def
J(A, B) = lim sup xi Qxi + uT
i Rui , 63.
T →∞ T i=1
where Q = QT > 0 and R = RT > 0, is minimized. To address this objective, the following first
describes the AC approach (both direct and indirect) and then the RL approach.
5.1.1. Adaptive control approach. In what follows, we outline two different approaches, indi-
rect and direct, where in the former, the plant parameters are first estimated and then the control
parameters are determined, whereas in the latter, the control parameters are directly estimated.
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
5.1.1.1. Indirect approach. It is well known that for this linear–quadratic system, the following
control input is optimal:
uk = K (A, B)xk , 64.
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
where
K (A, B) = −[BT PB + R]−1 BT PA
and P solves the discrete-time algebraic Riccati equation
P = AT PA − AT PB(BT PB + R)−1 BT PA + Q.
The results of Campi & Kumar (79) show that the problem becomes significantly more com-
plex when A and B are unknown, and the control gain in Equation 64 must be replaced with one
that depends on parameter estimates of A and B. Suppose we define (ALS LS
k , Bk ) as the least-squares
estimate of [A, B], i.e.,
def
k
k , Bk ) = argmin(A,B)∈
(ALS ||xs − Axs−1 − Bus−1 ||2 .
LS
65.
s=1
Becker et al. (44) showed that for autoregressive–moving-average with exogenous inputs
(ARMAX) systems, which are equivalent representations of the plant in Equation 62, the param-
eter estimates can converge to false values with positive probabilities when measurement noise is
present in Equation 62; Borkar & Varaiya (80) provided an example of the above statement for
general Markov chains. Campi & Kumar (79, 81) reported an interesting fix for this problem that
enabled a suboptimal solution. The following assumptions are made.
Assumption 2. The true value (A0 , B0 ) lies in the interior of a known compact set 0 , and
(A, B) is stabilizable at all points in this compact set 0 .
The approach used by Campi & Kumar (79) consists of adding a bias term to the cost J in
Equation 63 so as to lead to estimates of the form
k
k , Bk ) = argmin(A,B)∈
(ALS ||xs − Axs−1 − Bus−1 ||2
LS
s=1
In the above, μk is a deterministic sequence that tends to infinity as o(log k). The points to note here
are that (a) the estimation is based on a nonrecursive approach, (b) the problem is strictly focused
on a stabilization task, and (c) the solution exploits the linear dynamics and the quadratic structure
themselves are directly estimated. To describe this method, we start with a single-input version of
Equation 62, rewritten as
xk+1 = Axk + buk , 68.
where for ease of exposition only A is assumed to be unknown. Suppose that a nominal value of A
is given by Am , the reference system is designed as in Section 3.1.2 as
xm(k+1) = Am xmk + bumk , 69.
and umk is designed to optimize a cost function
1 T
T
def
J(Am , b) = lim sup mi Rumi .
xmi Qxmi + uT 70.
T →∞ T i=1
Noting that the cost function is quadratic and the dynamics in Equation 69 is linear, the optimal
control input is given by umk = kTopt xmk , which leads to a reference system xm(k + 1) = Ah xmk , where
Ah is Schur-stable and is such that it leads to an optimal cost of J. We assume that (A, b) satis-
fies Assumption 1, with Equation 16 satisfied for k∗ = 1. The structure of the optimal input umk
suggests that as θ ∗ is not known, an AC input is chosen as
uk = θkT x pk , 71.
which leads to closed-loop adaptive system
xk+1 = (Ah + bθ̃kT )xk , 72.
where θ̃k = θk − θ ∗ . To drive parameter error θ̃k , we attempt to drive ek to zero, where ek = xpk −
xmk .
As in Section 3.1.2, we derive an error model that relates ek to θ̃k , which takes the form
e(k+1) = Ah ek + bθ̃kT xk . 73.
The question then is whether it is possible to adjust the parameter estimate as
θk+1 = θk − γ g(ek , θk ) 74.
using a suitable gradient g(ek , θ k ) that ensures stability, convergence of ek to zero, and optimality.
The following constants, both scalars and matrices, are first defined:
P = P T > 0 such that AT
h PAh − P = −Q, Q = QT > 0, 75.
d
c = 2AT
h Pb, d > bT Pb, d0 = . 76.
bT Pb
82 Annaswamy
Next, we define a few variables:
1
N (x) = , 77.
1 + αγ dxT x
ωk+1 = Ah ωk + b αγ xkT xk yk , 78.
k = ek − ωk , 79.
yk = N (xk ) cT k + d0 bT P(ek+1 − Ah ek ) . 80.
In the above, Equation 75 is the Lyapunov equation for discrete linear time-invariant systems,
yielding a positive definite solution P; d0 is a positive constant that exceeds unity and is useful in
developing the update law for θ k ; c determines a unique combination of a vector of state errors and
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
plays a central role that will become clear below; and ωk and k (and therefore yk ) are augmented
state and output errors, respectively, that lead to a provably correct adaptive law that is in lieu of
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
What is remarkable about the choice of the gradient g(·, ·) as yk xk is that it causes the positive
definite function
1 T
Vk = 2kT Pk + θ̃ θ̃k 82.
γ k
to become a Lyapunov function, i.e., Vk ≤ 0. To provide a rationale for the choice of V in
Equation 82, the following theorem is needed.
Theorem 1 (87). A dynamic system given by
together with the adaptive law in Equation 81, permits the Lyapunov function in
Equation 82 with a nonpositive decrease
The crucial point to note here is that the augmented state error k and yk in Equations 79
and 80 can be shown to satisfy the dynamic relations in Equation 83. Thanks to Theorem 1
(87), we have therefore identified a provably correct law (Equation 81) for the control param-
eter θ k in Equation 71. This leads to the complete solution using the direct AC approach, given
by Equations 71 and 81.
Several properties follow from Theorem 1. First, θ k and k are bounded. Second, as k → ∞,
k , ν k , and yk xk all approach zero. Third, because νk = θ̃kT xk − αγ xkT xk yk , it follows that θ̃kT xk
approaches zero asymptotically. Fourth, the true state error ek approaches zero asymptotically, as
Ah is Schur-stable. And fifth, all variables in the closed-loop system are bounded. The dynamic
model in Equations 83 and 84 can be viewed as an SPR transfer function between ν k and yk . The
normalization N(x) as in Equation 77 and the choice of yk as in Equation 80 were necessary to
create such an SPR operator, which clearly adds to the complexity of the underlying solution.
dence on persistent excitation a strong one; in this regard, a direct approach is more robust, as
imperfect learning is implicitly accounted for in its formulation. Its development, however, en-
tails more complexity because the algorithm requires an appropriate gradient function g(·) that
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
5.2.1. Adaptive control approach. We start with the case when the dynamic system in
Equation 86 is affine and controllable, where f (x) is assumed to be unknown, and g(x) = B. With-
out loss of generality, it is assumed that f (0) = 0. With this assumption, we rewrite Equation 86
as
ẋ = Ax + B[u + f1 (x)], 88.
where A and B denote the Jacobian evaluated at the equilibrium point x = 0, and (A, B) is a
controllable pair. The uncertainty in f and g in Equation 86 can be assumed to pertain to A, B, and
f1 (·).
Similar to Example 1, we propose a reference system
ẋr = Ar xr + Br [ur + f1r (xr )] 89.
where Ar , Br , and f1r (x) can be viewed as nominal values of the matrices A and B and the nonlinearity
f1 (x). Suppose that a nominal controller is designed as
ur = − f1r (xr ) + l,r xr + ucom , 90.
where l, r is such that Ar + Br l, r = AH , with AH a Hurwitz matrix, and ucom is such that xr (t)
tracks a desired command signal xcom (t) as closely as possible and such that the cost function in
Equation 87 corresponding to xr and ucom is minimized.
84 Annaswamy
Suppose that the plant dynamics in Equation 88 had uncertainties that satisfy the following
assumptions.
Assumption 3. The nonlinearity f1 (x) is equal to n n (x), where n ∈ Rm×l is unknown,
while n (x) ∈ Rl is a known nonlinearity.
Assumption 4. The unknown linear parameters A and B are such that a matrix ∗ ∈ Rm×m
exists such that
A + B∗ = AH 91.
and a diagonal matrix exists, with known signs for all diagonal entries, such that
B = Br . 92.
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
where the uncertainties in the plant to be controlled are lumped into the matrices ∈ Rm×m
and n ∈ Rm×l . The structure of the uncertainty in Equation 92 often occurs in many practi-
cal applications in the form of a loss of control effectiveness. This is typically due to unforeseen
anomalies that may occur in real time, such as accidents or aging in system components, especially
in actuators. Parametric uncertainty n in the nonlinearity f1 (x) may be due to modeling errors.
As nonlinearities are always more difficult to model even with system identification, it may not
always be possible to accurately identify the parameters of nonlinear effects even if the underlying
mechanisms may be known; this provides the rationale for Assumption 3. The problem here is
therefore control of Equation 93, where Br and n (x) are known but A, , and n are unknown.
Overall, the model structure in Equation 93 is utilized to develop the AC solution.
The adaptive controller is chosen as follows:
)(t ),
u = (t 94.
˙ = −γ BT PeT ,
e = x − xr , 95.
is a parameter estimate of ,
where
⎡ ⎤
ucom
−1 −1 ∗
⎢ ⎥
:= , − n , , := ⎣n (x)⎦. 96.
x
The efforts in AC guarantee that the closed-loop system determined by Equations 93–95 is
globally stable for any initial conditions in x(0), xr (0), and (0).
=
This follows by deriving an error model that relates e and the parameter error − ,
which takes the form
ė = AH e + Br [u − ]. 97.
That is, the key component that connects the uncertain parameter to the performance error e
is a regressor . Together with the AC input as in Equation 94, we get a fundamental error model
of the form
ė = AH e + Br []. 98.
The following comments can be made regarding the choice of the adaptive controller and the
behavior of the closed-loop system.
definite.
3. There are several extensions of the AC approach to broader nonlinear systems, as in
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
It is clear that Assumption 5 assumes that despite the uncertainty in Equation 86, a stabilizing
policy u1 (x) can be found for some V0 . It should also be noted that V0 serves as a Lyapunov function
for this system. It then follows that V0 could be used as an upper bound for the cost incurred using
this stabilizing policy—that is,
J(x0 , u1 ) ≤ V0 (x0 ), ∀x0 ∈ Rn . 101.
This stability assumption is then connected with optimality through the Hamilton–Jacobi–
Bellman equation, necessitating the following assumption.
Assumption 6. There exists a proper, positive definite, and continuously differentiable
function V∗ (x) such that the Hamilton–Jacobi–Bellman equation holds:
H(V ∗ ) = 0, 102.
where
1
H(V ) = ∇V T (x) f (x) + q(x) −∇V T (x)g(x)R−1 (x)gT (x)∇V (x).
4
Such a V∗ can be easily shown to be a Lyapunov function such that
V ∗ (x0 ) = min J(x0 , u) = J(x0 , u∗ ), ∀x0 ∈ Rn , 103.
u
corresponds to the optimal value function and also yields the optimal control input
1
u∗ (x) = − R−1 (x)gT (x)∇V ∗ (x). 104.
2
86 Annaswamy
Finding a V∗ that solves Equation 102 is too difficult. In addition, it is easy to see that it requires
knowledge of f and g. Policy iteration is often used to find V∗ , where the control input u is iterated,
and with each new input u, the corresponding Lyapunov function is computed. The following
procedure is often utilized:
1. For i = 1, 2, . . . , solve for the cost function Vi (x) ∈ C 1 , with Vi (0) = 0, from the partial
differential equation
L(Vi (x), ui (x)) = 0. 105.
It can be seen that solving for Equation 105 requires the immediate cost c(x, ui ), which may
be available through an oracle when the plant model is not known.
2. Update the control policy using ui and the value function estimate Vi as
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
1
ui+1 (x) = − R−1 (x)gT (x)∇Vi (x). 106.
2
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
That is, instead of solving Equation 102, the approach seeks to find a sequence of Vi that satisfies
Equation 105. Step 1 is referred to as policy evaluation, and step 2 is referred to as policy im-
provement. Then convergence of the stabilizing policy u1 to an optimal policy u∗ can be achieved,
stated in the following theorem.
Theorem 2. Suppose Assumptions 5 and 6 hold, and the solution Vi (x) ∈ C 1 satisfying
Equation 105 exists, for i = 1, 2, . . . . Let Vi (x) and ui + 1 (x) be the functions generated from
Equations 105 and 106. Then the following properties hold, i = 0, 1, . . . :
1. V ∗ (x) ≤ Vi+1 (x) ≤ Vi (x), ∀x ∈ Rn .
2. ui + 1 is globally stabilizing.
3. Suppose there exist V o ∈ C 1 and uo such that ∀x0 ∈ Rn , we have limi → ∞ Vi (x0 ) = Vo (x0 )
and limi → ∞ ui (x0 ) = uo (x0 ). Then, V∗ = Vo and u∗ = uo .
The problem that still remains is the following. The solution of Equation 105 that deter-
mines a global solution Vi (x) for all x and policies ui (x), especially when the precise knowledge
of f or g is not available, is still nontrivial. Also, as mentioned earlier, any approximation-based
approaches, such as RL, pose problems of stability and robustness. As stated succinctly by Jiang &
Jiang (89, p. 2919), “although approximation methods can give acceptable results on some compact
set in the state space, they cannot be used to achieve global stabilization.” Any efforts to reduce
the approximation error, including neural networks and sum of squares, carry with them a large
computational burden in addition to issues of robustness. By contrast, stability and robustness
properties are clearly delineated with the use of AC. However, these guarantees are predicated on
specific structures of dynamic systems such as Equation 88 or Equations 46 and 47.
θ
b
Offline: Iteration i Online
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
Value function
estimates Immediate
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
Figure 3
Schematics of (a) an adaptive controller and (b) reinforcement learning. The adaptive controller is an online solution that monitors the
performance in real time (such as the loss function L1 in Equation 11 or L2 in Equation 23) and suitably designs the control input u.
The adaptive controller uses the plant model structure to identify an underlying regressor φ and a reference model to determine both
the input and a parameter estimate θ . The regressor and the reference model lead to an error model that relates the parameter error
to the real-time performance. This error model is utilized to determine the rule by which the parameter estimate θ is adjusted. Such an
adaptive system is always guaranteed to achieve the desired real-time performance, even with imperfect learning. In addition, if the
regressor φ satisfies persistent excitation properties, then parameter learning takes place as well. Reinforcement learning is an offline
approach where the oracle is the entity used to generate the response to a policy choice and can be viewed as the plant model. An
immediate cost c(x, ui ) is computed based on the system response generated by the oracle and is used to update the estimated value
function Vi − 1 (x). An iterative procedure is used to update both the policy as ui and estimates of value function, using which an optimal
policy u∗ (x) is obtained after ui (x) converges. If the oracle is identical to the plant dynamics, then applying u∗ (x) to the plant online
achieves the optimal value function V∗ (x).
and that one can collect a large amount of data pertaining to (x, u) pairs, often enough to permit the
learning of the optimal policy. When the underlying Bellman equation becomes computationally
infeasible to solve, approximations are deployed. The optimality of the resulting policy improves
as the approximation error becomes small.
All of the above statements apply to Example 2 as well. The AC approach relied completely on a
model structure, including the order as well as the nature of the nonlinearities present. The specific
result outlined required the system to be feedback linearizable. Here, too, imperfect learning was
possible by leveraging the model structure, including that of the nonlinearities. The focus once
again was on stability, and as before, optimality follows once learning takes place. The specific RL
approach that was discussed proposed an optimal solution for a class of nonlinear systems under
some assumptions.
Schematics of both the AC and RL approaches are shown in Figure 3. The RL approach
indicated in the figure illustrates popular methods, including the Q-learning approach described
in Section 4 and methods based on temporal-difference learning (90). It should be noted that there
are online methods based on RL applied to control of nonlinear systems control (89, 91, 92) that
provide analytical guarantees based on assumptions such as knowledge of an initial stable control
law u0 and a sufficiently accurate oracle to reduce the sim-to-real gap.
88 Annaswamy
Both methods require assumptions and restrictions. The AC approach is predicated on a model
structure, with the rationale that the underlying problem is grounded in physical mechanisms and
is therefore amenable to a model structure that could be determined from an understanding of
physical laws, conservation equations, and constitutive relations. Assumptions that equilibrium
points, order, model structure, and feedback linearizability are all known are not always valid.
These assumptions are stress tested as the scope of the systems being addressed increases in com-
plexity, size, and scale. Not all systems can be modeled as in the above examples. Complex systems
and stringent performance specifications of efficiency, reliability, and resilience pose tremendous
challenges, as unmodeled components introduce uncertainties of various kinds, and in real time.
The RL approach outlined here is model agnostic and is therefore applicable to a wider class
of systems. However, restrictions arise because it is a data-intensive approach—often requiring
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
training on an offline dataset or through the use of an environment simulator. This implies in
turn that there is sufficient information available about the true system to replicate in the form of
a simulation engine or collected dataset. That is, the sim-to-real gap is a huge challenge that must
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
be addressed to render RL applicable in real time for safety-critical systems. As there are always
unknown uncertainties that can occur and cannot be anticipated or incorporated in the simula-
tion engine, the desirable properties of robustness, stability, reliability, and resilience all need to be
addressed. In all cases, these decisions must be synthesized with inaccurate, partial, and limited in-
formation in real time, which in turn imposes challenges for both approaches, albeit different ones.
While the challenges seem formidable and pose roadblocks for both approaches, they also
present tremendous opportunities. The fact that the two approaches are different suggests that
there are ways in which they could be integrated so as to realize their combined advantages. The
focus of AC on stability and RL on optimality suggests that one such candidate is a multiloop
approach, with an inner loop focused on AC methods that are capable of delivering real-time per-
formance with stability guarantees and an outer loop focused on RL methods that can anticipate
an optimal policy when the sim-to-real gap is small (93–97).
The problem of controlling a large-scale dynamic system in real time is exceedingly complex.
The two solutions that have been delineated in this article, AC and RL, are huge subfields of
control that have been researched over the past several years. AC has been synthesized through
the lens of stability, parameter learning, online control, and continuous-time systems, and RL
has been synthesized through an optimality-based and data-driven viewpoint. It is clear that the
concept of learning is common to both. Stability is followed by learning and optimality in AC,
while RL attempts to achieve optimality through learning and simulation. While analytical rigor
and provable correctness in real time are hallmarks of AC, they are also plagued with several re-
strictions and difficulty in extending the approach to complex dynamic systems. Comparatively,
RL has achieved enormous success in difficult problems related to games and pattern recogni-
tion, although the lack of guarantees of stability and robustness is a deficiency that remains to be
addressed. Both approaches have learning as a fundamental tenet and employ an iterative proce-
dure that consists of an action–response–correction sequence. Despite these rich intersections and
commonalities, little effort has been expended in comparing the two approaches or in combining
their philosophies and methodologies. This article takes a first step in this direction.
There are several directions that were not explored here owing to space limitations. Each field
is vast, with several subtopics that have deep and varying insights and rich results. The intent here
is not to provide a comprehensive exposition of these topics but rather to expose the reader to
these distinct inquiries into an extremely challenging problem. Several societal-scale challenges,
including sustainability, quality of life, and resilient infrastructure, have in their core the need to
analyze and synthesize complex systems. AC and RL are fundamental building blocks that need
to be refined to meet this need.
ACKNOWLEDGMENTS
I gratefully acknowledge support from the Boeing Strategic University Initiative. I would also like
to acknowledge several useful discussions with Peter Fisher, Anubhav Guha, and Sunbochen Tang
as well as the comments of an anonymous reviewer, which led to a significant improvement in the
readability of the article.
LITERATURE CITED
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
1. Qu Z, Thomsen B, Annaswamy AM. 2020. Adaptive control for a class of multi-input multi-output plants
with arbitrary relative degree. IEEE Trans. Autom. Control 65:3023–38
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
2. Annaswamy AM, Fradkov AL. 2021. A historical perspective of adaptive control and learning. Annu. Rev.
Control 52:18–41
3. Krstić M. 2021. Control has met learning: aspirational lessons from adaptive control theory. Control Meets
Learning Seminar, virtual, June 9
4. Narendra KS, Annaswamy AM. 2005. Stable Adaptive Systems. Mineola, NY: Dover (original publication
by Prentice Hall, 1989)
5. Åström KJ, Wittenmark B. 1995. Adaptive Control. Reading, MA: Addison-Wesley. 2nd ed.
6. Ioannou PA, Sun J. 1996. Robust Adaptive Control. Upper Saddle River, NJ: Prentice Hall
7. Sastry S, Bodson M. 1989. Adaptive Control: Stability, Convergence and Robustness. Upper Saddle River, NJ:
Prentice Hall
8. Krstić M, Kanellakopoulos I, Kokotović P. 1995. Nonlinear and Adaptive Control Design. New York: Wiley
9. Tao G. 2003. Adaptive Control Design and Analysis. New York: Wiley
10. Narendra KS, Annaswamy AM. 1987. Persistent excitation in adaptive systems. Int. J. Control 45:127–60
11. Boyd S, Sastry S. 1983. On parameter convergence in adaptive control. Syst. Control Lett. 3:311–19
12. Morgan AP, Narendra KS. 1977. On the uniform asymptotic stability of certain linear nonautonomous
differential equations. SIAM J. Control Optim. 15:5–24
13. Anderson BDO, Johnson CR Jr. 1982. Exponential convergence of adaptive identification and control
algorithms. Automatica 18:1–13
14. Jenkins B, Krupadanam A, Annaswamy AM. 2019. Fast adaptive observers for battery management
systems. IEEE Trans. Control Syst. Technol. 28:776–89
15. Gaudio JE, Annaswamy AM, Bolender MA, Lavretsky E, Gibson TE. 2021. A class of high order tuners
for adaptive systems. IEEE Control Syst. Lett. 5:391–96
16. Lavretsky E, Wise KA. 2013. Robust and Adaptive Control with Aerospace Applications. London: Springer
17. Luders G, Narendra KS. 1974. Stable adaptive schemes for state estimation and identification of linear
systems. IEEE Trans. Autom. Control 19:841–47
18. Lion PM. 1967. Rapid identification of linear and nonlinear systems. AIAA J. 5:1835–42
19. Kreisselmeier G. 1977. Adaptive observers with exponential rate of convergence. IEEE Trans. Autom.
Control 22:2–8
20. Aranovskiy S, Belov A, Ortega R, Barabanov N, Bobtsov A. 2019. Parameter identification of linear
time-invariant systems using dynamic regressor extension and mixing. Int. J. Adapt. Control Signal Process.
33:1016–30
21. Ortega R, Aranovskiy S, Pyrkin A, Astolfi A, Bobtsov A. 2020. New results on parameter estimation via
dynamic regressor extension and mixing: continuous and discrete-time cases. IEEE Trans. Autom. Control
66:2265–72
22. Gaudio JE, Annaswamy AM, Lavretsky E, Bolender MA. 2020. Fast parameter convergence in adaptive
flight control. In AIAA Scitech 2020 Forum, pap. 2020-0594. Reston, VA: Am. Inst. Aeronaut. Astronaut.
23. Kailath T. 1980. Linear Systems. Englewood Cliffs, NJ: Prentice Hall
90 Annaswamy
24. Chen CT. 1984. Linear System Theory and Design. New York: Holt, Rinehart & Winston
25. Feldbaum A. 1960. Dual control theory. I. Avtom. Telemekhanika 21:1240–49
26. Yakubovich VA. 1962. The solution of certain matrix inequalities in automatic control theory. Dokl. Akad.
Nauk 143:1304–7
27. Kalman RE. 1963. Lyapunov functions for the problem of Lur’e in automatic control. PNAS 49:201–5
28. Meyer K. 1965. On the existence of Lyapunov function for the problem of Lur’e. J. Soc. Ind. Appl. Math.
A 3:373–83
29. Lefschetz S. 1965. Stability of Nonlinear Control Systems. New York: Academic
30. Narendra KS, Taylor JH. 1973. Frequency Domain Criteria for Absolute Stability. New York: Academic
31. Fradkov A. 1974. Synthesis of adaptive system of stabilization for linear dynamic plants. Autom. Remote
Control 35:1960–66
32. Anderson BDO, Bitmead RR, Johnson CR Jr., Kokotović PV, Kosut RL, et al. 1986. Stability of Adaptive
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
34. Anderson BDO. 1985. Adaptive systems, lack of persistency of excitation and bursting phenomena.
Automatica 21:247–58
35. Morris A, Fenton T, Nazer Y. 1977. Application of self-tuning regulators to the control of chemical
processes. IFAC Proc. Vol. 10(16):447–55
36. Fortescue T, Kershenbaum LS, Ydstie BE. 1981. Implementation of self-tuning regulators with variable
forgetting factors. Automatica 17:831–35
37. Narendra KS, Annaswamy AM. 1987. A new adaptive law for robust adaptation without persistent
excitation. IEEE Trans. Autom. Control 32:134–45
38. Narendra KS, Annaswamy AM. 1986. Robust adaptive control in the presence of bounded disturbances.
IEEE Trans. Autom. Control 31:306–15
39. Jenkins BM, Annaswamy AM, Lavretsky E, Gibson TE. 2018. Convergence properties of adaptive systems
and the definition of exponential stability. SIAM J. Control Optim. 56:2463–84
40. Kumar PR. 1983. Optimal adaptive control of linear-quadratic-Gaussian systems. SIAM J. Control Optim.
21:163–78
41. Desoer C, Liu R, Auth L. 1965. Linearity versus nonlinearity and asymptotic stability in the large. IEEE
Trans. Circuit Theory 12:117–18
42. Goodwin GC, Ramadge PJ, Caines PE. 1981. Discrete time stochastic adaptive control. SIAM J. Control
Optim. 19:829–53
43. Goodwin GC, Ramadge PJ, Caines PE. 1980. Discrete-time multivariable adaptive control. IEEE Trans.
Autom. Control 25:449–56
44. Becker A, Kumar PR, Wei CZ. 1985. Adaptive control with the stochastic approximation algorithm:
geometry and convergence. IEEE Trans. Autom. Control 30:330–38
45. Stevens BL, Lewis FL. 2003. Aircraft Control and Simulation. Hoboken, NJ: Wiley. 2nd ed.
46. Rohrs C, Valavani L, Athans M, Stein G. 1985. Robustness of continuous-time adaptive control algorithms
in the presence of unmodeled dynamics. IEEE Trans. Autom. Control 30:881–89
47. Marino R, Tomei P. 1993. Global adaptive output-feedback control of nonlinear systems. II. Nonlinear
parameterization. IEEE Trans. Autom. Control 38:33–48
48. Seto D, Annaswamy AM, Baillieul J. 1994. Adaptive control of nonlinear systems with a triangular
structure. IEEE Trans. Autom. Control 39:1411–28
49. Wen C, Hill DJ. 1990. Adaptive linear control of nonlinear systems. IEEE Trans. Autom. Control 35:1253–
57
50. Gusev S. 1988. Linear stabilization of nonlinear systems program motion. Syst. Control Lett. 11:409–12
51. Marino R. 1985. High-gain feedback in non-linear control systems. Int. J. Control 42:1369–85
52. Haddad WM, Chellaboina V, Hayakawa T. 2001. Robust adaptive control for nonlinear uncertain sys-
tems. In Proceedings of the 40th IEEE Conference on Decision and Control, Vol. 2, pp. 1615–20. Piscataway,
NJ: IEEE
53. Fradkov A, Lipkovich M. 2015. Adaptive absolute stability. IFAC-PapersOnLine 48(11):258–63
3:837–63
61. Polycarpou MM. 1996. Stable adaptive neural control scheme for nonlinear systems. IEEE Trans. Autom.
Access provided by 113.160.25.58 on 08/28/23. See copyright for approved use.
Control 41:447–51
62. Lewis FL, Yesildirek A, Liu K. 1996. Multilayer neural-net robot controller with guaranteed tracking
performance. IEEE Trans. Neural Netw. 7:388–99
63. Chang YC, Roohi N, Gao S. 2019. Neural Lyapunov control. In Advances in Neural Information Processing
Systems 32, ed. H Wallach, H Larochelle, A Beygelzimer, F d’Alché-Buc, E Fox, R Garnett, pp. 3245–54.
Red Hook, NY: Curran
64. Yu SH, Annaswamy AM. 1998. Stable neural controllers for nonlinear dynamic systems. Automatica
34:641–50
65. Bertsekas DP. 2015. Value and policy iterations in optimal control and adaptive dynamic programming.
IEEE Trans. Neural Netw. Learn. Syst. 28:500–9
66. Bertsekas DP. 2017. Dynamic Programming and Optimal Control, Vol. 1. Belmont, MA: Athena Sci.
67. Watkins CJ, Dayan P. 1992. Q-learning. Mach. Learn. 8:279–92
68. Recht B. 2019. A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control
Robot. Auton. Syst. 2:253–79
69. Yu SH, Annaswamy AM. 1996. Neural control for nonlinear dynamic systems. In Advances in Neural
Information Processing Systems 8, ed. D Touretzky, MC Mozer, ME Hasselmo, pp. 1010–16, Cambridge,
MA: MIT Press
70. Lagoudakis MG, Parr R. 2003. Least-squares policy iteration. J. Mach. Learn. Res. 4:1107–49
71. Bradtke SJ, Barto AG. 1996. Linear least-squares algorithms for temporal difference learning. Mach. Learn.
22:33–57
72. Narendra KS, Parthasarathy K. 1991. Gradient methods for the optimization of dynamical systems
containing neural networks. IEEE Trans. Neural Netw. 2:252–62
73. Werbos PJ. 1990. Backpropagation through time: what it does and how to do it. Proc. IEEE 78:1550–60
74. Finn C, Abbeel P, Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks.
In Proceedings of the 34th International Conference on Machine Learning, ed. Doina Precup, YW Teh,
pp. 1126–35. Proc. Mach. Learn. Res. 70. N.p.: PMLR
75. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. 2015. Trust region policy optimization. In Proceedings
of the 32nd International Conference on Machine Learning, ed. F Bach, D Blei, pp. 1889–97. Proc. Mach.
Learn. Res. 37. N.p.: PMLR
76. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. 2014. Deterministic policy gradient
algorithms. In Proceedings of the 31st International Conference on Machine Learning, ed. EP Xing, T Jebara,
pp. 387–395. Proc. Mach. Learn. Res. 32. N.p.: PMLR
77. Zhao W, Queralta JP, Westerlund T. 2020. Sim-to-real transfer in deep reinforcement learning for
robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence, pp. 737–44. Piscataway,
NJ: IEEE
78. Chebotar Y, Handa A, Makoviychuk V, Macklin M, Issac J, et al. 2019. Closing the sim-to-real loop:
adapting simulation randomization with real world experience. In 2019 International Conference on Robotics
and Automation, pp. 8973–79. Piscataway, NJ: IEEE
92 Annaswamy
79. Campi MC, Kumar PR. 1998. Adaptive linear quadratic Gaussian control: the cost-biased approach
revisited. SIAM J. Control Optim. 36:1890–907
80. Borkar V, Varaiya P. 1979. Adaptive control of Markov chains, I: finite parameter set. IEEE Trans. Autom.
Control 24:953–57
81. Campi MC, Kumar PR. 1996. Optimal adaptive control of an LQG system. In Proceedings of 35th IEEE
Conference on Decision and Control, Vol. 1, pp. 349–53. Piscataway, NJ: IEEE
82. Guo L, Chen HF. 1991. The Åström–Wittenmark self-tuning regulator revisited and ELS-based adaptive
trackers. IEEE Trans. Autom. Control 36:802–12
83. Guo L. 1995. Convergence and logarithm laws of self-tuning regulators. Automatica 31:435–50
84. Duncan T, Guo L, Pasik-Duncan B. 1999. Adaptive continuous-time linear quadratic Gaussian control.
IEEE Trans. Autom. Control 44:1653–62
85. Dean S, Tu S, Matni N, Recht B. 2018. Safely learning to control the constrained linear quadratic
Annu. Rev. Control Robot. Auton. Syst. 2023.6:65-93. Downloaded from www.annualreviews.org
Annual Review of
Control, Robotics,
and Autonomous
Contents Systems
Volume 6, 2023
Dong Wang, Jinqiang Wang, Zequn Shen, Chengru Jiang, Jiang Zou,
Le Dong, Nicholas X. Fang, and Guoying Gu p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p31
Adaptive Control and Intersections with Reinforcement Learning
Anuradha M. Annaswamy p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p65
On the Timescales of Embodied Intelligence for Autonomous
Adaptive Systems
Fumiya Iida and Fabio Giardina p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p95
Toward a Theoretical Foundation of Policy Optimization for
Learning Control Policies
Bin Hu, Kaiqing Zhang, Na Li, Mehran Mesbahi, Maryam Fazel,
and Tamer Başar p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 123
Sequential Monte Carlo: A Unified Review
Adrian G. Wills and Thomas B. Schön p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 159
Construction Robotics: From Automation to Collaboration
Stefana Parascho p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 183
Embodied Communication: How Robots and People Communicate
Through Physical Interaction
Aleksandra Kalinowska, Patrick M. Pilarski, and Todd D. Murphey p p p p p p p p p p p p p p p p p p p p p 205
The Many Facets of Information in Networked Estimation
and Control
Massimo Franceschetti, Mohammad Javad Khojasteh, and Moe Z. Win p p p p p p p p p p p p p p p p p 233
Crowd Dynamics: Modeling and Control of Multiagent Systems
Xiaoqian Gong, Michael Herty, Benedetto Piccoli, and Giuseppe Visconti p p p p p p p p p p p p p p p p 261
Noise in Biomolecular Systems: Modeling, Analysis,
and Control Implications
Corentin Briat and Mustafa Khammash p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 283
AS06_TOC ARjats.cls January 25, 2023 10:59
Errata
An online log of corrections to Annual Review of Control, Robotics, and Autonomous
Systems articles may be found at http://www.annualreviews.org/errata/control