2019 RL Control Review
2019 RL Control Review
253
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
1. INTRODUCTION
Reinforcement learning (RL) is the subfield of machine learning that studies how to use past data
to enhance the future manipulation of a dynamical system. A control engineer might be puzzled
by such a definition and interject that this is precisely the scope of control theory. That the RL and
control communities remain practically disjoint has led to the codevelopment of vastly different
approaches to the same problems. However, it should be impossible for a control engineer not to
be impressed by the recent successes of the RL community, such as solving Go (1).
Indeed, given this dramatic recent progress in RL, a tremendous opportunity lies in deploy-
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
ing its data-driven systems in more demanding interactive tasks, including self-driving vehicles,
distributed sensor networks, and agile robotic systems. For RL to expand into such technologies,
however, the methods must be both safe and reliable—the failure of such systems can have severe
societal and economic consequences, including the loss of human life. How can we guarantee that
our new data-driven automated systems are robust? These types of reliability concerns are at the
core of control engineering, and RL practitioners might be able to make their methods robust by
applying appropriate control tools for engineering systems to match prescribed safety guarantees.
This survey aims to provide a language for the control and reinforcement learning communities
to begin communicating, highlighting what each can learn from the other. Control is the theory of
designing complex actions from well-specified models, while reinforcement learning often makes
intricate, model-free predictions from data alone. Yet both RL and control aim to design systems
that use richly structured perception, perform planning and control that adequately adapt to en-
vironmental changes, and exploit safeguards when surprised by a new scenario. Understanding
how to properly analyze, predict, and certify such systems requires insights from current machine
learning practice and from the applied mathematics of optimization, statistics, and control theory.
With a focus on problems in continuous control, I will try to disentangle the similarities and dif-
ferences of methods of the complementary perspectives and present a set of challenging problems
whose solution will require significant input from both sets of practitioners.
I focus first on casting RL problems in an optimization framework, establishing the sorts of
methodological tools brought to bear in contemporary RL. I then lay out the main solution tech-
niques of RL, including the dichotomy between the model-free and model-based methodologies.
Next, I try to put RL and control techniques on the same footing through a case study of the
linear quadratic regulator (LQR) with unknown dynamics. This baseline will illuminate the vari-
ous trade-offs associated with techniques from RL and control. In particular, we will see that the
so-called model-free methods popular in deep RL are considerably less effective in both theory
and practice than simple model-based schemes when applied to LQR. Perhaps surprisingly, I also
show cases where these observations continue to hold on more challenging nonlinear applications.
I then argue that model-free and model-based perspectives can be unified, combining their rel-
ative merits. This leads to a concluding discussion of some of the challenges at the interface of
control and learning that must be solved before we can build robust, safe learning systems that
interact with an uncertain physical environment, which will surely require tools from both the
machine learning and control communities.
254 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
In the classic optimal control problem, we begin with a dynamical system governed by the
difference equation xt+1 = ft (xt , ut , et ), where xt is the state of the system, ut is the control action,
and et is a random disturbance; ft is the rule that maps the current state, control action, and
disturbance at time t to a new state. Assume that at every time, we receive some reward R(xt , ut ) for
our current xt and ut . The goal is to maximize this reward. In terms of mathematical optimization,
we aim to solve the problem
N
maximize Eet Rt (xt , ut )
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
t=0
1.
subject to xt+1 = ft (xt , ut , et )
(x0 given).
That is, we aim to maximize the expected reward over N time steps with respect to the control
sequence ut , subject to the dynamics specified by the state-transition rule ft . The expected value
is over the disturbance and assumes that ut is to be chosen having seen only the states x0 through
xt and previous inputs u0 through ut−1 . Rt is the reward gained at each time step and is determined
by the state and control action. Note that xt is not really a decision variable in the optimization
problem; it is determined entirely by the previous state, control action, and disturbance. I will refer
to a trajectory, τt , as a sequence of states and control actions generated by a dynamical system:
τt = (u1 , . . . , ut−1 , x0 , . . . , xt ) . 2.
Since the dynamics are stochastic, the optimal control problem typically allows a controller
to observe the state before deciding upon the next action (2). This allows a controller to contin-
ually mitigate uncertainty through feedback. Hence, rather than optimizing over deterministic
sequences of controls ut , we instead optimize over policies. A control policy (or simply “a policy”)
is a function, π , that takes a trajectory from a dynamical system and outputs a new control action.
Note that π has access only to previous states and control actions.
To slightly lower the notational burden, I will work with the time-invariant version of
Problem 1, assuming that the dynamical update rule is constant over time and that the rewards
for state–action pairs are also constant:
N
maximize Eet R(xt , ut )
t=0
3.
subject to xt+1 = f (xt , ut , et ), ut = πt (τt )
(x0 given).
This is our oracle model and is called episodic RL (see, for example, chapter 3 of Reference 3,
chapter 2 of Reference 4, or Reference 5). We want the expected reward to be high for our derived
policy, but we also need the number of oracle queries to be small.
This oracle model is considerably more complicated than those typically considered in oracle
models for optimization (6). Each episode returns a complex feedback signal of states and rewards.
What is the best way to tie this information together in order to improve performance? What is
the best way to query and probe a system to achieve high-quality control with as few interventions
as possible? Here, “best” is also not clearly defined. Do we decide an algorithm is best if it crosses
some reward threshold in the fewest number of samples? Or is it best if it achieves the highest
reward given a fixed budget of samples? Or maybe there is a middle ground? This oracle provides
a rich and complex model for interacting with a system and brings with it considerably more
complexity than exists in standard stochastic optimization settings. What is the most efficient way
to use all of the collected data in order to improve future performance?
1 To achieve notational consistency, I am adopting the control-centric notation of denoting state–action pairs
as (x, u) rather than (s, a), as is commonly used in RL.
256 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
to solve optimal control problems only by probing the system and improving strategies based on
past rewards and states. Many researchers argue for algorithms that can innately learn to control
without access to the complex details required to simulate a dynamical system. They argue that
it is often easier to find a policy for a task than it is to fit a general-purpose model of the system
dynamics (see, for example, the discussion in chapter 3 of Reference 16). Model-free methods
are primarily divided into two approaches: policy search and approximate dynamic programming.
Policy search directly searches for policies by using data from previous episodes in order to im-
prove the reward. Approximate dynamic programming uses Bellman’s principle of optimality to
approximate Problem 3 using previously observed data.
Throughout, my aim will be to highlight the main conceptual ideas of different approaches and
to avoid embroiling myself in a thorough discussion of the myriad of technical details required to
make all of the statements crisply precise. What is important is that all of the approaches surveyed
reduce to some sort of function fitting from noisy observations of the dynamical system, though
performance can be drastically different depending on how you fit this function. In model-based
RL, we fit a model of the state transitions to best match observed trajectories. In approximate
dynamic programming, we estimate a function that best characterizes the cost to go for exper-
imentally observed states. And in direct policy search, we attempt to find a policy that directly
maximizes the optimal control problem using only input–output data. The main questions are
which of these approaches makes the best use of samples and how quickly the derived policies
converge to optimality.
xt+1 ≈ ϕ(xt , ut ) , 4.
where ϕ is some model aiming to approximate the true dynamics; ϕ might arise from a first-
principles physical model or might be a nonparametric approximation by a neural network. The
state-transition function can then be fit using supervised learning. For instance, a model can be fit
by solving the least squares problem
N
−1
minimizeϕ ||xt+1 − ϕ(xt , ut )||2 .
t=0
Let ϕ̂ denote the function fit to the collected data to model the dynamics. Let ωt denote a
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
random variable that we will use as a model for the noise process. With such a point estimate for
the model, we might solve the optimal control problem
N
maximize Eωt R(xt , ut )
t=0
subject to xt+1 = ϕ̂(xt , ut ) + ωt , ut = πt (τt ) .
In this case, we are solving the wrong problem to get our control policies πt . Not only is the model
incorrect, but this formulation requires some plausible model of the noise process. But if ϕ̂ and f
are close, this approach might work well in practice.
The Q-function determines the value of the optimal control problem that is attained when the first
action is set to be u and the initial condition is x. Note that it then trivially follows that the optimal
value of Problem 3 is maxu Q(x0 , u), and the optimal policy is π (x0 ) = arg maxu Q(x0 , u). If we had
access to the Q-function, we would have everything we need to know to take the first step in the
optimal control problem. We can use dynamic programming to compute this Q-function and the
Q-function associated with every subsequent action. That is, we define the terminal Q-function
to be
QN (x, u) = R(x, u)
This is the dynamic programing algorithm in a nutshell: We can recursively define the Q-functions
by passing backward in time, and then compute the optimal controls from any starting x0 by
258 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
applying the policy that maximizes the right-hand side of Equation 6 at each time step. Equation 6
is known as Bellman’s equation. Note that for all time, the optimal policy is uk = arg maxu Qk (xk , u)
and depends only on the current state.
Approximate dynamic programming methods typically try to compute these action-value func-
tions from data. They do so by assuming that the Q-function is stationary [i.e., Qk (x, u) = Q(x, u)
for all k and some function Q]. Such stationarity indeed arises assuming the time horizon is infinite.
Consider the limit:
N
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
And we define the Q-function Q(x0 , u0 ) to be the average reward accrued running from state x0
with initial action u0 . Unfortunately, Equation 7 is not directly amenable to dynamic program-
ming without introducing further technicalities. For mathematical convenience and to connect to
common practice in RL, it is useful to instead consider the discounted reward problem
∞
t
maximize (1 − γ )Eet γ R(xt , ut )
t=0
8.
subject to xt+1 = f (xt , ut , et ), ut = πt (τt )
(x0 given),
where γ is a scalar in (0, 1) called the discount factor. For γ close to 1, the discounted reward is
approximately equal to the average reward (16). The discounted cost has particularly clean opti-
mality conditions that make it amenable to estimation. If we define Qγ (x, u) to be the Q-function
obtained from solving Problem 8 with initial condition x, then we have a discounted version of
dynamic programming, now with the same Q-functions on the left- and right-hand sides:
This is a remarkably simple formula, which is part of what makes Q-learning methods so attractive.
We can try to solve for the Q-function using stochastic approximation. If we draw a sam-
ple trajectory using the policy given by Equation 9, then we should have (approximately and in
expectation)
Thus, beginning with some initial guess Qγ(old) for the Q-function, we can update
Qγ(new) (xk , uk ) = (1 − η)Qγ(old) (xk , uk ) + η R(xk , uk ) + γ max
Qγ
(old)
(xk+1 , u ) , 10.
u
where η is a step size or learning rate. Equation 10 forms the basis of Q-learning algorithms
(20, 21).
Methods for estimating value functions are also widely used in RL and developed through
the perspective of estimation and stochastic approximation. In particular, temporal difference
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
3.3.1. A generic algorithm for sampling to optimize. Let us begin with a review of a general
paradigm for leveraging random sampling to solve optimization problems. Consider the general
unconstrained optimization problem
Any optimization problem like this is equivalent to an optimization over probability distributions
on z:
If z is the optimal solution, then we get the same value if we put a δ-function around z . Moreover,
if p is a probability distribution, it is clear that the expected value of the reward function can never
be larger than the maximal reward achievable by a fixed z. So we can either optimize over z or
optimize over distributions over z.
Since optimizing over the space of all probability densities is intractable, we must restrict the
class of densities over which we optimize. For example, we can consider a family parameterized
260 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
If this family of distributions contains all of the δ-functions, then the optimal value will coincide
with the nonrandom optimization problem. But if the family does not contain δ-functions, the
resulting optimization problem only provides a lower bound on the optimal value no matter how
good of a probability distribution we find.
That said, this reparameterization provides a powerful and general algorithmic framework for
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
optimization. In particular, we can compute the derivative of J(ϑ ) := E p(z;ϑ ) [R(z)] using the fol-
lowing calculation (called the log-likelihood trick):
This derivation reveals that the gradient of J with respect to ϑ is the expected value of the function
Hence, if we sample z from the distribution defined by p(z; ϑ ), we can compute G(z, ϑ ) and will
have an unbiased estimate of the gradient of J. We can follow this direction and will be running
stochastic gradient descent on J, defining Algorithm 1.
Algorithm 1 (REINFORCE).
Hyperparameters: step sizes α j > 0.
Initialize: ϑ0 and k = 0.
while ending condition not satisfied do
Sample zk ∼ p(z; ϑk ).
Set ϑk+1 = ϑk + αk R(zk )∇ϑ log p(zk ; ϑk ).
k ← k + 1.
end while
Algorithm 1 is typically called REINFORCE (27), and its main appeal is that it is trivial to
implement. If you can efficiently sample from p(z; ϑ ), you can run this algorithm on essentially any
problem. But such generality comes with a significant cost. The algorithm operates on stochastic
gradients of the sampling distribution, but the function we cared about optimizing (R) is accessed
only through function evaluations. Direct search methods that use the log-likelihood trick are
necessarily derivative-free optimization methods and, in turn, are necessarily less effective than
methods that compute actual gradients, especially when the function evaluations are noisy (28).
Another significant concern is that the choice of distribution can lead to very high variance in the
stochastic gradients. Such high variance in turn implies that many samples need to be drawn to
find a stationary point.
That said, the ease of implementation should not be readily discounted. Direct search methods
are trivial to implement, and oftentimes reasonable results can be achieved with considerably less
effort than custom solvers tailored to the structure of the optimization problem. There are two
primary ways that this sort of stochastic search arises in RL: policy gradient and pure random
search.
3.3.2. Policy gradient. As seen from Bellman’s equation, the optimal policy for Problem 3 is
always deterministic. Nonetheless, the main idea behind policy gradient is to use probabilistic
policies. Probabilistic policies are optimal for other optimization-based control problems, such as
control of partially observed Markov decision processes (29, 30) or in zero-sum games. Hence,
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
exploring their value for the RL problems studied in this survey does not appear too outlandish
at first glance.
We fix our attention on parametric, randomized policies such that ut is sampled from a dis-
tribution p(u|τt ; ϑ ) that is a function only of the currently observed trajectory and a parameter
vector ϑ. A probabilistic policy induces a probability distribution over trajectories:
L−1
p(τ ; ϑ ) = p(xt+1 |xt , ut )p(ut |τt ; ϑ ). 14.
t=0
Moreover, we can overload the notation and define the reward of a trajectory to be
N
R(τ ) = Rt (xt , ut ).
t=0
Then our optimization problem for RL tidily takes the form of Equation 12. Policy gradient
thus proceeds by sampling a trajectory using the probabilistic policy with parameters ϑk and then
updating using REINFORCE.
Using the log-likelihood trick and Equation 14, it is straightforward to verify that the gradient
of J with respect to ϑ is not an explicit function of the underlying dynamics. However, at this
point this should not be surprising: By shifting to distributions over policies, we push the burden
of optimization onto the sampling procedure.
3.3.3. Pure random search. An older and more widely applied method to solve Problem 11
is to directly perturb the current decision variable z by random noise and then update the model
based on the received reward at this perturbed value. That is, we apply Algorithm 1 with sampling
distribution p(z; ϑ ) = p0 (z − ϑ ) for some distribution p0 . The simplest examples for p0 would be
the uniform distribution on a sphere or a normal distribution. Perhaps less surprisingly here, RE-
INFORCE can again be run without any knowledge of the underlying dynamics. Note that in this
case, the REINFORCE algorithm has a simple interpretation in terms of gradient approximation.
Indeed, REINFORCE is equivalent to approximate gradient ascent of R
R(ϑ + σ ) − R(ϑ − σ )
gσ (ϑ ) = .
2σ
This update says to compute a finite-difference approximation to the gradient along the direction
and move along the gradient. One can reduce the variance of such a finite-difference estimate
262 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
1 R(ϑ + σ i ) − R(ϑ − σ i )
m
σ (ϑ ) =
g(m) i.
m i=1 2σ
This is akin to approximating the gradient in the random subspace spanned by the i .
This particular algorithm and its generalizations go by many different names. Probably the
earliest proposal for this method was made by Rastrigin (31). In an unexpected historical surprise,
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
Rastrigin initially developed this method to solve RL problems! His main motivating example
was an inverted pendulum. A rigorous analysis using contemporary techniques was provided by
Nesterov & Spokoiny (32). Random search was also discovered by the evolutionary algorithms
community, where it is called the (μ, λ)-evolution strategy (33, 34). Random search has also been
studied in the context of stochastic approximation (35) and bandits (36, 37). Algorithms that are
invented independently by four different communities probably have something good going for
them.
The random search method is considerably simpler than the policy gradient algorithm, but it
uses much less structure from the problem as well. Since RL problems tend to be nonconvex, it
is not clear which of these approaches is better unless we focus on specific instances. In light of
this, in the next section we turn to a set of instances where we may be able to glean more insights
about the relative merits of all of the approaches to RL covered in this section.
quadratic regulator:
1 T
N
1 T
minimize Eet x Qxt + ut Rut + xN +1 SxN +1 ,
T
2 t=0 t 2
15.
subject to xt+1 = Axt + But + et , ut = πt (τt )
(x0 given).
Here, Q, R, and S are positive semidefinite matrices. Note that we have switched to minimization
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
from maximization, as is conventional in optimal control. The state transitions are governed by a
linear update rule with A and B appropriately sized matrices.
A few words are in order to defend this baseline as instructive for general problems in contin-
uous control and RL. Though linear dynamics are somewhat restrictive, many systems are linear
over the range we would like them to operate. Indeed, enormous engineering effort goes into
designing systems so that their responses are as close to linear as possible. From an optimization
perspective, linear dynamics are the only class where we are guaranteed that our constraint set is
convex, which is another appealing feature for analysis.
What about cost functions? Whereas dynamics are typically handed to the engineer, cost func-
tions are completely at their discretion. Designing and refining cost functions are part of optimal
control design, and different characteristics can be extracted by iteratively refining cost functions
to meet specifications. This is no different in machine learning, where, for example, combinatorial
losses in classification are replaced with smooth losses like logistic or squared loss. Designing cost
functions is a major challenge and tends to be an art form in engineering. But since we are design-
ing our cost functions, we should focus our attention on costs that are easier to solve. Quadratic
cost is particularly attractive not only because it is convex, but also because of how it interacts with
noise. The cost of the stochastic problem is equal to that of the noiseless problem plus a constant
that is independent of the choice of ut . The noise will degrade the achievable cost, but it will not
affect how control actions are chosen.
Note that when the parameters of the dynamical system are known, the standard LQR problem
admits an elegant dynamic programming solution (38). The control action is a linear function of
the state
ut = −Kt xt
for some matrix Kt that can be computed via a simple linear algebraic recursion with only knowl-
edge of A, B, Q, and R.
In the limit as the time horizon tends to infinity, the optimal control policy is static, linear state
feedback:
ut = −K xt ,
K = (R + BT MB)−1 BT MA
That is, for LQR on an infinite time horizon, πt (xt ) = −K xt . Here, M is the unique solution of
the Riccati equation where all of the eigenvalues of A − BK have magnitude less than 1. Finding
this specific solution is relatively easy using standard linear algebraic techniques (38).
264 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
There are a variety of ways to derive these formulae. In particular, one can use dynamic pro-
gramming as in Section 3.2. In this case, one can check that the Q-function on a finite time horizon
satisfies a recursion
for some positive definite matrix Mk+1 . The limits of these matrices are the solution of Equation 16.
Though LQR cannot capture every interesting optimal control problem, it has many of the
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
salient features of the generic optimal control problem. Dynamic programming recursion lets us
compute the control actions efficiently, and for long time horizons, a static policy is nearly optimal.
Now the main question to consider in the context of RL: What happens when we do not know
A and B? What is the right way to interact with the dynamical system in order to quickly and
efficiently get it under control? Let us now dive into the different styles of RL and connect them
to ideas in control, using LQR as a guiding baseline.
N
−1
minimizeA,B ||xt+1 − Axt − But ||2 .
t=0
How well do these model estimates work for the LQR problem? Suppose we treat the estimates
as true and use them to compute a state feedback control from a Riccati equation. While we might
expect this to work well in practice, how can we verify the performance? As a simple case, suppose
that the true dynamics are slightly unstable, so that A has at least one eigenvalue of magnitude
larger than 1. It is fully possible that the least squares estimate of one of the diagonal entries will
be less than 1, and, consequently, the optimal control strategy using the estimate will fail to account
for the poorly estimated unstable mode. How can we include the knowledge that our model is just
an estimate and not accurate with a small sample count? One possible solution is to use tools from
robust control to mitigate this uncertainty.
My collaborators and I have been considering an approach to merge robust control and high-
dimensional statistics dubbed coarse-ID control. The general framework consists of the following
three steps:
1. Use supervised learning to learn a coarse model of the dynamical system to be controlled. I
will refer to the system estimate as the nominal system.
2. Using either prior knowledge or statistical tools like the bootstrap, build probabilistic guar-
antees about the distance between the nominal system and the true, unknown dynamics.
3. Solve a robust optimization problem that optimizes control of the nominal system
while penalizing signals with respect to the estimated uncertainty, ensuring stable, robust
execution.
As long as the true system behavior lies in the estimated uncertainty set, we are guaranteed to
find a performant controller. The key here is that we are using machine learning to identify not
only the plant to be controlled, but also the uncertainty. Indeed, the main advances in the past two
decades of estimation theory consist of providing reasonable estimates of such uncertainty sets
with guaranteed bounds on their errors as a function of the number of observed samples. Taking
these new tools and merging them with old and new ideas from robust control allows us to bound
the end-to-end performance of a controller in terms of the number of observations.
The coarse-ID procedure is well illustrated through the case study of LQR (39). We can guar-
antee the accuracy of the least squares estimates for A and B using novel probabilistic analysis (40).
With the estimate of model error in hand, one can pose a robust variant of the standard LQR op-
timal control problem that computes a robustly stabilizing controller seeking to minimize the
worst-case performance of the system given the (high-probability) norm bounds on our modeling
errors.
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
To design a good control policy, we here turn to state-of-the-art tools from robust control. We
leverage the recently developed system-level synthesis framework (41, 42) to solve this robust op-
timization problem. System-level synthesis lifts the system description into a higher-dimensional
space that enables efficient search for controllers. The proposed approach provides nonasymptotic
bounds that guarantee finite performance on the infinite time horizon and quantitatively bound
the gap between the computed solution and the true optimal controller.
Suppose in LQR that we have a state dimension d and a control dimension p. Denote the mini-
mum cost achievable by the optimal controller as J∗ . Our analysis guarantees that, after a observing
a trajectory of length T , we can design a controller that will have infinite-time-horizon cost Ĵ with
Ĵ − J d+p
= Õ .
J T
Here, the notation Õ(·) suppresses logarithmic factors and instance-dependent constants. In
particular, we can guarantee that we stabilize the system after seeing only a finite amount of data.
Where coarse-ID control differs from nominal control is that it explicitly accounts for the
uncertainty in the least squares point estimate. By appending this uncertainty to the original LQR
optimization problem, we can circumvent the need to study perturbations of Riccati equations.
Moreover, since the approach is optimization based, it can be readily applied to other optimal
control problems beyond the LQR baseline.
266 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
With regard to direct search methods, we can already see variance issues enter the picture even
for small LQR instances. Consider the most trivial example of LQR:
R(u) = ||u||2 .
E p(u;ϑ ) [R(u)] = ϑ 2
+ σ 2 d.
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
Obviously, the best thing to do would be to set ϑ = 0. Note that the expected reward is off by σ 2 d
at this point, but at least this would be finding a good guess for u. Also, as a function of ϑ, the cost
is strongly convex, and the most important thing to know is the expected norm of the gradient, as
this will control the number of iterations. Now, after sampling u from a Gaussian with mean ϑ0
and variance σ 2 I and using Equation 13, the first gradient will be
||ω − ϑ0 ||2 ω
g=− ,
σ2
where ω is a normally distributed random vector with mean zero and covariance σ 2 I. The expected
norm of this stochastic gradient is on the order of
O σ d 1.5 + σ −1 d 0.5 ϑ0 ,
5. NUMERICAL COMPARISONS
The preceding analyses of the RL paradigms when applied to LQR are striking. A model-based
approach combining supervised learning and robust control achieves nearly optimal performance
given its sampling budget. Approximate dynamic programming appears to fare worse in terms
of worst-case performance. And direct policy search seems to be of too high variance to work in
practice. In this section, we implement these various methods and test them on some simple LQR
instances to see how these theoretical predictions reflect practice.
Such a system could model, say, the position (first state) and velocity (second state) of a unit mass
object under force u.
As an instance of LQR, we can try to steer this system to reach point 0 from initial condition
x0 = [−1, 0] without expending much force:
1 0
Q= , R = r0 18.
0 0
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
for some scalar r0 . Note that even in this simple instance there is an element of design: Changing
the value of r0 changes the character of the control law, allowing one to weigh the importance of
expending control energy against the time required to reach the destination.
To compare the different approaches, I ran experiments on this instance with a small amount
of noise (et zero mean with covariance 10−4 I) and training episode length L = 10. The goal was
to design a controller that works on an arbitrarily long time horizon using the fewest number of
simulations of length L.
With one simulation (10 samples), using a white noise input with unit variance, the nominal
estimate is correct to three digits of precision. And, not surprisingly, this returns a nearly opti-
mal control policy. Right out of the box, this nominal control strategy works well on this simple
example. Note that using a least squares estimator makes the nominal controller’s life hard here
because all prior information about sparsity on the state-transition matrices is discarded. In a more
realistic situation, the only parameter that would need to be estimated would be the (2,1) entry
in B, which governs how much force is put out by the actuator and how much mass the system
has.
Now, let us compare this approach with approximate dynamic programming and policy search
methods. For policy search, let us restrict to policies that use a static, linear gain, as would be
optimal on an infinite time horizon. Note that a static linear policy works almost as well as a
time-varying policy for this simple LQR problem with two state dimensions. Moreover, there
are only two decision variables for this simple problem. For policy gradient, I used the Adam
algorithm to shape the iterates (45). I also subtracted the mean reward of previous iterates, a
popular baseline subtraction heuristic to reduce variance [Dayan (46) attributes this heuristic to
Sutton (47) and Williams (48)]. I was unable to get policy gradient to converge without these
additional algorithmic ornamentations. I also compared against a simple approximate dynamic
programming method called least squares policy iteration (LSPI), proposed by Lagoudakis &
Parr (49). I ran each of these methods using 10 different random seeds. Figure 1 plots the median
performance of the various methods with error bars encompassing the maximum and minimum
over all trials. Both nominal control and LSPI are able to find high-quality controllers with only
10 observations. Direct policy methods, on the other hand, require many times as many samples.
Policy gradient, in particular, requires thousands of times as many samples as simple nominal
control.
268 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
10
Policy gradient
9 Random search
Optimal
8
Cost
7
6
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
5
0 5,000 10,000 15,000 20,000 25,000 30,000
Samples
Figure 1
Cost for the double-integrator model for various reinforcement learning algorithms. The solid plots denote
the median performance, and the shaded regions capture the maximum and minimum performance.
Nominal control and least squares policy iteration are indistinguishable from the optimal controller in this
experiment and hence are omitted.
Note that the open loop system here is unstable: With any nonzero initial condition, the state
vector will blow up because the limit of Ak is infinite. Moreover, if a method estimates one of the
diagonal entries of A to be less than 1, we might guess that this mode is actually stable and put
less effort into cooling that source. So it is imperative to obtain a high-quality estimate of the
system’s true behavior for near-optimal control. Or, rather, we must be able to ascertain whether
our current policy is safe, or the consequences can be disastrous.
Let us try to solve the LQR problem with the settings Q = I and R = 1,000I. This models a
high relative cost for power consumption and hence may encourage small control inputs on modes
that are estimated as stable. What happens for our RL methods in this instance?
Figure 2 compares nominal control to two versions of the robust LQR problem described in
Section 4.1. To solve the robust LQR problem, we end up solving a small semidefinite program-
ming problem as described by Dean et al. (39). These semidefinite programs are solved on my
laptop in well under a second. Note that estimating the error from data only yields slightly worse
LQR performance than exactly knowing the true model error.
Note also that the nominal controller does tend to frequently find controllers that fail to stabi-
lize the true system. A necessary and sufficient condition for stabilization is for the matrix A + BK
to have all of its eigenvalues be less than 1. We can plot how frequently the various search meth-
ods find stabilizing control policies when looking at a finite horizon in Figure 2b. The robust
optimization really helps here to provide controllers that are guaranteed to find a stabilizing so-
lution. On the other hand, in industrial practice, nominal control does seem to work quite well.
A great open problem is to find reasonable assumptions under which the nominal controller is
stabilizing.
Figure 3 additionally compares the performance to model-free methods in this instance. Here
we again see that they are indeed far off from their model-based counterparts. The x axis has
a b 1.0
0.8
Fraction stable
100
0.6
Cost
0.4
Nominal
0.2 Robust LQR
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
increased by a factor of 10, and yet even the approximate dynamic programming approach does
not find a decent solution. Surprisingly, LSPI, which worked very well on the double integrator,
now performs worse than random search. This is likely because the LSPI subroutine requires a
stabilizing controller for all iterations and also requires careful tuning of the discount factor. Not
only are model-free methods sample hungry, but they fail to be safe. And safety is much more
critical than sample complexity.
a b 1.0
100
0.8
Fraction stable
0.6
Cost
10–1
0.4
0.2
10–2 0
1,000 2,000 3,000 4,000 5,000 1,000 2,000 3,000 4,000 5,000
Samples Samples
Nominal Robust LQR (bootstrap) Random search
Robust LQR LSPI Policy gradient
Figure 3
(a) Cost for the Laplacian model for varied models over 5,000 iterations. (b) The fraction of the time that the synthesized control
strategy returns a stabilizing controller. Abbreviations: LQR, linear quadratic regulator; LSPI, least squares policy iteration.
270 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
tasks are very simple, but some are quite difficult, like the complicated humanoid models with 22
degrees of freedom. The dynamics of legged robots are well specified by Lagrange’s equations (54),
but planning locomotion from these models is challenging because it is not clear how to best design
the objective function and because the model is piecewise linear. The model changes whenever
part of the robot comes into contact with a solid object, and hence a normal force is introduced
that was not previously acting upon the robot. Hence, getting robots to work without having to
deal with complicated nonconvex nonlinear models seems like a solid and interesting challenge for
the RL paradigm. Moreover, seminal work by Tedrake et al. (55) demonstrated that direct policy
search could rapidly find feedback control policies for certain constrained legged robot designs.
Levine & Koltun (56) were among the first to use MuJoCo as a test bed for learning-based
control, and they were able to achieve walking in complex simulators without special-purpose
techniques. Since then, these techniques have become standard continuous control benchmarks
for RL (see, for example, 57–61). Recently, Salimans and his collaborators at OpenAI (62) showed
that random search worked quite well on these benchmarks. In particular, they fit neural network
controllers using random search with a few algorithmic enhancements. Random search had indeed
enjoyed significant success in some corners of the robotics community, and others had noted that
in their applications, random search outperformed policy gradient (63). In another piece of great
work, Rajeswaran et al. (64) showed that natural policy gradient could learn linear policies that
could complete these benchmarks. That is, they showed that static linear state feedback, like the
kind we use in LQR, was also sufficient to control these complex robotic simulators. This of course
left an open question: Can simple random search find linear controllers for these MuJoCo tasks?
Mania et al. (65) tested this out, coding up a rather simple version of random search with a
couple of small algorithmic enhancements. Many RL papers were using statistics of the states and
whitening the states before passing them into the neural network mapping from state to action.
This study found that when random search performed the same whitening with linear controls,
this algorithm was able to get state-of-the-art results on all of the MuJoCo benchmark tasks.
There are a few of important takeaways from this study. On the one hand, the results suggest
that these MuJoCo demos are easy, or at least considerably easier than they were believed to be.
Benchmarking is difficult, and having only a few simulation benchmarks encourages overfitting to
these benchmarks. Indeed, it does seem like these benchmarks are more about taking advantage
of simulation approximations in MuJoCo than they are about learning reasonable policies for
walking. In terms of benchmarking, this is what makes LQR so attractive: LQR with unknown
dynamics is a reasonable task to master, as it is easy to specify new instances, and it is relatively
easy to understand the limits of achievable performance.
Second, note that since our random search method is fast, we can evaluate its performance on
many random seeds. All model-free methods exhibit alarmingly high variance on these bench-
marks. For instance, on the humanoid task, the model is slow to train almost a quarter of the time
even when supplied with what we thought were good parameters (see Figure 4b). And, for those
random seeds, we found that the method returned rather peculiar gaits. Henderson et al. (66)
and Islam et al. (67) observed this phenomenon with deep RL methods, but our results on linear
6,000
70–100 30–70 0–30
runs runs runs
4,000
2,000
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
0
0 100,000 200,000 300,000 400,000
Episodes
Figure 4
(a) Sample frame of the MuJoCo humanoid. (b) Variance of learning performance on 100 runs of random search on the humanoid
model. Note that though high rewards are often achieved, it is more common to observe poor control performance from a random
initialization. (c) Using model predictive control and a poor model, complex actions with humanoid simulations can be executed, such as
climbing into a vehicle. Panel b adapted from Reference 65; panel c adapted with permission from a supplementary video from
Reference 68.
controllers suggest that such high variability will be a symptom of all model-free methods. Though
direct policy search methods are easy to code up, their reliability on any reasonable control task
remains in question.
is equivalent to Problem 8. Here, we have just unrolled the cost beyond one step. Though this is
trivial, it is again incredibly powerful: The longer we make the time horizon, the less we need to
worry about the Q-function being accurate. Of course, now we need to worry about the accuracy
of the state-transition map, f . But, especially in problems with continuous variables, it is not at all
obvious which accuracy is more important in terms of finding algorithms with fast learning rates
and short computation times. There is a trade-off between learning models and learning value
functions, and this is a trade-off that needs to be better understood.
Though RHC methods appear fragile to model mismatch, the repeated feedback inside RHC
can correct for many modeling errors. As an example, it is worth revisiting the robotic locomotion
tasks inside the MuJoCo framework. These tasks were actually designed to test the power of a
nonlinear RHC algorithm developed by Tassa et al. (69). The receding horizon controller works
to keep the robot upright even when the model is poorly specified. Moreover, the RHC approach
272 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
to humanoid control solved for the controller in 7× real time in 2012. In 2013, the same research
group published a cruder version of their controller that they used during the DARPA Robotics
Challenge (68). All these behaviors are generated by RHC in real time. Though the resulting
walking is not of the same quality as what can be obtained from computationally intensive long-
horizon trajectory optimization, it does look considerably better than the sort of gaits typically
obtained by popular RL methods.
Is there a middle ground between expensive offline trajectory optimization and real-time
RHC? I think the answer is yes, in the same way that there is a middle ground between learn-
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
ing dynamical models and learning Q-functions. The performance of an RHC system can be im-
proved by better modeling of the Q-function that defines the terminal cost: The better a model
you make of the Q-function, the shorter a time horizon you need for simulation, and the closer
you get to real-time operation. Of course, if you had a perfect model of the Q-function, you could
just solve the Bellman equation, and you would then have the optimal control policy. But by having
an approximation to the Q-function, you can still extract high performance in real time.
So what if we learn to iteratively improve the Q-function while running RHC? This idea has
been explored in a project by Rosolia & Borrelli (70). In their approach, the terminal cost is learned
by a method akin to nearest neighbors. The terminal cost of a state is the value obtained the last
time that state was tried. If that state has not been visited, the cost is infinite. This formulation
constrains the terminal condition to be in a state observed before. It enables the control system
to explore new ways to decrease cost as long as it maintains the ability to reach a state that has
already been demonstrated to be safe. This nearest-neighbors approach works surprisingly well
in practice: In radio-controlled car demonstrations, the learned controller works better than a
human operator after only a few laps around a fixed track.
Another reason to like this blended RHC approach to learning to control is that one can hard
code in constraints on controls and states and easily incorporate models of disturbance directly
into the optimization problem. Some of the most challenging problems in control are how to
execute safely while continuing to learn more about a system’s capability, and an RHC approach
provides a direct route toward balancing safety and performance. Indeed, an interesting direction
of future work would be merging the robust learning of coarse-ID control with RHC.
In general, this problem gets into very old intractability issues of nonlinear output feedback
in control (72) and partially observed Markov decision processes in RL (73). Nonetheless, some
early results in RL have shown promise in training optimal controllers directly from pixels (12,
58). Of course, these results have even worse sample complexity than the same methods trained
from states, but they are making progress.
In my opinion, the most promising approaches in this space follow the ideas of guided policy
search, which bootstraps standard state feedback to provide training data for a map from sensors
directly to optimal action (56, 74). That is, a mapping from sensor to action can be learned itera-
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
tively by first finding the optimal action and then finding a map to that control setting. A coupling
along these lines, where reliance on a precise state estimator is reduced over time, could potentially
provide a reasonably efficient method for learning to control from sensors.
However, these problems remain daunting. Moving from fully observed scenarios to partially
observed scenarios makes the control problem exponentially more difficult. How to use diverse
sensor measurements in a safe and reliable manner remains an active and increasingly important
research challenge (75–77).
274 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
their behavior is constrained. If we know the cost, then we get a complex game-theoretic version of
RHC (88, 89). But, as is usually the case, humans are bad at specifying their objectives, and hence
what they are optimizing must be learned. This becomes a problem of inverse optimal control (90)
or inverse RL (91), where we must estimate the reward functions of the human and understand
the loss accrued for crudely modeling these rewards.
As expressed above, I think that all of the daunting problems in machine learning are now RL prob-
lems. Whether they be autonomous transportation systems or seemingly mundane social network
engagement systems, actively interacting with reality has high stakes. Indeed, as soon as a machine
learning system is unleashed in feedback with humans, that system is an RL system. The broad
engineering community must take responsibility for the now ubiquitous machine learning systems
and understand what happens when we set them loose on the world.
Solving these problems will require advances in both machine learning and control. Perhaps
this intersection needs a new name, so that researchers can stop arguing about territory. I person-
ally am fond of the term actionable intelligence, as it sums up not only robotics but smarter, safer
analytics. But regardless of what we call it, the point is that there is a large community spanning
multiple disciplines that is invested in making progress on these problems. Hopefully this tour has
set the stage for a lot of great research at the intersection of machine learning and control, and I
am excited to see what progress the communities can make working together.
DISCLOSURE STATEMENT
The author received a 2018 Amazon AWS Machine Learning Research Award, was a consultant
at Google from 2015 to 2017, and is a member of the Berkeley DeepDrive Consortium (sponsors
listed at https://deepdrive.berkeley.edu) and Berkeley Artificial Intelligence Research (sponsors
listed at https://bair.berkeley.edu/sponsor.html).
ACKNOWLEDGMENTS
Countless individuals have helped to shape the content of this review. First, this work was gener-
ously supported in part by two forward-looking programs at the US Department of Defense: the
Mathematical Data Science program at the Office of Naval Research and the Foundations and
Limits of Learning program at DARPA. Second, this article was distilled from a series on my blog
at http://argmin.net. I greatly appreciated the lively debates on Twitter, and I hope that even
those who disagree with my perspectives find their input incorporated into this survey.
I thank Chris Wiggins for sharing his taxonomy on machine learning, Roy Frostig for shaping
Section 3.3, Pavel Pravdin for consulting on how to get policy gradient methods up and run-
ning, and Max Raginsky for perspectives on adaptive control and translations of Russian. I also
thank Moritz Hardt, Eric Jonas, and Ali Rahimi for helping to shape the language, rhetoric, and
focus of the blog series and this survey, and Nevena Lazic and Gergely Neu for many helpful
suggestions for improving the article’s readability and accuracy. Additionally, I thank my other
colleagues in machine learning and control for many helpful conversations and pointers about
this material: Murat Arcak, Karl Astrom, Francesco Borrelli, John Doyle, Andy Packard, Anders
Rantzer, Lorenzo Rosasco, Shankar Sastry, Yoram Singer, Csaba Szepesvari, Claire Tomlin, and
Stephen Wright. I also thank my colleagues in robotics—Anca Dragan, Leslie Kaebling, Sergey
Levine, Pierre-Yves Oudeyer, Olivier Sigaud, Russ Tedrake, and Emo Todorov—for sharing their
perspectives on the RL and optimization technologies that work for them and the challenges they
face in their research.
I thank everyone who took CS281B with me in the spring of 2017, where I first tried to make
sense of the problems in learning to control. And most importantly, a big thanks to everyone in
my research group, who have been wrestling with these ideas with me for the past several years
and who have done much of the research that shaped my views on this space, particularly Ross
Boczar, Nick Boyd, Sarah Dean, Animesh Garg, Aurelia Guy, Qingqing Huang, Kevin Jamieson,
Sanjay Krishnan, Laurent Lessard, Horia Mania, Nik Matni, Becca Roelofs, Ugo Rosolia, Ludwig
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
LITERATURE CITED
1. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, et al. 2016. Mastering the game of Go with deep neural
networks and tree search. Nature 529:484–89
2. Bertsekas DP. 2017. Dynamic Programming and Optimal Control, Vol. 1. Nashua, NH: Athena Sci. 4th ed.
3. Sutton RS, Barto AG. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press
4. Puterman ML. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Hoboken, NJ:
Wiley-Interscience
5. Dann C, Brunskill E. 2015. Sample complexity of episodic fixed-horizon reinforcement learning. In Ad-
vances in Neural Information Processing Systems 28, ed. C Cortes, ND Lawrence, DD Lee, M Sugiyama,
R Garnett, pp. 2818–26. Red Hook, NY: Curran
6. Nemirovski A, Yudin D. 1983. Problem Complexity and Method Efficiency in Optimization. New York: Wiley
7. Zhu X. 2005. Semi-supervised learning literature survey. Tech. Rep. 1530, Dep. Comput. Sci., Univ. Wisc.,
Madison
8. Hazan E, Kale S, Shalev-Shwartz S. 2012. Near-optimal algorithms for online matrix prediction. In
Proceedings of the 25th Annual Conference on Learning Theory, ed. S Mannor, N Srebro, RC Williamson,
pp. 38.1–13. Proc. Mach. Learn. Res. 23. N.p.: PMLR
9. Bertsekas DP, Tsitsiklis JN. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Sci.
10. Kaelbling LP, Littman ML, Moore AW. 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4:
237–85
11. Bowling M, Burch N, Johanson M, Tammelin O. 2015. Heads-up limit hold’em poker is solved. Science
347:145–49
12. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, et al. 2015. Human-level control through deep
reinforcement learning. Nature 518:529–33
13. Tesauro G. 1995. TD-Gammon: a self-teaching backgammon program. In Applications of Neural Networks,
ed. AF Murray, pp. 267–85. Boston, MA: Springer
14. Bottou L, Peters J, Quiñonero-Candela J, Charles DX, Chickering DM, et al. 2013. Counterfactual rea-
soning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14:3207–
60
15. Strehl A, Langford J, Li L, Kakade SM. 2010. Learning from logged implicit exploration data. In Advances
in Neural Information Processing Systems 23, ed. JD Lafferty, CKI Williams, J Shawe-Taylor, RS Zemel,
A Culotta, pp. 2217–25. Red Hook, NY: Curran
16. Bertsekas DP. 2012. Dynamic Programming and Optimal Control, Vol. 2. Nashua, NH: Athena Sci. 4th ed.
17. Ljung L. 1998. System Identification: Theory for the User. Upper Saddle River, NJ: Prentice Hall. 2nd ed.
18. Campi MC, Weyer E. 2002. Finite sample properties of system identification methods. IEEE Trans. Autom.
Control 47:1329–34
19. Vidyasagar M, Karandikar RL. 2008. A learning theory approach to system identification and stochastic
adaptive control. J. Process Control 18:421–30
20. Tsitsiklis JN. 1994. Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16:185–202
21. Watkins CJ, Dayan P. 1992. Q-learning. Mach. Learn. 8:279–92
276 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
22. Sutton RS. 1988. Learning to predict by the method of temporal differences. Mach. Learn. 3:9–44
23. Dayan P. 1992. The convergence of TD(λ) for general λ. Mach. Learn. 8:341–62
24. Bradtke SJ, Barto AG. 1996. Linear least-squares algorithms for temporal difference learning. Mach. Learn.
22:33–57
25. Bertsekas DP, Ioffe S. 1996. Temporal differences-based policy iteration and applications in neuro-dynamic pro-
gramming. Tech. Rep. LIDS-P-2349, Lab. Inf. Decis. Syst., Mass. Inst. Technol., Cambridge, MA
26. Yu H, Bertsekas DP. 2009. Convergence results for some temporal difference methods based on least
squares. IEEE Trans. Autom. Control 54:1515–31
27. Williams RJ. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
49. Lagoudakis MG, Parr R. 2003. Least-squares policy iteration. J. Mach. Learn. Res. 4:1107–49
50. Gao J. 2014. Machine learning applications for data center optimization. White Pap., Google, Mountain View,
CA
51. Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7:1–26
52. Shao J, Tu D. 1995. The Jackknife and Bootstrap. New York: Springer-Verlag
53. Todorov E, Erez T, Tassa Y. 2012. MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pp. 5026–33. New York: IEEE
54. Murray RM. 2017. A Mathematical Introduction to Robotic Manipulation. Boca Raton, FL: CRC
55. Tedrake R, Zhang TW, Seung HS. 2004. Stochastic policy gradient reinforcement learning on a simple
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
3D biped. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2849–54. New
York: IEEE
56. Levine S, Koltun V. 2013. Guided policy search. In Proceedings of the 30th International Conference on Ma-
chine Learning, ed. S Dasgupta, D McAllester, pp. 1–9. Proc. Mach. Learn. Res. 28(3). N.p.: PMLR
57. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. 2014. Deterministic policy gradient
algorithms. In Proceedings of the 31st International Conference on Machine Learning, ed. EP Xing, T Jebara,
pp. 387–95. Proc. Mach. Learn. Res. 32(1). N.p.: PMLR
58. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. 2015. Continuous control with deep reinforcement
learning. arXiv:1509.02971 [cs.LG]
59. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. 2015. Trust region policy optimization. In Proceedings
of the 32nd International Conference on Machine Learning, ed. F Bach, D Blei, pp. 1889–97. Proc. Mach.
Learn. Res. 37. N.p.: PMLR
60. Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. 2015. High-dimensional continuous control using
generalized advantage estimation. arXiv:1506.02438 [cs.LG]
61. Wu Y, Mansimov E, Liao S, Grosse R, Ba J. 2017. Scalable trust-region method for deep reinforcement
learning using Kronecker-factored approximation. arXiv:1708.05144 [cs.LG]
62. Salimans T, Ho J, Chen X, Sutskever I. 2017. Evolution strategies as a scalable alternative to reinforcement
learning. arXiv:1703.03864 [stat.ML]
63. Stulp F, Sigaud O. 2013. Robot skill learning: from reinforcement learning to evolution strategies. Paladyn
4:49–61
64. Rajeswaran A, Lowrey K, Todorov E, Kakade S. 2017. Towards generalization and simplicity in continuous
control. In Advances in Neural Information Processing Systems 30, ed. I Guyon, UV Luxburg, S Bengio,
H Wallach, R Fergus, et al., pp. 6550–61. Red Hook, NY: Curran
65. Mania H, Guy A, Recht B. 2018. Simple random search provides a competitive approach to reinforcement
learning. arXiv:1803.07055 [cs.LG]
66. Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. 2017. Deep reinforcement learning
that matters. arXiv:1709.06560 [cs.LG]
67. Islam R, Henderson P, Gomrokchi M, Precup D. 2017. Reproducibility of benchmarked deep reinforce-
ment learning tasks for continuous control. arXiv:1708.04133 [cs.LG]
68. Erez T, Lowrey K, Tassa Y, Kumar V, Kolev S, Todorov E. 2013. An integrated system for real-time model
predictive control of humanoid robots. In 2013 13th IEEE-RAS International Conference on Humanoid
Robots, pp. 292–99. New York: IEEE
69. Tassa Y, Erez T, Todorov E. 2012. Synthesis and stabilization of complex behaviors through online trajec-
tory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–13.
New York: IEEE
70. Rosolia U, Borrelli F. 2018. Learning model predictive control for iterative tasks. A data-driven control
framework. IEEE Trans. Autom. Control 63:1883–96
71. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, et al. 2016. End to end learning for self-
driving cars. arXiv:1604.07316 [cs.CV]
72. Blondel VD, Tsitsiklis JN. 2000. A survey of computational complexity results in systems and control.
Automatica 36:1249–74
73. Papadimitriou CH, Tsitsiklis JN. 1987. The complexity of Markov decision processes. Math. Oper. Res.
12:441–50
278 Recht
AS02CH11_Recht ARjats.cls March 13, 2019 10:37
74. Levine S, Finn C, Darrell T, Abbeel P. 2016. End-to-end training of deep visuomotor policies. J. Mach.
Learn. Res. 17:1–40
75. Akametalu AK, Fisac JF, Gillula JH, Kaynama S, Zeilinger MN, Tomlin CJ. 2014. Reachability-based
safe learning with Gaussian processes. In 53rd IEEE Conference on Decision and Control, pp. 1424–31. New
York: IEEE
76. Aswani A, Gonzalez H, Sastry SS, Tomlin CJ. 2013. Provably safe and robust learning-based model pre-
dictive control. Automatica 49:1216–26
77. Berkenkamp F, Turchetta M, Schoellig A, Krause A. 2017. Safe model-based reinforcement learning with
stability guarantees. In Advances in Neural Information Processing Systems 30, ed. I Guyon, UV Luxburg,
Downloaded from www.annualreviews.org. Guest (guest) IP: 171.225.201.238 On: Sun, 04 May 2025 22:52:39
S Bengio, H Wallach, R Fergus, et al., pp. 908–18. Red Hook, NY: Curran
78. Goodwin GC, Ramadge PJ, Caines PE. 1981. Discrete time stochastic adaptive control. SIAM J. Control
Optim. 19:829–53
79. Abbasi-Yadkori Y, Szepesvári C. 2011. Regret bounds for the adaptive control of linear quadratic systems.
In Proceedings of the 24th Annual Conference on Learning Theory, ed. SM Kakade, U von Luxburg, pp. 1–26.
Proc. Mach. Learn. Res. 19. N.p.: PMLR
80. Auer P, Cesa-Bianchi N, Fischer P. 2002. Finite-time analysis of the multiarmed bandit problem. Mach.
Learn. 47:235–56
81. Lai TL, Robbins H. 1985. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6:4–22
82. Abbasi-Yadkori Y, Szepesvári C. 2015. Bayesian optimal control of smoothly parameterized systems: the
lazy posterior sampling algorithm. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial
Intelligence, ed. M Meila, T Heskes, pp. 2–11. Arlington, VA: AUAI
83. Abeille M, Lazaric A. 2017. Thompson sampling for linear-quadratic control problems. In Proceedings of
the 20th International Conference on Artificial Intelligence and Statistics, ed. A Singh, J Zhu, pp. 176–84. Proc.
Mach. Learn. Res. 54. N.p.: PMLR
84. Ouyang Y, Gagrani M, Jain R. 2017. Learning-based control of unknown linear systems with Thompson
sampling. arXiv:1709.04047 [cs.SY]
85. Abbasi-Yadkori Y, Lazic N, Szepesvári C. 2018. The return of -greedy: sublinear regret for model-free
linear quadratic control. arXiv:1804.06021 [cs.LG]
86. Dean S, Mania H, Matni N, Recht B, Tu S. 2018. Regret bounds for robust adaptive control of the linear
quadratic regulator. arXiv:1805.09388 [cs.LG]
87. Sadigh D, Sastry S, Seshia SA, Dragan AD. 2016. Planning for autonomous cars that leverage effects on
human actions. In Robotics: Science and Systems XII, ed. D Hsu, N Amato, S Berman, S Jacobs, chap. 29.
N.p.: Robot. Sci. Syst. Found.
88. Bialas WF. 1989. Cooperative n-person Stackelberg games. In Proceedings of the 28th IEEE Conference on
Decision and Control, Vol. 3, pp. 2439–44. New York: IEEE
89. Li N, Oyler DW, Zhang M, Yildiz Y, Kolmanovsky I, Girard AR. 2017. Game theoretic modeling of driver
and vehicle interactions for verification and validation of autonomous vehicle control systems. IEEE Trans.
Control Syst. Technol. 26:1782–97
90. Kalman RE. 1964. When is a linear control system optimal? J. Basic Eng. 86:51–60
91. Ng AY, Russell SJ. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth
International Conference on Machine Learning, ed. P Langley, pp. 663–70. San Francisco: Morgan Kaufmann