0% found this document useful (0 votes)
44 views

Optimal Control Theory

Uploaded by

Aritra Dasgupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
44 views

Optimal Control Theory

Uploaded by

Aritra Dasgupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 28
To appear in Bayesian Brain, Doya, K. (ed), MIT Press (2006) Optimal Control Theory Emanuel Todorov University of California San Diego Optimal control theory is a mature mathematical discipline with numerous applications in both science and engineering. It is emerging as the computational framework of choice {or studying the neural control of movement, in much the same way that probabilistic infer- ence is emerging as the computational framework of choice for studying sensory information processing. Despite the growing popularity of optimal control models, however, the elab- orate mathematical machinery behind them is rarely exposed and the big picture is hard to grasp without reading a few technical books on the subject. While this chapter cannot replace such books, it aims to provide a self-contained mathematical introduction to opti- mal control theory that is sufficiently broad and yet sufficiently detailed when it comes to key concepts. ‘The text is not tailored to the field of motor control (apart from the last section, and the overall emphasis on systems with continuous state) so it will hopefully be of interest to a wider audience. Of special interest in the context of this book is the aaterial on the duality of optimal control and probabilistic inferenee; such duality suggests that neural information processing in sensory and motor areas may be more similar than currently thought. The chapter is organized in the following sections 1. Dynamic programming, Bellman equations, optimal value functions, value and policy iteration, shortest paths, Markov decision processes. 2. Hamilton-Jacobi-Bellman equations, approximation methods, finite and infinite hori- zon formulations, basies of stochastic calculus. 3. Pontryagin’s maximum principle, ODE and gradient descent methods, relationship to classical mechanies. 4. Linear-quadratic-Gaussian control, Ri to nonlinear problems. ‘ati equations, iterative linear approximations 5. Optimal recursive estimation, Kalman filter, Zakai equation, 6. Duality of optimal control and optimal estimation (including new results), 7. Optimality models in motor control, promising research directions. 1 Discrete control: Bellman equations Let x € & denote the state of an agent’s environment, and w € U(c) the action (or control) which the agent chooses while at state «xr, For now both 2 and UW (1) are finite sets, Let nesrt (x,u) € A denote the state which results from applying action w in state sr, and cost (rr,u) > 0 the cost of applying action w in state x. As an example, « may be the city where we are now, u the flight we choose to take, ne:tt (x, ) the city where that flight lands, and cost (1,1) the price of the ticket. We ean now pose a simple yet. practical optimal control problem: find the cheapest way to fly to your destination, This problem can be formalized as follows: find an action sequence (ug, t,*+ty~1) and corresponding state sequence (9, 1,-+q) minimizing the total cost T(esu) = 2" cost (24, ux) where r441 = next (rp, uz) and uy € U (rq). ‘The initial state x» = <*"* and destination state, = a are given. We can visualize this setting with a directed graph where the states are nodes and the actions are arrows connecting the nodes. If cost (1, u) = 1 for all (c,u) the problem reduces to finding the shortest path from 2° to «** in the graph. 1.1 Dynamic programming Optimization problems such as the one stated above are efficiently solved via dynamic programming (DP). DP relies on the following obvious fact: if a given state-action sequence is optimal, and we were to remove the first state and action, the remaining sequence is also optimal (with the second state of the original sequence now acting as initial state). ‘This is the Bellman optimality principle. Note the close resemblance to the Markov property of stochastie processes (a process is Markov if its future is conditionally independent of the past given the present state). ‘The optimality principle can be reworded in similar language the choice of optimal actions in the future is independent of the past actions which led to the present state. Thus optimal state-action sequences can be constructed by starting at the final state and extending backwards. Key to this procedure is the optimal value function (or optimal cost-to-go function) (2) = "minimal total cost for completing the task starting from state 2" ‘This function captures the long-term cost for starting from a given state, and makes it possible to find optimal actions through the following algorithm: Consider every action available at the current state, add its immediate cost to the optimal value of the resulting nest state, and choose an action for which the sum is minimal. "The above algorithm is "greedy" in the sense that actions are chosen based on local infor- mation, without explicit consideration of all future scenarios. And yet the resulting actions are optimal. ‘This is possible because the optimal value function contains all information about future scenarios that is relevant to the present choice of action, ‘Thus the optimal value function is an extremely useful quantity, and indeed its calculation is at the heart of many methods for optimal control The above algorithm yields an optimal action w= 7 (x) € U (xr) for every state x. A ‘mapping from states to actions is ealled control law or control policy. Once we have a control law 1: ¥ U(X) we can start at any state xq, generate action ug = m (rq), transition to state 7 = next (xp, uo), generate action u: = (1), and keep going until we reach a!" Formally, an optimal control law 7 satisfies = arg min. {cost (2,1) + (next (2, 1 ag in, (east (0) + (next (2,0))} @ ‘The minimum in (1) may be achieved for multiple actions in the set U (ar), may not be unique. However the optimal value function v is always uniquely defined, and satisfies v(x) in, {cost (2,1) +» (next (2, u))) Q) 2) Equations (1) and (2) are the Bellman equations. If for some z we already know » (nest (1, u)) for all u € U(r), then we can apply the Bellman equations directly and compute x(x) and v(x). Thus dynamic programming is particularly simple in acyclic graphs where we can start from 2%" with v(x") = 0, and perform a backward pass in which every state is visited after all its successor states have been visited. It is straightforward to extend the algorithm to the case where we are given non-zero final costs for a number of destination states (or absorbing states). 1.2 Value iteration and policy iteration The situation is more complex in graphs with eyeles. Here the Bellman equations are still valid, but we cannot apply them in a single pass. This is because the presence of cycles makes it impossible to visit each state only after all its successors have been visited Instead the Bellman equations are treated as consistency conditions and used to design iterative relaxation schemes ~ mmuch like partial differential equations (PDEs) are treated as consistency conditions and solved with corresponding relaxation schemes. By "relaxation scheme" we mean guessing the solution, and iteratively improving the guess so as to make it more compatible with the consistency condition. ‘The two main relaxation schemes are value iteration and policy iteration. Value iteration uses only (2). We start with a guess v'°) of the optimal value function, and construct a sequence of improved guesses: yo) min { cost (r,u) +0") (next (1, u) 3 Bip, {eost (eu) +0 (next .u))} @) This process is guaranteed to converge to the optimal value function v in a finite number of iterations. The proof relies on the important idea of contraction mappings: one defines the approximation error ¢ (v) = max. |» (x) — v (2)|, (3) causes € (u) to decrease as i increases. In other words, the mapping uf — v+) given by (3) contracts the "size" of v‘ as measured by the error norm € (v). Policy iteration uses both (1) and (2). It starts with a guess = of the optimal control Jaw, and constructs a sequence of improved guesses: oF (2) = cost (ey (2)) +0" (neet (esx ())) @ (42) (2) = arg ae, {eost (x,u) + v™® (neat («,w)} ‘The first line of (4) requires a separate relaxation to compute the value function v™"” for the control law a), This function is defined as the total cost for starting at state x and acting according to 7 thereafter. Policy iteration can also be proven to converge in a finite number of iterations. It is not obvious which algorithm is better, because each of the two nested relaxations in poliey iteration converges faster than the single relaxation in value iteration, In practice both algorithms are used depending on the problem at hand, 1.3 Markov decision processes The problems considered thus far are deterministic, in the sense that applying action u at state « always yields the same next state nest (x,u). Dynamic programming easily generalizes to the stochastic ease where we have a probability distribution over possible next states: p(yla,u) = "probability that neat (x, u) = y" In order to qualify as a probability distribution the fun Dyce? le) P(ylx,u) >0 np must satisfy In the stochastic case the value function equation (2) becomes (2) = i (cont (2,1) + Bo (next (2,09) where E denotes expectation over next (x,u), and is computed as Bl (next (2,0)] = Dey Pula.) »(v) Equations (1, 3, 4) generalize to the stochastic case in the same way as equation (2) does. An optimal control problem with discrete states and actions and probabilistic state transitions is called a Markov decision process (MDP). MDPs are extensively studied in reinforcement learning ~ which is a sub-field of machine learning focusing on optimal control problems with discrete state. In contrast, optimal control theory focuses on problems with continuous state and exploits their rich differential structure 2 Continuous control: Hamilton-Jacobi-Bellman equations We now turn to optimal control problems where the state x € R" and control w € U (x) © RM are real-valued vectors. To simplify notation we will use the shortcut min, instead of mnin,cu(e), although the latter is implied unless noted otherwise. Consider the stochastic differential equation dx =f (x,u) dt + F (x,u) dw 6) where dw is my-dimensional Brownian motion. ‘This is sometimes called a controlled Ito diffusion, with f (x, m) being the drift and F (x, u) the diffusion coefficient. In the absence of noise, ic. when F (x,u) = 0, we can simply write x = f (x,u). However in the stochastic case this would be meaningless because the sample paths of Brownian motion are not differentiable (the term dw/dt is infinite). What equation (6) really means is that the integral of the left hand side is equal to the integral of the right hand side: ‘ xoxo f Fex(o) mae [ F (x(s),a(s)) dew (s) ‘The last term is an Ito integral, defined for square-integrable functions g (t) as ff 9s) ins) = fi, YP) 0) (5002) — wu) where 0 = 89 <2 <0 < Sy = We will stay away from the complexities of stochastic calculus to the extent possible. Instead ‘we will discretize the time axis and obtain results for the continuous-time case in the limit of infinitely small time step. ‘The appropriate Euler discretization of (6) is Xb = XK + AF (xp, uy) + VAF (xp, UK) ck where A is the time step, ex ~.A(0,I") and xz =x (kA). The VB term appears because the variance of Brownian motion grows linearly with time, and thus the standard deviation of the discrete-time noise should scale as VA. ‘To define an optimal control problem we also need a cost function, In finite-horizon problems, i, when a final time ty is specified, it is natural to separate the total cost into a time-integral of a cost rate &(x,u,t) > 0, and a final cost h(x) > 0 which is only evaluated at the final state x(t). Thus the total cost for a given state-control trajectory {x(),u() 0 StS ty} is defined as y I(x(), U6) = h(x (ts) of (x(t), u(t) ,t)dt Keep in mind that we are dealing with a stochastic system. Our objective is to find a control law u = x (x,t) which minimizes the expected total cost for starting at a given (x,t) and acting according thereafter. In discrete time the total cost becomes FM.) = Hn) +A Lest AA) where n = ty/A is the number of time steps (assume that ty/A is integer) 2.1 Derivation of the HJB equations We are now ready to apply dynamic programming to the time-discretized stochastic prob- Jem. The development is similar to the MDP case except that the state space is now infinite: it consists of n +1 copies of R"*. The reason we need multiple copies of R" is that we have a finite-horizon problem, and therefore the time when a given x € R" is reached makes a difference. ‘The state transitions are now stochastic: the probability distribution of xey1 given Xi, Ux is the multivariate Gaussian Xia YN (Xe + AL (Xp, UR), AS Oxk, UK) where S(x,u) = F (x,u) F (x,u)™ ‘The Bellman equation for the optimal value function v is similar to (5), except that v is now a funetion of space and time, We have v(x,k) = min {Ae(x, u,kA) + B [v(x + Af (x,u) + k+1))} (7 where €~N (0, AS(x,u)) and v(x,n) = h(x) Consider the second-order Taylor-series expansion of v, with the time index k+1 suppressed for clarity: U(x +5) = v(x) + dT ue (x) + £67 xx (x) 6 + 0 (58) where 6 = Af (x,u) +6 vx = 20, tee = gage Now compute the expectation of the optimal value function at the next state, using the above Taylor-series expansion and only keeping terms up to first-order in A. The result is: E(u] = 0 (x) + AF (x, 0)" vx (x) + Ft (AS (x, U) dx (x)) + 0 (A?) ‘The trace term appears because B [tr (eT) | Note the second-order derivative tye in the first-order approximation to E [p]. This is a recurrent theme in stochastic calculus. It is directly related to Ito’s lemma, which states that if x (t) is an Ito diffusion with coefficient 7, then B | ET vxx€) it (Cov [f] Ux! HT (ASV xx) dg (x (1)) = ge (# (t)) de (8) + fo°ge0 (x(t) dt Coming back to the derivation, we substitute the expression for F[v] in (7), move the term v(x) outside the minimization operator (since it does not depend on u), and divide by A. Suppressing x, u, & on the right hand side, we have (xk) 0 (ek +1) = min {6-4 fox + $61 (Stax) +0(A)} Recall that t= kA, and consider the optimal value function v (x,t) defined in continuous time. The left hand side in the above equation is then v(xt) —vOx,t+d) a In the limit A — 0 the latter expression becomes —v, which we denote ~v,. ‘Thus for O 0 being the discount factor. Intuitively this says that future costs are less costly (whatever that means). Here we do not have a final cost f(x), and the cost rate £ (x, u) no-longer depends on time explicitly. The HJB equation for the optimal value function becomes aw (x) = aaa, (ew) E(x, u)! vy (x) + tr (5 (x, 0) tine 0) } (10) Another alternative is the average-cost-per-stage formulation, with total cost F(x(),w()) = tim 2 [cow weorae aytae Ty In this case the HJB equation for the optimal value function is, min. ox, a) (¢,u)T rx (x) + 5 tr (S (2, 1) te (0) A= amin, {8 660) +f G5 u)T ox (0) +B (5 64,1) Mee ())} ay where A > (/is the average cost per stage, and v now has the meaning of a differential value function. Equations (10) and (11) do not depend on time, which makes them more amenable to numerical approximations in the sense that we do not need to store a copy of the optimal value function at each point in time, Form another point of view, however, (8) may be easier to solve numerically. This is because dynamic programming can be performed in a single backward pass through time: initialize v (x,t7) = h(x) and simply integrate (8) backward in time, computing the spatial derivatives numerically along the way. In contrast, (10) and (11) call for relaxation methods (such as value iteration or policy iteration) which in the continnous-state case may take an arbitrary number of iterations to converge. Relaxation methods are of course guaranteed to converge in a finite number of iterations for any finite state approximation, but that number may increase rapidly as the discretization of the continuous state space is refined. 3 Deterministic control: Pontryagin’s maximum principle Optimal control theory is based on two fundamental ideas. One is dynamic programming and the associated optimality principle, introduced by Bellman in the United States. The other is the mazimum principle, introduced by Pontryagin in the Soviet Union. The max- imum principle applies only to deterministic problems, and yields the same solutions as dynamic programming. Unlike dynamic programming, however, the maximum principle avoids the curse of dimensionality. Here we derive the maximum principle indirectly via the HJB equation, and directly via Lagrange multipliers. We also clarify its relationship to classical mechanics. 3.1 Derivation via the HJB equations For deterministic dynamics % = f (x, u) the finite-horizon HJB equation (8) becomes u(x,t) = min {e6s.u.1) $£ (x, a)" ox (x, »} (12) Suppose a solution to the minimization problem in (12) is given by an optimal control law (Xt) which is differentiable in x. Setting u = x (x,t) we can drop the min operator in (22) and write 0 =u (x, t) + £ (x, m (x, £) ,t) + £ (x, (x, 8)" Ux (x, 6) This equation is valid for all x, and therefore can be differentiated w.r.t. x to obtain (in shortcut notation) O= tg + be tag la + (ET + EEE) ta + tl Regrouping terms, and using the identity bg = tock + vax = Yoel + Une, yields 0 ig + la + Blt +4 (Cu + Boom) We now make a key observation: the term in the brackets is the gradient w.r.t. u of the quantity being minimized w.r.t. u in (12). ‘That gradient is zero (assuming unconstrained minimization), which leaves us with Ati, (0) = Be (X, m (26, 0) 0) +E (x, (x, 0) te (x, 0) (13) This may look like a PDE for w, but if we think of vx as a vector p instead of a gradient of a function which depends on x, then (13) is an ordinary differential equation (ODE) for p. ‘That equation holds along any trajectory generated by x (x,t). The vector p is called the costate vector. We are now ready to formulate the maximum principle. If (x(t) ,u(t):0