RL and ObC Lecture 1
RL and ObC Lecture 1
Control
Lecture 1: Introduction
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications
Optimal Control
• Minimize prescribed Adaptive Control
performance function • Learns online via feedback
• Usually designed to be offline to function
solve HJB • Not usually designed to be
• Use complete knowledge of the optimal
system • First identify the system then
• Solving Nonlinear HJB equation use the model
are often hard or impossible
The most basic sort of optimal controller for LTI systems. Consider
following system
ẋ = Ax(t) + Bu(t)
where the state x(t) ∈ Rn and control input u(t) ∈ Rm . The system is
associated with the infinite horizon quadratic cost function
Z ∞
V (x(t0 ), t0 ) = (x T (τ )Qx(τ ) + u T Ru(τ ))d(τ )
t0
The solution is given by u(t) = −Kx(t), where the gain matrix will be
K = R −1 B T P
where P matrix is a positive definite solution of Algebraic Riccati Equation
AT P + PA + Q − PBR −1 B T P = 0
ẋ = Ax(t) + Bu(t) + Dd
u(x) = −R −1 B T Px = −Kx
1
d(x) = 2 D T Px = Lx
γ
where the P is the solution to the game ARE
1
0 = AT P + PA + Q − PBR −1 B T P + PDD T P
γ2
E. Koyuncu (ITU) RL and ObC Lecture 1 9 / 34
Linear Quadratic Zero-sum Games
√
• There exist a solution P > 0 if (A, B) is stabilizable, (A, Q) is
observable, and λ > λ∗ the H-infinity gain.
• this is offline solution that requires complete knowledge of the system
dynamics (A, B, D)
• if system dynamics (A, B, D) change or the performance index
(Q, R, λ) varies, a new optimal control solution needed.
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications
ẋ = ax + bu
where the state x(t) ∈ Rn , control input u(t) ∈ Rm , and input gain b > 0.
It is desired for the plant state to follow the state of a reference model
given by
ẋm = −am xm + bm r
where r (t) ∈ Rn reference input signal. Take the controller structure as
u = −kx + dr
which has a feedback term and a feedforward term. k and d are unknown
and are to be determined so that the state tracking error
e(t) = x(t) − xm (t) goes to zero.
k̇ = αex, d˙ = −βer
where α, β > 0 are tuning parameters, then the tracking error e(t) goes to
zero with time.
• the feedback gain k is tuned by a product of its state x(t) in the
traking error e(t)
• feedforward gain d is tuned by a product of its input r (t) in the
traking error e(t)
• the plant dynamics (a, b) are not needed in the tuning laws!
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications
System terminology
• Agent → Controller or decision maker
• Action → Control or decision
• Environment → Dynamic system
Learning/Planning terminology
• Learning → Solving a problem with simulation
• Self-learning → Solving problem with simulation-based policy iteration
• Planning vs Learning → Solving problem with model-based or
model-free simulations
E. Koyuncu (ITU) RL and ObC Lecture 1 17 / 34
Value Functions
• Value functions measure the goodness of a particular state or
state/action pair: how good is for the agent to be in a particular state
or execute a particular action at a particular state, for a given policy.
• Optimal value functions measure the best possible goodness of states
or state/action pairs under all possible policies.
Optimal decision
• At current state, apply decision that minimizes
Current stage cost + J ∗ (Next state)
where J ∗ (Next state) is the optimal future cost, starting from next
state
• This defines optimal policy - an optimal control to apply at each state
Principle of optimality
Let {u0∗ , ..., uN−1
∗ } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
∗ ∗
By principle of optimality
Start with
JN∗ (xN ) = gN (xN ) , for all xN
and for k = 0, , N − 1, let
Jk∗ (xk ) = ∗
min gk (xk , uk ) + Jk+1 (fk (xk , uk )) , for all xk .
uk ∈Uk (xk )
then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
E. Koyuncu (ITU) RL and ObC Lecture 1 21 / 34
Constraints via Infinite Cost Values
Equivalent Unconstrained
Constrained Optimal Control Formulation
Problem min
N−1
X
c̄ (sk , ak ) + Ē (sN )
s,a
k=0
George Box
”All models are wrong but some models are useful”
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications
Historical highlights
• Exact DP, Optimal Control - Bellman, Shannon, others 1950s
• AI/RL and Decision Making ideas - late 80s and early 90s
• Backgammon programs - Tesauro, 1992
• Algorithm era, analysis, applications, books - mid 90s
• Machine Learning, Big Data, Neural Networks - mid 2000s
• AlphaGo and AlphaZero - Deepmind, 2016, 2017
• DARPA AlphaDogFight against real F-16 pilots - 2019, 2020
https://www.youtube.com/watch?v=kopoLzvh5jY
• Survivability based Optimal Air Combat Mission Planning with Reinforcement Learning, IEEE Conference on Control
Technology and Applications (CCTA), Copenhagen, Denmark, August 21-24, 2018, Baspinar, B., Koyuncu, E.,
• Assessment of Aerial Combat Game via Optimization-Based Receding Horizon Control, IEEE Access, vol. 8, pp.
35853-35863, 2020, doi: 10.1109/ACCESS.2020.2974792 Baspinar, B., Koyuncu, E.,
• Evaluation of Two-vs-One Air Combats Using Hybrid Maneuver-Based Framework and Security Strategy Approach,
Journal of Aeronautics and Space Technologies, v. 12-1, pg. 95-107, January 2019 Baspinar, B., Koyuncu, E.,
• Differential Flatness-based Optimal Air Combat Maneuver Strategy Generation, AIAA Science and Technology Forum
and Exposition (AIAA SciTech 2019), San Diego, California, 7-11 January 2019 Baspinar B., Koyuncu E.,
• Aerial Combat Simulation Environment for One-on-One Engagement, AIAA SciTech Forum and Exposition: Modelling
and Simulation Technologies, Gaylord Palms, Kissimmee, FL, 8-12 January 2018 Baspinar, B., Koyuncu, E.,
https://www.youtube.com/watch?v=8IiLQFQ3V0E
• A Dynamically Feasible Fast Replanning Strategy with Deep Reinforcement Learning, Journal of Intelligent and Robotic
Systems, v. 101, issue 1, 2021 Hasanzade, M., Koyuncu, E.,
2 Adaptive Control
3 Reinforcement Learning
4 RL Applications