0% found this document useful (0 votes)
46 views34 pages

RL and ObC Lecture 1

Uploaded by

Erdem Şimşek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views34 pages

RL and ObC Lecture 1

Uploaded by

Erdem Şimşek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Reinforcement Learning and Optimization-based

Control

Assoc. Prof. Dr. Emre Koyuncu

Department of Aeronautics Engineering


Istanbul Technical University

Lecture 1: Introduction

E. Koyuncu (ITU) RL and ObC Lecture 1 1 / 34


Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 2 / 34


Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 3 / 34


Adaptive and Optimal Control

Optimal Control
• Minimize prescribed Adaptive Control
performance function • Learns online via feedback
• Usually designed to be offline to function
solve HJB • Not usually designed to be
• Use complete knowledge of the optimal
system • First identify the system then
• Solving Nonlinear HJB equation use the model
are often hard or impossible

E. Koyuncu (ITU) RL and ObC Lecture 1 4 / 34


MPC and RL
• Both are frameworks to solve sequential decision making problems
• Both automatically design controllers based on desired outcomes
(reward/cost, constraints, etc.)
Reinforcement Learning
• Controller directly learned from Model Predictive Control
data, exploration and • System identification precedes
exploitation control implementation, model
• Both continuous and fixed during execution
binary/sparse rewards • Typically convex stage costs
• Constraints imposed via • Constraints imposed explicitly
penalties • Online optimization over
• Mostly parameterized controller, prediction horizon - expensive?
Deep Learning integrated cheap • Usually combined with state
• Usually history included in estimator
definition of the state
E. Koyuncu (ITU) RL and ObC Lecture 1 5 / 34
Linear Quadratic Regulators(LQR)

The most basic sort of optimal controller for LTI systems. Consider
following system
ẋ = Ax(t) + Bu(t)
where the state x(t) ∈ Rn and control input u(t) ∈ Rm . The system is
associated with the infinite horizon quadratic cost function
Z ∞
V (x(t0 ), t0 ) = (x T (τ )Qx(τ ) + u T Ru(τ ))d(τ )
t0

with weighting matrices Q ≥ 0, R ≥ 0.


• it is assumed that (A, B) stabilizable - there exist a control input
makes the system stable

• (A, Q) is detectable - unstable modes are observable through

output (y = Qx)

E. Koyuncu (ITU) RL and ObC Lecture 1 6 / 34


Linear Quadratic Regulators(LQR)
The LQR optimal control problem requires finding the policy that
minimizes the cost
u ∗ (t) = arg min V (t0 , x (t0 ) , u(t))
u(t)
t0 ≤t≤∞

The solution is given by u(t) = −Kx(t), where the gain matrix will be
K = R −1 B T P
where P matrix is a positive definite solution of Algebraic Riccati Equation
AT P + PA + Q − PBR −1 B T P = 0

• under stabilizabiltiy and detectability conditions there is a unique


positive semi-definite solution
• this is closed loop system A − BK is asymptotically stable
• this is offline solution requires complete knowledge on the system
dynamics
E. Koyuncu (ITU) RL and ObC Lecture 1 7 / 34
Linear Quadratic Zero-sum Games

The LQ-ZS games have following linear dynamics

ẋ = Ax(t) + Bu(t) + Dd

where the state x(t) ∈ Rn , control input u(t) ∈ Rm , and disturbance


d(t) ∈ Rk . The system is associated with the infinite horizon quadratic
cost function
1 ∞ T
Z  Z ∞
T 2 2
V (x(t), u, d) = x Qx + u Ru − γ kdk dτ ≡ r (x, u, d)dτ
2 t t

with the control weighting matrix R = R T > 0 and a scalar λ > 0.

E. Koyuncu (ITU) RL and ObC Lecture 1 8 / 34


Linear Quadratic Zero-sum Games
The LQ-ZS games require finding the control policy that minimizes the
cost wrt the control and maximizes the cost wrt to the disturbance

V ∗ (x(0)) = min max J(x(0), u, d)


u d
Z ∞ 
= min max Q(x) + u T Ru − γ 2 kdk2 dt
u d 0
The solution of this optimal control problem is given by

u(x) = −R −1 B T Px = −Kx
1
d(x) = 2 D T Px = Lx
γ
where the P is the solution to the game ARE
1
0 = AT P + PA + Q − PBR −1 B T P + PDD T P
γ2
E. Koyuncu (ITU) RL and ObC Lecture 1 9 / 34
Linear Quadratic Zero-sum Games


• There exist a solution P > 0 if (A, B) is stabilizable, (A, Q) is
observable, and λ > λ∗ the H-infinity gain.
• this is offline solution that requires complete knowledge of the system
dynamics (A, B, D)
• if system dynamics (A, B, D) change or the performance index
(Q, R, λ) varies, a new optimal control solution needed.

E. Koyuncu (ITU) RL and ObC Lecture 1 10 / 34


Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 11 / 34


Model Reference Adaptive Controller (MRAC)

Consider the simple scalar case

ẋ = ax + bu
where the state x(t) ∈ Rn , control input u(t) ∈ Rm , and input gain b > 0.
It is desired for the plant state to follow the state of a reference model
given by
ẋm = −am xm + bm r
where r (t) ∈ Rn reference input signal. Take the controller structure as

u = −kx + dr

which has a feedback term and a feedforward term. k and d are unknown
and are to be determined so that the state tracking error
e(t) = x(t) − xm (t) goes to zero.

E. Koyuncu (ITU) RL and ObC Lecture 1 12 / 34


Model Reference Adaptive Controller (MRAC)

E. Koyuncu (ITU) RL and ObC Lecture 1 13 / 34


Model Reference Adaptive Controller (MRAC)

Tune the controller parameters online. E.g., using Lyapunov techniques,


the parameters are tune wrt

k̇ = αex, d˙ = −βer

where α, β > 0 are tuning parameters, then the tracking error e(t) goes to
zero with time.
• the feedback gain k is tuned by a product of its state x(t) in the
traking error e(t)
• feedforward gain d is tuned by a product of its input r (t) in the
traking error e(t)
• the plant dynamics (a, b) are not needed in the tuning laws!

E. Koyuncu (ITU) RL and ObC Lecture 1 14 / 34


Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 15 / 34


Reinforcement Learning
RL has close connections to both optimal and adaptive control.
• allows designing adaptive controllers learn online and in real time
• provide solutions to user prescribed optimal control problems.
E.g., actor-critic structure

• policy evaluation executed by the critic


• policy improvement preformed by the actor.
• determine how close to optimal the current action
• modify control policy yields a value function.
E. Koyuncu (ITU) RL and ObC Lecture 1 16 / 34
AI/RL vs Control Terminology
RL uses max value, Control uses min cost
• Reward of a stage → Cost of a stage
• State value → State cost
• Value function → Cost function

System terminology
• Agent → Controller or decision maker
• Action → Control or decision
• Environment → Dynamic system

Learning/Planning terminology
• Learning → Solving a problem with simulation
• Self-learning → Solving problem with simulation-based policy iteration
• Planning vs Learning → Solving problem with model-based or
model-free simulations
E. Koyuncu (ITU) RL and ObC Lecture 1 17 / 34
Value Functions
• Value functions measure the goodness of a particular state or
state/action pair: how good is for the agent to be in a particular state
or execute a particular action at a particular state, for a given policy.
• Optimal value functions measure the best possible goodness of states
or state/action pairs under all possible policies.

• Prediction: For a given policy, estimate state and state/action value


functions
• Control (Optimal): Estimate the optimal state and state/action value
functions
E. Koyuncu (ITU) RL and ObC Lecture 1 18 / 34
Sequential Decision

Optimal decision
• At current state, apply decision that minimizes
Current stage cost + J ∗ (Next state)
where J ∗ (Next state) is the optimal future cost, starting from next
state
• This defines optimal policy - an optimal control to apply at each state

E. Koyuncu (ITU) RL and ObC Lecture 1 19 / 34


Principle of Optimality

Principle of optimality
Let {u0∗ , ..., uN−1
∗ } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
∗ ∗

minimizes over {uk , ..., uN−1 } the cost-to-go from k to N


N−1
X
gk (xk∗ , uk ) + gm (xm , um ) + gN (xN )
m=k+1

Then the tail optimal control sequence {uk∗ , ..., uN−1


∗ } is optimal for the
tail subproblem.
E. Koyuncu (ITU) RL and ObC Lecture 1 20 / 34
Dynamic Programming
Solve all the tail subproblems of a given time length using the solution of
all the tail subproblems of shorter time length.
By principle of optimality
• Consider every possible uk and solve the tail subproblem that starts at
next state xk+1 = fk (xk , uk )
• Optimize over all uk

By principle of optimality
Start with
JN∗ (xN ) = gN (xN ) , for all xN
and for k = 0, , N − 1, let

Jk∗ (xk ) = ∗
 
min gk (xk , uk ) + Jk+1 (fk (xk , uk )) , for all xk .
uk ∈Uk (xk )

then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
E. Koyuncu (ITU) RL and ObC Lecture 1 21 / 34
Constraints via Infinite Cost Values

Can assign infinite cost to infeasible points, using extended reals


R := R ∪ {∞, −∞}

Equivalent Unconstrained
Constrained Optimal Control Formulation
Problem min
N−1
X
c̄ (sk , ak ) + Ē (sN )
s,a
k=0

PN−1 s.t. s0 = s̄0


mins,a k=0
c (sk , ak ) + E (sN )
s.t. s0 = s̄0 sk+1 = f (sk , ak ) , k = 0, . . . , N − 1
sk+1 = f (sk , ak ) ( )
c(s, a) if h(s, a) ≤ 0
0 ≥ h (sk , ak ) , k = 0, . . . , N − 1 with c̄(s, a) =
0 ≥ r (sN ) ∞ else
( )
E (s) if r (s) ≤ 0
and Ē (s) =
∞ else

E. Koyuncu (ITU) RL and ObC Lecture 1 22 / 34


Model-free VS Model-based

George Box
”All models are wrong but some models are useful”

• Due to model error model-free methods often achieve better policies


though are more time consuming
• (Adaptivity) We will examine use of (inaccurate) learned models and
ways not to hinder the final policy while still accelerating learning

E. Koyuncu (ITU) RL and ObC Lecture 1 23 / 34


Bellman’s curse of dimensionality

• Exact Dynamic Programming is an elegant and powerful way to solve


any optimal control problem to global optimality, independent of
convexity. It can be interpreted an efficient implementation of an
exhaustive search that explores all possible control actions for all
possible circumstances.
• However, it requires the tabulation of cost-to-go functions for all
possible states s ∈ S. Thus, it is exactly implementable only for
discrete state and action spaces, and otherwise requires a
discretization of the state space. Its computational complexity grows
exponentially in the state dimension. This ”curse of dimensionality”,
a phrase coined by Richard Bellman, unfortunately makes exact DP
impossible to appy to systems with larger state dimensions.
• Classical MPC does circumvent this problem by restricting itself to
finding only the optimal trajectory that starts at the current state s0.
• Explicit MPC suffers from the same curse of dimensionality as DP.
E. Koyuncu (ITU) RL and ObC Lecture 1 24 / 34
Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 25 / 34


Reinforcement Learning History

Historical highlights
• Exact DP, Optimal Control - Bellman, Shannon, others 1950s
• AI/RL and Decision Making ideas - late 80s and early 90s
• Backgammon programs - Tesauro, 1992
• Algorithm era, analysis, applications, books - mid 90s
• Machine Learning, Big Data, Neural Networks - mid 2000s
• AlphaGo and AlphaZero - Deepmind, 2016, 2017
• DARPA AlphaDogFight against real F-16 pilots - 2019, 2020

E. Koyuncu (ITU) RL and ObC Lecture 1 26 / 34


Multiagent Reinforcement Learning

OpenAI Hide and Seek game with emergent behaviours


https://openai.com/blog/emergent-tool-use

https://www.youtube.com/watch?v=kopoLzvh5jY

E. Koyuncu (ITU) RL and ObC Lecture 1 27 / 34


RL-based Strategical War Gaming

• Survivability based Optimal Air Combat Mission Planning with Reinforcement Learning, IEEE Conference on Control
Technology and Applications (CCTA), Copenhagen, Denmark, August 21-24, 2018, Baspinar, B., Koyuncu, E.,

E. Koyuncu (ITU) RL and ObC Lecture 1 28 / 34


RL-based Tactical Air Combat

• Assessment of Aerial Combat Game via Optimization-Based Receding Horizon Control, IEEE Access, vol. 8, pp.
35853-35863, 2020, doi: 10.1109/ACCESS.2020.2974792 Baspinar, B., Koyuncu, E.,
• Evaluation of Two-vs-One Air Combats Using Hybrid Maneuver-Based Framework and Security Strategy Approach,
Journal of Aeronautics and Space Technologies, v. 12-1, pg. 95-107, January 2019 Baspinar, B., Koyuncu, E.,
• Differential Flatness-based Optimal Air Combat Maneuver Strategy Generation, AIAA Science and Technology Forum
and Exposition (AIAA SciTech 2019), San Diego, California, 7-11 January 2019 Baspinar B., Koyuncu E.,
• Aerial Combat Simulation Environment for One-on-One Engagement, AIAA SciTech Forum and Exposition: Modelling
and Simulation Technologies, Gaylord Palms, Kissimmee, FL, 8-12 January 2018 Baspinar, B., Koyuncu, E.,

E. Koyuncu (ITU) RL and ObC Lecture 1 29 / 34


RL-based Fast Flight Replanning

https://www.youtube.com/watch?v=8IiLQFQ3V0E
• A Dynamically Feasible Fast Replanning Strategy with Deep Reinforcement Learning, Journal of Intelligent and Robotic
Systems, v. 101, issue 1, 2021 Hasanzade, M., Koyuncu, E.,

E. Koyuncu (ITU) RL and ObC Lecture 1 30 / 34


Table of Contents

1 Optimal Control and RL

2 Adaptive Control

3 Reinforcement Learning

4 RL Applications

5 About this Course

E. Koyuncu (ITU) RL and ObC Lecture 1 31 / 34


Course Topics

• Introduction; Optimal Control; Adaptive Control and RL


• RL and Optimal Control of Discrete Systems
• RL-based Optimal Adaptive Control for Linear Systems
• RL-based Optimal Adaptive Control for Nonlinear Systems
• Policy iteration for continuous-time systems
• Value iteration for continuous-time systems
• RL-based Optimal Adaptive Control with Online Learning
• Online Learning for Zero-sum Games and H-infinity Control
• Online Learning for mutiplayer non-zero-sum Games
• RL for Zero-sum Games

E. Koyuncu (ITU) RL and ObC Lecture 1 32 / 34


Grading Policy

• 20% Paper abstract - problem selection and presentation, in class,


Due date is April 15.
• 40% Submission ready paper - 6 pages, including coding
implementation, in IFAC CPHS template - Due date is May 15, strict.
• 40% Paper presentation, including coding implementation - online, in
final exam week.
• 1 to 3 people groups

E. Koyuncu (ITU) RL and ObC Lecture 1 33 / 34


IFAC CPHS 2024, Antalya Turkey

E. Koyuncu (ITU) RL and ObC Lecture 1 34 / 34

You might also like