0% found this document useful (0 votes)
10 views

Deep Reinforcement Learning

The document outlines a graduate-level course on Deep Reinforcement Learning (DRL) with a focus on its applications in aerospace engineering. It covers theoretical foundations, practical implementations, and advanced techniques in DRL, preparing students to tackle complex aerospace challenges. Prerequisites include mathematical maturity and programming experience, and the course includes assignments, exams, and a final project.

Uploaded by

Hosein Pahlavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Deep Reinforcement Learning

The document outlines a graduate-level course on Deep Reinforcement Learning (DRL) with a focus on its applications in aerospace engineering. It covers theoretical foundations, practical implementations, and advanced techniques in DRL, preparing students to tackle complex aerospace challenges. Prerequisites include mathematical maturity and programming experience, and the course includes assignments, exams, and a final project.

Uploaded by

Hosein Pahlavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Deep Reinforcement Learning

September 15, 2024

1 Course Description
This graduate-level course offers a comprehensive introduction to Deep Rein-
forcement Learning (DRL) through the lens of control theory, with a strong
focus on its applications in aerospace engineering. Students will explore the fun-
damental principles of reinforcement learning, its synergy with classical control
concepts, and its powerful applications in decision-making and control problems.
The course covers both theoretical foundations and practical implementations,
equipping students with the knowledge to apply DRL techniques to complex
aerospace systems. From autonomous navigation and adaptive flight control to
space robotics and satellite operations, students will learn how DRL is revo-
lutionizing the aerospace industry. By the end of the course, participants will
be able to formulate, implement, and analyze DRL solutions for cutting-edge
aerospace challenges, positioning them at the forefront of this rapidly evolving
field.

2 Prerequisites
• Mathematical maturity

• Programming experience in Python or MATLAB


• Linear systems theory (recommended)

1
3 Course Schedule
Week 1-2: Introduction to Reinforcement Learning and Control
1. Introduction to RL and its applications in control
2. RL problem formulation: states, actions, rewards

3. Markov Decision Processes (MDPs)


4. Value functions and Bellman equations
5. Exploration vs. exploitation
6. Control system basics and state-space models

Week 3-4: Dynamic Programming, Monte Carlo, and Temporal


Difference
1. Policy and value iteration
2. Multi-Armed Bandit (Action selection strategies)

3. Monte Carlo methods


4. Examples
5. Temporal Difference (TD) learning

6. SARSA and Q-learning


7. Examples
Week 5: Function Approximation in RL
1. Introduction to function approximation

2. Linear function approximation


3. Neural networks for RL (Feedforward, RNN, LSTM, CNN)
4. Examples of NN vs linear approximators

Week 6-7: RL and Optimal Control


1. ADP in control theory
2. Actor-Critic method using NNs
3. Linear Quadratic Regulator (LQR)

4. Single neural network in actor-critic structure


5. Model-free control and Integral RL

2
6. Examples
Week 8: Deep Value-based RL
1. Deep Q-Networks (DQN) (DDQN, Dueling DQN)
2. Experience replay and prioritized experience replay

3. Examples
Week 9-11: Policy Gradient Methods
1. Policy gradient theorem

2. REINFORCE algorithm
3. Actor-Critic methods (A2C, A3C)
4. Deep Deterministic Policy Gradient (DDPG), TD3, SAC
5. Trust Region Policy Optimization (TRPO), Proximal Policy Optimization
(PPO)
Week 12-13: Advanced RL Techniques
1. Distributional Reinforcement Learning
2. Multi-agent RL

3. Imitation learning / Inverse reinforcement learning


4. ODE technique to analyse stability
Week 13-14: RL in Control Systems

1. Model-based RL for control


2. RL for optimal control problems
3. Stochastic Optimal Control and Information Theory
4. Case studies in control applications

Week 15: Advanced Topics and Future Directions


1. Meta-learning in RL
2. Safe RL for control systems

3. Current research trends in RL for control


4. Ethics and societal impacts of RL in control applications

3
4 Assignments and Grading
• Homework assignments (20%, 5 series)
• Midterm and final exams (40%)
• Final project (25%)
• Weekly quizzes and class participation (15+5%)

5 Course outcome
By the end of this course, students will be able to:
• Understand the fundamentals of reinforcement learning and its relation-
ship to control theory
• Formulate RL problems in the context of control systems
• Implement and analyze basic RL as well as advanced DRL algorithms
• Apply DRL techniques to solve control problems, particularly in aerospace
engineering
• Critically evaluate the strengths and limitations of DRL methods in con-
trol applications

6 Textbooks and Resources


Required:
• Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An
Introduction, Second Edition, MIT Press, 2018
• Hao Dong, Zihan Ding, Shanghang Zhang, and T. Chang. Deep Rein-
forcement Learning. Singapore: Springer Singapore, 2020.
• Morales, Miguel. Grokking deep reinforcement learning. Manning Publi-
cations, 2020.
Recommended:
• Nimish Sanghi. Deep reinforcement learning with python, Second edition.
Springer: Berlin/Heidelberg, Germany, 2024.
• Ravichandiran, Sudharsan. Deep Reinforcement Learning with Python:
Master classic RL, deep RL, distributional RL, inverse RL, and more with
OpenAI Gym and TensorFlow. Packt Publishing Ltd, 2020.
• Åström, Karl Johan, and Richard Murray. Feedback systems: an intro-
duction for scientists and engineers. Princeton university press, 2021.

4
‫یکشبنه ‪ 27‬آبان‪ :‬امتحان میانترم‬

‫‪Sotton‬‬ ‫‪ 4‬هفته ‪ RL‬در محیطهای گسسته‬

‫‪Function Approximation‬‬ ‫هفته ‪ ← 5‬معرفی شبکههای عصبی و مفهوم‬

‫هفته ‪ :7-6‬کنترل بهینه تقریبی‪/‬تطبیقی ‪Approximate Dynamic Programming‬‬

‫هفته ‪Deep RL ← 12/11-8‬‬

‫𝐿𝑅 𝑙𝑎𝑛𝑜𝑖𝑡𝑢𝑏𝑖𝑟𝑡𝑠𝑖𝐷‬
‫𝐿𝑅 𝑡𝑛𝑒𝑔𝑎 ‪𝑀𝑢𝑙𝑡𝑖 −‬‬
‫‪SOC‬‬ ‫هفته ‪} ← 15-13‬‬
‫𝐿𝑅 𝑎𝑡𝑒𝑀‬
‫𝐿𝑅 𝑒𝑓𝑎𝑆‬
‫‪ -‬هوش مصنوعی‬
‫‪ -‬یادگیری ماشین‬
‫‪ -‬یادگیری تقویتی‬

‫‪ -‬روانشناسی رفتاری ← دهه ‪ 60-50‬شرطیسازی حیوانات‬


‫‪ -‬کنترل بهینه ← دهه ‪ 50‬بلمن ← ‪Dynamic Programming‬‬

‫از ترکیب دو مفهوم باال در دهه ‪ ← 80‬متداول شدن ‪1989 Watkins ← Temporal Difference ← RL‬‬
‫‪Q-learning‬‬
‫دهه ‪ 2000-90‬گسترش کاربرد ‪Function Approximator‬‬

‫‪Deep Q-Network -2013‬‬

‫‪2015‬‬

‫‪Policy Gradient → DDPG‬‬


‫‪SAC‬‬
‫𝑂𝑃𝑅𝑇‬
‫{‬
‫𝑂𝑃𝑃‬
‫مفهوم ‪ RL‬میتواند در دو دسته کلی تقسیمبندی شود‪:‬‬

‫‪1- Model free‬‬


‫یادگیری مستقیم رفتار بهینه بر مبنای دادههای ورودی‪-‬خروجی‬

‫‪2- Model-based RL‬‬


‫یادگیری مدل سیستم از دادهها و سپس یادگیری رفتار بهینه‬

‫نوع دیگری از دستهبندی به شرح زیر است‪:‬‬

‫‪1- On-Policy‬‬
‫‪2- Off-Policy‬‬
‫‪Q-Learning → Sarsa‬‬
‫تعریف مسئله‬

‫‪Policy‬‬ ‫قطعی‪/‬احتماالتی‬

‫‪Action‬‬ ‫گسسته‪/‬پیوسته‬

‫𝑃𝐷𝑀‬
‫→ 𝑒𝑡𝑎𝑡𝑆 → 𝑛𝑜𝑖𝑡𝑎𝑣𝑟𝑒𝑠𝑏𝑂‬
‫𝑃𝐷𝑀𝑂𝑃‬
‫) 𝑘𝑢 ‪𝑃(𝑥𝑘+1 |𝑥𝑘 , 𝑢𝑘 ) = 𝑃(𝑥𝑘+1 |𝑥0 , 𝑢0 , 𝑥1 , 𝑢1 , … , 𝑥𝑘 ,‬‬
‫دینامیک سیستم‬

‫)𝑢 ‪𝑃(𝑥 ′ |𝑥,‬‬


‫‪Reward‬‬
‫)𝛾 ‪(𝑆, 𝐴, 𝑃, 𝑅,‬‬
‫‪Return‬‬ ‫پاداش بلند مدت‬

‫⋯ ‪𝐺𝑘 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + 𝛾 2 𝑟𝑘+3 +‬‬


‫) 𝑘𝐺( 𝜋𝐸 ‪max‬‬ ‫) 𝑥|𝑢(𝜋‬
‫‪π‬‬

‫تابع ارزش‬
‫∞‬

‫)𝑥 = 𝑘𝑥| ‪𝑉 𝜋 (𝑥 ) = 𝐸𝜋 (𝐺𝑘 |𝑥𝑘 = 𝑥 ) = 𝐸𝜋 (∑ 𝛾 𝑖 𝑟𝑘+𝑖+1‬‬


‫‪𝑖=0‬‬

‫) ‪𝑉 𝜋 (𝑥 ) = 𝐸𝜋 (𝑟𝑘+1 + 𝛾𝑟𝑘+2 + ⋯ |𝑥𝑘 = 𝑥 ) = 𝐸𝜋 (𝑟𝑘+1 + 𝛾𝐺𝑘+1‬‬


‫] 𝑥 = 𝑘𝑥|) ‪= ∑ 𝜋(𝑢|𝑥 )[𝑟𝑘+1 + 𝛾𝐸𝜋 (𝐺𝑘+1‬‬

‫𝑢‬
‫𝑥𝑥𝑃 ∑ 𝛾 ‪= ∑ 𝜋(𝑢|𝑥 ) [𝑟𝑘+1 +‬‬ ‫]𝑥 = 𝑘𝑥| ‪′ 𝐺𝑘+1‬‬

‫𝑢‬ ‫‪𝑥′‬‬

‫𝑢‬
‫𝑥𝑥𝑃 ∑ 𝛾 ‪𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) [𝑟𝑘+1 +‬‬ ‫‪𝜋 ′‬‬
‫]) 𝑥( 𝑉 ‪′‬‬

‫) 𝑥 = 𝑘𝑥| ‪∑ 𝜋(𝑢|𝑥 )[𝑟𝑘+1 + 𝛾𝐸𝜋 (𝐺𝑘+1‬‬


‫𝑢‬
‫𝑥𝑥𝑃 ∑ ) 𝑥|𝑢(𝜋 ∑ 𝛾 ‪= ∑ 𝜋(𝑢|𝑥 )𝑟𝑘+1 +‬‬ ‫𝑥 = 𝑘𝑥|) ‪′ 𝐸 (𝐺𝑘+1‬‬
‫) 𝑥 = 𝑘𝑥| 𝑘𝐺( 𝜋𝐸 = ) 𝑥( 𝜋 𝑉‬
‫‪Bellman Expectation Eq.‬‬

‫𝑢‬
‫𝑥𝑥𝑃 ∑ 𝛾 ‪𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) (𝑟𝑘+1 +‬‬ ‫‪𝜋 ′‬‬
‫)) 𝑥( 𝑉 ‪′‬‬

‫𝑢‬ ‫‪𝑥′‬‬
‫(𝜋‬
‫𝑉𝛾 ‪= 𝐸𝜋 (𝑟𝑘+1 +‬‬ ‫) 𝑥 = 𝑘𝑥|) ‪𝑥𝑘+1‬‬

‫‪𝑟 = −10‬‬ ‫اگر از فضا خارج شود‬

‫اگر از فضا خارج نشود ولی به هدف نرسد ‪𝑟 = −1‬‬

‫‪𝑟 = +10‬‬ ‫به هدف برسد‬

‫↑𝑎‬
‫↓𝑏‬
‫←𝑐‬
‫→𝑑‬
𝑎: −10
−1 + 𝑉 (3) 1
𝑉 (1) = [𝑏:
𝑐: ] →= (−22 + 𝑉 (3) + 𝑉 (2))
−10 4
𝑑: −1 + 𝑉 (2)
𝑎: −10
+10 1
𝑉 (2) = [𝑏:
𝑐: −1 + 𝑉 (1) ] →= (−11 + 𝑉 (1))
4
𝑑: −10
𝑎: −1 + 𝑉 (1)
𝑉 (3) = [𝑏: −10 ] →= 1 (−11 + 𝑉 (1))
𝑐: −10 4
𝑑: −10
Action-Value Function
𝑄 𝜋 (𝑥, 𝑢) = 𝐸𝜋 (𝐺𝑘 |𝑥𝑘 = 𝑥, 𝑢𝑘 = 𝑢)

𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 )𝑄 𝜋 (𝑥, 𝑢)
𝑢

𝑢
𝑄 𝜋 (𝑥 ) = 𝑅(𝑥, 𝑢) + 𝛾 ∑ 𝑃𝑥𝑥 𝜋 ′
′ 𝑉 (𝑥 ) = 𝑄 (𝑥, 𝑢 )

𝑥′

𝑢 ′ ′ 𝜋 ′ ′
𝑄 (𝑥, 𝑢) = 𝑅 (𝑥, 𝑢) + 𝛾 ∑ 𝑃𝑥𝑥 ′ ∑ 𝜋 (𝑢 |𝑥 )𝑄 (𝑥 , 𝑢 )

𝑥′ 𝑢′

𝑉 ∗ (𝑥 ) = max 𝑉 𝜋 (𝑥 )
π

𝑄 ∗ (𝑥, 𝑢) = max 𝑄 𝜋 (𝑥, 𝑢)


π
‫)]𝑢 = 𝑘𝑢 ‪𝑉 ∗ (𝑥 ) = max 𝑉 𝜋 (𝑥 ) = max 𝑄 ∗ (𝑥, 𝑢) = max(𝑟𝑘+1 + [𝛾𝑉 ∗ (𝑥 ′ )|𝑥𝑘 = 𝑥,‬‬
‫‪π‬‬ ‫‪u‬‬ ‫‪u‬‬
‫𝑢‬ ‫‪∗ ′‬‬
‫𝑥𝑥𝑃𝛾 ∑ ‪= max (𝑟𝑘+1 +‬‬ ‫)) 𝑥 ( 𝑉 ‪′ ,‬‬
‫‪u‬‬

‫)𝑢 = 𝑘𝑢 ‪𝑄 ∗ (𝑥, 𝑢) = max 𝑄 𝜋 (𝑥, 𝑢) = 𝑟𝑘+1 + 𝐸 [𝛾𝑉 ∗ (𝑥 ′ )|𝑥𝑘 = 𝑥,‬‬


‫‪π‬‬
‫𝑢‬ ‫‪∗ ′‬‬ ‫𝑢‬ ‫‪∗ ′‬‬
‫𝑥𝑥𝑃𝛾 ∑ ‪= 𝑟𝑘+1 +‬‬ ‫) 𝑥( 𝑉 ‪′ 𝑉 (𝑥 ) = 𝑟𝑘+1 + ∑ 𝛾𝑃𝑥𝑥 ′‬‬

‫𝑢‬ ‫‪∗ ′ ′‬‬


‫𝑥𝑥𝑃𝛾 ∑ ‪= 𝑟𝑘+1 +‬‬ ‫) 𝑢 ‪′ max 𝑄 (𝑥 ,‬‬
‫‪′‬‬ ‫‪u‬‬

‫)𝑢 ‪𝑢∗ = arg max 𝑄 ∗ (𝑥,‬‬


‫‪u‬‬

‫← ‪𝑟 = −1‬‬ ‫داخل فضا میمانیم‬

‫← ‪𝑟 = −10‬‬ ‫اگر از فضا خارج شویم‬

‫← ‪𝑟 = +10‬‬ ‫اگر به هدف برسیم‬

‫↑𝑎‬
‫↓𝑏‬
‫←𝑐‬
‫→𝑑‬
‫‪−10‬‬
‫) (∗‬ ‫)‪−1 + 𝑉 ∗ (3‬‬
‫[ ‪𝑉 1 = max‬‬ ‫]‬
‫‪u‬‬ ‫‪−10‬‬
‫)‪−1 + 𝑉 ∗ (2‬‬
−10
+10 + 0
𝑉 ∗ (2) = max [ ∗
∗ ( )] = 𝑉 (3)
−1 + 𝑉 1
−10
𝑉 ∗ (1) = max(−1 + 𝑉 ∗ (2), −10)
𝑉 ∗ (2) = max(−10, +10, −1 + 𝑉 ∗ (1))
= max(−10, +10, −1 + max(−1 + 𝑉 ∗ (2), −10))
𝑉 ∗ (1) = +9
𝑉 ∗ (2) = 𝑉 ∗ (3) = +10
𝑉 𝜋 (1) = −7.8 𝑉 𝜋 (2) = 𝑉 𝜋 (3) = −4.7
𝑉 ∗ (1) = 9 𝑉 ∗ (2) = 𝑉 ∗ (3) = 10
Policy Evaluation
𝑢
𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢, 𝑥 ) (𝑟𝑘+1 + ∑ 𝛾𝑃𝑥𝑥 𝜋 ′
′ 𝑉 (𝑥 ))

𝑉𝑖𝜋 (𝑥 )
𝜋 ( ) 𝑢 𝜋 ′ 𝜋 𝜋
𝑉𝑖+1 𝑥 = ∑ 𝜋(𝑢, 𝑥 ) (𝑟𝑘+1 + ∑ 𝛾𝑃𝑥𝑥 ′ 𝑉𝑖 (𝑥 )) → 𝑉𝑖 (𝑥 ) = 𝑉𝑖+1 (𝑥 )

𝑉0𝜋 (1) = 𝑉0𝜋 (2) = 𝑉0𝜋 (3) = 0


1
𝑉1𝜋 (1) = (−10 + (−1) + (−10) + (−1)) = −5.5
4
1
𝑉1𝜋 (2) = (−10 + 10 + (−1) + (−10)) = −2.75
4
1
𝑉2𝜋 (1) = (−10 + (−1 + (−2.75)) + (−1 + (−2.75)) + (−10)) = −6.9
4
1
𝑉2𝜋 (2) = (−10 + 10 + (−10) + (−1 + (−5.5))) = −4.12
4
𝑄 𝜋 (𝑥, 𝑢)
𝑢′ = arg max 𝑄 𝜋 (𝑥, 𝑢)
u
𝑢 𝜋
𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) ∑ 𝑃𝑥𝑥 𝜋 ′
′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉 (𝑥 )) → 𝑉𝑘+1 (𝑥 )

𝑢 𝑥′
𝑢 𝜋 ′
= ∑ 𝑃𝑥𝑥 ′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉𝑘 (𝑥 ))

𝑢
𝑉 ∗ (𝑥 ) = max ∑ 𝑃𝑥𝑥 ∗ ′
′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉 (𝑥 ))
u
𝑥′

𝑉𝜋
{𝑄 𝜋 (𝑥, 𝑢) = 𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑢 ′ 𝛾𝑉 𝜋 (𝑥 ′ )
𝑥𝑥
𝑥′

𝜋 ′ (𝑥 ) = arg max 𝑄 𝜋 (𝑥, 𝑢) → 𝑉 𝜋 (𝑥 ) ≥ 𝑉 𝜋 (𝑥 )
u

Policy Improvement theorem


u ‫به ازای هر‬

𝑄 𝜋 (𝑥, 𝜋 ′ (𝑥 )) ≥ 𝑄 𝜋 (𝑥, 𝑢)

𝑄 𝜋 (𝑥, 𝜋 ′ (𝑥 )) ≥ 𝑄 𝜋 (𝑥, 𝜋(𝑥 ))

𝑄 𝜋 (𝑥, 𝜋(𝑥 )) = 𝑉 𝜋 (𝑥 )

𝑉 𝜋 (𝑥 ) ≥ 𝑄 𝜋 (𝑥, 𝜋(𝑥 ))?

′ 𝑢
𝑉 𝜋 (𝑥 ) = max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 𝜋 ′ 𝜋
′ 𝛾𝑉 (𝑥 )] > 𝑉 (𝑥 )
u
𝑥′

.‫تنها حالت تساوی در این عبارت برای ∗ 𝑉 رخ میدهد‬


Policy Evaluation – Policy Iteration
𝑢
𝜋 → 𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) ∑ 𝑃𝑥𝑥 𝜋 ′ 𝜋
′ 𝛾𝑉 (𝑥 ) → 𝑄 (𝑥, 𝑢 )

𝑢 𝑥′
𝑢 𝜋 ′
= 𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )

𝜋 ′ (𝑥 ) = arg max 𝑄 𝜋 (𝑥, 𝑢)


u

Policy Improvement
|𝑉 𝜋 ′ (𝑥 ) − 𝑉 𝜋 (𝑥 )| ≤ Δ ‫ 𝜋 تمام‬′ = 𝜋 ‫اگر‬
‫و اال برگرد به پله اول‬

′ 𝑢 ′
𝑉 𝜋 (𝑥 ) = ∑ 𝜋 ′ (𝑢|𝑥 ) ∑ 𝑃𝑥𝑥 ′ (𝛾𝑉
𝜋 ( ′)
𝑥 + 𝑅(𝑥, 𝑢))
𝑢 𝑥′

1 𝜋′
(𝑉 (3) − 1)
′ → ↓2 ′
𝑉 𝜋 (1) = = 𝑉 𝜋 (2) − 1
→ →1 𝜋 ′
(𝑉 (2) − 1)
2

𝑉 𝜋 (2) = +10 + 0 = 10
Value Iteration
𝑢 ∗
𝑉𝑘+1 (𝑥 ) = max ∑ 𝑃𝑥𝑥 ′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉𝑘 (𝑥 )) → 𝑉
u
𝑥′

𝜋 ∗ (𝑥 ) = arg max 𝑄 ∗ (𝑥, 𝑢)


u
𝑉1 (1) = max[−10, −1, −10, −1] = −1
𝑉1 (2) = max[−10, 10, −2, −10] = +10 = 𝑉1 (3)
𝑉2 (1) = max[−10, 9, −10, 9] = +9
𝑉2 (2) = max[−10, +10, 8, −10] = +10 = 𝑉2 (3)
PI
𝑢 𝜋 ′ ′
𝑉𝑘+1 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) ∑ 𝑃𝑥𝑥 ′ [𝛾𝑉𝑘 (𝑥 ) + 𝑅 (𝑥, 𝑢 )] → 𝜋 (𝑥 )

𝑢
𝑢 𝜋 ′
= arg max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )]
u

VI
𝑢 ′ ∗
𝑉𝑘+1 (𝑥 ) = max ∑ 𝑃𝑥𝑥 ′ [𝛾𝑉𝑘 (𝑥 ) + 𝑅 (𝑥, 𝑢 )] → 𝜋 (𝑥 )
u
𝑢 ∗ ′
= arg max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )]
u

GPI
ADP
Multi-armed Bandit
-5 +2 0 +5 +1
Multi-armed bandit
ActionSpace: 1:n
1 2 … n-1 n
∑𝑁
𝑘=0 𝑟𝑘 . 1𝑢𝑘 =𝑢
𝑄 (𝑢 ) =
∑𝑁𝑘=0 1𝑢𝑘 =𝑢

)Exploitation( ‫) – بهرهبرداری از اطالعات خطی‬Exploration( ‫نحوه باالنس کردن جستجو‬


Greedy
𝑢𝑘 = arg max 𝑄 (𝑢)
u

𝜖 = 0.1 ← ‫ ← رندم‬10%
𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 { }
1 − 𝜖 → 𝐺𝑟𝑒𝑒𝑑𝑦 ← 90%

100 ron → episode 200s


Soft-max ‫الگوریتم‬
𝑄 (𝑢 )
exp ( )
𝜏
𝜋 (𝑢𝑘 = 𝑢 ) =
𝑄 (𝑢 ′ )
∑𝑢′ exp ( )
𝜏
𝜏 → 0 → 𝑔𝑟𝑒𝑒𝑑𝑦
Upper Confidence Bound (UCB)
𝜋(𝑢) = arg max

𝑄 (𝑢 ′ ) + 𝑂
u

ln(𝑘 )
𝑂 → 𝐶√
𝑁𝑘 (𝑢)
‫‪Incremental Implementation‬‬
‫𝑢=𝑘𝑢‪∑ 𝑟𝑘 1‬‬
‫= ) 𝑢( 𝑄‬
‫𝑢= 𝑘𝑢‪∑ 1‬‬
‫𝑛‬ ‫‪𝑛+1‬‬ ‫𝑛‬
‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬
‫‪𝑄𝑛 → 𝑄𝑛+1‬‬ ‫= ‪→ 𝑄𝑛 = ∑ 𝑟𝑘 → 𝑄𝑛+1‬‬ ‫= 𝑘𝑟 ∑‬ ‫] ‪[∑ 𝑟𝑘 + 𝑟𝑛+1‬‬
‫𝑛‬ ‫‪𝑛+1‬‬ ‫‪𝑛+1‬‬
‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬
‫𝑛‬ ‫‪𝑟𝑛 + 1‬‬ ‫‪1‬‬
‫=‬ ‫‪𝑄𝑛 +‬‬ ‫‪= 𝑄𝑛 +‬‬ ‫𝑟(‬ ‫) 𝑛𝑄 ‪−‬‬
‫‪𝑛+1‬‬ ‫‪𝑛+1‬‬ ‫‪𝑛 + 1 𝑛+1‬‬
‫خطای تخمین→ ) 𝑛𝑄 ‪(𝑟𝑛+1 −‬‬
‫𝑟𝑜𝑟𝑟𝐸 ‪𝑁𝑒𝑤 𝐸𝑠𝑡. = 𝑂𝑙𝑑 𝐸𝑠𝑡. +𝑆𝑡𝑒𝑝 𝑆𝑖𝑧𝑒.‬‬
‫‪1‬‬
‫= 𝑒𝑧𝑖𝑆 𝑝𝑒𝑡𝑆 = 𝑛𝛼‬ ‫میانگین ساده→‬
‫𝑛‬

‫) ‪𝐸𝑟𝑟𝑜𝑟 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑂𝑙𝑑 𝐸𝑠𝑡.‬‬


‫𝑦𝑟𝑎𝑛𝑜𝑖𝑡𝑎𝑡𝑠𝑛𝑜𝑁 ← ثابت 𝛼‬
‫شرط همگرایی‬

‫∞ → 𝑛𝛼 ∑‬
‫{‬
‫∞ < ‪∑ 𝛼𝑛2‬‬

‫‪Monte-Carlo (Chap 5):‬‬


‫تخمین بر اساس میانگین نمونه داده‪:‬‬
‫‪Monte-Carlo Sampling‬‬
𝜋𝑟 2 𝜋
= 3.16
4 4
↗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛: 𝜋 → 𝑄 𝜋 , 𝑉 𝜋
↘ 𝐶𝑜𝑛𝑡𝑟𝑜𝑙\⇝ 𝑂𝑏𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝜋 ∗
MC Prediction
𝜋 → 𝑉 𝜋 (𝑥 ) =?
𝑉 𝜋 (𝑥 ) = 𝐸𝜋 [𝐺𝑘 |𝑥𝑘 = 𝑥 ]
‫] 𝑥 = 𝑘𝑥| 𝑘𝐺[ 𝜋𝐸 = ) 𝑥( 𝜋 𝑣‬

‫𝑖𝐺 ∑‬
‫) 𝑥( 𝜋 𝑉 ≈‬
‫‪𝑁 → 1000‬‬

‫‪First visit‬‬
‫‪Every visit‬‬
‫الگوریتم ‪)Prediction( MC‬‬

‫‪ -‬تعدادی ‪ episode‬تولید میکنیم‬


‫‪ -‬برای هر ‪0:K :episode‬‬

‫حلقه برای ‪ i‬از ‪ K‬به صفر‬

‫‪𝐺 ← 𝑟 + 𝛾𝐺1‬‬
‫𝑡𝑖𝑠𝑖𝑣 𝑡𝑠𝑟𝑖𝐹 ←اگر 𝑖𝑥 در ‪state‬های قبلی ظاهر نشده است‬

‫}) 𝑖𝑥(𝑛𝑟𝑢𝑡𝑒𝑟 ‪𝑟𝑒𝑡𝑢𝑟𝑛(𝑥𝑖 ) ← {𝐺,‬‬


‫میانگین ) 𝑖𝑥(𝑛𝑟𝑢𝑡𝑒𝑟 ← ) 𝑖𝑥( 𝑉‬
‫‪MC:‬‬
‫‪ bias‬ندارد‪ variance /‬زیادی دارد‪ /‬یک ‪ trajectory‬کامل را استفاده میکند‪ /‬در هر مرحله یک منحنی‬
‫مشخص تولید میشود‪.‬‬

‫‪DP:‬‬
‫‪ bias‬دارد‪ variance /‬کمتری دارد‪/‬پله پله جلو میرود‪ /‬در هر پله همهی احتماالت مختلف بررسی میشود‪.‬‬

‫تخمین جدید بر اساس تخمین قبلی نقاط دیگر ⇝‪Bootstrapping‬‬

‫‪MC Control‬‬
‫𝜋 ?‬
‫𝜋𝑉‬ ‫)𝑢 ‪𝑄 ⇝ 𝜋 ′ = arg u max 𝑄 𝜋 (𝑥,‬‬
‫⇝‬
‫𝑢‬
‫𝑥𝑥𝑃𝛾 ∑ ‪𝜋 ′ = arg u max 𝑟(𝑥, 𝑢) +‬‬ ‫‪𝜋 ′‬‬
‫) 𝑥( 𝑉 ‪′‬‬

‫‪𝑥′‬‬
‫𝑢‬
‫𝑥𝑥𝑃‬ ‫دینامیک‪′ :‬‬
‫𝑡𝑓𝑜𝑠 ‪𝜖 −‬‬

‫) ⬚ ‪(𝑎,‬‬
‫⋮‬
‫)‪𝑄 (𝑎, 2‬‬
‫?= )‪𝑄 (𝑎, 1‬‬

‫‪𝑎, 2‬‬

‫‪Exploring starts‬‬
‫• نقطه شروع را تصادفی بدهیم‪.‬‬
‫‪Exploring start -‬‬
‫‪ ← 𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 -‬تابع بهینه که بین همه سیاستهای 𝑡𝑓𝑜𝑠 ‪ 𝜖 −‬بهینه باشد‪.‬‬

‫|𝐴|‪𝜋(𝑢(𝑥 )) > 𝜖/‬‬


‫‪`Where‬‬
‫𝑒𝑐𝑎𝑝𝑠 𝑛𝑜𝑖𝑡𝑐𝐴 → 𝐴‬

‫نقاط شروع رندم‪−‬‬


‫{ ‪1- Policy Evaluation‬‬ ‫)𝑢 ‪→ 𝑄 𝜋 (𝑥,‬‬
‫سیاستهای تصادفی شده‪−‬‬
‫)𝑢 ‪2- Policy Improvement: 𝜋 ′ (𝑥 ) = arg u max 𝑄 𝜋 (𝑥,‬‬
‫‪MC Control Algorithm‬‬
‫‪ -‬یک اپیزود جدید تولید میکنیم (با نقطه شروع تصادفی ‪ /‬بر اساس سیاست 𝜋 فعلی)‬
‫برای هر اپیزود‪𝑘: 𝐾 → 0 :‬‬
‫𝐺 → ‪𝛾𝐺 + 𝑟𝑘 + 1‬‬
‫اگر ) 𝑘𝑢 ‪ (𝑥𝑘 ,‬در پلههای قبلی اپیزود وجود ندارد‪:‬‬
‫) 𝑘𝑢 ‪{𝑅𝑒𝑡𝑢𝑟𝑛(𝑥𝑘 , 𝑢𝑘 ), 𝐺 } → 𝑅𝑒𝑡𝑢𝑟𝑛(𝑥𝑘 ,‬‬
‫) 𝑘𝑢 ‪𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑟𝑒𝑡𝑢𝑟𝑛) → 𝑄(𝑥𝑘 ,‬‬
‫)𝑢 ‪(𝑃𝑜𝑙𝑖𝑐𝑦 𝐼𝑚𝑝. ) arg u max 𝑄(𝑥,‬‬
‫برگرد به مرحله تولید 𝑒𝑑𝑜𝑠𝑖𝑝𝐸‬

‫] 𝑥 = 𝑘𝑥| 𝑘𝐺[ 𝜋𝐸 = ) 𝑥( 𝜋 𝑉‬
‫‪90%‬‬ ‫‪10%‬‬
‫‪) 𝜋:‬محاسبه تابع ارزش(‬
‫←‬ ‫→‬
‫‪1‬‬ ‫‪9‬‬
‫𝑖 𝐺 ‪∑5𝑖=1‬‬ ‫‪+ ∑10‬‬
‫‪𝑖=6‬‬ ‫𝑖𝐺‬
‫⇝ ) 𝑥( 𝜋 𝑉‬ ‫‪5‬‬ ‫‪5‬‬
‫‪10‬‬
‫‪50%‬‬ ‫‪50%‬‬
‫‪6‬‬
‫𝐺‬ ‫←‬ ‫→‬ ‫‪𝐺1‬‬
‫‪7‬‬
‫𝐺‬ ‫←‬ ‫→‬ ‫‪𝐺2‬‬
‫‪) 𝜋 ′ :‬رفتار(‬ ‫‪8‬‬
‫𝐺‬ ‫←‬ ‫→‬ ‫‪𝐺3‬‬
‫𝐺‬ ‫‪9‬‬
‫←‬ ‫→‬ ‫‪𝐺4‬‬
‫𝐺‬ ‫‪10‬‬
‫←‬ ‫→‬ ‫‪𝐺5‬‬
𝜋(
∑10
𝑖=1 𝐺
𝑖
𝑉 𝑥) =
10
Ordinary Importance Sampling
1 9
& → 𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑜
5 5
Ordinary I.S. method → High variance / unbiased → Can be more easily
suited for continuous structs.
1 9
∑5𝑖=1 𝐺 𝑖
+ ∑10
𝑖=6 𝐺𝑖
𝑉 𝜋 (𝑥 ) ⇝ 5 5
1 9
(∑5𝑖=1 + ∑10
𝑖=6 )
5 5
Weighted I.S. method → Low variance / biased → preferred.
∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 ∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 ∑𝑛+1
𝑖=1 𝑤𝑖 𝐺
𝑖 ∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 + 𝑤𝑛+1 𝐺𝑛+1
= = 𝑉𝑛 ⇝ 𝑉𝑛+1 = 𝑛+1 =
∑𝑛𝑖=1 𝑤𝑖 𝐶𝑛 ∑𝑖=1 𝑤𝑖 ∑𝑛𝑖=1 𝑤𝑖 + 𝑤𝑛+1
𝑉𝑛 𝐶𝑛 + 𝑤𝑛+1 𝐺𝑛+1 𝐶𝑛 𝑤𝑛+1
= = 𝑉𝑛 + 𝐶𝑛+1
𝐶𝑛 + 𝑤𝑛+1 𝐶𝑛 + 𝑤𝑛+1 𝐶𝑛 + 𝑤𝑛+1
𝑤𝑛+1
𝑉𝑛+1 = 𝑉𝑛 + (𝐶 − 𝑉𝑛 ) & 𝐶𝑛+1 = 𝐶𝑛 + 𝑤𝑛+1
𝐶𝑛+1 𝑛+1

𝑢 𝑢
𝑘+1
𝑃𝜋 (𝜏) = 𝜋(𝑢𝑘 |𝑥𝑘 )𝑃𝑥𝑘𝑘𝑥𝑘+1 𝜋(𝑥𝑘+1 |𝑢𝑘+1 )𝑃𝑥𝑘+1 𝑥𝑘+2 …
𝑢 𝑢
𝑃𝜋′ (𝜏) = 𝜋 ′ (𝑢𝑘 |𝑥𝑘 )𝑃𝑥𝑘𝑘𝑥𝑘+1 𝜋 ′ (𝑥𝑘+1 |𝑢𝑘+1 )𝑃𝑥𝑘+1
𝑘+1
𝑥𝑘+2 …
𝐾−1
𝑃𝜋 (𝜏) 𝜋(𝑢𝑘 |𝑥𝑘 ) 𝜋(𝑢𝑘+1 |𝑥𝑘+1 ) 𝜋(𝑢𝑖 |𝑥𝑖 )
𝐼𝑚𝑝. 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑜 = = ′ . ′ …=∏ ′
𝑃𝜋′ (𝜏) 𝜋 (𝑢𝑘 |𝑥𝑘 ) 𝜋 (𝑢𝑘+1 |𝑥𝑘+1 ) 𝜋 (𝑢𝑖 |𝑥𝑖 )
𝑖=𝑘

(𝜋) 𝑇𝑎𝑟𝑔𝑒𝑡 𝑝𝑜𝑙𝑖𝑐𝑦


Off-policy MC algorithm →
(𝑏) 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 𝑝𝑜𝑙𝑖𝑐𝑦
‫یک اپیزود جدید تولید میکنیم (بر اساس سیاست ‪)b‬‬

‫برای هر مرحله از ‪Episode‬‬

‫) ‪𝐶 (𝑥𝑘 , 𝑢𝑘 ) + 𝑊 → 𝐶 (𝑥𝑘 , 𝑢𝑘′‬‬


‫𝑊‬
‫‪𝑄 (𝑥𝑘 , 𝑢𝑘′ ) +‬‬ ‫) 𝑘𝑢 ‪→ 𝑄(𝑥𝑘 ,‬‬
‫) 𝑘𝑢‪𝐶(𝑥𝑘 ,‬‬

‫)𝑢 ‪𝜋 = arg u max 𝑄(𝑥,‬‬


‫) 𝑘𝑥| 𝑘𝑢(𝑏‪𝑤 × 𝜋(𝑢𝑘 |𝑥𝑘 )/‬‬
‫‪MC:‬‬
‫)𝑢 ‪𝑉 𝜋 (𝑥 ) = 𝐸𝜋 [𝐺𝑘 |𝑥𝑘 = 𝑥 ] → 𝜋 ′ = arg u max 𝑄(𝑥,‬‬
‫‪Off-policy:‬‬
‫𝑏 → 𝑦𝑐𝑖𝑙𝑜𝑝 𝑟𝑜𝑖𝑣𝑎‪𝑏𝑒ℎ‬‬
‫𝜋 → 𝑦𝑐𝑖𝑙𝑜𝑝 𝑡𝑒𝑔𝑟𝑎𝑇‬
‫)𝑢 ‪𝜋(𝑥,‬‬
‫[ 𝑏𝐸 = ]) 𝑥(𝑓[ 𝜋𝐸‬ ‫]) 𝑥( 𝑓‬
‫)𝑢 ‪𝑏(𝑥,‬‬
‫‪Incremental Imp.‬‬
‫) 𝑥( 𝑉 → )) 𝑥( 𝑉 ‪𝑉 (𝑥 ) +⊙ (𝐶𝐾 −‬‬
‫])‪Temporal Difference Learning [TD(0‬‬

‫) ‪𝑉 (𝑥𝑘 ) = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1‬‬


‫) 𝑘𝑥( ‪𝑉𝑛 (𝑥𝑘 ) + 𝛼(𝑟𝑘+1 + 𝛾𝑉𝑛 (𝑥𝑘+1 ) − 𝑉𝑛 (𝑥𝑘 )) → 𝑉𝑛+1‬‬
‫) ‪𝐺𝑘 = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1‬‬
‫𝑢‬
‫𝑥𝑥𝑃 ∑ ) 𝑥|𝑢(𝜋 ∑ = ) 𝑥( 𝜋 𝑉‬ ‫‪𝜋 ′‬‬
‫) 𝑥( 𝑉𝛾 ‪′ (𝑅 (𝑥, 𝑢 ) +‬‬

‫𝑢‬ ‫𝑥‬

‫‪DP:‬‬
‫]) 𝑥( 𝜋 𝑉𝛾 ‪𝑉 𝜋 (𝑥 ) = 𝐸𝜋 [𝑟𝑘+1 +‬‬
‫← مدل سیستم √ ← تقریبی بودن 𝑃𝐷 → استفاده از مقدار تخمینی ‪ V‬درمعادله‬
‫‪MC:‬‬
‫] 𝑥 = 𝑘𝑥| 𝑘𝐺[ 𝜋𝐸 = ) 𝑥( 𝜋 𝑉‬
‫جایگزین کردن امید ریاضی یا سمپلگیری ← تقریبی بودن ‪MC‬‬
‫‪TD:‬‬
‫← 𝑝𝑎𝑟𝑡𝑠𝑇𝑜𝑜𝐵‬
‫به هر دو دلیل تقریب میزند‬
‫← 𝑑𝑒𝑠𝑎𝑏 ‪𝑆𝑎𝑚𝑝𝑙𝑒 −‬‬

‫) 𝐴( 𝑉‬
‫) 𝐵( 𝑉‬
‫‪𝐴, 0, 𝐵, 0‬‬
‫𝑠𝑡𝑎𝑒𝑝𝑒𝑟 ‪{𝐵, 1 6‬‬
‫‪𝐵, 0‬‬
‫‪6 3‬‬
‫= ) 𝐵( 𝑉‬ ‫=‬
‫‪8 4‬‬
‫𝑟𝑜𝑟𝑟𝑒 𝑑𝑒𝑟𝑎𝑢𝑞𝑆 𝑛𝑎𝑒𝑀 𝑔𝑛𝑢𝑧𝑖𝑚𝑖𝑛𝑖𝑀 𝐶𝑀 ← ‪𝑉 (𝐴) = 0‬‬
‫{‬
‫𝑒𝑡𝑎𝑚𝑖𝑡𝑠𝑒 𝑡𝑛𝑎𝑙𝑎𝑣𝑖𝑢𝑞𝐸 𝑦𝑙𝑛𝑖𝑎𝑡𝑟𝑒𝐶 𝐷𝑇 ← )𝐵( 𝑉 = )𝐴( 𝑉‬
‫اگر ‪ TD‬مدل را درست تخمین بزند به جواب بهینه همگرا میشود‪.‬‬
‫‪On-policy TD Control‬‬
‫‪SARSA‬‬

‫)) 𝑘𝑢 ‪𝑄𝑛+1 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼(𝑟𝑘+1 + 𝛾𝑄𝑛 (𝑥𝑘+1 + 1) − 𝑄𝑛 (𝑥𝑘 ,‬‬

‫)) 𝑘𝑢 ‪TD error = (𝑟𝑘+1 + 𝛾𝑄𝑛 (𝑥𝑘+1 + 1) − 𝑄𝑛 (𝑥𝑘 ,‬‬


‫)𝑢 ‪𝜋 ′ (𝑥 ) = arg u max 𝑄 (𝑥𝑘 ,‬‬
‫‪Expected Sarsa:‬‬
𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 )𝑄 𝜋 (𝑥, 𝑢)

𝑄𝑛+1 (𝑥𝑘 , 𝑢𝑘 )
= 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 )

+ 𝛼 (𝑟𝑘+1 + 𝛾 ∑ 𝜋(𝑢|𝑥𝑘+1 )𝑄𝑛 (𝑥𝑘+1 , 𝑢) − 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ))


𝑢
= 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼(𝑟𝑘+1 + 𝛾𝑄𝑛 (𝑥𝑘+1 , 𝑢𝑘+1 ) − 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ))
Q-learning:

𝑄𝑛+1 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼 (𝑟𝑘+1 + max 𝑄𝑛 (𝑥𝑘+1 , 𝑢) − 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 ))


u
→ 𝑄 𝑥, 𝑢) → 𝜋 𝑥 ) = arg u max 𝑄 ∗ (𝑥, 𝑢)
∗( ∗(
Sarsa:
𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄(𝑥𝑘 , 𝑢𝑘 ) + 𝛼(𝑟𝑘+1 + 𝛾𝑄 (𝑥𝑘+1 , 𝑢𝑘+1 ) − 𝑄 (𝑥𝑘 , 𝑢𝑘 ))
𝑇𝑎𝑟𝑔𝑒𝑡: 𝑟𝑘+1 + 𝛾𝑄 (𝑥𝑘+1 , 𝑢𝑘+1 )
Expected Sarsa:

𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼 (𝑟𝑘+1 + 𝛾 ∑ 𝜋(𝑥|𝑥𝑘+1 )𝑄(𝑥𝑘+1 , 𝑢) − 𝑄 (𝑥𝑘 , 𝑢𝑘 ))


𝑢

𝑇𝑎𝑟𝑔𝑒𝑡: 𝑟𝑘+1 + 𝛾 ∑ 𝜋(𝑥|𝑥𝑘+1 )𝑄(𝑥𝑘+1 , 𝑢)


𝑢

Q-learning:

𝑄(𝑥𝑘 , 𝑢𝑘 ) = 𝑄 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼 (𝑟𝑘+1 + 𝛾 max 𝑄(𝑥𝑘+1 , 𝑢) − 𝑄(𝑥𝑘 , 𝑢𝑘 ))


u

𝑇𝐷 ↔ 𝑀𝐶
𝑏𝑖𝑎𝑠 ↑ 𝑏𝑖𝑎𝑠 ↓
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ↓ 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ↑
N-step TD:
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + 𝛾 2 𝑟𝑘+3 + ⋯ 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 𝑉 (𝑥𝑘+𝑛 )
𝐺𝑘:𝑘+1 = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1 )
Prediction:
𝑉𝑘+𝑛 (𝑥𝑘 ) = 𝑉𝑘+𝑛−1 (𝑥𝑘 ) + 𝛼(𝐺𝑘:𝑘+𝑛 − 𝑉𝑘+𝑛−1 (𝑥𝑘 ))
Error reduction property:
max|𝐸 [𝐺𝑘:𝑘+𝑛 |𝑥𝑘 = 𝑥 ] − 𝑉 𝜋 (𝑥 )| ≤ 𝛾 𝑛 max|𝑉𝑘+𝑛−1 (𝑥 ) − 𝑉 𝜋 (𝑥 )|
x x

n-step Sarsa:
𝑄(𝑥𝑘 , 𝑢𝑘 ) = 𝑄 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼 [𝐺𝑘:𝑘+𝑛 − 𝑄 (𝑥𝑘 , 𝑢𝑘 )]
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 𝑄 (𝑥𝑘+𝑛 , 𝑢𝑘+𝑛 )
n-step Expected Sarsa:
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 ∑ 𝜋(𝑢|𝑥𝑘+𝑛 )𝑄 (𝑥𝑘+𝑛 , 𝑢)

n-step off-policy learning:


Importance sampling:
𝑘+𝑛−1
𝜋(𝑢𝑖 |𝑥𝑖 )
∏ = 𝜌𝑘:𝑘+𝑛1
𝑏(𝑢𝑖 |𝑥𝑖 )
𝑖=𝑘

Prediction:
𝑉 (𝑥𝑘 ) = 𝑉 (𝑘𝑘 ) + 𝛼𝜌𝑘:𝑘+𝑛 [𝐺𝑘:𝑘+𝑛 − 𝑉 (𝑥𝑘 )]
n-step off-policy Sarsa control:
𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄(𝑥𝑘 , 𝑢𝑘 ) + 𝛼𝜌𝑘+1:𝑘+𝑛 [𝐺𝑘:𝑘+𝑛 − 𝑄(𝑥𝑘 , 𝑢𝑘 )]
n-step off-policy Expected Sarsa:
𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄(𝑥𝑘 , 𝑢𝑘 ) + 𝛼𝜌𝑘+1:𝑘+𝑛−1 [𝐺𝑘:𝑘+𝑛 − 𝑄(𝑥𝑘 , 𝑢𝑘 )]
𝜆-return:
𝐺𝑘:𝑘+1
𝐺
{ 𝑘:𝑘+2

𝐺𝑘:𝑘+𝑛
∞ 𝑇−𝑘−1

(1 − 𝜆) ∑ 𝜆𝑖−1 𝐺𝑘:𝑘+𝑖 = 𝐺𝑘𝜆 = (1 − 𝜆) ∑ 𝜆𝑖−1 𝐺𝑘:𝑘+𝑖 + 𝜆𝑇−𝑘−1 𝐺𝑘


𝑖=1 𝑖=1

Sarsa(𝜆)
TD(𝜆)
𝐺𝑘𝜆=0 = 𝐺𝑘:𝑘+1 ← 𝑇𝐷(0)
𝐺𝑘𝜆=1 = 𝐺𝑘 ← 𝑀𝐶
Function approximation

+1 ‫سمت چپ‬
𝐴𝑐𝑡𝑖𝑜𝑛 {
0 ‫سمت راست‬
𝜃
𝜃̇
𝑥 = [ ] ⇝ 𝑉 𝜋 (𝑥 ) ≈ 𝑉 (𝑥; 𝜙)
𝑦
𝑦̇
Generalization (‫)تعمیمپذیری‬
𝑉 (𝑥; 𝜙) = 𝜙1 𝑥 + 𝜙2 𝑥 2 + ⋯
𝜙1
𝜙 = [𝜙 2 ]

‫ کارایی‬-
𝑢
- DP → ∑𝑢 𝜋(𝑢|𝑥 ) ∑𝑥 ′ 𝑃𝑥𝑥 + 𝛾𝑉 (𝑥; 𝜙)
′ (𝑟𝑘+1

- MC → 𝐺𝑘
- TD(0) → 𝑉 𝜋 (𝑥𝑘 ) ∼ 𝑟𝑘+1 + 𝑉 (𝑥𝑘+1 ; 𝜙)
- N-step → 𝐺𝑘:𝑘+𝑛
Objective

𝑚𝑠𝑒 = ∑ μx [Uk − 𝑉 (𝑥; 𝜙)]2


𝑥

∑ 𝜇 (𝑥 ) = 1

𝜇(𝑥 ) = 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 ‫احتمال حضور سیستم در 𝑥 در کل‬


Gradient descent
𝜙 − 𝛼∇ϕ 𝑚𝑠𝑒
‫)‪Stochastic GD (SGD‬‬
‫𝜙 → ‪𝜙 − 𝛼∇𝜙 [𝑈𝑘 − 𝑉 (𝑥; 𝜙)]2‬‬
‫)𝜙 ;𝑥(𝑉 𝜙∇])𝜙 ;𝑥( 𝑉 ‪𝜙 + 2𝛼 [𝑈𝑘 −‬‬

‫بهترین پاسخ ممکن →√ 𝐶𝑀‬


‫𝑃𝐷‬
‫همگرایی ضعیفتری دست یابی است‪−‬‬
‫{ 𝑡𝑛𝑎𝑖𝑑𝑎𝑟𝐺 ‪{ 𝑇𝐷(0) → 𝑆𝑒𝑚𝑖 −‬‬
‫𝑝𝑒𝑡𝑠 ‪𝑛 −‬‬ ‫پیاده سازی آنالین ‪−‬‬

‫‪Semi-gradient algorithm‬‬
‫‪ -‬در هر مرحله از ‪episode‬‬
‫‪ o‬تولید اکشن بر مبنای سیاست 𝜋‬
‫‪ o‬محاسبه تارگت از رابطه )𝜙 ; ‪𝑈𝑘 = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1‬‬
‫‪ o‬به روز رسانی 𝜙 → )𝜙 ; 𝑘𝑥( 𝑉 𝜙∇ 𝑘𝛿𝛼‪𝜙 + 2‬‬

‫)برچسب ‪∑(𝑉 (𝑥; 𝜙) −‬‬

‫‪1 →1‬‬
‫‪2 →2‬‬
‫‪3 →3‬‬
‫‪1 →1‬‬
‫‪4 →4‬‬
‫⋯ ‪𝑉 (𝑥; 𝜙) = 𝜙1 𝑓1 (𝑥 ) + 𝜙2 𝑓2 (𝑥 ) +‬‬
‫‪𝜙1‬‬ ‫) 𝑥( ‪𝑓1‬‬
‫] ‪𝜙 = [𝜙2‬‬ ‫توابع پایه → ]) 𝑥( ‪𝑓(𝑥 ) = [𝑓2‬‬
‫⋮‬ ‫⋮‬
‫‪Feature‬‬
‫) 𝑥( 𝑓 𝑇 𝜙 =‬
‫‪ -‬ساختار ‪Linear‬‬
‫‪ o‬روش ‪ MC‬به نقطه بهینه سراسری (‪ )global optimum‬همگرا میشود‪.‬‬
‫‪ o‬روش )‪ :TD(0‬همگرا میشود اما نه لزوما به نقطه بهینه‬
‫‪1‬‬
‫< خطای تخمین )‪𝑇𝐷 (0‬‬ ‫کمترین خطای ممکن‬
‫𝛾‪1−‬‬
‫‪ -‬انواع ‪Faeture‬ها‪:‬‬
‫‪ )1‬چند جملهایها‪:‬‬
‫‪2‬‬
‫… ‪𝑥, 𝑥 ,‬‬
‫{‬
‫‪𝑥1 , 𝑥1 𝑥2 , 𝑥2‬‬
‫‪ )2‬سری فوریه‬
‫) 𝑥 ‪sin(𝜔1‬‬
‫) 𝑥 ‪cos(𝜔1‬‬
‫) 𝑥 ‪sin(2𝜔1‬‬
‫) 𝑥 ‪cos(2𝜔1‬‬
‫⋮‬
‫‪Tile coding )3‬‬

‫‪Coarse coding )4‬‬


‫) 𝑥( 𝑉∇) 𝑘𝛿( 𝛼 ‪𝜙 +‬‬
‫‪1‬‬
‫≈𝛼‬
‫]) 𝑥(𝑓) 𝑥( 𝑇 𝑓[ 𝐸𝑇‬
‫شبکه عصبی‬
‫نورون‬

‫‪ -‬تابع فعالسازی‬
‫‪ )1‬تابع ‪Sigmoyd‬‬
‫‪1‬‬
‫𝑥‪1 + 𝑒 −‬‬

‫‪ )2‬تابع ‪Tanh‬‬
‫𝑥‬ ‫𝑥‪−‬‬
‫𝑒‪𝑒 −‬‬
‫𝑥‪𝑒 𝑥 + 𝑒 −‬‬

‫‪ReLU )3‬‬
Dying ReLU

∑ 𝑦1 = 𝑓 (∑ 𝑤𝑖1 𝑥𝑖 + 𝑏1 )

𝑧 = ∑ 𝑚𝑗 𝑦𝑗 = ∑ 𝑚𝑗 𝑓 (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 )

𝐸𝑘 = (𝑈𝑘 − 𝑍𝑘 )2
1
𝐸= ∑ 𝐸𝑘
𝑁
𝑘

𝜕𝑍𝑘
= 𝑚𝑗 + 𝑓 ′ (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 )
𝜕𝑏𝑖
𝜕𝑍𝑘 𝜕𝑍𝑘 𝜕𝑓 𝜕𝑓
= = 𝑚𝑗 = 𝑚𝑗 𝑓 ′ (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 ) 𝑥𝑖
𝑤𝑖𝑗 𝜕𝑓 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝜕𝑍𝑘
= 𝑦𝑗
𝜕𝑚𝑗
Vanishing Gradient
Fully Connected

Convolutional NN
CNN
Overfitting
Generalizable
1- Cross validation (Early stopping)

2- Dropout

3- Batch normalization
4- Regularization
𝐸𝑘 =
1
𝐸= ∑ 𝐸𝑘 + ‫پیچیدگی شبکه‬
𝑁
5- Deep Residual
ResNet
𝑉 ∗ = ∑𝑢 𝜋 ∗ (𝑥, 𝑢)𝑄∗ (𝑥, 𝑢)
𝑄∗
𝜋 ∗ = arg u max 𝑄 ∗ (𝑥, 𝑢)
Nonlinear function approximator
Convolutional NN (CNN)

𝐼𝑛𝑝𝑢𝑡 → 𝐶𝑜𝑛𝑣𝑢𝑙𝑡𝑖𝑜𝑛 → 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 → 𝐶𝑜𝑛𝑣𝑢𝑙𝑡𝑖𝑜𝑛 → 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 → ⋯ →


𝐹𝑢𝑙𝑙𝑦 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 → 𝑂𝑢𝑡𝑝𝑢𝑡
Recurrent NN (RNN)

ℎ𝑘 = 𝑓(𝑥𝑘 , ℎ𝑘−1 )
ℎ𝑘 = 𝑓(𝑥𝑘 , 𝑥𝑘−1 , … )
𝑥𝑘+1 = 𝑓(𝑥𝑘 , 𝑢𝑘 ) = 𝐴𝑥𝑘 + 𝐵𝑢𝑘
Vanishing Gradient
Long Short-Term Memory
𝑓𝑘 : 𝑓𝑜𝑟𝑔𝑒𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑓 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑓 )
𝑖𝑘 : 𝐼𝑛𝑝𝑢𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑖 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑖 )
𝑜𝑘 : 𝑂𝑢𝑡𝑝𝑢𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑜 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑜 )

𝐶𝑘 = 𝑓𝑘 𝐶𝑘−1 + 𝑖𝑘 𝐶̃𝑘
𝐶̃𝑘 = tanh(𝑤𝑐 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑐 )
ℎ𝑘 = 𝑜𝑘 tanh(𝐶𝑘 )
𝑉(𝑥) = 𝜙 𝑇 𝑓 𝑓 = [𝑓1 (𝑥) 𝑓2 (𝑥) … 𝑓𝑛 (𝑥)]
Least-square TD
𝑇

min ∑[𝑟𝑘+1 + 𝛾𝑉(𝑥𝑘+1 ) − 𝑉(𝑥𝑘 )]2


ϕ
𝑘=0
𝑇

= min ∑ [𝑟𝑘+1 + 𝛾𝜙 𝑇 𝑓(𝑘𝑥+1 ) − 𝜙 𝑇 𝑓(𝑥𝑘 )]2


ϕ
𝑘=0

𝜙1
𝜙
𝐴 [ 2 ] = 𝑏 → 𝐴 = ∑ 𝑓(𝑥𝑘 )[𝑓(𝑥𝑘 ) − 𝛾𝑓(𝑥𝑘+1 )]𝑇

𝜙𝑛

𝑏 = ∑ 𝑓(𝑥𝑘 )𝑟𝑘+1

TD fixed-point
𝜙1
𝜙
[ 2 ] = 𝐴−1 𝑏

𝜙𝑛

𝜙+𝛼⊚
𝛼 ← step size ‫انتخاب‬
← Non-parametric ‫• روشهای‬
Memory-based app
𝑘
1
𝑉(𝑥) = ∑ 𝑉(𝑥𝑖 )
𝑘
𝑖=1
∑ 𝑤𝑖 𝑉(𝑥𝑖 )
𝑉(𝑥) =
∑ 𝑤𝑖
‫𝑖𝑤 متناسب با میزان نزدیکی دو نقطه‬
k-NN
• Kernel-based app
𝑉(𝑥) = ∑ 𝑘(𝑥𝑖 , 𝑥)𝑉(𝑥, 𝑖)
−‖𝑥 − 𝑥𝑖 ‖2
𝑘(𝑥𝑖 , 𝑥) = exp ( )
2𝜎 2
𝑉(𝑥) = 𝜙 𝑇 𝑓(𝑥) 𝐿𝑖𝑛𝑒𝑎𝑟 ↔ 𝐾𝑒𝑟𝑛𝑒𝑙 − 𝑏𝑎𝑠𝑒𝑑 𝑘(𝑥𝑖 , 𝑥)
= 𝑓(𝑥𝑖 )𝑇 𝑓(𝑥)
[𝑥𝑖 , 𝑉(𝑥𝑖 ) = 𝜙 𝑇 𝑓(𝑥𝑖 )] → 𝑉(𝑥) = ∑ 𝑘(𝑥𝑖 , 𝑥)𝑉(𝑥𝑖 ) = ∑ 𝑓(𝑥𝑖 )𝑇 𝑓(𝑥)𝑉(𝑥𝑖 )
Kernel-Trick
→Rbfsampler
→Nystroem

‫ به کمک آموزش اولیه الیههای قبلی‬Deep Learning ‫• دور زدن‬


‫فصل ‪ 4‬مساله کنترل بهینه در سیستمهای پیوسته‬
‫∞‬ ‫∞‬
‫𝜏𝑑 𝑡𝑠𝑜𝑐 𝑔𝑛𝑖𝑛𝑛𝑢𝑟 ∫ = 𝑜𝑔 ‪𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢)𝑑𝜏 = 𝐶𝑜𝑠𝑡 − 𝑡𝑜 −‬‬
‫𝑡‬ ‫𝑡‬

‫)𝑢 ‪𝑥̈ = 𝑓(𝑥,‬‬

‫𝑢𝑅 𝑇𝑢 ‪𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 +‬‬


‫𝑑𝑥 → 𝑥‬
‫𝑑 ̇𝑥 ‪𝑒 = 𝑥 − 𝑥𝑑 → 𝑒̇ = 𝑓(𝑥, 𝑢) −‬‬
‫∞‬
‫𝜏𝑑)𝑢 ‪𝑉(𝑡, 𝑥) = ∫ 𝐿(𝑡, 𝑥,‬‬
‫𝑡‬
‫∞‬ ‫𝑡𝑑‪𝑡+‬‬ ‫∞‬
‫‪∗ (𝑡,‬‬
‫𝑉‬ ‫∫ → )𝑥‬ ‫∫=‬ ‫∫‪+‬‬
‫𝑡‬ ‫𝑡‬ ‫𝑡𝑑‪𝑡+‬‬

‫)𝑥𝑑 ‪𝑉 ∗ (𝑡, 𝑥) = 𝐿(𝑡, 𝑥, 𝑢)𝑑𝑡 + 𝑉 ∗ (𝑡 + 𝑑𝑡, 𝑥 +‬‬


‫])𝑥𝑑 ‪𝑉 ∗ (𝑡, 𝑥) = min[𝐿(𝑡, 𝑥, 𝑢)𝑑𝑡 + 𝑉 ∗ (𝑡 + 𝑑𝑡, 𝑥 +‬‬
‫‪u‬‬
‫𝑡( ∗‬ ‫‪∗ (𝑡,‬‬
‫∗ 𝑉𝜕‬ ‫∗ 𝑉𝜕‬ ‫∗ 𝑉𝜕‬
‫𝑉→‬ ‫𝑉 = )𝑥𝑑 ‪+ 𝑑𝑡, 𝑥 +‬‬ ‫‪𝑥) +‬‬ ‫‪𝑑𝑡 +‬‬ ‫‪𝑑𝑡 +‬‬ ‫𝑥𝑑‬
‫𝑡𝜕‬ ‫𝑡𝜕‬ ‫𝑥𝜕‬
‫)𝑢 ‪𝑥̇ = 𝑓(𝑥,‬‬
‫𝑡𝑑)𝑢 ‪𝑑𝑥 = 𝑓(𝑥,‬‬
‫𝑇 ∗ 𝑉𝜕 ∗ 𝑉𝜕‬
‫‪→ min [𝐿(𝑡, 𝑥, 𝑢) +‬‬ ‫‪+‬‬ ‫‪𝑓(𝑥, 𝑢)] = 0‬‬
‫‪u‬‬ ‫𝑡𝜕‬ ‫𝑥𝜕‬

‫∗ 𝑉𝜕‪−‬‬ ‫𝑇 ∗ 𝑉𝜕‬
‫→‬ ‫‪= min [𝐿(𝑡, 𝑥, 𝑢) +‬‬ ‫])𝑢 ‪𝑓(𝑥,‬‬ ‫𝐵𝐽𝐻‬
‫𝑡𝜕‬ ‫‪u‬‬ ‫𝑥𝜕‬
‫تابعیت صریح زمان نداشته باشند → ‪• V, L‬‬
‫‪• 𝑥̇ = 𝑓(𝑥, 𝑢) = 𝐹(𝑥) + 𝐵(𝑥)𝑢 affine‬‬
‫𝑇 ∗ 𝑉𝜕‬
‫‪0 = min [𝐿(𝑥, 𝑢) +‬‬ ‫])𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫‪u‬‬ ‫𝑥𝜕‬
‫)𝑢)𝑥(𝐵 ‪𝐻(𝑥, χ, u) = 𝐿(𝑥, 𝑢) + 𝜆𝑇 (𝐹(𝑥) +‬‬
‫∗ 𝑉𝜕‬
‫‪𝐻𝐽𝐵: min 𝐻 (𝑥,‬‬ ‫‪, 𝑢) = 0‬‬
‫‪u‬‬ ‫𝑥𝜕‬
‫‪→Cost:‬‬
‫𝑢𝑅 𝑇𝑢 ‪𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 +‬‬
‫معادله جبری ریکاتی → 𝐵𝐽𝐻 → 𝐼𝑇𝐿 →‬
‫‪∙ affine‬‬
‫تابعیت صریح زمان ندارند 𝑉 ‪{∙ 𝐿,‬‬
‫𝑡𝑠𝑜𝑐 𝑐𝑖𝑡𝑎𝑟𝑑𝑎𝑢𝑞 ∙‬
‫𝑇∗‬
‫𝑉𝜕‬
‫‪min[𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 +‬‬ ‫‪(𝐹(𝑥) + 𝐵(𝑥, 𝑢)] = 0‬‬
‫‪u‬‬ ‫𝑥𝜕‬
‫با مشتق نسبت به ‪u‬‬

‫𝑇‬
‫𝑇 ∗ 𝑉𝜕‬
‫‪2𝑢 𝑅 +‬‬ ‫‪𝐵(𝑥) = 0‬‬
‫𝑥𝜕‬

‫∗‬
‫𝑇 ‪1 −1‬‬ ‫∗ 𝑉𝜕‬
‫)𝑥( 𝐵 𝑅 ‪𝑢 = −‬‬
‫‪2‬‬ ‫𝑥𝜕‬
𝑇
1 𝜕𝑉 𝑇 −1 𝑇
𝜕𝑉 ∗ 𝜕𝑉 ∗ 𝑇
𝑥 𝑄𝑥 − 𝐵𝑅 𝐵 + 𝐹(𝑥) = 0
4 𝜕𝑥 𝜕𝑥 𝜕𝑥
‫)𝑢 ‪𝑥̇ = 𝑓(𝑥,‬‬
‫∞‬
‫𝑡𝑑)𝑢 ‪𝑉(𝑥) = ∫𝑡 𝐿(𝑥,‬‬

‫𝑜𝑔 ‪𝑉(𝑥) = 𝑐𝑜𝑠𝑡 − 𝑡𝑜 −‬‬


‫𝑡𝑠𝑜𝑐 𝑔𝑛𝑖𝑛𝑛𝑢𝑟 = )𝑢 ‪𝐿(𝑥,‬‬

‫)𝑥(𝑘‪𝑢 = −‬‬

‫∗ 𝑉𝜕‪−‬‬ ‫𝑇 ∗ 𝑉𝜕‬
‫→‬ ‫‪= min {𝐿(𝑡, 𝑥, 𝑢) +‬‬ ‫})𝑢 ‪𝑓(𝑥,‬‬
‫𝑡𝜕‬ ‫‪u‬‬ ‫𝑥𝜕‬
‫𝑢)𝑥(𝐵 ‪• 𝑎𝑓𝑓𝑖𝑛𝑒: 𝑥̇ = 𝐹(𝑥) +‬‬
‫• ‪ L‬و ‪ V‬تابعیت صریح زمان نداشته باشند‪.‬‬
‫𝑢𝑅 𝑇𝑢 ‪• 𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 +‬‬

‫𝑇‬ ‫𝑇‬
‫𝑇 ∗ 𝑉𝜕‬
‫‪0 = min {𝑥 𝑄𝑥 + 𝑢 𝑅𝑢 +‬‬ ‫})𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫‪u‬‬ ‫𝑥𝜕‬
‫)𝑢)𝑥(𝐵 ‪𝐻(𝑥, 𝜆, 𝑢) = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 + 𝜆𝑇 (𝐹(𝑥) +‬‬
‫∗ 𝑉𝜕‬ ‫𝐻𝜕‬ ‫∗‬
‫∗ 𝑉𝜕 𝑇 ‪−1 −1‬‬
‫‪min 𝐻 (𝑥,‬‬ ‫→ ‪, 𝑢) = 0‬‬ ‫= 𝑢 →‪=0‬‬ ‫𝐵 𝑅‬
‫‪u‬‬ ‫𝑥𝜕‬ ‫𝑢𝜕‬ ‫‪2‬‬ ‫𝑥𝜕‬
‫‪HJB:‬‬

‫∗‬ ‫𝑇‬
‫𝑇 ∗ 𝑉𝜕 ‪1‬‬ ‫‪−1‬‬
‫𝑇 ∗ 𝑉𝜕 ∗ 𝑉𝜕‬
‫‪𝑉 : 𝑥 𝑄𝑥 −‬‬ ‫𝑅𝐵‬ ‫‪+‬‬ ‫‪𝐹(𝑥) = 0‬‬ ‫‪𝐻2‬‬
‫𝑥𝜕 ‪4‬‬ ‫𝑥𝜕‬ ‫𝑥𝜕‬
‫𝑤)𝑥(𝐷 ‪𝑥̇ = 𝐹(𝑥) + 𝐵(𝑥)𝑢 +‬‬ ‫∞_𝐻‬
‫𝑤𝑃 𝑇 𝑤 ‪𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 −‬‬
‫‪HJI:‬‬
𝑇 𝑇 𝑇
𝜕𝑉 ∗ 𝑇
min max {𝑥 𝑄𝑥 + 𝑢 𝑅𝑢 − 𝑤 𝑃𝑤 + (𝐹(𝑥) + 𝐵(𝑥)𝑢 + 𝐷(𝑥)𝑤)} = 0
u 𝑤 𝜕𝑥
1 −1 𝑇 𝜕𝑉 ∗

𝑤 = 𝑃 𝐷
→{ 2 𝜕𝑥

−1 −1 𝑇
𝜕𝑉 ∗
𝑢 = 𝑅 𝐵
2 𝜕𝑥
→ 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎𝑝𝑝𝑟. ⇝ 𝐴𝐷𝑃
𝜕𝑉 ∗
min 𝐻 (𝑥, , 𝑢) = 0
u 𝜕𝑥
−1 −1 𝑇 𝜕𝑉 ∗
𝑢∗ = 𝑅 𝐵
{ 2 𝜕𝑥

𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢)𝑑𝑡
𝑡

𝑑𝑉(𝑥) = −𝐿(𝑥, 𝑢)𝑑𝑡


𝜕𝑉 𝑇 𝑑𝑥
𝑑𝑡 = −𝐿(𝑥, 𝑢)𝑑𝑡
𝜕𝑥 𝑑𝑡
𝜕𝑉 𝑇
𝐿(𝑥, 𝑢) + (𝐹(𝑥) + 𝐵(𝑥)𝑢) = 0
𝜕𝑥
𝜕𝑉 𝑇 𝜕𝑉
𝐿(𝑥, 𝑢) + (𝐹(𝑥) + 𝐵(𝑥)𝑢) = 0 = 𝐻 (𝑥, , 𝑢)
𝜕𝑥 𝜕𝑥
(−1) −1 𝑇 𝜕𝑉
{ 𝑢 = 𝑅 𝐵
2 𝜕𝑥
Policy Iteration:
1- Evaluation
𝜕𝑉 𝑇 𝜕𝑉
𝐿(𝑥, 𝑢) + (𝐹(𝑥) + 𝐵(𝑥)𝑢) = 0 = 𝐻 (𝑥, , 𝑢)
𝜕𝑥 𝜕𝑥
2- Improvement
(−1) −1 𝑇 𝜕𝑉
𝑢= 𝑅 𝐵
2 𝜕𝑥
𝑢(𝑥)
𝑉 = 𝜙 𝑇 𝜇(𝑥) + 𝜖(𝑥)

𝜕𝑉̂ 𝜕𝜇
𝜙̂ 𝑇
𝜕𝑉̂ 𝜕𝑥1 𝜕𝑥1 𝜕𝜇𝑇
̂ ̂ 𝑇
𝑉 (𝑥) = 𝜙 𝜇(𝑥) → = ⋮ = ⋮ = 𝜙
𝜕𝑥 𝜕𝑥
𝜕𝑉̂ 𝜕𝜇
𝜙̂ 𝑇
[𝜕𝑥𝑚 ] [ 𝜕𝑥𝑚 ]
𝜕𝜇1 𝜕𝜇1 𝜕𝜇1

𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑚
𝜕𝜇 ⋮
=
𝜕𝑥
𝜕𝜇𝑁 𝜕𝜇𝑁

[ 𝜕𝑥1 𝜕𝑥𝑚 ]
𝜕𝜇1 𝜕𝜇1 𝜕𝜇1
… 𝜕𝜇
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑚 𝜙1 𝜙̂ 𝑇
𝜕𝑥1
⋮ 𝜙2
[ ]= ⋮
𝜕𝜇
𝜕𝜇𝑁 𝜕𝜇𝑁 𝜙𝑁 𝜙̂ 𝑇
… [ 𝜕𝑥𝑚 ]
[ 𝜕𝑥1 𝜕𝑥𝑚 ]
𝜕𝑉̂ 𝜕𝑉
𝑒𝑉 = 𝐻 (𝑥, , 𝑢) − 𝐻 (𝑥, , 𝑢)
𝜕𝑥 𝜕𝑥
𝜕𝜇
𝑒𝑉 = 𝐿(𝑥, 𝑢) + 𝜙̂ 𝑇 (𝐹(𝑥) + 𝐵(𝑥)𝑢)
𝜕𝑥
𝜕𝜇
(𝐹(𝑥) + 𝐵(𝑥)𝑢) = 𝛽
𝜕𝑥
1 𝜕𝐸𝑉
𝐸𝑉 = 𝑒𝑉2 ⇝ 𝜙̂° = −𝛼 = −𝛼𝑒𝑉 𝛽
2 𝜕𝜙̂
𝛽
𝜙̂° = −𝛼 𝑒
(1 + 𝛽 𝑇 𝛽)2 𝑉
‫𝑉𝜕‬ ‫𝑇 𝑉𝜕‬
‫‪𝐻 (𝑥, , 𝑢) = 0 = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 +‬‬ ‫)𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫𝑥𝜕‬ ‫𝑥𝜕‬
‫{‬ ‫‪−1‬‬ ‫𝑉𝜕‬
‫𝑇𝐵 ‪𝑢(𝑥) = 𝑅−1‬‬
‫‪2‬‬ ‫𝑥𝜕‬

‫)𝑥(𝜖 ‪→ 𝑉(𝑥) = 𝜙 𝑇 𝜇(𝑥) +‬‬


‫)𝑥(𝜇 𝑇 ̂𝜙 = )𝑥( ̂𝑉‬
‫̂𝑉𝜕‬ ‫𝑉𝜕‬ ‫𝜇𝜕‬
‫‪𝑒𝑉 = 𝐻 (𝑥,‬‬ ‫‪, 𝑢) − 𝐻 (𝑥,‬‬ ‫𝑇 ̂𝜙 ‪, 𝑢) = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 +‬‬ ‫)𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫𝑥𝜕‬ ‫𝑥𝜕‬ ‫𝑥𝜕‬
‫𝜇𝜕‬
‫𝛽 = )𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫𝑥𝜕‬

‫‪1‬‬
‫‪𝐸𝑣 = 𝑒𝑉2‬‬
‫‪2‬‬
‫𝑣𝐸𝜕‬ ‫𝛽‬
‫𝛼‪𝜙̂̇ = −‬‬ ‫𝛼‪= −𝛼𝑒𝑣 𝛽 → 𝜙̂̇ = −‬‬ ‫𝑒‬
‫̂𝜙𝜕‬ ‫𝑣 ‪(1 + 𝛽 𝑇 𝛽)2‬‬
‫𝑇 𝜖𝜕‬ ‫𝛽𝛼‬
‫‪= [𝜙̂ 𝛽 − (𝜙 𝛽 +‬‬
‫𝑇‬ ‫𝑇‬ ‫‪(𝐹(𝑥) + 𝐵(𝑥)𝑢))] (−‬‬ ‫)‬
‫𝑥𝜕‬ ‫‪(1 + 𝛽 𝑇 𝛽)2‬‬

‫𝛽𝛼‬ ‫𝑇 𝜖𝜕‬
‫‪𝜙̃̇ = −‬‬ ‫̃‬ ‫𝑇‬
‫‪[𝜙 𝛽 −‬‬ ‫])𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫‪(1 + 𝛽 𝑇 𝛽)2‬‬ ‫𝑥𝜕‬
‫𝑇 𝛽𝛽‬
‫𝛼‪𝜙̃̇ = −‬‬ ‫⊕‪𝜙̃ +‬‬
‫‪(1 + 𝛽 𝑇 𝛽)2‬‬
‫‪𝜙̃ → 0‬‬
‫‪𝜆𝑚𝑖𝑛 (𝛽𝛽 𝑇 ) > 0‬‬
‫𝑛𝑜𝑖𝑡𝑎𝑡𝑖𝑐𝑥𝐸 𝑦𝑙𝑡𝑛𝑒𝑡𝑠𝑖𝑠𝑟𝑒𝑃 𝐸𝑃 = ) 𝑇 𝛽𝛽( 𝑛𝑖𝑚𝜆‬
‫𝑇𝜇𝜕‬
‫=𝛽‬ ‫)𝑢)𝑥(𝐵 ‪(𝐹(𝑥) +‬‬
‫𝑥𝜕‬
‫𝑣𝑒𝛽𝛼‬
‫‪𝜙̂̇ = −‬‬ ‫تضمین پایداری در بازه حرکت به سمت پاسخ بهینه → ترم جبرانی ‪+‬‬
‫‪(1 + 𝛽 𝑇 𝛽)2‬‬
‫𝑇𝜇𝜕 𝑇 ‪−1 −1‬‬
‫=𝑢‬ ‫𝐵 𝑅‬ ‫ترم جبرانی ‪𝜙̂ +‬‬
‫{‬ ‫‪2‬‬ ‫𝑥𝜕‬
∞ 𝑡+𝑇 𝑡+𝑇
𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢)𝑑𝜏 → 𝑉̇ = −𝐿(𝑥, 𝑢) ⇝ ∫ 𝑉̇ 𝑑𝜏 = ∫ −𝐿(𝑥, 𝑢)𝑑𝜏
𝑡 𝑡 𝑡
𝑡+𝑇
= 𝑉(𝑡 + 𝑇) − 𝑉(𝑡) = − ∫ 𝐿(𝑥, 𝑢)𝑑𝜏
𝑡

𝑉 = 𝜙 𝑇 𝜇(𝑥) + 𝜖(𝑥)
𝑉̂ = 𝜙̂ 𝑇 𝜇(𝑥)
𝜙̂ 𝑇 𝜇(𝑡 + 𝑇) − 𝜙̂ 𝑇 𝜇(𝑡) = −𝑦 → 𝜙̂ 𝑇 [𝜇(𝑡) − 𝜇(𝑡 + 𝑇)] = 𝜙̂ 𝑇 𝜃 = 𝑦 → 𝜙̂ 𝑇 𝜃 = 𝑦
𝑒 = 𝜙̂ 𝑇 𝜃 − 𝑦
2
𝐸 = ∫(𝜙̂ 𝑇 𝜃 − 𝑦) 𝑑𝜏

𝜕𝐸 𝑇 𝜕𝑒
= 0 = ∫ 2(𝜙̂ 𝑇 𝜃 − 𝑦) 𝑑𝜏 = ∫ 2(𝑒)𝑇 𝜃𝑑𝜏 = 0
̂
𝜕𝜙 ̂
𝜕𝜙
Weighted Residual
𝑇
∫(𝜙̂ 𝑇 𝜃 − 𝑦) 𝜃𝑑𝜏 = 0

〈𝑓, 𝑔〉 = ∫ 𝑓 𝑇 𝑔𝑥

𝜕𝑒
〈𝑒, 〉=0
𝜕𝜙̂
𝑇
∫(𝜙̂ 𝑇 𝜃) 𝜃𝑑𝜏 = ∫ 𝑦 𝑇 𝜃𝑑𝜏

∫ 𝜃(𝜃 𝑇 𝜙̂)𝑑𝜏 = ∫ 𝑦 𝑇 𝜃𝑑𝜏


(∫ 𝜃𝜃 𝑇 𝑑𝜏) 𝜙̂ = ∫ 𝑦 𝑇 𝜃𝑑𝜏

−1 −1
𝜙̂ = (∫ 𝜃𝜃 𝑇 𝑑𝜏) 𝑇
(∫ 𝑦 𝜃𝑑𝜏) = 𝑇
(∑ 𝜃𝑖 𝜃𝑖 ) (∑ 𝑦𝑖𝑇 𝜃)

𝜃 = 𝜇(𝑡) − 𝜇(𝑡 + 𝑇)
𝑡+𝑇
𝑦=∫ 𝐿(𝑥, 𝑢)𝑑𝜏
𝑡

𝑉̇ = −𝐿(𝑥, 𝑢)
𝜕𝑉 𝑇
𝑥̇ = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝜕𝑉 𝑇
(𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝑥̇ = 𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏

𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢 𝑇 )𝑑𝜏
𝑡

𝑉(𝑥 + 𝑑𝑥) = ∫ 𝐿(𝑥 + 𝑑𝑥, 𝑢 𝑇 )𝑑𝜏
𝑡

𝜕𝑉 𝑇 𝑑𝑥 𝜕𝑉 𝑇
𝑉̇ = = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥 𝑑𝑡 𝜕𝑥
𝜕𝑉 𝑇 ?
̇𝑉 = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 )= − 𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥

𝜕𝑉 𝑇 𝜕𝑉 𝑇
𝑉̇ = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = [𝐹(𝑥) + 𝐵(𝑥)𝑢 𝑇 + 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )]
𝜕𝑥 𝜕𝑥
𝜕𝑉 𝑇
[𝐹(𝑥) + 𝐵(𝑥)𝑢 𝑇 ] = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝑡+𝑇 𝑇
𝜕𝑉 𝑇
∫ 𝑉̇ = ∫ −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇 + 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )
𝑡 𝑡 𝜕𝑥
𝑇
𝜕𝑉 𝑇
𝑉(𝑡 + 𝑇) − 𝑉(𝑡) = − ∫ [𝑥 𝑄𝑥 + 𝑇
𝑢𝑇𝑇 𝑅𝑢 𝑇 − 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )] 𝑑𝜏
𝑡 𝜕𝑥

𝜙̂ 𝑇 𝜃 = 𝑦
𝜃 = 𝜇(𝑡) − 𝜇(𝑡 + 𝑇)

You might also like