Deep Reinforcement Learning
Deep Reinforcement Learning
1 Course Description
This graduate-level course offers a comprehensive introduction to Deep Rein-
forcement Learning (DRL) through the lens of control theory, with a strong
focus on its applications in aerospace engineering. Students will explore the fun-
damental principles of reinforcement learning, its synergy with classical control
concepts, and its powerful applications in decision-making and control problems.
The course covers both theoretical foundations and practical implementations,
equipping students with the knowledge to apply DRL techniques to complex
aerospace systems. From autonomous navigation and adaptive flight control to
space robotics and satellite operations, students will learn how DRL is revo-
lutionizing the aerospace industry. By the end of the course, participants will
be able to formulate, implement, and analyze DRL solutions for cutting-edge
aerospace challenges, positioning them at the forefront of this rapidly evolving
field.
2 Prerequisites
• Mathematical maturity
1
3 Course Schedule
Week 1-2: Introduction to Reinforcement Learning and Control
1. Introduction to RL and its applications in control
2. RL problem formulation: states, actions, rewards
2
6. Examples
Week 8: Deep Value-based RL
1. Deep Q-Networks (DQN) (DDQN, Dueling DQN)
2. Experience replay and prioritized experience replay
3. Examples
Week 9-11: Policy Gradient Methods
1. Policy gradient theorem
2. REINFORCE algorithm
3. Actor-Critic methods (A2C, A3C)
4. Deep Deterministic Policy Gradient (DDPG), TD3, SAC
5. Trust Region Policy Optimization (TRPO), Proximal Policy Optimization
(PPO)
Week 12-13: Advanced RL Techniques
1. Distributional Reinforcement Learning
2. Multi-agent RL
3
4 Assignments and Grading
• Homework assignments (20%, 5 series)
• Midterm and final exams (40%)
• Final project (25%)
• Weekly quizzes and class participation (15+5%)
5 Course outcome
By the end of this course, students will be able to:
• Understand the fundamentals of reinforcement learning and its relation-
ship to control theory
• Formulate RL problems in the context of control systems
• Implement and analyze basic RL as well as advanced DRL algorithms
• Apply DRL techniques to solve control problems, particularly in aerospace
engineering
• Critically evaluate the strengths and limitations of DRL methods in con-
trol applications
4
یکشبنه 27آبان :امتحان میانترم
𝐿𝑅 𝑙𝑎𝑛𝑜𝑖𝑡𝑢𝑏𝑖𝑟𝑡𝑠𝑖𝐷
𝐿𝑅 𝑡𝑛𝑒𝑔𝑎 𝑀𝑢𝑙𝑡𝑖 −
SOC هفته } ← 15-13
𝐿𝑅 𝑎𝑡𝑒𝑀
𝐿𝑅 𝑒𝑓𝑎𝑆
-هوش مصنوعی
-یادگیری ماشین
-یادگیری تقویتی
از ترکیب دو مفهوم باال در دهه ← 80متداول شدن 1989 Watkins ← Temporal Difference ← RL
Q-learning
دهه 2000-90گسترش کاربرد Function Approximator
2015
1- On-Policy
2- Off-Policy
Q-Learning → Sarsa
تعریف مسئله
Policy قطعی/احتماالتی
Action گسسته/پیوسته
𝑃𝐷𝑀
→ 𝑒𝑡𝑎𝑡𝑆 → 𝑛𝑜𝑖𝑡𝑎𝑣𝑟𝑒𝑠𝑏𝑂
𝑃𝐷𝑀𝑂𝑃
) 𝑘𝑢 𝑃(𝑥𝑘+1 |𝑥𝑘 , 𝑢𝑘 ) = 𝑃(𝑥𝑘+1 |𝑥0 , 𝑢0 , 𝑥1 , 𝑢1 , … , 𝑥𝑘 ,
دینامیک سیستم
تابع ارزش
∞
𝑢
𝑥𝑥𝑃 ∑ 𝛾 = ∑ 𝜋(𝑢|𝑥 ) [𝑟𝑘+1 + ]𝑥 = 𝑘𝑥| ′ 𝐺𝑘+1
𝑢 𝑥′
𝑢
𝑥𝑥𝑃 ∑ 𝛾 𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) [𝑟𝑘+1 + 𝜋 ′
]) 𝑥( 𝑉 ′
𝑢
𝑥𝑥𝑃 ∑ 𝛾 𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 ) (𝑟𝑘+1 + 𝜋 ′
)) 𝑥( 𝑉 ′
𝑢 𝑥′
(𝜋
𝑉𝛾 = 𝐸𝜋 (𝑟𝑘+1 + ) 𝑥 = 𝑘𝑥|) 𝑥𝑘+1
↑𝑎
↓𝑏
←𝑐
→𝑑
𝑎: −10
−1 + 𝑉 (3) 1
𝑉 (1) = [𝑏:
𝑐: ] →= (−22 + 𝑉 (3) + 𝑉 (2))
−10 4
𝑑: −1 + 𝑉 (2)
𝑎: −10
+10 1
𝑉 (2) = [𝑏:
𝑐: −1 + 𝑉 (1) ] →= (−11 + 𝑉 (1))
4
𝑑: −10
𝑎: −1 + 𝑉 (1)
𝑉 (3) = [𝑏: −10 ] →= 1 (−11 + 𝑉 (1))
𝑐: −10 4
𝑑: −10
Action-Value Function
𝑄 𝜋 (𝑥, 𝑢) = 𝐸𝜋 (𝐺𝑘 |𝑥𝑘 = 𝑥, 𝑢𝑘 = 𝑢)
𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢|𝑥 )𝑄 𝜋 (𝑥, 𝑢)
𝑢
𝑢
𝑄 𝜋 (𝑥 ) = 𝑅(𝑥, 𝑢) + 𝛾 ∑ 𝑃𝑥𝑥 𝜋 ′
′ 𝑉 (𝑥 ) = 𝑄 (𝑥, 𝑢 )
𝑥′
𝑢 ′ ′ 𝜋 ′ ′
𝑄 (𝑥, 𝑢) = 𝑅 (𝑥, 𝑢) + 𝛾 ∑ 𝑃𝑥𝑥 ′ ∑ 𝜋 (𝑢 |𝑥 )𝑄 (𝑥 , 𝑢 )
𝑥′ 𝑢′
𝑉 ∗ (𝑥 ) = max 𝑉 𝜋 (𝑥 )
π
↑𝑎
↓𝑏
←𝑐
→𝑑
−10
) (∗ )−1 + 𝑉 ∗ (3
[ 𝑉 1 = max ]
u −10
)−1 + 𝑉 ∗ (2
−10
+10 + 0
𝑉 ∗ (2) = max [ ∗
∗ ( )] = 𝑉 (3)
−1 + 𝑉 1
−10
𝑉 ∗ (1) = max(−1 + 𝑉 ∗ (2), −10)
𝑉 ∗ (2) = max(−10, +10, −1 + 𝑉 ∗ (1))
= max(−10, +10, −1 + max(−1 + 𝑉 ∗ (2), −10))
𝑉 ∗ (1) = +9
𝑉 ∗ (2) = 𝑉 ∗ (3) = +10
𝑉 𝜋 (1) = −7.8 𝑉 𝜋 (2) = 𝑉 𝜋 (3) = −4.7
𝑉 ∗ (1) = 9 𝑉 ∗ (2) = 𝑉 ∗ (3) = 10
Policy Evaluation
𝑢
𝑉 𝜋 (𝑥 ) = ∑ 𝜋(𝑢, 𝑥 ) (𝑟𝑘+1 + ∑ 𝛾𝑃𝑥𝑥 𝜋 ′
′ 𝑉 (𝑥 ))
𝑉𝑖𝜋 (𝑥 )
𝜋 ( ) 𝑢 𝜋 ′ 𝜋 𝜋
𝑉𝑖+1 𝑥 = ∑ 𝜋(𝑢, 𝑥 ) (𝑟𝑘+1 + ∑ 𝛾𝑃𝑥𝑥 ′ 𝑉𝑖 (𝑥 )) → 𝑉𝑖 (𝑥 ) = 𝑉𝑖+1 (𝑥 )
𝑢 𝑥′
𝑢 𝜋 ′
= ∑ 𝑃𝑥𝑥 ′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉𝑘 (𝑥 ))
𝑢
𝑉 ∗ (𝑥 ) = max ∑ 𝑃𝑥𝑥 ∗ ′
′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉 (𝑥 ))
u
𝑥′
𝑉𝜋
{𝑄 𝜋 (𝑥, 𝑢) = 𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑢 ′ 𝛾𝑉 𝜋 (𝑥 ′ )
𝑥𝑥
𝑥′
′
𝜋 ′ (𝑥 ) = arg max 𝑄 𝜋 (𝑥, 𝑢) → 𝑉 𝜋 (𝑥 ) ≥ 𝑉 𝜋 (𝑥 )
u
𝑄 𝜋 (𝑥, 𝜋 ′ (𝑥 )) ≥ 𝑄 𝜋 (𝑥, 𝑢)
𝑄 𝜋 (𝑥, 𝜋(𝑥 )) = 𝑉 𝜋 (𝑥 )
′
𝑉 𝜋 (𝑥 ) ≥ 𝑄 𝜋 (𝑥, 𝜋(𝑥 ))?
′ 𝑢
𝑉 𝜋 (𝑥 ) = max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 𝜋 ′ 𝜋
′ 𝛾𝑉 (𝑥 )] > 𝑉 (𝑥 )
u
𝑥′
𝑢 𝑥′
𝑢 𝜋 ′
= 𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )
Policy Improvement
|𝑉 𝜋 ′ (𝑥 ) − 𝑉 𝜋 (𝑥 )| ≤ Δ 𝜋 تمام′ = 𝜋 اگر
و اال برگرد به پله اول
′ 𝑢 ′
𝑉 𝜋 (𝑥 ) = ∑ 𝜋 ′ (𝑢|𝑥 ) ∑ 𝑃𝑥𝑥 ′ (𝛾𝑉
𝜋 ( ′)
𝑥 + 𝑅(𝑥, 𝑢))
𝑢 𝑥′
1 𝜋′
(𝑉 (3) − 1)
′ → ↓2 ′
𝑉 𝜋 (1) = = 𝑉 𝜋 (2) − 1
→ →1 𝜋 ′
(𝑉 (2) − 1)
2
′
𝑉 𝜋 (2) = +10 + 0 = 10
Value Iteration
𝑢 ∗
𝑉𝑘+1 (𝑥 ) = max ∑ 𝑃𝑥𝑥 ′ (𝑅 (𝑥, 𝑢 ) + 𝛾𝑉𝑘 (𝑥 )) → 𝑉
u
𝑥′
𝑢
𝑢 𝜋 ′
= arg max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )]
u
VI
𝑢 ′ ∗
𝑉𝑘+1 (𝑥 ) = max ∑ 𝑃𝑥𝑥 ′ [𝛾𝑉𝑘 (𝑥 ) + 𝑅 (𝑥, 𝑢 )] → 𝜋 (𝑥 )
u
𝑢 ∗ ′
= arg max [𝑅 (𝑥, 𝑢) + ∑ 𝑃𝑥𝑥 ′ 𝛾𝑉 (𝑥 )]
u
GPI
ADP
Multi-armed Bandit
-5 +2 0 +5 +1
Multi-armed bandit
ActionSpace: 1:n
1 2 … n-1 n
∑𝑁
𝑘=0 𝑟𝑘 . 1𝑢𝑘 =𝑢
𝑄 (𝑢 ) =
∑𝑁𝑘=0 1𝑢𝑘 =𝑢
𝜖 = 0.1 ← ← رندم10%
𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 { }
1 − 𝜖 → 𝐺𝑟𝑒𝑒𝑑𝑦 ← 90%
ln(𝑘 )
𝑂 → 𝐶√
𝑁𝑘 (𝑢)
Incremental Implementation
𝑢=𝑘𝑢∑ 𝑟𝑘 1
= ) 𝑢( 𝑄
𝑢= 𝑘𝑢∑ 1
𝑛 𝑛+1 𝑛
1 1 1
𝑄𝑛 → 𝑄𝑛+1 = → 𝑄𝑛 = ∑ 𝑟𝑘 → 𝑄𝑛+1 = 𝑘𝑟 ∑ ] [∑ 𝑟𝑘 + 𝑟𝑛+1
𝑛 𝑛+1 𝑛+1
1 1 1
𝑛 𝑟𝑛 + 1 1
= 𝑄𝑛 + = 𝑄𝑛 + 𝑟( ) 𝑛𝑄 −
𝑛+1 𝑛+1 𝑛 + 1 𝑛+1
خطای تخمین→ ) 𝑛𝑄 (𝑟𝑛+1 −
𝑟𝑜𝑟𝑟𝐸 𝑁𝑒𝑤 𝐸𝑠𝑡. = 𝑂𝑙𝑑 𝐸𝑠𝑡. +𝑆𝑡𝑒𝑝 𝑆𝑖𝑧𝑒.
1
= 𝑒𝑧𝑖𝑆 𝑝𝑒𝑡𝑆 = 𝑛𝛼 میانگین ساده→
𝑛
∞ → 𝑛𝛼 ∑
{
∞ < ∑ 𝛼𝑛2
𝑖𝐺 ∑
) 𝑥( 𝜋 𝑉 ≈
𝑁 → 1000
First visit
Every visit
الگوریتم )Prediction( MC
𝐺 ← 𝑟 + 𝛾𝐺1
𝑡𝑖𝑠𝑖𝑣 𝑡𝑠𝑟𝑖𝐹 ←اگر 𝑖𝑥 در stateهای قبلی ظاهر نشده است
DP:
biasدارد variance /کمتری دارد/پله پله جلو میرود /در هر پله همهی احتماالت مختلف بررسی میشود.
MC Control
𝜋 ?
𝜋𝑉 )𝑢 𝑄 ⇝ 𝜋 ′ = arg u max 𝑄 𝜋 (𝑥,
⇝
𝑢
𝑥𝑥𝑃𝛾 ∑ 𝜋 ′ = arg u max 𝑟(𝑥, 𝑢) + 𝜋 ′
) 𝑥( 𝑉 ′
𝑥′
𝑢
𝑥𝑥𝑃 دینامیک′ :
𝑡𝑓𝑜𝑠 𝜖 −
) ⬚ (𝑎,
⋮
)𝑄 (𝑎, 2
?= )𝑄 (𝑎, 1
𝑎, 2
Exploring starts
• نقطه شروع را تصادفی بدهیم.
Exploring start -
← 𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 -تابع بهینه که بین همه سیاستهای 𝑡𝑓𝑜𝑠 𝜖 −بهینه باشد.
] 𝑥 = 𝑘𝑥| 𝑘𝐺[ 𝜋𝐸 = ) 𝑥( 𝜋 𝑉
90% 10%
) 𝜋:محاسبه تابع ارزش(
← →
1 9
𝑖 𝐺 ∑5𝑖=1 + ∑10
𝑖=6 𝑖𝐺
⇝ ) 𝑥( 𝜋 𝑉 5 5
10
50% 50%
6
𝐺 ← → 𝐺1
7
𝐺 ← → 𝐺2
) 𝜋 ′ :رفتار( 8
𝐺 ← → 𝐺3
𝐺 9
← → 𝐺4
𝐺 10
← → 𝐺5
𝜋(
∑10
𝑖=1 𝐺
𝑖
𝑉 𝑥) =
10
Ordinary Importance Sampling
1 9
& → 𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑜
5 5
Ordinary I.S. method → High variance / unbiased → Can be more easily
suited for continuous structs.
1 9
∑5𝑖=1 𝐺 𝑖
+ ∑10
𝑖=6 𝐺𝑖
𝑉 𝜋 (𝑥 ) ⇝ 5 5
1 9
(∑5𝑖=1 + ∑10
𝑖=6 )
5 5
Weighted I.S. method → Low variance / biased → preferred.
∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 ∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 ∑𝑛+1
𝑖=1 𝑤𝑖 𝐺
𝑖 ∑𝑛𝑖=1 𝑤𝑖 𝐺 𝑖 + 𝑤𝑛+1 𝐺𝑛+1
= = 𝑉𝑛 ⇝ 𝑉𝑛+1 = 𝑛+1 =
∑𝑛𝑖=1 𝑤𝑖 𝐶𝑛 ∑𝑖=1 𝑤𝑖 ∑𝑛𝑖=1 𝑤𝑖 + 𝑤𝑛+1
𝑉𝑛 𝐶𝑛 + 𝑤𝑛+1 𝐺𝑛+1 𝐶𝑛 𝑤𝑛+1
= = 𝑉𝑛 + 𝐶𝑛+1
𝐶𝑛 + 𝑤𝑛+1 𝐶𝑛 + 𝑤𝑛+1 𝐶𝑛 + 𝑤𝑛+1
𝑤𝑛+1
𝑉𝑛+1 = 𝑉𝑛 + (𝐶 − 𝑉𝑛 ) & 𝐶𝑛+1 = 𝐶𝑛 + 𝑤𝑛+1
𝐶𝑛+1 𝑛+1
𝑢 𝑢
𝑘+1
𝑃𝜋 (𝜏) = 𝜋(𝑢𝑘 |𝑥𝑘 )𝑃𝑥𝑘𝑘𝑥𝑘+1 𝜋(𝑥𝑘+1 |𝑢𝑘+1 )𝑃𝑥𝑘+1 𝑥𝑘+2 …
𝑢 𝑢
𝑃𝜋′ (𝜏) = 𝜋 ′ (𝑢𝑘 |𝑥𝑘 )𝑃𝑥𝑘𝑘𝑥𝑘+1 𝜋 ′ (𝑥𝑘+1 |𝑢𝑘+1 )𝑃𝑥𝑘+1
𝑘+1
𝑥𝑘+2 …
𝐾−1
𝑃𝜋 (𝜏) 𝜋(𝑢𝑘 |𝑥𝑘 ) 𝜋(𝑢𝑘+1 |𝑥𝑘+1 ) 𝜋(𝑢𝑖 |𝑥𝑖 )
𝐼𝑚𝑝. 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑜 = = ′ . ′ …=∏ ′
𝑃𝜋′ (𝜏) 𝜋 (𝑢𝑘 |𝑥𝑘 ) 𝜋 (𝑢𝑘+1 |𝑥𝑘+1 ) 𝜋 (𝑢𝑖 |𝑥𝑖 )
𝑖=𝑘
𝑢 𝑥
DP:
]) 𝑥( 𝜋 𝑉𝛾 𝑉 𝜋 (𝑥 ) = 𝐸𝜋 [𝑟𝑘+1 +
← مدل سیستم √ ← تقریبی بودن 𝑃𝐷 → استفاده از مقدار تخمینی Vدرمعادله
MC:
] 𝑥 = 𝑘𝑥| 𝑘𝐺[ 𝜋𝐸 = ) 𝑥( 𝜋 𝑉
جایگزین کردن امید ریاضی یا سمپلگیری ← تقریبی بودن MC
TD:
← 𝑝𝑎𝑟𝑡𝑠𝑇𝑜𝑜𝐵
به هر دو دلیل تقریب میزند
← 𝑑𝑒𝑠𝑎𝑏 𝑆𝑎𝑚𝑝𝑙𝑒 −
) 𝐴( 𝑉
) 𝐵( 𝑉
𝐴, 0, 𝐵, 0
𝑠𝑡𝑎𝑒𝑝𝑒𝑟 {𝐵, 1 6
𝐵, 0
6 3
= ) 𝐵( 𝑉 =
8 4
𝑟𝑜𝑟𝑟𝑒 𝑑𝑒𝑟𝑎𝑢𝑞𝑆 𝑛𝑎𝑒𝑀 𝑔𝑛𝑢𝑧𝑖𝑚𝑖𝑛𝑖𝑀 𝐶𝑀 ← 𝑉 (𝐴) = 0
{
𝑒𝑡𝑎𝑚𝑖𝑡𝑠𝑒 𝑡𝑛𝑎𝑙𝑎𝑣𝑖𝑢𝑞𝐸 𝑦𝑙𝑛𝑖𝑎𝑡𝑟𝑒𝐶 𝐷𝑇 ← )𝐵( 𝑉 = )𝐴( 𝑉
اگر TDمدل را درست تخمین بزند به جواب بهینه همگرا میشود.
On-policy TD Control
SARSA
𝑄𝑛+1 (𝑥𝑘 , 𝑢𝑘 )
= 𝑄𝑛 (𝑥𝑘 , 𝑢𝑘 )
Q-learning:
𝑇𝐷 ↔ 𝑀𝐶
𝑏𝑖𝑎𝑠 ↑ 𝑏𝑖𝑎𝑠 ↓
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ↓ 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ↑
N-step TD:
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + 𝛾 2 𝑟𝑘+3 + ⋯ 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 𝑉 (𝑥𝑘+𝑛 )
𝐺𝑘:𝑘+1 = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1 )
Prediction:
𝑉𝑘+𝑛 (𝑥𝑘 ) = 𝑉𝑘+𝑛−1 (𝑥𝑘 ) + 𝛼(𝐺𝑘:𝑘+𝑛 − 𝑉𝑘+𝑛−1 (𝑥𝑘 ))
Error reduction property:
max|𝐸 [𝐺𝑘:𝑘+𝑛 |𝑥𝑘 = 𝑥 ] − 𝑉 𝜋 (𝑥 )| ≤ 𝛾 𝑛 max|𝑉𝑘+𝑛−1 (𝑥 ) − 𝑉 𝜋 (𝑥 )|
x x
n-step Sarsa:
𝑄(𝑥𝑘 , 𝑢𝑘 ) = 𝑄 (𝑥𝑘 , 𝑢𝑘 ) + 𝛼 [𝐺𝑘:𝑘+𝑛 − 𝑄 (𝑥𝑘 , 𝑢𝑘 )]
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 𝑄 (𝑥𝑘+𝑛 , 𝑢𝑘+𝑛 )
n-step Expected Sarsa:
𝐺𝑘:𝑘+𝑛 = 𝑟𝑘+1 + 𝛾𝑟𝑘+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑘+𝑛 + 𝛾 𝑛 ∑ 𝜋(𝑢|𝑥𝑘+𝑛 )𝑄 (𝑥𝑘+𝑛 , 𝑢)
Prediction:
𝑉 (𝑥𝑘 ) = 𝑉 (𝑘𝑘 ) + 𝛼𝜌𝑘:𝑘+𝑛 [𝐺𝑘:𝑘+𝑛 − 𝑉 (𝑥𝑘 )]
n-step off-policy Sarsa control:
𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄(𝑥𝑘 , 𝑢𝑘 ) + 𝛼𝜌𝑘+1:𝑘+𝑛 [𝐺𝑘:𝑘+𝑛 − 𝑄(𝑥𝑘 , 𝑢𝑘 )]
n-step off-policy Expected Sarsa:
𝑄 (𝑥𝑘 , 𝑢𝑘 ) = 𝑄(𝑥𝑘 , 𝑢𝑘 ) + 𝛼𝜌𝑘+1:𝑘+𝑛−1 [𝐺𝑘:𝑘+𝑛 − 𝑄(𝑥𝑘 , 𝑢𝑘 )]
𝜆-return:
𝐺𝑘:𝑘+1
𝐺
{ 𝑘:𝑘+2
⋮
𝐺𝑘:𝑘+𝑛
∞ 𝑇−𝑘−1
Sarsa(𝜆)
TD(𝜆)
𝐺𝑘𝜆=0 = 𝐺𝑘:𝑘+1 ← 𝑇𝐷(0)
𝐺𝑘𝜆=1 = 𝐺𝑘 ← 𝑀𝐶
Function approximation
+1 سمت چپ
𝐴𝑐𝑡𝑖𝑜𝑛 {
0 سمت راست
𝜃
𝜃̇
𝑥 = [ ] ⇝ 𝑉 𝜋 (𝑥 ) ≈ 𝑉 (𝑥; 𝜙)
𝑦
𝑦̇
Generalization ()تعمیمپذیری
𝑉 (𝑥; 𝜙) = 𝜙1 𝑥 + 𝜙2 𝑥 2 + ⋯
𝜙1
𝜙 = [𝜙 2 ]
⋮
کارایی-
𝑢
- DP → ∑𝑢 𝜋(𝑢|𝑥 ) ∑𝑥 ′ 𝑃𝑥𝑥 + 𝛾𝑉 (𝑥; 𝜙)
′ (𝑟𝑘+1
- MC → 𝐺𝑘
- TD(0) → 𝑉 𝜋 (𝑥𝑘 ) ∼ 𝑟𝑘+1 + 𝑉 (𝑥𝑘+1 ; 𝜙)
- N-step → 𝐺𝑘:𝑘+𝑛
Objective
∑ 𝜇 (𝑥 ) = 1
Semi-gradient algorithm
-در هر مرحله از episode
oتولید اکشن بر مبنای سیاست 𝜋
oمحاسبه تارگت از رابطه )𝜙 ; 𝑈𝑘 = 𝑟𝑘+1 + 𝛾𝑉 (𝑥𝑘+1
oبه روز رسانی 𝜙 → )𝜙 ; 𝑘𝑥( 𝑉 𝜙∇ 𝑘𝛿𝛼𝜙 + 2
1 →1
2 →2
3 →3
1 →1
4 →4
⋯ 𝑉 (𝑥; 𝜙) = 𝜙1 𝑓1 (𝑥 ) + 𝜙2 𝑓2 (𝑥 ) +
𝜙1 ) 𝑥( 𝑓1
] 𝜙 = [𝜙2 توابع پایه → ]) 𝑥( 𝑓(𝑥 ) = [𝑓2
⋮ ⋮
Feature
) 𝑥( 𝑓 𝑇 𝜙 =
-ساختار Linear
oروش MCبه نقطه بهینه سراسری ( )global optimumهمگرا میشود.
oروش ) :TD(0همگرا میشود اما نه لزوما به نقطه بهینه
1
< خطای تخمین )𝑇𝐷 (0 کمترین خطای ممکن
𝛾1−
-انواع Faetureها:
)1چند جملهایها:
2
… 𝑥, 𝑥 ,
{
𝑥1 , 𝑥1 𝑥2 , 𝑥2
)2سری فوریه
) 𝑥 sin(𝜔1
) 𝑥 cos(𝜔1
) 𝑥 sin(2𝜔1
) 𝑥 cos(2𝜔1
⋮
Tile coding )3
-تابع فعالسازی
)1تابع Sigmoyd
1
𝑥1 + 𝑒 −
)2تابع Tanh
𝑥 𝑥−
𝑒𝑒 −
𝑥𝑒 𝑥 + 𝑒 −
ReLU )3
Dying ReLU
∑ 𝑦1 = 𝑓 (∑ 𝑤𝑖1 𝑥𝑖 + 𝑏1 )
𝑧 = ∑ 𝑚𝑗 𝑦𝑗 = ∑ 𝑚𝑗 𝑓 (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 )
𝐸𝑘 = (𝑈𝑘 − 𝑍𝑘 )2
1
𝐸= ∑ 𝐸𝑘
𝑁
𝑘
𝜕𝑍𝑘
= 𝑚𝑗 + 𝑓 ′ (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 )
𝜕𝑏𝑖
𝜕𝑍𝑘 𝜕𝑍𝑘 𝜕𝑓 𝜕𝑓
= = 𝑚𝑗 = 𝑚𝑗 𝑓 ′ (∑ 𝑤𝑖𝑗 𝑥𝑖 + 𝑏𝑖 ) 𝑥𝑖
𝑤𝑖𝑗 𝜕𝑓 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝜕𝑍𝑘
= 𝑦𝑗
𝜕𝑚𝑗
Vanishing Gradient
Fully Connected
Convolutional NN
CNN
Overfitting
Generalizable
1- Cross validation (Early stopping)
2- Dropout
3- Batch normalization
4- Regularization
𝐸𝑘 =
1
𝐸= ∑ 𝐸𝑘 + پیچیدگی شبکه
𝑁
5- Deep Residual
ResNet
𝑉 ∗ = ∑𝑢 𝜋 ∗ (𝑥, 𝑢)𝑄∗ (𝑥, 𝑢)
𝑄∗
𝜋 ∗ = arg u max 𝑄 ∗ (𝑥, 𝑢)
Nonlinear function approximator
Convolutional NN (CNN)
ℎ𝑘 = 𝑓(𝑥𝑘 , ℎ𝑘−1 )
ℎ𝑘 = 𝑓(𝑥𝑘 , 𝑥𝑘−1 , … )
𝑥𝑘+1 = 𝑓(𝑥𝑘 , 𝑢𝑘 ) = 𝐴𝑥𝑘 + 𝐵𝑢𝑘
Vanishing Gradient
Long Short-Term Memory
𝑓𝑘 : 𝑓𝑜𝑟𝑔𝑒𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑓 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑓 )
𝑖𝑘 : 𝐼𝑛𝑝𝑢𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑖 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑖 )
𝑜𝑘 : 𝑂𝑢𝑡𝑝𝑢𝑡 𝑔𝑎𝑡𝑒 = 𝜎(𝑤𝑜 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑜 )
𝐶𝑘 = 𝑓𝑘 𝐶𝑘−1 + 𝑖𝑘 𝐶̃𝑘
𝐶̃𝑘 = tanh(𝑤𝑐 [ℎ𝑘−1 , 𝑥𝑘 ] + 𝑏𝑐 )
ℎ𝑘 = 𝑜𝑘 tanh(𝐶𝑘 )
𝑉(𝑥) = 𝜙 𝑇 𝑓 𝑓 = [𝑓1 (𝑥) 𝑓2 (𝑥) … 𝑓𝑛 (𝑥)]
Least-square TD
𝑇
𝜙1
𝜙
𝐴 [ 2 ] = 𝑏 → 𝐴 = ∑ 𝑓(𝑥𝑘 )[𝑓(𝑥𝑘 ) − 𝛾𝑓(𝑥𝑘+1 )]𝑇
⋮
𝜙𝑛
𝑏 = ∑ 𝑓(𝑥𝑘 )𝑟𝑘+1
TD fixed-point
𝜙1
𝜙
[ 2 ] = 𝐴−1 𝑏
⋮
𝜙𝑛
𝜙+𝛼⊚
𝛼 ← step size انتخاب
← Non-parametric • روشهای
Memory-based app
𝑘
1
𝑉(𝑥) = ∑ 𝑉(𝑥𝑖 )
𝑘
𝑖=1
∑ 𝑤𝑖 𝑉(𝑥𝑖 )
𝑉(𝑥) =
∑ 𝑤𝑖
𝑖𝑤 متناسب با میزان نزدیکی دو نقطه
k-NN
• Kernel-based app
𝑉(𝑥) = ∑ 𝑘(𝑥𝑖 , 𝑥)𝑉(𝑥, 𝑖)
−‖𝑥 − 𝑥𝑖 ‖2
𝑘(𝑥𝑖 , 𝑥) = exp ( )
2𝜎 2
𝑉(𝑥) = 𝜙 𝑇 𝑓(𝑥) 𝐿𝑖𝑛𝑒𝑎𝑟 ↔ 𝐾𝑒𝑟𝑛𝑒𝑙 − 𝑏𝑎𝑠𝑒𝑑 𝑘(𝑥𝑖 , 𝑥)
= 𝑓(𝑥𝑖 )𝑇 𝑓(𝑥)
[𝑥𝑖 , 𝑉(𝑥𝑖 ) = 𝜙 𝑇 𝑓(𝑥𝑖 )] → 𝑉(𝑥) = ∑ 𝑘(𝑥𝑖 , 𝑥)𝑉(𝑥𝑖 ) = ∑ 𝑓(𝑥𝑖 )𝑇 𝑓(𝑥)𝑉(𝑥𝑖 )
Kernel-Trick
→Rbfsampler
→Nystroem
∗ 𝑉𝜕− 𝑇 ∗ 𝑉𝜕
→ = min [𝐿(𝑡, 𝑥, 𝑢) + ])𝑢 𝑓(𝑥, 𝐵𝐽𝐻
𝑡𝜕 u 𝑥𝜕
تابعیت صریح زمان نداشته باشند → • V, L
• 𝑥̇ = 𝑓(𝑥, 𝑢) = 𝐹(𝑥) + 𝐵(𝑥)𝑢 affine
𝑇 ∗ 𝑉𝜕
0 = min [𝐿(𝑥, 𝑢) + ])𝑢)𝑥(𝐵 (𝐹(𝑥) +
u 𝑥𝜕
)𝑢)𝑥(𝐵 𝐻(𝑥, χ, u) = 𝐿(𝑥, 𝑢) + 𝜆𝑇 (𝐹(𝑥) +
∗ 𝑉𝜕
𝐻𝐽𝐵: min 𝐻 (𝑥, , 𝑢) = 0
u 𝑥𝜕
→Cost:
𝑢𝑅 𝑇𝑢 𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 +
معادله جبری ریکاتی → 𝐵𝐽𝐻 → 𝐼𝑇𝐿 →
∙ affine
تابعیت صریح زمان ندارند 𝑉 {∙ 𝐿,
𝑡𝑠𝑜𝑐 𝑐𝑖𝑡𝑎𝑟𝑑𝑎𝑢𝑞 ∙
𝑇∗
𝑉𝜕
min[𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 + (𝐹(𝑥) + 𝐵(𝑥, 𝑢)] = 0
u 𝑥𝜕
با مشتق نسبت به u
𝑇
𝑇 ∗ 𝑉𝜕
2𝑢 𝑅 + 𝐵(𝑥) = 0
𝑥𝜕
∗
𝑇 1 −1 ∗ 𝑉𝜕
)𝑥( 𝐵 𝑅 𝑢 = −
2 𝑥𝜕
𝑇
1 𝜕𝑉 𝑇 −1 𝑇
𝜕𝑉 ∗ 𝜕𝑉 ∗ 𝑇
𝑥 𝑄𝑥 − 𝐵𝑅 𝐵 + 𝐹(𝑥) = 0
4 𝜕𝑥 𝜕𝑥 𝜕𝑥
)𝑢 𝑥̇ = 𝑓(𝑥,
∞
𝑡𝑑)𝑢 𝑉(𝑥) = ∫𝑡 𝐿(𝑥,
)𝑥(𝑘𝑢 = −
∗ 𝑉𝜕− 𝑇 ∗ 𝑉𝜕
→ = min {𝐿(𝑡, 𝑥, 𝑢) + })𝑢 𝑓(𝑥,
𝑡𝜕 u 𝑥𝜕
𝑢)𝑥(𝐵 • 𝑎𝑓𝑓𝑖𝑛𝑒: 𝑥̇ = 𝐹(𝑥) +
• Lو Vتابعیت صریح زمان نداشته باشند.
𝑢𝑅 𝑇𝑢 • 𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 +
𝑇 𝑇
𝑇 ∗ 𝑉𝜕
0 = min {𝑥 𝑄𝑥 + 𝑢 𝑅𝑢 + })𝑢)𝑥(𝐵 (𝐹(𝑥) +
u 𝑥𝜕
)𝑢)𝑥(𝐵 𝐻(𝑥, 𝜆, 𝑢) = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 + 𝜆𝑇 (𝐹(𝑥) +
∗ 𝑉𝜕 𝐻𝜕 ∗
∗ 𝑉𝜕 𝑇 −1 −1
min 𝐻 (𝑥, → , 𝑢) = 0 = 𝑢 →=0 𝐵 𝑅
u 𝑥𝜕 𝑢𝜕 2 𝑥𝜕
HJB:
∗ 𝑇
𝑇 ∗ 𝑉𝜕 1 −1
𝑇 ∗ 𝑉𝜕 ∗ 𝑉𝜕
𝑉 : 𝑥 𝑄𝑥 − 𝑅𝐵 + 𝐹(𝑥) = 0 𝐻2
𝑥𝜕 4 𝑥𝜕 𝑥𝜕
𝑤)𝑥(𝐷 𝑥̇ = 𝐹(𝑥) + 𝐵(𝑥)𝑢 + ∞_𝐻
𝑤𝑃 𝑇 𝑤 𝐿(𝑥, 𝑢) = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 −
HJI:
𝑇 𝑇 𝑇
𝜕𝑉 ∗ 𝑇
min max {𝑥 𝑄𝑥 + 𝑢 𝑅𝑢 − 𝑤 𝑃𝑤 + (𝐹(𝑥) + 𝐵(𝑥)𝑢 + 𝐷(𝑥)𝑤)} = 0
u 𝑤 𝜕𝑥
1 −1 𝑇 𝜕𝑉 ∗
∗
𝑤 = 𝑃 𝐷
→{ 2 𝜕𝑥
∗
−1 −1 𝑇
𝜕𝑉 ∗
𝑢 = 𝑅 𝐵
2 𝜕𝑥
→ 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎𝑝𝑝𝑟. ⇝ 𝐴𝐷𝑃
𝜕𝑉 ∗
min 𝐻 (𝑥, , 𝑢) = 0
u 𝜕𝑥
−1 −1 𝑇 𝜕𝑉 ∗
𝑢∗ = 𝑅 𝐵
{ 2 𝜕𝑥
∞
𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢)𝑑𝑡
𝑡
𝜕𝑉̂ 𝜕𝜇
𝜙̂ 𝑇
𝜕𝑉̂ 𝜕𝑥1 𝜕𝑥1 𝜕𝜇𝑇
̂ ̂ 𝑇
𝑉 (𝑥) = 𝜙 𝜇(𝑥) → = ⋮ = ⋮ = 𝜙
𝜕𝑥 𝜕𝑥
𝜕𝑉̂ 𝜕𝜇
𝜙̂ 𝑇
[𝜕𝑥𝑚 ] [ 𝜕𝑥𝑚 ]
𝜕𝜇1 𝜕𝜇1 𝜕𝜇1
…
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑚
𝜕𝜇 ⋮
=
𝜕𝑥
𝜕𝜇𝑁 𝜕𝜇𝑁
…
[ 𝜕𝑥1 𝜕𝑥𝑚 ]
𝜕𝜇1 𝜕𝜇1 𝜕𝜇1
… 𝜕𝜇
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑚 𝜙1 𝜙̂ 𝑇
𝜕𝑥1
⋮ 𝜙2
[ ]= ⋮
𝜕𝜇
𝜕𝜇𝑁 𝜕𝜇𝑁 𝜙𝑁 𝜙̂ 𝑇
… [ 𝜕𝑥𝑚 ]
[ 𝜕𝑥1 𝜕𝑥𝑚 ]
𝜕𝑉̂ 𝜕𝑉
𝑒𝑉 = 𝐻 (𝑥, , 𝑢) − 𝐻 (𝑥, , 𝑢)
𝜕𝑥 𝜕𝑥
𝜕𝜇
𝑒𝑉 = 𝐿(𝑥, 𝑢) + 𝜙̂ 𝑇 (𝐹(𝑥) + 𝐵(𝑥)𝑢)
𝜕𝑥
𝜕𝜇
(𝐹(𝑥) + 𝐵(𝑥)𝑢) = 𝛽
𝜕𝑥
1 𝜕𝐸𝑉
𝐸𝑉 = 𝑒𝑉2 ⇝ 𝜙̂° = −𝛼 = −𝛼𝑒𝑉 𝛽
2 𝜕𝜙̂
𝛽
𝜙̂° = −𝛼 𝑒
(1 + 𝛽 𝑇 𝛽)2 𝑉
𝑉𝜕 𝑇 𝑉𝜕
𝐻 (𝑥, , 𝑢) = 0 = 𝑥 𝑇 𝑄𝑥 + 𝑢𝑇 𝑅𝑢 + )𝑢)𝑥(𝐵 (𝐹(𝑥) +
𝑥𝜕 𝑥𝜕
{ −1 𝑉𝜕
𝑇𝐵 𝑢(𝑥) = 𝑅−1
2 𝑥𝜕
1
𝐸𝑣 = 𝑒𝑉2
2
𝑣𝐸𝜕 𝛽
𝛼𝜙̂̇ = − 𝛼= −𝛼𝑒𝑣 𝛽 → 𝜙̂̇ = − 𝑒
̂𝜙𝜕 𝑣 (1 + 𝛽 𝑇 𝛽)2
𝑇 𝜖𝜕 𝛽𝛼
= [𝜙̂ 𝛽 − (𝜙 𝛽 +
𝑇 𝑇 (𝐹(𝑥) + 𝐵(𝑥)𝑢))] (− )
𝑥𝜕 (1 + 𝛽 𝑇 𝛽)2
𝛽𝛼 𝑇 𝜖𝜕
𝜙̃̇ = − ̃ 𝑇
[𝜙 𝛽 − ])𝑢)𝑥(𝐵 (𝐹(𝑥) +
(1 + 𝛽 𝑇 𝛽)2 𝑥𝜕
𝑇 𝛽𝛽
𝛼𝜙̃̇ = − ⊕𝜙̃ +
(1 + 𝛽 𝑇 𝛽)2
𝜙̃ → 0
𝜆𝑚𝑖𝑛 (𝛽𝛽 𝑇 ) > 0
𝑛𝑜𝑖𝑡𝑎𝑡𝑖𝑐𝑥𝐸 𝑦𝑙𝑡𝑛𝑒𝑡𝑠𝑖𝑠𝑟𝑒𝑃 𝐸𝑃 = ) 𝑇 𝛽𝛽( 𝑛𝑖𝑚𝜆
𝑇𝜇𝜕
=𝛽 )𝑢)𝑥(𝐵 (𝐹(𝑥) +
𝑥𝜕
𝑣𝑒𝛽𝛼
𝜙̂̇ = − تضمین پایداری در بازه حرکت به سمت پاسخ بهینه → ترم جبرانی +
(1 + 𝛽 𝑇 𝛽)2
𝑇𝜇𝜕 𝑇 −1 −1
=𝑢 𝐵 𝑅 ترم جبرانی 𝜙̂ +
{ 2 𝑥𝜕
∞ 𝑡+𝑇 𝑡+𝑇
𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢)𝑑𝜏 → 𝑉̇ = −𝐿(𝑥, 𝑢) ⇝ ∫ 𝑉̇ 𝑑𝜏 = ∫ −𝐿(𝑥, 𝑢)𝑑𝜏
𝑡 𝑡 𝑡
𝑡+𝑇
= 𝑉(𝑡 + 𝑇) − 𝑉(𝑡) = − ∫ 𝐿(𝑥, 𝑢)𝑑𝜏
𝑡
𝑉 = 𝜙 𝑇 𝜇(𝑥) + 𝜖(𝑥)
𝑉̂ = 𝜙̂ 𝑇 𝜇(𝑥)
𝜙̂ 𝑇 𝜇(𝑡 + 𝑇) − 𝜙̂ 𝑇 𝜇(𝑡) = −𝑦 → 𝜙̂ 𝑇 [𝜇(𝑡) − 𝜇(𝑡 + 𝑇)] = 𝜙̂ 𝑇 𝜃 = 𝑦 → 𝜙̂ 𝑇 𝜃 = 𝑦
𝑒 = 𝜙̂ 𝑇 𝜃 − 𝑦
2
𝐸 = ∫(𝜙̂ 𝑇 𝜃 − 𝑦) 𝑑𝜏
𝜕𝐸 𝑇 𝜕𝑒
= 0 = ∫ 2(𝜙̂ 𝑇 𝜃 − 𝑦) 𝑑𝜏 = ∫ 2(𝑒)𝑇 𝜃𝑑𝜏 = 0
̂
𝜕𝜙 ̂
𝜕𝜙
Weighted Residual
𝑇
∫(𝜙̂ 𝑇 𝜃 − 𝑦) 𝜃𝑑𝜏 = 0
〈𝑓, 𝑔〉 = ∫ 𝑓 𝑇 𝑔𝑥
𝜕𝑒
〈𝑒, 〉=0
𝜕𝜙̂
𝑇
∫(𝜙̂ 𝑇 𝜃) 𝜃𝑑𝜏 = ∫ 𝑦 𝑇 𝜃𝑑𝜏
−1 −1
𝜙̂ = (∫ 𝜃𝜃 𝑇 𝑑𝜏) 𝑇
(∫ 𝑦 𝜃𝑑𝜏) = 𝑇
(∑ 𝜃𝑖 𝜃𝑖 ) (∑ 𝑦𝑖𝑇 𝜃)
𝜃 = 𝜇(𝑡) − 𝜇(𝑡 + 𝑇)
𝑡+𝑇
𝑦=∫ 𝐿(𝑥, 𝑢)𝑑𝜏
𝑡
𝑉̇ = −𝐿(𝑥, 𝑢)
𝜕𝑉 𝑇
𝑥̇ = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝜕𝑉 𝑇
(𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝑥̇ = 𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏
∞
𝑉(𝑥) = ∫ 𝐿(𝑥, 𝑢 𝑇 )𝑑𝜏
𝑡
∞
𝑉(𝑥 + 𝑑𝑥) = ∫ 𝐿(𝑥 + 𝑑𝑥, 𝑢 𝑇 )𝑑𝜏
𝑡
𝜕𝑉 𝑇 𝑑𝑥 𝜕𝑉 𝑇
𝑉̇ = = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥 𝑑𝑡 𝜕𝑥
𝜕𝑉 𝑇 ?
̇𝑉 = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 )= − 𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝜕𝑉 𝑇 𝜕𝑉 𝑇
𝑉̇ = (𝐹(𝑥) + 𝐵(𝑥)𝑢𝑏 ) = [𝐹(𝑥) + 𝐵(𝑥)𝑢 𝑇 + 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )]
𝜕𝑥 𝜕𝑥
𝜕𝑉 𝑇
[𝐹(𝑥) + 𝐵(𝑥)𝑢 𝑇 ] = −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇
𝜕𝑥
𝑡+𝑇 𝑇
𝜕𝑉 𝑇
∫ 𝑉̇ = ∫ −𝑥 𝑇 𝑄𝑥 − 𝑢𝑇𝑇 𝑅𝑢 𝑇 + 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )
𝑡 𝑡 𝜕𝑥
𝑇
𝜕𝑉 𝑇
𝑉(𝑡 + 𝑇) − 𝑉(𝑡) = − ∫ [𝑥 𝑄𝑥 + 𝑇
𝑢𝑇𝑇 𝑅𝑢 𝑇 − 𝐵(𝑥)(𝑢𝑏 − 𝑢 𝑇 )] 𝑑𝜏
𝑡 𝜕𝑥
𝜙̂ 𝑇 𝜃 = 𝑦
𝜃 = 𝜇(𝑡) − 𝜇(𝑡 + 𝑇)