0% found this document useful (0 votes)
8 views17 pages

Lecture 37 - Deep Deterministic Policy Gradient (DDPG)

The document discusses the Deep Deterministic Policy Gradient (DDPG) algorithm in reinforcement learning, highlighting its advantages over the Deep Q-Network (DQN) in handling continuous action spaces. It explains the limitations of DQN, particularly its inability to efficiently manage high-dimensional and continuous domains, and introduces DDPG as a solution that incorporates policy gradient methods. The lecture also covers concepts such as exploration vs. exploitation, the role of replay buffers, and the actor-critic framework in reinforcement learning.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

Lecture 37 - Deep Deterministic Policy Gradient (DDPG)

The document discusses the Deep Deterministic Policy Gradient (DDPG) algorithm in reinforcement learning, highlighting its advantages over the Deep Q-Network (DQN) in handling continuous action spaces. It explains the limitations of DQN, particularly its inability to efficiently manage high-dimensional and continuous domains, and introduces DDPG as a solution that incorporates policy gradient methods. The lecture also covers concepts such as exploration vs. exploitation, the role of replay buffers, and the actor-critic framework in reinforcement learning.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

AI-832 Reinforcement Learning

Instructor: Dr. Zuhair Zafar

Lecture # 37: Deep Deterministic Policy Gradient (DDPG)


Recap

• Exploration vs. Exploitation

• Regret

• Optimistic Initializations

• Optimism in the Face of Uncertainty


Today’s Agenda

• DDPG
DQN to DDPG: DQN algorithm
3/21

• DQN is capable of human level performance on many Atari games


• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable

Q learning DQN
𝑄 𝑠𝑡 , 𝑎𝑡 ← 𝑄 𝑠𝑡 , 𝑎𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄 𝑠𝑡 , 𝑎𝑡 ] 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄𝜋𝜃′ 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ]

𝑠𝑡 : state Policy(𝜋): 𝑎𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄𝜋𝜃 (𝑠𝑡 , 𝑎𝑡 )


𝑎𝑡 : action
𝑟𝑡 : reward
𝑄(𝑠𝑡 , 𝑎𝑡 ): reward to go DQN Loss
𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜃) Update 𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜃) 𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡′ , 𝑎𝑡 ; 𝜃′)
Copy Target
Environment Q Network
𝑠𝑡 Q Network

(𝑠𝑡 , 𝑎𝑡 ) 𝑠𝑡+1 𝑟𝑡

store
(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) Replay buffer

Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.

Autonomous Systems Laboratory


DQN to DDPG: Limitation of DQN (discrete action spaces)
4/21

• Discrete action spaces


- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌𝒏 action spaces

• DQN cannot be straight forwardly applied to continuous domain


• Why? -> 1. Policy(𝜋): 𝑎𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙𝒂 𝑄𝜋𝜃 (𝑠𝑡 , 𝑎𝑡 )
2. Update: 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙𝒂 𝑸𝝅𝜽′ 𝒔𝒕+𝟏 , 𝒂𝒕+𝟏 − 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ]

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Autonomous Systems Laboratory


DDPG: DQN with Policy gradient methods
5/21

1. replay buffer
2. deep neural network
3. target network
Q learning DQN

DDPG

Policy gradient
Actor critic DPG
(REINFORCE) Continuous
action spaces

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Autonomous Systems Laboratory


Policy gradient: The goal of Reinforcement learning
6/21

reward & next state


𝑟𝑡 𝑠𝑡+1

state& policy model


𝑠𝑡 𝜋(𝑎𝑡 |𝑠𝑡 ) 𝑃(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 )

Agent World
policy(𝜋𝜃 ):
stochastic policy with weights 𝜽 action
𝑎𝑡

Markov decision process trajectory distribution


𝑇

𝑝𝜃 𝑠1 ,𝑎1 ,⋯,𝑠𝑇 ,𝑎𝑇 = 𝑝(𝑠1 ) ෑ 𝜋𝜃 𝑎𝑡 𝑠𝑡 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 )


𝑎1 𝑎2 𝑎3
𝜏 𝑡=1

Goal of reinforcement learning

𝑠1 𝑠2 𝑠3 𝜃 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝐸𝜏~𝑝𝜃 (𝜏) ෍ 𝑟 𝑠𝑡 , 𝑎𝑡


𝑝(𝑠2 |𝑠1 , 𝑎1 ) 𝑝(𝑠3 |𝑠2 , 𝑎2 ) 𝑝(𝑠4 |𝑠3 , 𝑎3 ) 𝑡

objective: 𝐽(𝜃)
Autonomous Systems Laboratory
Policy gradient: REINFORCE
7/21

• REINFORCE models the policy as a stochastic policy: 𝑎𝑡 ~ 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )

probability

0.2

0.4

𝑠𝑡 0.1 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )


0.2

0.1

Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
Autonomous Systems Laboratory
Policy gradient: REINFORCE
8/21

• REINFORCE models the policy as a stochastic decision: 𝑎𝑡 ~ 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )

𝑟1
𝐽 𝜃 = 𝐸𝜏~𝑝𝜃 (𝜏) ෍ 𝑟 𝑠𝑡 , 𝑎𝑡 𝜃: weights of actor network 𝑟2
𝑡

𝛻𝐽 𝜃 =𝐸𝜏~𝑝𝜃 (𝜏) ( σ𝑇𝑡=1 𝛻𝜃 𝑙𝑜𝑔 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) (σ𝑇𝑡=1 𝑟(𝑠𝑡 , 𝑎𝑡 )) 𝑟𝑁

𝑁
1 𝑇 𝑇
𝛻𝐽 𝜃 ≈ ෍ ෍ 𝛻𝜃 𝑙𝑜𝑔 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) ෍ 𝑟(𝑠𝑡 , 𝑎𝑡 )
𝑁 𝑡=1 𝑡=1
𝑖=1
The number of episodes

𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃) 𝛼: learning rate


initial state

Must experience some episodes to update


problem
1. Slow training process
2. High gradient variance

Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
Autonomous Systems Laboratory
Policy gradient: Actor critic (actor critic)
9/21

→ 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 ) 𝜙: weights of critic network


1. High gradient variance
2. Slow training policy

• Actor(𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸𝝓 (𝒔𝒕 , 𝒂𝒕 )): evaluate actor’s action

1. Sample 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 from 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑖 times


2. Update 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 ) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋𝜃 𝑎𝑡 𝑠𝑡 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 )
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) critic sample data 𝒊 times
update critic & actor

update critic
𝛻𝐽(𝜃) (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )0~𝑖
update critic & actor

actor 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 )
sample data 𝒊 times

initial state
𝑎𝑡
Env.
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
𝑠𝑡

Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.

Autonomous Systems Laboratory


Policy gradient: DPG
10/21

• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇𝜃 (𝑠t )

Stochastic policy 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) Deterministic policy 𝝁𝜽 (𝒔𝒕 )

𝑎𝑥
𝑎𝑥
𝑠𝑡 𝑠𝑡 𝑎𝑦
𝑎𝑦

• Need 10 action spaces for 5 discretized 2 dimensional actions • Only 2 action spaces are needed

Silver, David, et al. "Deterministic policy gradient algorithms." 2014.

Autonomous Systems Laboratory


Policy gradient: DPG
11/21

• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇𝜃 (𝑠t )
trajectory distribution
𝑇 𝑇

𝑝𝜃 𝑠1 ,𝑎1 ,⋯,𝑠𝑇 ,𝑎𝑇 = 𝑝(𝑠1 ) ෑ 𝜋𝜃 𝑎𝑡 𝑠𝑡 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) 𝑝𝜃 𝑠1 ,𝑠2 ,𝑠3 ⋯,𝑠𝑇 = 𝑝(𝑠1 ) ෑ 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝜇𝜃 (𝑠𝑡 ))
𝜏 𝑡=1 𝜏 𝑡=1

objective

𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝𝜃(𝜏) 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 ) 𝐽 𝜃 = 𝐸𝑠~𝑝𝜃 (𝜏) [𝑄(𝑠, 𝜇𝜃 𝑠 )]

1. Sample 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 from 𝜇𝜃 (𝑠) 𝑖 times


2. Update 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 ) to samples loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄𝜙 𝑠𝑡+1 , 𝜇𝜃 (𝑠𝑡+1 ) − 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 )
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇𝜃 𝑠𝑡 𝛻𝜙 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 )|𝑎𝑡 =𝜇𝜃(𝑠𝑡 )
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)

Silver, David, et al. "Deterministic policy gradient algorithms." 2014.

Autonomous Systems Laboratory


DDPG: DQN + DPG
12/21

+ off policy: replay buffer


+ stable update: target network
- low dimensional observation spaces + high dimensional observation spaces
- discrete action spaces
Q learning DQN

DDPG

Policy gradient
Actor critic DPG
(REINFORCE) + continuous action spaces
- no replay buffer: sample correlation
- high variance + lower variance
- no target network: unstable

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Autonomous Systems Laboratory


DDPG: algorithm(1/2)
13/21

• policy
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄𝜋𝜃 (𝑠, 𝑎) 𝒂 = 𝝁𝜽 (𝒔)
• exploration
𝝁′ 𝒔 = 𝝁𝜽 𝒔 + 𝐍
• Add noise for exploration: white Gaussian noise

• soft target update


𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
• Target network is constrained to change slowly
• Stabilize training process

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Autonomous Systems Laboratory


DDPG: algorithm(2/2)
14/21

critic actor
select action
𝛻𝐽(𝜃) 𝑎𝑡 = 𝜇𝜃 𝑠𝑡 + 𝑁
critic 𝑄𝜙 policy 𝜇𝜃
𝜇𝜃 (𝑠𝑡 ) 𝑠𝑡
update critic
loss: 𝐿(𝜙) soft update 𝜙′ soft update 𝜃′ Env

target critic 𝑄𝜙′ target policy 𝜇𝜃′


𝜇𝜃′ (𝑠𝑡+1 )

(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 ) 𝑠𝑡+1

sample 𝑖 batch (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) store 𝑖 data (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )


Replay buffer

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Autonomous Systems Laboratory


Appendix: DDPG algorithm 21/21

Autonomous Systems Laboratory


Conclusion & Future work
18/21

• DQN have problem to adjust continuous action space directly


• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance

Autonomous Systems Laboratory

You might also like