Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
• Regret
• Optimistic Initializations
• DDPG
DQN to DDPG: DQN algorithm
3/21
Q learning DQN
𝑄 𝑠𝑡 , 𝑎𝑡 ← 𝑄 𝑠𝑡 , 𝑎𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄 𝑠𝑡 , 𝑎𝑡 ] 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥𝑎 𝑄𝜋𝜃′ 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄𝜋𝜃 𝑠𝑡 , 𝑎𝑡 ]
(𝑠𝑡 , 𝑎𝑡 ) 𝑠𝑡+1 𝑟𝑡
store
(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) Replay buffer
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
1. replay buffer
2. deep neural network
3. target network
Q learning DQN
DDPG
Policy gradient
Actor critic DPG
(REINFORCE) Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Agent World
policy(𝜋𝜃 ):
stochastic policy with weights 𝜽 action
𝑎𝑡
objective: 𝐽(𝜃)
Autonomous Systems Laboratory
Policy gradient: REINFORCE
7/21
probability
0.2
0.4
0.1
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
Autonomous Systems Laboratory
Policy gradient: REINFORCE
8/21
𝑟1
𝐽 𝜃 = 𝐸𝜏~𝑝𝜃 (𝜏) 𝑟 𝑠𝑡 , 𝑎𝑡 𝜃: weights of actor network 𝑟2
𝑡
𝑁
1 𝑇 𝑇
𝛻𝐽 𝜃 ≈ 𝛻𝜃 𝑙𝑜𝑔 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑟(𝑠𝑡 , 𝑎𝑡 )
𝑁 𝑡=1 𝑡=1
𝑖=1
The number of episodes
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
Autonomous Systems Laboratory
Policy gradient: Actor critic (actor critic)
9/21
• Actor(𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸𝝓 (𝒔𝒕 , 𝒂𝒕 )): evaluate actor’s action
update critic
𝛻𝐽(𝜃) (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )0~𝑖
update critic & actor
actor 𝑄𝜙 (𝑠𝑡 , 𝑎𝑡 )
sample data 𝒊 times
initial state
𝑎𝑡
Env.
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
𝑠𝑡
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇𝜃 (𝑠t )
𝑎𝑥
𝑎𝑥
𝑠𝑡 𝑠𝑡 𝑎𝑦
𝑎𝑦
• Need 10 action spaces for 5 discretized 2 dimensional actions • Only 2 action spaces are needed
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇𝜃 (𝑠t )
trajectory distribution
𝑇 𝑇
𝑝𝜃 𝑠1 ,𝑎1 ,⋯,𝑠𝑇 ,𝑎𝑇 = 𝑝(𝑠1 ) ෑ 𝜋𝜃 𝑎𝑡 𝑠𝑡 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) 𝑝𝜃 𝑠1 ,𝑠2 ,𝑠3 ⋯,𝑠𝑇 = 𝑝(𝑠1 ) ෑ 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝜇𝜃 (𝑠𝑡 ))
𝜏 𝑡=1 𝜏 𝑡=1
objective
DDPG
Policy gradient
Actor critic DPG
(REINFORCE) + continuous action spaces
- no replay buffer: sample correlation
- high variance + lower variance
- no target network: unstable
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
• policy
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄𝜋𝜃 (𝑠, 𝑎) 𝒂 = 𝝁𝜽 (𝒔)
• exploration
𝝁′ 𝒔 = 𝝁𝜽 𝒔 + 𝐍
• Add noise for exploration: white Gaussian noise
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
critic actor
select action
𝛻𝐽(𝜃) 𝑎𝑡 = 𝜇𝜃 𝑠𝑡 + 𝑁
critic 𝑄𝜙 policy 𝜇𝜃
𝜇𝜃 (𝑠𝑡 ) 𝑠𝑡
update critic
loss: 𝐿(𝜙) soft update 𝜙′ soft update 𝜃′ Env
(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 ) 𝑠𝑡+1
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).