Lec 04 Reinforcement Learning
Lec 04 Reinforcement Learning
2
4.1
Markov Decision Processes
Reinforcement Learning
So far:
I Supervised learning, lots of expert demonstrations required
I Use of auxiliary, short-term loss functions
I Imitation learning: per-frame loss on action
I Direct perception: per-frame loss on affordance indicators
Now:
I Learning of models based on the loss that we actually care about, e.g.:
I Minimize time to target location
I Minimize number of collisions
I Minimize risk
I Maximize comfort
I etc.
Unsupervised Learning:
I Dataset: {(xi )} (xi = data) Goal: Discover structure underlying data
I Examples: Clustering, dimensionality reduction, feature learning, etc.
Reinforcement Learning:
I Agent interacting with environment which provides numeric reward signals
I Goal: Learn how to take actions in order to maximize reward
I Examples: Learning of manipulation or control tasks (everything that interacts)
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 5
Introduction to Reinforcement Learning
Agent
Reward rt
State st Action at
Next state st+1
Environment
https://gym.openai.com/envs/#classic_control 8
Example: Robot Locomotion
http://blog.openai.com/roboschool/
https://gym.openai.com/envs/#mujoco 9
Example: Atari Games
http://blog.openai.com/gym-retro/
https://gym.openai.com/envs/#atari 10
Example: Go
www.deepmind.com/research/alphago/
www.deepmind.com/research/alphago/ 11
Example: Self-Driving
https://gym.openai.com/envs/CarRacing-v0/ 12
Reinforcement Learning: Overview
Agent
Reward rt
State st Action at
Next state st+1
Environment
13
Markov Decision Process
Markov Decision Process (MDP) models the environment and is defined by the tuple
(S, A, R, P, γ)
with
I S : set of possible states
I A: set of possible actions
I R(rt |st , at ): distribution of current reward given (state,action) pair
I P (st+1 |st , at ): distribution over next state given (state,action) pair
I γ: discount factor (determines value of future rewards)
14
Markov Decision Process
15
Markov Decision Process
16
Policy
A policy π is a function from S to A that specifies what action to take in each state:
I A policy fully defines the behavior of an agent
I Deterministic policy: a = π(s)
I Stochastic policy: π(a|s) = P (at = a|st = s)
Remark:
I MDP policies depend only on the current state and not the entire history
I However, the current state may include past observations
17
Policy
18
Exploration vs. Exploitation
Answer: We need to explore the state/action space. Thus RL combines two tasks:
I Exploration: Try a novel action a in state s , observe reward rt
I Discovers more information about the environment, but sacrifices total reward
I Game-playing example: Play a novel experimental move
19
Exploration vs. Exploitation
20
4.2
Bellman Optimality and Q-Learning
Value Functions
How good is a state?
I The discount factor γ < 1 is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16
1
,...)
I Determines agent’s far/short-sightedness
I Avoids infinite returns in cyclic Markov processes
22
Value Functions
How good is a state-action pair?
I The discount factor γ ∈ [0, 1] is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16
1
,...)
I Determines agent’s far/short-sightedness
I Avoids infinite returns in cyclic Markov processes
23
Optimal Value Functions
The optimal state-value function V ∗ (st ) is the best V π (st ) over all policies π:
X
V ∗ (st ) = max V π (st ) V π (st ) = E γ k rt+k st , π
π
k≥0
The optimal action-value function Q∗ (st , at ) is the best Qπ (st , at ) over all policies π:
X
Q∗ (st , at ) = max Qπ (st , at ) Qπ (st , at ) = E γ k rt+k st , at , π
π
k≥0
I The optimal value functions specify the best possible performance in the MDP
I However, searching over all possible policies π is computationally intractable
24
Optimal Policy
25
A Simple Grid World Example
states
actions = {
1. right ?
2. left reward: r = −1 for
3. up ? each transition
4. down
}
Objective: Reach one of terminal states (marked with ’?’) in least number of actions
I Penalty (negative reward) given for every transition made
26
A Simple Grid World Example
? ?
? ?
I The arrows indicate equal probability of moving into each of the directions
27
Solving for the Optimal Policy
Bellman Optimality Equation
Q(s, a; θ) ≈ Q∗ (s, a)
θ θ
s a s
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 36
Training the Q Network
Forward Pass:
Loss function is the mean-squared error in Q-values:
2
Backward Pass:
Gradient update with respect to Q-function parameters θ:
" 2 #
∇θ L(θ) = ∇θ E rt + γ max 0
Q(st+1 , a0 ; θ) − Q(st , at ; θ)
a
Optimize objective end-to-end with stochastic gradient descent (SGD) using ∇θ L(θ).
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 37
Experience Replay
To speed-up training we like to train on mini-batches:
I Problem: Learning from consecutive samples is inefficient
I Reason: Strong correlations between consecutive samples
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 38
Fixed Q Targets
Problem: Non-stationary targets
I As the policy changes, so do our targets: rt + γ max Q(st+1 , a0 ; θ)
0 a
I This may lead to oscillation or divergence
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 39
Putting it together
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 40
Case Study: Playing Atari Games
Agent
; ; ;
Environment
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 42
Case Study: Playing Atari Games
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 43
Deep Q-Learning Shortcomings
44
Deep Deterministic Policy Gradients
DDPG addresses the problem of continuous action spaces.
Problem: Finding a continuous action requires optimization at every timestep.
Solution: Use two networks, an actor (deterministic policy) and a critic.
µ(s; θµ ) Q(s, a; θQ )
θµ θQ
s s a = µ(s; θµ )
Actor Critic
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 45
Deep Deterministic Policy Gradients
∇θµ Est ,at ,rt ,st+1 ∼D Q(st , µ(st ; θµ ); θQ ) = E ∇at Q(st , at ; θQ ) ∇θµ µ(st ; θµ )
I Remark: No maximization over actions required as this step is now learned via µ(·)
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 46
Deep Deterministic Policy Gradients
Experience replay and target networks are again used to stabilize training:
I Replay memory D stores transition tuples (st , at , rt , st+1 )
I Target networks are updated using “soft” target updates
I Weights are not directly copied but slowly adapted:
θQ− ← τ θQ + (1 − τ )θQ−
θµ− ← τ θµ + (1 − τ )θµ−
where 0 < τ 1 controls the tradeoff between speed and stability of learning
µ(s; θµ ) + N
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 47
Prioritized Experience Replay
δ = rt + γ max
0
Q(st+1 , a0 ; θQ− ) − Q(st , at ; θQ )
a
Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 49
Learning to Drive in a Day
Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 50
Other flavors of Deep RL
Asynchronous Deep Reinforcement Learning
Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016. 52
Bootstrapped DQN
Bootstrapping for efficient exploration:
I Approximate a distribution over Q values via K bootstrapped ”heads”
I At the start of each epoch, a single head Qk is selected uniformly at random
I After training, all heads can be combined into a single ensemble policy
Q1 QK
θQ 1 ... θQK
θshared
s
Osband et al.: Deep Exploration via Bootstrapped DQN. NIPS, 2016. 53
Double Q-Learning
Double Q-Learning
I Decouple Q function for selection and evaluation of actions
to avoid Q overestimation and stabilize training. Target:
DQN : rt + γ max
0
Q(st+1 , a0 ; θ− )
a
DoubleDQN : rt + γQ(st+1 , argmax Q(st+1 , a0 ; θ); θ− )
a0
I Online network with weights θ is used to determine greedy policy
I Target network with weights θ− is used to determine corresponding action value
I Improves performance on Atari benchmarks
van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning. AAAI, 2016. 54
Deep Recurrent Q-Learning
Add recurrency to a deep Q-network to handle partial observability of states:
FC-Out (Q-values)
;;;;
Hausknecht and Stone: Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI, 2015 55
Faulty Reward Functions
https://blog.openai.com/faulty-reward-functions/ 56
Summary
57