Lecture 7
Lecture 7
Emma Brunskill
Winter 2020
1
With slides from Katerina Fragkiadaki and Pieter Abbeel
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 1 / 52
Refresh Your Knowledge 6
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 2 / 52
Deep RL
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 3 / 52
Class Structure
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 4 / 52
Double DQN
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 5 / 52
Recall: Double Q-Learning
7: else
8:
9: end if
10: t =t +1
11: end loop
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 6 / 52
Deep RL
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 7 / 52
Check Your Understanding: Mars Rover Model-Free Policy
Evaluation
89/: ./01/!123
.2456 7214 .2456 7214
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 9 / 52
Potential Impact of Ordering Episodic Replay Updates
pi = r + γ max
0
Q(si+1 , a0 ; w − ) − Q(si , ai ; w )
a
pα
P(i) = P i α
k pk
1
See paper for details and an alternative
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 11 / 52
Exercise: Prioritized Replay
Let i be the index of the i-the tuple of experience (si , ai , ri , si+1 )
Sample tuples for update using priority function
Priority of a tuple i is proportional to DQN error
pi = r + γ max
0
Q(si+1 , a0 ; w − ) − Q(si , ai ; w )
a
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 12 / 52
Performance of Prioritized Replay vs Double DQN
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 13 / 52
Deep RL
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 14 / 52
Value & Advantage Function
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 15 / 52
Dueling DQN
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 16 / 52
Check Understanding: Unique?
Advantage function
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 17 / 52
Uniqueness
Advantage function
Not unique
Option 1: Force A(s, a) = 0 if a is action taken
0
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − max
0
Â(s, a ; w )
a ∈A
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 18 / 52
Dueling DQN V.S. Double DQN with Prioritized Replay
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 20 / 52
Practical Tips for DQN on Atari (from J. Schulman) cont.
Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 21 / 52
Practical Tips for DQN on Atari (from J. Schulman) cont.
Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 23 / 52
Deep Reinforcement Learning
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 25 / 52
We want RL Algorithms that Perform
Optimization
Delayed consequences
Exploration
Generalization
And do it all statistically and computationally efficiently
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 26 / 52
Generalization and Efficiency
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 27 / 52
Class Structure
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 28 / 52
So Far in this Course
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 29 / 52
Reward Shaping
Rewards that are dense in time closely guide the agent. How can we
supply these rewards?
Manually design them: often brittle
Implicitly specify them through demonstrations
Simulated highway driving [ Abbeel and Ng, ICML 2004; Syed and
Schapire, NIPS 2007; Majumdar et al., RSS 2017 ]
Parking lot navigation [Abbeel, Dolgov, Ng, and Thrun, IROS 2008]
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 31 / 52
Learning from Demonstrations
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 32 / 52
Problem Setup
Input:
State space, action space
Transition model P(s 0 | s, a)
No reward function R
Set of one or more teacher’s demonstrations (s0 , a0 , s1 , s0 , . . .)
(actions drawn from teacher’s policy π ∗ )
Behavioral Cloning:
Can we directly learn the teacher’s policy using supervised learning?
Inverse RL:
Can we recover R?
Apprenticeship learning via Inverse RL:
Can we use R to generate a good policy?
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 33 / 52
Table of Contents
1 Behavioral Cloning
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 34 / 52
Behavioral Cloning
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 35 / 52
ALVINN
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 36 / 52
Problem: Compounding Errors
Supervised learning assumes iid. (s, a) pairs and ignores temporal structure
Independent in time errors:
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 37 / 52
Problem: Compounding Errors
Idea: Get more labels of the expert action along the path taken by
the policy computed by behavior cloning
Obtains a stationary deterministic policy with good performance
under its induced state distribution
Key limitation?
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 40 / 52
Table of Contents
1 Behavioral Cloning
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 41 / 52
Feature Based Reward Function
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 42 / 52
Check Your Understanding: Feature Based Reward
Function
.
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 43 / 52
Linear Feature Reward Inverse RL
X∞
V π = Es∼π [ γ t R(st )]
t=0
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 44 / 52
Linear Feature Reward Inverse RL
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 45 / 52
Relating Frequencies to Optimality
V∗ ≥ Vπ
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 46 / 52
Relating Frequencies to Optimality
Recall linear value function approximation
Similarly, here consider when reward is linear over features
R(s) = w T x(s) where w ∈ Rn , x : S → Rn
Goal: identify the weight vector w given a set of demonstrations
The resulting value function for a policy π can be expressed as
V π = w T µ(π)
t=0 t=0
w ∗T µ(π ∗ ) ≥ w ∗T µ(π), ∀π 6= π ∗
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 47 / 52
Feature Matching
|w T µ(π) − w T µ(π ∗ )| ≤
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 48 / 52
Ambiguity
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 49 / 52
Learning from Demonstration / Imitation Learning Pointers
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 50 / 52
Summary
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 51 / 52
Class Structure
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 52 / 52