0% found this document useful (0 votes)
23 views52 pages

Lecture 7

The lecture focuses on imitation learning in large state spaces within the context of deep reinforcement learning (DRL). It discusses advancements such as Double DQN, Prioritized Replay, and Dueling DQN, highlighting their significance in improving learning efficiency and performance. Practical tips for implementing DQN on Atari games are also provided, emphasizing the importance of memory efficiency and exploration strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views52 pages

Lecture 7

The lecture focuses on imitation learning in large state spaces within the context of deep reinforcement learning (DRL). It discusses advancements such as Double DQN, Prioritized Replay, and Dueling DQN, highlighting their significance in improving learning efficiency and performance. Practical tips for implementing DQN on Atari games are also provided, emphasizing the importance of memory efficiency and exploration strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Lecture 7: Imitation Learning in Large State Spaces1

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2020

1
With slides from Katerina Fragkiadaki and Pieter Abbeel
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 1 / 52
Refresh Your Knowledge 6

Experience replay in deep Q-learning (select all):

1 Involves using a bank of prior (s,a,r,s’) tuples and doing Q-learning


updates using all the tuples in the bank
2 Always uses the most recent history of tuples
3 Reduces the data efficiency of DQN
4 Increases the computational cost
5 Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 2 / 52
Deep RL

Success in Atari has led to huge excitement in using deep neural


networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 3 / 52
Class Structure

Last time: CNNs and Deep Reinforcement learning


This time: DRL and Imitation Learning in Large State Spaces
Next time: Policy Search

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 4 / 52
Double DQN

Recall maximization bias challenge


Max of the estimated state-action values can be a biased estimate of
the max
Double Q-learning

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 5 / 52
Recall: Double Q-Learning

1: Initialize Q1 (s, a) and Q2 (s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0


2: loop
3: Select at using -greedy π(s) = arg maxa Q1 (st , a) + Q2 (st , a)
4: Observe (rt , st+1 )
5: if (with 0.5 probability) then
6:

Q1 (st , at ) ← Q1 (st , at )+α(rt +Q1 (st+1 , arg max


0
Q2 (st+1 , a0 ))−Q1 (st , at ))
a

7: else
8:

Q2 (st , at ) ← Q2 (st , at )+α(rt +Q2 (st+1 , arg max


0
Q1 (st+1 , a0 ))−Q2 (st , at ))
a

9: end if
10: t =t +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 6 / 52
Deep RL

Success in Atari has led to huge excitement in using deep neural


networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 7 / 52
Check Your Understanding: Mars Rover Model-Free Policy
Evaluation

!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

π(s) = a1 ∀s, γ = 1. Any action from s1 and s7 terminates episode


Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of V of each state? [1 1 1 0 0 0 0]
TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
Chose 2 ”replay” backups to do. Which should we pick to get
estimate closest to MC first visit estimate?
1 Doesn’t matter, any will yield the same
2 (s3 , a1 , 0, s2 ) then (s2 , a1 , 0, s1 )
3 (s2 , a1 , 0, s1 ) then (s3 , a1 , 0, s2 )
4 (s2 , a1 , 0, s1 ) then (s3 , a1 , 0, s2 )
5 Not sure
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 8 / 52
Impact of Replay?

In tabular TD-learning, order of replaying updates could help speed


learning
Repeating some updates seem to better propagate info than others
Systematic ways to prioritize updates?

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 9 / 52
Potential Impact of Ordering Episodic Replay Updates

Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016

Schaul, Quan, Antonoglou, Silver ICLR 2016


Oracle: picks (s, a, r , s 0 ) tuple to replay that will minimize global loss
Exponential improvement in convergence
Number of updates needed to converge
Oracle is not a practical method but illustrates impact of ordering
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 10 / 52
Prioritized Experience Replay

Let i be the index of the i-the tuple of experience (si , ai , ri , si+1 )


Sample tuples for update using priority function
Priority of a tuple i is proportional to DQN error

pi = r + γ max
0
Q(si+1 , a0 ; w − ) − Q(si , ai ; w )
a

Update pi every update. pi for new tuples is set to 0


One method1 : proportional (stochastic prioritization)


P(i) = P i α
k pk

1
See paper for details and an alternative
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 11 / 52
Exercise: Prioritized Replay
Let i be the index of the i-the tuple of experience (si , ai , ri , si+1 )
Sample tuples for update using priority function
Priority of a tuple i is proportional to DQN error

pi = r + γ max
0
Q(si+1 , a0 ; w − ) − Q(si , ai ; w )
a

Update pi every update. pi for new tuples is set to 0


One method1 : proportional (stochastic prioritization)

P(i) = P i α
k pk
α = 0 yields what rule for selecting among existing tuples?
Selects randomly
Selects the one with the highest priority
It depends on the priorities of the tuples
Not Sure

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 12 / 52
Performance of Prioritized Replay vs Double DQN

Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 13 / 52
Deep RL

Success in Atari has led to huge excitement in using deep neural


networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network
Architectures for Deep Reinforcement Learning, Wang et al, ICML
2016)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 14 / 52
Value & Advantage Function

Intuition: Features need to accurate represent value may be different


than those needed to specify difference in actions
E.g.
Game score may help accurately predict V (s)
But not necessarily in indicating relative action values Q(s, a1 ) vs
Q(s, a2 )
Advantage function (Baird 1993)

Aπ (s, a) = Q π (s, a) − V π (s)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 15 / 52
Dueling DQN

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 16 / 52
Check Understanding: Unique?

Advantage function

Aπ (s, a) = Q π (s, a) − V π (s)

For a given advantage function, is there a unique Q and V ?


1 Yes
2 No
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 17 / 52
Uniqueness

Advantage function

Aπ (s, a) = Q π (s, a) − V π (s)

Not unique
Option 1: Force A(s, a) = 0 if a is action taken
 
0
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − max
0
Â(s, a ; w )
a ∈A

Option 2: Use mean as baseline (more stable)


!
1 X
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − Â(s, a0 ; w )
|A| 0
a ∈A

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 18 / 52
Dueling DQN V.S. Double DQN with Prioritized Replay

Figure: Wang et al, ICML 2016


Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 19 / 52
Practical Tips for DQN on Atari (from J. Schulman)

DQN is more reliable on some Atari tasks than others. Pong is a


reliable task: if it doesn’t achieve good scores, something is wrong
Large replay buffers improve robustness of DQN, and memory
efficiency is key
Use uint8 images, don’t duplicate data
Be patient. DQN converges slowly—for ATARI it’s often necessary to
wait for 10-40M frames (couple of hours to a day of training on GPU)
to see results significantly better than random policy
In our Stanford class: Debug implementation on small test
environment

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 20 / 52
Practical Tips for DQN on Atari (from J. Schulman) cont.

Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 21 / 52
Practical Tips for DQN on Atari (from J. Schulman) cont.

Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise

Consider trying Double DQN—significant improvement from small


code change in Tensorflow.
To test out your data pre-processing, try your own skills at navigating
the environment based on processed frames
Always run at least two different seeds when experimenting
Learning rate scheduling is beneficial. Try high learning rates in initial
exploration period
Try non-standard exploration schedules
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 22 / 52
Recap: Deep Model-free RL, 3 of the Big Ideas

Double DQN: (Deep Reinforcement Learning with Double


Q-Learning, Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network
Architectures for Deep Reinforcement Learning, Wang et al, ICML
2016)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 23 / 52
Deep Reinforcement Learning

Hessel, Matteo, et al. ”Rainbow: Combining Improvements in Deep


Reinforcement Learning.”
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 24 / 52
Summary of Model Free Value Function Approximation
with DNN

DNN are very expressive function approximators


Can use to represent the Q function and do MC or TD style methods
Should be able to implement DQN (assignment 2)
Be able to list a few extensions that help performance beyond DQN

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 25 / 52
We want RL Algorithms that Perform

Optimization
Delayed consequences
Exploration
Generalization
And do it all statistically and computationally efficiently

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 26 / 52
Generalization and Efficiency

We will discuss efficient exploration in more depth later in the class


But exist hardness results that, if learning in a generic MDP, can
require large number of samples to learn a good policy
Alternate idea: use structure and additional knowledge to help
constrain and speed reinforcement learning
Today: Imitation learning
Later:
Policy search (can encode domain knowledge in the form of the policy
class used)
Strategic exploration
Incorporating human help (in the form of teaching, reward
specification, action specification, . . . )

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 27 / 52
Class Structure

Last time: CNNs and Deep Reinforcement learning


This time: Imitation Learning with Large State Spaces
Next time: Policy Search

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 28 / 52
So Far in this Course

Reinforcement Learning: Learning policies guided by (often sparse)


rewards (e.g. win the game or not)
Good: simple, cheap form of supervision
Bad: High sample complexity
Where is it most successful?
In simulation where data is cheap and parallelization is easy
Harder when:
Execution of actions is slow
Very expensive or not tolerable to fail
Want to be safe

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 29 / 52
Reward Shaping

Rewards that are dense in time closely guide the agent. How can we
supply these rewards?
Manually design them: often brittle
Implicitly specify them through demonstrations

Learning from Demonstration for Autonomous Navigation in Complex Unstructured


Terrain, Silver et al. 2010
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 30 / 52
Examples

Simulated highway driving [ Abbeel and Ng, ICML 2004; Syed and
Schapire, NIPS 2007; Majumdar et al., RSS 2017 ]
Parking lot navigation [Abbeel, Dolgov, Ng, and Thrun, IROS 2008]

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 31 / 52
Learning from Demonstrations

Expert provides a set of demonstration trajectories: sequences of


states and actions
Imitation learning is useful when it is easier for the expert to
demonstrate the desired behavior rather than:
Specifying a reward that would generate such behavior,
Specifying the desired policy directly

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 32 / 52
Problem Setup

Input:
State space, action space
Transition model P(s 0 | s, a)
No reward function R
Set of one or more teacher’s demonstrations (s0 , a0 , s1 , s0 , . . .)
(actions drawn from teacher’s policy π ∗ )
Behavioral Cloning:
Can we directly learn the teacher’s policy using supervised learning?
Inverse RL:
Can we recover R?
Apprenticeship learning via Inverse RL:
Can we use R to generate a good policy?

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 33 / 52
Table of Contents

1 Behavioral Cloning

2 Inverse Reinforcement Learning

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 34 / 52
Behavioral Cloning

Formulate problem as a standard machine learning problem:


Fix a policy class (e.g. neural network, decision tree, etc.)
Estimate a policy from training examples (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), . . .
Two notable success stories:
Pomerleau, NIPS 1989: ALVINN
Summut et al., ICML 1992: Learning to fly in flight simulator

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 35 / 52
ALVINN

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 36 / 52
Problem: Compounding Errors

Supervised learning assumes iid. (s, a) pairs and ignores temporal structure
Independent in time errors:

Error at time t with probability 


E[Total errors] ≤ T

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 37 / 52
Problem: Compounding Errors

Data distribution mismatch!


In supervised learning, (x, y ) ∼ D during train and test. In MDPs:
Train: st ∼ Dπ∗
Test: st ∼ Dπθ

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online


Learning, Ross et al. 2011
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 38 / 52
Problem: Compounding Errors

Error at time t with probability 


E[Total errors] ≤ (T + (T − 1) + (T − 2) . . . + 1) ∝ T 2

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online


Learning, Ross et al. 2011
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 39 / 52
DAGGER: Dataset Aggregation

Idea: Get more labels of the expert action along the path taken by
the policy computed by behavior cloning
Obtains a stationary deterministic policy with good performance
under its induced state distribution
Key limitation?
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 40 / 52
Table of Contents

1 Behavioral Cloning

2 Inverse Reinforcement Learning

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 41 / 52
Feature Based Reward Function

Given state space, action space, transition model P(s 0 | s, a)


No reward function R
Set of one or more teacher’s demonstrations (s0 , a0 , s1 , s0 , . . .)
(actions drawn from teacher’s policy π ∗ )
Goal: infer the reward function R
Assume that the teacher’s policy is optimal. What can be inferred
about R?

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 42 / 52
Check Your Understanding: Feature Based Reward
Function

Given state space, action space, transition model P(s 0 | s, a)


No reward function R
Set of one or more teacher’s demonstrations (s0 , a0 , s1 , s0 , . . .)
(actions drawn from teacher’s policy π ∗ )
Goal: infer the reward function R
Assume that the teacher’s policy is optimal.

1 There is a single unique R that makes teacher’s policy optimal


2 There are many possible R that makes teacher’s policy optimal
3 It depends on the MDP
4 Not sure

.
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 43 / 52
Linear Feature Reward Inverse RL

Recall linear value function approximation


Similarly, here consider when reward is linear over features
R(s) = w T x(s) where w ∈ Rn , x : S → Rn
Goal: identify the weight vector w given a set of demonstrations
The resulting value function for a policy π can be expressed as

X∞
V π = Es∼π [ γ t R(st )]
t=0

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 44 / 52
Linear Feature Reward Inverse RL

Recall linear value function approximation


Similarly, here consider when reward is linear over features
R(s) = w T x(s) where w ∈ Rn , x : S → Rn
Goal: identify the weight vector w given a set of demonstrations
The resulting value function for a policy π can be expressed as

γ t R(st ) | π] = Es∼π [ ∞
X
V π = Es∼π [ t T
P
t=0 γ w x(st ) | π]
t=0
= w T Es∼π [ ∞ t
P
t=0 γ x(st ) | π]
= w T µ(π)

where µ(π)(s) is defined as the discounted weighted frequency of


state features under policy π.

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 45 / 52
Relating Frequencies to Optimality

Assume R(s) = w T x(s) where w ∈ Rn , x : S → Rn


Goal: identify the weight vector w given a set of demonstrations
V π = Es∼π [ ∞ t ∗ T
P
t=0 γ R (st ) | π] = w µ(π) where
µ(π)(s) = discounted weighted frequency of state s under policy π.

V∗ ≥ Vπ

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 46 / 52
Relating Frequencies to Optimality
Recall linear value function approximation
Similarly, here consider when reward is linear over features
R(s) = w T x(s) where w ∈ Rn , x : S → Rn
Goal: identify the weight vector w given a set of demonstrations
The resulting value function for a policy π can be expressed as

V π = w T µ(π)

µ(π)(s) = discounted weighted frequency of state s under policy π.



X X∞
∗ ∗ ∗
Es∼π∗ [ t
γ R (st ) | π ] = V ≥ V = Es∼π [ γ t R ∗ (st ) | π] ∀π
π

t=0 t=0

Therefore if the expert’s demonstrations are from the optimal policy,


to identify w it is sufficient to find w ∗ such that

w ∗T µ(π ∗ ) ≥ w ∗T µ(π), ∀π 6= π ∗
Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 47 / 52
Feature Matching

Want to find a reward function such that the expert policy


outperforms other policies.
For a policy π to be guaranteed to perform as well as the expert
policy π ∗ , sufficient if its discounted summed feature expectations
match the expert’s policy [Abbeel & Ng, 2004].
More precisely, if
kµ(π) − µ(π ∗ )k1 ≤ 
then for all w with kw k∞ ≤ 1:

|w T µ(π) − w T µ(π ∗ )| ≤ 

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 48 / 52
Ambiguity

There is an infinite number of reward functions with the same optimal


policy.
There are infinitely many stochastic policies that can match feature
counts
Which one should be chosen?

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 49 / 52
Learning from Demonstration / Imitation Learning Pointers

Many different approaches


Two of the key papers are:
Maximumum Entropy Inverse Reinforcement Learning (Ziebart et al.
AAAI 2008)
Generative adversarial imitation learning (Ho and Ermon, NeurIPS
2016)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 50 / 52
Summary

Imitation learning can greatly reduce the amount of data need to


learn a good policy
Challenges remain and one exciting area is combining inverse RL /
learning from demonstration and online reinforcement learning
For a look into some of the theory between imitation learning and RL,
see Sun, Venkatraman, Gordon, Boots, Bagnell (ICML 2017)

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 51 / 52
Class Structure

Last time: CNNs and Deep Reinforcement learning


This time: DRL and Imitation Learning in Large State Spaces
Next time: Policy Search

Emma Brunskill (CS234 Reinforcement Learning. ) 7: Imitation Learning in Large State Spaces1
Lecture Winter 2020 52 / 52

You might also like