cs237b Lecture 12
cs237b Lecture 12
Imitation Learning
Imitation Learning 2
This issue of sparse rewards is less
relevant if data is cheap, for example
The formulation of the imitation learning problem is quite similar to the RL when training in simulation.
problem formulation from the previous chapter. The main difference is that in-
stead of leveraging an explicit reward function rt = R( xt , ut ) it will be assumed
that a set of demonstrations from an expert are provided.
It will be assumed that the system is a Markov Decision Process (MDP) with
a state x and control input u, and the set of admissible states and controls are
denoted as X and U . The system dynamics are expressed by the probabilistic The field of RL often uses s to express
transition model: the state and a to represent an action,
but x and u will be used here for
p ( x t | x t −1 , u t −1 ), (10.1) consistency with previous chapters.
2 imitation learning
which are drawn from the expert policy π ∗ . The imitation learning problem is
therefore to determine a policy π that imitates the expert policy π ∗ :
Definition 10.1.1 (Imitation Learning Problem). For a system with transition model
(10.1) with states x ∈ X and controls u ∈ U , the imitation learning problem is to
leverage a set of demonstrations Ξ = {ξ 1 , . . . , ξ D } from an expert policy π ∗ to find a
policy π̂ ∗ that imitates the expert policy.
There are generally two approaches to imitation learning: the first is to di-
rectly learn how to imitate the expert’s policy and the second is to indirectly
imitate the policy by instead learning the expert’s reward function. This chap-
ter will first introduce two classical approaches to imitation learning (behavior
cloning and the DAgger algorithm) that focus on directly imitating the policy.
Then a set of approaches for learning the expert’s reward function will be dis-
cussed, which is commonly referred to as inverse reinforcement learning. The
chapter will then conclude with a couple of short discussions into related topics
on learning from experts (e.g. through comparisons or physical feedback) as
well as on interaction-aware control.
where L is the cost function4 , π ∗ ( x) is the expert’s action for at the state x, and 4
Different loss functions could include
π̂ ∗ is the approximated policy. p-norms (e.g. Euclidean norm) or
f -divergences (e.g. KL divergence)
However this approach may not yield very good performance since the learn- depending on the form of the policy.
ing process is only based on a set of samples provided by the expert. In many
cases these expert demonstrations will not be uniformly sampled across the
principles of robot autonomy 3
entire state space and therefore it is likely that the learned policy will perform
poorly when not close to states found in ξ. This is particularly true when the
expert demonstrations come from a trajectory of sequential states and actions,
such that the distribution of the sampled states x in the dataset is defined by the
expert policy. Then, when an estimated policy π̂ ∗ is used in practice it produces
its own distribution of states that will be visited, which will likely not be the
same as in the expert demonstrations! This distributional mismatch leads to
compounding errors, which is a major challenge in imitation learning.
mation! The behavioral cloning algorithm that leverages this idea is known as
DAgger6 (Dataset Aggregation). 6
S. Ross, G. Gordon, and D. Bagnell.
“A Reduction of Imitation Learning
and Structured Prediction to No-Regret
Algorithm 1: DAgger: Dataset Aggregation Online Learning”. In: Proceedings of the
Data: π ∗ Fourteenth International Conference on
Artificial Intelligence and Statistics. 2011,
Result: π̂ ∗ pp. 627–635
D← −0
Initialize π̂
for i = 1 to N do
πi = β i π ∗ + (1 − β i )π̂
Rollout policy πi to sample trajectory τ = { x0 , x1 , . . . }
Query expert to generate dataset Di = {( x0 , π ∗ ( x0 )), ( x1 , π ∗ ( x1 )), . . . }
Aggregate datsets, D ← − D ∪ Di
Retrain policy π̂ using aggregated dataset D
return π̂
Approaches that learn policies to imitate expert actions can be limited by several
factors:
4 imitation learning
R( x, u) = w T φ( x, u),
T −1
∑ γt R(xt , π (xt )) | x0 = x
VTπ ( x) = E .
t =0
T −1
VTπ ( x) = w T µ(π, x), ∑ γt φ(xt , π (xt )) | x0 = x
µ(π, x) = Eπ ,
t =0
Theoretically, identifying the vector w∗ associated with the expert policy can
be accomplished by finding a vector w that satisfies this condition. However
this can potentially lead to ambiguities! For example, the choice w = 0 satisfies
this condition trivially! In fact, reward ambiguity is one of the main challenges
associated with inverse reinforcement learning10 . The algorithms discussed in 10
A. Ng and S. Russell. “Algorithms
the following chapters will propose techniques for alleviating this issue. for Inverse Reinforcement Learning”.
In: Proceedings of the Seventeenth Inter-
national Conference on Machine Learning.
2000, pp. 663–670
principles of robot autonomy 5
for any w as long as kwk2 ≤ 1. In other words, as long as the feature expecta-
tions can be matched then the performance will be as good as the expert even if
the vector w does not match w∗ . Another practical aspect to the approach is that it
will be assumed that the initial state x0 is drawn from a distribution D such that
the value function is also considered in expectation as:
T −1
Ex0 ∼ D VTπ ( x0 ) = w T µ(π ), ∑ γt φ(xt , π (xt ))
µ(π ) = Eπ .
t =0
if ti ≤ e then
π̂ ∗ ←
− best feature matching policy from {π0 , . . . , πi−1 }
return π̂ ∗
Use RL to find an optimal policy πi for reward function defined by wi
t∗ (w) = max t,
t
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π0 , π1 , . . . },
which is essentially computing the smallest performance loss among the candi-
date policies {π0 , π1 , . . . } with respect to the expert policy, assuming the reward
function weights are w. If w was known, then if t∗ (w) ≤ e it would guaran-
tee that one of the candidate policies would effectively perform as well as the
expert.
Since w is not known, the actual optimization problem (10.5) maximizes
the smallest performance loss across all vectors w with kwk2 ≤ 1. Therefore,
if ti ≤ e (i.e. the termination condition in Algorithm 2), then there must be a
candidate policy whose performance loss is small for all possible choices of w! In
other words, there is a candidate policy that matches feature expectations well
enough that good performance can be guaranteed without assuming the reward
function is known, and without attempting to estimate the reward accurately.
Again this problem computes the reward function vector w such that the expert
policy maximally outperforms the policies in the set {π0 , π1 , . . . }.
principles of robot autonomy 7
However the formulation is also improved in two ways: it adds a slack term
to account for potential expert suboptimality and it adds a similarity function
that gives more “margin” to policies that are dissimilar to the expert policy. This
new formulation is:
ŵ∗ = arg min kwk22 + Cv,
w,v (10.6)
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + m(π ∗ , π ) − v, ∀ π ∈ { π0 , π1 , . . . },
where v is a slack variable that can account for expert suboptimality, C > 0 is a
hyperparameter that is used to penalize the amount of assumed suboptimality,
and m(π ∗ , π ) is a function that quantifies how dissimilar two policies are.
One example of where this formulation is advantageous over the apprentice-
ship learning formulation (10.5) is when the expert is suboptimal. In this case it
is possible that there is no w that makes the expert policy outperform all other
policies, such that the optimization (10.5) returns wi = 0 and ti = 0 (which is
obviously not the appropriate solution). Alternatively the slack variables in the
MMP formulation allow for a reasonable w to be computed.
p(τ ) ≥ 0, ∀τ,
1 λ T f (τ )
Z
T
p∗ (τ, λ) = e , Z (λ) = eλ f (τ )
dτ,
Z (λ)
where Z (λ) normalizes the distribution, and where λ must be chosen such that
the feature expectations match:
Z Z
p∗ (τ, λ) f (τ ) = pπ ∗ (τ ) f (τ )dτ.
In other words the maximum entropy IRL approach tries to find a distribution
parameterized by λ that match features, but also requires that the distribution
p∗ (τ, λ) belong to the exponential family.
To determine the value of λ that matches features, it is assumed that the
expert also selects trajectories with high reward with exponentially higher prob-
ability:
∗ T f (τ )
pπ ∗ (τ ) ∝ ew ,
and therefore ideally λ = w∗ . Of course w∗ (and more generally pπ ∗ (τ )) are
not known, and therefore a maximum likelihood estimation approach is used
to compute λ to best approximate w∗ based on the sampled expert demonstra-
tions20 . 20
By assuming the expert policy is also
In particular, an estimate ŵ∗ of the reward weights is computed from the exponential, the maximum likelihood
estimate is theoretically consistent (i.e.
expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . } (which each demonstration ξ i is a λ− → w∗ as the number of demonstra-
trajectory) by solving the maximum likelihood problem: tions approaches infinity).
which can be solved using a gradient descent algorithm where the gradient is
computed by:
∇λ J (λ) = ∑ f (ξ i ) − Eτ ∼ p∗ (τ,λ) [ f (τ )].
ξ i ∈Ξ
principles of robot autonomy 9
The first term of this gradient is easily computable since the expert demonstra-
tions are known, and the second term can be approximated through Monte
Carlo sampling. However, this Monte Carlo sampling estimate is based on
sampling trajectories from the distribution p∗ (τ, λ). This leads to a iterative
algorithm:
1. Initialize λ and collect the set of expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . }.
2. Compute the optimal policy21 πλ with respect to the reward function with For example through traditional RL
21
w = λ. methods.
3. Using the policy πλ , sample trajectories of the system and compute an ap-
proximation of Eτ ∼ p∗ (τ,λ) [ f (τ )].
where f (τ ) are the collective feature counts (same as in Section 10.4), this com-
parison can be used to conclude that:
In other words, this comparison has split the space of possible reward weights
w in half through the hyperplane:
w∗ T f (τH ) ≥ w∗ T f (τR ),
which simply states that the reward of the new trajectory is higher. This insight
is then leveraged in a maximum a posteriori approach for updating the estimate
ŵ∗ after each interaction. Specifically, this update takes the form:
ŵ∗ ←
− ŵ∗ + β( f (τH ) − f (τR )),
where β > 0 is a scalar step size. The robot then uses the new estimate to
change its policy, and the process iterates. Note that this idea yields an ap-
proach that is similar to the concept of matching feature expectations from
inverse reinforcement learning, except that the approach is iterative rather than
requiring a batch of complete expert demonstrations.
Yet another interesting problem in robot autonomy arises when robots and hu-
mans are interacting to accomplish shared or individual goals. Many classical
principles of robot autonomy 11
In other words the interaction dynamics evolve according to the actions taken
by both the robot and the human. In this interaction the robot’s reward function
is denoted as R R ( x, u R , u H ) and the human’s reward function is denoted as
R H ( x, u R , u H ), which are both functions of the combined state and both agent’s
actions27 . 27
While R R and R H do not have to be
Under the assumption that both the robot and the human act optimally28 the same, choosing R R = R H may
be desirable for the robot to achieve
with respect to their cost functions: human-like behavior.
28
While not necessarily true, this
u∗R ( x) = arg max R R ( x, u R , u∗H ( x)), assumption is important to make the
uR resulting problem formulation tractable
u∗H ( x) = arg max R H ( x, u∗R ( x), u H ). to solve in practice.
uH
In other words the human is assumed to see the action taken by the robot before
deciding on their own action. The robot policy can therefore be computed by
solving:
u∗R ( x) = arg max R R ( x, u R , u∗H ( x, u R )), (10.9)
uR
which can be solved using a gradient descent approach. For the gradient de-
scent approach the gradient of:
J ( x, u R ) = R R ( x, u R , u∗H ( x, u R )),
12 imitation learning
∂J ∂R R ∂R ∂u∗
= + ∗R H .
∂u R ∂u R ∂u H ∂u R
Since the reward function R R is known the terms ∂R R /∂u R and ∂R R /∂u∗H can
be easily determined. In order to compute the term ∂u∗H /∂u R , which represents
how much the robot’s actions impact the human’s actions, an additional step
is required. First, assuming the human acts optimally according to (10.8) the
necessary optimality condition is:
∂R H
g( x, u R , u∗H ) = 0, g= ,
∂u H
which for the fixed values of x and u R specifies u∗H . Then, by implicitly differen-
tiating this condition with respect to the robot action u R :
∂g ∂g ∂u∗H
+ ∗ = 0,
∂u R ∂u H ∂u R
Notice that every term in this expression can be computed30 and therefore it can 30
Assuming the human’s reward
be substituted into the gradient calculation: function is known.
∂J ∂R R ∂R ∂g −1 ∂g
= − ∗R ,
∂u R ∂u R ∂u H ∂u∗H ∂u R
The objective of intent inference is therefore to estimate the parameters Maximum Entropy IRL approach.
1
bt+1 (θ) = p(u H,t | xt , u R,t , θ)bt (θ),
η
R R ( x, u R , u H , θ) = I (b(θ), u R ) + λRgoal ( x, u R , u H , θ)
where λ > 0 is a tuning parameter and I (b(θ), u R ) denotes a function that quan-
tifies the amount of information gained with respect to the belief distribution
from taking action u R . In other words the robot’s reward is a tradeoff between
exploiting the current knowledge of θ to accomplish the objective and taking
exploratory actions to improve the intent inference. With this robot reward func-
tion the robot’s actions are chosen to maximize the expected reward:
block the lane change (aggressive driving behavior). Once the robot has a strong
enough belief about the human’s behavior it may choose to either complete the
lane change or slow down to merge behind the human driver.