0% found this document useful (0 votes)
33 views

cs237b Lecture 12

This document discusses imitation learning, which aims to determine policies that imitate an expert without an explicit reward function. It describes two main approaches: directly learning to imitate the expert's policy through behavior cloning, and indirectly imitating the policy by learning the expert's implicit reward function with inverse reinforcement learning. The key challenges are distributional mismatch when deploying the learned policy and sparse or unknown rewards. The DAgger algorithm is introduced to address mismatch by querying the expert for more data from states visited during policy rollouts.

Uploaded by

movic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

cs237b Lecture 12

This document discusses imitation learning, which aims to determine policies that imitate an expert without an explicit reward function. It describes two main approaches: directly learning to imitate the expert's policy through behavior cloning, and indirectly imitating the policy by learning the expert's implicit reward function with inverse reinforcement learning. The key challenges are distributional mismatch when deploying the learned policy and sparse or unknown rewards. The DAgger algorithm is introduced to address mismatch by querying the expert for more data from states visited during policy rollouts.

Uploaded by

movic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

10

Imitation Learning

As discussed in the previous chapter, the goal of reinforcement learning is to


determine closed-loop control policies that result in the maximization of an
accumulated reward, and RL algorithms are generally classified as either model-
based or model-free. In both cases it is generally assumed that the reward func-
tion is known, and both typically rely on collecting system data to either update
a learned model (model-based), or directly update a learned value function or
policy (model-free).
While successful in many settings, these approaches to RL also suffer from
several drawbacks. First, determining an appropriate reward function that can
accurately represent the true performance objectives can be challenging1 . Sec- 1
RL agents can sometimes learn how to
ond, rewards may be sparse, which makes the learning process expensive in exploit a reward function without ac-
tually producing the desired behavior.
terms of both the required amount of data and in the number of failures that This is commonly referred to as reward
may be experienced when exploring with a suboptimal policy2 . This chapter hacking. Consider training an RL agent
with a reward for each piece of trash
introduces the imitation learning approach to RL, where a reward function is not collected. Rather than searching the
assumed to be known a priori but rather it is assumed the reward function is area to find more trash (the desired be-
described implicitly through expert demonstrations. havior), the agent may decide to throw
the trash back onto the ground and pick
it up again!

Imitation Learning 2
This issue of sparse rewards is less
relevant if data is cheap, for example
The formulation of the imitation learning problem is quite similar to the RL when training in simulation.
problem formulation from the previous chapter. The main difference is that in-
stead of leveraging an explicit reward function rt = R( xt , ut ) it will be assumed
that a set of demonstrations from an expert are provided.

10.1 Problem Formulation

It will be assumed that the system is a Markov Decision Process (MDP) with
a state x and control input u, and the set of admissible states and controls are
denoted as X and U . The system dynamics are expressed by the probabilistic The field of RL often uses s to express
transition model: the state and a to represent an action,
but x and u will be used here for
p ( x t | x t −1 , u t −1 ), (10.1) consistency with previous chapters.
2 imitation learning

which is the conditional probability distribution over xt , given the previous


state and control. As in the previous chapter, the goal is to define a policy π that
defines the closed-loop control law3 : 3
This chapter will consider a stationary
policy for simplicity.
u t = π ( x t ). (10.2)

The primary difference in formulation from the previous RL problem is that


we do not have access to the reward function, and instead we have access to a
set of expert demonstrations where each demonstration ξ consists of a sequence
of state-control pairs:
ξ = {( x0 , u0 ), ( x1 , u1 ), . . . }, (10.3)

which are drawn from the expert policy π ∗ . The imitation learning problem is
therefore to determine a policy π that imitates the expert policy π ∗ :

Definition 10.1.1 (Imitation Learning Problem). For a system with transition model
(10.1) with states x ∈ X and controls u ∈ U , the imitation learning problem is to
leverage a set of demonstrations Ξ = {ξ 1 , . . . , ξ D } from an expert policy π ∗ to find a
policy π̂ ∗ that imitates the expert policy.

There are generally two approaches to imitation learning: the first is to di-
rectly learn how to imitate the expert’s policy and the second is to indirectly
imitate the policy by instead learning the expert’s reward function. This chap-
ter will first introduce two classical approaches to imitation learning (behavior
cloning and the DAgger algorithm) that focus on directly imitating the policy.
Then a set of approaches for learning the expert’s reward function will be dis-
cussed, which is commonly referred to as inverse reinforcement learning. The
chapter will then conclude with a couple of short discussions into related topics
on learning from experts (e.g. through comparisons or physical feedback) as
well as on interaction-aware control.

10.2 Behavior Cloning

Behavior cloning approaches use a set of expert demonstrations ξ ∈ Ξ to de-


termine a policy π that imitates the expert. This can be accomplished through
supervised learning techniques, where the difference between the learned policy
and expert demonstrations are minimized with respect to some metric. Con-
cretely, the goal is to solve the optimization problem:

π̂ ∗ = arg min ∑ ∑ L(π (x), π ∗ (x)),


π ξ ∈Ξ x∈ξ

where L is the cost function4 , π ∗ ( x) is the expert’s action for at the state x, and 4
Different loss functions could include
π̂ ∗ is the approximated policy. p-norms (e.g. Euclidean norm) or
f -divergences (e.g. KL divergence)
However this approach may not yield very good performance since the learn- depending on the form of the policy.
ing process is only based on a set of samples provided by the expert. In many
cases these expert demonstrations will not be uniformly sampled across the
principles of robot autonomy 3

entire state space and therefore it is likely that the learned policy will perform
poorly when not close to states found in ξ. This is particularly true when the
expert demonstrations come from a trajectory of sequential states and actions,
such that the distribution of the sampled states x in the dataset is defined by the
expert policy. Then, when an estimated policy π̂ ∗ is used in practice it produces
its own distribution of states that will be visited, which will likely not be the
same as in the expert demonstrations! This distributional mismatch leads to
compounding errors, which is a major challenge in imitation learning.

10.3 DAgger: Dataset Aggregation

One straightforward idea for addressing the issue of distributional mismatch in


states seen under the expert policy and the learned policy is to simply collect
new expert data as needed5 . In other words, when the learned policy π̂ ∗ leads 5
Assuming the expert can be queried
to states that aren’t in the expert dataset just query the expert for more infor- on demand.

mation! The behavioral cloning algorithm that leverages this idea is known as
DAgger6 (Dataset Aggregation). 6
S. Ross, G. Gordon, and D. Bagnell.
“A Reduction of Imitation Learning
and Structured Prediction to No-Regret
Algorithm 1: DAgger: Dataset Aggregation Online Learning”. In: Proceedings of the
Data: π ∗ Fourteenth International Conference on
Artificial Intelligence and Statistics. 2011,
Result: π̂ ∗ pp. 627–635
D← −0
Initialize π̂
for i = 1 to N do
πi = β i π ∗ + (1 − β i )π̂
Rollout policy πi to sample trajectory τ = { x0 , x1 , . . . }
Query expert to generate dataset Di = {( x0 , π ∗ ( x0 )), ( x1 , π ∗ ( x1 )), . . . }
Aggregate datsets, D ← − D ∪ Di
Retrain policy π̂ using aggregated dataset D
return π̂

As can be seen in Algorithm 1, this approach iteratively improves the learned


policy by collecting additional data from the expert. This is accomplished by
rolling out the current learned policy for some number of time steps and then
asking the expert what actions they would have taken at each step along that
trajectory. Over time this process drives the learned policy to better approximate
the true policy and reduce the incidence of distributional mismatch. One dis-
advantage to the approach is that at each step the policy needs to be retrained,
which may be computationally inefficient.

10.4 Inverse Reinforcement Learning

Approaches that learn policies to imitate expert actions can be limited by several
factors:
4 imitation learning

1. Behavior cloning provides no way to understand the underlying reasons for


the expert behavior (no reasoning about outcomes or intentions).

2. The “expert” may actually be suboptimal7 . 7


Although the discussion of inverse RL
in this section will also assume the ex-
3. A policy that is optimal for the expert may not be optimal for the agent if pert is optimal, there exist approaches
to remove this assumption.
they have different dynamics, morphologies, or capabilities.

An alternative approach to behavioral cloning is to reason about and try to learn


a representation of the underlying reward function R that the expert was using
to generate its actions. By learning the expert’s intent, the agent can potentially
outperform the expert or adjust for differences in capabilities8 . This approach 8
Learned reward representations can
(learning reward functions) is known as inverse reinforcement learning. potentially generalize across different
robot platforms that tackle similar
Inverse RL approaches assume a specific parameterization of the reward problems!
function, and in this section the fundamental concepts will be presented by
parameterizing the reward as a linear combination of (nonlinear) features:

R( x, u) = w T φ( x, u),

where w ∈ Rn is a weight vector and φ( x, u) : X × U − → Rn is a feature


map. For a given feature map φ, the goal of inverse RL can be simplified to
determining the weights w. Recall from the previous chapter on RL that the
total (discounted) reward under a policy π is defined for a time horizon T as:

 T −1
∑ γt R(xt , π (xt )) | x0 = x

VTπ ( x) = E .
t =0

Using the reward function R( x, u) = w T φ( x, u) this value function can be


expressed as:

 T −1
VTπ ( x) = w T µ(π, x), ∑ γt φ(xt , π (xt )) | x0 = x

µ(π, x) = Eπ ,
t =0

where µ(π, x) is defined by an expectation over the trajectories of the system


under policy π (starting from state x) and is referred to as the feature expecta-
tion9 . One insight that can now be leveraged is that by definition the optimal 9
Feature expectations are often com-
expert policy π ∗ will always produce a greater value function: puted using a Monte Carlo technique
(e.g. using the set of demonstrations for
∗ the expert policy).
VTπ ( x) ≥ VTπ ( x), ∀x ∈ X , ∀π,

which can be expressed in terms of the feature expectation as:

w∗ T µ(π ∗ , x) ≥ w∗ T µ(π, x), ∀x ∈ X , ∀π. (10.4)

Theoretically, identifying the vector w∗ associated with the expert policy can
be accomplished by finding a vector w that satisfies this condition. However
this can potentially lead to ambiguities! For example, the choice w = 0 satisfies
this condition trivially! In fact, reward ambiguity is one of the main challenges
associated with inverse reinforcement learning10 . The algorithms discussed in 10
A. Ng and S. Russell. “Algorithms
the following chapters will propose techniques for alleviating this issue. for Inverse Reinforcement Learning”.
In: Proceedings of the Seventeenth Inter-
national Conference on Machine Learning.
2000, pp. 663–670
principles of robot autonomy 5

10.4.1 Apprenticeship Learning


The apprenticeship learning11 algorithm attempts to avoid some of the prob- 11
P. Abbeel and A. Ng. “Apprenticeship
lems with reward ambiguity by leveraging an additional insight from condition Learning via Inverse Reinforcement
Learning”. In: Proceedings of the Twenty-
(10.4). Specifically, the insight is that it doesn’t matter how well w∗ is estimated First International Conference on Machine
as long as a policy π can be found that matches the feature expectations. Mathe- Learning. 2004
matically, this conclusion is derived by noting that:

kµ(π, x) − µ(π ∗ , x)k2 ≤ e =⇒ |w T µ(π, x) − w T µ(π ∗ , x)| ≤ e

for any w as long as kwk2 ≤ 1. In other words, as long as the feature expecta-
tions can be matched then the performance will be as good as the expert even if
the vector w does not match w∗ . Another practical aspect to the approach is that it
will be assumed that the initial state x0 is drawn from a distribution D such that
the value function is also considered in expectation as:
 T −1
Ex0 ∼ D VTπ ( x0 ) = w T µ(π ), ∑ γt φ(xt , π (xt ))
  
µ(π ) = Eπ .
t =0

This is useful to avoid having to consider all x ∈ X when matching features12 . 12


Trying to find a policy that matches
To summarize, the goal of the apprenticeship learning approach is to find a features for every possible starting state
x is likely intractable or even infeasible.
policy π that matches the feature expectations with respect to the expert pol-
icy (i.e. makes µ(π ) as similar as possible to µ(π ∗ ))13 . This is accomplished 13
See Example 10.4.1 for an example
through Algorithm 2, which uses an iterative approach to finding better policies. of why matching features is intuitively
useful.

Algorithm 2: Apprenticeship Learning


Data: µ(π ∗ ), e
Result: π̂ ∗
Initialize policy π0
for i = 1 to . . . do
Compute µ(πi−1 ) (or approximate via Monte Carlo)
Solve problem (10.5) with policies {π0 , . . . , πi−1 } to compute wi and ti

(wi , ti ) = arg max t,


w,t

s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π 0 , . . . , π i −1 } ,


kwk2 ≤ 1.
(10.5)

if ti ≤ e then
π̂ ∗ ←
− best feature matching policy from {π0 , . . . , πi−1 }
return π̂ ∗
Use RL to find an optimal policy πi for reward function defined by wi

To better understand this algorithm it is useful to further examine the opti-


mization problem (10.5)14 . Suppose that instead of making w a decision variable 14
This problem can be thought of as an
inverse RL problem that is seeking to
find the reward function vector w such
that the expert maximally outperforms the
other policies.
6 imitation learning

it was actually fixed, then the resulting optimization would be:

t∗ (w) = max t,
t
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π0 , π1 , . . . },

which is essentially computing the smallest performance loss among the candi-
date policies {π0 , π1 , . . . } with respect to the expert policy, assuming the reward
function weights are w. If w was known, then if t∗ (w) ≤ e it would guaran-
tee that one of the candidate policies would effectively perform as well as the
expert.
Since w is not known, the actual optimization problem (10.5) maximizes
the smallest performance loss across all vectors w with kwk2 ≤ 1. Therefore,
if ti ≤ e (i.e. the termination condition in Algorithm 2), then there must be a
candidate policy whose performance loss is small for all possible choices of w! In
other words, there is a candidate policy that matches feature expectations well
enough that good performance can be guaranteed without assuming the reward
function is known, and without attempting to estimate the reward accurately.

Example 10.4.1 (Apprenticeship Learning vs. Behavioral Cloning). Consider a


problem where the goal is to drive a car across a city in as short of time as pos-
sible. In the imitation learning formulation it is assumed that the reward func-
tion is not known, but that there is an expert who shows how to drive across the
city (i.e. what routes to take). A behavioral cloning approach would simply try
to mimic the actions taken by the expert, such as memorizing that whenever the
agent is at a particular intersection it should turn right. Of course this approach
is not robust when at intersections that the expert never visited!
The apprenticeship learning approach tries to avoid the inefficiency of behav-
ioral cloning by instead identifying features of the expert’s trajectories that are
more generalizable, and developing a policy that experiences the same feature
expectations as the expert. For example it could be more efficient to notice that
the expert takes routes without stop signs, or routes with higher speed limits,
and then try to find policies that also seek out those features!

10.4.2 Maximum Margin Planning


The maximum margin planning approach15 uses an optimization-based ap- 15
N. Ratliff, J. A. Bagnell, and M. Zinke-
proach to computing the reward function weights w that is very similar to (10.5) vich. “Maximum Margin Planning”.
In: Proceedings of the 23rd International
but with some additional flexibility. In its most standard form the MMP opti- Conference on Machine Learning. 2006,
mization is: pp. 729–736

ŵ∗ = arg min kwk22 ,


w
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + 1, ∀ π ∈ { π0 , π1 , . . . }.

Again this problem computes the reward function vector w such that the expert
policy maximally outperforms the policies in the set {π0 , π1 , . . . }.
principles of robot autonomy 7

However the formulation is also improved in two ways: it adds a slack term
to account for potential expert suboptimality and it adds a similarity function
that gives more “margin” to policies that are dissimilar to the expert policy. This
new formulation is:
ŵ∗ = arg min kwk22 + Cv,
w,v (10.6)
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + m(π ∗ , π ) − v, ∀ π ∈ { π0 , π1 , . . . },
where v is a slack variable that can account for expert suboptimality, C > 0 is a
hyperparameter that is used to penalize the amount of assumed suboptimality,
and m(π ∗ , π ) is a function that quantifies how dissimilar two policies are.
One example of where this formulation is advantageous over the apprentice-
ship learning formulation (10.5) is when the expert is suboptimal. In this case it
is possible that there is no w that makes the expert policy outperform all other
policies, such that the optimization (10.5) returns wi = 0 and ti = 0 (which is
obviously not the appropriate solution). Alternatively the slack variables in the
MMP formulation allow for a reasonable w to be computed.

10.4.3 Maximum Entropy Inverse Reinforcement Learning


While the apprenticeship learning approach shows that matching feature counts
is a necessary and sufficient condition to ensure a policy performs as well as
an expert, it also has some ambiguity (similar to the reward weight ambiguity
problem discussed before). This ambiguity is associated with the fact that there
could be different policies that lead to the same feature expectations!
This issue can also be thought of in a slightly more intuitive way in terms of
distributions over trajectories. Specifically, a policy π induces a distribution over
trajectories16 τ = {( x0 , π ( x0 )), ( x1 , π ( x1 )), . . . } that is denoted as pπ (τ ). The 16
This distribution can be visualized as
feature expectations can be rewritten in terms of this distribution as: a set of paths generated by simulating
the system many times with policy π
  Z (i.e. using a Monte Carlo method).
µ(π ) = Eπ f (τ ) = pπ (τ ) f (τ )dτ,

where f (τ ) = ∑tT=−01 γt φ( xt , π ( xt )). Now suppose a policy π was found that


matched feature expectations17 with an expert policy π ∗ such that: 17
For example by using apprenticeship
Z Z learning.
pπ (τ ) f (τ )dτ = pπ ∗ (τ ) f (τ )dτ.

Crucially this condition is not sufficient to guarantee that pπ (τ ) = pπ ∗ (τ )


(which would be ideal). In fact, the distribution pπ (τ ) could also have an arbi-
trary preference for some paths that is unrelated to the feature matching objective.
The main idea in the maximum entropy inverse RL approach18 is to not 18
B. D. Ziebart et al. “Maximum En-
only match the feature expectations, but also remove ambiguity in the path tropy Inverse Reinforcement Learning”.
In: Proceedings of the Twenty-Third AAAI
distribution pπ (τ ) by trying to make pπ (τ ) as broadly uncommitted as possible. In Conference on Artificial Intelligence. 2008,
other words, find a policy that matches feature expectations but otherwise has pp. 1433–1438
no additional path preferences. This concept is known as the maximum entropy
principle19 . 19
A maximum entropy distribution can
be thought of as the least informative
distribution of a class of distribution.
This is useful in situations where it is
undesirable to encode unintended prior
information.
8 imitation learning

The maximum entropy IRL approach finds a minimally preferential, feature


expectation matching distribution by solving the optimization problem:
Z
p∗ (τ ) = arg max − p(τ ) log p(τ )dτ,
p
Z Z
s.t. p(τ ) f (τ )dτ = pπ ∗ (τ ) f (τ )dτ,
Z
(10.7)
p(τ )dτ = 1,

p(τ ) ≥ 0, ∀τ,

where the objective is the mathematical definition of a distribution’s entropy,


the first constraint requires feature expectation matching, and the remaining
constraints ensure that p(τ ) is a valid probability distribution. It turns out that
the solution to this problem has the exponential form:

1 λ T f (τ )
Z
T
p∗ (τ, λ) = e , Z (λ) = eλ f (τ )
dτ,
Z (λ)

where Z (λ) normalizes the distribution, and where λ must be chosen such that
the feature expectations match:
Z Z
p∗ (τ, λ) f (τ ) = pπ ∗ (τ ) f (τ )dτ.

In other words the maximum entropy IRL approach tries to find a distribution
parameterized by λ that match features, but also requires that the distribution
p∗ (τ, λ) belong to the exponential family.
To determine the value of λ that matches features, it is assumed that the
expert also selects trajectories with high reward with exponentially higher prob-
ability:
∗ T f (τ )
pπ ∗ (τ ) ∝ ew ,
and therefore ideally λ = w∗ . Of course w∗ (and more generally pπ ∗ (τ )) are
not known, and therefore a maximum likelihood estimation approach is used
to compute λ to best approximate w∗ based on the sampled expert demonstra-
tions20 . 20
By assuming the expert policy is also
In particular, an estimate ŵ∗ of the reward weights is computed from the exponential, the maximum likelihood
estimate is theoretically consistent (i.e.
expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . } (which each demonstration ξ i is a λ− → w∗ as the number of demonstra-
trajectory) by solving the maximum likelihood problem: tions approaches infinity).

ŵ∗ = arg max ∏ p ∗ ( ξ i , λ ),


λ ξ i ∈Ξ

= arg max ∑ λ T f (ξ i ) − log Z (λ),


λ ξ i ∈Ξ

which can be solved using a gradient descent algorithm where the gradient is
computed by:
∇λ J (λ) = ∑ f (ξ i ) − Eτ ∼ p∗ (τ,λ) [ f (τ )].
ξ i ∈Ξ
principles of robot autonomy 9

The first term of this gradient is easily computable since the expert demonstra-
tions are known, and the second term can be approximated through Monte
Carlo sampling. However, this Monte Carlo sampling estimate is based on
sampling trajectories from the distribution p∗ (τ, λ). This leads to a iterative
algorithm:
1. Initialize λ and collect the set of expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . }.

2. Compute the optimal policy21 πλ with respect to the reward function with For example through traditional RL
21

w = λ. methods.

3. Using the policy πλ , sample trajectories of the system and compute an ap-
proximation of Eτ ∼ p∗ (τ,λ) [ f (τ )].

4. Perform a gradient step on λ to improve the maximum likelihood cost.

5. Repeat until convergence.


To summarize, the maximum entropy inverse reinforcement learning ap-
proach identifies a distribution over trajectories that matches feature expecta-
tions with the expert, but by restricting the distribution to belong to the expo-
nential family ensures that spurious preferences (path preferences not motivated
by feature matching) are not introduced. Additionally, this distribution over tra-
jectories is parameterized by a value that is an estimate of the reward function
weights.

10.5 Learning From Comparisons and Physical Feedback

Both behavioral cloning and inverse reinforcement learning approaches rely on


expert demonstrations of behavior. However in some practical scenarios it may
actually be difficult for the expert to provide complete/quality demonstrations.
For example it has been shown22 that when humans are asked to demonstrate 22
C. Basu et al. “Do You Want Your
good driving behavior in simulation they retroactively think their behavior was Autonomous Car to Drive Like You?”
In: 12th ACM/IEEE International Confer-
too aggressive! As another example, if a robot has a high-dimensional con- ence on Human-Robot Interaction. 2017,
trol or state space it could be difficult for the expert to specify the full high- pp. 417–425

dimensional behavior. Therefore another interesting question in imitation learn-


ing is to find a way to learn from alternative data sources besides complete
demonstrations.

10.5.1 Learning from Comparisons


One alternative approach is to use pairwise comparisons23 , where an expert is 23
D. Sadigh et al. “Active Preference-
shown two different behaviors and then asked to rank which behavior is better. Based Learning of Reward Functions”.
In: Robotics: Science and System. 2017
Through repeated queries it is possible to converge to an understanding of the
underlying reward function. For example, suppose two trajectories τA and τB
are shown to an expert and that trajectory τA is preferred. Then assuming that
the reward function is:
R ( τ ) = w T f ( τ ),
10 imitation learning

where f (τ ) are the collective feature counts (same as in Section 10.4), this com-
parison can be used to conclude that:

w T f (τA ) > w T f (τB ).

In other words, this comparison has split the space of possible reward weights
w in half through the hyperplane:

( f (τA ) − f (τB ))T w = 0.

By continuously querying the expert with new comparisons24 , the space of 24


The types of comparisons shown can
possible reward weights w will continue to shrink until a good estimate of w∗ be selectively chosen to maximally split
the remaining space of potential w in
can be made. In practice the expert decision may be a little noisy and therefore order to minimize the total number of
the hyperplanes don’t define hard cutoffs, but rather can be used to “weight” expert queries that are required.
the possible reward vectors w.

10.5.2 Learning from Physical Feedback


Another alternative to learning from complete expert demonstrations is to sim-
ply allow the expert to physically interact with the robot to correct for unde-
sirable behavior25 . In this approach, a physical interaction (i.e. a correction) is 25
A. Bajcsy et al. “Learning Robot
assumed to occur when the robot takes actions that result in a lower reward Objectives from Physical Human
Interaction”. In: Proceedings of the 1st
than the expert’s action. Annual Conference on Robot Learning.
For a reward function of the form R( x, u) = w T φ( x, u) the robot maintains 2017, pp. 217–226
an estimate of the reward weights ŵ∗ and the expert is assumed to have act
according to a true set of optimal weights w∗ . Suppose the robot’s policy, which
is based on the estimated reward function with weights ŵ∗ , yields a trajectory
τR . Then, if the expert physically interacts with the robot to make a correction
the resulting actual trajectory τH is assumed to satisfy:

w∗ T f (τH ) ≥ w∗ T f (τR ),

which simply states that the reward of the new trajectory is higher. This insight
is then leveraged in a maximum a posteriori approach for updating the estimate
ŵ∗ after each interaction. Specifically, this update takes the form:

ŵ∗ ←
− ŵ∗ + β( f (τH ) − f (τR )),

where β > 0 is a scalar step size. The robot then uses the new estimate to
change its policy, and the process iterates. Note that this idea yields an ap-
proach that is similar to the concept of matching feature expectations from
inverse reinforcement learning, except that the approach is iterative rather than
requiring a batch of complete expert demonstrations.

10.6 Interaction-aware Control and Intent Inference

Yet another interesting problem in robot autonomy arises when robots and hu-
mans are interacting to accomplish shared or individual goals. Many classical
principles of robot autonomy 11

examples of this problem arise in autonomous driving settings, when human-


driven vehicles interact with autonomous vehicles in settings such as highway
merging or at intersections. While the imitation learning problems from the
previous sections are focused on understanding the expert’s behavior for the
purpose of imitating the behavior, in this setting the human’s behavior needs to
be understood in order to ensure safe interactions. However there is an addi-
tional component to understanding interactions: the robot’s behavior can influence
the human’s behavior26 . 26
It is particularly important in
interaction-aware robot control to
understand the effects of the robot’s
10.6.1 Interaction-aware Control with Known Human Model actions on the human’s behavior. Oth-
erwise the human’s could simply be
One common approach is to model the interaction between humans and robots modeled as dynamic obstacles!
as a dynamical system that has a combined state x, where the robot controls
are denoted u R and the human decisions or inputs are denoted as u H . The
transition model is therefore defined as:

p( xt | xt−1 , u R,t−1 , u H,t−1 ).

In other words the interaction dynamics evolve according to the actions taken
by both the robot and the human. In this interaction the robot’s reward function
is denoted as R R ( x, u R , u H ) and the human’s reward function is denoted as
R H ( x, u R , u H ), which are both functions of the combined state and both agent’s
actions27 . 27
While R R and R H do not have to be
Under the assumption that both the robot and the human act optimally28 the same, choosing R R = R H may
be desirable for the robot to achieve
with respect to their cost functions: human-like behavior.
28
While not necessarily true, this
u∗R ( x) = arg max R R ( x, u R , u∗H ( x)), assumption is important to make the
uR resulting problem formulation tractable
u∗H ( x) = arg max R H ( x, u∗R ( x), u H ). to solve in practice.
uH

Additionally, assuming both reward functions R R and R H are known29 , comput- 29


The reward function R H could be ap-
ing u∗R is still extremely challenging due to the two-player game dynamics of the proximated using inverse reinforcement
learning techniques.
decision making process. However this problem can be made more tractable by
modeling it as a Stackelberg game, which restricts the two-player game dynam-
ics to a leader-follower structure. Under this assumption it is assumed that the
robot is the “leader” and that as the follower the human acts according to:

u∗H ( x, u R ) = arg max R H ( x, u R , u H ). (10.8)


uH

In other words the human is assumed to see the action taken by the robot before
deciding on their own action. The robot policy can therefore be computed by
solving:
u∗R ( x) = arg max R R ( x, u R , u∗H ( x, u R )), (10.9)
uR
which can be solved using a gradient descent approach. For the gradient de-
scent approach the gradient of:

J ( x, u R ) = R R ( x, u R , u∗H ( x, u R )),
12 imitation learning

can be computed using the chain rule as:

∂J ∂R R ∂R ∂u∗
= + ∗R H .
∂u R ∂u R ∂u H ∂u R

Since the reward function R R is known the terms ∂R R /∂u R and ∂R R /∂u∗H can
be easily determined. In order to compute the term ∂u∗H /∂u R , which represents
how much the robot’s actions impact the human’s actions, an additional step
is required. First, assuming the human acts optimally according to (10.8) the
necessary optimality condition is:

∂R H
g( x, u R , u∗H ) = 0, g= ,
∂u H
which for the fixed values of x and u R specifies u∗H . Then, by implicitly differen-
tiating this condition with respect to the robot action u R :

∂g ∂g ∂u∗H
+ ∗ = 0,
∂u R ∂u H ∂u R

which can be used to solve for:


∂u∗H  ∂g −1 ∂g
( x, u R , u∗H ) = − .
∂u R ∂u∗H ∂u R

Notice that every term in this expression can be computed30 and therefore it can 30
Assuming the human’s reward
be substituted into the gradient calculation: function is known.

∂J ∂R R ∂R  ∂g −1 ∂g
= − ∗R ,
∂u R ∂u R ∂u H ∂u∗H ∂u R

which can then be computed as long as it is possible to compute u∗H ( x, u R ).


To summarize, one approach to interaction-aware control is to model the in-
teraction as a Stackelberg game, where it is assumed that both the human and
the robot act optimally with respect to some reward functions. This formula-
tion of the problem enables the robot to choose actions based on an implicit
understanding of how the human will react.

10.6.2 Intent Inference


One disadvantage to the approach for interaction-aware control from the previ-
ous section is that it assumes the human acts optimally with respect to a known
reward function. While a reward function could be learned through inverse re-
inforcement learning, this is not practical for real-world settings where different
humans behave differently. Returning to the example of interaction between
human drivers and autonomous vehicles, the human could exhibit drastically
different behavior depending on whether they have an aggressive or passive
driving style. In these settings the problem of intent inference focuses on iden-
tifying underlying behavioral characteristics that can lead to more accurate
behavioral models31 . 31
This problem can be formulated as a
partially observable Markov decision
process (POMDP) since the underlying
behavioral characteristic is not directly
observable, yet influences the system’s
behavior.
principles of robot autonomy 13

One approach to intent inference32 is to model the underlying behavioral dif-


ferences through a set of unknown parameters θ which need to be inferred by
observing the human’s behavior. Mathematically this is expressed by defining
the human’s reward function R H ( x, u R , u H , θ) to be a function of θ, and assum- 32
D. Sadigh et al. “Planning for cars
ing the human chooses actions according to: that coordinate with people: leveraging
effects on human actions for planning
p(u H | x, u R , θ) ∝ e R H ( x,uR ,u H ,θ) . and active information gathering over
human internal state”. In: Autonomous
Robots 42.7 (2018), pp. 1405–1426
In other words this model assumes the human is exponentially more likely to
pick optimal actions33 , but that they may pick suboptimal actions as well. This assumption was also used in the
33

The objective of intent inference is therefore to estimate the parameters Maximum Entropy IRL approach.

θ, which can be accomplished through Bayesian inference methods. In the


Bayesian approach a probability distribution over parameters θ is updated based
on observations. Specifically the belief distribution is denoted as b(θ), and given
an observation of the human’s actions u H the belief distribution is updated as:

1
bt+1 (θ) = p(u H,t | xt , u R,t , θ)bt (θ),
η

where η is a normalizing constant. This Bayesian update is simply taking the


prior belief over θ and updating the distribution based on the likelihood of
observing human action u H under that prior. Note that this concept is quite
similar to the concepts of inverse reinforcement learning: a set of parameters
that describe the human’s (experts) behavior are continually updated when new
observations of their actions are gathered.
While the robot could sit around and passively observe the human act to col-
lect samples for the Bayesian updates, it is often more efficient for the robot to
probe the human to take interesting actions that are more useful for revealing the
intent parameters θ. This can be accomplished by choosing the robot’s reward
function to be:

R R ( x, u R , u H , θ) = I (b(θ), u R ) + λRgoal ( x, u R , u H , θ)

where λ > 0 is a tuning parameter and I (b(θ), u R ) denotes a function that quan-
tifies the amount of information gained with respect to the belief distribution
from taking action u R . In other words the robot’s reward is a tradeoff between
exploiting the current knowledge of θ to accomplish the objective and taking
exploratory actions to improve the intent inference. With this robot reward func-
tion the robot’s actions are chosen to maximize the expected reward:

u∗R ( x) = arg max Eθ [ R R ( x, u R , u H , θ)].


uR

To summarize, this robot policy will try to simultaneously accomplish the


robot’s objective and gather more information to improve the inference of the
human’s intent (modeled through the parameters θ). In a highway lane chang-
ing scenario this type of policy might lead the robot to nudge into the other
lane to see if the other car will slow down (passive driving behavior) or try to
14 imitation learning

block the lane change (aggressive driving behavior). Once the robot has a strong
enough belief about the human’s behavior it may choose to either complete the
lane change or slow down to merge behind the human driver.

You might also like