0% found this document useful (0 votes)
12 views10 pages

Interactive Learning From Policy-Dependent Human Feedback

Uploaded by

yingaofei993
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Interactive Learning From Policy-Dependent Human Feedback

Uploaded by

yingaofei993
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Interactive Learning from Policy-Dependent Human Feedback

James MacGlashan 1 Mark K Ho 2 Robert Loftin 3 Bei Peng 4 Guan Wang 2 David L. Roberts 3
Matthew E. Taylor 4 Michael L. Littman 2

Abstract behavior using these simple signals. Indeed, animals have


been successfully trained to guide the blind, locate mines
This paper investigates the problem of interac-
in the ocean, detect cancer or explosives, and even solve
tively learning behaviors communicated by a hu-
arXiv:1701.06049v2 [cs.AI] 28 Jan 2023

complex, multi-stage puzzles.


man teacher using positive and negative feed-
back. Much previous work on this problem has Despite success when learning from environmental reward,
made the assumption that people provide feed- traditional reinforcement-learning algorithms have yielded
back for decisions that is dependent on the be- limited success when the reward signal is provided by hu-
havior they are teaching and is independent from mans. This failure underscores the importance that algo-
the learner’s current policy. We present empirical rithms for learning from humans are based on appropriate
results that show this assumption to be false— models of human-feedback. Indeed, much human-centered
whether human trainers give a positive or neg- RL work has investigated and employed different mod-
ative feedback for a decision is influenced by els of human-feedback (Knox & Stone, 2009b; Thomaz &
the learner’s current policy. Based on this in- Breazeal, 2006; 2007; 2008; Griffith et al., 2013; Loftin


sight, we introduce Convergent Actor-Critic by et al., 2015). Many of these algorithms leverage the ob-


Humans (COACH), an algorithm for learning servation that people tend to give feedback that is best in-


from policy-dependent feedback that converges terpreted as guidance on the policy the agent should be fol-


to a local optimum. Finally, we demonstrate that lowing, rather than as a numeric value to be maximized
COACH can successfully learn multiple behav- by the agent. However, these approaches assume models


iors on a physical robot. of feedback that are independent of the policy the agent


is currently following. We present empirical results that
demonstrate that this assumption is incorrect and further
1. Introduction demonstrate cases in which policy-independent learning al-
gorithms suffer from this assumption. Following this result,
Programming robots is very difficult, in part because we present Convergent Actor-Critic by Humans (COACH),
the real world is inherently rich and—to some degree— an algorithm for learning from policy-dependent human
unpredictable. In addition, our expectations for physical feedback. COACH is based on the insight that the ad-
agents are quite high and often difficult to articulate. Nev- vantage function (a value roughly corresponding to how
ertheless, for robots to have a significant impact on the lives much better or worse an action is compared to the current
of individuals, even non-programmers need to be able to policy) provides a better model of human feedback, cap-
specify and customize behavior. Because of these complex- turing human-feedback properties like diminishing returns,
ities, relying on end-users to provide instructions to robots rewarding improvement, and giving 0-valued feedback a
programmatically seems destined to fail. semantic meaning that combats forgetting. We compare
Reinforcement learning (RL) from human trainer feedback COACH to other approaches in a simple domain with sim-
provides a compelling alternative to programming because ulated feedback. Then, to validate that COACH scales to
agents can learn complex behavior from very simple posi- complex problems, we train five different behaviors on a
tive and negative signals. Furthermore, real-world animal TurtleBot robot.
training is an existence proof that people can train complex
*
Equal contribution 1 Cogitai 2 Brown University 3 North Car-
2. Background
olina State University 4 Washington State University. Correspon- For modeling the underlying decision-making problem of
dence to: James MacGlashan <[email protected]>.
an agent being taught by a human, we adopt the Markov
Proceedings of the 34 th International Conference on Machine Decision Process (MDP) formalism. An MDP is a 5-tuple:
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 hS, A, T, R, γi, where S is the set of possible states of the
by the author(s).
Interactive Learning from Policy-Dependent Human Feedback

environment; A is the set of actions available to the agent; policy by giving numeric feedback as the agent acts in the
T (s0 |s, a) is the transition function, which defines the prob- environment. The goal of the agent is to learn the target
ability of the environment transitioning to state s0 when policy π ∗ from the feedback.
the agent takes action a in environment state s; R(s, a, s0 )
To define a learning algorithm for this problem, we first
is the reward function specifying the numeric reward the
characterize how human trainers typically use numeric
agent receives for taking action a in state s and transition-
feedback to teach target policies. If feedback is stationary
ing to state s0 ; and γ ∈ [0, 1] is a discount factor specifying
and intended to be maximized, it can be treated as a re-
how much immediate rewards are preferred to more distant
ward function and standard RL algorithms used. Although
rewards.
this approach has had some success (Pilarski et al., 2011;
A stochastic policy π for an MDP is a per-state action Isbell et al., 2001), there are complications that limit its ap-
probability distribution that P defines an agent’s behavior; plicability. In particular, a trainer must take care that the
π : S × A → [0, 1], where a∈A π(s, a) = 1, ∀s ∈ S. feedback they give contains no unanticipated exploits, con-
In the MDP setting, the goal is to find the optimal pol- straining the feedback strategies they can use. Indeed, prior
icy π ∗ , which maximizes the expected future discounted research has shown that interpreting human feedback like
reward when the agent selects actions P∞in each state ac- a reward function often induces positive reward cycles that
cording to π ∗ ; π ∗ = argmaxπ E[ t=0 γ t rt |π], where lead to unintended behaviors (Knox, 2012; Ho et al., 2015).
rt is the reward received at time t. Two important con-
The issues with interpreting feedback as reward have led
cepts in MDPs are the value function (V π ) and action–
to the insight that human feedback is better interpreted as
value function (Qπ ). The value function defines the ex-
commentary on the agent’s behavior; for example, positive
pected future discounted reward from each state when fol-
feedback roughly corresponds to “that was good” and neg-
lowing some policy and the action–value function defines
ative feedback roughly corresponds to “that was bad.” In
the expected future discounted reward when an agent takes
the next section, we review existing HCRL approaches that
some action in some state and then follows some policy π


build on this insight.
thereafter. These equations can bePrecursively defined via


the Bellman equation: V π (s) = a π(s, a)Qπ (s, a) and


π 0 0 π 0 4. Related Work
P
Q (s, a) = s0 T (s |s, a) [R(s, a, s ) + γV (s )]. For
shorthand, the value functions for the optimal policies are


A number of existing approaches to HCRL and RL that
usually denoted V ∗ and Q∗ .


includes human feedback has been explored in the past.
In reinforcement learning (RL), an agent interacts with an The most similar to ours, and a primary inspiration for


environment modeled as an MDP, but does not have di- this work, is the TAMER framework (Knox, 2012). In
rect access to the transition function or reward function TAMER, trainers provide interactive numeric feedback as
and instead must learn a policy from environment obser- the learner takes actions. The learner attempts to estimate
vations. A common class of RL algorithms are actor- a target reward function by interpreting trainer feedback as
critic algorithms. Bhatnagar et al. (2009) provide a gen- exemplars of this function. When the agent makes rapid
eral template for these algorithms. Actor-critic algorithms decisions, TAMER divides the feedback among the recent
are named for the two main components of the algorithms: state–action pairs according to a probability distribution.
The actor is a parameterized policy that dictates how the TAMER makes decisions by myopically choosing the ac-
agent selects actions; the critic estimates the value func- tion with the highest reward estimate. Because the agent
tion for the actor and provides critiques at each time step myopically maximizes reward, the feedback can also be
that are used to update the policy parameters. Typically, thought of as exemplars of Q∗ . Later work also investi-
the critique is the temporal difference (TD) error: δt = gated non-myopically maximizing the learned reward func-
rt + γV (st ) − V (st−1 ), which describes how much better tion with a planning algorithm (Knox & Stone, 2013), but
or worse a transition went than expected. this approach requires a model of the environment and spe-
cial treatment of termination conditions.
3. Human-centered Reinforcement Learning Two other closely related approaches are SABL (Loftin
et al., 2015) and Policy Shaping (Griffith et al., 2013). Both
In this work, a human-centered reinforcement-learning
of these approaches treat feedback as discrete probabilis-
(HCRL) problem is a learning problem in which an agent
tic evidence of the trainer’s target parameterized policy.
is situated in an environment described by an MDP but in
SABL’s probabilistic model additionally includes (learn-
which rewards are generated by a human trainer instead of
able) parameters for describing how often a trainer is ex-
from a stationary MDP reward function that the agent is
pected to give explicit positive or negative feedback.
meant to maximize. The trainer has a target policy π ∗ they
are trying to teach the agent. The trainer communicates this There have also been some domains in which treating hu-
Interactive Learning from Policy-Dependent Human Feedback

man feedback as reward signals to maximize has had some


success, such as in shaping the control for a prosthetic
arm (Pilarski et al., 2011) and learning how to interact in
an online chat room from multiple users’ feedback (Isbell
et al., 2001). Some complications with how people give
feedback have been reported, however.
Some research has also examined combining human feed-
back with more traditional environmental rewards (Knox
& Stone, 2010; Tenorio-Gonzalez et al., 2010; Clouse &
Utgoff, 1992; Maclin et al., 2005). A challenge in this
context in practice is that rewards do not naturally come
from the environment and must be programmatically de-
fined. However, it is appealing because the agent can learn Figure 1. The training interface shown to AMT users.
in the absence of an active trainer. We believe our approach
to HCRL could also straightforwardly incorporate learning based on the wrong assumption may result in unexpected
from environmental reward as well, but we leave this inves- responses to feedback. Consequently, we were interested
tigation for future work. in investigating which model better fits human feedback.
Finally, a related research area is learning from demonstra- Despite existing HCRL algorithms assuming policy-
tion (LfD), in which a human provides examples of the de- independent feedback, evidence of policy-dependent feed-
sired behavior. There are a number of different approaches back can be found in prior works with these algorithms.
to solving this problem surveyed by Argall et al. (2009). For example, it was often observed that trainers taper their
We see these approaches as complementary to HCRL be- feedback over the course of learning (Ho et al., 2015; Knox


cause it is not always possible, or convenient, to provide et al., 2012; Isbell et al., 2001). Although diminishing feed-


demonstrations. LfD approaches that learn a parameter- back is a property that is explained by people’s feedback
ized policy could also operate with COACH, allowing the being policy-dependent—as the learner’s performance im-


agent to have their policy seeded by demonstrations, and proves, trainer feedback is decreased—an alternative expla-


then fine tuned with interactive feedback. nation is simply trainer fatigue. To further make the case


Note that the policy-dependent feedback we study here for human feedback being policy dependent, we provide a


is viewed as essential in behavior analysis reinforcement stronger result showing that trainers—for the same state–
schedules (Miltenberger, 2011). Trainers are taught to action pair—choose positive or negative feedback depend-
provide diminishing returns (gradual decreases in posi- ing on their perception of the learner’s behavior.
tive feedback for good actions as the agent adopts those
actions), differential feedback (varied magnitude of feed- 5.1. Empirical Results
backs depending on the degree of improvement or deterio- We had Amazon Mechanical Turk (AMT) participants
ration in behavior), and policy shaping (positive feedback teach an agent in a simple sequential task, illustrated in Fig-
for suboptimal actions that improve behavior and then neg- ure 1. Participants were instructed to train a virtual dog to
ative feedback after the improvement has been made), all walk to the yellow goal location in a grid world as fast as
of which are policy dependent. possible but without going through the green cells. They
were additionally told that, as a result of prior training,
5. Policy-dependent Feedback their dog was already either “bad,” “alright,” or “good” at
the task and were shown examples of each behavior before
A common assumption of existing HCRL algorithms is that training. In all cases, the dog would start in the location
feedback depends only on the quality of an agent’s action shown in Figure 1. “Bad” dogs walked straight through the
selection. An alternative hypothesis is that feedback also green cells to the yellow cell. “Alright” dogs first moved
depends on the agent’s current policy. That is, an action se- left, then up, and then to the goal, avoiding green but not
lection may be more greatly rewarded or punished depend- taking the shortest route. “Good” dogs took the shortest
ing on how often the agent would typically be inclined to path to yellow without going through green.
select it. For example, more greatly rewarding the agent for
improving its performance than maintaining the status quo. During training, participants saw the dog take an action
We call the former model of feedback policy-independent from one tile to another and then gave feedback after ev-
and the latter policy-dependent. If people are more nat- ery action using a continuous labeled slider as shown. The
urally inclined toward one model of feedback, algorithms slider always started in the middle of the scale on each
trial, and several points were labeled with different levels
Interactive Learning from Policy-Dependent Human Feedback

of reward (praise and treats) and punishment (scolding and

Final Episode First Response


a mild electric shock). Participants went through a brief
tutorial using this interface. Responses were coded as a
numeric value from −50 to 50, with “Do Nothing” as the
zero-point.
During the training phase, participants trained a dog for
three episodes that all started in the same position and
ended at the goal. The dog’s behavior was pre-programmed
in such a way that the first step of the final episode would
reveal if feedback was policy dependent. Each user was Improving Steady Degrading
placed into one of three different conditions: improving, Condition
steady, or degrading. For all three conditions, the dog’s
behavior in the final episode was “alright,” regardless of Figure 2. The feedback distribution for first step of the final
any prior feedback. The conditions differed in terms of the episode for each condition. Feedback tended to be positive for
behavior users observed in the first two episodes. In the improving behavior, but negative otherwise.
first two episodes, users observed bad behavior in the im-
proving condition (improving to alright); alright behavior we present the general update rule for COACH and its con-
in the steady condition; and good behavior in the degrad- vergence. Finally, we present Real-time COACH, which in-
ing condition. If feedback is policy-dependent, we would cludes mechanisms for providing variable magnitude feed-
expect more positive feedback in the final episode for the back and learning in problems with a high-frequency deci-
improving condition, but not for policy-independent feed- sion cycle.
back since it was the same final behavior for all conditions.


Figure 2 shows boxplots and individual responses for the 6.1. The Advantage Function and Feedback


first step of the final episode under each of the three con-
The advantage function (Baird, 1995) Aπ is defined as
ditions. These results indicate that the sign of feedback is


sensitive to the learner’s policy, as predicted. The mean and A (s, a) = Qπ (s, a) − V π (s).
π
(1)


median feedback under the improving condition is slightly


positive (Mean = 9.8, Median = 24, S.D. = 22.2; planned Roughly speaking, the advantage function describes how
Wilcoxon one-sided signed-rank test: Z = 1.71, p <


much better or worse an action selection is compared to
0.05), whereas it is negative for the steady condition (Mean the agent’s performance under policy π. The function is
= −18.3, Median = −23.5, S.D. = 24.6; planned Wilcoxon closely related to the update used in policy iteration (Put-
two-sided signed-rank test: Z = −3.15, p < 0.01) and erman, 1994): defining π 0 (s) = argmaxa Aπ (s, a) is guar-
degrading condition (Mean = −10.8, Median = −18.0, anteed to produce an improvement over π whenever π is
S.D. = 20.7; planned Wilcoxon one-sided signed-rank test: suboptimal. It can also be used in policy gradient meth-
Z = −2.33, p < 0.05). There was a main effect across ods to gradually improve the performance of a policy, as
the three conditions (p < 0.01, Kruskal-Wallace Test), and described later.
pairwise comparisons indicated that only the improving
condition differed from steady and degrading conditions It is worth nothing that feedback produced by the advan-
(p < 0.01 for both, Bonferroni-corrected, Mann-Whitney tage function is consistent with that recommended in be-
Pairwise test). havior analysis. It trivially results in differential feedback
since it is defined as the magnitude of improvement of
an action over its current policy. It induces diminishing
6. Convergent Actor-Critic by Humans returns because, as π improves opportunities to improve
In this section, we introduce Convergent Actor-Critic by on it decrease. Indeed, once π is optimal, all advantage-
Humans (COACH), an actor-critic-based algorithm capa- function-based feedback is zero or negative. Finally, ad-
ble of learning from policy-dependent feedback. COACH vantage function feedback induces policy shaping in that
is based on the insight that the advantage function is a good whether feedback is positive or negative for an action de-
model of human feedback and that actor–critic algorithms pends on whether it is a net improvement over the current
update a policy using the critic’s TD error, which is an unbi- behavior.
ased estimate of the advantage function. Consequently, an
agent’s policy can be directly modified by human feedback 6.2. Convergence and Update Rule
without a critic component. We first define the advantage Given a performance metric ρ, Sutton et al. (1999) derive a
function and its interpretation as trainer feedback. Then, policy gradient algorithm of the form: ∆θ = α∇θ ρ. Here,
Interactive Learning from Policy-Dependent Human Feedback

θ represents the parameters that control the agent’s behav- Algorithm 1 Real-time COACH
ior and α is a learning rate. Under the assumption that ρ Require: policy πθ0 , trace set λ, delay d, learning rate α
is the discounted expected reward from a fixed start state Initialize traces eλ ← 0 ∀λ ∈ λ
distribution, they show that observe initial state s0
X X for t = 0 to ∞ do
∇θ ρ = dπ (s) ∇θ π(s, a)Qπ (s, a), select and execute action at ∼ πθt (st , ·)
s a
observe next state st+1 , sum feedback ft+1 , and λ
π
where d (s) is the component of the (discounted) station- for λ0 ∈ λ do
ary distribution at s. A benefit of this form of the gradient eλ0 ← λ0 eλ0 + πθ (st−d 1
,at−d ) ∇θt πθt (st−d , at−d )
t
is that, given that states are visited according to dπ (s) and end for
actions are taken according to π(s, a), the update at time t θt+1 ← θt + αft+1 eλ
can be made as: end for
ft+1
∆θt = αt ∇θ π(st , at ) , (2)
π(st , at )
always want to influence a long history of actions. Conse-
where E[ft+1 ] = Qπ (st , at ) − v(s) for any action- quently, Real-time COACH maintains multiple eligibility
independent function v(s). traces with different temporal decay rates and the trainer
chooses which eligibility trace to use for each update. This
In the context of the present paper, ft+1 represents the
trace choice may be handled implicitly with the feedback
feedback provided by the trainer. It follows trivially that
value selection or explicitly.
if the trainer chooses the policy-dependent feedback ft =
Qπ (st , at ), we obtain a convergent learning algorithm that Due to reaction time, human feedback is typically delayed
(locally) maximizes discounted expected reward. In addi- by about 0.2 to 0.8 seconds from the event to which they
tion, feedback of the form ft = Qπ (st , at ) − V π (st ) =


meant to give feedback (Knox, 2012). To handle this delay,
Aπ (st , at ) also results in convergence. Note that for the feedback in Real-time COACH is associated with events


trainer to provide feedback in the form of Qπ or Aπ , they from d steps ago to cover the gap. Eligibility traces further


would need to “peer inside” the learner and observe its pol- smooth the feedback to older events.


icy. In practice, the trainer estimates π by observing the
Finally, we note that just as there are numerous variants of
agent’s actions.


actor-critic update rules, similar variations can be used in


the context of COACH.
6.3. Real-time COACH
There are challenges in implementing Equation 2 for real- 7. Comparison of Update Rules
time use in practice. Specifically, the interface for provid-
ing variable magnitude feedback needs to be addressed, and To understand the behavior of COACH under different
the question of how to handle sparseness and the timing of types of trainer feedback strategies, we carried out a con-
feedback needs to be answered. Here, we introduce Real- trolled comparison in a simple grid world. The domain is
time COACH, shown in Algorithm 1, to address these is- essentially an expanded version of the dog domain used in
sues. our human-subject experiment. It is a 8 × 5 grid in which
the agent starts in 0, 0 and must get to 7, 0, which yields +5
For providing variable magnitude reward, we use reward reward. However, from 1, 0 to 6, 0 are cells the agent needs
aggregation (Knox & Stone, 2009b). In reward aggrega- to avoid, which yield −1 reward.
tion, a trainer selects from a discrete set of feedback values
and further raises or lowers the numeric value by giving
7.1. Learning Algorithms and Feedback Strategies
multiple feedbacks in succession that are summed together.
Three types of learning algorithms were tested. Each main-
While sparse feedback is not especially problematic (be-
tains an internal data structure, which it updates with feed-
cause no feedback results in no change in policy), it may
back of the form hs, a, f, s0 i, where s is a state, a is an
slow down learning unless the trainer is provided with a
action taken in that state, f is the feedback received from
mechanism to allow feedback to affect a history of actions.
the trainer, and s0 is the resulting next state. The algorithm
We use eligibility traces (Barto et al., 1983) to help apply
also must produce an action for each state encountered.
feedback to the relevant transitions. An eligibility trace is
a vector that keeps track of the policy gradient and decays The first algorithm, Q learning (Watkins & Dayan, 1992),
exponentially with a parameter λ. Policy parameters are represents a standard value-function-based RL algorithm
then updated in the direction of the trace, allowing feed- designed for reward maximization under delayed feedback.
back to affect earlier decisions. However, a trainer may not It maintains a data structure Q(s, a), initially 0. Its update
Interactive Learning from Policy-Dependent Human Feedback

rule has the form: 7.2. Results

∆Q(s, a) = α[f + γ max Q(s0 , a0 ) − Q(s, a)]. (3) Each combination of algorithm and feedback strategy was
0 a run 99 times with the median value of the number of steps
Actions are chosen using the rule: argmaxa Q(s, a), where needed to reach the goal reported. Episodes were ended
ties are broken randomly. We tested a handful of parame- after 1, 000 steps if the goal was not reached.
ters and used the best values: discount factor γ = 0.99 and Figure 3(a) shows the steps needed to reach the goal for
learning rate α = 0.2. the three algorithms trained with task feedback. The figure
In TAMER (Knox & Stone, 2009a), a trainer provides inter- shows that TAMER can fail to learn in this setting. COACH
active numeric feedback that is interpreted as an exemplar also performs poorly with λ = 0, which prevents feedback
of the reward function for the demonstrated state–action from influencing earlier decisions. We did a subsequent ex-
pair as the learner takes actions. We assumed that each periment (not shown) with λ = 0.9 and found that COACH
feedback applies to the last action, and thus used a simpli- converged to reasonable behavior, although not as quickly
fied version of the algorithm that did not attempt to spread as Q learning. This result helps justify using traces to com-
updates over multiple transitions. TAMER maintains a data bat the challenges of delayed feedback.
structure RH (s, a) for the predicted reward in each state, Figure 3(b) shows results with action feedback. This time,
initially 0. It is updated by: ∆RH (s, a) = αf . We used Q learning fails to perform well, a consequence of this feed-
α = 0.2. Actions are chosen via an -greedy rule on back strategy inducing positive behavior cycles as it tries to
RH (s, a) with  = 0.2. avoid ending the trial, the same kind of problem that HCRL
Lastly, we examined COACH, which is also designed to algorithms have been designed to avoid. Both TAMER and
work well with human-generated feedback. We used a soft- COACH perform well with this feedback strategy. TAMER
max policy with a single λ = 0 trace. The parameters were performs slightly better than COACH, as this is precisely
the kind of feedback TAMER was designed to handle.


a matrix of values θ(s, a), initially zero. The stochastic


policy defined by these parameters was Figure 3(c) shows the results of the three algorithms with
improvement feedback, which is generated via the advan-


X
π(s, a) = eβθ(s,a) / eβθ(s,a) , tage function defined on the learner’s current policy. These


a results tells a different story. Here, COACH performs the


with β = 1. Parameters were updated via best. Q-learning largely flounders for most of the time,
but with enough training sometimes start to converge. (Al-


f though, 14% of the time, Q learning fails to do well even
∆θ = α∇θ π(s, a) , (4)
π(s, a) after 100 training episodes). TAMER, on the other hand,
performs very badly at first. While the median score in the
where α is a learning rate. We used α = 0.05. plot shows TAMER suddenly performing more compara-
In effect, each of these learning rules makes an assump- bly to COACH after about 10 episodes, 29% of our training
tion about the kind of feedback it expects trainers to use. trials completely failed to improve and timed-out across all
We wanted to see how they would behave with feedback 100 episodes.
strategies that matched these assumptions and those that
did not. The first feedback strategy we studied is the clas- 8. Robotics Case Study
sical task-based reward function (“task”) where the feed-
back is sparse: +5 reward when the agent reaches the goal In this section, we present qualitative results on Real-time
state, −1 for avoidance cells, and 0 for all other transi- COACH applied to a TurtleBot robot. The goal of this
tions. Q-learning is known to converge to optimal behav- study was to test that COACH can scale to a complex do-
ior with this type of feedback. The second strategy pro- main involving multiple challenges, including training an
vides policy-independent feedback for each state–action agent that operates on a fast decision cycle (33ms), noisy
pair (“action”): +5 when the agent reaches termination, non-Markov observations from a camera, and agent per-
+1 reward when the selected action matches an optimal ception that is hidden from the trainer. To demonstrate the
policy, −1 for reaching an avoidance cell, and 0 other- flexibility of COACH, we trained it to perform five differ-
wise. This type of feedback serves TAMER well. The ent behaviors involving a pink ball and cylinder with an
third strategy (“improvement”) used feedback defined by orange top using the same parameter selections. We dis-
the advantage function of the learner’s current policy π, cuss these behaviors below. We also contrast the results to
Aπ (s, a) = Qπ (s, a) − V π (s), where the value functions training with TAMER. We chose TAMER as a comparison
are defined based on the task rewards. This type of feed- because, to our knowledge, it is the only HCRL algorithm
back is very well suited to COACH. with success on a similar platform (Knox et al., 2013).
Interactive Learning from Policy-Dependent Human Feedback

The TurtleBot is a mobile base with two degrees of freedom


that senses the world from a Kinect camera. We discretized
the action space to five actions: forward, backward, rotate
clockwise, rotate counterclockwise, and do nothing. The
agent selects one of these actions every 33ms. To deliver
feedback, we used a Nintendo Wii controller to give +1,
+4, or −1 numeric feedback, and pause and continue train-
ing. For perception, we used only the RGB image chan-
nels from the Kinect. Because our behaviors were based
around a relocatable pink ball and a fixed cylinder with an
orange top, we hand constructed relevant image features to
be used by the learning algorithms. These features were
generated using techniques similar to those used in neural
network architectures. The features were constructed by
first transforming the image into two color channels asso-
(a) Task feedback ciated with the colors of the ball and cylinder. Sum pool-
ing to form a lower-dimensional 8 × 8 grid was applied to
each color channel. Each sum-pooling unit was then passed
through three different normalized threshold units defined
by Ti (x) = min( φxi , 1), where φi specifies the saturation
point. Using multiple saturation parameters differentiates
the distance of objects, resulting in three “depth” scales per
color channel. Finally, we passed these results through a


2 × 8 max-pooling layer with stride 1.


The five behaviors we trained were push–pull, hide, ball


following, alternate, and cylinder navigation. In push–pull,
the TurtleBot is trained to navigate to the ball when it is far,


and back away from it when it is near. The hide behavior


has the TurtleBot back away from the ball when it is near


and turn away from it when it is far. In ball following, the
TurtleBot is trained to navigate to the ball. In the alternate
(b) Action feedback task, the TurtleBot is trained to go back and forth between
the cylinder and ball. Finally, cylinder navigation involves
the agent navigating to the cylinder. We further classify
training methods for each of these behaviors as flat, involv-
ing the push–pull, hide, and ball following behaviors; and
compositional, involving the alternate and cylinder naviga-
tion behaviors.
In all cases, our human trainer (one of the co-authors) used
differential feedback and diminishing returns to quickly re-
inforce behaviors and restrict focus to the areas needing
tuning. However, in alternate and cylinder navigation, they
attempted more advanced compositional training methods.
For alternate, the agent was first trained to navigate to the
ball when it sees it, and then turn away when it is near.
Then, the same was independently done for the cylinder.
After training, introducing both objects would cause the
(c) Improvement feedback
agent to move back and forth between them. For cylin-
Figure 3. Steps to goal for Q learning (blue), TAMER (red), and der navigation, they attempted to make use of an animal-
COACH (yellow) in Cliff world under different feedback strate- training method called lure training in which an animal is
gies. The y-axis is on a logarithmic scale. first conditioned to follow a lure object, which is then used
to guide it through more complex behaviors. In cylinder
navigation, they first trained the ball to be a lure, used it to
Interactive Learning from Policy-Dependent Human Feedback

guide the TurtleBot to the cylinder, and finally gave a +4 lustrate this problem, we constructed a well-defined sce-
reward to reinforce the behaviors it took when following nario in which TAMER consistently unlearns behavior. In
the ball (turning to face the cylinder, moving toward it, and this scenario, the goal was for the TurtleBot to always stay
stopping upon reaching it). The agent would then navigate whenever the ball was present, and move forward if just the
to the cylinder without requiring the ball to be present. cylinder was present. We first trained TAMER to stay when
the ball alone was present using many rapid rewards (yield-
For COACH parameters, we used a softmax parameterized
ing a large aggregated signal). Next, we trained it to move
policy, where each action preference value was a linear
forward when the cylinder alone was present. We then in-
function of the image features, plus tanh(θa ), where θa
troduced both objects, and the TurtleBot correctly stayed.
is a learnable parameter for action a, providing a prefer-
After rewarding it for staying with a single reward (weaker
ence in the absence of any stimulus. We used two eligi-
than the previously-used many rapid rewards), the Turtle-
bility traces with λ = 0.95 for feedback +1 and −1, and
Bot responded by moving forward—the positive feedback
λ = 0.9999 for feedback +4. The feedback-action delay
actually caused it to unlearn the rewarded behavior. This
d was set to 6, which is 0.198 seconds. Additionally, we
counter-intuitive response is a consequence of the small re-
used an actor-critic parameter-update rule variant in which
ward decreasing its reward-function target for the stay ac-
action preference values are directly modified (along its
tion to a point lower than the value for moving forward.
gradient), rather than by the gradient of the policy (Sut-
Roughly, because TAMER does not treat zero reward as
ton & Barto, 1998). This variant more rapidly commu-
special, a positive reward can be a negative influence if
nicates stimulus–response preferences. For TAMER, we
it is less than expected. COACH does not exhibit this
used typical parameter values for fast decision cycle prob-
problem—any positive reward for staying will strengthen
lems: delay-weighted aggregate TAMER with uniform dis-
the behavior.
tribution credit assignment over 0.2 to 0.8 seconds, p = 0,
and cmin = 1 (Knox, 2012). (See prior work for parameter
meaning.) TAMER’s reward-function approximation used 9. Conclusion


the same representation as COACH. In this work, we presented empirical results that show


that the numeric feedback people give agents in an inter-


8.1. Results and Discussion active training paradigm is influenced by the agent’s cur-


COACH was able to successfully learn all five be- rent policy and argued why such policy-dependent feed-
back enables useful training strategies. We then intro-


haviors and a video showing its learning is available
online at https://www.youtube.com/watch?v= duced COACH, an algorithm that, unlike existing human-


e2Ewxumy8EA. Each of these behaviors were trained in centered reinforcement-learning algorithms, converges to a
less than two minutes, including the time spent verifying local optimum when trained with policy-dependent feed-
that a behavior worked. Differential feedback and dimin- back. We showed that COACH learns robustly in the
ishing returns allowed only the behaviors in need of tuning face of multiple feedback strategies and finally showed that
to be quickly reinforced or extinguished without any ex- COACH can be used in the context of robotics with ad-
plicit division between training and testing. Moreover, the vanced training methods.
agent successfully benefited from the compositional train- There are a number of exciting future directions to ex-
ing methods, correctly combining subbehaviors for alter- tend this work. In particular, because COACH is built on
nate, and quickly learning cylinder navigation with the lure. the actor-critic paradigm, it should be possible to combine
TAMER only successfully learned the behaviors using the it straightforwardly with learning from demonstration and
flat training methodology and failed to learn the composi- environmental rewards, allowing an agent to be trained in
tionally trained behaviors. In all cases, TAMER tended to a variety of ways. Second, because people give policy-
forget behavior, requiring feedback for previous decisions dependent feedback, investigating how people model the
it learned to be resupplied after it learned a new decision. current policy of the agent and how their model differs from
For the alternate behavior, this forgetting led to failure: af- the agent’s actual policy may produce even greater gains.
ter training the behavior for the cylinder, the agent forgot
some of the ball-related behavior and ended up drifting off Acknowledgements
course when it was time to go to the ball. TAMER also
We thank the anonymous reviewers for their useful sug-
failed to learn from lure training because TAMER does not
gestions and comments. This research has taken place in
allow reinforcing a long history of behaviors.
part at the Intelligent Robot Learning (IRL) Lab, Wash-
We believe TAMER’s forgetting is a result of interpret- ington State University. IRL’s support includes NASA
ing feedback as reward-function exemplars in which new NNX16CD07C, NSF IIS-1149917, NSF IIS-1643614, and
feedback in similar contexts can change the target. To il- USDA 2014-67021-22174.
Interactive Learning from Policy-Dependent Human Feedback

References Knox, W Bradley, Glass, Brian D, Love, Bradley C, Mad-


dox, W Todd, and Stone, Peter. How humans teach
Argall, Brenna D, Chernova, Sonia, Veloso, Manuela, and
agents. International Journal of Social Robotics, 4(4):
Browning, Brett. A survey of robot learning from
409–421, 2012.
demonstration. Robotics and autonomous systems, 57
(5):469–483, 2009. Knox, W Bradley, Stone, Peter, and Breazeal, Cynthia.
Training a robot via human feedback: A case study. In
Baird, Leemon. Residual algorithms: Reinforcement learn- Social Robotics, pp. 460–470. Springer, 2013.
ing with function approximation. In Proceedings of the
twelfth international conference on machine learning, Knox, W. Bradley Knox and Stone, Peter. Combining
pp. 30–37, 1995. manual feedback with subsequent MDP reward signals
for reinforcement learning. In Proc. of 9th Int. Conf.
Barto, A.G., Sutton, R.S., and Anderson, C.W. Neuron- on Autonomous Agents and Multiagent Systems (AAMAS
like adaptive elements that can solve difficult learning 2010), May 2010.
control problems. Systems, Man and Cybernetics, IEEE
Transactions on, SMC-13(5):834 –846, sept.-oct. 1983. Knox, William Bradley. Learning from human-generated
reward. PhD thesis, University of Texas at Austin, 2012.
Bhatnagar, Shalabh, Sutton, Richard S, Ghavamzadeh, Mo-
hammad, and Lee, Mark. Natural actor–critic algo- Loftin, Robert, Peng, Bei, MacGlashan, James, Littman,
rithms. Automatica, 45(11):2471–2482, 2009. Michael L., Taylor, Matthew E., Huang, Jeff, and
Roberts, David L. Learning behaviors via human-
Clouse, Jeffery A and Utgoff, Paul E. A teaching method delivered discrete feedback: modeling implicit feedback
for reinforcement learning. In Proceedings of the strategies to speed up learning. Autonomous Agents and
Ninth International Conference on Machine Learning Multi-Agent Systems, 30(1):30–59, 2015.
(ICML’92), pp. 92–101, 1992.


Maclin, Richard, Shavlik, Jude, Torrey, Lisa, Walker,
Trevor, and Wild, Edward. Giving advice about pre-


Griffith, Shane, Subramanian, Kaushik, Scholz, Jonathan,
Isbell, Charles, and Thomaz, Andrea L. Policy shaping: ferred actions to reinforcement learners via knowledge-


Integrating human feedback with reinforcement learn- based kernel regression. In Proceedings of the National


ing. In Advances in Neural Information Processing Sys- Conference on Artificial intelligence, volume 20, pp.


tems, pp. 2625–2633, 2013. 819, 2005.


Ho, Mark K, Littman, Michael L., Cushman, Fiery, and Miltenberger, Raymond G. Behavior modification: Princi-
Austerweil, Joseph L. Teaching with rewards and pun- ples and procedures. Cengage Learning, 2011.
ishments: Reinforcement or communication? In Pro-
Pilarski, Patrick M, Dawson, Michael R, Degris, Thomas,
ceedings of the 37th Annual Meeting of the Cognitive
Fahimi, Farbod, Carey, Jason P, and Sutton, Richard S.
Science Society, 2015.
Online human training of a myoelectric prosthesis
Isbell, Charles, Shelton, Christian R, Kearns, Michael, controller via actor-critic reinforcement learning. In
Singh, Satinder, and Stone, Peter. A social reinforcement 2011 IEEE International Conference on Rehabilitation
learning agent. In Proceedings of the fifth international Robotics, pp. 1–7. IEEE, 2011.
conference on Autonomous agents, pp. 377–384. ACM, Puterman, Martin L. Markov Decision Processes—
2001. Discrete Stochastic Dynamic Programming. John Wiley
Knox, W Bradley and Stone, Peter. Interactively shaping & Sons, Inc., New York, NY, 1994.
agents via human reinforcement: The TAMER frame- Sutton, Richard S and Barto, Andrew G. Reinforcement
work. In Proceedings of the Fifth International Confer- learning: An introduction, volume 1. MIT press Cam-
ence on Knowledge Capture, pp. 9–16, 2009a. bridge, 1998.
Knox, W Bradley and Stone, Peter. Interactively shaping Sutton, Richard S, McAllester, David A, Singh, Satinder P,
agents via human reinforcement: The tamer framework. Mansour, Yishay, et al. Policy gradient methods for re-
In Proceedings of the fifth international conference on inforcement learning with function approximation. In
Knowledge capture, pp. 9–16. ACM, 2009b. NIPS, volume 99, pp. 1057–1063, 1999.
Knox, W Bradley and Stone, Peter. Learning non- Tenorio-Gonzalez, Ana C, Morales, Eduardo F, and Vil-
myopically from human-generated reward. In Proceed- laseñor-Pineda, Luis. Dynamic reward shaping: training
ings of the 2013 international conference on Intelligent a robot by voice. In Advances in Artificial Intelligence–
user interfaces, pp. 191–202. ACM, 2013. IBERAMIA 2010, pp. 483–492. Springer, 2010.
Interactive Learning from Policy-Dependent Human Feedback

Thomaz, Andrea L and Breazeal, Cynthia. Robot learn-


ing via socially guided exploration. In Development
and Learning, 2007. ICDL 2007. IEEE 6th International
Conference on, pp. 82–87. IEEE, 2007.

Thomaz, Andrea L and Breazeal, Cynthia. Teachable


robots: Understanding human teaching behavior to build
more effective robot learners. Artificial Intelligence,
172:716–737, 2008.

Thomaz, Andrea Lockerd and Breazeal, Cynthia. Rein-


forcement learning with human teachers: Evidence of
feedback and guidance with implications for learning
performance. In AAAI, volume 6, pp. 1000–1005, 2006.

Watkins, Christopher J. C. H. and Dayan, Peter. Q-learning.


Machine Learning, 8(3):279–292, 1992.

整 理
之 眼
深 度

You might also like