Dis8 Sol
Dis8 Sol
This discussion covers unsupervised pretraining methods in NLP and imitation learning methods.
One way to learn embeddings is to optimize the representation to predict nearby words. More precisely, we
can consider a center word c and try to predict its neighbors in the sentence (context words o) via logistic
regression. We will associate each two vectors uw and vw , and optimize these word embeddings to optimize
the likelihood
exp uTo vc
p(o|c) = P T
w∈V exp uw vc
averaged over all selections of center word c and context word o. Intuitively, this objective means that
words that occur together (and so would show up as center/context word pairs o, c) would have higher inner
product uT0 vc . It would also mean that if words a, b were similar in meaning and so were interchangeable,
we would expect their embeddings ua and ub to be similar (as well as va , vb ), since they would appear in
similar contexts. The embedding for each word was split into two components u, v to make optimization
more tractable, which are simply averaged together to produce the final embedding after training.
The issue with this approach is that if we only sample positive examples (pairs of words o, c that are actually
neighbors of one another), then a trivial solution would be to make u, v to be the same, very large vector
for all words, which would cause our binary classifier to always confidently predict that any two words are
valid center and context pairs.
To avoid this degenerate solution, we also include negative samples in every batch. Instead of just maximizing
the likelihood on valid center/context pairs, we’ll also reduce the likelihood of words randomly sampled from
the dictionary to get the objective
!
X X
max log p(o is the right word|c) + log p(w is wrong|c) ,
c,o w
where the negative examples w are randomly sampled. Now, optimizing this objective will try to push vc
and uo closer together for valid context/center pairs, while pushing vc away from uw for all other words in
the vocabulary.
Figure 1: ELMo takes the hidden states in a bi-directional LSTM to generate word embeddings. The LSTMs
are both trained via sequence prediction.
ELMo: We note that if we simply ran an LSTM forward through a sentence to generate the embeddings
of words, the embedding of each word would only depend on those that came before it, rather than the full
We can imagine the state s as capturing the current state of the world, actions a to be the decisions we
make, the transition function P (s0 |s, a) describing how our actions affect the world around us, and the reward
r(s, a) capturing some notion of success at what we want to accomplish.
While solving imitation learning problems will not explicitly require access to the reward, we should keep in
mind that success in imitation learning is not necessarily measured directly in how well we match the expert
(as measured in perhaps negative-log-likelihood on the expert dataset like we would consider in supervised
learning), but in how well our learned policy actually executes the task we care about. The task we care
about is often specified (loosely) as a reward function.
Problem 1: Probability of a trajectory under a Markovian policy
Given a policy πθ (a|s), compute the log probability of a trajectory τ = ((s0 , a0 ), (s1 , a1 ), . . . , (sT , aT ))
using the Markov property.
In practice, we may not be able to access the full state s that satisfies the Markov property, but instead rely
on some limited observations o. For example, when driving a car, we cannot see everything around us, even
though things we cannot see may impact how the world behaves around us.
Here, optimal policies may no longer be stationary, but may need to depend on all past observations
(π(at |o1 . . . , ot )). Unless otherwise specified, we’re going to assume we’re in a fully observed MDP with
access to the true state.
Figure 3: The Markov property still holds for the underlying state s in POMDPs, but our policies cannot
depend on s directly. Instead, we are forced to take actions based only on the observations o.
Using the decomposition of the probability of a trajectory from the previous problem, show that
the behavior cloning objective maximizes the probability of the expert trajectories (assumed to be
generated by some expert β that produces a distribution over trajectories) under the learned policy
max Eτ ∼β log πθ (τ ).
θ
Recall that
T
X
log p(τ ) = log p(s0 ) + log π(a0 |s0 ) + log p(st |st−1 , at−1 ) + log πθ (at |st ).
t=1
Ignoring terms that don’t depend on the policy parameters, maximizing the log-likelihood of a tra-
jectory τ is then equivalent to maximizing
T
X
log πθ (at |st ).
t=0
Averaging over all the timesteps in each trajectory and over all the trajectories, we see this is identical
to the supervised objective for behavior cloning (up to having only an empirical distribution over
expert trajectories in the dataset instead of the true distribution).
2.3 DAgger
However, having a policy that assigns high likelihoods to expert trajectories is often not enough to ensure
the learned policy attains good performance.
The primary issue is a distribution shift between training and test time. During training, the policy is only
being trained on states that were visited by the expert policy. During test time, we actually execute the
learned policy, and small mismatches between the learned policy and the expert can potentially take us to
new states not seen during training. On these new states that the imitation policy hasn’t been trained on,
the imitation policy would likely not match the expert’s behavior well. Thus, even if the expert policy were
capable of recovering from the earlier mistakes and still solve the task, the imitation learning policy would
instead continue to make mistakes.
One way to remedy this is to simply do a better job at matching the expert policy to avoid deviating far
from the expert trajectories. However, this can require extremely accurate models, and can be tricky to
accomplish (even with large amounts of expert data) when the expert policy is non-Markovian (so cannot
be matched exactly by a Markovian policy) or if the expert policy is multimodel (and choice our probability
distribution for our policy is not expressive enough to match it).
Another approach alter our data distribution we train on to better cover the trajectories we’ll encounter
during test time. This is the approach taken in the dataset aggregation (Dagger) algorithm, which
iteratively collects new trajectories from the current policy, labels those trajectories using the expert policy,
and adds the relabeled trajectories to our dataset and retraining. This way, we are constantly updating our
state distribution to include our current policy, and we can stop when our imitation policy’s distributions
stabilize and we obtain good performance in our desired task.
While Dagger can be very effective at mitigating the distribution mismatch issues, it does require the agent
to interact with the environment during learning, as well querying the expert to figure out what actions it
would take at the new states. Both of these can potentially be costly.