0% found this document useful (0 votes)

7 views6 pages

Dis8 Sol

This document discusses unsupervised pretraining methods in natural language processing (NLP) and imitation learning techniques. It covers various approaches to learn word embeddings, such as Word2Vec and context-dependent models like ELMo and BERT, as well as the fundamentals of Markov Decision Processes (MDPs) in imitation learning. The document also introduces behavior cloning and the DAgger algorithm to address distribution shifts in training and testing phases.

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Dis8 Sol

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 8

This discussion covers unsupervised pretraining methods in NLP and imitation learning methods.

1 Unsupervised Pretraining in Natural Language Processing

We will review several techniques for unsupervised pretraining in NLP. The general idea is to use unlabelled
text, which is often easily accessible (for example on the internet, in books, other publications, etc...) in
order to learn representations of language that can be useful for downstream tasks.
To illustrate why we might expect this to be helpful, we can imagine we want to translate English sentences
to French, and are given a labelled dataset of English/French sentence pairs. You can imagine this task would
be really difficult if you had no prior knowledge of English, while being much more manageable if you came
in with a general understanding of the English language already, which can be learned using unsupervised
data (for example, all the English text we see on the internet).

1.1 Word Embeddings

Before in lectures, we didn’t worry much about how individual words were represented, often abstracting
them away as one-hot vectors for simplicity. Our goal with word embeddings is to map words to real vectors
such that distances in the representation are in some sense meaningful, which can make learning much easier
for the downstream task. This would imply that “similar” words should be mapped to representations that
are close to one another.

1.1.1 Naive Word2Vec

One way to learn embeddings is to optimize the representation to predict nearby words. More precisely, we
can consider a center word c and try to predict its neighbors in the sentence (context words o) via logistic
regression. We will associate each two vectors uw and vw , and optimize these word embeddings to optimize
the likelihood

exp uTo vc
p(o|c) = P T

w∈V exp uw vc

averaged over all selections of center word c and context word o. Intuitively, this objective means that
words that occur together (and so would show up as center/context word pairs o, c) would have higher inner
product uT0 vc . It would also mean that if words a, b were similar in meaning and so were interchangeable,
we would expect their embeddings ua and ub to be similar (as well as va , vb ), since they would appear in
similar contexts. The embedding for each word was split into two components u, v to make optimization
more tractable, which are simply averaged together to produce the final embedding after training.

1.1.2 Making Word2Vec Tractable

As mentioned in lecture, the normalizing factor w∈V exp uTw vc involves summing the logits for every word
P
in the entire vocabulary, which is very slow given how many words there are in total.

CS 182/282A, Spring 2021, Discussion 8 1

One way to address the issue is to simply train using binary classification. We can instead choose to optimize
1
p(o is the right word|c) = .
1 + exp −uTo vc

The issue with this approach is that if we only sample positive examples (pairs of words o, c that are actually
neighbors of one another), then a trivial solution would be to make u, v to be the same, very large vector
for all words, which would cause our binary classifier to always confidently predict that any two words are
valid center and context pairs.
To avoid this degenerate solution, we also include negative samples in every batch. Instead of just maximizing
the likelihood on valid center/context pairs, we’ll also reduce the likelihood of words randomly sampled from
the dictionary to get the objective
!
X X
max log p(o is the right word|c) + log p(w is wrong|c) ,
c,o w

where the negative examples w are randomly sampled. Now, optimizing this objective will try to push vc
and uo closer together for valid context/center pairs, while pushing vc away from uw for all other words in
the vocabulary.

1.2 Pretrained Language Models

One weakness of the previous Word2Vec approach is that we train by simply predicting neighbors from an
individual word, and so the representations for a word do not depend on its context after training is complete.
This is somewhat limiting, since words can have very different meanings depending in the context. At a
high level, one simple way we can embed words in a context-dependent manner is to take a language model
(for example an LSTM) trained on some task, and to run a sentence through it, taking the hidden state of
the model as the embedding for each word. Since these language models presumably had to use the context
in order to solve the task they were trained on, using the hidden state as an embedding should provide
context-dependent representations of words. We will briefly discuss two examples of this approach.

Figure 1: ELMo takes the hidden states in a bi-directional LSTM to generate word embeddings. The LSTMs
are both trained via sequence prediction.

ELMo: We note that if we simply ran an LSTM forward through a sentence to generate the embeddings
of words, the embedding of each word would only depend on those that came before it, rather than the full

CS 182/282A, Spring 2021, Discussion 8 2

context of the word. ELMo addresses this issue by simply training a bi-directional LSTM (both trained
to predict the next/previous word), and concatenating hidden states of both directions together to form
an embedding. This approach is very similar to the previous Word2Vec approach in that the embeddings
are trained based on predicting nearby words, with the main difference being that the we aren’t exactly
embedding single words, but rather words in their context.
BERT: Instead of using a bi-directional LSTM like ELMo, BERT relies on a single transformer to generate
embeddings. While the previous transformers we saw for sequence modeling relied on masked self-attention
to avoid peeking into the future, our goal here is to digest the entire context of a word to produce an
embedding, which eliminates the need for the mask. However, this presents a complication if we were to
try train embeddings to predict the next word like ELMO did. The issue here is that if we did unmasked
self-attention, we can already directly see the next word in the input, making prediction completely trivial
and preventing useful representations from being learned.
The solution is to simply change the unsupervised task. Instead of predicting the next word in sentence, we
instead randomly mask out certain words in the input, and then train the embedding to predict the masked
out words. In this way, our model is forced to learn context dependent word-level representations to predict
the missing words.
In addition to learning word-level representations by predicting masked out words, BERT also tries to learn
sentence-level representations. To train this, BERT takes in pairs of sentences, randomly permutes their
order, and trains a binary classifier to predict which of the two sentences came first originally. Depending
on the downstream task, we can either use the sentence level representation outputted by BERT or the
word-level representations in the downstream task. We can use BERT for downstream tasks both by simply
finetuning the entire model on the downstream tasks, or taking combinations of the hidden states as fixed
representations.

CS 182/282A, Spring 2021, Discussion 8 3

2 Imitation Learning
So far in the course, we have focused on supervised learning for prediction tasks. In imitation learning
and reinforcement learning, our goal will now be to solve control problems. Control problems will generally
involve making sequences of decisions, with each decision affecting the future, in order to accomplish some
goal. To formalize this problem, we will introduce Markov Decision Processes (MDPs). In imitation
learning, we assume we have access to an expert policy that already solves the task we care about, and our
goal will be to learn a policy to solve a task by copying these expert demonstrations.

2.1 Markov Decision Processes

A Markov decision process (MDP) is specified by a state space S, an action space A, a transition function
P (s0 |s, a) defining the probability of the next state given the current state and action, and a reward function
r(s, a). A key property in MDPs is the Markov property, which similar to Markov chains, states that
conditioned on the current state and action, the next state is conditionally independent of anything in the
past.
The goal is to learn to take actions in order to maximize the sum of rewards over trajectories in the MDP.
Due to the Markov property, all such optimal policies are Markovian (also called stationary processes) and
described as conditional distributions π(a|s) over only the current state. This essentially states that the
optimal action depends only on the current state, rather than any states or actions further in the past.

Figure 2: Markov property in MDPs

We can imagine the state s as capturing the current state of the world, actions a to be the decisions we
make, the transition function P (s0 |s, a) describing how our actions affect the world around us, and the reward
r(s, a) capturing some notion of success at what we want to accomplish.
While solving imitation learning problems will not explicitly require access to the reward, we should keep in
mind that success in imitation learning is not necessarily measured directly in how well we match the expert
(as measured in perhaps negative-log-likelihood on the expert dataset like we would consider in supervised
learning), but in how well our learned policy actually executes the task we care about. The task we care
about is often specified (loosely) as a reward function.
Problem 1: Probability of a trajectory under a Markovian policy

Given a policy πθ (a|s), compute the log probability of a trajectory τ = ((s0 , a0 ), (s1 , a1 ), . . . , (sT , aT ))
using the Markov property.

CS 182/282A, Spring 2021, Discussion 8 4

Solution 1: Probability of a trajectory under a Markovian policy

Using the chain rule of probability, we can decompose the probability as

Taking logarithms, we have

T
X
log p(τ ) = log p(s0 ) + log π(a0 |s0 ) + log p(st |st−1 , at−1 ) + log π(at |st ).
t=1

2.1.1 Partially Observed Markov Decision Processes

In practice, we may not be able to access the full state s that satisfies the Markov property, but instead rely
on some limited observations o. For example, when driving a car, we cannot see everything around us, even
though things we cannot see may impact how the world behaves around us.
Here, optimal policies may no longer be stationary, but may need to depend on all past observations
(π(at |o1 . . . , ot )). Unless otherwise specified, we’re going to assume we’re in a fully observed MDP with
access to the true state.

Figure 3: The Markov property still holds for the underlying state s in POMDPs, but our policies cannot
depend on s directly. Instead, we are forced to take actions based only on the observations o.

2.2 Behavior Cloning

We now describe a very simple approach to imitation learning, which we refer to as behavior cloning.
We assume that we have a collection of N expert trajectories, where each trajectory τ is a sequence of
state-action pairs ((s0 , a0 ), (s1 , a1 ), . . . , (sT , aT )). We add every state action pair into our dataset, forming
a dataset D = {(s0 , a0 ), . . . , (sN T , aN T )}.
Behavior cloning simply trains our policy π(a|s) via supervised learning on this dataset of expert experiences.
Each state is considered an input, and the corresponding expert action is considered the label, and if we
train with maximum likelihood, our objective becomes
X
max log πθ (a|s).
θ
s,a∼D

CS 182/282A, Spring 2021, Discussion 8 5

Problem 2: Behavior cloning maximizes likelihood of expert trajectories

Using the decomposition of the probability of a trajectory from the previous problem, show that
the behavior cloning objective maximizes the probability of the expert trajectories (assumed to be
generated by some expert β that produces a distribution over trajectories) under the learned policy

max Eτ ∼β log πθ (τ ).
θ

Solution 2: Behavior cloning maximizes likelihood of expert trajectories

Recall that
T
X
log p(τ ) = log p(s0 ) + log π(a0 |s0 ) + log p(st |st−1 , at−1 ) + log πθ (at |st ).
t=1

Ignoring terms that don’t depend on the policy parameters, maximizing the log-likelihood of a tra-
jectory τ is then equivalent to maximizing
T
X
log πθ (at |st ).
t=0

Averaging over all the timesteps in each trajectory and over all the trajectories, we see this is identical
to the supervised objective for behavior cloning (up to having only an empirical distribution over
expert trajectories in the dataset instead of the true distribution).

2.3 DAgger
However, having a policy that assigns high likelihoods to expert trajectories is often not enough to ensure
the learned policy attains good performance.
The primary issue is a distribution shift between training and test time. During training, the policy is only
being trained on states that were visited by the expert policy. During test time, we actually execute the
learned policy, and small mismatches between the learned policy and the expert can potentially take us to
new states not seen during training. On these new states that the imitation policy hasn’t been trained on,
the imitation policy would likely not match the expert’s behavior well. Thus, even if the expert policy were
capable of recovering from the earlier mistakes and still solve the task, the imitation learning policy would
instead continue to make mistakes.
One way to remedy this is to simply do a better job at matching the expert policy to avoid deviating far
from the expert trajectories. However, this can require extremely accurate models, and can be tricky to
accomplish (even with large amounts of expert data) when the expert policy is non-Markovian (so cannot
be matched exactly by a Markovian policy) or if the expert policy is multimodel (and choice our probability
distribution for our policy is not expressive enough to match it).
Another approach alter our data distribution we train on to better cover the trajectories we’ll encounter
during test time. This is the approach taken in the dataset aggregation (Dagger) algorithm, which
iteratively collects new trajectories from the current policy, labels those trajectories using the expert policy,
and adds the relabeled trajectories to our dataset and retraining. This way, we are constantly updating our
state distribution to include our current policy, and we can stop when our imitation policy’s distributions
stabilize and we obtain good performance in our desired task.
While Dagger can be very effective at mitigating the distribution mismatch issues, it does require the agent
to interact with the environment during learning, as well querying the expert to figure out what actions it
would take at the new states. Both of these can potentially be costly.

CS 182/282A, Spring 2021, Discussion 8 6

Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Chassis Control Systems
100% (2)
Chassis Control Systems
312 pages
CHEVRON Maintenance Heat Exchanger
67% (3)
CHEVRON Maintenance Heat Exchanger
23 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Ch3 Rotor System Operation PDF
No ratings yet
Ch3 Rotor System Operation PDF
13 pages
2 Axis Positioner Manual
100% (1)
2 Axis Positioner Manual
76 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
MA26 Meter & MP-T1 Pulser: Document Ref 903158-001 Rev - 1 10/2001
100% (1)
MA26 Meter & MP-T1 Pulser: Document Ref 903158-001 Rev - 1 10/2001
28 pages
Measurement of Conductance and Kohlrauch's Law
No ratings yet
Measurement of Conductance and Kohlrauch's Law
23 pages
Visual Basic 6.0 Documentation
No ratings yet
Visual Basic 6.0 Documentation
33 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Chapter 4 Measures of Location
No ratings yet
Chapter 4 Measures of Location
37 pages
W 2 Vexp
No ratings yet
W 2 Vexp
22 pages
Deep Contextualized Word Representation
No ratings yet
Deep Contextualized Word Representation
15 pages
Lec1 NLP
No ratings yet
Lec1 NLP
39 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Web Minnig
No ratings yet
Web Minnig
30 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
BERT
No ratings yet
BERT
98 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Water Heater Spreadsheet
No ratings yet
Water Heater Spreadsheet
16 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Dod STD 2183
No ratings yet
Dod STD 2183
19 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Word 2 Vec
No ratings yet
Word 2 Vec
33 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
Bert
No ratings yet
Bert
20 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
86 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Lecture 4 Word Representation
No ratings yet
Lecture 4 Word Representation
48 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Lecture 10
No ratings yet
Lecture 10
86 pages
Ba LLMS W2 S2 2024 2025
No ratings yet
Ba LLMS W2 S2 2024 2025
47 pages
Wordembed
No ratings yet
Wordembed
31 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
08 Exercises Word2vec MUD SOLVED
No ratings yet
08 Exercises Word2vec MUD SOLVED
3 pages
XCS224N Module1 Slides
No ratings yet
XCS224N Module1 Slides
72 pages
Engineering Support of Planning and Scheduling Revsion 81
No ratings yet
Engineering Support of Planning and Scheduling Revsion 81
22 pages
Cs229 Lecture Selfsupervision Final
No ratings yet
Cs229 Lecture Selfsupervision Final
65 pages
Best Practice Catalog: Machine Condition Monitoring
No ratings yet
Best Practice Catalog: Machine Condition Monitoring
18 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
Advanced Concepts in Machine Learning and Natural
No ratings yet
Advanced Concepts in Machine Learning and Natural
8 pages
MS-MO6-L02-Theory of Columns-Rankine Formula
No ratings yet
MS-MO6-L02-Theory of Columns-Rankine Formula
11 pages
Varian Catalog GPC-SEC
No ratings yet
Varian Catalog GPC-SEC
40 pages
Claude Shannon Masters Thesis
100% (3)
Claude Shannon Masters Thesis
7 pages
AP1501
No ratings yet
AP1501
12 pages
How To Install Ubuntu Linux From USB Drive
No ratings yet
How To Install Ubuntu Linux From USB Drive
2 pages
Buckley 2005
No ratings yet
Buckley 2005
11 pages
Project Report On Conflict Management
No ratings yet
Project Report On Conflict Management
57 pages
Company SNP (Eng) - Color - 1-6-61
No ratings yet
Company SNP (Eng) - Color - 1-6-61
95 pages
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
No ratings yet
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
14 pages
Including:: 4 Authors
No ratings yet
Including:: 4 Authors
34 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
38 pages
ITF24-DS-Assignment #1
No ratings yet
ITF24-DS-Assignment #1
3 pages
Project Work
No ratings yet
Project Work
34 pages
Jan 25 Chem Pastec Paper CXC
No ratings yet
Jan 25 Chem Pastec Paper CXC
20 pages
A WZ Oil Separators Catalog en Us 1733722
No ratings yet
A WZ Oil Separators Catalog en Us 1733722
1 page
WS2 Sin, CosLaw
No ratings yet
WS2 Sin, CosLaw
4 pages
Motion and Its Types - What Is Motion - Types of Motion PPT 2
No ratings yet
Motion and Its Types - What Is Motion - Types of Motion PPT 2
1 page
Precise Edit Training Manual
From Everand
Precise Edit Training Manual
David Bowman
5/5 (2)
50 Most Challenging Algebra Problems!
From Everand
50 Most Challenging Algebra Problems!
Andrei Besedin
No ratings yet
Constraint Satisfaction: Fundamentals and Applications
From Everand
Constraint Satisfaction: Fundamentals and Applications
Fouad Sabry
No ratings yet