0% found this document useful (0 votes)
6 views

Lecture 2 Language Model

Uploaded by

januaryuser2025
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 2 Language Model

Uploaded by

januaryuser2025
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

CSC6203

Large Language Model

Lecture 2: Large language model and beyond

Fall 2024
Benyou Wang
School of Data Science
Before the lecture …
OpenAI o1 is coming

Maybe there will never be a GPT-5?

https://openai.com/index/learning-to-reason-with-llms/
“We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America”
A score of 13.9 (re-ranking 1000; 83%-- 12.5/15 is for re-ranking 64) places it among the top 500 students nationally and above the cutoff for the
USA Mathematical Olympiad.
I: Scaling test-time computing (TTC)
Example of TTC: OVM
Level1 Level2 Level3 Level4

Step 𝑠11 2
Step 𝑠11 3
Step 𝑠121 Answer 𝑎1
𝑞 Generator
Step 𝑠21 2
Step 𝑠12 3
Step 𝑠122 Answer 𝑎2

Inference-time tree search, to enhance multi-step reasoning performance

Fei Yu, Anningzhe Gao, Benyou Wang. OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning.
https://arxiv.org/abs/2311.09724. Findings of NAACL 2024.
https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k
II: Chain of thoughts

Instruction: Write a bash script that takes a matrix represented as a


string with format '[1,2],[3,4],[5,6]' and prints the transpose in the
same format.
II: CoT (1)
So the user is requesting a bash script that can take a string representing a matrix, such as '[1,2],[3,4],[5,6]' and output its transpose, in the
same format.

Let's first understand the input and output formats.

Input: '[1,2],[3,4],[5,6]'
which corresponds to the matrix:
12
34
56

So the input is a series of rows, each in brackets separated by commas.

The transpose of this matrix would be:


135
246

So the requested output is '[1,3,5],[2,4,6]'


II: CoT (2)
Overall Task:

Write a bash script that takes one argument (the string representing the matrix) and outputs its transpose in the same format.

Ok, let's think about how to parse the input string in bash, build the matrix (probably in arrays), transpose it, and then output the resulting
transposed matrix in the expected format.

Constraints:

- Since it's bash, we have to be careful with parsing strings.


- The script should accept the input string as an argument.

Approach:

- Parse the input string to extract the matrix elements.


- Build the matrix as an array of arrays.
- Transpose the matrix.
- Output the transposed matrix in the same format.
II: CoT (3)
Implementation Steps:

1. Capture input string as argument.


2. Remove any spaces (if any).
3. Parse the input string to extract numbers into arrays.
- Since the input is in the format '[1,2],[3,4],[5,6]', we can:
- Remove outer brackets if necessary.
- Split the string by '],' to get each row.
- For each row, remove '[' and ']', then split by ',' to get elements.

4. Build a 2D array in bash (arrays containing arrays).


- Since bash arrays are one-dimensional, we need to simulate 2D arrays.
5. Transpose the matrix.
- For transposing, we'll need to switch rows and columns.

6. Output the transposed matrix in the required format.


- Reconstruct the output string in the format '[1,3,5],[2,4,6]'

Now, start coding step by step.

Under potential issues:

- Bash does not have multi-dimensional arrays natively, so we need to simulate.


- We need to handle numbers that can be more than one digit.
II: CoT (4)
Let me try to simulate what happens.

I'll write this as an answer.

[code]

Save this script to a file, for example, transpose.sh, make it executable with
chmod +x transpose.sh, and run it:

Bash

1
1
./transpose.sh '[1,2],[3,4],[5,6]'
It will output:

Bash

1
1
[1,3,5],[2,4,6]

System 1 vs. System 2

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
To recap…
What is language modeling?

A language model assigns a probability to a N-gram


𝑓: 𝑉 𝑛 → 𝑅 +

A conditional language model assigns a probability of a word given


some conditioning context
𝑔: (𝑉 𝑛−1 , 𝑉) → 𝑅 +
𝑓(𝑤1 ⋯𝑤𝑛 )
And 𝒑 𝒘𝒏 𝒘𝟏 ⋯ 𝒘𝒏−𝟏 ) = 𝑔(𝑤1 ⋯ 𝑤𝑛−1 , 𝑤) = 𝑓(𝑤1 ⋯𝑤𝑛−1 )

𝒑 𝒘𝒏 𝒘𝟏 ⋯ 𝒘𝒏−𝟏 ) is the foundation of modern large language models (GPT, ChatGPT, etc.)
Language models: Narrow Sense
A probabilistic model that assigns a probability to every finite sequence (grammatical or not)

GPT-3 still acts in this way but the model is implemented as a very large neural network of 175-
billion parameters!
Language models:Broad Sense

❖ Decoder-only models (GPT-x models)


The latter two usually involve a
❖ Encoder-only models (BERT, RoBERTa, ELECTRA)
different pre-training
❖ Encoder-decoder models (T5, BART) objective.
Today’s lecture

• Language model in a narrow sense


(Probability theory, N-gram language model)

• Language model in broad sense

• More thoughts on language model


Why do we need language models?
Many NLP tasks require natural language output:
- Machine translation: return text in the target language
- Speech recognition: return a transcript of what was spoken
- Natural language generation: return natural language text
- Spell-checking: return corrected spelling of input

Language models define probability distributions over (natural language)


strings or sentences.
➔We can use a language model to score possible output strings so that we
can choose the best (i.e. most likely) one: if PLM(A) > PLM(B), return A, not B
Hmmm, but…
… what does it mean for a language model to “define a
probability distribution”? [Google N-gram dataset]

… why would we want to define probability


distributions over languages? [evaluation]

… how can we construct a language model such that it


actually defines a probability distribution? [evaluation]

http://commondatastorage.googleapis.com/books/syntactic -ngrams/index.html
Reminder:
Basic Probability Theory
Sampling with replacement
Pick a random shape, then put it back in the bag.

P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15


P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15
Sampling with replacement
Pick a random shape, then put it back in the bag.
What sequence of shapes will you draw?
P( )
= 1/15 ×1/15 ×1/15 ×2/15
= 2/50625
P( )
= 3/15 ×2/15 ×2/15 ×3/15
= 36/50625
P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15
P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15
Sampling with replacement
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(of) = 3/66 P(her) = 2/66


P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66
Sampling with replacement
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(her) = 2/66


P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66

In this model, P(English sentence) = P(word salad)


Probability theory: terminology
Trial (aka “experiment”)
Picking a shape, predicting a word
Sample space Ω:
The set of all possible outcomes
(all shapes; all words in Alice in Wonderland)
Event ω ⊆ Ω:
An actual outcome (a subset of Ω)
(predicting ‘the’, picking a triangle)
Random variable X: Ω → T
A function from the sample space (often the identity function)
Provides a ‘measurement of interest’ from a trial/experiment
(Did we pick ‘Alice’/a noun/a word starting with “x”/…?)
What is a probability distribution?
P(ω) defines a distribution over Ω iff

1) Every event ω has a probability P(ω) between 0 and 1:


0 ≤ P ( ω ⊆ Ω) ≤ 1
2) The null event ∅ has probability P(⊘) = 0:
P( ⊘) = 0

3) And the probability of all disjoint events sums to 1.


Joint and Conditional Probability
The conditional probability of X given Y, P(X | Y ),
is defined in terms of the probability of Y, P( Y ),
and the joint probability of X and Y, P(X,Y ):

P(X, Y
P (X |Y ) =
) P(Y
)
P(blue | ) = 2/5
The chain rule
The joint probability P(X,Y) can also be expressed in
terms of the conditional probability P(X | Y)
P(X, Y ) = P(X|Y )P(Y )

This leads to the so-called chain rule


Independence
Two random variables X and Y are independent if
P(X, Y ) = P ( X ) P ( Y )

If X and Y are independent, then P(X | Y) = P(X):


P(X, Y
P (X |Y ) =
) P(Y
P )( X ) P ( Y )
= (X ,Y independent)
P(Y )
= P (X )
Probability models
Building a probability model consists of two steps:
1. Defining the model
2. Estimating the model’s parameters
(= training/learning )

Models (almost) always make


independence assumptions.
That is, even though X and Y are not actually independent,
our model may treat them as independent.

This reduces the number of model parameters that


we need to estimate (e.g. from n2 to 2n)
Language modeling with n-grams
Language modeling with N-grams
A language model over a vocabulary V
assigns probabilities to strings drawn from V*.

Recall the chain rule:


P(w(1) . . . w(i)) = P(w(1)) ⋅P(w(2) | w(1)) ⋅.. . ⋅P(w(i) | w(i−1), . . . , w(1))

An n-gram language model assumes each word


depends only on the last n−1 words:
Pngram(w(1) .. .w(i)) = P(w(1)) ⋅P(w(2) | w(1)) ⋅.. . ⋅P(w(i) | w(i−1), .. . ,w(1−(n+1)))
N-gram models
N-gram models assume each word (event)
depends only on the previous n−1 words (events):
N
Unigram model: P(w(1) . . . w(N )) = ∏ P(w(i))
i=1
N
Bigram model: P(w(1) . . . w(N )) = ∏ P(w(i) | w(i−1))
i=1
N
Trigram model: P(w(1) . . . w(N )) =∏ P(w(i) | w(i−1),
w(i−2)) i=1

Such independence assumptions are called


Markov assumptions (of order n−1).
A unigram model for Alice
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(her) = 2/66


P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66

In this model, P(English sentence) = P(word salad)


A bigram model for
Alice
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) =


P(w(i) = of | w(i–1) = use) = 1 1/3 P(w(i) = book | w(i–1) = the)
P(w(i) = sister | w(i–1) = her) = 1 = 1/3 P(w(i) = use | w(i–1) = the)
P(w(i) = beginning | w(i–1) = was) = 1/2 = 1/3
P(w(i) = reading | w(i–1) = was) = 1/2
Where do we get the probabilities
from?
Learning (estimating) a language model
Where do we get the parameters of our model
(its actual probabilities) from?
P(w(i) = ‘the’ | w(i–1) = ‘on’) = ???
We need (a large amount of) text as training data
to estimate the parameters of a language model.

The most basic parameter estimation technique:


relative frequency estimation (= counts)
P(w(i) = ‘the’ | w(i–1) = ‘on’) = C(‘on the’) / C(‘on’)
Also called Maximum Likelihood Estimation (MLE)

NB: MLE assigns all probability mass to events


that occur in the training corpus.
Are n-gram models actual
language models?
How do n-gram models define P(L)?
An n-gram model defines Pngram(w(1) . . . w(N)) in terms of the
probability of predicting each word: Pbigram(w(1) . . . w(N )) = ∏ P(w(i) |
w(i−1)) i=1...N

With a fixed vocabulary V, it’s easy to make sure P(w(i) |w(i−1))


is a distribution: ∑ P(wi | wj) = 1 and ∀i,j 0 ≤ P(wi | wj) ≤ 1
i=1...|V|

If P(w(i) | w(i−1)) is a distribution, this model defines


one distribution (over all strings) for each length N

But the strings of a language L don’t all have the same length
English = {“yes!”, “I agree”, “I see you”, …}
And there is no Nmax that limits how long strings in L can get.

Solution: the EOS (end-of-sentence) token!


How do n-gram models define P(L)?
Think of a language model as a stochastic process:
- At each time step, randomly pick one more word.
- Stop generating more words when the word you pick is a special end-
of-sentence (EOS) token.
To be able to pick the EOS token, we have to modify our
training data so that each sentence ends in EOS.
This means our vocabulary is now VEOS = V ∪{EOS}
We then get an actual language model,
i.e. a distribution over strings of any length
Technically, this is only true because P(EOS | …) will be high enough that we are always
guaranteed to stop after having generated a finite number of words

Why do we care about having one model for all lengths?


We can now compare the probabilities of strings of different
lengths, because they’re computed by the same distribution.
A couple more modifications…
Handling unknown words: UNK
Training:
- Assume a fixed vocabulary (e.g. all words that occur at least
n times in the training corpus)
- Replace all other words in the corpus by a token <UNK>
- Estimate the model on this modified training corpus.

Testing (e.g to compute probability of a string):


- Replace any words not in the vocabulary by <UNK>

Refinements:
use different UNK tokens for different types of words
(numbers, etc.).
What about the beginning of the sentence?
In a trigram model
P(w(1)w(2)w(3)) = P(w(1))P(w(2) | w(1))P(w(3) | w(2), w(1))
only the third term P(w(3) | w(2), w(1)) is an actual trigram
probability. What about P(w(1)) and P(w(2) | w(1)) ?

If this bothers you:


Add n–1 beginning-of-sentence (BOS) symbols to
each sentence for an n–gram model:
BOS1 BOS2 Alice was …
Now the unigram and bigram probabilities
involve only BOS symbols.
Using language models
How do we use language models?
Independently of any application, we can use a
language model as a random sentence generator
(i.e we sample sentences according to their language model
probability)

Systems for applications such as machine translation,


speech recognition, spell-checking, generation, often
produce multiple candidate sentences as output.
- We prefer output sentences SOut that have a higher probability
- We can use a language model P(SOut) to score and rank these
different candidate output sentences, e.g. as follows:
argmaxSOut P(SOut | Input) = argmaxSOut P(Input | SOut)P(SOut)

Example: language model in information retrieval.


An example of ASR to use language models

这儿有周杰伦演唱会(There is a Jay Chou


concern!)

• Acoustic Model: 周杰轮? 周捷伦?

• Acoustic Model + language models: 周杰伦


Using n-gram models to
generate language
Generating from a distribution
How do you generate text from an n-gram model?

That is, how do you sample from a distribution P(X |Y=y)?


- Assume X has N possible outcomes (values): {x1, …, xN}
and P(X=xi | Y=y) = pi
- Divide the interval [0,1] into N smaller intervals accordingto
the probabilities of the outcomes
- Generate a random number r between 0 and 1.
- Return the x1 whose interval the number is in.
r

x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
Generating the Wall Street Journal
Generating Shakespeare
Shakespeare as corpus
The Shakespeare corpus consists of N=884,647 word
tokens and a vocabulary of V=29,066 word types

Shakespeare produced 300,000 bigram types


out of V2= 844 million possible bigram types.

99.96% of possible bigrams don’t occur in the corpus.

Our relative frequency estimate assigns non-zero


probability to only 0.04% of the possible bigrams
That percentage is even lower for trigrams, 4-grams, etc.

Use data from https://huggingface.co/datasets/Trelis/tiny-shakespeare or


https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
MLE doesn’t capture unseen events
We estimated a model on 440K word tokens, but:

Only 30,000 word types occur in the training data


Any word that does not occur in the training data
has zero probability!

Only 0.04% of all possible bigrams (over 30K word


types) occur in the training data
Any bigram that does not occur in the training data
has zero probability (even if we have seen both words in
the bigram)
How we assign non-zero
probability to unseen events?
We have to “smooth” our distributions to assign some
probability mass to unseen events
P(unseen)
> 0.0

???
P(seen) P(seen)
= 1.0 < 1.0
MLE model Smoothed model

We won’t talk much about smoothing this year.


Smoothing methods
Add-one smoothing:
Hallucinate counts that didn’t occur in the data

Linear interpolation:
P˜(w| w′, w′′) = λP̂(w | w′, w′′) + (1 − λ)P˜(w| w′)
Interpolate n-gram model with (n–1)-gram model.

Absolute Discounting: Subtract constant count from


frequent events and add it to rare events
Kneser-Ney: AD with modified unigram probabilities
Add-One (Laplace) Smoothing
A really simple way to do smoothing:
Increment the actual observed count of every possible
event (e.g. bigram) by a hallucinated count of 1
(or by a hallucinated count of some k with 0<k<1).

Shakespeare bigram model (roughly):


0.88 million actual bigram counts
+ 844.xx million hallucinated bigram counts

Oops. Now almost none of the counts in our model


come from actual data. We’re back to word salad.
K needs to be really small. But it turns out that that still doesn’t
work very well.
Evaluation
Intrinsic vs Extrinsic Evaluation
How do we know whether one language
model is better than another?

There are two ways to evaluate models:


- intrinsic evaluation captures how well the model
captures what it is supposed to capture (e.g.
probabilities)
- extrinsic (task-based) evaluation captures how useful
the model is in a particular task.

Both cases require an evaluation metric that allows us


to measure and compare the performance of different
models.
Intrinsic Evaluation of
Language Models: Perplexity
Perplexity
The perplexity of a language models is defined as
the inverse (P( . 1. . ) ) of the probability of the test set,
normalized ( N
... ) by the # of tokens (N) in the test set.

If a LM assigns probability P(w1, …, wN) to a test


corpus w1…wN, the LM’s perplexity, PP(w1…wN),
1
PP(w 1 ...w N ) == N

P(w 1 ...w N )

A LM with lower perplexity is better because it assigns


a higher probability to the unseen test corpus.
LM1 and LM2’s perplexity can only be compared if they use the same vocabulary
— Trigram models have lower perplexity than bigram models;
— Bigram models have lower perplexity than unigram models, etc.
Practical issues

•Since language model probabilities are very small,


multiplying them together often yields to underflow.

• It is often better to use logarithms instead, so replace

with
Extrinsic (Task-Based)
Evaluation of LMs:
Word Error Rate
Intrinsic vs. Extrinsic Evaluation
Perplexity tells us which LM assigns a higher
probability to unseen text

This doesn’t necessarily tell us which LM is better for


our task (i.e. is better at scoring candidate sentences)

Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.
Word Error Rate (WER)
Originally developed for speech recognition.

How much does the predicted sequence of words


differ from the actual sequence of words in the correct
transcript?
Insertions + Deletions + Substitutions
WER =
Actual words intranscript

Insertions: “eat lunch” → “eat a lunch”


Deletions: “see a movie” → “see movie”
Substitutions: “drink ice tea” → “drink nice tea”
To recap….
Key concepts in summary
N-gram language models
Independence assumptions
Getting from n-grams to a distribution over a language
Relative frequency (maximum likelihood) estimation
Smoothing
Intrinsic evaluation: Perplexity,
Extrinsic evaluation: WER
Contents

• Language model in a narrow sense


(Probability theory, N-gram language model)

• Language model in broad sense


(BERT and beyond)

• More thoughts on language model


More on N-gram LMs
N-gram Language Models

the students opened their


•Question: How to learn a Language Model?
•Answer (pre- Deep Learning): learn an n-gram Language Model!

•Definition: An n-gram is a chunk of n consecutive words.


•unigrams: “the”, “students”, “opened”, ”their”
•bigrams: “the students”, “students opened”, “opened their”
•trigrams: “the students opened”, “students opened their”
•four-grams: “the students opened their”

•Idea: Collect statistics about how frequent different n-grams are and use
these to predict next word.
N-gram Language Models
•First we make a Markov assumption: 𝑥
(n) depends only on the preceding n-1 words

•Question: How do we get these n-gram and (n-1)-gram probabilities?


•Answer: By counting them in some large corpus of text!
(statistical approximation)
N-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.

as the proctor started the clock, the students opened their


discard
condition on this

For example, suppose that in the corpus:


• “students opened their” occurred 1000 times
• “students opened their books” occurred 400 times
• P(books | students opened their) = 0.4
• “students opened their exams” occurred 100 times
• P(exams | students opened their) = 0.1
71
Sparsity Problems with n-gram Language Models
Sparsity Problem 1
Problem: What if “students
(Partial) Solution: Add small
opened their 𝑤” never
𝛿 to the count for every 𝑤 ∈
occurred in data? Then 𝑤 has
𝑉. This is called smoothing.
probability 0!

Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition on
opened their” never occurred in
“opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!

Note: Increasing n makes sparsity problems worse.


Typically, we can’t have n bigger than 5.
72
Storage Problems with n-gram Language Models

Storage: Need to store


count for all n-grams
you saw in the corpus.

Increasing n or increasing
corpus increases model size!

73
How to build a neural language model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob. dist. of the next word

• How about a window-based neural model?


• We saw this applied to Named Entity Recognition :
LOCATION

museums in Paris are amazing 74


A fixed-window neural Language Model
books
laptop
s
output distribution

a zo
o

hidden layer

concatenated word embeddings

words / one-hot vectors the students opene their


d
75
A fixed-window neural Language Model
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
Improvements over n-gram LM: book
s laptop
• No sparsity problem s

• Don’t need to store all observed n-grams


Remaining problems: a zo
o
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥(!) and 𝑥(") are multiplied by completely
different weights in 𝑊. No symmetry in how
the inputs are processed.
We need a neural architecture that can
process any length input the opene thei
stude d r
Recurrent NN is the solution ! nts
From N-gram LMs to Word vectors
Byproducts of NNLM : word embedding
book
s laptop
s

a zo
o

Word embedding/Vectors !

the opene thei


stude d r
nts
How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)


❏ the idea that is represented by a word, phrase, etc.
❏ the idea that a person wants to express by using words, signs, etc.
❏ the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:


❏ signifier (symbol) ⟺ signified (idea or thing)
= denotational semantics
❏ Tree ⟺ {🌳, 🌲, 🌴, …}

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words as discrete symbols

❏ In traditional NLP, we regard words as discrete symbols:


hotel, conference, motel – a localist representation

❏ Such symbols for words can be represented by one-hot vectors:


motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

❏ Vector dimension = number of words in vocabulary (e.g., 500,000+)

These two vectors are orthogonal


There is no natural notion of similarity for one-hot vectors!

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words by their context
Distributional semantics: A word’s meaning is given by the words that frequently
appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words that appear nearby
(within a fixed-size window).
• We use the many contexts of w to build up a representation of w

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
• We have a large corpus (“body”) of text: a long list of words
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o
given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2vec: objective function
❏ We want to minimize the objective function:

❏ Question: How to calculate

Answer: We will use two vectors per word w:


❏ when w is a center word
❏ when w is a context word
Then for a center word c and a context word o: (softmax)
“max” because amplifies probability of largest
“soft” because still assigns some probability to smaller
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word structure and subword models
We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.

Finite vocabulary assumptions make even less sense in many languages.


• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Interesting characters/words

• 夵 《广韵》《集韵》并以冉切,音琰 (yan3)。物上大下小也。
又《集韵》他刀切,音叨(tao1)。进也。
• LGUer
• Looooooooong
A paper from ours: MorphTE

Guobing Gan, Peng Zhang, Sunzhu Li, Xiuqing Lu, Benyou Wang. MorphTE: Injecting Morphology in Tensorized Embeddings. NeurIPS 2022
From static word vector to
contextualized word vectors
What’s wrong with word2vec?

• One vector for each word type

• Complex characteristics of word use: semantics, syntactic behavior, and connotations

• Polysemous words, e.g., bank, mouse


Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

Contextualized word embeddings

the movie was terribly exciting !

f : (w1, w2, … , wn) ⟶ x1, … , xn ∈ ℝd


ELMo
• NAACL’18: Deep contextualized word representations

• Key idea:

• Train an LSTM-based language model on some


large corpus
• Use the hidden states of the LSTM for each token
to compute a vector representation of each word
ELMo

# words in the
sentence

softmax
input
How to use ELMo?
# of layers

hlM = xLM, hLM = [ h LM; h


LM]
k,0 k k,j k,j k,j

• γtask: allows the task model to scale the entire ELMo vector
• sjtask: softmax-normalized weights across
layers
• Plug ELMo into any (neural) NLP model: freeze all the LMs
weights and change the input representation to:

(could also insert into higher layers)


Use ELMo in practice
https://allennlp.org/elmo

Also available in TensorFlow


BERT
• First released in Oct 2018.
• NAACL’19: BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding

How is BERT different from ELMo?

#1. Unidirectional context vs bidirectional context

#2. LSTMs vs Transformers (will talk later)

#3. The weights are not freezed, called fine-tuning


Bidirectional encoders
• Language models only use left context or right context (although
ELMo used two independent LMs from each direction).
• Language understanding is bidirectional

Lecture 9:

Why are LMs unidirectional?


Bidirectional encoders
• Language models only use left context or right context (although
ELMo used two independent LMs from each direction).
• Language understanding is bidirectional
Masked language models (MLMs)

• Solution: Mask out 15% of the input words, and then predict the
masked words

• Too little masking: too expensive to train


• Too much masking: not enough context
Masked language models (MLMs)

A little more complication:

Because [MASK] is never seen when BERT is used…


Next sentence prediction (NSP)
Always sample two sentences, predict whether the second sentence is
followed after the first one.

Recent papers show that NSP is not necessary…


(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans
(Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pre-training and fine-tuning

Pre-training Fine-tuning

Key idea: all the weights are fine-tuned on downstream


tasks
Applications
More details
• Input representations

• Use word pieces instead of words: playing => play ##ing Assignment 4

• Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)

• Released two model sizes: BERT_base, BERT_large


Variants of
contextualized word vectors
Overview

Benyou Wang et.al. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. ACM Computing Survey.
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


• How do we train them to build strong representations?

• Good parts of decoders and encoders?


• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional context,
so we can’t do language modeling!

Idea: replace some fraction of words in the


input with a special [MASK] token; predict
these words.

Only add loss terms from words that are “masked


out.” If is the masked version of 𝑥, we’re
learning . Called Masked LM. [Devlin et al., 2018]

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.

Some more details about Masked LM for BERT:


• Predict a random 15% of (sub)word tokens.
• Replace input word with [MASK] 80%
of the time
• Replace input word with a random token
10% of the time
• Leave input word unchanged 10% of the
time (but still predict it!)
• Why? Doesn’t let the model get complacent
and not build strong representations of non-
masked words. (No masks are seen at fine-
tuning time!)
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Devlin et al.,
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


• How do we train them to build strong representations?

• Good parts of decoders and encoders?


• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a prefix
of every input is provided to the encoder and is not predicted.

The encoder portion benefits from bidirectional context;


The decoder portion is used to train the whole model through
language modeling.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input with


unique placeholders; decode out the spans that were
removed!

This is implemented in text preprocessing: it’s


still an objective that looks like language
modeling at the decoder side.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property of T5: it


can be finetuned to answer a
wide range of questions,
retrieving knowledge from its
parameters.

NQ: Natural Questions


WQ: WebQuestions
TQA: Trivia QA

All “open-domain” versions

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


• How do we train them to build strong representations?

• Good parts of decoders and encoders?


• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Back to the language model
(next word predict)
Pretraining decoders
When using language model pretrained decoders, we can ignore that they were trained
to model

We can finetune them by training a


classifier on the last word’s hidden state.

Where 𝐴 and 𝑏 are randomly initialized and


specified by the downstream task.
Gradients backpropagate through the whole
network.
[Note how the linear layer hasn’t been
pretrained and must be learned from scratch.]
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining decoders

It’s natural to pretrain decoders as language models and then


use them as generators, finetuning their

This is helpful in tasks where the output


is a sequence with a vocabulary like that
at pretraining time!
• Dialogue (context=dialogue history)
• Summarization (context=document)

Where 𝐴, 𝑏 were pretrained in the [Note how the linear layer has been pretrained.]
language model!
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language
models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to
produce relatively convincing samples of natural language.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
GPT-3, In-context learning, and very large models

So far, we’ve interacted with pretrained models in two ways:


• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without
gradient steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11


billion parameters. GPT-3 has 175 billion parameters.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
LLaMA, Open-Source Models

Meta hopes to advance NLP research through LLAMA, particularly in the academic
exploration of large language models.

LLAMA can be customized for a variety of use cases, especially in research and
non-commercial projects where it demonstrates greater suitability.

Through architectural optimizations, LLAMA can achieve performance


similar to GPT-3 while using fewer computational resources.

Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/pdf/2307.09288


Phi-3, Small but Strong,

Despite the compact size of the Phi-3 model, it has demonstrated performance on
par with or even superior to larger models on various academic benchmarks in
the market.
Phi-3's training method, inspired by children's learning, uses a "curriculum-
based" strategy. It starts with simplified data, gradually guiding the model to
grasp complex concepts.

Phi-3 adopts an architecture optimized specifically for mobile devices, with a


design that supports significant extension of the model's context length through
the LongRope system, thereby enhancing its ability to handle long-sequence data.

https://azure.microsoft.com/en-us/products/phi-3
Today’s lecture

• Language model in a narrow sense


(Probability theory, N-gram language model)

• Language model in broad sense

• More thoughts on language model


• LM (next word predict) is scalable
• LM does not need annotations
• LM is simple such that it is easily to adapt it many tasks
• LM could model human thoughts
• LM is efficient to capture knowledge (imagine use images to record
knowledge?)
• Humans do LM everyday (do next-word/ next-second prediction)
What can we learn from reconstructing the input?

I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____

Overall, the value I got from the two hours watching it was the sum total of the
popcorn and the drink. The movie was ___.

The woman walked across the street, checking for traffic over ___ shoulder.

I went to the ocean to see the fish, turtles, seals, and _____.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Tutorial

https://platform.openai.com/docs/libraries/python-library
https://platform.openai.com/docs/libraries/python-library
Prompt Engineering

Related resource:
❖ https://www.promptingguide.ai/zh
❖ https://www.youtube.com/watch?v=dOxUroR57xs&ab_channel=ElvisSaravia
❖ https://github.com/dair-ai/Prompt-Engineering-Guide
Assignment 1: Using ChatGPT API

This will be released in the next week!


See updates in our BB system, WeChat and Emails.
Acknowledgement
• Princeton COS 484: Natural Language Processing.
Contextualized Word Embeddings. Fall 2019
• CS447: Natural Language Processing. Language Models.
http://courses.engr.illinois.edu/cs447

You might also like