0% found this document useful (0 votes)

6 views

Lecture 2 Language Model

Uploaded by

januaryuser2025

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 2 Language Model

Uploaded by

januaryuser2025

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 127

CSC6203

Large Language Model

Lecture 2: Large language model and beyond

Fall 2024
Benyou Wang
School of Data Science
Before the lecture …
OpenAI o1 is coming

Maybe there will never be a GPT-5?

https://openai.com/index/learning-to-reason-with-llms/
“We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America”
A score of 13.9 (re-ranking 1000; 83%-- 12.5/15 is for re-ranking 64) places it among the top 500 students nationally and above the cutoff for the
USA Mathematical Olympiad.
I: Scaling test-time computing (TTC)
Example of TTC: OVM
Level1 Level2 Level3 Level4

Step 𝑠11 2
Step 𝑠11 3
Step 𝑠121 Answer 𝑎1
𝑞 Generator
Step 𝑠21 2
Step 𝑠12 3
Step 𝑠122 Answer 𝑎2

Inference-time tree search, to enhance multi-step reasoning performance

Fei Yu, Anningzhe Gao, Benyou Wang. OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning.
https://arxiv.org/abs/2311.09724. Findings of NAACL 2024.
https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k
II: Chain of thoughts

Instruction: Write a bash script that takes a matrix represented as a

string with format '[1,2],[3,4],[5,6]' and prints the transpose in the
same format.
II: CoT (1)
So the user is requesting a bash script that can take a string representing a matrix, such as '[1,2],[3,4],[5,6]' and output its transpose, in the
same format.

Let's first understand the input and output formats.

Input: '[1,2],[3,4],[5,6]'
which corresponds to the matrix:
12
34
56

So the input is a series of rows, each in brackets separated by commas.

The transpose of this matrix would be:

135
246

So the requested output is '[1,3,5],[2,4,6]'

II: CoT (2)
Overall Task:

Write a bash script that takes one argument (the string representing the matrix) and outputs its transpose in the same format.

Ok, let's think about how to parse the input string in bash, build the matrix (probably in arrays), transpose it, and then output the resulting
transposed matrix in the expected format.

Constraints:

- Since it's bash, we have to be careful with parsing strings.

- The script should accept the input string as an argument.

Approach:

- Parse the input string to extract the matrix elements.

- Build the matrix as an array of arrays.
- Transpose the matrix.
- Output the transposed matrix in the same format.
II: CoT (3)
Implementation Steps:

1. Capture input string as argument.

2. Remove any spaces (if any).
3. Parse the input string to extract numbers into arrays.
- Since the input is in the format '[1,2],[3,4],[5,6]', we can:
- Remove outer brackets if necessary.
- Split the string by '],' to get each row.
- For each row, remove '[' and ']', then split by ',' to get elements.

4. Build a 2D array in bash (arrays containing arrays).

- Since bash arrays are one-dimensional, we need to simulate 2D arrays.
5. Transpose the matrix.
- For transposing, we'll need to switch rows and columns.

6. Output the transposed matrix in the required format.

- Reconstruct the output string in the format '[1,3,5],[2,4,6]'

Now, start coding step by step.

Under potential issues:

- Bash does not have multi-dimensional arrays natively, so we need to simulate.

- We need to handle numbers that can be more than one digit.
II: CoT (4)
Let me try to simulate what happens.

I'll write this as an answer.

[code]

Save this script to a file, for example, transpose.sh, make it executable with
chmod +x transpose.sh, and run it:

Bash

1
1
./transpose.sh '[1,2],[3,4],[5,6]'
It will output:

Bash

1
1
[1,3,5],[2,4,6]
…
System 1 vs. System 2

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
To recap…
What is language modeling?

A language model assigns a probability to a N-gram

𝑓: 𝑉 𝑛 → 𝑅 +

A conditional language model assigns a probability of a word given

some conditioning context
𝑔: (𝑉 𝑛−1 , 𝑉) → 𝑅 +
𝑓(𝑤1 ⋯𝑤𝑛 )
And 𝒑 𝒘𝒏 𝒘𝟏 ⋯ 𝒘𝒏−𝟏 ) = 𝑔(𝑤1 ⋯ 𝑤𝑛−1 , 𝑤) = 𝑓(𝑤1 ⋯𝑤𝑛−1 )

𝒑 𝒘𝒏 𝒘𝟏 ⋯ 𝒘𝒏−𝟏 ) is the foundation of modern large language models (GPT, ChatGPT, etc.)
Language models: Narrow Sense
A probabilistic model that assigns a probability to every ﬁnite sequence (grammatical or not)

GPT-3 still acts in this way but the model is implemented as a very large neural network of 175-
billion parameters!
Language models:Broad Sense

❖ Decoder-only models (GPT-x models)

The latter two usually involve a
❖ Encoder-only models (BERT, RoBERTa, ELECTRA)
different pre-training
❖ Encoder-decoder models (T5, BART) objective.
Today’s lecture

• Language model in a narrow sense

(Probability theory, N-gram language model)

• Language model in broad sense

• More thoughts on language model

Why do we need language models?
Many NLP tasks require natural language output:
- Machine translation: return text in the target language
- Speech recognition: return a transcript of what was spoken
- Natural language generation: return natural language text
- Spell-checking: return corrected spelling of input

Language models define probability distributions over (natural language)

strings or sentences.
➔We can use a language model to score possible output strings so that we
can choose the best (i.e. most likely) one: if PLM(A) > PLM(B), return A, not B
Hmmm, but…
… what does it mean for a language model to “define a
probability distribution”? [Google N-gram dataset]

… why would we want to define probability

distributions over languages? [evaluation]

… how can we construct a language model such that it

actually defines a probability distribution? [evaluation]

http://commondatastorage.googleapis.com/books/syntactic -ngrams/index.html
Reminder:
Basic Probability Theory
Sampling with replacement
Pick a random shape, then put it back in the bag.

P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15

P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15
Sampling with replacement
Pick a random shape, then put it back in the bag.
What sequence of shapes will you draw?
P( )
= 1/15 ×1/15 ×1/15 ×2/15
= 2/50625
P( )
= 3/15 ×2/15 ×2/15 ×3/15
= 36/50625
P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15
P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15
Sampling with replacement
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(of) = 3/66 P(her) = 2/66

P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66
Sampling with replacement
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(her) = 2/66

P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66

In this model, P(English sentence) = P(word salad)

Probability theory: terminology
Trial (aka “experiment”)
Picking a shape, predicting a word
Sample space Ω:
The set of all possible outcomes
(all shapes; all words in Alice in Wonderland)
Event ω ⊆ Ω:
An actual outcome (a subset of Ω)
(predicting ‘the’, picking a triangle)
Random variable X: Ω → T
A function from the sample space (often the identity function)
Provides a ‘measurement of interest’ from a trial/experiment
(Did we pick ‘Alice’/a noun/a word starting with “x”/…?)
What is a probability distribution?
P(ω) defines a distribution over Ω iff

1) Every event ω has a probability P(ω) between 0 and 1:

0 ≤ P ( ω ⊆ Ω) ≤ 1
2) The null event ∅ has probability P(⊘) = 0:
P( ⊘) = 0

3) And the probability of all disjoint events sums to 1.

Joint and Conditional Probability
The conditional probability of X given Y, P(X | Y ),
is defined in terms of the probability of Y, P( Y ),
and the joint probability of X and Y, P(X,Y ):

P(X, Y
P (X |Y ) =
) P(Y
)
P(blue | ) = 2/5
The chain rule
The joint probability P(X,Y) can also be expressed in
terms of the conditional probability P(X | Y)
P(X, Y ) = P(X|Y )P(Y )

This leads to the so-called chain rule

Independence
Two random variables X and Y are independent if
P(X, Y ) = P ( X ) P ( Y )

If X and Y are independent, then P(X | Y) = P(X):

P(X, Y
P (X |Y ) =
) P(Y
P )( X ) P ( Y )
= (X ,Y independent)
P(Y )
= P (X )
Probability models
Building a probability model consists of two steps:
1. Defining the model
2. Estimating the model’s parameters
(= training/learning )

Models (almost) always make

independence assumptions.
That is, even though X and Y are not actually independent,
our model may treat them as independent.

This reduces the number of model parameters that

we need to estimate (e.g. from n2 to 2n)
Language modeling with n-grams
Language modeling with N-grams
A language model over a vocabulary V
assigns probabilities to strings drawn from V*.

Recall the chain rule:

P(w(1) . . . w(i)) = P(w(1)) ⋅P(w(2) | w(1)) ⋅.. . ⋅P(w(i) | w(i−1), . . . , w(1))

An n-gram language model assumes each word

depends only on the last n−1 words:
Pngram(w(1) .. .w(i)) = P(w(1)) ⋅P(w(2) | w(1)) ⋅.. . ⋅P(w(i) | w(i−1), .. . ,w(1−(n+1)))
N-gram models
N-gram models assume each word (event)
depends only on the previous n−1 words (events):
N
Unigram model: P(w(1) . . . w(N )) = ∏ P(w(i))
i=1
N
Bigram model: P(w(1) . . . w(N )) = ∏ P(w(i) | w(i−1))
i=1
N
Trigram model: P(w(1) . . . w(N )) =∏ P(w(i) | w(i−1),
w(i−2)) i=1

Such independence assumptions are called

Markov assumptions (of order n−1).
A unigram model for Alice
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(her) = 2/66

P(Alice) = 2/66 P(sister) = 2/66
P(was) = 2/66 P(,) = 4/66
P(to) = 2/66 P(') = 4/66

In this model, P(English sentence) = P(word salad)

A bigram model for
Alice
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) =

P(w(i) = of | w(i–1) = use) = 1 1/3 P(w(i) = book | w(i–1) = the)
P(w(i) = sister | w(i–1) = her) = 1 = 1/3 P(w(i) = use | w(i–1) = the)
P(w(i) = beginning | w(i–1) = was) = 1/2 = 1/3
P(w(i) = reading | w(i–1) = was) = 1/2
Where do we get the probabilities
from?
Learning (estimating) a language model
Where do we get the parameters of our model
(its actual probabilities) from?
P(w(i) = ‘the’ | w(i–1) = ‘on’) = ???
We need (a large amount of) text as training data
to estimate the parameters of a language model.

The most basic parameter estimation technique:

relative frequency estimation (= counts)
P(w(i) = ‘the’ | w(i–1) = ‘on’) = C(‘on the’) / C(‘on’)
Also called Maximum Likelihood Estimation (MLE)

NB: MLE assigns all probability mass to events

that occur in the training corpus.
Are n-gram models actual
language models?
How do n-gram models define P(L)?
An n-gram model defines Pngram(w(1) . . . w(N)) in terms of the
probability of predicting each word: Pbigram(w(1) . . . w(N )) = ∏ P(w(i) |
w(i−1)) i=1...N

With a fixed vocabulary V, it’s easy to make sure P(w(i) |w(i−1))

is a distribution: ∑ P(wi | wj) = 1 and ∀i,j 0 ≤ P(wi | wj) ≤ 1
i=1...|V|

If P(w(i) | w(i−1)) is a distribution, this model defines

one distribution (over all strings) for each length N

But the strings of a language L don’t all have the same length
English = {“yes!”, “I agree”, “I see you”, …}
And there is no Nmax that limits how long strings in L can get.

Solution: the EOS (end-of-sentence) token!

How do n-gram models define P(L)?
Think of a language model as a stochastic process:
- At each time step, randomly pick one more word.
- Stop generating more words when the word you pick is a special end-
of-sentence (EOS) token.
To be able to pick the EOS token, we have to modify our
training data so that each sentence ends in EOS.
This means our vocabulary is now VEOS = V ∪{EOS}
We then get an actual language model,
i.e. a distribution over strings of any length
Technically, this is only true because P(EOS | …) will be high enough that we are always
guaranteed to stop after having generated a finite number of words

Why do we care about having one model for all lengths?

We can now compare the probabilities of strings of different
lengths, because they’re computed by the same distribution.
A couple more modifications…
Handling unknown words: UNK
Training:
- Assume a fixed vocabulary (e.g. all words that occur at least
n times in the training corpus)
- Replace all other words in the corpus by a token <UNK>
- Estimate the model on this modified training corpus.

Testing (e.g to compute probability of a string):

- Replace any words not in the vocabulary by <UNK>

Refinements:
use different UNK tokens for different types of words
(numbers, etc.).
What about the beginning of the sentence?
In a trigram model
P(w(1)w(2)w(3)) = P(w(1))P(w(2) | w(1))P(w(3) | w(2), w(1))
only the third term P(w(3) | w(2), w(1)) is an actual trigram
probability. What about P(w(1)) and P(w(2) | w(1)) ?

If this bothers you:

Add n–1 beginning-of-sentence (BOS) symbols to
each sentence for an n–gram model:
BOS1 BOS2 Alice was …
Now the unigram and bigram probabilities
involve only BOS symbols.
Using language models
How do we use language models?
Independently of any application, we can use a
language model as a random sentence generator
(i.e we sample sentences according to their language model
probability)

Systems for applications such as machine translation,

speech recognition, spell-checking, generation, often
produce multiple candidate sentences as output.
- We prefer output sentences SOut that have a higher probability
- We can use a language model P(SOut) to score and rank these
different candidate output sentences, e.g. as follows:
argmaxSOut P(SOut | Input) = argmaxSOut P(Input | SOut)P(SOut)

Example: language model in information retrieval.

An example of ASR to use language models

这儿有周杰伦演唱会（There is a Jay Chou

concern!）

• Acoustic Model：周杰轮？周捷伦？

• Acoustic Model + language models：周杰伦

Using n-gram models to
generate language
Generating from a distribution
How do you generate text from an n-gram model?

That is, how do you sample from a distribution P(X |Y=y)?

- Assume X has N possible outcomes (values): {x1, …, xN}
and P(X=xi | Y=y) = pi
- Divide the interval [0,1] into N smaller intervals accordingto
the probabilities of the outcomes
- Generate a random number r between 0 and 1.
- Return the x1 whose interval the number is in.
r

x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
Generating the Wall Street Journal
Generating Shakespeare
Shakespeare as corpus
The Shakespeare corpus consists of N=884,647 word
tokens and a vocabulary of V=29,066 word types

Shakespeare produced 300,000 bigram types

out of V2= 844 million possible bigram types.

99.96% of possible bigrams don’t occur in the corpus.

Our relative frequency estimate assigns non-zero

probability to only 0.04% of the possible bigrams
That percentage is even lower for trigrams, 4-grams, etc.

Use data from https://huggingface.co/datasets/Trelis/tiny-shakespeare or

https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
MLE doesn’t capture unseen events
We estimated a model on 440K word tokens, but:

Only 30,000 word types occur in the training data

Any word that does not occur in the training data
has zero probability!

Only 0.04% of all possible bigrams (over 30K word

types) occur in the training data
Any bigram that does not occur in the training data
has zero probability (even if we have seen both words in
the bigram)
How we assign non-zero
probability to unseen events?
We have to “smooth” our distributions to assign some
probability mass to unseen events
P(unseen)
> 0.0

???
P(seen) P(seen)
= 1.0 < 1.0
MLE model Smoothed model

We won’t talk much about smoothing this year.

Smoothing methods
Add-one smoothing:
Hallucinate counts that didn’t occur in the data

Linear interpolation:
P˜(w| w′, w′′) = λP̂(w | w′, w′′) + (1 − λ)P˜(w| w′)
Interpolate n-gram model with (n–1)-gram model.

Absolute Discounting: Subtract constant count from

frequent events and add it to rare events
Kneser-Ney: AD with modified unigram probabilities
Add-One (Laplace) Smoothing
A really simple way to do smoothing:
Increment the actual observed count of every possible
event (e.g. bigram) by a hallucinated count of 1
(or by a hallucinated count of some k with 0<k<1).

Shakespeare bigram model (roughly):

0.88 million actual bigram counts
+ 844.xx million hallucinated bigram counts

Oops. Now almost none of the counts in our model

come from actual data. We’re back to word salad.
K needs to be really small. But it turns out that that still doesn’t
work very well.
Evaluation
Intrinsic vs Extrinsic Evaluation
How do we know whether one language
model is better than another?

There are two ways to evaluate models:

- intrinsic evaluation captures how well the model
captures what it is supposed to capture (e.g.
probabilities)
- extrinsic (task-based) evaluation captures how useful
the model is in a particular task.

Both cases require an evaluation metric that allows us

to measure and compare the performance of different
models.
Intrinsic Evaluation of
Language Models: Perplexity
Perplexity
The perplexity of a language models is defined as
the inverse (P( . 1. . ) ) of the probability of the test set,
normalized ( N
... ) by the # of tokens (N) in the test set.

If a LM assigns probability P(w1, …, wN) to a test

corpus w1…wN, the LM’s perplexity, PP(w1…wN),
1
PP(w 1 ...w N ) == N

P(w 1 ...w N )

A LM with lower perplexity is better because it assigns

a higher probability to the unseen test corpus.
LM1 and LM2’s perplexity can only be compared if they use the same vocabulary
— Trigram models have lower perplexity than bigram models;
— Bigram models have lower perplexity than unigram models, etc.
Practical issues

•Since language model probabilities are very small,

multiplying them together often yields to underflow.

• It is often better to use logarithms instead, so replace

with
Extrinsic (Task-Based)
Evaluation of LMs:
Word Error Rate
Intrinsic vs. Extrinsic Evaluation
Perplexity tells us which LM assigns a higher
probability to unseen text

This doesn’t necessarily tell us which LM is better for

our task (i.e. is better at scoring candidate sentences)

Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.
Word Error Rate (WER)
Originally developed for speech recognition.

How much does the predicted sequence of words

differ from the actual sequence of words in the correct
transcript?
Insertions + Deletions + Substitutions
WER =
Actual words intranscript

Insertions: “eat lunch” → “eat a lunch”

Deletions: “see a movie” → “see movie”
Substitutions: “drink ice tea” → “drink nice tea”
To recap….
Key concepts in summary
N-gram language models
Independence assumptions
Getting from n-grams to a distribution over a language
Relative frequency (maximum likelihood) estimation
Smoothing
Intrinsic evaluation: Perplexity,
Extrinsic evaluation: WER
Contents

• Language model in a narrow sense

(Probability theory, N-gram language model)

• Language model in broad sense

(BERT and beyond)

• More thoughts on language model

More on N-gram LMs
N-gram Language Models

the students opened their

•Question: How to learn a Language Model?
•Answer (pre- Deep Learning): learn an n-gram Language Model!

•Definition: An n-gram is a chunk of n consecutive words.

•unigrams: “the”, “students”, “opened”, ”their”
•bigrams: “the students”, “students opened”, “opened their”
•trigrams: “the students opened”, “students opened their”
•four-grams: “the students opened their”

•Idea: Collect statistics about how frequent different n-grams are and use
these to predict next word.
N-gram Language Models
•First we make a Markov assumption: 𝑥
(n) depends only on the preceding n-1 words

•Question: How do we get these n-gram and (n-1)-gram probabilities?

•Answer: By counting them in some large corpus of text!
(statistical approximation)
N-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.

as the proctor started the clock, the students opened their

discard
condition on this

For example, suppose that in the corpus:

• “students opened their” occurred 1000 times
• “students opened their books” occurred 400 times
• P(books | students opened their) = 0.4
• “students opened their exams” occurred 100 times
• P(exams | students opened their) = 0.1
71
Sparsity Problems with n-gram Language Models
Sparsity Problem 1
Problem: What if “students
(Partial) Solution: Add small
opened their 𝑤” never
𝛿 to the count for every 𝑤 ∈
occurred in data? Then 𝑤 has
𝑉. This is called smoothing.
probability 0!

Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition on
opened their” never occurred in
“opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!

Note: Increasing n makes sparsity problems worse.

Typically, we can’t have n bigger than 5.
72
Storage Problems with n-gram Language Models

Storage: Need to store

count for all n-grams
you saw in the corpus.

Increasing n or increasing
corpus increases model size!

73
How to build a neural language model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob. dist. of the next word

• How about a window-based neural model?

• We saw this applied to Named Entity Recognition :
LOCATION

museums in Paris are amazing 74

A fixed-window neural Language Model
books
laptop
s
output distribution

a zo
o

hidden layer

concatenated word embeddings

words / one-hot vectors the students opene their

d
75
A fixed-window neural Language Model
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
Improvements over n-gram LM: book
s laptop
• No sparsity problem s

• Don’t need to store all observed n-grams

Remaining problems: a zo
o
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥(!) and 𝑥(") are multiplied by completely
different weights in 𝑊. No symmetry in how
the inputs are processed.
We need a neural architecture that can
process any length input the opene thei
stude d r
Recurrent NN is the solution ! nts
From N-gram LMs to Word vectors
Byproducts of NNLM : word embedding
book
s laptop
s

a zo
o

Word embedding/Vectors !

the opene thei

stude d r
nts
How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)

❏ the idea that is represented by a word, phrase, etc.
❏ the idea that a person wants to express by using words, signs, etc.
❏ the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

❏ signifier (symbol) ⟺ signified (idea or thing)
= denotational semantics
❏ Tree ⟺ {🌳, 🌲, 🌴, …}

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words as discrete symbols

❏ In traditional NLP, we regard words as discrete symbols:

hotel, conference, motel – a localist representation

❏ Such symbols for words can be represented by one-hot vectors:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

❏ Vector dimension = number of words in vocabulary (e.g., 500,000+)

These two vectors are orthogonal

There is no natural notion of similarity for one-hot vectors!

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words by their context
Distributional semantics: A word’s meaning is given by the words that frequently
appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words that appear nearby
(within a fixed-size window).
• We use the many contexts of w to build up a representation of w

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
• We have a large corpus (“body”) of text: a long list of words
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o
given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2vec: objective function
❏ We want to minimize the objective function:

❏ Question: How to calculate

Answer: We will use two vectors per word w:

❏ when w is a center word
❏ when w is a context word
Then for a center word c and a context word o: (softmax)
“max” because amplifies probability of largest
“soft” because still assigns some probability to smaller
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word structure and subword models
We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.

Finite vocabulary assumptions make even less sense in many languages.

• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Interesting characters/words

• 夵《广韵》《集韵》并以冉切，音琰 (yan3)。物上大下小也。
又《集韵》他刀切，音叨(tao1)。进也。
• LGUer
• Looooooooong
A paper from ours： MorphTE

Guobing Gan, Peng Zhang, Sunzhu Li, Xiuqing Lu, Benyou Wang. MorphTE: Injecting Morphology in Tensorized Embeddings. NeurIPS 2022
From static word vector to
contextualized word vectors
What’s wrong with word2vec?

• One vector for each word type

• Complex characteristics of word use: semantics, syntactic behavior, and connotations

• Polysemous words, e.g., bank, mouse

Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

Contextualized word embeddings

the movie was terribly exciting !

f : (w1, w2, … , wn) ⟶ x1, … , xn ∈ ℝd

ELMo
• NAACL’18: Deep contextualized word representations

• Key idea:

• Train an LSTM-based language model on some

large corpus
• Use the hidden states of the LSTM for each token
to compute a vector representation of each word
ELMo

# words in the
sentence

softmax
input
How to use ELMo?
# of layers

hlM = xLM, hLM = [ h LM; h

LM]
k,0 k k,j k,j k,j

• γtask: allows the task model to scale the entire ELMo vector
• sjtask: softmax-normalized weights across
layers
• Plug ELMo into any (neural) NLP model: freeze all the LMs
weights and change the input representation to:

(could also insert into higher layers)

Use ELMo in practice
https://allennlp.org/elmo

Also available in TensorFlow

BERT
• First released in Oct 2018.
• NAACL’19: BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding

How is BERT different from ELMo?

#1. Unidirectional context vs bidirectional context

#2. LSTMs vs Transformers (will talk later)

#3. The weights are not freezed, called fine-tuning

Bidirectional encoders
• Language models only use left context or right context (although
ELMo used two independent LMs from each direction).
• Language understanding is bidirectional

Lecture 9:

Why are LMs unidirectional?

Bidirectional encoders
• Language models only use left context or right context (although
ELMo used two independent LMs from each direction).
• Language understanding is bidirectional
Masked language models (MLMs)

• Solution: Mask out 15% of the input words, and then predict the
masked words

• Too little masking: too expensive to train

• Too much masking: not enough context
Masked language models (MLMs)

A little more complication:

Because [MASK] is never seen when BERT is used…

Next sentence prediction (NSP)
Always sample two sentences, predict whether the second sentence is
followed after the first one.

Recent papers show that NSP is not necessary…

(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans
(Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pre-training and fine-tuning

Pre-training Fine-tuning

Key idea: all the weights are fine-tuned on downstream

tasks
Applications
More details
• Input representations

• Use word pieces instead of words: playing => play ##ing Assignment 4

• Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)

• Released two model sizes: BERT_base, BERT_large

Variants of
contextualized word vectors
Overview

Benyou Wang et.al. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. ACM Computing Survey.
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

• How do we train them to build strong representations?

• Good parts of decoders and encoders?

• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional context,
so we can’t do language modeling!

Idea: replace some fraction of words in the

input with a special [MASK] token; predict
these words.

Only add loss terms from words that are “masked

out.” If is the masked version of 𝑥, we’re
learning . Called Masked LM. [Devlin et al., 2018]

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.

Some more details about Masked LM for BERT:

• Predict a random 15% of (sub)word tokens.
• Replace input word with [MASK] 80%
of the time
• Replace input word with a random token
10% of the time
• Leave input word unchanged 10% of the
time (but still predict it!)
• Why? Doesn’t let the model get complacent
and not build strong representations of non-
masked words. (No masks are seen at fine-
tuning time!)
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Devlin et al.,
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

• How do we train them to build strong representations?

• Good parts of decoders and encoders?

• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a prefix
of every input is provided to the encoder and is not predicted.

The encoder portion benefits from bidirectional context;

The decoder portion is used to train the whole model through
language modeling.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input with

unique placeholders; decode out the spans that were
removed!

This is implemented in text preprocessing: it’s

still an objective that looks like language
modeling at the decoder side.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property of T5: it

can be finetuned to answer a
wide range of questions,
retrieving knowledge from its
parameters.

NQ: Natural Questions

WQ: WebQuestions
TQA: Trivia QA

All “open-domain” versions

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

• How do we train them to build strong representations?

• Good parts of decoders and encoders?

• What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

• Nice to generate from; can’t condition on future words

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Back to the language model
(next word predict)
Pretraining decoders
When using language model pretrained decoders, we can ignore that they were trained
to model

We can finetune them by training a

classifier on the last word’s hidden state.

Where 𝐴 and 𝑏 are randomly initialized and

specified by the downstream task.
Gradients backpropagate through the whole
network.
[Note how the linear layer hasn’t been
pretrained and must be learned from scratch.]
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining decoders

It’s natural to pretrain decoders as language models and then

use them as generators, finetuning their

This is helpful in tasks where the output

is a sequence with a vocabulary like that
at pretraining time!
• Dialogue (context=dialogue history)
• Summarization (context=document)

Where 𝐴, 𝑏 were pretrained in the [Note how the linear layer has been pretrained.]
language model!
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language
models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to
produce relatively convincing samples of natural language.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
GPT-3, In-context learning, and very large models

So far, we’ve interacted with pretrained models in two ways:

• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without
gradient steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11

billion parameters. GPT-3 has 175 billion parameters.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
LLaMA, Open-Source Models

Meta hopes to advance NLP research through LLAMA, particularly in the academic
exploration of large language models.

LLAMA can be customized for a variety of use cases, especially in research and
non-commercial projects where it demonstrates greater suitability.

Through architectural optimizations, LLAMA can achieve performance

similar to GPT-3 while using fewer computational resources.

Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/pdf/2307.09288

Phi-3, Small but Strong,

Despite the compact size of the Phi-3 model, it has demonstrated performance on
par with or even superior to larger models on various academic benchmarks in
the market.
Phi-3's training method, inspired by children's learning, uses a "curriculum-
based" strategy. It starts with simplified data, gradually guiding the model to
grasp complex concepts.

Phi-3 adopts an architecture optimized specifically for mobile devices, with a

design that supports significant extension of the model's context length through
the LongRope system, thereby enhancing its ability to handle long-sequence data.

https://azure.microsoft.com/en-us/products/phi-3
Today’s lecture

• Language model in a narrow sense

(Probability theory, N-gram language model)

• Language model in broad sense

• More thoughts on language model

• LM (next word predict) is scalable
• LM does not need annotations
• LM is simple such that it is easily to adapt it many tasks
• LM could model human thoughts
• LM is efficient to capture knowledge (imagine use images to record
knowledge?)
• Humans do LM everyday (do next-word/ next-second prediction)
What can we learn from reconstructing the input?

I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____

Overall, the value I got from the two hours watching it was the sum total of the
popcorn and the drink. The movie was ___.

The woman walked across the street, checking for traffic over ___ shoulder.

I went to the ocean to see the fish, turtles, seals, and _____.

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Tutorial

https://platform.openai.com/docs/libraries/python-library
https://platform.openai.com/docs/libraries/python-library
Prompt Engineering

Related resource:
❖ https://www.promptingguide.ai/zh
❖ https://www.youtube.com/watch?v=dOxUroR57xs&ab_channel=ElvisSaravia
❖ https://github.com/dair-ai/Prompt-Engineering-Guide
Assignment 1: Using ChatGPT API

This will be released in the next week!

See updates in our BB system, WeChat and Emails.
Acknowledgement
• Princeton COS 484: Natural Language Processing.
Contextualized Word Embeddings. Fall 2019
• CS447: Natural Language Processing. Language Models.
http://courses.engr.illinois.edu/cs447

Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
Lecture 2 ML_Maths
No ratings yet
Lecture 2 ML_Maths
80 pages
The University of The West Indies: Mid-Semester Examinations Of: October /march /june 2012
No ratings yet
The University of The West Indies: Mid-Semester Examinations Of: October /march /june 2012
9 pages
Week5 Dynamic Programming1
No ratings yet
Week5 Dynamic Programming1
11 pages
Programming Assignment 5: Dynamic Programming 1
No ratings yet
Programming Assignment 5: Dynamic Programming 1
11 pages
Programming Assignment 5: Dynamic Programming 1
No ratings yet
Programming Assignment 5: Dynamic Programming 1
11 pages
Matlab 1 Microsoft Word Document 1 PDF
No ratings yet
Matlab 1 Microsoft Word Document 1 PDF
26 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
Lecture8_Approximation-Algorithms
No ratings yet
Lecture8_Approximation-Algorithms
13 pages
15-251: GTI Quiz 3 Solutions: N 1 1 N A B
No ratings yet
15-251: GTI Quiz 3 Solutions: N 1 1 N A B
6 pages
Pract Q
No ratings yet
Pract Q
6 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
11 pages
DAA_Question Bank
No ratings yet
DAA_Question Bank
7 pages
Department of Computer Engineering: Experiment No.6
No ratings yet
Department of Computer Engineering: Experiment No.6
5 pages
Chapin S., Young T. The MATLAB Workbook - A Supplement For Calculus, Differential Equations and Linear Algebra (2003) (En) (73s)
No ratings yet
Chapin S., Young T. The MATLAB Workbook - A Supplement For Calculus, Differential Equations and Linear Algebra (2003) (En) (73s)
73 pages
Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Probability vs. Statistics
No ratings yet
Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Probability vs. Statistics
20 pages
Ross mathematics program 2017 application problems
No ratings yet
Ross mathematics program 2017 application problems
7 pages
2019 Vraestel - Semester
No ratings yet
2019 Vraestel - Semester
7 pages
Learning Quiz 3 - Discrete Random Variables - Jupyter Notebook
No ratings yet
Learning Quiz 3 - Discrete Random Variables - Jupyter Notebook
15 pages
DAA Unit-4
No ratings yet
DAA Unit-4
26 pages
Mit18 05 s22 Class10 Pset Sol
No ratings yet
Mit18 05 s22 Class10 Pset Sol
4 pages
Homework-2
No ratings yet
Homework-2
8 pages
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
8 pages
Math 9 QTR 2 Week 4
No ratings yet
Math 9 QTR 2 Week 4
8 pages
A New Algorithm For Solving 3-CNF-SAT Problem: Keywords
No ratings yet
A New Algorithm For Solving 3-CNF-SAT Problem: Keywords
30 pages
abstract data structures
No ratings yet
abstract data structures
22 pages
FAI 4 Mathematical Concepts II
No ratings yet
FAI 4 Mathematical Concepts II
39 pages
Section 3 Ai303
No ratings yet
Section 3 Ai303
19 pages
MT 2006 Answers
No ratings yet
MT 2006 Answers
7 pages
Mit18 05 s22 Exam01 Sol
No ratings yet
Mit18 05 s22 Exam01 Sol
6 pages
Assignment 9
No ratings yet
Assignment 9
4 pages
Untitled Document
No ratings yet
Untitled Document
19 pages
Chapter 7 PythonStrngs
No ratings yet
Chapter 7 PythonStrngs
31 pages
MATH Probability G10
No ratings yet
MATH Probability G10
18 pages
2 Probability
No ratings yet
2 Probability
8 pages
Final Exam Preparation: 15 January 2021 16 January 2021
No ratings yet
Final Exam Preparation: 15 January 2021 16 January 2021
18 pages
UCS415 - EST - Final With Solutions
No ratings yet
UCS415 - EST - Final With Solutions
16 pages
Lecture 03
No ratings yet
Lecture 03
58 pages
2015 Stajano Algs Students Exercises
No ratings yet
2015 Stajano Algs Students Exercises
28 pages
Backtracking Dfs Bfs
No ratings yet
Backtracking Dfs Bfs
8 pages
2022 Spring CS300 Midterm
No ratings yet
2022 Spring CS300 Midterm
9 pages
Learning To Reason With LLMS: Skip To Main Content
No ratings yet
Learning To Reason With LLMS: Skip To Main Content
61 pages
Math 9 Q2 Week 4
No ratings yet
Math 9 Q2 Week 4
9 pages
UNIT II_NLP
No ratings yet
UNIT II_NLP
35 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
ch1
No ratings yet
ch1
43 pages
Eliuslide
No ratings yet
Eliuslide
20 pages
AINLP Lab Manual
No ratings yet
AINLP Lab Manual
27 pages
Intro To Probability
No ratings yet
Intro To Probability
54 pages
Prolog - Backtracking
No ratings yet
Prolog - Backtracking
8 pages
2425 Deep Learning For Symbolic Mat
No ratings yet
2425 Deep Learning For Symbolic Mat
24 pages
CSE 114 Practice Final
No ratings yet
CSE 114 Practice Final
4 pages
Ps 4
No ratings yet
Ps 4
6 pages
Ross 2015 ProblemSet Revised
No ratings yet
Ross 2015 ProblemSet Revised
8 pages
Project C: Dr. Shahin Tavakoli Applied Bayesian Statistics Project 1
No ratings yet
Project C: Dr. Shahin Tavakoli Applied Bayesian Statistics Project 1
2 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Lecture01-Introduction to Algorithms
No ratings yet
Lecture01-Introduction to Algorithms
32 pages
Lab 4: Logistic Regression: PSTAT 131/231, Winter 2019
No ratings yet
Lab 4: Logistic Regression: PSTAT 131/231, Winter 2019
10 pages
Calculus Volume1
From Everand
Calculus Volume1
Ming Yao Tsai
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Using Sequences of Life-Events To Predict Human Lives: Nature Computational Science
No ratings yet
Using Sequences of Life-Events To Predict Human Lives: Nature Computational Science
17 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
A Transformer-Based Approach For Fake News Detection Using Time Series Analysis
No ratings yet
A Transformer-Based Approach For Fake News Detection Using Time Series Analysis
7 pages
PDF-VQA: A New Dataset For Real-World VQA On PDF Documents
No ratings yet
PDF-VQA: A New Dataset For Real-World VQA On PDF Documents
22 pages
2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
No ratings yet
2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
14 pages
1 s2.0 S2666651021000176 Main
No ratings yet
1 s2.0 S2666651021000176 Main
6 pages
AI For Beginners Made Easy
No ratings yet
AI For Beginners Made Easy
186 pages
Immediate download Real World Natural Language Processing 1st Edition Masato Hagiwara ebooks 2024
No ratings yet
Immediate download Real World Natural Language Processing 1st Edition Masato Hagiwara ebooks 2024
37 pages
Ankur Nigam
No ratings yet
Ankur Nigam
2 pages
Impact of AI Assistance On Student Agency: Computers & Education November 2023
No ratings yet
Impact of AI Assistance On Student Agency: Computers & Education November 2023
19 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Arabic Fake News Detection Comparative Study of Ne
No ratings yet
Arabic Fake News Detection Comparative Study of Ne
10 pages
Text Classification Via Large Language Models
No ratings yet
Text Classification Via Large Language Models
20 pages
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
No ratings yet
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
285 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Fake News Detection Based On A Hybrid Bert and Lightgbm Models
No ratings yet
Fake News Detection Based On A Hybrid Bert and Lightgbm Models
12 pages
ChatGPT MASTERY 12 Books in 1 Unlocki... (Z-Library)
No ratings yet
ChatGPT MASTERY 12 Books in 1 Unlocki... (Z-Library)
161 pages
Integrating Large Language Models For Severity Classification in Traffic Incident Management: A Machine Learning Approach
No ratings yet
Integrating Large Language Models For Severity Classification in Traffic Incident Management: A Machine Learning Approach
17 pages
PPT for the First Paper (1)
No ratings yet
PPT for the First Paper (1)
49 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
An End-to-End Solution For Named Entity Recognition in Ecommerce Search
No ratings yet
An End-to-End Solution For Named Entity Recognition in Ecommerce Search
9 pages
Prediction and Sentiment Analysis of Stock Using Machine Learning
No ratings yet
Prediction and Sentiment Analysis of Stock Using Machine Learning
10 pages
Get Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines 1st Edition Vasques free all chapters
100% (1)
Get Machine Learning Theory and Applications: Hands-on Use Cases with Python on Classical and Quantum Machines 1st Edition Vasques free all chapters
52 pages
Spatten: Efficient Sparse Attention Architecture With Cascade Token and Head Pruning
No ratings yet
Spatten: Efficient Sparse Attention Architecture With Cascade Token and Head Pruning
14 pages
Intelligent Bot: For Healthcare
No ratings yet
Intelligent Bot: For Healthcare
26 pages
Jurnal 5 New
No ratings yet
Jurnal 5 New
33 pages
2022.findings Acl.100
No ratings yet
2022.findings Acl.100
11 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Natural Language Processingand Sentiment Analysis
No ratings yet
Natural Language Processingand Sentiment Analysis
15 pages
Improvement in Semantic Address Matching Using Natural Language Processing
No ratings yet
Improvement in Semantic Address Matching Using Natural Language Processing
6 pages