Lecture 2 Language Model
Lecture 2 Language Model
Fall 2024
Benyou Wang
School of Data Science
Before the lecture …
OpenAI o1 is coming
https://openai.com/index/learning-to-reason-with-llms/
“We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America”
A score of 13.9 (re-ranking 1000; 83%-- 12.5/15 is for re-ranking 64) places it among the top 500 students nationally and above the cutoff for the
USA Mathematical Olympiad.
I: Scaling test-time computing (TTC)
Example of TTC: OVM
Level1 Level2 Level3 Level4
Step 𝑠11 2
Step 𝑠11 3
Step 𝑠121 Answer 𝑎1
𝑞 Generator
Step 𝑠21 2
Step 𝑠12 3
Step 𝑠122 Answer 𝑎2
Fei Yu, Anningzhe Gao, Benyou Wang. OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning.
https://arxiv.org/abs/2311.09724. Findings of NAACL 2024.
https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k
II: Chain of thoughts
Input: '[1,2],[3,4],[5,6]'
which corresponds to the matrix:
12
34
56
Write a bash script that takes one argument (the string representing the matrix) and outputs its transpose in the same format.
Ok, let's think about how to parse the input string in bash, build the matrix (probably in arrays), transpose it, and then output the resulting
transposed matrix in the expected format.
Constraints:
Approach:
[code]
Save this script to a file, for example, transpose.sh, make it executable with
chmod +x transpose.sh, and run it:
Bash
1
1
./transpose.sh '[1,2],[3,4],[5,6]'
It will output:
Bash
1
1
[1,3,5],[2,4,6]
…
System 1 vs. System 2
https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
To recap…
What is language modeling?
𝒑 𝒘𝒏 𝒘𝟏 ⋯ 𝒘𝒏−𝟏 ) is the foundation of modern large language models (GPT, ChatGPT, etc.)
Language models: Narrow Sense
A probabilistic model that assigns a probability to every finite sequence (grammatical or not)
GPT-3 still acts in this way but the model is implemented as a very large neural network of 175-
billion parameters!
Language models:Broad Sense
http://commondatastorage.googleapis.com/books/syntactic -ngrams/index.html
Reminder:
Basic Probability Theory
Sampling with replacement
Pick a random shape, then put it back in the bag.
P(X, Y
P (X |Y ) =
) P(Y
)
P(blue | ) = 2/5
The chain rule
The joint probability P(X,Y) can also be expressed in
terms of the conditional probability P(X | Y)
P(X, Y ) = P(X|Y )P(Y )
But the strings of a language L don’t all have the same length
English = {“yes!”, “I agree”, “I see you”, …}
And there is no Nmax that limits how long strings in L can get.
Refinements:
use different UNK tokens for different types of words
(numbers, etc.).
What about the beginning of the sentence?
In a trigram model
P(w(1)w(2)w(3)) = P(w(1))P(w(2) | w(1))P(w(3) | w(2), w(1))
only the third term P(w(3) | w(2), w(1)) is an actual trigram
probability. What about P(w(1)) and P(w(2) | w(1)) ?
x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
Generating the Wall Street Journal
Generating Shakespeare
Shakespeare as corpus
The Shakespeare corpus consists of N=884,647 word
tokens and a vocabulary of V=29,066 word types
???
P(seen) P(seen)
= 1.0 < 1.0
MLE model Smoothed model
Linear interpolation:
P˜(w| w′, w′′) = λP̂(w | w′, w′′) + (1 − λ)P˜(w| w′)
Interpolate n-gram model with (n–1)-gram model.
P(w 1 ...w N )
with
Extrinsic (Task-Based)
Evaluation of LMs:
Word Error Rate
Intrinsic vs. Extrinsic Evaluation
Perplexity tells us which LM assigns a higher
probability to unseen text
Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.
Word Error Rate (WER)
Originally developed for speech recognition.
•Idea: Collect statistics about how frequent different n-grams are and use
these to predict next word.
N-gram Language Models
•First we make a Markov assumption: 𝑥
(n) depends only on the preceding n-1 words
Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition on
opened their” never occurred in
“opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!
Increasing n or increasing
corpus increases model size!
73
How to build a neural language model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob. dist. of the next word
a zo
o
hidden layer
a zo
o
Word embedding/Vectors !
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words as discrete symbols
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Representing words by their context
Distributional semantics: A word’s meaning is given by the words that frequently
appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!
• When a word w appears in a text, its context is the set of words that appear nearby
(within a fixed-size window).
• We use the many contexts of w to build up a representation of w
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
• We have a large corpus (“body”) of text: a long list of words
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o
given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture01-wordvecs1.pdf
Word2vec: objective function
❏ We want to minimize the objective function:
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Interesting characters/words
• 夵 《广韵》《集韵》并以冉切,音琰 (yan3)。物上大下小也。
又《集韵》他刀切,音叨(tao1)。进也。
• LGUer
• Looooooooong
A paper from ours: MorphTE
Guobing Gan, Peng Zhang, Sunzhu Li, Xiuqing Lu, Benyou Wang. MorphTE: Injecting Morphology in Tensorized Embeddings. NeurIPS 2022
From static word vector to
contextualized word vectors
What’s wrong with word2vec?
• Key idea:
# words in the
sentence
softmax
input
How to use ELMo?
# of layers
• γtask: allows the task model to scale the entire ELMo vector
• sjtask: softmax-normalized weights across
layers
• Plug ELMo into any (neural) NLP model: freeze all the LMs
weights and change the input representation to:
Lecture 9:
• Solution: Mask out 15% of the input words, and then predict the
masked words
Pre-training Fine-tuning
• Use word pieces instead of words: playing => play ##ing Assignment 4
Benyou Wang et.al. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. ACM Computing Survey.
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional context,
so we can’t do language modeling!
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a prefix
of every input is provided to the encoder and is not predicted.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining encoder-decoders: what pretraining objective to use?
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
[Raffel et al., 2018]
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Back to the language model
(next word predict)
Pretraining decoders
When using language model pretrained decoders, we can ignore that they were trained
to model
Where 𝐴, 𝑏 were pretrained in the [Note how the linear layer has been pretrained.]
language model!
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language
models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to
produce relatively convincing samples of natural language.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without
gradient steps simply from examples you provide within their contexts.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
LLaMA, Open-Source Models
Meta hopes to advance NLP research through LLAMA, particularly in the academic
exploration of large language models.
LLAMA can be customized for a variety of use cases, especially in research and
non-commercial projects where it demonstrates greater suitability.
Despite the compact size of the Phi-3 model, it has demonstrated performance on
par with or even superior to larger models on various academic benchmarks in
the market.
Phi-3's training method, inspired by children's learning, uses a "curriculum-
based" strategy. It starts with simplified data, gradually guiding the model to
grasp complex concepts.
https://azure.microsoft.com/en-us/products/phi-3
Today’s lecture
I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____
Overall, the value I got from the two hours watching it was the sum total of the
popcorn and the drink. The movie was ___.
The woman walked across the street, checking for traffic over ___ shoulder.
I went to the ocean to see the fish, turtles, seals, and _____.
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Tutorial
https://platform.openai.com/docs/libraries/python-library
https://platform.openai.com/docs/libraries/python-library
Prompt Engineering
Related resource:
❖ https://www.promptingguide.ai/zh
❖ https://www.youtube.com/watch?v=dOxUroR57xs&ab_channel=ElvisSaravia
❖ https://github.com/dair-ai/Prompt-Engineering-Guide
Assignment 1: Using ChatGPT API