Brief Introduction To LLM
Brief Introduction To LLM
language models
Sachin Kumar
[email protected]
What are we going to talk about?
● The language modeling problem
● How do we learn a language model?
○ A quick primer on learning a model via gradient descent
○ The role of training data
● ✨✨ The Transformer ✨✨
○ The two things that make it such an improvement over our previous techniques for language
modeling
○ More detail about both of those two things
● From language models to large language models
● Large language models for chat (ala ChatGPT)
Quick poll
1. Are you familiar with supervised machine learning? gradient descent?
4
p(how are you this evening ? has your house ever been burgled ?) = 10-15
p(how are you this evening ? fine , thanks , how about you ?) = 10-9
5
The language modeling problem
A language model answers the question: What is p(text)?
context
At each timestep, sample a token from the language model’s new probability
distribution over next tokens.
The ____
10
Language models in the news (these days, ChatGPT)
11
We use language models every day
12
We use language models every day
13
Why language modeling?
● Machine translation
○ p(strong winds) > p(large winds)
● Spelling correction
○ The office is about fifteen minuets from my house
○ p(about fifteen minutes from) > p(about fifteen minuets from)
● Speech recognition
○ p(I saw a van) >> p(eyes awe of an)
14
How we learn a language model
Language modeling
16
How do we learn a language model?
Estimate probabilities using text data
No generalization!
Markov assumption
● We make the Markov assumption: x(t+1) depends only on the preceding n-1
words
assumption
n-1 words
18
Markov assumption
or maybe even
19
n-gram Language Models
“I have a dog whose name is Lucy. I have two cats, they like playing with Lucy.”
● Idea: Collect statistics about how frequent different n-grams are and use these
to predict next word
20
unigram probability
“I have a dog whose name is Lucy. I have two cats, they like playing with Lucy.”
● corpus size m = 17
● P(Lucy) = 2/17; P(cats) = 1/17
● Unigram probability:
21
bigram probability
“I have a dog whose name is Lucy. I have two cats, they like playing with Lucy.”
22
trigram probability
“I have a dog whose name is Lucy. I have two cats, they like playing with Lucy.”
23
n-gram probability
“I have a dog whose name is Lucy. I have two cats, they like playing with Lucy.”
24
Sampling from an n-gram language model
25
Sampling from a language model
26
Sampling from a language model
27
Sampling from a language model
28
Neural language models
f(Θ)
A differentiable
I have a dog whose name is Lucy. I have two function (e.g. a
neural network)
How do we maximize the likelihood?
The dominant strategy from the past decade:
31
Gradient descent: a throwback to calculus
Q: Given current parameter w, should we make w bigger or smaller to minimize
our loss?
A: Move w in the reverse direction from the slope of the function
32
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
33
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
34
Now let’s imagine 2 dimensions, w and b
Visualizing the (negative) gradient vector
at the red point
35
Gradient Descent → Stochastic Gradient Descent
Key difference from our motivating scenario: in practice, calculating the exact
gradient is really time-consuming.
So… we estimate the gradient using samples of data.
36
✨✨ The Transformer ✨✨
Why did the transformer make such a big difference for
language modeling?
It allowed for faster learning of more model parameters on more data (by allowing
parallel computation on GPUs!)
A brief aside about some visual shorthand I’ll be using
A 3-layer LSTM’s calculations for an input of 10 tokens
M
ul
tip
ly
M
ul
(Variable number
tip
ly
of) input vectors
M
ul
tip
ly
M
ul
tip
ly
Computed how?
1. Dot product between param vector
and each input vector
2. Softmax the set of resulting scalars.
Pros and cons
Pros:
● We have a function that can compute a weighted
average (largely) in parallel of an arbitrary number
of vectors!
● The parameters determining what makes it into our
output representation are learned
Cons:
● We’re also hoping to produce n different output
token representations… and this just produces
one…
Enter “self attention”
Q K V
Q K V
Q
Hooray for self attention!
Our function is still made up almost entirely of matrix multiplications! Which are very
parallelizable ( → efficient!)
We still learn fixed-size blocks of parameters that can be used for a sequence with
an arbitrary length
Web text
It allows training on large amounts of data (think the whole internet worth of text).
It allows adding many and many layers in the model (large model)
ChatGPT (and many other models) are further trained on supervised data to follow
instructions.
Instruction Tuning
● Collect a large dataset of instruction following examples of the form
○ <instruction> <input> <output>
○ For example,
○ Summarize this news article [ARTICLE] [SUMMARY]
○ Answer this question [QUESTION] [ANSWER]
○ Predict the sentiment of this review [REVIEW] [SENTIMENT]....
● Continue training the model on this dataset (again using the same training objective)
Aligning the model to humans’ preferences
● Chat based on models are supposed to converse with humans
● Basic idea: Model samples multiple outputs – users rank them based on their
preference
○ Convert user preferences into reward scores – more preferred output has higher
reward
○ Treat an LLM like an agent and use RL to maximize this reward (RLHF)
So what does this mean ChatGPT is good at?
Some aspects of producing answers that might fall under
that category:
● Writing in specific styles (that have appeared in the model’s training data)
● Grammatical consistency
● Generating boilerplate sentences that often appear at the beginning, end of
emails, etc.
● Fluency
What are some problems that ChatGPT’s
training leaves it prone to?
Inaccuracies
● The language model doesn’t “plan” what it will say in advance
● The model doesn’t store facts, just outputs plausible looking sentences which
may or may not be factual
Lack of source attribution
Just like the model doesn’t store facts… it doesn’t store sources.
Outputs that reflect social biases
An example from machine translation a few years ago:
Thanks! Questions?