0% found this document useful (0 votes)
11 views69 pages

Course3 LM

The document provides an overview of language modeling techniques. It begins by defining language modeling as predicting the next token given previous context. It then discusses statistical n-gram models and their limitations. It introduces neural network approaches like RNNs and Transformers, explaining how they represent and process input sequences. Specific pretrained models are also summarized, like BERT, RoBERTa, and ELECTRA. The document concludes by discussing how encoders can be fine-tuned for tasks and how decoders generate text while preventing attention to future tokens.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views69 pages

Course3 LM

The document provides an overview of language modeling techniques. It begins by defining language modeling as predicting the next token given previous context. It then discusses statistical n-gram models and their limitations. It introduces neural network approaches like RNNs and Transformers, explaining how they represent and process input sequences. Specific pretrained models are also summarized, like BERT, RoBERTa, and ELECTRA. The document concludes by discussing how encoders can be fine-tuned for tasks and how decoders generate text while preventing attention to future tokens.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Course 3: Language Modeling

1
How does it really work?

Course 3: Language Modeling 2


What is Language Modeling?

Course 3: Language Modeling 3


Definition
A sequence of tokens
For a position , a language model (LM) predicts

In words: a LM predicts the probability of a token given its context

Course 3: Language Modeling 4


Example
I went to the ??? yesterday

P( park | I went to the ??? yesterday ) = 0.1

P( zoo | I went to the ??? yesterday ) = 0.07

...

P( under | I went to the ??? yesterday ) = 0

Course 3: Language Modeling 5


Why is it hard?
Large vocabularies: 170,000 English words
Lots of possible contexts:
For possible tokens, there are contexts of size (in
theory)
Inherent uncertainty: not obvious even for humans

Course 3: Language Modeling 6


Basic approach - Unigram
Learn the non-contextual probability (=frequency) of each token:

Example
chart against operations at influence the surface plays crown a inaro
the three @ but the court lewis on hand american of seamen mu role
due roger executives

Course 3: Language Modeling 7


Include context - Bigram
Predict based on the last token only:

(MLE): Measure next token frequency

Example
the antiquamen lost to dios nominated former is carved stone oak
were problematic, 1910. his willingness to receive this may have been
seen anything

Course 3: Language Modeling 8


Include more context - n-gram
Predict based on the last tokens only:

(MLE): Measure occurences of tokens after

Example (n=4)
eva gauthier performed large amounts of contemporary french music
across the united states marshals service traveled to frankfurt,
germany and took custody of the matthews

Course 3: Language Modeling 9


Statistical n-grams: pro/cons
Strenghts:
Easy to train
Easy to interpret
Fast inference
Limitations:
Very limited context
Unable to extrapolate : can only model what it has seen

Course 3: Language Modeling 10


The embedding paradigm

Course 3: Language Modeling 11


LM with RNNs

Course 3: Language Modeling 12


LM with RNNs - Training
: parameters of the RNN
: training sequence
Cross-entropy loss :

Train via back-propagation + SGD

Course 3: Language Modeling 13


Reminder - Back-propagation

Course 3: Language Modeling 14


Reminder - Stochastic Gradient Descent
Goal : Minimize a loss function for given data with
respect to model parameters
Method :
Split in smaller parts (called mini-batches)
Compute (forward) and (back-prop)
Update: ( , learning rate)

Course 3: Language Modeling 15


LM with RNNs: Generation

Course 3: Language Modeling 16


RNNs: pro/cons
Strenghts
Still relatively fast to train
... and for inference ( )
Can extrapolate (works with continuous features)
Limitations
Context dilution when information is far away

Course 3: Language Modeling 17


Extending RNNs: BiLSTMs
LSTM: improves context capacity
Read the sequence in both directions

Course 3: Language Modeling 18


Transformers

Course 3: Language Modeling 19


Information flow - RNN
How many steps between source of info and current position?

What is the previous word? =>


What is the subject of verb X? =>
What are the other occurences of current word? =>
...

Course 3: Language Modeling 20


Information flow - Transformers
How many steps between source of info and current position?

What is the previous word? =>


What is the subject of verb X? =>
What are the other occurences of current word? =>
... =>

Course 3: Language Modeling 21


Outside Transformers
A Transformer network
Input: Sequence of vectors
Output: Sequence of vectors
Each may depend on the whole input sequence

Course 3: Language Modeling 22


Inside Transformers

Course 3: Language Modeling 23


Inside Transformers : Embeddings
Before going in the network:

Given an input token sequence


We retrieve token embeddings
We retrieve position embeddings
We compute input embeddings:

Course 3: Language Modeling 24


Inside Transformers : Self-attention

Course 3: Language Modeling 25


Inside Transformers : Q and K
=> Model interactions between tokens:

Course 3: Language Modeling 26


Inside Transformers : Q and K
Each row of is then normalized using softmax
Interpretable patterns:

Course 3: Language Modeling 27


Inside Transformers : Q and K
Formally:

where is the hidden dimension of the model

Course 3: Language Modeling 28


Inside Transformers : A and V

Course 3: Language Modeling 29


Inside Transformers : Self-attention summary
Inputs are mapped to Queries, Keys and
Values
Queries and Keys are used to measure
interaction (A)
Interaction weights are used to "select"
relevant Values combinations
Complexity: O(L^2)

Course 3: Language Modeling 30


Inside Transformers : Multi-head attention

Course 3: Language Modeling 31


Inside Transformers : LayerNorm
Avoids gradient explosion

Course 3: Language Modeling 32


Inside Transformers : Output layer

Course 3: Language Modeling 33


Modern flavors : Relative Positional Embeddings
Encode position at attention-level:

Rotary Positional Embeddings (RoPE, Su et al. 2023)


is a rotation of angle ; no
Linear Biases (ALiBi, Press et al. 2022)
with

Course 3: Language Modeling 34


Modern flavors : RMSNorm
Replaces LayerNorm
Re-scaling is all you need

Course 3: Language Modeling 35


Modern flavors : Grouped-Query Attention

Course 3: Language Modeling 36


Encoder Models

Course 3: Language Modeling 37


Masked Language Models

Course 3: Language Modeling 38


BERT (Devlin et al., 2018)
Pre-trained on 128B tokens from Wikipedia + BooksCorpus
Additional Next Sentence Prediction (NSP) loss
Two versions:
BERT-base (110M parameters)
BERT-large (350M parameters)
Cost: ~1000 GPU hours

Course 3: Language Modeling 39


RoBERTa (Liu et al., 2019)
Pre-trained on 128B 2T tokens from web data (BERT x10)
No more Next Sentence Prediction (NSP) loss
Two versions:
RoBERTa-base (110M parameters)
RoBERTa-large (350M parameters)
Better results in downstream tasks
Cost: ~25000 GPU hours

Course 3: Language Modeling 40


Multilingual BERT (mBERT)
Pre-trained on 128B tokens from multilingual Wikipedia
104 languages
One version:
mBERT-base (179M parameters)
Cost: unknown

Course 3: Language Modeling 41


XLM-RoBERTa (Conneau et al., 2019)
Pre-trained on 63T tokens from CommonCrawl
100 languages
Two versions:
XLM-RoBERTa-base (279M parameters)
XLM-RoBERTa-large (561M parameters)
Cost: ~75000 GPU hours

Course 3: Language Modeling 42


ELECTRA (Clark et al., 2020)

Course 3: Language Modeling 43


ELECTRA (Clark et al., 2020)
Pre-trained on 63T tokens from CommonCrawl
100 languages
Three versions:
ELECTRA-small (14M parameters)
ELECTRA-base (110M parameters)
ELECTRA-large (350M parameters)
Really better than BERT/RoBERTa
Cost: =BERT

Course 3: Language Modeling 44


Encoders: Fine-tuning

Course 3: Language Modeling 45


Encoders: Classical applications
Natural Language Inference (NLI)
I like cake! / Cake is bad => same|neutral|opposite
Text classification (+ clustering)
I'm so glad to be here! => joy
Named Entity Recognition (NER)
I voted for Obama! => (Obama, pos:3, class:PER)
and many others...

Course 3: Language Modeling 46


Decoders

Course 3: Language Modeling 47


Decoders - Motivation
Models that are designed to generate text
Next-word predictors:

Problem: How do we impede self-attention to consider future


tokens?

Course 3: Language Modeling 48


Decoders - Attention mask

Each attention input can only attend to previous positions

Course 3: Language Modeling 49


Decoders - Causal LM pre-training
Teacher-forcing

Course 3: Language Modeling 50


Decoders - Causal LM inference (greedy)

Course 3: Language Modeling 51


Decoders - Causal LM inference (greedy)

Course 3: Language Modeling 52


Decoders - Refining inference
What we have : a good model for
What we want at inference:

For a given completion length , there are possibilities


e.g.: 19 new tokens with a vocab of 30000 tokens > #atoms in
We need approximations

Course 3: Language Modeling 53


Decoders - Greedy inference
Keep best word at each step and start again:

where

Course 3: Language Modeling 54


Decoders - Beam search
Keep best chains of tokens at each step:
Take best and compute for each
Take best in each sub-case (now we have
pairs to consider)
Consider only the more likely pairs
Compute for the candidates
and so on...

Course 3: Language Modeling 55


Decoders - Top-k sampling
Randomly sample among top- tokens based on

Course 3: Language Modeling 56


Decoders - Top-p (=Nucleus) sampling
Randomly sample based on up to %

Course 3: Language Modeling 57


Decoders - Generation Temperature
Alter the softmax function:

Course 3: Language Modeling 58


Decoders - Inference speed
For greedy decoding without prefix:
passes with sequences of length
Each pass is
Complexity:
Other decoding are more costly
Ways to go faster?

Course 3: Language Modeling 59


Decoders - Query-Key caching

Course 3: Language Modeling 60


Decoders - Speculative decoding
Generate tokens using where (smaller model)
Forward in teacher-forcing mode and predict
with the bigger model
Compare and and only keep tokens where they don't differ
too much

Course 3: Language Modeling 61


Encoder-Decoder models

Course 3: Language Modeling 62


T5 pre-training

Course 3: Language Modeling 63


All models can do everything
Encoders are mostly used to get contextual embeddings
They can also generate : ("I love [MASK]")
Decoders are mostly used for language generation
They can also give contextual embeddings : ("I love music!")
Or solve any task using prompts:
"What is the emotion in this tweet? Tweet: '...' Answer:"
Encoders-decoders are used for language in-filling

Course 3: Language Modeling 64


Evaluating models
A useful evaluation metric: Perplexity
Defined as:

Other metrics: accuracy, f1-score, ...

Course 3: Language Modeling 65


Zero-shot evaluation
Never-seen problems/data
Example: "What is the capital of Italy? Answer:"
Open-ended: Let the model continue the sentence and check
exact match
Ranking: Get next-word likelihood for "Rome" , "Paris" , "London" ,
and check if "Rome" is best
Perplexity: Compute perplexity of "Rome" and compare with
other models

Course 3: Language Modeling 66


Few-shot evaluation / In-context learning
Never-seen problems/data
Example: "Paris is the capital of France. London is the capital of the
UK. Rome is the capital of"
Chain-of-Thought (CoT) examples:
Normal: "(2+3)x5=25. What's (3+4)x2?"
CoT: "To solve (2+3)x5, we first compute (2+3) = 5 and then
multiply (2+3)x5=5x5=25. What's (3+4)x2?"

Course 3: Language Modeling 67


Open-sourced evaluation
Generative models are evaluated on benchmarks
Example (LLM Leaderboard from HuggingFace):

Course 3: Language Modeling 68


Lab session

Course 3: Language Modeling 69

You might also like