0% found this document useful (0 votes)

11 views69 pages

Course3 LM

The document provides an overview of language modeling techniques. It begins by defining language modeling as predicting the next token given previous context. It then discusses statistical n-gram models and their limitations. It introduces neural network approaches like RNNs and Transformers, explaining how they represent and process input sequences. Specific pretrained models are also summarized, like BERT, RoBERTa, and ELECTRA. The document concludes by discussing how encoders can be fine-tuned for tasks and how decoders generate text while preventing attention to future tokens.

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views69 pages

Course3 LM

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Course 3: Language Modeling

1
How does it really work?

Course 3: Language Modeling 2

What is Language Modeling?

Course 3: Language Modeling 3

Definition
A sequence of tokens
For a position , a language model (LM) predicts

In words: a LM predicts the probability of a token given its context

Course 3: Language Modeling 4

Example
I went to the ??? yesterday

P( park | I went to the ??? yesterday ) = 0.1

P( zoo | I went to the ??? yesterday ) = 0.07

...

P( under | I went to the ??? yesterday ) = 0

Course 3: Language Modeling 5

Why is it hard?
Large vocabularies: 170,000 English words
Lots of possible contexts:
For possible tokens, there are contexts of size (in
theory)
Inherent uncertainty: not obvious even for humans

Course 3: Language Modeling 6

Basic approach - Unigram
Learn the non-contextual probability (=frequency) of each token:

Example
chart against operations at influence the surface plays crown a inaro
the three @ but the court lewis on hand american of seamen mu role
due roger executives

Course 3: Language Modeling 7

Include context - Bigram
Predict based on the last token only:

(MLE): Measure next token frequency

Example
the antiquamen lost to dios nominated former is carved stone oak
were problematic, 1910. his willingness to receive this may have been
seen anything

Course 3: Language Modeling 8

Include more context - n-gram
Predict based on the last tokens only:

(MLE): Measure occurences of tokens after

Example (n=4)
eva gauthier performed large amounts of contemporary french music
across the united states marshals service traveled to frankfurt,
germany and took custody of the matthews

Course 3: Language Modeling 9

Statistical n-grams: pro/cons
Strenghts:
Easy to train
Easy to interpret
Fast inference
Limitations:
Very limited context
Unable to extrapolate : can only model what it has seen

Course 3: Language Modeling 10

The embedding paradigm

Course 3: Language Modeling 11

LM with RNNs

Course 3: Language Modeling 12

LM with RNNs - Training
: parameters of the RNN
: training sequence
Cross-entropy loss :

Train via back-propagation + SGD

Course 3: Language Modeling 13

Reminder - Back-propagation

Course 3: Language Modeling 14

Reminder - Stochastic Gradient Descent
Goal : Minimize a loss function for given data with
respect to model parameters
Method :
Split in smaller parts (called mini-batches)
Compute (forward) and (back-prop)
Update: ( , learning rate)

Course 3: Language Modeling 15

LM with RNNs: Generation

Course 3: Language Modeling 16

RNNs: pro/cons
Strenghts
Still relatively fast to train
... and for inference ( )
Can extrapolate (works with continuous features)
Limitations
Context dilution when information is far away

Course 3: Language Modeling 17

Extending RNNs: BiLSTMs
LSTM: improves context capacity
Read the sequence in both directions

Course 3: Language Modeling 18

Transformers

Course 3: Language Modeling 19

Information flow - RNN
How many steps between source of info and current position?

What is the previous word? =>

What is the subject of verb X? =>
What are the other occurences of current word? =>
...

Course 3: Language Modeling 20

Information flow - Transformers
How many steps between source of info and current position?

What is the previous word? =>

What is the subject of verb X? =>
What are the other occurences of current word? =>
... =>

Course 3: Language Modeling 21

Outside Transformers
A Transformer network
Input: Sequence of vectors
Output: Sequence of vectors
Each may depend on the whole input sequence

Course 3: Language Modeling 22

Inside Transformers

Course 3: Language Modeling 23

Inside Transformers : Embeddings
Before going in the network:

Given an input token sequence

We retrieve token embeddings
We retrieve position embeddings
We compute input embeddings:

Course 3: Language Modeling 24

Inside Transformers : Self-attention

Course 3: Language Modeling 25

Inside Transformers : Q and K
=> Model interactions between tokens:

Course 3: Language Modeling 26

Inside Transformers : Q and K
Each row of is then normalized using softmax
Interpretable patterns:

Course 3: Language Modeling 27

Inside Transformers : Q and K
Formally:

where is the hidden dimension of the model

Course 3: Language Modeling 28

Inside Transformers : A and V

Course 3: Language Modeling 29

Inside Transformers : Self-attention summary
Inputs are mapped to Queries, Keys and
Values
Queries and Keys are used to measure
interaction (A)
Interaction weights are used to "select"
relevant Values combinations
Complexity: O(L^2)

Course 3: Language Modeling 30

Inside Transformers : Multi-head attention

Course 3: Language Modeling 31

Inside Transformers : LayerNorm
Avoids gradient explosion

Course 3: Language Modeling 32

Inside Transformers : Output layer

Course 3: Language Modeling 33

Modern flavors : Relative Positional Embeddings
Encode position at attention-level:

Rotary Positional Embeddings (RoPE, Su et al. 2023)

is a rotation of angle ; no
Linear Biases (ALiBi, Press et al. 2022)
with

Course 3: Language Modeling 34

Modern flavors : RMSNorm
Replaces LayerNorm
Re-scaling is all you need

Course 3: Language Modeling 35

Modern flavors : Grouped-Query Attention

Course 3: Language Modeling 36

Encoder Models

Course 3: Language Modeling 37

Masked Language Models

Course 3: Language Modeling 38

BERT (Devlin et al., 2018)
Pre-trained on 128B tokens from Wikipedia + BooksCorpus
Additional Next Sentence Prediction (NSP) loss
Two versions:
BERT-base (110M parameters)
BERT-large (350M parameters)
Cost: ~1000 GPU hours

Course 3: Language Modeling 39

RoBERTa (Liu et al., 2019)
Pre-trained on 128B 2T tokens from web data (BERT x10)
No more Next Sentence Prediction (NSP) loss
Two versions:
RoBERTa-base (110M parameters)
RoBERTa-large (350M parameters)
Better results in downstream tasks
Cost: ~25000 GPU hours

Course 3: Language Modeling 40

Multilingual BERT (mBERT)
Pre-trained on 128B tokens from multilingual Wikipedia
104 languages
One version:
mBERT-base (179M parameters)
Cost: unknown

Course 3: Language Modeling 41

XLM-RoBERTa (Conneau et al., 2019)
Pre-trained on 63T tokens from CommonCrawl
100 languages
Two versions:
XLM-RoBERTa-base (279M parameters)
XLM-RoBERTa-large (561M parameters)
Cost: ~75000 GPU hours

Course 3: Language Modeling 42

ELECTRA (Clark et al., 2020)

Course 3: Language Modeling 43

ELECTRA (Clark et al., 2020)
Pre-trained on 63T tokens from CommonCrawl
100 languages
Three versions:
ELECTRA-small (14M parameters)
ELECTRA-base (110M parameters)
ELECTRA-large (350M parameters)
Really better than BERT/RoBERTa
Cost: =BERT

Course 3: Language Modeling 44

Encoders: Fine-tuning

Course 3: Language Modeling 45

Encoders: Classical applications
Natural Language Inference (NLI)
I like cake! / Cake is bad => same|neutral|opposite
Text classification (+ clustering)
I'm so glad to be here! => joy
Named Entity Recognition (NER)
I voted for Obama! => (Obama, pos:3, class:PER)
and many others...

Course 3: Language Modeling 46

Decoders

Course 3: Language Modeling 47

Decoders - Motivation
Models that are designed to generate text
Next-word predictors:

Problem: How do we impede self-attention to consider future

tokens?

Course 3: Language Modeling 48

Decoders - Attention mask

Each attention input can only attend to previous positions

Course 3: Language Modeling 49

Decoders - Causal LM pre-training
Teacher-forcing

Course 3: Language Modeling 50

Decoders - Causal LM inference (greedy)

Course 3: Language Modeling 51

Decoders - Causal LM inference (greedy)

Course 3: Language Modeling 52

Decoders - Refining inference
What we have : a good model for
What we want at inference:

For a given completion length , there are possibilities

e.g.: 19 new tokens with a vocab of 30000 tokens > #atoms in
We need approximations

Course 3: Language Modeling 53

Decoders - Greedy inference
Keep best word at each step and start again:

where

Course 3: Language Modeling 54

Decoders - Beam search
Keep best chains of tokens at each step:
Take best and compute for each
Take best in each sub-case (now we have
pairs to consider)
Consider only the more likely pairs
Compute for the candidates
and so on...

Course 3: Language Modeling 55

Decoders - Top-k sampling
Randomly sample among top- tokens based on

Course 3: Language Modeling 56

Decoders - Top-p (=Nucleus) sampling
Randomly sample based on up to %

Course 3: Language Modeling 57

Decoders - Generation Temperature
Alter the softmax function:

Course 3: Language Modeling 58

Decoders - Inference speed
For greedy decoding without prefix:
passes with sequences of length
Each pass is
Complexity:
Other decoding are more costly
Ways to go faster?

Course 3: Language Modeling 59

Decoders - Query-Key caching

Course 3: Language Modeling 60

Decoders - Speculative decoding
Generate tokens using where (smaller model)
Forward in teacher-forcing mode and predict
with the bigger model
Compare and and only keep tokens where they don't differ
too much

Course 3: Language Modeling 61

Encoder-Decoder models

Course 3: Language Modeling 62

T5 pre-training

Course 3: Language Modeling 63

All models can do everything
Encoders are mostly used to get contextual embeddings
They can also generate : ("I love [MASK]")
Decoders are mostly used for language generation
They can also give contextual embeddings : ("I love music!")
Or solve any task using prompts:
"What is the emotion in this tweet? Tweet: '...' Answer:"
Encoders-decoders are used for language in-filling

Course 3: Language Modeling 64

Evaluating models
A useful evaluation metric: Perplexity
Defined as:

Other metrics: accuracy, f1-score, ...

Course 3: Language Modeling 65

Zero-shot evaluation
Never-seen problems/data
Example: "What is the capital of Italy? Answer:"
Open-ended: Let the model continue the sentence and check
exact match
Ranking: Get next-word likelihood for "Rome" , "Paris" , "London" ,
and check if "Rome" is best
Perplexity: Compute perplexity of "Rome" and compare with
other models

Course 3: Language Modeling 66

Few-shot evaluation / In-context learning
Never-seen problems/data
Example: "Paris is the capital of France. London is the capital of the
UK. Rome is the capital of"
Chain-of-Thought (CoT) examples:
Normal: "(2+3)x5=25. What's (3+4)x2?"
CoT: "To solve (2+3)x5, we first compute (2+3) = 5 and then
multiply (2+3)x5=5x5=25. What's (3+4)x2?"

Course 3: Language Modeling 67

Open-sourced evaluation
Generative models are evaluated on benchmarks
Example (LLM Leaderboard from HuggingFace):

Course 3: Language Modeling 68

Lab session

Course 3: Language Modeling 69

Lecture 10 - Knowledge and Reasoning - 2025 - LLM
No ratings yet
Lecture 10 - Knowledge and Reasoning - 2025 - LLM
121 pages
Impact of Ott Platforms On Teen
88% (32)
Impact of Ott Platforms On Teen
21 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Pcep 30 02 - 5
No ratings yet
Pcep 30 02 - 5
11 pages
Advanced NLP
No ratings yet
Advanced NLP
111 pages
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
3 pages
Lecture 7 - Language Modelling
No ratings yet
Lecture 7 - Language Modelling
107 pages
Lecture 14 Post Training Annotations
No ratings yet
Lecture 14 Post Training Annotations
79 pages
History of Computers
No ratings yet
History of Computers
43 pages
Rsa Securitys Official Guide To Cryptography Steve Burnett Stephen Paine Rsa Security Download
No ratings yet
Rsa Securitys Official Guide To Cryptography Steve Burnett Stephen Paine Rsa Security Download
88 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
No ratings yet
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
5 pages
NLP Unit-5.2 Notes
No ratings yet
NLP Unit-5.2 Notes
72 pages
Deep Learning (MODULE-4) - RNN - NLP
No ratings yet
Deep Learning (MODULE-4) - RNN - NLP
52 pages
Visual Analytics With SAS Viya
No ratings yet
Visual Analytics With SAS Viya
148 pages
SCTP in Theory and Practice-Sample
0% (1)
SCTP in Theory and Practice-Sample
55 pages
NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
RNN For Moodle
No ratings yet
RNN For Moodle
42 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Notes - Ryan
No ratings yet
Notes - Ryan
258 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Lecture2 Transformer
No ratings yet
Lecture2 Transformer
64 pages
NLP Basics
No ratings yet
NLP Basics
119 pages
MTM18 Final Report
No ratings yet
MTM18 Final Report
24 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
Users Perception of Cloud Based Accounting Software
No ratings yet
Users Perception of Cloud Based Accounting Software
19 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
A02yyuw Su Gecirmez Ultrasonik Sensor Datasheet
No ratings yet
A02yyuw Su Gecirmez Ultrasonik Sensor Datasheet
8 pages
NLP Short
No ratings yet
NLP Short
5 pages
Servo Motor Coding With Manual
No ratings yet
Servo Motor Coding With Manual
4 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Os Qa
No ratings yet
Os Qa
4 pages
Table 3
No ratings yet
Table 3
1 page
Unit 7 Pointers
No ratings yet
Unit 7 Pointers
12 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
Communicating, Perceiving and Acting
No ratings yet
Communicating, Perceiving and Acting
32 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Infographic Poster COM167
No ratings yet
Infographic Poster COM167
2 pages
Com - Magic.solitairegame Logcat
No ratings yet
Com - Magic.solitairegame Logcat
28 pages
Complete NLP Mastery Study Plan
No ratings yet
Complete NLP Mastery Study Plan
18 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Course4 Efficiency
No ratings yet
Course4 Efficiency
41 pages
License Certificate: Free For Commercial Use WITH ATTRIBUTION License
No ratings yet
License Certificate: Free For Commercial Use WITH ATTRIBUTION License
2 pages
Brief Introduction To LLM
No ratings yet
Brief Introduction To LLM
69 pages
! Diet Problem Given in The Note : Model Title
No ratings yet
! Diet Problem Given in The Note : Model Title
2 pages
NLP Internal
No ratings yet
NLP Internal
15 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
LLM Intro
No ratings yet
LLM Intro
49 pages
Networking Devices CheatSheet - WK v1
No ratings yet
Networking Devices CheatSheet - WK v1
1 page
5th Class Computer Ch1 QANS
No ratings yet
5th Class Computer Ch1 QANS
2 pages
Quizizz - Text Structure
No ratings yet
Quizizz - Text Structure
7 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
20 pages
Easy Animal Riddles For Kids - Kidpid
No ratings yet
Easy Animal Riddles For Kids - Kidpid
16 pages
Language Models
No ratings yet
Language Models
11 pages
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
No ratings yet
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
5 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
AI1
No ratings yet
AI1
38 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
Lulu Daily
No ratings yet
Lulu Daily
2 pages
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Terms 1
No ratings yet
Terms 1
46 pages
Terms 1
No ratings yet
Terms 1
46 pages
Govinda Krishna Jai
No ratings yet
Govinda Krishna Jai
1 page
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
4 pages
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Salesforce Data Loader Guide
No ratings yet
Salesforce Data Loader Guide
57 pages
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
7 pages
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
No ratings yet
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
1 page
HANA IQs
No ratings yet
HANA IQs
21 pages
PP APR-18 (Sol) (E-Next - In)
No ratings yet
PP APR-18 (Sol) (E-Next - In)
22 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
AI Algorithms and Statistics
No ratings yet
AI Algorithms and Statistics
11 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
Signiwis
No ratings yet
Signiwis
27 pages
Module1 L4 LLMs New
No ratings yet
Module1 L4 LLMs New
37 pages
UML Activity-InteractionOverview
No ratings yet
UML Activity-InteractionOverview
52 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
Statistical Language Model
No ratings yet
Statistical Language Model
9 pages
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
6 pages
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Language Modeling
No ratings yet
Language Modeling
3 pages
Seminar Project - Face - Recognition
No ratings yet
Seminar Project - Face - Recognition
58 pages
Buku Teks BM Tahun 4 KSSR Semakan
No ratings yet
Buku Teks BM Tahun 4 KSSR Semakan
102 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Unit 1
No ratings yet
Unit 1
29 pages
Ansys Fluent Migration Manual
No ratings yet
Ansys Fluent Migration Manual
72 pages
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
4 pages
Clip Unit 4
No ratings yet
Clip Unit 4
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
ISACA Azure Checklist
0% (2)
ISACA Azure Checklist
30 pages
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
STD 5 CH 1
No ratings yet
STD 5 CH 1
3 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
253 Apl Modbus-Protocol en 140929
No ratings yet
253 Apl Modbus-Protocol en 140929
12 pages
College Time Table Generation Abstract
100% (2)
College Time Table Generation Abstract
5 pages
Day 1
No ratings yet
Day 1
32 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
CSEC Information Technology January 2016 P03
No ratings yet
CSEC Information Technology January 2016 P03
15 pages