(10 December 2024, NeurIPS) Tutorial On Language Modeling
(10 December 2024, NeurIPS) Tutorial On Language Modeling
1
Open Closed API
Models Models
3
Are we done with scientific research
on LMs?
4
Goal of this tutorial is to build foundational
understanding for LM research.
Outline:
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)
6
Prerequisites
We’re assuming you are comfortable with:
● Training ML models
○ e.g., “learning rate schedulers”, “AdamW”, “batch size”, “transformers”
● Core LM concepts
○ e.g. “next word prediction”, “tokenization”, “sequence length”
● PyTorch*
*Treat our code snippets like pseudocode, no guarantees they will run!
Model
Model
Tokenizer
Tokenizer
Input
embeddings
● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)
Model
● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)
Model
argmax
argmax
Tokenizer
● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, …)
Model
● Becomes tensor of dimension
(batch_size, seq_len, …)
● Tokenizer
Data tensor
shapes
Data tensor
shapes
Base – Data
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)
20
What is LM data?
21
Looking at the data
Unstructured
text
acquire data
run experiment
(pretrain LM)
005 Takeaways
26
What is “good” data?
27
non iid
Scale “Quality”
ac
ess
ce
acc
s s
Constraints
29
The data curation loop
acquire data
run experiment
(pretrain LM)
37
Broad & wide crawls are Domain-specific crawls are
easiest to scale easiest to ensure quality
38
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
How to get the content?
<p>My Title</p>.
<div id="contentDiv"></div>
“My Title. Click Me. Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad m…”
40
What websites to target?
Quality Volume Diiculty Coverage
example.org Highly curated. ~100,000 pages, 742 w/p ~ 186 words / sec
example.cat Highly curated. Reports 3MM books. Free range crawl Maybe not very crawlable?
Substantial non-text information, PDF
example.gov High variance (generally curated) Reports 2.5MM sites over 92 Free range crawl Single URLs vs. Linked Sites;
languages. High variance on highly parallelizable
words/doc.
example.xyz High variance (generally curated) 12,000 English Urls reported. Free range crawl Single URLs vs. Linked Sites;
High variance. highly parallelizable
Lo, Bhagia, Lambert – Language Modeling Tutorial 41
Broad & wide crawls are Domain-specific crawls are
easiest to scale easiest to ensure quality
42
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
Harder to get data via crawling
Longpre et. al. 2024. Consent in Crisis: The Rapid Decline of the AI Data Commons. Data Provenance Initiative.
Lo, Bhagia, Lambert – Language Modeling Tutorial 43
Widening inequality in data access
45
The data curation loop
acquire data
run experiment
(pretrain LM)
47
What does language model data look like?
DIAGNOSTICS ``What?'' if you get a question wrong. ``Right!'' if you get it for yourself.
What about PDFs & Scanned Docs?
Old Scanned Docs Using Classical OCR Pipeline
The Mahomedan faith has been appropriately entitled, *The religion of Proper OCR
the sword*; and with equal propriety may we so designate the religion
of these belligerent friars. The Portuguese writers give an account of despite bad
one of their missionaries, Fernando Vinagre, who was as prompt in the lighting
field of bale as at the baptismal font. This man, though a secular
priest, undertook the command of a squadron that was sent to the
assistance of the rajah of Tidore,⁴ on which occasion he is said to have
acted in the twofold capacity of a great commander, and a great
apostle, at one time appearing in armour, at another in a surplice; and
even occasionally, baptizing the converts of his sword without puing
o his armour, but covering it with his ecclesiastical vest. In this
crusade⁵ he had two
---
³ Geddes History, &c., pp. 24—27. Proper handling of footnotes
Pudet haec opprobria nobis
Vel dici potuisse.
⁴ Called *Tadure* or *Daco*, an island in the Indian Ocean, one of the
Moluccas
⁵ 'These a la Dragoon conversions.' Geddes' History, p. 27.
Is there a “best” linearization?
Filtering
62
Filter low-quality content
bufvc.ac.uk/allbufvc/search.php/item?q=Discussion
Lo, Bhagia, Lambert – Language Modeling Tutorial
hps://finance.yahoo.com 63
Filter low-quality content
hps://i.redd.it/r76y8e47qrvb1.jpg hps://www.reddit.com/r/interestingasfuck/comments/10slutr/the_cats_that_sailed_on_ships_until_the_mid20th/
Lo, Bhagia, Lambert – Language Modeling Tutorial 65
Filter duplicate data
acquire data
run experiment
(pretrain LM)
acquire data
Language filtering
Quality filtering
transform the data
(data intervention)
Safety filtering
Deduplication
run experiment
(pretrain LM)
fastText
● 2,000 docs per second per CPU
● $0.04/hr ($8.5/hr for c7i instance, 192 cores)
BERT-Base
● 1,600 docs per second per H100
● $2.50/hr
Positive: Llama-labeled “Edu” content Positive: Diverse set of “High Quality” docs
Negative: Llama-labeled “non-Edu” content Negative: Randomly sampled Common Crawl
C4, Gopher used rules. Dolma, FineWeb, DCLM all use some
commonsense text heuristics.
FineWeb used Llama 70B to generate labels.
Distill into fastText.
81
Side-eects of filters (e.g. “quality”)
hps://nationalstocksign.com/terms.php
hps://www.aximtrd.com/term-conditions
hps://www.lawinsider.com/clause/definitions
hps://www.picturesofengland.com/agreements/ExtendedLicence
Subramani et. al. 2023. Detecting Personal Information in Training Corpora: an Analysis. TrustNLP workshop at ACL.
Lo, Bhagia, Lambert – Language Modeling Tutorial 85
Coarse text is faster but can hide issues
Scientific in English
NSFW in Chinese
87
Each data source requires own pipeline
GitHub Code
Lo, Bhagia, Lambert – Language Modeling Tutorial 88
Each data source requires own pipeline
90
The data curation loop
acquire data
run experiment
(pretrain LM)
93
Some data interventions hard to “test”
deduplication
No eect from
deduplication?
So lile noise!
Omg
97
Data curriculum actually works?
● Instruction data
● Instruction data
● Instruction data
● Instruction data
● “Synthetic” data
2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!
2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!
3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!
2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!
3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!
2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!
3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!
109
Speaker: Akshita Bhagia
Pretraining
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)
110
Pretraining
Goal: Equip the language model with general language capabilities through
self-supervised training on large amounts of unstructured text.
Goal: Equip the language model with general language capabilities through
self-supervised training on large amounts of unstructured text.
005 Takeaways
113
001 Architecture choices
005 Takeaways
114
Transformer
Dubey, Abhimanyu et al. “The Llama 3 Herd of Models.” ArXiv abs/2407.21783 (2024).
117
Training Configurations
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 118
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 119
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 120
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 How to2000 4B tokens
optimize loss
ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04
Input
pos embeddings rope representation
rope rope min lr 3.00E-05 3.00E-05 1.20E-05
attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 121
Models don’t always agree on best configs
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 122
Some “standard” choices
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens
affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 123
A mistake in pretraining can
cost up to millions of dollars…
124
Pre-training runs are costly
125
“Standard” practices
126
Size of the model
Given a fixed compute budget C, what model size do you train?
C ≈ 6ND, D ≈ 20N
“Performance depends strongly on scale, weakly on model shape”
Kaplan, Jared et al. “Scaling Laws for Neural Language Models.”
C ≈ 6ND, D ≈ 20N
“Performance depends strongly on scale, weakly on model shape”
Kaplan, Jared et al. “Scaling Laws for Neural Language Models.”
Su, Jianlin et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv abs/2104.09864 (2021): n. Pag.
Xiong, Wenhan et al. “Eective Long-Context Scaling of Foundation Models.” North American Chapter of the Association for
Computational Linguistics (2023).
Shazeer, Noam M.. “GLU Variants Improve Transformer.” ArXiv abs/2002.05202 (2020): n. pag.
● RMSNorm
005 Takeaways
135
What to look for?
How to determine if your model is training well?
● Loss convergence
● Loss convergence
In-loop perplexity
● Language modeling fit (potentially on specific domains) evaluations
Magnusson, Ian et al. “Paloma: A Benchmark for Evaluating Language Model Fit.” ArXiv abs/2312.10523 (2023): n. Pag.
Note: Poster at neurips on Friday at 4.30 pm!
● Loss convergence
● Loss convergence
● Loss convergence
Is this enough?
● Language modeling fit (potentially on specific domains)
Downstream performance
looks fine
hps://wandb.ai/ai2-llm/OLMo-7B/reports
hps://wandb.ai/ai2-llm/OLMo-7B/reports
Takase, Sho et al. “Spike No More: Stabilizing the Pre-training of
Large Language Models.” (2023). Lo, Bhagia, Lambert – Language Modeling Tutorial 142
Spikes can indicate eventual divergence
For larger
models, spikes
can be an early
indicator of
model
divergence
Takase, Sho et al. “Spike No More: Stabilizing the Pre-training of Large Language Models.” ArXiv abs/2312.16903 (2023): n. pag.
Wortsman, Mitchell et al. “Small-scale proxies for large-scale Transformer training instabilities.” ArXiv abs/2309.14322 (2023): n. pag.
Use normal
initialization!
Cowsik, Aditya et al. “Geometric Dynamics of Signal Propagation Predict Trainability of Transformers.” ArXiv abs/2403.02579 (2024): n. pag.
Additionally,
Zhang, Biao and Rico Sennrich. “Root Mean Square Layer Normalization.” ArXiv abs/1910.07467 (2019): n. Pag.
Team, Chameleon. “Chameleon: Mixed-Modal Early-Fusion Foundation Models.” ArXiv abs/2405.09818 (2024): n. pag.
005 Takeaways
151
Do more with less compute
Yang, Greg et al. “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.” (2022)
Gadre, Samir Yitzhak et al. “Language models scale reliably with over-training and on downstream tasks.” ArXiv abs/2403.08540 (2024): n.
pag.
Bhagia, Akshita et al. “Establishing Task Scaling Laws via Compute-Eicient Model Ladders.” (2024).
How to ensure that small model behavior will match the large model?
Porian, Tomer et al. “Resolving Discrepancies in Compute-Optimal Scaling of Language Models.” ArXiv abs/2406.19146 (2024): n. pag.
3e-4
5e-5
0 tokens
<10B tokens Trillions of tokens 50B tokens
Annealing
learning
rate
5e-5
0 tokens
<10B tokens Trillions of tokens 50B tokens
Annealing
learning
rate
3e-4
linear to 0
5e-5
0 tokens
<10B tokens Trillions of tokens 50B tokens
LR → 0 is all you need?
Llama 3
Do more with less compute
Models
MOE Dense
Note: Learn more about OLMoE at the ESNLP workshop at neurips on Saturday!
005 Takeaways
168
Using hardware eectively
Goal: maximize the number of tokens processed per second (TPS) without loss of
model performance
● Model parallelism
● Tensor parallelism
● Pipeline parallelism
● Model parallelism
In practice, use FSDP …
● Tensor parallelism
● Pipeline parallelism
● Model parallelism
In practice, use FSDP …
● Tensor parallelism
5,000 TPS
750 TPS
Manual GC!
2. Saving checkpoints
005 Takeaways
178
Takeaways
1. Minimize the things you need to worry about
3. Throughput maers
180
Break
(Or catching up if behind)
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)
181
Speaker: Nathan lambert
Adaptation
(Post-training)
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)
182
Language model adaptation
The raw pre-trained LMs are neither safe nor robust for public use and interactions,
thus require “alignment” between AI and humans.
183
Initial approaches to modern post-training
ChatGPT blog post:
From: hps://www.interconnects.ai/p/frontier-model-post-training
Lo, Bhagia, Lambert – Language Modeling Tutorial 185
Initial approaches to modern post-training
Three stage approach:
1. Instruction tune base model.
2. Collect preference data & train reward model.
3. Fine-tune with RL.
From: hps://www.interconnects.ai/p/frontier-model-post-training
Lo, Bhagia, Lambert – Language Modeling Tutorial 191
Current frontier model post-training
Three training objectives are most popular:
1. Supervised Finetuning – teach formaing and for base of
instruction following abilities.
2. Preference Finetuning – align to human preferences (and
smaller bump in capabilities).
3. Reinforcement Finetuning – final stage to boost
performance on verifiable tasks.
193
Geing the ingredients to start
post-training
Successful adaptation starts with:
1. Meaningful evaluations for targeted skills, and
2. Prompts of representative queries for said skills.
Lambert, Nathan et al. 2024. Tülu 3. Lo, Bhagia, Lambert – Language Modeling Tutorial 195
Geing the ingredients to start
post-training: Prompts
All post-training stages require prompts in distribution of tasks.
Example prompt budget:
197
The role of instruction tuning
Accomplishes two primary tasks:
1. Adapt base model to specific style of input for chat interactions.
2. Ability to include system prompts, multi-turn dialogues, and other chat
templates.
A very large proportion of post-training gains come from the
SFT stage.
<|system|>
You’re a helpful agent System prompt
<|end|>
Special <|user|>
tokens {query}
<|end|>
<|assistant|>{Answer goes here}
Koala
3 Apr. 2023 Dolly
● Diverse dataset (Alpaca, Anthropic 12 Apr. 2023
HH, ShareGPT, WebGPT…) ● 15k human written data
● Human evaluation ● Trained on Pythia 12b
https://www.databricks.com/blog/2023/04/12/dolly
● LLaMA 7B diff. -first-open-commercially-viable-instruction-tuned-l
https://bair.berkeley.edu/blog/2023/04/03/koala/ lm
205
The role of preference finetuning (PreFT)
Aligning to human preferences gives:
● Stronger training influence for style and chat evaluations
(e.g. ChatBotArena).
● Continue building capabilities of skills from SFT, but lower
absolute magnitude of improvements.
Probability ∝ reward
Rejected completion
Bradley Terry model:
Estimate probability that a given pairwise preference is true
UltraFeedback: https://arxiv.org/abs/2310.01377
Model: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta Lo, Bhagia, Lambert – Language Modeling Tutorial 215
DPO model proliferation
Allen AI’s Tülu 2 70B
● First to scale DPO to 70B
parameters.
● State-of-the-art open model on
external benchmarks.
● Open models began to match and
surpass GPT-4.
More discussion:
hps://twier.com/srush_nlp/status/1729896568956895370,
hps://www.interconnects.ai/p/the-dpo-debate,
hps://www.youtube.com/watch?v=YJMCSVLRUNs
In open research:
● Synthetic data dominates due to price.
○ One LLM-as-a-judge label costs <1 cent.
○ One human datapoint costs $5-20.
223
RL finetuning
Reinforcement learning as a training objective other than just for
human preferences:
● OpenAI’s o1 and related models trained with "large-scale RL" for reasoning
● Finetuning based on verifiable outputs:
○ Tülu 3’s Reinforcement Learning with Verifiable Rewards (RLVR) or
○ OpenAI’s Reinforcement Finetuning API
○ Extensive research in specific domains: Code verification, VinePPO for
math, Quiet STaR, etc.
234
Open questions in post-training
Methods and practices common in frontier laboratories but understudied in
academic research:
239
Research Still Needed
Science of Extend LMs Use LMs in
LMs Beyond Text Real World
LMs for
Improve LMs LM Agents
Science
Build Next
generation of LMs for Health Planning
LMs
243
Extra, hidden RLHF
slides
244
State of open recipes for fine-tuning
No models in the top 60 of
LMSYS ChatBotArena with
open fine-tuning data.
We can change this!
(As of Dec. 9th, 2024)
PersonaHub:
hps://github.com/tencent-ailab/persona-hub
hps://unsloth.ai/blog/gradient
252
RLHF phase: SteerLM & Starling
Still plenty of models showing that PPO (and RL methods) outperforms DPO!
SteerLM: https://huggingface.co/nvidia/SteerLM-llama2-13B
Starling: https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha Lo, Bhagia, Lambert – Language Modeling Tutorial 253
Inference with a language model
Tokenizer
& Model
Text to
tensor
Tensor
to text